Combining data from different sources

Glyph-ready data often combines data from different sources.

Perhaps they come from different experiments or institutions.
Often, they were collected with different objectives than yours.
Perhaps they are completely different types of data,

Example: Medicare data

MedicareProviders: Name and location
DirectRecoveryGroups: Descriptions of standardized medical procedures
MedicareCharges: Charges and payments for different DRGs by different providers
ZipDemographics: Population and age structure in each ZIP code.

Example: Holiday births

Goal: Compare the number of births on holidays to those on non-holiday weekdays.

The mosaicData::Birthdays table gives the daily number of births in US states from 1969 to 1988.
Birthdays doesn’t directly tell us which days are holidays, but …

We can use the date variable to look up each case in a list of holidays, e.g.

Holidays <- read.file("http://tiny.cc/dcf/US-Holidays.csv")

holiday	date	year
Veteran’s Day	11-Nov-1983	1983
New Year’s Day	1-Jan-1996	1996
Columbus Day	13-Oct-2003	2003
Memorial Day	26-May-2014	2014

Relational databases

Storing data in separate tables can be beneficial even when the data are coming from the same source:

There is no “one size fits all” glyph-ready format. Often the kinds of analysis that will be done are not specifically anticipated when data are collected.
Glyph-ready data often contains redundancies. This makes it hard to update or correct data.

Strategy: Don’t even try to smash all data into one big table. Instead, join related tables as needed.

Example: Grades and Enrollment

Grades <- read.file("http://tiny.cc/mosaic/grades.csv")

	sid	grade	sessionID
2197	S31680	B	session3518
259	S31242	B	session2897
4188	S32127	A	session2002
3880	S32058	A-	session2952

Courses <- read.file("http://tiny.cc/mosaic/courses.csv")

	sessionID	dept	level	sem	enroll	iid
640	session2568	J	100	FA2002	15	inst223
76	session1940	d	100	FA2000	16	inst409
1218	session3242	m	200	SP2004	30	inst476

Joins

A join is a data verb that combines two tables.

These care called the left and the right tables.

There are several kinds of join.

All involve establishing a correspondance — a match — between each case in the left table and zero or more cases in the right table.
The various joins differ in how they handle multiple matches or missing matches.

Example: Average class size

Goal: Figure out the average class size seen by each student.

enroll comes from Courses table.
Student (sid) comes from Grades.
sessionID is in both tables.

Grades %>%
  left_join(Courses) %>% sample_n(size=4)

	sid	grade	sessionID	dept	level	sem	enroll	iid
4043	S32061	B+	session3644	K	300	SP2005	18	inst239
1536	S31518	AU	session2901	J	100	FA2003	22	inst224
1844	S31587	B-	session3191	b	200	SP2004	10	inst403
3593	S31962	B	session3821	W	300	SP2005	24	inst138

Once Courses and Grades are joined, it’s straightforward to find the average enrollment seen by each student.

AveClassEachStudent <- Grades %>% 
  left_join(Courses) %>%
  group_by(sid) %>%
  summarise(ave_enroll = mean(enroll, na.rm=TRUE))

sid	ave_enroll
S31626	27.64286
S32007	22.23077
S31905	26.73333

Statistical Digression

Why are these numbers different?

AveClassEachStudent %>% summarise(average = mean(ave_enroll))

average

23.60299

Courses %>% summarise(average = mean(enroll))

average

21.16531

Establishing a match between cases

A match between a case in the left table and a case in the right table is made based on the values in pairs of corresponding variables.

You specify which pairs to use.
A pair is a variable from the left table and a variable from the right table.
Cases must have exactly equal values in the left variable and right variable for a match to be made.

Example:

Grades %>% 
  left_join(Courses, by = c(sessionID = "sessionID")) %>%
  head(4)

sid	grade	sessionID	dept	level	sem	enroll	iid
S31185	D+	session1784	M	100	FA1991	22	inst265
S31185	B+	session1785	k	100	FA1991	52	inst458
S31185	A-	session1791	J	100	FA1993	22	inst223
S31185	B+	session1792	J	300	FA1993	20	inst235

The default value of by= is all variables with the same names in both tables.
- This is not reliable unless you’ve checked.

Kinds of join

Different kinds of join have different answers to these questions.

What to do when there is no match between a left case and any right case?
What to do when there are multiple matching cases in the right table for a case in the left table?

Most popular joins: left_join() and inner_join()

No match of a left case to a right case
- left_join() Keep the left case and fill in the new variables with NA
- inner_join() Discard the left case. Less popular joins:
- full_join() Keep left case as well as unmatched right cases.
- semi_join() Discard the left case.
- anti_join() Keep the left case but discard any left case with a match in the right table
Multiple matches of right cases to a left case
- left_join() and inner_join() do the same thing: Keep all combinations. Less popular joins:
- full_join() Keep all combinations.
- semi_join() Keep just one copy of the left case.
- anti_join() Discard the left case.

Example: Grade-point averages

Here are three data tables relating student grades in courses at Macalester in 2005

Grades <- read.file("http://tiny.cc/mosaic/grades.csv")

sid	grade	sessionID
S31185	D+	session1784

Courses <- read.file("http://tiny.cc/mosaic/courses.csv")

sessionID	dept	level	sem	enroll	iid
session1784	M	100	FA1991	22	inst265

GradePoint <- read.file("http://tiny.cc/mosaic/grade-to-number.csv")

grade	gradepoint
A	4

Activity: Which to Join?

For each of these, say what tables you would need to join and what the corresponding variables will be.

How many students in each department?
What fraction of grades are below B+ in each department?
What’s the grade-point average (GPA) for each student?
Grade-point average for each department or instructor
What’s the 95% confidence interval on the GPA for each student?
(Statistically more sophisticated) To what extent does the grade reflect the student or the department or instructor?