Combining data from different sources

Glyph-ready data often combines data from different sources.

Example: Medicare data

Example: Holiday births

Goal: Compare the number of births on holidays to those on non-holiday weekdays.

We can use the date variable to look up each case in a list of holidays, e.g.

Holidays <- read.file("http://tiny.cc/dcf/US-Holidays.csv")
holiday date year
Veteran’s Day 11-Nov-1983 1983
New Year’s Day 1-Jan-1996 1996
Columbus Day 13-Oct-2003 2003
Memorial Day 26-May-2014 2014

Relational databases

Storing data in separate tables can be beneficial even when the data are coming from the same source:

  • There is no “one size fits all” glyph-ready format. Often the kinds of analysis that will be done are not specifically anticipated when data are collected.
  • Glyph-ready data often contains redundancies. This makes it hard to update or correct data.

Strategy: Don’t even try to smash all data into one big table. Instead, join related tables as needed.

Example: Grades and Enrollment

Grades <- read.file("http://tiny.cc/mosaic/grades.csv")
sid grade sessionID
2197 S31680 B session3518
259 S31242 B session2897
4188 S32127 A session2002
3880 S32058 A- session2952
Courses <- read.file("http://tiny.cc/mosaic/courses.csv")
sessionID dept level sem enroll iid
640 session2568 J 100 FA2002 15 inst223
76 session1940 d 100 FA2000 16 inst409
1218 session3242 m 200 SP2004 30 inst476

Joins

A join is a data verb that combines two tables.

  • These care called the left and the right tables.

There are several kinds of join.

  • All involve establishing a correspondance — a match — between each case in the left table and zero or more cases in the right table.
  • The various joins differ in how they handle multiple matches or missing matches.

Example: Average class size

Goal: Figure out the average class size seen by each student.

  • enroll comes from Courses table.
  • Student (sid) comes from Grades.
  • sessionID is in both tables.
Grades %>%
  left_join(Courses) %>% sample_n(size=4)
sid grade sessionID dept level sem enroll iid
4043 S32061 B+ session3644 K 300 SP2005 18 inst239
1536 S31518 AU session2901 J 100 FA2003 22 inst224
1844 S31587 B- session3191 b 200 SP2004 10 inst403
3593 S31962 B session3821 W 300 SP2005 24 inst138

Once Courses and Grades are joined, it’s straightforward to find the average enrollment seen by each student.

AveClassEachStudent <- Grades %>% 
  left_join(Courses) %>%
  group_by(sid) %>%
  summarise(ave_enroll = mean(enroll, na.rm=TRUE))
sid ave_enroll
S31626 27.64286
S32007 22.23077
S31905 26.73333

Statistical Digression

Why are these numbers different?

AveClassEachStudent %>% summarise(average = mean(ave_enroll))

average

23.60299

Courses %>% summarise(average = mean(enroll))

average

21.16531

Establishing a match between cases

A match between a case in the left table and a case in the right table is made based on the values in pairs of corresponding variables.

  • You specify which pairs to use.
  • A pair is a variable from the left table and a variable from the right table.
  • Cases must have exactly equal values in the left variable and right variable for a match to be made.

Example:

Grades %>% 
  left_join(Courses, by = c(sessionID = "sessionID")) %>%
  head(4)
sid grade sessionID dept level sem enroll iid
S31185 D+ session1784 M 100 FA1991 22 inst265
S31185 B+ session1785 k 100 FA1991 52 inst458
S31185 A- session1791 J 100 FA1993 22 inst223
S31185 B+ session1792 J 300 FA1993 20 inst235

The default value of by= is all variables with the same names in both tables.
- This is not reliable unless you’ve checked.

Kinds of join

Different kinds of join have different answers to these questions.

  • What to do when there is no match between a left case and any right case?
  • What to do when there are multiple matching cases in the right table for a case in the left table?

Most popular joins: left_join() and inner_join()

  1. No match of a left case to a right case
    • left_join() Keep the left case and fill in the new variables with NA
    • inner_join() Discard the left case. Less popular joins:
    • full_join() Keep left case as well as unmatched right cases.
    • semi_join() Discard the left case.
    • anti_join() Keep the left case but discard any left case with a match in the right table
  2. Multiple matches of right cases to a left case
    • left_join() and inner_join() do the same thing: Keep all combinations. Less popular joins:
    • full_join() Keep all combinations.
    • semi_join() Keep just one copy of the left case.
    • anti_join() Discard the left case.

Example: Grade-point averages

Here are three data tables relating student grades in courses at Macalester in 2005

Grades <- read.file("http://tiny.cc/mosaic/grades.csv")
sid grade sessionID
S31185 D+ session1784
Courses <- read.file("http://tiny.cc/mosaic/courses.csv")
sessionID dept level sem enroll iid
session1784 M 100 FA1991 22 inst265
GradePoint <- read.file("http://tiny.cc/mosaic/grade-to-number.csv")
grade gradepoint
A 4

Activity: Which to Join?

For each of these, say what tables you would need to join and what the corresponding variables will be.

  1. How many students in each department?
  2. What fraction of grades are below B+ in each department?
  3. What’s the grade-point average (GPA) for each student?
  4. Grade-point average for each department or instructor
  5. What’s the 95% confidence interval on the GPA for each student?
  6. (Statistically more sophisticated) To what extent does the grade reflect the student or the department or instructor?