Glyph-ready data often combines data from different sources.
Example: Medicare data
MedicareProviders
: Name and locationDirectRecoveryGroups
: Descriptions of standardized medical proceduresMedicareCharges
: Charges and payments for different DRGs by different providersZipDemographics
: Population and age structure in each ZIP code.Example: Holiday births
Goal: Compare the number of births on holidays to those on non-holiday weekdays.
mosaicData::Birthdays
table gives the daily number of births in US states from 1969 to 1988.Birthdays
doesn’t directly tell us which days are holidays, but …We can use the date
variable to look up each case in a list of holidays, e.g.
Holidays <- read.file("http://tiny.cc/dcf/US-Holidays.csv")
holiday | date | year |
---|---|---|
Veteran’s Day | 11-Nov-1983 | 1983 |
New Year’s Day | 1-Jan-1996 | 1996 |
Columbus Day | 13-Oct-2003 | 2003 |
Memorial Day | 26-May-2014 | 2014 |
Storing data in separate tables can be beneficial even when the data are coming from the same source:
Strategy: Don’t even try to smash all data into one big table. Instead, join related tables as needed.
Example: Grades and Enrollment
Grades <- read.file("http://tiny.cc/mosaic/grades.csv")
sid | grade | sessionID | |
---|---|---|---|
2197 | S31680 | B | session3518 |
259 | S31242 | B | session2897 |
4188 | S32127 | A | session2002 |
3880 | S32058 | A- | session2952 |
Courses <- read.file("http://tiny.cc/mosaic/courses.csv")
sessionID | dept | level | sem | enroll | iid | |
---|---|---|---|---|---|---|
640 | session2568 | J | 100 | FA2002 | 15 | inst223 |
76 | session1940 | d | 100 | FA2000 | 16 | inst409 |
1218 | session3242 | m | 200 | SP2004 | 30 | inst476 |
A join is a data verb that combines two tables.
There are several kinds of join.
Goal: Figure out the average class size seen by each student.
enroll
comes from Courses
table.sid
) comes from Grades
.sessionID
is in both tables.Grades %>%
left_join(Courses) %>% sample_n(size=4)
sid | grade | sessionID | dept | level | sem | enroll | iid | |
---|---|---|---|---|---|---|---|---|
4043 | S32061 | B+ | session3644 | K | 300 | SP2005 | 18 | inst239 |
1536 | S31518 | AU | session2901 | J | 100 | FA2003 | 22 | inst224 |
1844 | S31587 | B- | session3191 | b | 200 | SP2004 | 10 | inst403 |
3593 | S31962 | B | session3821 | W | 300 | SP2005 | 24 | inst138 |
Once Courses
and Grades
are joined, it’s straightforward to find the average enrollment seen by each student.
AveClassEachStudent <- Grades %>%
left_join(Courses) %>%
group_by(sid) %>%
summarise(ave_enroll = mean(enroll, na.rm=TRUE))
sid | ave_enroll |
---|---|
S31626 | 27.64286 |
S32007 | 22.23077 |
S31905 | 26.73333 |
Statistical Digression
Why are these numbers different?
AveClassEachStudent %>% summarise(average = mean(ave_enroll))
23.60299
Courses %>% summarise(average = mean(enroll))
21.16531
A match between a case in the left table and a case in the right table is made based on the values in pairs of corresponding variables.
Example:
Grades %>%
left_join(Courses, by = c(sessionID = "sessionID")) %>%
head(4)
sid | grade | sessionID | dept | level | sem | enroll | iid |
---|---|---|---|---|---|---|---|
S31185 | D+ | session1784 | M | 100 | FA1991 | 22 | inst265 |
S31185 | B+ | session1785 | k | 100 | FA1991 | 52 | inst458 |
S31185 | A- | session1791 | J | 100 | FA1993 | 22 | inst223 |
S31185 | B+ | session1792 | J | 300 | FA1993 | 20 | inst235 |
The default value of by=
is all variables with the same names in both tables.
- This is not reliable unless you’ve checked.
Different kinds of join have different answers to these questions.
Most popular joins: left_join()
and inner_join()
left_join()
Keep the left case and fill in the new variables with NAinner_join()
Discard the left case. Less popular joins:full_join()
Keep left case as well as unmatched right cases.semi_join()
Discard the left case.anti_join()
Keep the left case but discard any left case with a match in the right tableleft_join()
and inner_join()
do the same thing: Keep all combinations. Less popular joins:full_join()
Keep all combinations.semi_join()
Keep just one copy of the left case.anti_join()
Discard the left case.Here are three data tables relating student grades in courses at Macalester in 2005
Grades <- read.file("http://tiny.cc/mosaic/grades.csv")
sid | grade | sessionID |
---|---|---|
S31185 | D+ | session1784 |
Courses <- read.file("http://tiny.cc/mosaic/courses.csv")
sessionID | dept | level | sem | enroll | iid |
---|---|---|---|---|---|
session1784 | M | 100 | FA1991 | 22 | inst265 |
GradePoint <- read.file("http://tiny.cc/mosaic/grade-to-number.csv")
grade | gradepoint |
---|---|
A | 4 |
For each of these, say what tables you would need to join and what the corresponding variables will be.