grade | sessionID | sid |
---|---|---|
A | session2606 | S31440 |
S | session2491 | S31461 |
A | session1904 | S31461 |
A | session2606 | S31869 |
A | session2044 | S31905 |
A | session2491 | S32028 |
A- | session3524 | S32328 |
A | session2044 | S32328 |
… with 5902 rows altogether.
As we move forward through these Lessons, an individual data frame will be the launching point for a statistical analysis or a graphical or tabular presentation. Inside every data frame, as you know, each row (that is, specimen) is an instance of the same unit of observation. But data science work often involves combining information about different kinds of unit of observation. For example, a health-care research project will presumably be based on patients: the corresponding data frame has a patient as the unit of observation and will include variables on date of birth, gender, and so on. If the project involves looking at doctor and clinic visits, there will be another data frame in which the unit of observation is a doctor/clinic. If medication is part of the project, there will be a data frame listing each patient’s prescriptions and another data frame giving the characteristics of each drug substance. In the prescription data frame, there will be many rows that list the same drug, each such row rendered unique by the patient involved and the date of the prescription. Interested in studying the health consequences of previous illnesses? Then still another data frame will be needed to list each person’s medical history, where the unit of observation is a bout of illness in an individual patient.
Suppose the project is to identify illnesses that might be side-effects of drugs. To evaluate a specific hypothesized drug-to-illness path, a basic question is whether those who took the drug are more likely to subsequently suffer the illness than the people who did not take the drug.
The data frame needed to answer this question might be simple: the unit of observation is a patient. The core variables will be (1) whether and when the patient got the illness and (2) whether and when the patient took the drug. As you will see in later Lessons, we can include in the analysis characteristics of each patient so that we can avoid, for instance, comparing elderly drug takers to young adults who never had the drug. This will entail including additional variables to the data frame, but the unit of observation will remain “a patient.”
How do we construct the data frame described in the previous paragraph. We will need to combine the illness data frame, the drug prescription data frame, the drug-substance data frame (to connect together drugs that belong to the same class of substances), and the patient data frame.
This Lesson is about how to combine data frames with different units of observation, and how to organize those multiple data frames so that they can easily be combined. The set of well-organized data frames is called a database.
Facility in using databases is a core professional skill for data scientists. For the statistical thinker, it is important to know the basics of how databases work so that she can call on data from multiple sources to inform the statistical questions asked.
The traditional national motto of the United States is E pluribus unum: “out of many, one.” The motto is embossed on coinage and printed on paper currency. It refers to the formation of a single country out of the thirteen original colonies. The historically-minded reader knows that the process of creating one country out of many colonies was difficult. On the political side, representatives from each of the thirteen met together in one body to debate, decide, and reconcile their differences.
With databases, the process—combining multiple data frames into a single one suited for statistical analysis—is much simpler. One reason is that there is no need for all the multiple data frames to meet all together simultaneously. Any combination of data frames can be constructed by a series of steps, each of which involves combining only two data frames at a time.
This Lesson introduces the generic process of combining two data frames with different units of observation. The Lesson also illustrates how to organize systems of data frames so that they can easily be combined into the myriad of forms needed to address the myriad of potential scientific and statistical questions.
To illustrate wrangling to join tables, we’ll work with an authentic database in a familiar setting: student transcripts at a college. At many colleges, the person with authority over the database is called the “registrar.” The registrar at one college gave permission to make parts of the database available to the general public so long as the published data is de-identified. This means, for example, that arbitrary codes are used for the names of students, faculty, and departments.
There are three data frames in the database: Grades
(Table 7.1), Sessions
(Table 7.2), and Gradepoint
(Table 7.3).
sid
is the student ID, while sessionID
identifies which course (in which semester) the student took. Students take multiple courses. For instance, student S32328 took sessions 2044, 2491, and 3524 (among others not listed). Student S31461 is listed twice, once for session 2491 and again for 1904. These two students had one course in common, session 2491. They may have sat next to each other! The same is true in session 2606 for students S31440 and S31869.
grade | sessionID | sid |
---|---|---|
A | session2606 | S31440 |
S | session2491 | S31461 |
A | session1904 | S31461 |
A | session2606 | S31869 |
A | session2044 | S31905 |
A | session2491 | S32028 |
A- | session3524 | S32328 |
A | session2044 | S32328 |
… with 5902 rows altogether.
iid
), is associated with a department (dept
). The number of students in that session (enroll
) is listed, as is the semester in which the session was offered. The level
indicates whether the course is directed to new students (level 100) or more advanced students (levels 200 and 300).
sessionID | iid | enroll | dept | level | sem |
---|---|---|---|---|---|
session2044 | inst436 | 16 | m | 100 | FA2001 |
session2491 | inst170 | 34 | n | 200 | FA2002 |
session2606 | inst143 | 25 | C | 300 | SP2003 |
session1904 | inst264 | 26 | M | 100 | SP2001 |
session3524 | inst436 | 21 | g | 100 | FA2004 |
session2911 | inst268 | 10 | M | 300 | FA2003 |
session3822 | inst465 | 25 | k | 200 | SP2005 |
… with 1718 rows altogether
Gradepoint
establishes the college’s policy in converting letter grades to numbers. An A is translated to 4 gradepoints. An NC
(no credit) gets zero gradepoints. Pass-fail students who pass (S
) don’t have the course included in their gradepoint average. Similarly for students who are auditing (AU
) the course.
grade | gradepoint |
---|---|
AU | NA |
S | NA |
A | 4.00 |
A- | 3.66 |
B+ | 3.33 |
B | 3.00 |
B- | 2.66 |
C+ | 2.33 |
C | 2.00 |
C- | 1.66 |
D+ | 1.33 |
D | 1.00 |
D- | 0.66 |
NC | 0.00 |
All rows shown.
Consider the familiar student-by-student gradepoint average (GPA). This averages together each student’s grades. The Grades
tables store the grades, but we can’t average categorical levels like “B+” or “C”. To average, we need to convert each category to a number. This is done via the Gradepoint
table.
The operation is conceptually simple. Add a new column to Grades
that has the number. Work row-by-row through Grades
, referring to the policy in Gradepoint
to fill in the value of the new column for that row. Simple, but tedious!
The left_join()
wrangling operation involves the two data frames to be combined. For each row in the “left” data frame, the corresponding information from the “right” data frame is added. Like this:
Once Gradepoint
has been joined to Grades
, we can compute the GPA summary for each of the 443 students:.
|>
Grades left_join(Gradepoint) |>
summarize(GPA = mean(gradepoint, na.rm = TRUE), .by = sid)
na.rm = TRUE
meaning to remove any NA
values before computing the mean. To judge from the GPA, student S31461 strategically decided to preserve their high GPA by taking a risky course pass/fail.sid | GPA |
---|---|
S31461 | 3.94 |
S31869 | 3.56 |
S31440 | 3.76 |
S32328 | 3.52 |
S32028 | 3.55 |
S31905 | 3.88 |
Students will be sympathetic to the claim that some instructors are harder grading than others. This makes a student-by-student GPA an unreliable indicator of a student’s performance.
Knowing how easy it is to join data frames … Let’s try something different. We can calculate a gradepoint average for each instructor! This will involve joining the Grades
and Sessions
data frames in order to place the instructor’s ID next to each of the grades he or she gave out. Join this combined table with Gradepoint
to get the numerical value of the grade, then average across instructors. We will also keep track of how many students were taught by the instructor.
<- Grades |>
Instructors left_join(Sessions) |>
left_join(Gradepoint) |>
summarize(iGPA = mean(gradepoint, na.rm = TRUE,
nstudents = sum(enroll, na.rm = TRUE)), .by = iid)
Instructors
iid | iGPA |
---|---|
inst143 | 3.76 |
inst198 | 2.99 |
inst263 | 2.85 |
inst501 | 3.85 |
inst269 | 2.72 |
inst411 | 3.01 |
inst459 | 3.74 |
inst419 | 2.95 |
Exercise 7.1
?fig-comparing-gpas(left) compares raw gradepoint averages to instructor adjusted averages. The ambiguity in a gradepoint average is reflected in the horizontal width of the cloud of points. That width is about 0.25 gradepoint, which suggests that a gradepoint average ought to be reported not as a single number, but as a range. So a 3.32 GPA should be reported as 3.20 to 3.45.
Repeat the gradepoint analysis of the grades database using a department-adjusted rather than an instructor-adjusted gradepoint. Make a plot of adj_gpa
versus raw_gpa
and (by eye) measure the horizontal width of the cloud.
Make sure to show all your wrangling and graphics commands.
Exercise 7.2
Small project:
Calculate the iGPA instructor’s GPA, then use this to index each student grade to the iGPA, and average the index across students to get the indexGPA
. Compute the rank of students in the standard student-only way with the rank in the indexGPA
. How consistent are the two rankings?
Exercise 7.3 Refer to the three differently formatted data frames in Exercise 5.10.
Suppose you have another data frame, ContinentData
, which gives the continent that each country is in.
ContinentData
Country | Continent |
---|---|
Algeria | Africa |
Brazil | South America |
Columbia | South America |
Which data-frame format from @Q05-122 do you think would make it easiest to find the sum of the values for each continent for each of the years? How would you do it?
Answer:
Format C.
|> left_join(ContinentData) |>
Format_C summarize(total = sum(value), .by = c(Continent, Year))
Often, a literal display of a data frame may seem inefficient, for instance this view of the Galton
dataframe which was constructed from Figure 1.4.
Galton
family | father | mother | sex | height | nkids |
---|---|---|---|---|---|
1 | 78.5 | 67.0 | M | 73.2 | 4 |
1 | 78.5 | 67.0 | F | 69.2 | 4 |
1 | 78.5 | 67.0 | F | 69.0 | 4 |
1 | 78.5 | 67.0 | F | 69.0 | 4 |
2 | 75.5 | 66.5 | M | 73.5 | 4 |
2 | 75.5 | 66.5 | M | 72.5 | 4 |
2 | 75.5 | 66.5 | F | 65.5 | 4 |
2 | 75.5 | 66.5 | F | 65.5 | 4 |
3 | 75.0 | 64.0 | M | 71.0 | 2 |
3 | 75.0 | 64.0 | F | 68.0 | 2 |
It may seem that the data frame is inefficient, for example repeating the heights of mother and father for all the siblings in a family. But this view of efficiency relates to the use of paper and ink by a table; the computer entity requires a different view of efficiency.