7  Databases

As we move forward through these Lessons, an individual data frame will be the launching point for a statistical analysis or a graphical or tabular presentation. Inside every data frame, as you know, each row (that is, specimen) is an instance of the same unit of observation. But data science work often involves combining information about different kinds of unit of observation. For example, a health-care research project will presumably be based on patients: the corresponding data frame has a patient as the unit of observation and will include variables on date of birth, gender, and so on. If the project involves looking at doctor and clinic visits, there will be another data frame in which the unit of observation is a doctor/clinic. If medication is part of the project, there will be a data frame listing each patient’s prescriptions and another data frame giving the characteristics of each drug substance. In the prescription data frame, there will be many rows that list the same drug, each such row rendered unique by the patient involved and the date of the prescription. Interested in studying the health consequences of previous illnesses? Then still another data frame will be needed to list each person’s medical history, where the unit of observation is a bout of illness in an individual patient.

Suppose the project is to identify illnesses that might be side-effects of drugs. To evaluate a specific hypothesized drug-to-illness path, a basic question is whether those who took the drug are more likely to subsequently suffer the illness than the people who did not take the drug.

The data frame needed to answer this question might be simple: the unit of observation is a patient. The core variables will be (1) whether and when the patient got the illness and (2) whether and when the patient took the drug. As you will see in later Lessons, we can include in the analysis characteristics of each patient so that we can avoid, for instance, comparing elderly drug takers to young adults who never had the drug. This will entail including additional variables to the data frame, but the unit of observation will remain “a patient.”

How do we construct the data frame described in the previous paragraph. We will need to combine the illness data frame, the drug prescription data frame, the drug-substance data frame (to connect together drugs that belong to the same class of substances), and the patient data frame.

This Lesson is about how to combine data frames with different units of observation, and how to organize those multiple data frames so that they can easily be combined. The set of well-organized data frames is called a database.

Facility in using databases is a core professional skill for data scientists. For the statistical thinker, it is important to know the basics of how databases work so that she can call on data from multiple sources to inform the statistical questions asked.

E pluribus unum

The traditional national motto of the United States is E pluribus unum: “out of many, one.” The motto is embossed on coinage and printed on paper currency. It refers to the formation of a single country out of the thirteen original colonies. The historically-minded reader knows that the process of creating one country out of many colonies was difficult. On the political side, representatives from each of the thirteen met together in one body to debate, decide, and reconcile their differences.

With databases, the process—combining multiple data frames into a single one suited for statistical analysis—is much simpler. One reason is that there is no need for all the multiple data frames to meet all together simultaneously. Any combination of data frames can be constructed by a series of steps, each of which involves combining only two data frames at a time.

A phrase from the Declaration of Independence describes this simultaneous as “in General Congress, Assembled.”

This Lesson introduces the generic process of combining two data frames with different units of observation. The Lesson also illustrates how to organize systems of data frames so that they can easily be combined into the myriad of forms needed to address the myriad of potential scientific and statistical questions.

Join: putting tables together

To illustrate wrangling to join tables, we’ll work with an authentic database in a familiar setting: student transcripts at a college. At many colleges, the person with authority over the database is called the “registrar.” The registrar at one college gave permission to make parts of the database available to the general public so long as the published data is de-identified. This means, for example, that arbitrary codes are used for the names of students, faculty, and departments.

There are three data frames in the database: Grades (Table 7.1), Sessions (Table 7.2), and Gradepoint (Table 7.3).

Table 7.1: sid is the student ID, while sessionID identifies which course (in which semester) the student took. Students take multiple courses. For instance, student S32328 took sessions 2044, 2491, and 3524 (among others not listed). Student S31461 is listed twice, once for session 2491 and again for 1904. These two students had one course in common, session 2491. They may have sat next to each other! The same is true in session 2606 for students S31440 and S31869.
grade sessionID sid
A session2606 S31440
S session2491 S31461
A session1904 S31461
A session2606 S31869
A session2044 S31905
A session2491 S32028
A- session3524 S32328
A session2044 S32328

… with 5902 rows altogether.

Table 7.2: Each session is taught by an instructor (iid), is associated with a department (dept). The number of students in that session (enroll) is listed, as is the semester in which the session was offered. The level indicates whether the course is directed to new students (level 100) or more advanced students (levels 200 and 300).
sessionID iid enroll dept level sem
session2044 inst436 16 m 100 FA2001
session2491 inst170 34 n 200 FA2002
session2606 inst143 25 C 300 SP2003
session1904 inst264 26 M 100 SP2001
session3524 inst436 21 g 100 FA2004
session2911 inst268 10 M 300 FA2003
session3822 inst465 25 k 200 SP2005

… with 1718 rows altogether

Table 7.3: Gradepoint establishes the college’s policy in converting letter grades to numbers. An A is translated to 4 gradepoints. An NC (no credit) gets zero gradepoints. Pass-fail students who pass (S) don’t have the course included in their gradepoint average. Similarly for students who are auditing (AU) the course.
grade gradepoint
A 4.00
A- 3.66
B+ 3.33
B 3.00
B- 2.66
C+ 2.33
C 2.00
C- 1.66
D+ 1.33
D 1.00
D- 0.66
NC 0.00

All rows shown.

Consider the familiar student-by-student gradepoint average (GPA). This averages together each student’s grades. The Grades tables store the grades, but we can’t average categorical levels like “B+” or “C”. To average, we need to convert each category to a number. This is done via the Gradepoint table.

The operation is conceptually simple. Add a new column to Grades that has the number. Work row-by-row through Grades, referring to the policy in Gradepoint to fill in the value of the new column for that row. Simple, but tedious!

The left_join() wrangling operation involves the two data frames to be combined. For each row in the “left” data frame, the corresponding information from the “right” data frame is added. Like this:

Notice that student S31461 took session 2491 as a pass/fail class. He or she (we don’t know which, because we don’t have permission to publish the table giving such information for individual students) passed the course with a grade of S which doesn’t count for student’s gradepoint.
Table 7.4
Grades |> left_join(Gradepoint) 
Joining with `by = join_by(grade)`

Once Gradepoint has been joined to Grades, we can compute the GPA summary for each of the 443 students:.

Grades |>
  left_join(Gradepoint) |>
  summarize(GPA = mean(gradepoint, na.rm = TRUE), .by = sid)

In calculating the mean gradepoint, we’ve set na.rm = TRUE meaning to remove any NA values before computing the mean. To judge from the GPA, student S31461 strategically decided to preserve their high GPA by taking a risky course pass/fail.
sid GPA
S31461 3.94
S31869 3.56
S31440 3.76
S32328 3.52
S32028 3.55
S31905 3.88

Case study: What about the instructor?

Students will be sympathetic to the claim that some instructors are harder grading than others. This makes a student-by-student GPA an unreliable indicator of a student’s performance.

Knowing how easy it is to join data frames … Let’s try something different. We can calculate a gradepoint average for each instructor! This will involve joining the Grades and Sessions data frames in order to place the instructor’s ID next to each of the grades he or she gave out. Join this combined table with Gradepoint to get the numerical value of the grade, then average across instructors. We will also keep track of how many students were taught by the instructor.

Instructors <- Grades |> 
  left_join(Sessions) |>
  left_join(Gradepoint) |>
  summarize(iGPA = mean(gradepoint, na.rm = TRUE, 
                        nstudents = sum(enroll, na.rm = TRUE)), .by = iid) 
iid iGPA
inst143 3.76
inst198 2.99
inst263 2.85
inst501 3.85
inst269 2.72
inst411 3.01
inst459 3.74
inst419 2.95


Exercise 7.1  

?fig-comparing-gpas(left) compares raw gradepoint averages to instructor adjusted averages. The ambiguity in a gradepoint average is reflected in the horizontal width of the cloud of points. That width is about 0.25 gradepoint, which suggests that a gradepoint average ought to be reported not as a single number, but as a range. So a 3.32 GPA should be reported as 3.20 to 3.45.

Repeat the gradepoint analysis of the grades database using a department-adjusted rather than an instructor-adjusted gradepoint. Make a plot of adj_gpa versus raw_gpa and (by eye) measure the horizontal width of the cloud.

Make sure to show all your wrangling and graphics commands.


Exercise 7.2  

Small project:

Calculate the iGPA instructor’s GPA, then use this to index each student grade to the iGPA, and average the index across students to get the indexGPA. Compute the rank of students in the standard student-only way with the rank in the indexGPA. How consistent are the two rankings?


Exercise 7.3 Refer to the three differently formatted data frames in Exercise 5.10.

Suppose you have another data frame, ContinentData, which gives the continent that each country is in.


Country Continent
Algeria Africa
Brazil South America
Columbia South America

Which data-frame format from @Q05-122 do you think would make it easiest to find the sum of the values for each continent for each of the years? How would you do it?


Format C.

Format_C |> left_join(ContinentData) |>
 summarize(total = sum(value), .by = c(Continent, Year))


Enrichment topics

Often, a literal display of a data frame may seem inefficient, for instance this view of the Galton dataframe which was constructed from Figure 1.4.

Table 7.5: The records from the table shown in Figure 1.4 in a data-frame format.
family father mother sex height nkids
1 78.5 67.0 M 73.2 4
1 78.5 67.0 F 69.2 4
1 78.5 67.0 F 69.0 4
1 78.5 67.0 F 69.0 4
2 75.5 66.5 M 73.5 4
2 75.5 66.5 M 72.5 4
2 75.5 66.5 F 65.5 4
2 75.5 66.5 F 65.5 4
3 75.0 64.0 M 71.0 2
3 75.0 64.0 F 68.0 2
      ... for 898 rows altogether

It may seem that the data frame is inefficient, for example repeating the heights of mother and father for all the siblings in a family. But this view of efficiency relates to the use of paper and ink by a table; the computer entity requires a different view of efficiency.