6  Databases

INTRODUCTION: One more class of wrangling operations is needed to provide a general framework for handling data: combining multiple tables.

THIS PICKS UP ON THE EXAMPLE IN THE data-frames.qmd chapter, splitting Galton up into tables as an example.

We’ll call the data frame Children, to remind us what is the unit of observation. The Children data frame will have 898 rows, but we will show just the first several. Children has three columns. Columns are also called “variables” since the entries within a column vary from row to row. In a data frame, variables are given names. In Children, the variable names are “height,” “sex,” and “family.”

?(caption)

 height  sex   family 
-------  ----  -------
   73.2  M     1      
   69.2  F     1      
   69.0  F     1      
   69.0  F     1      
   73.5  M     2      
   72.5  M     2      

The “height” variable is numeric (or, equivalently, quantitative). All of the values in the “height” variable are exactly the same kind of thing: a height measured in inches. The “sex” variable contains values that come from a small set of possible levels of “sex,” specifically “M” and “F”. (Keep in mind these data are from the 1880s.) A variable like “sex” is called categorical, because the values are not numeric. The two levels correspond to the two categories of “sex.” The values in the “sex” variable are all the same kind of thing, namely, a category of “sex.”

Including the third column in Children may seem odd at first. But you will see why it’s needed in a little bit. That third column—the “family” variable—records which family a child came from. The different families are identified by different integers: 1, 2, 3, and so on. Best to think about “family” as a categorical variable. It happens that the levels of “family” are integers, but they could have been any ID code assigned to identify the family uniquely, say, “fam_ID_1423” and “fam_ID_76A.”

The Children data frame records only some of the data from Galton’s notebook. To store the other data we use a second data frame, which we will call Parents. As the name suggests, the unit of observation in the Parents data frame is the pair of parents for each family.

?(caption)

mother father family
67.0 78.5 1
66.5 75.5 2
64.0 75.0 3
64.0 75.0 4

A data table, Parents, for storing the data on the heights of the parents.

The evident disadvantage in splitting the data from Galton’s notebook is that it separates related quantify. For instance, Galton’s genetics interest was to relate the heights of parents and their children.

There are big advantages to this organization, however. Imagine that discover that one of the children had been entered twice. To correct things, we need merely strike out the repeated row in the Children data frame. Or, suppose we want to add year_of_birth or eye_color for the children. Just add a new column to Children.

Do these advantages outweigh the disadvantages? This is where data wrangling comes in. The data-wrangling operations, once learned, make it easy to combine the data from the two tables in whatever way serves the purpose at hand. To illustrate, suppose we are interested in checking the hypothesis that taller fathers tend to have bigger families. Here are the wrangling steps:

  1. Process the Children data frame to count the number of children in each family, producing a new table that we might call Family_size. Here’s the wrangling command:
Family_size <- Children |>
  group_by(family) |>
  summarize(nkids = n()) 
 family   nkids
-------  ------
      1       4
      2       1
      3       3
      4       6
      5       6
      6       5
  1. Merge the Family_size data table with the Parents data table.
For_my_research <- Parents |> left_join(Family_size) |> select(father, nkids)
family father nkids
1 78.5 4
2 75.5 4
3 75.0 2
4 75.0 5
5 75.0 6
6 74.0 1
  1. Model the relationship between father and nkids. Later Lessons will cover modeling extensively. Since we’re eager to check the tall-father-big-family hypothesis, we’ll build and summarize the model, even though you aren’t yet in a position to make sense of the results.
lm(nkids ~ father, data=For_my_research) |> conf_interval()
term              .lwr     .coef      .upr
------------  --------  --------  --------
(Intercept)    -0.0477    9.9400   19.9000
father         -0.2210   -0.0776    0.0663

According to the model, the father’s height doesn’t have any discernible relationship to the family size.

Databases

If you can’t figure out what the unit of observation is, then likely you should be organizing things as two or more different data frames.

EXPLAIN ABOUT COLLECTION OF data frames and keys that connect relate how things are connected across data frames.


EXERCISE: Parents is just one of the reasonable formats for storing the parent data. Another perfectly good format would have the unit of observation be an individual parent rather than a mother/father pair.

?(caption)

family role parent_height
1 mother 67.0
1 father 78.5
2 mother 66.5
2 father 75.5
3 mother 64.0
3 father 75.0
4 mother 64.0
4 father 75.0
5 mother 58.5
5 father 75.0
6 mother 68.0
6 father 74.0
7 mother 68.0
7 father 74.0
8 mother 66.5
8 father 74.0
9 mother 66.0
9 father 74.5
10 mother 65.5
10 father 74.0
11 mother 62.0
11 father 74.0
12 mother 61.0
12 father 74.0
14 mother 67.0
14 father 73.0
15 mother 66.5
15 father 73.0
16 mother 65.0
16 father 73.0
17 mother 64.5
17 father 73.0
18 mother 64.0
18 father 73.0
19 mother 63.0
19 father 73.2
20 mother 69.0
20 father 72.7
21 mother 68.0
21 father 72.0
22 mother 67.0
22 father 72.0
23 mother 65.0
23 father 72.0
24 mother 65.5
24 father 72.0
25 mother 64.0
25 father 72.0
26 mother 63.0
26 father 72.0
27 mother 63.0
27 father 72.0
28 mother 63.0
28 father 72.0
29 mother 63.5
29 father 72.5
30 mother 62.0
30 father 72.0
31 mother 62.0
31 father 72.5
32 mother 62.0
32 father 72.0
33 mother 62.0
33 father 72.0
34 mother 61.0
34 father 72.0
35 mother 69.0
35 father 71.0
36 mother 67.0
36 father 71.0
37 mother 66.0
37 father 71.0
38 mother 66.0
38 father 71.0
39 mother 66.0
39 father 71.0
40 mother 66.0
40 father 71.0
41 mother 65.5
41 father 71.7
42 mother 65.5
42 father 71.0
43 mother 65.5
43 father 71.5
44 mother 65.0
44 father 71.5
45 mother 65.0
45 father 71.0
46 mother 64.0
46 father 71.0
47 mother 64.5
47 father 71.7
48 mother 64.0
48 father 71.0
49 mother 64.5
49 father 71.5
51 mother 63.0
51 father 71.2
52 mother 63.5
52 father 71.0
53 mother 63.0
53 father 71.0
54 mother 63.0
54 father 71.0
55 mother 62.0
55 father 71.0
56 mother 62.0
56 father 71.0
57 mother 62.5
57 father 71.0
58 mother 62.0
58 father 71.0
59 mother 61.0
59 father 71.0
60 mother 58.0
60 father 71.0
61 mother 69.0
61 father 70.0
62 mother 69.0
62 father 70.0
63 mother 68.0
63 father 70.0
64 mother 67.0
64 father 70.0
65 mother 67.0
65 father 70.0
66 mother 66.5
66 father 70.0
67 mother 65.0
67 father 70.5
68 mother 65.0
68 father 70.5
69 mother 65.0
69 father 70.0
70 mother 65.0
70 father 70.0
71 mother 65.0
71 father 70.0
72 mother 65.0
72 father 70.0
73 mother 65.0
73 father 70.0
74 mother 65.0
74 father 70.0
75 mother 64.7
75 father 70.0
76 mother 64.0
76 father 70.0
77 mother 64.0
77 father 70.0
78 mother 64.2
78 father 70.0
79 mother 64.0
79 father 70.5
80 mother 64.5
80 father 70.5
81 mother 64.0
81 father 70.0
82 mother 64.0
82 father 70.0
83 mother 63.7
83 father 70.0
85 mother 63.0
85 father 70.5
86 mother 63.5
86 father 70.0
87 mother 63.0
87 father 70.0
88 mother 63.0
88 father 70.0
89 mother 62.0
89 father 70.5
90 mother 62.7
90 father 70.3
91 mother 62.0
91 father 70.5
92 mother 61.0
92 father 70.0
93 mother 60.0
93 father 70.0
94 mother 60.0
94 father 70.0
95 mother 58.5
95 father 70.0
96 mother 58.0
96 father 70.0
97 mother 68.5
97 father 69.0
98 mother 67.0
98 father 69.0
99 mother 66.0
99 father 69.0
100 mother 66.0
100 father 69.0
101 mother 66.7
101 father 69.0
102 mother 66.0
102 father 69.0
103 mother 66.5
103 father 69.0
104 mother 66.5
104 father 69.5
105 mother 66.5
105 father 69.0
106 mother 66.0
106 father 69.5
107 mother 66.0
107 father 69.0
108 mother 65.0
108 father 69.0
109 mother 64.5
109 father 69.5
110 mother 64.0
110 father 69.2
112 mother 63.0
112 father 69.0
113 mother 63.0
113 father 69.0
114 mother 63.0
114 father 69.0
115 mother 63.5
115 father 69.0
116 mother 63.5
116 father 69.0
117 mother 62.0
117 father 69.7
118 mother 62.0
118 father 69.5
119 mother 62.0
119 father 69.0
121 mother 62.5
121 father 69.0
122 mother 62.0
122 father 69.0
123 mother 61.0
123 father 69.5
124 mother 61.0
124 father 69.0
125 mother 60.0
125 father 69.0
126 mother 60.0
126 father 69.0
127 mother 60.5
127 father 69.0
128 mother 70.5
128 father 68.7
129 mother 67.0
129 father 68.5
130 mother 66.5
130 father 68.5
131 mother 65.0
131 father 68.0
132 mother 65.5
132 father 68.0
133 mother 65.5
133 father 68.0
134 mother 65.0
134 father 68.0
135 mother 65.0
135 father 68.5
136 mother 64.0
136 father 68.0
137 mother 64.0
137 father 68.0
138 mother 64.0
138 father 68.0
139 mother 64.5
139 father 68.0
140 mother 64.0
140 father 68.0
141 mother 63.0
141 father 68.0
142 mother 63.5
142 father 68.5
143 mother 63.0
143 father 68.0
144 mother 63.0
144 father 68.0
145 mother 63.0
145 father 68.0
146 mother 63.0
146 father 68.0
147 mother 63.5
147 father 68.5
148 mother 63.0
148 father 68.0
149 mother 63.5
149 father 68.2
150 mother 62.5
150 father 68.0
151 mother 62.0
151 father 68.7
152 mother 62.5
152 father 68.0
153 mother 61.0
153 father 68.0
154 mother 60.2
154 father 68.0
155 mother 60.0
155 father 68.0
156 mother 60.0
156 father 68.0
157 mother 59.0
157 father 68.5
158 mother 59.0
158 father 68.0
159 mother 66.2
159 father 67.0
160 mother 66.5
160 father 67.0
162 mother 65.0
162 father 67.0
163 mother 65.5
163 father 67.0
164 mother 65.5
164 father 67.0
165 mother 65.0
165 father 67.0
166 mother 65.0
166 father 67.5
167 mother 64.0
167 father 67.0
168 mother 63.5
168 father 67.0
169 mother 63.0
169 father 67.0
170 mother 62.0
170 father 67.5
171 mother 61.0
171 father 67.0
172 mother 67.0
172 father 66.0
173 mother 67.0
173 father 66.0
174 mother 66.0
174 father 66.0
175 mother 66.0
175 father 66.0
176 mother 65.0
176 father 66.5
177 mother 65.5
177 father 66.0
178 mother 63.0
178 father 66.0
179 mother 63.5
179 father 66.0
180 mother 63.0
180 father 66.5
181 mother 62.5
181 father 66.5
182 mother 61.5
182 father 66.0
183 mother 60.0
183 father 66.0
184 mother 60.0
184 father 66.0
185 mother 59.0
185 father 66.0
186 mother 67.0
186 father 65.0
187 mother 67.0
187 father 65.0
188 mother 66.0
188 father 65.0
190 mother 65.0
190 father 65.0
191 mother 65.5
191 father 65.0
192 mother 65.0
192 father 65.0
193 mother 64.0
193 father 65.0
194 mother 63.0
194 father 65.0
195 mother 63.0
195 father 65.0
196 mother 63.0
196 father 65.5
197 mother 60.0
197 father 65.5
198 mother 64.0
198 father 64.0
199 mother 64.0
199 father 64.0
200 mother 63.0
200 father 64.0
201 mother 60.0
201 father 64.0
203 mother 66.0
203 father 62.0
204 mother 63.0
204 father 62.5
136A mother 65.0
136A father 68.5

EXERCISE: Are girls from bigger families on average than boys?

EXERCISE: Average family size from the child’s point of view and from the households point of view.

EXERCISE: Use College_database in {math300} to find the GPA by instructor.

Unit of observation

Links between tables

flights13

grades

Putting tables together

  • join()