<- Children |>
Family_size group_by(family) |>
summarize(nkids = n())
6 Databases
INTRODUCTION: One more class of wrangling operations is needed to provide a general framework for handling data: combining multiple tables.
THIS PICKS UP ON THE EXAMPLE IN THE data-frames.qmd
chapter, splitting Galton up into tables as an example.
We’ll call the data frame Children
, to remind us what is the unit of observation. The Children
data frame will have 898 rows, but we will show just the first several. Children
has three columns. Columns are also called “variables” since the entries within a column vary from row to row. In a data frame, variables are given names. In Children
, the variable names are “height,” “sex,” and “family.”
?(caption)
height sex family
------- ---- -------
73.2 M 1
69.2 F 1
69.0 F 1
69.0 F 1
73.5 M 2
72.5 M 2
The “height” variable is numeric (or, equivalently, quantitative). All of the values in the “height” variable are exactly the same kind of thing: a height measured in inches. The “sex” variable contains values that come from a small set of possible levels of “sex,” specifically “M” and “F”. (Keep in mind these data are from the 1880s.) A variable like “sex” is called categorical, because the values are not numeric. The two levels correspond to the two categories of “sex.” The values in the “sex” variable are all the same kind of thing, namely, a category of “sex.”
Including the third column in Children
may seem odd at first. But you will see why it’s needed in a little bit. That third column—the “family” variable—records which family a child came from. The different families are identified by different integers: 1, 2, 3, and so on. Best to think about “family” as a categorical variable. It happens that the levels of “family” are integers, but they could have been any ID code assigned to identify the family uniquely, say, “fam_ID_1423” and “fam_ID_76A.”
The Children
data frame records only some of the data from Galton’s notebook. To store the other data we use a second data frame, which we will call Parents
. As the name suggests, the unit of observation in the Parents
data frame is the pair of parents for each family.
?(caption)
mother | father | family |
---|---|---|
67.0 | 78.5 | 1 |
66.5 | 75.5 | 2 |
64.0 | 75.0 | 3 |
64.0 | 75.0 | 4 |
A data table, Parents
, for storing the data on the heights of the parents.
The evident disadvantage in splitting the data from Galton’s notebook is that it separates related quantify. For instance, Galton’s genetics interest was to relate the heights of parents and their children.
There are big advantages to this organization, however. Imagine that discover that one of the children had been entered twice. To correct things, we need merely strike out the repeated row in the Children
data frame. Or, suppose we want to add year_of_birth
or eye_color
for the children. Just add a new column to Children
.
Do these advantages outweigh the disadvantages? This is where data wrangling comes in. The data-wrangling operations, once learned, make it easy to combine the data from the two tables in whatever way serves the purpose at hand. To illustrate, suppose we are interested in checking the hypothesis that taller fathers tend to have bigger families. Here are the wrangling steps:
- Process the
Children
data frame to count the number of children in each family, producing a new table that we might callFamily_size
. Here’s the wrangling command:
family nkids
------- ------
1 4
2 1
3 3
4 6
5 6
6 5
- Merge the
Family_size
data table with theParents
data table.
<- Parents |> left_join(Family_size) |> select(father, nkids) For_my_research
family | father | nkids |
---|---|---|
1 | 78.5 | 4 |
2 | 75.5 | 4 |
3 | 75.0 | 2 |
4 | 75.0 | 5 |
5 | 75.0 | 6 |
6 | 74.0 | 1 |
- Model the relationship between
father
andnkids
. Later Lessons will cover modeling extensively. Since we’re eager to check the tall-father-big-family hypothesis, we’ll build and summarize the model, even though you aren’t yet in a position to make sense of the results.
lm(nkids ~ father, data=For_my_research) |> conf_interval()
term .lwr .coef .upr
------------ -------- -------- --------
(Intercept) -0.0477 9.9400 19.9000
father -0.2210 -0.0776 0.0663
According to the model, the father
’s height doesn’t have any discernible relationship to the family size.
Databases
If you can’t figure out what the unit of observation is, then likely you should be organizing things as two or more different data frames.
EXPLAIN ABOUT COLLECTION OF data frames and keys that connect relate how things are connected across data frames.
EXERCISE: Parents
is just one of the reasonable formats for storing the parent data. Another perfectly good format would have the unit of observation be an individual parent rather than a mother/father pair.
?(caption)
family | role | parent_height |
---|---|---|
1 | mother | 67.0 |
1 | father | 78.5 |
2 | mother | 66.5 |
2 | father | 75.5 |
3 | mother | 64.0 |
3 | father | 75.0 |
4 | mother | 64.0 |
4 | father | 75.0 |
5 | mother | 58.5 |
5 | father | 75.0 |
6 | mother | 68.0 |
6 | father | 74.0 |
7 | mother | 68.0 |
7 | father | 74.0 |
8 | mother | 66.5 |
8 | father | 74.0 |
9 | mother | 66.0 |
9 | father | 74.5 |
10 | mother | 65.5 |
10 | father | 74.0 |
11 | mother | 62.0 |
11 | father | 74.0 |
12 | mother | 61.0 |
12 | father | 74.0 |
14 | mother | 67.0 |
14 | father | 73.0 |
15 | mother | 66.5 |
15 | father | 73.0 |
16 | mother | 65.0 |
16 | father | 73.0 |
17 | mother | 64.5 |
17 | father | 73.0 |
18 | mother | 64.0 |
18 | father | 73.0 |
19 | mother | 63.0 |
19 | father | 73.2 |
20 | mother | 69.0 |
20 | father | 72.7 |
21 | mother | 68.0 |
21 | father | 72.0 |
22 | mother | 67.0 |
22 | father | 72.0 |
23 | mother | 65.0 |
23 | father | 72.0 |
24 | mother | 65.5 |
24 | father | 72.0 |
25 | mother | 64.0 |
25 | father | 72.0 |
26 | mother | 63.0 |
26 | father | 72.0 |
27 | mother | 63.0 |
27 | father | 72.0 |
28 | mother | 63.0 |
28 | father | 72.0 |
29 | mother | 63.5 |
29 | father | 72.5 |
30 | mother | 62.0 |
30 | father | 72.0 |
31 | mother | 62.0 |
31 | father | 72.5 |
32 | mother | 62.0 |
32 | father | 72.0 |
33 | mother | 62.0 |
33 | father | 72.0 |
34 | mother | 61.0 |
34 | father | 72.0 |
35 | mother | 69.0 |
35 | father | 71.0 |
36 | mother | 67.0 |
36 | father | 71.0 |
37 | mother | 66.0 |
37 | father | 71.0 |
38 | mother | 66.0 |
38 | father | 71.0 |
39 | mother | 66.0 |
39 | father | 71.0 |
40 | mother | 66.0 |
40 | father | 71.0 |
41 | mother | 65.5 |
41 | father | 71.7 |
42 | mother | 65.5 |
42 | father | 71.0 |
43 | mother | 65.5 |
43 | father | 71.5 |
44 | mother | 65.0 |
44 | father | 71.5 |
45 | mother | 65.0 |
45 | father | 71.0 |
46 | mother | 64.0 |
46 | father | 71.0 |
47 | mother | 64.5 |
47 | father | 71.7 |
48 | mother | 64.0 |
48 | father | 71.0 |
49 | mother | 64.5 |
49 | father | 71.5 |
51 | mother | 63.0 |
51 | father | 71.2 |
52 | mother | 63.5 |
52 | father | 71.0 |
53 | mother | 63.0 |
53 | father | 71.0 |
54 | mother | 63.0 |
54 | father | 71.0 |
55 | mother | 62.0 |
55 | father | 71.0 |
56 | mother | 62.0 |
56 | father | 71.0 |
57 | mother | 62.5 |
57 | father | 71.0 |
58 | mother | 62.0 |
58 | father | 71.0 |
59 | mother | 61.0 |
59 | father | 71.0 |
60 | mother | 58.0 |
60 | father | 71.0 |
61 | mother | 69.0 |
61 | father | 70.0 |
62 | mother | 69.0 |
62 | father | 70.0 |
63 | mother | 68.0 |
63 | father | 70.0 |
64 | mother | 67.0 |
64 | father | 70.0 |
65 | mother | 67.0 |
65 | father | 70.0 |
66 | mother | 66.5 |
66 | father | 70.0 |
67 | mother | 65.0 |
67 | father | 70.5 |
68 | mother | 65.0 |
68 | father | 70.5 |
69 | mother | 65.0 |
69 | father | 70.0 |
70 | mother | 65.0 |
70 | father | 70.0 |
71 | mother | 65.0 |
71 | father | 70.0 |
72 | mother | 65.0 |
72 | father | 70.0 |
73 | mother | 65.0 |
73 | father | 70.0 |
74 | mother | 65.0 |
74 | father | 70.0 |
75 | mother | 64.7 |
75 | father | 70.0 |
76 | mother | 64.0 |
76 | father | 70.0 |
77 | mother | 64.0 |
77 | father | 70.0 |
78 | mother | 64.2 |
78 | father | 70.0 |
79 | mother | 64.0 |
79 | father | 70.5 |
80 | mother | 64.5 |
80 | father | 70.5 |
81 | mother | 64.0 |
81 | father | 70.0 |
82 | mother | 64.0 |
82 | father | 70.0 |
83 | mother | 63.7 |
83 | father | 70.0 |
85 | mother | 63.0 |
85 | father | 70.5 |
86 | mother | 63.5 |
86 | father | 70.0 |
87 | mother | 63.0 |
87 | father | 70.0 |
88 | mother | 63.0 |
88 | father | 70.0 |
89 | mother | 62.0 |
89 | father | 70.5 |
90 | mother | 62.7 |
90 | father | 70.3 |
91 | mother | 62.0 |
91 | father | 70.5 |
92 | mother | 61.0 |
92 | father | 70.0 |
93 | mother | 60.0 |
93 | father | 70.0 |
94 | mother | 60.0 |
94 | father | 70.0 |
95 | mother | 58.5 |
95 | father | 70.0 |
96 | mother | 58.0 |
96 | father | 70.0 |
97 | mother | 68.5 |
97 | father | 69.0 |
98 | mother | 67.0 |
98 | father | 69.0 |
99 | mother | 66.0 |
99 | father | 69.0 |
100 | mother | 66.0 |
100 | father | 69.0 |
101 | mother | 66.7 |
101 | father | 69.0 |
102 | mother | 66.0 |
102 | father | 69.0 |
103 | mother | 66.5 |
103 | father | 69.0 |
104 | mother | 66.5 |
104 | father | 69.5 |
105 | mother | 66.5 |
105 | father | 69.0 |
106 | mother | 66.0 |
106 | father | 69.5 |
107 | mother | 66.0 |
107 | father | 69.0 |
108 | mother | 65.0 |
108 | father | 69.0 |
109 | mother | 64.5 |
109 | father | 69.5 |
110 | mother | 64.0 |
110 | father | 69.2 |
112 | mother | 63.0 |
112 | father | 69.0 |
113 | mother | 63.0 |
113 | father | 69.0 |
114 | mother | 63.0 |
114 | father | 69.0 |
115 | mother | 63.5 |
115 | father | 69.0 |
116 | mother | 63.5 |
116 | father | 69.0 |
117 | mother | 62.0 |
117 | father | 69.7 |
118 | mother | 62.0 |
118 | father | 69.5 |
119 | mother | 62.0 |
119 | father | 69.0 |
121 | mother | 62.5 |
121 | father | 69.0 |
122 | mother | 62.0 |
122 | father | 69.0 |
123 | mother | 61.0 |
123 | father | 69.5 |
124 | mother | 61.0 |
124 | father | 69.0 |
125 | mother | 60.0 |
125 | father | 69.0 |
126 | mother | 60.0 |
126 | father | 69.0 |
127 | mother | 60.5 |
127 | father | 69.0 |
128 | mother | 70.5 |
128 | father | 68.7 |
129 | mother | 67.0 |
129 | father | 68.5 |
130 | mother | 66.5 |
130 | father | 68.5 |
131 | mother | 65.0 |
131 | father | 68.0 |
132 | mother | 65.5 |
132 | father | 68.0 |
133 | mother | 65.5 |
133 | father | 68.0 |
134 | mother | 65.0 |
134 | father | 68.0 |
135 | mother | 65.0 |
135 | father | 68.5 |
136 | mother | 64.0 |
136 | father | 68.0 |
137 | mother | 64.0 |
137 | father | 68.0 |
138 | mother | 64.0 |
138 | father | 68.0 |
139 | mother | 64.5 |
139 | father | 68.0 |
140 | mother | 64.0 |
140 | father | 68.0 |
141 | mother | 63.0 |
141 | father | 68.0 |
142 | mother | 63.5 |
142 | father | 68.5 |
143 | mother | 63.0 |
143 | father | 68.0 |
144 | mother | 63.0 |
144 | father | 68.0 |
145 | mother | 63.0 |
145 | father | 68.0 |
146 | mother | 63.0 |
146 | father | 68.0 |
147 | mother | 63.5 |
147 | father | 68.5 |
148 | mother | 63.0 |
148 | father | 68.0 |
149 | mother | 63.5 |
149 | father | 68.2 |
150 | mother | 62.5 |
150 | father | 68.0 |
151 | mother | 62.0 |
151 | father | 68.7 |
152 | mother | 62.5 |
152 | father | 68.0 |
153 | mother | 61.0 |
153 | father | 68.0 |
154 | mother | 60.2 |
154 | father | 68.0 |
155 | mother | 60.0 |
155 | father | 68.0 |
156 | mother | 60.0 |
156 | father | 68.0 |
157 | mother | 59.0 |
157 | father | 68.5 |
158 | mother | 59.0 |
158 | father | 68.0 |
159 | mother | 66.2 |
159 | father | 67.0 |
160 | mother | 66.5 |
160 | father | 67.0 |
162 | mother | 65.0 |
162 | father | 67.0 |
163 | mother | 65.5 |
163 | father | 67.0 |
164 | mother | 65.5 |
164 | father | 67.0 |
165 | mother | 65.0 |
165 | father | 67.0 |
166 | mother | 65.0 |
166 | father | 67.5 |
167 | mother | 64.0 |
167 | father | 67.0 |
168 | mother | 63.5 |
168 | father | 67.0 |
169 | mother | 63.0 |
169 | father | 67.0 |
170 | mother | 62.0 |
170 | father | 67.5 |
171 | mother | 61.0 |
171 | father | 67.0 |
172 | mother | 67.0 |
172 | father | 66.0 |
173 | mother | 67.0 |
173 | father | 66.0 |
174 | mother | 66.0 |
174 | father | 66.0 |
175 | mother | 66.0 |
175 | father | 66.0 |
176 | mother | 65.0 |
176 | father | 66.5 |
177 | mother | 65.5 |
177 | father | 66.0 |
178 | mother | 63.0 |
178 | father | 66.0 |
179 | mother | 63.5 |
179 | father | 66.0 |
180 | mother | 63.0 |
180 | father | 66.5 |
181 | mother | 62.5 |
181 | father | 66.5 |
182 | mother | 61.5 |
182 | father | 66.0 |
183 | mother | 60.0 |
183 | father | 66.0 |
184 | mother | 60.0 |
184 | father | 66.0 |
185 | mother | 59.0 |
185 | father | 66.0 |
186 | mother | 67.0 |
186 | father | 65.0 |
187 | mother | 67.0 |
187 | father | 65.0 |
188 | mother | 66.0 |
188 | father | 65.0 |
190 | mother | 65.0 |
190 | father | 65.0 |
191 | mother | 65.5 |
191 | father | 65.0 |
192 | mother | 65.0 |
192 | father | 65.0 |
193 | mother | 64.0 |
193 | father | 65.0 |
194 | mother | 63.0 |
194 | father | 65.0 |
195 | mother | 63.0 |
195 | father | 65.0 |
196 | mother | 63.0 |
196 | father | 65.5 |
197 | mother | 60.0 |
197 | father | 65.5 |
198 | mother | 64.0 |
198 | father | 64.0 |
199 | mother | 64.0 |
199 | father | 64.0 |
200 | mother | 63.0 |
200 | father | 64.0 |
201 | mother | 60.0 |
201 | father | 64.0 |
203 | mother | 66.0 |
203 | father | 62.0 |
204 | mother | 63.0 |
204 | father | 62.5 |
136A | mother | 65.0 |
136A | father | 68.5 |
EXERCISE: Are girls from bigger families on average than boys?
EXERCISE: Average family size from the child’s point of view and from the households point of view.
EXERCISE: Use College_database
in {math300}
to find the GPA by instructor.
Unit of observation
Links between tables
flights13
grades
Putting tables together
join()