::Galton ?mosaicData
1 Data frames
The origin of recorded history is, literally, data. Five-thousand years ago, in Mesopotamia, the climate was changing. Retreating sources of irrigation water called for an organized and coordinated response, beyond the scope of isolated clans of farmers. To provide this response, a new social structure – government – was established and grew. Taxes were owed and paid, each transaction recorded. Food grain had to be measured and stored, livestock counted, trades and shipments memorialized.
Writing emerged as the technological innovation to keep track of all this. We know about this today because the records were incised on soft clay tablets and baked into permanence. Then, when the records were no longer needed, they were recycled as building materials for the growing settlements and cities. Archaeologists started uncovering these tablets more than 100 years ago, spending decades to decipher the meaning of the stylus marks in clay.
The technology for writing and record keeping developed over time: knots in string, wax tablets, papyrus, vellum, paper, computer memory. Making sense of the records has always required literacy, the ability to decipher marks according to the system and language used to represent the writer’s intent. Today, in many societies the vast majority of people have been taught to read and write their native language according to the accepted conventions.
Conventions of record keeping diverge from those of everyday language. For instance, records of financial transactions have to be guarded against error and fraud. Starting in the thirteenth century, financial accountants adopted a practice—double-entry bookkeeping—that has no counterpart in everyday language. that is utterly foreign to that used in other kinds of written communication.
Modern conventions make working with data easier and more reliable. Of primary interest to us in these Lessons is the organization provided by a “data frame,” a structure for holding data as exemplified in Figure 1.1.
The display in Figure 1.1 shows a small part of a larger data frame holding observations collected by statistician Francis Galton in the 1880s. I will use this data frame repeatedly across these lessons because of the outsized historical role the data played in the development of statistical methodology. The context for the data collection was Galton’s attempt to quantify the heritability of biological traits. The particular trait of interest to Galton (probably because it is easily measured) is human stature. Galton recorded the heights of full-grown children and their parents.
The row-and-column organization of a data frame is reminiscent of a spreadsheet. But data frames have additional requirements that typical spreadsheet software does not enforce. Often the term “tidy data” is used to emphasize that these requirements are being met.
Each variable must consist of the same kind of individual entries. For example, the
mother
variable consists of numbers: a quantity. In this case, the quantity is the mother’s height in inches. It would not be legitimate for an entry inmother
to be a word or to be a height in meters or something else entirely, for instance a blood pressure.Each row represents an individual real-world entity. For the data frame shown in Figure 1.1, each row corresponds to an individual, fully-grown child. We use the term “unit of observation” to refer to the kind of entity represented in each row. All rows in a data frame must be the same kind of unit of observation. It would not be legitimate for a row to represent a different unit, such as a house or family or country. If you wanted to record data on families, you would need to create a new data frame where the unit of observation is a family.
We will use the word “specimen” to refer to an individual instance of the unit of observation. It would not be exactly correct to say that a data frame is a collection of “units of observation,” since the unit of observation (an individual person in Figure 1.1) must be the same for all rows in a data frame.
The unit of observation in Figure 1.1 is a full-grown child. The fifth row in that data frame refers to a particular young woman in London in the 1880s, whose name is lost to history. By using the word “specimen” to refer to this woman, we do not mean to de-humanize her. However, we need a word that can be applied to a single row of any data frame whatever its unit of observation might be: a shipping container, a blood sample, a day of ticket sales, and so on.
Very often, the collection comprised by a data frame is a “sample” from a larger group of the units of observation. Galton did not measure the height of every fully-grown child in London, England, the UK, or the World. He collected a sample from London families. Sometimes, a data frame includes every possible instance of the unit of observation. For example, a library catalog is a data frame comprehensively listing the books in a library. Such a comprehensive collection is called a “census.”
The US Centers for Disease Control (CDC) publishes each year a “public use file” which is a data frame where the unit of observation is an infant born in the US. The published file for 2022 has 3,699,040 rows because there were that number of (known) births in 2022. As such, the CDC data constitutes a census rather than a sample.
Among the many variables are the baby’s weight and sex, the mother’s age, and the number of pre-natal care visits during the pregnancy.
Types of variables
Each column of a data frame is a variable. The word “variable” is appropriate because the entries within a variable vary one from one row to another. Other words with the same root include “variation,” “variety,” and even “diversity.”
Data-frame variables come in two fundamental types:
- Quantitative variables record an “amount” of something. These might just as well be called “numerical” variables.
- Categorical variables typically consist of letters. For instance, the
sex
variable in Figure 1.1 contains entries that are either F or M. In most of the data we work with in these Lessons, there is a fixed set of entry values, called the levels of the categorical variable. The levels ofsex
are F and M.
Among the many variables in the CDC public use file of births are place
and diabetes_gest
, which record the place of birth and whether the mother developed gestational diabetes.
The place
variable is categorical, with these levels:
- “hospital”
- “home (intended)”
- “home (unintended)”
- “freestanding”
- “other”
The diabetes_gest
variable has only two levels: N or Y.
The codebook
How are you to know for any given data frame what constitutes the unit of observation or what each variable is about? This information, sometimes called metadata, is stored outside the data frame. Often, the metadata is contained in a separate documentation file called a “codebook.”
To start, the codebook should make it clear what is the unit of observation of the data frame. For instance, we described the unit of observation for the data frame shown in Figure 1.1 as a fully-grown child. This detail is important. For instance, each such child—each specimen—can appear only once in the data frame. In contrast, the same mother
and father
might appear for multiple specimens: all the siblings of the child.
In the CDC data frame, the unit of observation is a new-born baby. If a birth resulted in twins, there would be a separate row for each of the two babies. In contrast, imagine a data frames for the birth-mothers or for pre-natal care visits. Each mother could appear only once in the birth-mothers frame, but the same mother can appear multiple times in the pre-natal care data frame.
For quantitative variables, the relevant metadata includes what the number refers to (e.g., mother’s height or baby’s weight) and the physical units of that quantity (e.g. inches or grams).
For categorical variables, the metadata ought to describe the meaning of each level. Often, however, as with the sex
variable in Figure 1.1, the meaning is a matter of common sense.
The codebook for the CDC data is a PDF document, entitled “User Guide to the 2022 Natality Public Use File.” You can access it on the CDC website.
Accessing data frames
Most statistics software, including R, makes it easy to access data frames stored in various formats either on computer file systems or on the internet. (For examples, see in
Almost all the data frames used as examples or as exercises in these Lessons are stored in a file system provided by the R system. These R data frames are particularly easy to access. The data frame itself is accessed by a simple name, e.g. Galton
. The location of the data frame is specified by a prefix separated by a pair of colons, e.g. mosaicData::Galton
. A particularly nice feature of this system is the easy access to each data frame’s documentation. Simply give a command consisting of a question mark followed by the location::name of the data frame, e.g.
Computing with data frames
Lesson 2 covers how to make informative graphics that give an overview of the contents in a data frame. Lesson 4 introduces commands for manipulating the contents of a data frame to put them in a more useful form for the data-graphics or -summary task at hand.
Here, we will show you how to access data frames and their documentation, and simple tasks such as listing the variable names and glimpsing a few rows of a data frame for the purposes of orientation.
There are many software systems for working with data frames. Commonly available spreadsheet software, while suited to some data-entry and data summarizing tasks, is surprisingly limited when it comes to statistical thinking. The system we will use, called RStudio, is one of a handful used by data-science professionals. It’s available free both as an online, browser-based platform and for installation on a laptop computer or computer server.
Much of the statistical work you do in RStudio consists of writing commands in the R language. The word “language” is offputting to many people, associating it as they do with natural languages such as Chinese or Spanish, mastery of which takes time and much work. Fortunately, you do not have to learn the R language, just a couple dozen simple expressions.
We will continue here under the assumption that you have already been shown how to install and access RStudio by an instructor or other mentor. That person will have arranged to install some additional software written for these Lessons. Once that has been done, give this command at the R prompt in the “console” tab.
library(math300)
posit.cloud
Note: If you are a student using these Lessons* as part of a class, check with your instructor who may already have set up a way for you to access RStudio.* Otherwise …
posit.cloud
is a “freemium” web service. The word “freemium” signals that you can use it for free, up to a point. Fortunately, that point will suffice for you to follow all of these Lessons.
- In your browser, follow this link. This will take you to
posit.cloud
and, after asking you to login via Google or to set up an account, will bring you to a page that will look much like the following. (It may take a few minutes.)
On the left half of the window, there are three “tabs” labelled “Console,” “Terminal,” and “Background Jobs.” You will be working in the “Console” tab. Click in that tab and you will see a flashing
|
cursor after the>
sign.Give this command, exactly as written, and press return:
library(math300)
Now you are ready to go.
All of your work with R will consist of giving commands at the >
prompt and pressing return. Possibly the simplest of all commands is merely the name of a data frame. For instance, the math300
library provides, among many others, a data frame named AAUP
. Try this as a command:
AAUP
The result of such a command will be a print-out of the first several rows and columns of the data frame. Some of the data frames provided by math300
have a couple of dozen rows, others have tens of thousands. Printing out the first few rows of a data frame is useful since it shows the variable names and you can see whether each variable is quantitative or categorical.
In order to see the codebook for a data frame, simply preceed the name with the ?
character, for instance:
-
The codebook for the CDC births data frame can be accessed with
?Births2022
. When displayed in the Help tab, you can scroll through the descriptions of all 38 variables.
?AAUP
RStudio arranges for the codebook to be displayed in the “Help” tab. This allows you to scroll through the documentation, follow web links (if any), and keep the names of the variables displayed in the Help tab while you write commands in the Console tab.
Often, the commands you will use in these Lessons will start with the name of a data frame followed by a description of the action you want to perform. Let’s consider two simple actions:
- Count the rows in the data frame:
|> nrow() AAUP
[1] 28
- List the names of the variables.
|> names() AAUP
[1] "subject" "acsal" "fem" "unemp" "nonac" "nonacsal" "licensed"
Each of these commands could be written in a more compact way, e.g. nrow(AAUP)
or names(AAUP)
. This works well for simple commands, but can become burdensome as we work with commands that involve several stages. We are introducing the pipeline syntax from the very beginning, to help you get used to it.
These two commands have a similar structure involving four elements.
\[\underbrace{\mathtt{AAUP}}_\text{name of data frame}\ \ \underbrace{\color{blue}{\texttt{|>}}}_\text{pipeline symbol} \ \ \underbrace{\texttt{nrow}}_\text{function name}\ \ \underbrace{\color{blue}{\texttt{()}}}_\text{open & close parentheses}\] There are two names in this command: the name of a data frame and a “function” name. The function name tells what you want to calculate from the data frame.
There are also two bits of punctuation: - the pipeline symbol |>
, which connects the data frame to the function. - a pair of open and close parentheses immediately following the function name. Every time you use a function the function name will be followed by parentheses.
You may have noticed that the displays of data frames printed in this book are given labels such as ?tbl-galton-dataframe. It is natural to wonder why the word “table” is used sometimes and “data frame” other times.
In these Lessons we make the following distinction. A “data frame” stores values in the strict format of rows and columns, where every row represents the same kind of specimen and every column consists only of values of the same kind, for instance height
or sex
. Data frames should always be “machine readable.”
The human working with data frames typically has the goal of making a display intended for human eyes. A “table” is one kind of display for humans. Since humans have common sense and have learned many ways to communicate with other humans, a table does not have to follow the restrictions placed on data frames. Tables are not necessarily organized in strict row-column format, can include units for numerical quantities and comments. An example is the table put together by Francis Galton (Figure 1.3) to organize his measurements of heights.
We make the distinction between a data frame (for data storage) and a table (for communicating with humans) because many of the operations discussed in later lessons serve the purpose of transforming data frames into human-facing displays such as graphics or tables.
Exercises
We have access to the physical pages on which Francis Galton originally recorded the data on heights shown in ?tbl-galton-dataframe. Galton, however, did not know about “data frames,” nor the other useful conventions that dominate work with data in the present world.
Here is an excerpt from Galton’s notebook recording the height data:
Describe the ways in which Galton’s data organization differs from that of a data frame.
DRAFT: Look at 1940 Census sheet. What are some of the ways in which the order
The data table below records activity at a neighborhood car repair shop.
One
mechanic product price date
--------- --------------- ------- -----------
Anne starter 170.00 2019-01-12
Beatrice shock absorber 78.42 2019-01-12
Anne alternator 385.95 2019-01-12
Clarisse brake shoe 39.50 2019-01-12
Clarisse brake shoe 39.50 2019-01-12
Beatrice radiator hose 17.90 2019-02-12
The codebook for a data table should describe what is the unit of observation. For the purpose of this exercise, your job is to comment on each of the following possibilities and say why or why not it is plausibly the unit of observation.
- a day. Answer: There must be more to it than that, since the same date may be repeated with different values for the other variables.
- a mechanic. Answer: No. The same mechanic appears multiple times, so the unit of observation is not simply a mechanic.
- a car part used in a repair. Answer: Could be, for instance if every time a mechanic installs a part a new entry is added to the table describing the part, its price, the date, and the mechanic doing the work.
The US Department of Transportation has a program called the Fatality Analysis Reporting System. FARS has a web site which publishes data. Figure 1.4 shows a partial screen shot of their web page.
For several reasons, the table is not in tidy form.
Some of the rows serve as headers for the next several rows, but don’t contain any data. Identify several of those headers. Answer: “Motor vehicle traffic crashes”, “Traffic crash fatalities”, “Vehicle occupants”, “Non-motorists”, “Other national statistics”, “National rates: fatalities”
In tidy data, all the entries in a column should describe the same kind of quantity. You can see that all of the columns contain numbers. But the numbers are not all the same kind of quantity. Referring to the 2016 column:
- What kind of thing is the number 34,439? Answer: A number of crashes
- What kind of thing is 18,610? Answer: A number of drivers
- What kind of thing is 1.18? Answer: A rate: fatalities per 100-million miles.
In tidy data, there is a definite unit of observation that is the same kind of thing for every row. Give an example of two rows that are not the same kind of thing. Answer: For example, “Registered vehicles” and “Licensed drivers”. The first is a count of cars, the second a count of drivers.
Identify a few rows that are summaries of other rows. Such summaries are not themselves a unit of observation. Answer: “Sub Total1”, “Sub Total2”, “Total**“
The data frames natality2014::Natality_2014_100K
and mosaicData::Births78
are relate to babies born in 2014 and 1978 respectively. They have utterly different units of observation.
What are the units of observation of each of these two data frames? (Hint: Look at the documentation for each of them using the
?
command described in Section 1.2.)What are the levels of the categorical variable
wday
inBirths78
? (Hint: Usehead()
orView()
.)One deficiency in the documentation of
Natality_2014_100K
is that the documentation for variabledwgt_r
does not say what units (if any) the values are in. The values are numbers in the range 100 to 400. To judge from the documentation, what are the units ofdwgt_r
?
In the text, we stated that the “unit of observation” of the dataframe shown in ?tbl-galton-dataframe, is a “individual, fully grown person.”
Suppose that a devil’s advocate claimed that this is incorrect, and that the unit of observation is really a family, not an individual. What can you point to in the data frame to argue the point?
The unit of observation in the mosaicData::KidsFeet
data frame is a 3rd- or 4th-grade student in the elementary school attended by a statistician’s daughter. You can see the first few rows by giving the R command
head(mosaicData::KidsFeet)
name birthmonth birthyear length width sex biggerfoot domhand
------- ----------- ---------- ------- ------ ---- ----------- --------
David 5 88 24 8.4 B L R
Lars 10 87 25 8.8 B L L
Zach 12 87 24 9.7 B R R
Josh 1 88 25 9.8 B L R
Lang 2 88 25 8.9 B L R
Scotty 3 88 26 9.7 B R R
For each variable, say whether “categorical” or “numerical” gives the better description of the variable’s type.
The
birthmonth
andbirthyear
variables are written using numerals, but this is due to deficiencies in the software used to record the data in the 1990s. Describe at least one of the ways in whichbirthmonth
does not behave like a number. (Hint: Is 12/1987 close or far from 01/1988?)
DRAFT, talk about date-times and why the CDC data uses month
and day of week
.
The text states that “?tbl-galton-dataframe shows part of a data frame.” The nature of data frames is that there are two completely distinct ways that data frame might be extended to include more data. What are they?
This table of values is not a proper data frame. Why not?
Have variables with mixed types, and shift units of observations from countries to individuals.
Some examples of reading data from files on the internet: csv, Google spreadsheet.
Note that when they are read in to R they are given a name which can also be used to access the documentation.
Direct the reader to one of the open data sites, e.g. https://data.cityofnewyork.us/City-Government/Open-Parking-and-Camera-Violations/nc67-uf89 Point out that the web site is a combination of documentation, a “table preview”, and a way to access a spreadsheet CSV file containing the data.
A small part of this data is provided as math300::NYC_parking
.
Look at 1940 Census sheet. What are some of the ways in which the order is being used to indicate the value of a variable.