Data

These notes focus on an organization of data that is simple yet very widely used, what’s sometimes called “tidy data.”¹ At the heart of tidy data is the data table. Here’s an example: a listing from the US Social Security Administration of the yearly births in the US along with the name given to the baby.

name	sex	count	year
Sherina	F	19	1970
Leketha	F	6	1973
Tahirih	F	6	1976
Radha	F	13	1979
Hatim	M	8	1981
Cissy	F	7	1984
Ahmet	M	6	1986

A very small subset of a data table showing how many babies were given each name in each year.

For instance, there were 6 boys named Ahmet born in the US in 1986, 19 girls named Sherina in 1970. Common sense tells you that this is just a small part of the original table. In fact, the 19 Sharinas are only a small fraction of 1,748,097 girls born in 1970. In 1986 the Ahmets were just 7 of the 1,839,985 boys. This particular data table covers the years 1880 through 2013 and a total of 333,417,770 babies, somewhat larger than the current population of the US.

These data are “tidy” simply because they are organized according to some simple rules.

The rows, called cases, each refer to a specific, unique and similar sort of thing, e.g. girls named Sherina in 1970.
The columns, called variables, each have the same sort of thing recorded for each row. For instance, count gives the number of babies for each case; sex tells which gender the case refers to.
Every row is a “case” (except the first, which gives names to the variables).
Every column is a variable.

In contrast, here’s a table that isn’t in tidy form:

Ward and precinct votes cast in the 2013 Minneapolis Mayoral election. source

The table is attractive and neatly laid out. There are helpful labels and summaries that make it easy for a person to read and draw conclusions. (For instance, Ward 1 had a higher voter turnout than Ward 2, and both wards were lower than the city total.)

But being neat is not what makes data tidy. This table, however neat it is, violates the rules for tidy data.

Each row is the same sort of thing: a case. That’s not true here. For most of the table, the rows represent a single precinct. But other rows give ward or city-wide totals. The first two rows are a caption describing the data, not a case.
Each column is a variable. That’s mostly true in this table, but the tidy pattern is interrupted by labels that are not variables. For instance, the first two cells in row 15 are the label “Ward 1 Subtotal,” different from how the wards are represented in most of the first column, and not telling anything about the precinct.

The tidy data rules make it easy to summarize or analyze data. For instance, in the tidy baby names table, it’s easy to find the total number of babies: just add up all the numbers in the count variable. It’s similarly easy to find the number of cases: just count the rows. And if you want to know the total number of Ahmeds or Sherinas across the years, there’s an easy way to do that.

In contrast, it would be more difficult in the Minneapolis election data to find, say, the total number of ballots cast. If you add up the numbers in column I, the result will be three times the number of ballots, because some of the rows contain summaries.

Indeed, if you wanted to do calculations based on the Minneapolis election data, you would first want to put it in a tidy form, like this:

Ward	Precinct	Registered Voters at 7am	Voters Registering by Absentee	Total Registrations	Voters at Polls	Absentee Voters
1	1	1878	3	28	492	27
1	4	2139	5	29	768	26
1	7	1847	0	47	291	8
2	1	2820	1	63	1011	39
2	4	1582	0	53	117	3
2	7	1013	0	39	138	7
2	10	2822	0	87	196	5
3	3	2590	4	71	893	101
3	6	2724	12	102	927	71

The Minneapolis election data in tidy form.

The tidy form is, admittedly, not as attractive as the form published by the Minneapolis government. But the tidy form is much easier to use for the purpose of generating summaries and analyses.

The strategy of using tidy data so that reports can easily be made lets you construct ways of presenting data that can be more effective than a formatted spreadsheet. For instance, here is a presentation of the turnout in each ward that makes it easy to see how much variation there is.

A graphical depiction of voter turnout in the different wards.

The tidy format also makes it easier to bring together data from different sources. For instance, to explain the variation in voter turnout, you might want to look at variables such as party affiliation, age, income, etc. Such data might be available on a ward-by-ward basis from other records, such as the (public) voter registration logs and census records. This would be difficult if you had to deal with an ideosyncratic format for each different source of data.

Variables

In data, the word “variable” has a different meaning than found in high-school algebra. In algebra, a variable is an unknown quantity. In data, a variable is known; it’s been measured. “Variable” refers to a specific quantity or quality that can vary from case to case.

There are two major types of variables: categorical and quantitative. A quantitative variable is just what it sounds like: a number.

A categorical variable tells which category a case falls into. For instance, in the baby names data table, sex is a categorical variable with two levels F and M, standing for female and male. Similarly the name for each case is categorical. It happens that there are 92,600 different levels for name, ranging from Aaron, Ab, and Abbie to Zyhaire, Zylis, and Zymya.

Cases and what they represent

As already said, a row of a tidy data table refers to a case. To this point, you may have little reason to prefer the word “case” to “row.”

In working with data, keep in mind what a case stands for in the real world. Sometimes the meaning is obvious. For instance, here eight cases from a (tidy) data table showing each in the Minneapolis mayoral election in 2013. (The voters were directed to mark their ballot with the first three choices of candidates, in order. This is part of a procedure called rank choice voting.)

Precinct	First	Second	Third	Ward
P-05	MARK V ANDERSON	undervote	undervote	W-13
P-09	BETSY HODGES	JACKIE CHERRYHOMES	DON SAMUELS	W-13
P-02	BETSY HODGES	DON SAMUELS	JACKIE CHERRYHOMES	W-2
P-02	BETSY HODGES	undervote	undervote	W-13
P-03	DON SAMUELS	MARK ANDREW	BETSY HODGES	W-7
P-05	BETSY HODGES	MARK ANDREW	BOB FINE	W-9
P-04	JACKIE CHERRYHOMES	DON SAMUELS	BETSY HODGES	W-4
P-08	DON SAMUELS	JACKIE CHERRYHOMES	MARK V ANDERSON	W-7

Altogether, there are 80,101 rows² in the table. Each row corresponds to a ballot cast in the 2013 election. That’s the meaning of “case” in this situation.

The case is different in the vote summary table; a case is a ward in a precinct. In the baby names data, a case is a name and sex in one year.

When thinking about cases, ask this question: What description would make every case unique? In the vote summary data, a precinct does not uniquely identify a case. The individual precincts each appear in several rows. But each precinct & ward combination appears once and only once.

Here’s a small part of a data table about results from a 10-mile running race held each year in Washington, D.C.

name.yob	sex	age	year	gun
jane polanek 1974	F	32	2006	114.50
jane poole 1948	F	55	2003	92.72
jane poole 1948	F	56	2004	87.28
jane poole 1948	F	57	2005	85.05
jane poole 1948	F	58	2006	80.75
jane poole 1948	F	59	2007	78.53
jane schultz 1964	F	35	1999	91.37
jane schultz 1964	F	37	2001	79.13
jane schultz 1964	F	38	2002	76.83
jane schultz 1964	F	39	2003	82.70
jane schultz 1964	F	40	2004	87.92
jane schultz 1964	F	41	2005	91.47
jane schultz 1964	F	42	2006	88.43
jane smith 1952	F	47	1999	90.60
jane smith 1952	F	49	2001	97.87

An excerpt of runners’ performance over the years in a 10-mile race.

What’s the meaning of a case here? It’s tempting to think that a case is a person. After all, it’s people who run road races. But notice that individuals appear more than once: Jane Poole ran each year from 2003 to 2007. (Her times improved consistently as she got older!) Jane Smith ran in the races from 1999 to 2006, missing only the year 2000 race. This suggests that the case is a runner in one year’s race.

Note that even if the individual identifier in the first column were stripped away, the case would still be a runner in one year’s race. Data tables do not necessarily display all the variables needed to figure out what makes each row unique. For such information, you sometimes need to look at the documentation of how the data were collected and what the variables mean.

The Codebook

The “codebook” is a document — separate from the data table — that describes various aspects of how the data were collected, what the variables mean and what the different levels of categorical variables refer to. The word “codebook” comes from the days when data was encoded for the computer in ways that make it hard for a human to read.

For data provided by R packages, a codebook is available via the help() function, e.g., for the BabyNames data in the DCF package, try:

help( BabyNames )

For the runners data, the codebook would tell you that the meaning of the gun variable is the time from when the start gun went off to when the runner crosses the finish line and that the unit of measurement is “minutes.” It would also state the obvious: that age is the person’s age in years and sex has two levels, male and female, represented by M and F.³

An example of a complicated codebook of the sort sometimes found in the tabulation of government or social science data, is this document describing data from the Uniform Crime Reporting Data from the US Department of Justice. You generally need a lot of knowledge about the specific field to make sense of such things.

Please use the comment system to make suggestions, point out errors, or to discuss the topic.

comments powered by Disqus

See Hadley Wickham.↩
It’s interesting to note that the citywide tally, by precinct and ward, gives only 80,099. There are two extra ballots in the vote table. Perhaps a couple of legitimate absentee ballots, postmarked before the election, were received after the tally was made.↩
Such codes are not always self evident to the reader. In Welsh, the codes might be B for females and G for males. Not “boy” and “girl” but “Gwryw” for female and “Benyw” for male. A non-Welsh speaker would easily be confused without the codebook.↩