These notes focus on an organization of data that is simple yet very widely used, what’s sometimes called “tidy data.”1 At the heart of tidy data is the data table. Here’s an example: a listing from the US Social Security Administration of the yearly births in the US along with the name given to the baby.
name | sex | count | year |
---|---|---|---|
Sherina | F | 19 | 1970 |
Leketha | F | 6 | 1973 |
Tahirih | F | 6 | 1976 |
Radha | F | 13 | 1979 |
Hatim | M | 8 | 1981 |
Cissy | F | 7 | 1984 |
Ahmet | M | 6 | 1986 |
For instance, there were 6 boys named Ahmet born in the US in 1986, 19 girls named Sherina in 1970. Common sense tells you that this is just a small part of the original table. In fact, the 19 Sharinas are only a small fraction of 1,748,097 girls born in 1970. In 1986 the Ahmets were just 7 of the 1,839,985 boys. This particular data table covers the years 1880 through 2013 and a total of 333,417,770 babies, somewhat larger than the current population of the US.
These data are “tidy” simply because they are organized according to some simple rules.
count
gives the number of babies for each case; sex
tells which gender the case refers to.In contrast, here’s a table that isn’t in tidy form:
The table is attractive and neatly laid out. There are helpful labels and summaries that make it easy for a person to read and draw conclusions. (For instance, Ward 1 had a higher voter turnout than Ward 2, and both wards were lower than the city total.)
But being neat is not what makes data tidy. This table, however neat it is, violates the rules for tidy data.
The tidy data rules make it easy to summarize or analyze data. For instance, in the tidy baby names table, it’s easy to find the total number of babies: just add up all the numbers in the count
variable. It’s similarly easy to find the number of cases: just count the rows. And if you want to know the total number of Ahmeds or Sherinas across the years, there’s an easy way to do that.
In contrast, it would be more difficult in the Minneapolis election data to find, say, the total number of ballots cast. If you add up the numbers in column I, the result will be three times the number of ballots, because some of the rows contain summaries.
Indeed, if you wanted to do calculations based on the Minneapolis election data, you would first want to put it in a tidy form, like this:
Ward | Precinct | Registered Voters at 7am | Voters Registering by Absentee | Total Registrations | Voters at Polls | Absentee Voters |
---|---|---|---|---|---|---|
1 | 1 | 1878 | 3 | 28 | 492 | 27 |
1 | 4 | 2139 | 5 | 29 | 768 | 26 |
1 | 7 | 1847 | 0 | 47 | 291 | 8 |
2 | 1 | 2820 | 1 | 63 | 1011 | 39 |
2 | 4 | 1582 | 0 | 53 | 117 | 3 |
2 | 7 | 1013 | 0 | 39 | 138 | 7 |
2 | 10 | 2822 | 0 | 87 | 196 | 5 |
3 | 3 | 2590 | 4 | 71 | 893 | 101 |
3 | 6 | 2724 | 12 | 102 | 927 | 71 |
The tidy form is, admittedly, not as attractive as the form published by the Minneapolis government. But the tidy form is much easier to use for the purpose of generating summaries and analyses.
The strategy of using tidy data so that reports can easily be made lets you construct ways of presenting data that can be more effective than a formatted spreadsheet. For instance, here is a presentation of the turnout in each ward that makes it easy to see how much variation there is.
The tidy format also makes it easier to bring together data from different sources. For instance, to explain the variation in voter turnout, you might want to look at variables such as party affiliation, age, income, etc. Such data might be available on a ward-by-ward basis from other records, such as the (public) voter registration logs and census records. This would be difficult if you had to deal with an ideosyncratic format for each different source of data.
In data, the word “variable” has a different meaning than found in high-school algebra. In algebra, a variable is an unknown quantity. In data, a variable is known; it’s been measured. “Variable” refers to a specific quantity or quality that can vary from case to case.
There are two major types of variables: categorical and quantitative. A quantitative variable is just what it sounds like: a number.
A categorical variable tells which category a case falls into. For instance, in the baby names data table, sex
is a categorical variable with two levels F and M, standing for female and male. Similarly the name
for each case is categorical. It happens that there are 92,600 different levels for name
, ranging from Aaron, Ab, and Abbie to Zyhaire, Zylis, and Zymya.
As already said, a row of a tidy data table refers to a case. To this point, you may have little reason to prefer the word “case” to “row.”
In working with data, keep in mind what a case stands for in the real world. Sometimes the meaning is obvious. For instance, here eight cases from a (tidy) data table showing each in the Minneapolis mayoral election in 2013. (The voters were directed to mark their ballot with the first three choices of candidates, in order. This is part of a procedure called rank choice voting.)
Precinct | First | Second | Third | Ward |
---|---|---|---|---|
P-05 | MARK V ANDERSON | undervote | undervote | W-13 |
P-09 | BETSY HODGES | JACKIE CHERRYHOMES | DON SAMUELS | W-13 |
P-02 | BETSY HODGES | DON SAMUELS | JACKIE CHERRYHOMES | W-2 |
P-02 | BETSY HODGES | undervote | undervote | W-13 |
P-03 | DON SAMUELS | MARK ANDREW | BETSY HODGES | W-7 |
P-05 | BETSY HODGES | MARK ANDREW | BOB FINE | W-9 |
P-04 | JACKIE CHERRYHOMES | DON SAMUELS | BETSY HODGES | W-4 |
P-08 | DON SAMUELS | JACKIE CHERRYHOMES | MARK V ANDERSON | W-7 |
Altogether, there are 80,101 rows2 in the table. Each row corresponds to a ballot cast in the 2013 election. That’s the meaning of “case” in this situation.
The case is different in the vote summary table; a case is a ward in a precinct. In the baby names data, a case is a name and sex in one year.
When thinking about cases, ask this question: What description would make every case unique? In the vote summary data, a precinct does not uniquely identify a case. The individual precincts each appear in several rows. But each precinct & ward combination appears once and only once.
Here’s a small part of a data table about results from a 10-mile running race held each year in Washington, D.C.
name.yob | sex | age | year | gun |
---|---|---|---|---|
jane polanek 1974 | F | 32 | 2006 | 114.50 |
jane poole 1948 | F | 55 | 2003 | 92.72 |
jane poole 1948 | F | 56 | 2004 | 87.28 |
jane poole 1948 | F | 57 | 2005 | 85.05 |
jane poole 1948 | F | 58 | 2006 | 80.75 |
jane poole 1948 | F | 59 | 2007 | 78.53 |
jane schultz 1964 | F | 35 | 1999 | 91.37 |
jane schultz 1964 | F | 37 | 2001 | 79.13 |
jane schultz 1964 | F | 38 | 2002 | 76.83 |
jane schultz 1964 | F | 39 | 2003 | 82.70 |
jane schultz 1964 | F | 40 | 2004 | 87.92 |
jane schultz 1964 | F | 41 | 2005 | 91.47 |
jane schultz 1964 | F | 42 | 2006 | 88.43 |
jane smith 1952 | F | 47 | 1999 | 90.60 |
jane smith 1952 | F | 49 | 2001 | 97.87 |
What’s the meaning of a case here? It’s tempting to think that a case is a person. After all, it’s people who run road races. But notice that individuals appear more than once: Jane Poole ran each year from 2003 to 2007. (Her times improved consistently as she got older!) Jane Smith ran in the races from 1999 to 2006, missing only the year 2000 race. This suggests that the case is a runner in one year’s race.
Note that even if the individual identifier in the first column were stripped away, the case would still be a runner in one year’s race. Data tables do not necessarily display all the variables needed to figure out what makes each row unique. For such information, you sometimes need to look at the documentation of how the data were collected and what the variables mean.
The “codebook” is a document — separate from the data table — that describes various aspects of how the data were collected, what the variables mean and what the different levels of categorical variables refer to. The word “codebook” comes from the days when data was encoded for the computer in ways that make it hard for a human to read.
For data provided by R packages, a codebook is available via the help()
function, e.g., for the BabyNames
data in the DCF
package, try:
help( BabyNames )
For the runners data, the codebook would tell you that the meaning of the gun
variable is the time from when the start gun went off to when the runner crosses the finish line and that the unit of measurement is “minutes.” It would also state the obvious: that age
is the person’s age in years and sex
has two levels, male and female, represented by M and F.3
An example of a complicated codebook of the sort sometimes found in the tabulation of government or social science data, is this document describing data from the Uniform Crime Reporting Data from the US Department of Justice. You generally need a lot of knowledge about the specific field to make sense of such things.
Please use the comment system to make suggestions, point out errors, or to discuss the topic.
See Hadley Wickham.↩
It’s interesting to note that the citywide tally, by precinct and ward, gives only 80,099. There are two extra ballots in the vote table. Perhaps a couple of legitimate absentee ballots, postmarked before the election, were received after the tally was made.↩
Such codes are not always self evident to the reader. In Welsh, the codes might be B for females and G for males. Not “boy” and “girl” but “Gwryw” for female and “Benyw” for male. A non-Welsh speaker would easily be confused without the codebook.↩