When data are in glyph-ready form it is straightforward to construct data graphics using the concepts and techniques in Chapters 5 and 6. First, you choose an appropriate sort of glyph: dots, stars, bars, countries, etc. Next, select which variables are to be mapped to the various aesthetics for that glyph. Then let the computer do the drawing.
On occasion, data will arrive in glyph-ready form. More typically, you have to wrangle the data into the glyph-ready form appropriate for your own purpose.
Table 7.1 records the choice of each of 80101 individual voters in the 2013 mayoral election in Minneapolis, MN, USA.
Table 7.1: The choices of each voter in the mayoral election. See
Minneapolis2013 in the
|P-09||KURTIS W. HANNA||W-10|
|… and so on for 80,101 rows altogether.|
The primary use for data like Table 7.1 is to determine how many votes were given to each candidate. Counting the votes for each candidate is a simple wrangling task that transforms the ballot data (Table 7.1) into a glyph-ready form (Table 7.2 right) that makes the answer obvious. The count information is still latent in the raw ballot data; it is not yet in a form where it is easily seen.
Table 7.2: The number of votes for each candidate. This has been wrangled from Table 7.1 into a glyph-ready form, that is, a form in which the information sought after is readily seen.
|… and so on for 38 rows altogether.|
The instructions for carrying out the wrangling are simple to give in English: Count the number of ballots that each candidate received and sort from highest to lowest. To carry out the process on a computer, you need to express this idea in a form the computer can carry out.
Before you start wrangling data, it’s crucial to envision what the goal is to be. In general, the purpose of wrangling is to take the data you have at hand and put it into glyph-ready form. But glyph-ready for what? This is something you, the data analyst, need to determine based on your own objectives.
Table 7.3 has variables indicating each person’s age, sex, and whether he or she smokes.
Table 7.3: The
NCHS data from the
dcData package. In
NCHS, the case is an individual person.
|… and so on for 29,375 rows altogether.|
Suppose you want to use
NCHS to explore the links between smoking and age. Often you will have some kind of presentation graphic in mind. Making an informal sketch like Figure 7.1 can be a helpful way to chart a path for data wrangling. The sketch can be based on your imagination of what the data might show. Even so, the sketch contains important information. For instance, from the sketch you can determine what a single glyph represents. As always, each glyph will be one case in the glyph-ready data. In this sketch, the glyph describes a group of people — people of one sex in one age group. This differs from the cases in
NCHS: individual people.
To make your goal for data wrangling even more explicit, rough out the form of the glyph-ready data frame, as in Figure 7.2. Don’t worry about calculating precise data values; the computer will do that after you have implemented your data wrangling plan.
The glyph-ready data for that graph will has a form like the rough table in Figure 7.2. That glyph-ready form is your target.
Each of these step would be easy enough to do by hand (if you had the time and patience to work through 31126 cases in
NCHS). To get the computer to do the work for you, you have to be able to describe each process to the computer. The next sections present a framework for describing wrangling operations that can work for both the human describing the wrangling and the computer that will carry out the calculations.
Over the last half century, researchers have identified a small set of patterns that can be used to describe a wrangling process. Because this set is small, it’s feasible for you to learn quickly what you need to describe the wrangling you have planned.
As you know, doing something in R is accomplished using functions. A function takes one or more inputs (the “arguments”) and returns an output. In thinking about data wrangling, it helps to consider these potential forms for inputs and outputs:
There are three broad families of functions involved in data wrangling. The families differ in what form of input they take in and what form they return.
Each step in data wrangling involves a data verb and one or more reduction or transformation functions.
Reduction functions summarize or reduce a variable to scalar form. You are likely familiar with several reduction functions:
mean()— find a single typical value
sum()— add up numbers into a total
n()— find how many cases there are
Some other frequently used reduction functions:
max()— find the smallest and largest value in a variable
sd()and other functions used in statistics
n_distinct()— how many different levels are there among the cases
Transformation functions take one or more variables as input and return a new variable. In contrast to reduction functions, which produce a single number — a scalar — by combining all the cases, transformation functions produce a result for each individual case.
Often, transformations are mathematical operations, for instance:
weight / height
log10( population )
round( age )
Other transformation functions include numerical and character comparisons, as in Table 7.4.
Table 7.4: Some of the functions used to compare values.
|For numerical comparisons|
||is one of||
|For character strings|
||is one of||
Another important operation,
ifelse() allows you to translate each value in a variable to one of two values, depending on the result of a comparison. For instance
ifelse(age >= 18, "voter", "non-voter")
Data verbs carry out an operation on a data frame and return a new data frame. Some data verbs involve the modification of variables, the creation of new variables, or the deletion of existing ones. Other data verbs involve changing the meaning of a case or adding or deleting cases.
In addition to a data frame input, each data verb takes additional arguments that provide the specifics of the operation. Very often, these specifics involve reduction and transformation functions.
There are about a dozen data verbs that are commonly used. Additional data verbs will be introduced in Chapter 10 and beyond, but lots of interesting questions can be investigated with just these two:
summarise()— turns multiple cases into a single case using reduction functions. “Aggregate” is synonym for “summarise.”
group_by()— modifies the action of reduction functions so that they give a single value for different groups of cases in a data frame.
To illustrate the uses of
summarise(), look at the
WorldCities data frame (in the
dcData package) with information about the most populous cities in each country. Table 7.5 is a subset of the variables:
Table 7.5: World cities
|… and so on for 23,018 rows altogether.|
A simple summary of the data is a count of the number of cities.
The reduction function is
n(). The expression
count = n() means that the output data frame should have a variable named
count containing the results of
n(). The n() function doesn’t take any arguments.
Perhaps you want to know the total population in these cities, or the average population per city or the smallest city in the table, as in Table 7.6.
Table 7.6: Summary statistics of the population of world cities
The average city on the list has about 110,000 people. The total population of people living in these cities is about 2.6 billion — a bit more than one-third of the world population. The smallest city has … zero people! Evidently, the
WorldCities data frame has one or more cases that are not really cities.
Some things to notice about the use of
%>%has been used to pass the first argument to
summarize(). This will be useful later on, when there is more than one step in a transfiguration. It’s OK to end a line with
%>%, but never start a line with it.
summarise()is a data frame. (In this example the output has only one case, but it is still a data frame.)
summarise()takes named arguments. The name of an argument is taken as the name of the corresponding variable created by
group_by() data verb sets things up so that other data verbs will perform their action on a group-by-group basis. For instance, Table 7.7 shows the number of cities in
WorldCities broken down by country.
Table 7.7: The number of cities in each country listed in
|… and so on for 243 rows altogether.|
group_by() verb should always be followed by another verb —
summarise() in the above example.
group_by() is the way to indicate to the following verbs that reduction operations should be performed on a group-wise basis.
group_by() along with
n(), provides a basis for counting up the number of cases in groups and subgroups, or for calculating group-wise statistics.
Note that the functions used within arguments to
summarise() — functions such as
sum(), etc. — are not data verbs. A data verb takes a data frame as input and returns a transfigured data frame as output. In contrast, the reduction functions,
sum(), and so on, take a variable as input and return a single number as output.
group_by() data verbs are team players; they work best in combination with other data verbs. Once you learn those data verbs, particularly
mutate(), you’ll be able to carry out many more operations. The richness of the data-verb system comes from the ways the different verbs can be combined together.
Suppose you want to create a graph like Figure 7.3 to examine the relative number of male and female births over time.
BabyNames data frame contains this information implicitly. Before the information can be graphed, you need to wrangle the existing data frame into glyph-ready form.
Table 7.11: The
BabyNames data frame
|… and so on for 1,792,091 rows altogether.|
The particular wrangling needed here is to calculate the sum of
count for each sex in each year. Here’s one way to do this.
Table 7.12: The data frame
YearlyBirths comes from
BabyNames wrangled into counts of births each year.
|… and so on for 268 rows altogether.|
YearlyBirths is in glyph-ready form. Using
YearlyBirths, drawing Figure 7.3 is a matter of assigning variables to graphical aesthetics:
year to the \(x\)-axis,
births to the \(y\)-axis, and
sex to dot color.
Problem 7.1: For each of the operations listed here, say whether it involves a transformation function or a reduction function or neither.
Problem 7.2: Each of these statements have an error. It might be an error in syntax or an error in the way the data tables are used, etc. Describe what each expression apparently attempts to do, as well as the error(s) that cause them to fail.
BabyNames %>% group_by( "First" ) %>% summarise( votesReceived=n() )
Tmp <- group_by(BabyNames, year, sex ) %>% summarise( Tmp, totalBirths=sum(count))
Tmp <- group_by(BabyNames, year, sex) summarise( BabyNames, totalBirths=sum(count) )
Problem 7.3: Using the
Minneapolis2013 data table in the
dcData package, answer these questions:
Secondvote selections? (That is, of all the possible ways a voter might have marked his or her first and second choices, which received the highest number of votes?)
Precincthad the highest number of ballots cast?
Problem 7.4: Each of these statements has an error. It might be an error in syntax or an error in the way the data tables are used, etc. Write down a correct version of the statement.
BabyNames %>% group_by(BabyNames, year, sex) %>% summarise(BabyNames, total = sum(count))
ZipGeography <- group_by(State) %>% summarise(pop = sum(Population))
Minneapolis2013 %>% group_by(First) -> summarise(voteReceived = n())
summarise(votesReceived = n()) %<% group_by(First) <- Minneapolis2013
Problem 7.5: The data verbs
summarise() are very frequently used in combination. Experiment with the R code, help documentation, etc to investigate each of the following.
VoterData_Aapparently been modified when compared to the original
Minneapolis2013data? What does a case represent in
VoterData_A <- Minneapolis2013 %>% group_by(First, Second)
VoterData_Bapparently been modified when compared to the original
Minneapolis2013data? What does a case represent in
VoterData_B <- Minneapolis2013 %>% summarise( total = n() )
VoterData_Capparently been modified when compared to the original
Minneapolis2013data? What does a case represent in
VoterData_C <- Minneapolis2013 %>% group_by(First, Second) %>% summarise( total = n() )
summarise()steps are reversed and now the result is an error indicating that “Column
Firstis unknown.” Clearly the variable
Firstexisted in the
Minneapolis2013data frame, why is it now unknown?
## Error: Must group by variables found in `.data`. ## * Column `First` is not found. ## * Column `Second` is not found.
Problem 7.6: Using the
Find the total land area and population in each state.
Make a scatter plot showing the relationship between land area and population for each state.
Make a choropleth map showing the population of each state.
Make a choropleth map showing the population per unit area of each state.
Problem 7.7: Imagine a data table,
Patients, with categorical variables
sex, and quantitative variable
You have a statement in the form
Patients %>% group_by( **SOME_VARIABLES** ) %>% summarise(count = n(), meanAge = mean(age))
**SOME_VARIABLES** with each of the following, tell what variables will appear in the output
Problem 7.8: Use the
ZipDemography data from the
dcData package for the following tasks.
Foreignbornpeople in a zip code and the number who
SpeakalanguageotherthanEnglishathome5yearsandover. (Please note that such long variable names ought to be avoided as a matter of good style.)
Warning: do NOT attempt to facet by ZIP code as you explore solutions. Doing so is very computationally intense (i.e., takes a long time if it finishes at all) only to show that one facet for each ZIP code is not a useful data visualization.
Bachelorsdegreeorhigher. Say how you would go about constructing such a plot — but don’t actually do it! Too much work.