The FAOsimple
dataset provides a quick introduction to “big” data. The structure of the data set is very simple:
We use it as a starting point for making graphics, particularly scatter plots.
As part of the DCF
package, it's easy to read in the data.
data(FAOsimple)
There are 224 cases, which turn out to be countries.
nrow(FAOsimple)
## [1] 224
There are 52 variables, of which a few are listed here — the complete list is below:
head(names(FAOsimple))
## [1] "Country"
## [2] "Year"
## [3] "Agricultural.area"
## [4] "Agricultural.area.certified.organic"
## [5] "Agricultural.area.in.conversion.to.organic"
## [6] "Agricultural.area.irrigated"
Here are a few of the countries: Sweden, Cote d'Ivoire, Poland, United Kingdom, Mongolia, Finland, Iraq, Bhutan, Anguilla, Luxembourg. There are 224 altogether.
Students can explore the relationship among variables with a simple scatterplot, for instance this one about total population as a function of agricultural area.
ggplot(data = FAOsimple, aes(x = Agricultural.area, y = Total.Population...Both.sexes)) +
geom_point() + scale_x_log10() + scale_y_log10()
## Warning: Removed 2 rows containing missing values (geom_point).
Evidently, there's a pretty simple typical relationship between agricultural area and population. To acquaint students with such graphics, some questions to ask:
To explore the graphics yourself, use the mScatter()
function, which let's you interactively choose the variables to be plotted as well as other aspects of the plot. Here's how:
require(DCF)
data(FAOsimple)
mScatter(FAOsimple)
In addition to choosing variables for the \( x \) and \( y \) axes, you can choose variables for color and size of the dot. Doing so in a meaningful way involves an additional data technique: Generating a new variable from existing variables.
Often, you create new variables by transforming old ones.
Some transformations are arithmetic, for example, creating population density by dividing population by land area:
FAOsimple = transform(FAOsimple, density = Total.Population...Both.sexes/Country.area)
When using colors to indicate the level of a variable, it can be helpful to divide into discrete groups. A particularly simple-to-use transformation divides up the cases into groups of (approximately) equal size. This is useful when dealing with variables that have an uneven distribution over a very wide range.
FAOsimple = transform(FAOsimple, ArableG = ntiles(Arable.land, n = 5))
Here's a plot that uses these transformed variables for color and size of the dots:
ggplot(data = FAOsimple, aes(x = Agricultural.area, y = Total.Population...Both.sexes)) +
geom_point() + aes(colour = ArableG) + aes(size = log10(density)) + scale_x_log10() +
scale_y_log10()
## Warning: Removed 2 rows containing missing values (geom_point).
Examine the plot above and describe the features of the plot that support these claims: 1. The amount of arable land is roughly aligned with agricultural area. 2. Countries with low population densities tend to have relatively large agricultural areas compared to the amount of arable land.
Construct new variables
# Your plot here
Is there a clear relationship between the fraction of a country's total land that is arable and the fraction of a country's total population that is engaged in agricultural work?
# Your plot goes here
Pick out some variables that you think ought to be related. Describe the relationship you hypothesize and then demonstrate whether or not your hypothesis is consistent with the data.
# Your plot here
Show a situation where the use of logarithmic axes helps to illustrate a relationship. Do this by contrasting the same plots with and without logarithmic axes and then providing a narrative explanation of what the logarithmic axes make clearer.
# Your plot without logarithmic axes
# The same plot with logarithmic axes
Your narrative goes here …
names(FAOsimple)
## [1] "Country"
## [2] "Year"
## [3] "Agricultural.area"
## [4] "Agricultural.area.certified.organic"
## [5] "Agricultural.area.in.conversion.to.organic"
## [6] "Agricultural.area.irrigated"
## [7] "Agricultural.area.organic..total"
## [8] "Arable.land"
## [9] "Arable.land.and.Permanent.crops"
## [10] "Arable.land.area.certified.organic"
## [11] "Arable.land.area.in.conversion.to.organic"
## [12] "Arable.land.organic..total"
## [13] "Country.area"
## [14] "Fallow.land"
## [15] "Forest.area"
## [16] "Inland.water"
## [17] "Land.area"
## [18] "Other.land"
## [19] "Perm..crops.irrigated"
## [20] "Perm..crops.non.irrigated"
## [21] "Perm..meadows...pastures...Cultivated"
## [22] "Perm..meadows...pastures...Nat..grown"
## [23] "Perm..meadows...pastures.Cult....irrig"
## [24] "Perm..meadows...pastures.Cult..non.irrig"
## [25] "Permanent.crops"
## [26] "Permanent.crops.area.certified.organic"
## [27] "Permanent.crops.area.in.conversion.to.organic"
## [28] "Permanent.crops.organic..total"
## [29] "Permanent.meadows.and.pastures"
## [30] "Permanent.meadows.and.pastures.area.certified.organic"
## [31] "Permanent.meadows.and.pastures.area.in.conversion.to.organic"
## [32] "Permanent.meadows.and.pastures.organic..total"
## [33] "Temp..crops.irrigated"
## [34] "Temp..crops.non.irrigated"
## [35] "Temp..meadows...pastures.irrigated"
## [36] "Temp..meadows...pastures.non.irrig."
## [37] "Temporary.crops"
## [38] "Temporary.meadows.and.pastures"
## [39] "Total.area.equipped.for.irrigation"
## [40] "Agricultural.population"
## [41] "Female.economically.active.population"
## [42] "Female.economically.active.population.in.Agr"
## [43] "Male.economically.active.population"
## [44] "Male.economically.active.population.in.Agr"
## [45] "Non.agricultural.population"
## [46] "Rural.population"
## [47] "Total.economically.active.population"
## [48] "Total.economically.active.population.in.Agr"
## [49] "Total.Population...Both.sexes"
## [50] "Total.Population...Female"
## [51] "Total.Population...Male"
## [52] "Urban.population"
## [53] "density"
## [54] "ArableG"