Up to now, the representation of data has been in the form of graphics. Graphics are often easy to interpret (but not always). They can be used to identify patterns and relationships — this is termed exploratory analysis. They are also useful for telling a story — presenting a pattern or relationship that is important for some purpose.

But graphics are not always the best way to explore or to present data. Graphics work well when there are two or three or even four variables involved in the patterns to be explored or in the story to be told. Why two, three or four? Two variables can be represented with position on the paper or screen or, ultimately, the eye’s retina. To represent a third variable, color or size can be used. In principle, more variables can be represented by other graphical aesthetics: shape, angle, color saturation, opacity, facets, etc. Doing so raises problems for human cognition; people simply don’t do very well at integrating so many graphical modes into a coherent whole.

There are, of course, occasions when a multivariate graphical representation works. A famous example is [Charles Minard’s] map, made in 1869, of the march on Moscow of Napoleon’s army in 1812-1813.

$Figurative map of the successive losses of men from the French Army in that Russian campaign 1812-1812. From [wikimedia](http://en.wikipedia.org/wiki/Wikipedia:Featured_picture_candidates/Napoleon's_Invasion_of_Russia). A higher resolution version is available [here](http://en.wikipedia.org/wiki/Wikipedia:Featured_picture_candidates/Napoleon's_Invasion_of_Russia#mediaviewer/File:Minard.png).$

Figurative map of the successive losses of men from the French Army in that Russian campaign 1812-1812. From [wikimedia](http://en.wikipedia.org/wiki/Wikipedia:Featured_picture_candidates/Napoleon's_Invasion_of_Russia). A higher resolution version is available [here](http://en.wikipedia.org/wiki/Wikipedia:Featured_picture_candidates/Napoleon's_Invasion_of_Russia#mediaviewer/File:Minard.png).

The map shows location (latitude and longitude), size of the army, terrain features, and expeditions off of the main force. The second graphic, on the bottom, shows time and temperature.

The Minard graphic is famous because it is exceptional. Such exceptions are not common.

In general, when dealing with multiple variables, a different mode of exploration or presentation is used: machine learning. This chapter introduces a few of the many methods of machine learning.

Keep in mind that with machine learning, as with graphics, techniques and operations are used to transfigure data into ready-to-use form. For graphics, this was called “glyph-ready form.” The same term will be used when working with machine learning, even though there are no glyphs.

Regression or Learning?

The most important methods pre-date the introduction of the term “machine learning.” These are the regression techniques, particularly linear least squares and logistic regression.

Presenting the concepts of regression requires a book-length treatment. It’s inappropriate to do that here. Instead, this chapter will focus on machine-learning techniques that can be understood more intuitively. Do remember, however, that mastery of the regression techniques is an essential component of any well-trained data analyst.

Supervised Learning

A function, as you know, is a relationship between inputs and an output. Outdoor temperature is a function of season: season is the input; temperature is the output. Length of the day — i.e. how many hours of daylight — is a function of latitude and day of the year: latitude and day-of-the-year (e.g. March 22) are the inputs; day length is the output. Risk of colon cancer is a function of … well what?

There is a new bit of R syntax to help with defining functions: the tilde. The tilde is used to indicate what is the output variable and what are the input variables. You’ll see expressions like this:

diabetic ~ age + sex + weight + height

Here, the variable diabetic is marked as the output, simply because it’s to the left of the tilde. The variables age, sex, weight, and height are to be the inputs to the function. You’ll also see the syntax diabetic ~ .. The dot to the right of the tilde means “use all the available variables (except the output).”

There are several different goals that might motivate constructing a function:

Predict the output given an input. It’s February, what will the temperature be? On June 15 in Saint Paul, Minnesota, USA (latitude 45 deg N), how many hours of daylight will there be.
Determine which variables are useful inputs. It’s obvious from experience that temperature is a function of season. But in less familiar situations, e.g. colon cancer, the relevant inputs are uncertain or unknown.
Generating hypotheses. For a scientist trying to figure out the causes of colon cancer, it can be useful to construct a predictive model, then look to see what variables turn out to be related to risk of colon cancer. For instance, you might find that diet, age, and blood pressure are risk factors for colon cancer. Knowing some physiology suggests that blood pressure is not a direct cause of colon cancer, but it might be that there is a common cause of blood pressure and cancer. That “might be” is a hypothesis, and one that you probably would not have thought of before finding a function relating risk of colon cancer to those inputs.
Understand how a system works. For instance, a reasonable function relating hours of daylight to day-of-the-year and latitude reveals that the the northern and southern hemisphere have reversed patterns: long days in the southern hemisphere will be short days in the northern hemisphere.

Depending on your motivation, the kind of model and the input variables will differ. In understanding how a system works, the variables you use should be related to the actual, causal mechanisms involved, e.g. the genetics of cancer. For predicting an output, it hardly matters what the causal mechanisms are. Instead, all that’s required is that the inputs are known at a time before the prediction is to be made.

Consider the relationship between age and colon cancer. As with many diseases, the risk of colon cancer increases with age. Knowing this can be very useful in practice. For instance, it could help you design a screening program to test people for colon cancer. On the other hand, age does not suggest a way to avoid colon cancer: there’s no way for you to change your age. You can, however, change things like diet, physical fitness, etc.

Supervised learning (also known as function fitting) is the process of constructing a function from data. The defining thing about supervised learning is that the data you have includes as a variable the outcome of the function. With this outcome variable in hand, you can try many different possible functions of input variables and choose those that give a good match to the actual outcome.

Another broad category of machine learning is unsupervised learning. More about that later.

The form or architecture of a function refers to the mathematics behind the function and the approach to finding a specific function that is a good match to the data. Examples of the names of such forms are linear least squares, logistic regression, neural networks, radial basis functions, discriminant functions, nearest neighbors, etc.

Here the focus will be on a single, general-purpose function architecture: recursive partitioning. Software for constructing recursive partitioning function is in the party package:

library(party)

To start, take a very familiar condition: pregnancy. You already know what factors predict pregnancy, so you can compare the findings of the machine learning technique of recursive partitioning to what you already know.

The party::ctree() function builds a recursive-partitioning function. The NHANES data has a variable pregnant that identifies whether the person is pregnant.

whoIsPregnant <- 
  ctree( pregnant ~ age + income + sex, data=NHANES )

There are different ways to look at the function contained in whoIsPregnant. An important one is the output produced for any given input. Here’s how to calculate that output when the inputs are an age of 25 (years), an income of 6 (times the poverty threshold), and sex female:

f <- funCT( whoIsPregnant ) # get the function
f( age=25, income=6, sex="female")

  age income    sex       yes        no
1  25      6 female 0.3859649 0.6140351

The input values are repeated to make it easier to compare inputs to outputs. Since the output is a yes-or-no, the output is given as the probability of each of these. So, for a 25-year-old woman with an income level of 6, the probability of being pregnant is 38.6%.

Another way to look at a recursive partitioning function is as a “decision tree”.

plot(whoIsPregnant, type="simple")

To read the tree, start at the top. The top node, drawn as an oval, specifies the variable sex. From that oval there are two branches: one labeled “male” and the other “female.” This is the partitioning of the node from which the branches stem. As you follow a branch, you may come to another node, which itself branches, or you may come to a terminal (shown in the gray box). The terminal gives the output of the model.

For the 26-year-old female with an income level of 6, the function output can be read off by following the appropriate nodes. Starting at the top, follow the “female” branch. The node that leads to is about the income variable; the two branches from that node indicate the criterion for going left or right. In this case, the threshold for branching right is an income greater than 4.73. For an income level of 6, then, branch right.

The right branch from income leads to a node involving age. The threshold for branching right is an age greater than 39 years. The 26-year-old does not reach the threshold, so branch left.

That left branch leads to another node involving age. The threshold for branching right is 21 years old. The 26-year-old meets that threshold, so branch right. The right branch leads to a terminal: the gray box marked “22”. The output for the example person — income 6, age 25, female — is 0.386. Admittedly, the format of the terminal is a bit obscure, but the information is there.

Within each node there is a notation like $p < 0.001$. This is not part of the function itself. Instead, it reflects the process by which the function was constructed. Roughly speaking, $p < 0.001$ means there’s less than a 1-in-1000 chance that the node’s two branches are really the same.¹

What has been learned in constructing this function. First, without having been told that sex is related to pregnancy, the recursive partitioning has figured out that sex makes a difference; only females get pregnant. Of course you knew that already, but the machine learned that fact from the data. The machine also identified the fact that females under 14 years old are not likely to be pregnant, and women over 40 are also unlikely to be pregnant. Again, you knew that already, but the machine learned it from the data. The machine also learned something that might be unfamiliar to you, that age patterns of pregnancy are different for low-income and high-income people.

It might be easier to see the income-related patterns by looking at the specific model outputs. Here’s the function’s output for low-income females from 10 to 20 years old:

f( age=10:20, income=3, sex="female")

   age income    sex          yes        no
1   10      3 female 0.0003841721 0.9996158
2   11      3 female 0.0003841721 0.9996158
3   12      3 female 0.0003841721 0.9996158
4   13      3 female 0.0003841721 0.9996158
5   14      3 female 0.0003841721 0.9996158
6   15      3 female 0.0134408602 0.9865591
7   16      3 female 0.0405228758 0.9594771
8   17      3 female 0.0405228758 0.9594771
9   18      3 female 0.2288541334 0.7711459
10  19      3 female 0.2288541334 0.7711459
11  20      3 female 0.2288541334 0.7711459

And for high-income females:

f( age=25, income=6, sex="female")

  age income    sex       yes        no
1  25      6 female 0.3859649 0.6140351

You can see that, at any given age, high-income girls are less likely to be pregnant than low-income girls.

Part of the machine learning is finding out which variables appear to be useful in matching the patterns of pregnancy in the NHANES data. For instance, the following recursive-partitioning tree can make use of several more variables. To avoid rediscovering the obvious, only females will be considered.

whoIsPregnant2 <- 
  ctree( pregnant ~ 
           age + income + ethnicity + 
           height + smoker + bmi, 
         data=NHANES %>% filter(sex=="female") )

plot( whoIsPregnant2, type="simple")

Unsupervised Learning

You may have seen an evolutionary tree like this one of mammals:

We humans (hominidae) are on the far left of the tree. The numbers at the branch points are estimates of how long ago — in millions of years — the branches separated. According to the diagram, rodents and we primates diverged about 90 million years ago.

How do evolutionary biologists construct a tree like this? They look at various traits of different kinds of mammals. Two mammals that have similar traits were deemed closely related. Animals with dissimilar traits are distantly related. Putting together all this information about what is close to what results in the evolutionary tree.

The tree, sometimes called a dendrogram, is a nice organizing structure for relationships. Evolutionary biologists imagine that at each branch point there was an actual animal whose descendents split into groups that developed in different directions. In evolutionary biology, the inferences about branches come from comparing existing creatures to one another (as well as creatures from the fossil record). Creatures with similar traits are in nearby branches, creatures with different traits are in further branches. It takes considerable expertise in anatomy and morphology to know what similarities and differences to look for.

But there is no outcome variable, just a construction of what is closely related or distantly related to what.

Trees can describe degrees of similarity between different things, regardless of how those relationships came to be. If you have a set of objects or cases, and you can measure how similar any two of the objects are, you can construct a tree. The tree may or may not reflect some deeper relationship among the objects, but it often provides a simple way to visualize relationships.

When the description of an object consists of a set of numerical variables, there are two main steps in constructing a tree:

Represent each object as a point in a Cartesian space.
Make branching decisions based on how close together points or clouds of points are.

The data about automobile characteristics in mtcars can illustrate.

data(mtcars)
names(mtcars)

 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"  
 [9] "am"   "gear" "carb"

summary(mtcars)

      mpg             cyl             disp      
 Min.   :10.40   Min.   :4.000   Min.   : 71.1  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8  
 Median :19.20   Median :6.000   Median :196.3  
 Mean   :20.09   Mean   :6.188   Mean   :230.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0  
       hp             drat             wt       
 Min.   : 52.0   Min.   :2.760   Min.   :1.513  
 1st Qu.: 96.5   1st Qu.:3.080   1st Qu.:2.581  
 Median :123.0   Median :3.695   Median :3.325  
 Mean   :146.7   Mean   :3.597   Mean   :3.217  
 3rd Qu.:180.0   3rd Qu.:3.920   3rd Qu.:3.610  
 Max.   :335.0   Max.   :4.930   Max.   :5.424  
      qsec             vs               am        
 Min.   :14.50   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:16.89   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :17.71   Median :0.0000   Median :0.0000  
 Mean   :17.85   Mean   :0.4375   Mean   :0.4062  
 3rd Qu.:18.90   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :22.90   Max.   :1.0000   Max.   :1.0000  
      gear            carb      
 Min.   :3.000   Min.   :1.000  
 1st Qu.:3.000   1st Qu.:2.000  
 Median :4.000   Median :2.000  
 Mean   :3.688   Mean   :2.812  
 3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :5.000   Max.   :8.000

For an individual quantitative variable, it’s easy to measure how far apart any two cars are: take the difference between the numerical values. The different variables are, however, on different scales. For example, gear ranges only from 3 to 5, while hp goes from 52 to 355. This means that some decision needs to be made about rescaling the variables so that the differences along each variable reasonably reflect how different the respective cars are. There is more than one way to do this. The dist() function takes a simple and pragmatic point of view: each variable is equally important.

carDiffs = dist(mtcars)

The result gives the “distance” from each individual car to every other car. Here’s a small sample from the distance table:

xtable( round(as.matrix(carDiffs)[1:5,1:5]))

Mazda RX4	Mazda RX4 Wag	Datsun 710	Hornet 4 Drive	Hornet Sportabout
0.00	1.00	55.00	98.00	210.00
1.00	0.00	55.00	98.00	210.00
55.00	55.00	0.00	151.00	265.00
98.00	98.00	151.00	0.00	121.00
210.00	210.00	265.00	121.00	0.00

It’s much like the tables that used to be printed on road maps giving the distance from one city to another, like this one of cities in the US:

Distances between some US Cities

Notice that the distances are symmetrical: it’s the same distance from Boston to San Francisco as from San Francisco to Boston (3043 mi, according to the table).

Knowing the distances between the cities is not the same thing as knowing their locations. But the set of mutual distances is enough information to reconstruct the relative positions.

Cities, of course, lie on the surface of the earth. That need not be true for the abstract distance between automobile types. Even so, the set of mutual distances provides information equivalent to knowing the relative positions. This can be used to construct branches between nearby items, then to connect those branches, and so on to construct an entire tree. The process is called “hierarchical clustering.” For example:

hc = hclust( carDiffs )
plot(hc, hang=-1)

Performance-related measures

perfDist = dist(getVarFormula(~mpg+hp+wt+qsec, data=mtcars))
hc = hclust(perfDist)
plot(hc, hang=-1)

There are many ways to graph such trees. For examples, see Visualizing Dendrograms in R.

Another way to group similar cases considers just the cases, without constructing a heirarchy. The output is not a tree but a choice of which group each case belongs to.²

As an example, consider the cities of the world (in WorldCities). Cities can be different and similar in many ways: population, age structure, public transport and roads, building space per person, etc. The choice of features depends on the purpose you have for making the grouping.

My purpose is to show you that clustering via machine learning can actually learn genuine patterns in the data. So I’ll choose features that are utterly familiar: the latitude and longitude of each cities.

Now you know about the location of cities. They are on land. And you know about the organization of land on earth; most land falls in one of the large clusters called continents. But the WorldCities data doesn’t have any notion of continents. Perhaps it’s possible that this feature, which you are completely aware of, can be learned by a computer that never even took grade-school geography.

For simplicity, consider the 4000 biggest cities in the world and their longitudes and latitudes:

require(mclust)
set.seed(104)
BigCities <- WorldCities %>%
  arrange( desc(population) ) %>%
  head( 4000 ) %>% select( longitude, latitude )
clusts2 <- BigCities %>%
  kmeans( centers=6 )
BigCities$cluster <- as.factor(clusts2$cluster)
ggplot( BigCities, aes( x=longitude, y=latitude, color=cluster, shape=cluster) ) + geom_point()

The clustering algorithm seems to have identified the continents. North and South America are clearly distinguished, as well as most of Africa. (The cities in North Africa are matched to Europe.) The distinction between Europe and Asia is essentially historic, not geographic. The cluster for Europe extends a ways into what’s called Asia. East Asia and central Asia are marked as distinct, largly because the low population areas of Tibet and Siberia look the same as the major oceans to the algorithm.

Dimension Reduction

It often happens a variable carries little information that’s relevant to the task at hand. Even for variables that are informative, there can be redundancy or near duplication of variables. That is, two or more variables are giving essentially the same information; they have very similar patterns across the cases.

Such irrelevant or redundant variables make it harder to learn from data. The irrelevant variables are simply noise that obscures actual patterns. Similarly, when two or more variables are redundant, the differences between them may represent random noise.

It’s helpful to remove irrelevant or redundant variables so that they, and the noise they carry, don’t obscure the patterns that machine learning could learn.

As an example of such a situation, consider votes in a congress or parliament. Each representative (in the US congress) or member (in parliaments) casts votes on a variety of matters. The pattern of ayes and nays may indicate which members are affiliated, i.e. members of the same political party. To test this idea, you might try clustering the members by their voting record.

Take one such voting record: the Scottish Parliament in 2008.³

Here’s a small part of the voting record:

                    S1M-1 S1M-4.1 S1M-4.3 S1M-4 S1M-5 S1M-17
Adam..Brian             1       0       0     0    -1     -1
Aitken..Bill            1       1       0    -1    -1      0
Alexander..Ms.Wendy     1      -1      -1     1     1      1
Baillie..Jackie         1      -1      -1     1     1      1
Barrie..Scott          -1      -1      -1     1     1      1
Boyack..Sarah           0      -1      -1     1     1      1
Brankin..Rhona          0      -1      -1     1     1      1
Brown..Robert          -1      -1      -1     1     1      1
Butler..Bill            0       0       0     0     0      0
Campbell..Colin         1       1       1    -1    -1     -1

The names of the members of parliament are the cases. Each ballot — identified by a file number such as “S1M-4.3” — is a variable. A $1$ means an aye vote, $-1$ is nay, and $0$ is an abstention. There are more than 130 members and more than 700 ballots. It’s impractical to show all of the 100,000+ votes in a table. But since there are only 3 levels for each variable, displaying the table as an image works well:

Hard to see much of a pattern here, although you may notice the Scottish tartan structure.⁴

The official tartan pattern of West Virginia University. (Source: Richinstead at Wikimedia. CC-BY-SA-3.0. )

As a start, here’s a graphic showing the ballot values for all of the members of parliament for two randomly selected ballots. (Jittering is used to give a better idea of the point count at each position. The red dots are the actual positions.)

Each point is one member of parliament. Similarly aligned members are grouped together at one of the nine possibilities marked in red: (Aye, Nay), (Aye, Abstain), (Aye, Aye), …, (Nay, Nay). In these two ballots, eight of the nine is populated. This hardly indicates that there are 8 clusters of members.

Intuition suggests that it would be better to use all of the ballots, rather than just two. In next plot, the first 336 ballots have been added together, as have the remaining ballots:

This graphic suggests that there might be two clusters of members who are aligned with each other. Using all of the data seems to give more information than using just two ballots.

You may ask why the choice was made to add up the first 336 ballots as $x$ and the remaining ballots as $y$. Perhaps there is a better choice to display the underlying patterns, adding up the ballots in a different way.

In fact, there is a mathematical approach to finding the best way to add up the ballots, called “singular value decomposition” (SVD). The mathematics of SVD draw on a knowledge of matrix algebra, but the operation itself is readily available to anyone.⁵ The next plot shows the position of each member on the best two ways of summing up the ballots.

At a glance, you can see that there are three main clusters. The red circle marks the “average” member. The three clusters move away from average in different directions. There are several members whose position is in-between the average and the cluster to which they are closest.

Two is as many variables as can be used for position. In these data, the first 4 or 5 SVD sums are much larger than the rest. These additional sums may allow the three clusters to be split up further. The color in the plot above shows the result of asking for 6 clusters using the 5 best SVD sums.

How well did clustering do. As you might expect, the party affiliation of each member of parliament is known, even though it wasn’t used in finding the clusters. The table shows the machine-learning classification compared against the actual party with which the member is affiliated.

                                          cluster
                                            1  2  3  4  5  6
  Member for Falkirk West                   0  0  0  1  0  0
  Scottish Conservative and Unionist Party 19  0  0  0  1  0
  Scottish Green Party                      0  0  0  1  0  0
  Scottish Labour                           0 50  0  0  2  6
  Scottish Liberal Democrats                0  1  0  2  0 14
  Scottish National Party                   0  0 34  0  1  1
  Scottish Socialist Party                  0  0  0  1  0  0

The Labour party is cluster 2, the National Party is cluster 4, the Conservative and Unionist Party is cluster 1, and the Liberal Democrats are cluster 5. There are two parties with just one member each: Green, Socialist, and an isolated Member for Falkirk West. All of these are gathered into cluster 3 along with one member each from the Conservative and Unionist Party and the National Party, and two from Labour. Out of the 131 members associated with a major party, 116 were correctly identified as such. Labour has 10 of 48 members who don’t follow the party line. They and the other strays are perhaps the people to talk to when whipping votes.

There’s more information to be extracted from the ballot data. Just as there are clusters of political positions, there are clusters of ballots that might correspond to such factors as social effect, economic effect, etc. Here’s a graphic showing the position of ballots, using the best two SVD sums.

There are obvious clusters in this plot. Still, interpretation can be tricky. Remember that, on each issue, there are both aye and nay votes. This is what accounts for the symmetry of the dots around the center (indicated in red). The opposing dots along each angle from the center might be interpretted in terms of “socially liberal” vs “socially conservative”, “economically liberal” versus “economically” conservative. Deciding which is which likely involves reading the bill itself.

Finally, the “best” sums from SVD can be used to re-arrange cases and separately re-arrange variables while keeping exactly the same values for each case in each variable. This amounts simply to re-ordering the members in a way other than alphabetical and similarly with the ballots. This can simplify the appearance of the data.

Please use the comment system to make suggestions, point out errors, or to discuss the topic.

comments powered by Disqus

Written by Daniel Kaplan for the Data & Computing Fundamentals Course. Development was supported by grants from the National Science Foundation for Project Mosaic (NSF DUE-0920350) and from the Howard Hughes Medical Institute.

Somewhat more technically, $p$ is the p-value from a hypothesis test. The hypothesis being tested with the data is that the observed difference between the left and right branches is merely the result of random, meaningless chance, and likely wouldn’t show up if the NHANES data were collected on a new group of people.↩
There can be more detail than this, for instance a probability for each group that a specific case belongs to the group.↩
The example was constructed by then-student Caroline Ettinger and her faculty advisor, Andrew Beveridge, at Macalester College, and presented in Ms Ettinger’s honors thesis.↩
The tartan pattern provides an indication to experts that the matrix can be approximated by a matrix of low-rank.↩
In brief, SVD calculates the best way to add up (i.e. linearly combine) the columns and the rows of a matrix to produce the largest possible variance. Then SVD finds the best way to add up the what’s left, and so on.↩