13  Signal and noise

Note in draft

Move the sum of variances material to the simulation chapter.

Imagine being transported back to June 1940. The family is in the living room, sitting around the radio console, waiting for it to warm up. The news today is from Europe, the surrender of the French in the face of the German invasion. Press the play button and listen to recording #103.

You may have to scroll down to see the play button and the recordings.

The spoken words from the recording are discernible despite the hiss and clicks of the background noise. The situation is similar to a conversation in a sports stadium. The crowd is loud, so the speaker has to shout. The listener ignores the noise (unless it is too loud) and recovers the shouted words.

Engineers and others make a distinction between signal and noise. The engineer aims to separate the signal from the noise. That aim applies to statistics as well.

There are many sources of noise in data; every variable has its own story, part of which is noise from measurement errors and recording blunders. For instance, economists use national statistics, like GDP, even though the definition is arbitrary (a Hurricane can raise GDP!), and early reports are invariably corrected a few months later. Historians go back to original documents, but inevitably many of the documents have been lost or destroyed: a source of noise. Even in elections where, in principle, counting is straightforward, the voters’ intentions are measured imperfectly due to “hanging chads,” “butterfly ballots,” broken voting machines, spoiled ballots, and so on.

The statistical thinker is well advised to know about the sources of noise in the system she is studying. Analysis of data will be better the more the modeler knows about how measurements are made and data collected.

Noise in hiring

The author has, on several occasions, testified in legal hearings as a statistical expert. In one case, the US Department of Labor audited the records of a contractor with several hundred employees and high employee turnover. The records led the Department to bring suit against the contractor for discriminating against Hispanics. The hiring records showed that many Hispanics applied for jobs; the company hired none. An open-and-shut case.

The lawyers for the defense asked me, the statistical expert, to review the findings from the Department of Labor. The lawyers thought they were asking me to check the arithmetic in the hiring spreadsheets. As a statistical thinker, I know that arithmetic is only part of the story; the origin of the data is critically important. So I asked for the complete files on all applicants and hires the previous year.

The spreadsheet files and the paper job applications were in accord; there were many Hispanic applicants. But the data on the paper job application form was not always consistent with the data on hiring spreadsheets. It turned out that whenever an applicant was hired, the contractor (per regulation) got a report on that person from the state police. The report returned by the state police had only two available race/ethnicities: white and Black. The contractor’s personnel office filled in the hired-worker spreadsheet based on the state police report. So all the Hispanic applicants who were hired had been transformed into white or Black by the state police. Noise.

Model values as the signal

To illustrate the statistical problem of signal and noise, we turn to a DAG simulation.

print(dag01)
x ~ exo()
y ~ 1.5 * x + 4 + exo()

dag01 involves only two variables. x is pure exogenous noise. y is a mixture of x with some more noise added.

It would fair to say that y is pure noise, coming as it does from a linear combination of two sources of noise. The signal, however, is the relationship between x and y. Separating signal from noise means finding the relationship despite the noisy origins of x and y.

Is it possible to see the relationship from the data frame itself? Table tbl-small-dag01 shows a sample of size \(n=10\):

Small <- sample(dag01, size=10)

?(caption)

       x       y
--------  ------
 -0.7860   1.890
  0.0547   4.120
 -1.1700   2.360
 -0.1670   6.330
 -1.8700   0.933
 -0.1200   2.930
  0.8260   5.700
  1.1900   5.900
 -1.0900   2.130
 -0.3750   4.230

Any of an infinite number of possible relationships could account for the x and y data in Table tbl-small-dag01. The signal separation problem of statistics is to make a guess that is as good as possible.

A careful perusal of Table tbl-small-dag01 suggests some patterns. x is never larger than about 2 in magnitude and can be positive or negative. y is always positive. Furthermore, when x is negative, the corresponding y value is relatively small compared to the y values for positive x.

Human cognition is not well suited to looking at long columns of numbers. Often, we can make better use of our natural human talents by translating the data frame into a graphic, as in Figure fig-dag01-sample-10:

Figure 13.1: The data in Table tbl-small-dag01 in point-plot form.

The signal that we seek to find is the relationship between x and y. This will be in the form of a function: for any value of x the function gives a corresponding value of y. Figure fig-dag01-functions shows three possible functions, all of which go through the data points.

Figure 13.2: Three of the infinite number of functions that can be drawn through the data Table tbl-small-dag01.

The functions drawn in Figure fig-dag01-functions are disturbing proposals for the relationship between x and y implied by the data (black points). While any of the functions could be the mechanism for how y is generated from x, they all seem to involve a lot of detail that is not suggested by the actual data points. Even the blue function, which makes minimal excursions from the data point, is not satisfactory.

Separating the signal from the noise involves specifying what kind of function we think is at work, then finding the specific instance of that kind of function that goes through the data points, as in Figure fig-dag01-linear.

Figure 13.3: The straight-line function (blue) that goes through the data points as closely as possible. The noise is estimated as the difference (red for negative noise, black for positive noise) between the actual data points and the function.

The blue function in Figure fig-dag01-linear is based on a simplifying assumption, that the relationship between x and y is itself simple. Straight-line functions are a very simple form. The noise is the deviation of the data from the signal, shown as red for negative deviations and black for positive. These deviations are called the residuals, what’s left over after you take away the signal.

We have assumed that the relationship is that of a sloping line in order to give a handle on what is signal and what is noise. It is perfectly legitimate to make other assumptions about the form of the relationship. Figure fig-dag08-other-forms shows two such alternative forms, one is a flat line and the other a gentle curve. Assuming a flat line, a sloping line, or a gently curving line each leads to a different estimation of the signal.

(a) a flat line

(b) a gentle curve”

Figure 13.4: The signal extracted when we assume different forms for the relationship.

It’s natural to wonder which of the three versions of the “signal” presented above is the best. Generations of statisticians have struggled with this problem: how to determine the best version of the signal purely from data. The reality is that the data alone do not define the signal. Instead, it is the data plus an assumption about the shape of the signal that, together, enable us to extract the specific form of signal.

Experience has shown that a good, general-purpose assumption for extracting signal is a sloping-line function. There are certainly many situations where another assumption is more meaningful, and we will consider one such situation in Lesson sec-risk, but for the purposes of these Lessons we will stick mostly to the sloping-line function.

Coefficients as the signal

Keep in mind that the signal we seek to extract from the data is about the relationship among variables. In the above, we have represented that relationship as the graph of a function. The noise in those graphs is represented by the residuals—the black and red segments that connect the function to the actual values of the response variable.

Most of our work in these Lessons will use a different representation of the signal: the coefficients from a model trained on the data. To find these coefficients—that is, to extract the signal—we have to specify what “shape” of relationship we want to use. Then we give the specified shape and the data to lm() and the computer finds the coefficients for us.

The specification of shape is done via the first argument to lm(): a tilde expression. As you know, the left-hand side of the tilde expression is always the name of the response variable. The right-hand side describes the shape.

There are only a few “shapes” in common use.

  • y ~ 1 is the flat-line function.
  • y ~ x is the sloping line function.

To see the coefficients, pipe the model produced by lm() into the conf_interval() summary function:

lm(y ~ 1, data=Small) %>% conf_interval()
term               .lwr     .coef       .upr
------------  ---------  --------  ---------
(Intercept)    2.304149   3.65201   4.999872
lm(y ~ x, data=Small) %>% conf_interval()
term                .lwr      .coef       .upr
------------  ----------  ---------  ---------
(Intercept)    3.4516339   4.262846   5.074058
x              0.8841454   1.741758   2.599370

The .coef column gives the coefficients, which are a compact description of the signal. For instance, with the y ~ x shape specification, the signal (that is, the relationship between y and x) has the form \(\widehat{\ y\ } = 4.26 + 1.74 x\). In conventional mathematical notation, the \(\widehat{\ \ \ }\) on top of \(y\) is the way of saying, “This is the formula for the signal.”

What about the noise? In the graphical representation of signal and noise (e.g. Figure fig-dag08-other-forms), the noise was shown as the residuals, one for each row of the data frame.

When we use coefficients to represent the signal, the noise shows up as the .lwr and .upr columns of the conf_interval() summary. In Lessons sec-sampling-variation and sec-confidence-intervals, we will show where these columns come from and attribute them to a statistical phenomenon called “sampling variation”.

R-squared (R2)

Let’s turn to the analogy between data and audio transmission.

  • The response variable is the raw transmission, consisting of both signal and noise.

  • The signal is the fitted model values, as in the .output column produced by model_eval(). This is analogous to cleaning up a recording so that it can be listened to without hiss and pop.

  • The noise is the residuals. In an audio signal, the noise is the pure hiss and pop (and other ambient sounds). That is, the noise is what is left over after subtracting the signal from the original transmission.

Signal to noise ratio

Engineers often speak of the “signal-to-noise” (SNR) ratio. In sound, this refers to the loudness of the signal compared to the loudness of the noise. For sound, the signal to noise ratio is often measured in decibels (dB). An SNR of 5 dB means that the signal is 3 times louder than the noise.

You can listen to examples of noisy music and speech at this web site, part of which looks like this:

The noisiest examples have an SNR of 5 dB. Press the play/pause button to hear the noisy recording, then compare to the de-noised transmission—the signal—by pressing play/pause in the “Clean” column.

Statisticians have a different way of measuring the size of the signal. Rather than comparing the signal to the noise, they compare the signal to the transmission, that is, to the response variable. This comparison is quantified by a measurement called “R-squared” and written R2.

The “size” of the signal and transmission are measured by their respective variances. For instance, let’s look at the size of the signal and transmission using a simple model from the Galton data frame:

lm(height ~ mother + sex, data = Galton) %>%
  model_eval() %>%
  summarize(signal_size = var(.output),
            transmission_size = var(.response))
Using training data as input to model_eval().
 signal_size   transmission_size
------------  ------------------
        7.21               12.84

The R2 summary is simply the signal size divided by the transmission size. Here, that is \[\text{R}^2 = \frac{7.21}{12.84} = 0.56 = 56\%\ .\]

The comparison of the signal to the transmission has a nice property. The signal can never be bigger than the transmission, otherwise the signal wouldn’t fit into the transmission. Consequently, the largest possible R2 is 1.0 (or, in terms of percent, 100%). The smallest possible R2 is zero.

The R2() summary function will calculate R2 for you. For instance,

lm(height ~ mother + sex, data = Galton) %>% R2()
   n    k    Rsquared          F       adjR2    p   df.num   df.denom
----  ---  ----------  ---------  ----------  ---  -------  ---------
 898    2   0.5618019   573.7276   0.5608227    0        2        895

R2 is a traditional measure of the “quality” of a model, so you will see it in a large fraction of research reports.

The sorrows of R2

R2 is often described as “the fraction of variance (in the response variable) explained by the explanatory variable.” It is a perfectly reasonable, technical way of accounting for variance. But it is often misinterpreted.

Many researchers prefer to report \(\sqrt{\text{R}^2}\), which they write simply as R. In a purely mathematical sense, it doesn’t matter whether one reports R2 or R; you can easily convert back and forth. However, I dislike reports of R since I suspect that the researcher’s motivation is to be able to report a bigger, more impressive number. (For instance, \(\text{R}^2 =0.10\) translates into a more more impressive \(\text{R} = 0.32\). More impressive, but exactly the same size.)

R2 is valuable for detecting a possible relationship between the response variables and the explanatory variables or describing the quality of prediction (Lessons sec-lesson-25 and sec-lesson-26) but the scales for these two purposes are very different. Detecting a relationship might be indicated even by very small R2 (e.g. 0.01) when the sample size is large. But a model is only useful for predicting the outcome for an individual if \(\text{R}^2 \gtrapprox 0.50\), no matter how large the sample.

A better way of quantifying the relationship between variables is in terms of effect size (Lesson sec-lesson-24).

Accumulating variation in a DAG (optional)

set.seed(103)
Large <- sample(dag01, size=10000)

Lesson sec-measuring-variation introduced the standard way to measure variation in a single variable: the variance or its square root, the standard deviation. For instance, we can measure the variation in the variables from the Large sample using sd() and var():

Large %>%
  summarize(sx = sd(x), sy = sd(y), vx = var(x), vy = var(y))
        sx         sy          vx         vy
----------  ---------  ----------  ---------
 0.9830639   1.779003   0.9664146   3.164851

According to the standard deviation, the size of the x variation is about 1. The size of the y variation is about 1.8.

Look again at the formulas that compose dag01:

print(dag01)
x ~ exo()
y ~ 1.5 * x + 4 + exo()

The formula for x shows that x is endogenous, its values coming from a random number generator, exo(), which, unless otherwise specified, generates noise of size 1.

As for y, the formula includes two sources of variation:

  1. The part of y determined by x, that is \(y = \mathbf{1.5 x} + \color{gray}{4 + \text{exo()}}\)
  2. The noise added directly into y, that is \(y = \color{gray}{\mathbf{1.5 x} + 4} + \color{black}{\mathbf{exo(\,)}}\)

The 4 in the formula does not add any variation to y; it is just a number.

We already know that exo() generates random noise of size 1. So the amount of variation contributed by the + exo() term in the DAG formula is 1. The remaining variation is contributed by 1.5 * x. The variation in x is 1 (coming from the exo() in the formula for x). A reasonable guess is that 1.5 * x will have 1.5 times the variation in x. So, the variation contributed by the 1.5 * x component is 1.5. The overall variation in y is the sum of the variations contributed by the individual components. This suggests that the variation in y should be \[\underbrace{1}_\text{from exo()} + \underbrace{1.5}_\text{from 1.5 x} = \underbrace{2.5}_\text{overall variation in y}.\] Simple addition! Unfortunately, the result is wrong. In the previous summary of the Large, we measured the overall variation in y as about 1.8.

The variance will give a better accounting than the standard deviation. Recall that exo() generates variation whose standard deviation is 1, so the variance from exo() is \(1^2 = 1\). Since x comes entirely from exo(), the variance of x is 1. So is the variance of the exo() component of y.

Turn to the 1.5 * x component of y. Since variances involve squares, the variance of 1.5 * x works out to be \(1.5^2\, \text{var(}\mathit{x}\text{)} = 2.25\). Adding up the variances from the two components of y gives

\[\text{var(}\mathit{y}\text{)} = \underbrace{2.25}_\text{from 1.5 exo()} + \underbrace{1}_\text{from exo()} = 3.25\]

This result that the variance of y is 3.25 closely matches what we found in summarizing the y data generated by the DAG.

The lesson here: When adding two sources of variation, the variances of the individual sources add to form the overall variance of the sum. Just like \(A^2 + B^2 = C^2\) in the Pythagorean Theorem.