`print(dag01)`

```
x ~ exo()
y ~ 1.5 * x + 4 + exo()
```

Note in draft

Move the sum of variances material to the simulation chapter.

Imagine being transported back to June 1940. The family is in the living room, sitting around the radio console, waiting for it to warm up. The news today is from Europe, the surrender of the French in the face of the German invasion. Press the play button and listen to recording #103.

You may have to scroll down to see the play button and the recordings.

The spoken words from the recording are discernible despite the hiss and clicks of the background noise. The situation is similar to a conversation in a sports stadium. The crowd is loud, so the speaker has to shout. The listener ignores the noise (unless it is too loud) and recovers the shouted words.

Engineers and others make a distinction between **signal** and noise. The engineer aims to separate the signal from the noise. That aim applies to statistics as well.

There are many sources of noise in data; every variable has its own story, part of which is noise from measurement errors and recording blunders. For instance, economists use national statistics, like GDP, even though the definition is arbitrary (a Hurricane can raise GDP!), and early reports are invariably corrected a few months later. Historians go back to original documents, but inevitably many of the documents have been lost or destroyed: a source of noise. Even in elections where, in principle, counting is straightforward, the voters’ intentions are measured imperfectly due to “hanging chads,” “butterfly ballots,” broken voting machines, spoiled ballots, and so on.

The statistical thinker is well advised to know about the sources of noise in the system she is studying. Analysis of data will be better the more the modeler knows about how measurements are made and data collected.

Noise in hiring

The author has, on several occasions, testified in legal hearings as a statistical expert. In one case, the US Department of Labor audited the records of a contractor with several hundred employees and high employee turnover. The records led the Department to bring suit against the contractor for discriminating against Hispanics. The hiring records showed that many Hispanics applied for jobs; the company hired none. An open-and-shut case.

The lawyers for the defense asked me, the statistical expert, to review the findings from the Department of Labor. The lawyers thought they were asking me to check the arithmetic in the hiring spreadsheets. As a statistical thinker, I know that arithmetic is only part of the story; the origin of the data is critically important. So I asked for the complete files on all applicants and hires the previous year.

The spreadsheet files and the paper job applications were in accord; there were many Hispanic applicants. But the data on the paper job application form was not always consistent with the data on hiring spreadsheets. It turned out that whenever an applicant was hired, the contractor (per regulation) got a report on that person from the state police. The report returned by the state police had only two available race/ethnicities: white and Black. The contractor’s personnel office filled in the hired-worker spreadsheet based on the state police report. So all the Hispanic applicants who were hired had been transformed into white or Black by the state police. Noise.

To illustrate the statistical problem of signal and noise, we turn to a DAG simulation.

`print(dag01)`

```
x ~ exo()
y ~ 1.5 * x + 4 + exo()
```

`dag01`

involves only two variables. `x`

is pure exogenous noise. `y`

is a mixture of `x`

with some more noise added.

It would fair to say that `y`

is pure noise, coming as it does from a linear combination of two sources of noise. The signal, however, is the relationship between `x`

and `y`

. Separating signal from noise means finding the relationship despite the noisy origins of `x`

and `y`

.

Is it possible to see the relationship from the data frame itself? Table tbl-small-dag01 shows a sample of size \(n=10\):

`<- sample(dag01, size=10) Small `

```
x y
-------- ------
-0.7860 1.890
0.0547 4.120
-1.1700 2.360
-0.1670 6.330
-1.8700 0.933
-0.1200 2.930
0.8260 5.700
1.1900 5.900
-1.0900 2.130
-0.3750 4.230
```

Any of an infinite number of possible relationships could account for the `x`

and `y`

data in Table tbl-small-dag01. The signal separation problem of statistics is to make a guess that is as good as possible.

A careful perusal of Table tbl-small-dag01 suggests some patterns. `x`

is never larger than about 2 in magnitude and can be positive or negative. `y`

is always positive. Furthermore, when `x`

is negative, the corresponding `y`

value is relatively small compared to the `y`

values for positive `x`

.

Human cognition is not well suited to looking at long columns of numbers. Often, we can make better use of our natural human talents by translating the data frame into a graphic, as in Figure fig-dag01-sample-10:

The signal that we seek to find is the relationship between `x`

and `y`

. This will be in the form of a function: for any value of `x`

the function gives a corresponding value of `y`

. Figure fig-dag01-functions shows three possible functions, all of which go through the data points.

The functions drawn in Figure fig-dag01-functions are disturbing proposals for the relationship between `x`

and `y`

implied by the data (black points). While any of the functions *could* be the mechanism for how `y`

is generated from `x`

, they all seem to involve a lot of detail that is not suggested by the actual data points. Even the blue function, which makes minimal excursions from the data point, is not satisfactory.

Separating the signal from the noise involves specifying *what kind of function* we think is at work, then finding the specific instance of that kind of function that goes through the data points, as in Figure fig-dag01-linear.

The blue function in Figure fig-dag01-linear is based on a simplifying assumption, that the relationship between `x`

and `y`

is itself simple. Straight-line functions are a very simple form. The noise is the deviation of the data from the signal, shown as red for negative deviations and black for positive. These deviations are called the **residuals**, what’s left over after you take away the signal.

We have *assumed* that the relationship is that of a sloping line in order to give a handle on what is signal and what is noise. It is perfectly legitimate to make other assumptions about the form of the relationship. Figure fig-dag08-other-forms shows two such alternative forms, one is a flat line and the other a gentle curve. Assuming a flat line, a sloping line, or a gently curving line each leads to a different estimation of the signal.

It’s natural to wonder which of the three versions of the “signal” presented above is the best. Generations of statisticians have struggled with this problem: how to determine the best version of the signal purely from data. The reality is that the data alone do not define the signal. Instead, it is the data **plus** an assumption about the shape of the signal that, together, enable us to extract the specific form of signal.

Experience has shown that a good, general-purpose assumption for extracting signal is a sloping-line function. There are certainly many situations where another assumption is more meaningful, and we will consider one such situation in Lesson sec-risk, but for the purposes of these Lessons we will stick mostly to the sloping-line function.

Keep in mind that the signal we seek to extract from the data is about the relationship among variables. In the above, we have represented that relationship as the graph of a function. The noise in those graphs is represented by the residuals—the black and red segments that connect the function to the actual values of the response variable.

Most of our work in these Lessons will use a different representation of the signal: the coefficients from a model trained on the data. To find these coefficients—that is, to extract the signal—we have to specify what “shape” of relationship we want to use. Then we give the specified shape and the data to `lm()`

and the computer finds the coefficients for us.

The specification of shape is done via the first argument to `lm()`

: a tilde expression. As you know, the left-hand side of the tilde expression is always the name of the response variable. The right-hand side describes the shape.

There are only a few “shapes” in common use.

`y ~ 1`

is the flat-line function.`y ~ x`

is the sloping line function.

To see the coefficients, pipe the model produced by `lm()`

into the `conf_interval()`

summary function:

`lm(y ~ 1, data=Small) %>% conf_interval()`

```
term .lwr .coef .upr
------------ --------- -------- ---------
(Intercept) 2.304149 3.65201 4.999872
```

`lm(y ~ x, data=Small) %>% conf_interval()`

```
term .lwr .coef .upr
------------ ---------- --------- ---------
(Intercept) 3.4516339 4.262846 5.074058
x 0.8841454 1.741758 2.599370
```

The `.coef`

column gives the coefficients, which are a compact description of the signal. For instance, with the `y ~ x`

shape specification, the signal (that is, the relationship between `y`

and `x`

) has the form \(\widehat{\ y\ } = 4.26 + 1.74 x\). In conventional mathematical notation, the \(\widehat{\ \ \ }\) on top of \(y\) is the way of saying, “This is the formula for the signal.”

What about the noise? In the graphical representation of signal and noise (e.g. Figure fig-dag08-other-forms), the noise was shown as the residuals, one for each row of the data frame.

When we use coefficients to represent the signal, the noise shows up as the `.lwr`

and `.upr`

columns of the `conf_interval()`

summary. In Lessons sec-sampling-variation and sec-confidence-intervals, we will show where these columns come from and attribute them to a statistical phenomenon called “**sampling variation**”.

Let’s turn to the analogy between data and audio transmission.

The

**response**variable is the raw**transmission**, consisting of both signal and noise.The

**signal**is the**fitted model values**, as in the`.output`

column produced by`model_eval()`

. This is analogous to cleaning up a recording so that it can be listened to without hiss and pop.The

**noise**is the**residuals**. In an audio signal, the noise is the pure hiss and pop (and other ambient sounds). That is, the noise is what is left over after subtracting the signal from the original transmission.

Signal to noise ratio

Engineers often speak of the “signal-to-noise” (SNR) ratio. In sound, this refers to the loudness of the signal compared to the loudness of the noise. For sound, the signal to noise ratio is often measured in decibels (dB). An SNR of 5 dB means that the signal is 3 times louder than the noise.

You can listen to examples of noisy music and speech at this web site, part of which looks like this:

The noisiest examples have an SNR of 5 dB. Press the play/pause button to hear the noisy recording, then compare to the de-noised transmission—the signal—by pressing play/pause in the “Clean” column.

Statisticians have a different way of measuring the size of the signal. Rather than comparing the signal to the noise, they compare the signal to the *transmission*, that is, to the response variable. This comparison is quantified by a measurement called “R-squared” and written R^{2}.

The “size” of the signal and transmission are measured by their respective variances. For instance, let’s look at the size of the signal and transmission using a simple model from the `Galton`

data frame:

```
lm(height ~ mother + sex, data = Galton) %>%
model_eval() %>%
summarize(signal_size = var(.output),
transmission_size = var(.response))
```

`Using training data as input to model_eval().`

```
signal_size transmission_size
------------ ------------------
7.21 12.84
```

The R^{2} summary is simply the signal size divided by the transmission size. Here, that is \[\text{R}^2 = \frac{7.21}{12.84} = 0.56 = 56\%\ .\]

The comparison of the signal to the transmission has a nice property. The signal can never be bigger than the transmission, otherwise the signal wouldn’t fit into the transmission. Consequently, the largest possible R^{2} is 1.0 (or, in terms of percent, 100%). The smallest possible R^{2} is zero.

The `R2()`

summary function will calculate R^{2} for you. For instance,

`lm(height ~ mother + sex, data = Galton) %>% R2()`

```
n k Rsquared F adjR2 p df.num df.denom
---- --- ---------- --------- ---------- --- ------- ---------
898 2 0.5618019 573.7276 0.5608227 0 2 895
```

R^{2} is a traditional measure of the “quality” of a model, so you will see it in a large fraction of research reports.

The sorrows of R^{2}

R^{2} is often described as “the fraction of variance (in the response variable) **explained** by the explanatory variable.” It is a perfectly reasonable, technical way of accounting for variance. But it is often misinterpreted.

Many researchers prefer to report \(\sqrt{\text{R}^2}\), which they write simply as R. In a purely mathematical sense, it doesn’t matter whether one reports R^{2} or R; you can easily convert back and forth. However, I dislike reports of R since I suspect that the researcher’s motivation is to be able to report a bigger, more impressive number. (For instance, \(\text{R}^2 =0.10\) translates into a more more impressive \(\text{R} = 0.32\). More impressive, but exactly the same size.)

R^{2} is valuable for **detecting** a possible relationship between the response variables and the explanatory variables or describing the quality of prediction (Lessons sec-lesson-25 and sec-lesson-26) but the scales for these two purposes are very different. Detecting a relationship might be indicated even by very small R^{2} (e.g. 0.01) when the sample size is large. But a model is only useful for *predicting* the outcome for an individual if \(\text{R}^2 \gtrapprox 0.50\), no matter how large the sample.

A better way of quantifying the relationship between variables is in terms of **effect size** (Lesson sec-lesson-24).

```
set.seed(103)
<- sample(dag01, size=10000) Large
```

Lesson sec-measuring-variation introduced the standard way to measure variation in a single variable: the **variance** or its square root, the **standard deviation**. For instance, we can measure the variation in the variables from the `Large`

sample using `sd()`

and `var()`

:

```
%>%
Large summarize(sx = sd(x), sy = sd(y), vx = var(x), vy = var(y))
```

```
sx sy vx vy
---------- --------- ---------- ---------
0.9830639 1.779003 0.9664146 3.164851
```

According to the standard deviation, the size of the `x`

variation is about 1. The size of the `y`

variation is about 1.8.

Look again at the formulas that compose `dag01`

:

`print(dag01)`

```
x ~ exo()
y ~ 1.5 * x + 4 + exo()
```

The formula for `x`

shows that `x`

is endogenous, its values coming from a random number generator, `exo()`

, which, unless otherwise specified, generates noise of size 1.

As for `y`

, the formula includes two sources of variation:

- The part of
`y`

determined by`x`

, that is \(y = \mathbf{1.5 x} + \color{gray}{4 + \text{exo()}}\) - The noise added directly into
`y`

, that is \(y = \color{gray}{\mathbf{1.5 x} + 4} + \color{black}{\mathbf{exo(\,)}}\)

The 4 in the formula does not add any *variation* to `y`

; it is just a number.

We already know that `exo()`

generates random noise of size 1. So the amount of variation contributed by the `+ exo()`

term in the DAG formula is 1. The remaining variation is contributed by `1.5 * x`

. The variation in `x`

is 1 (coming from the `exo()`

in the formula for `x`

). A reasonable guess is that `1.5 * x`

will have 1.5 times the variation in `x`

. So, the variation contributed by the `1.5 * x`

component is 1.5. The overall variation in `y`

is the sum of the variations contributed by the individual components. This suggests that the variation in `y`

should be \[\underbrace{1}_\text{from exo()} + \underbrace{1.5}_\text{from 1.5 x} = \underbrace{2.5}_\text{overall variation in y}.\] Simple addition! Unfortunately, the result is wrong. In the previous summary of the `Large`

, we measured the overall variation in `y`

as about 1.8.

The *variance* will give a better accounting than the standard deviation. Recall that `exo()`

generates variation whose standard deviation is 1, so the variance from `exo()`

is \(1^2 = 1\). Since `x`

comes entirely from `exo()`

, the variance of `x`

is 1. So is the variance of the `exo()`

component of `y`

.

Turn to the `1.5 * x`

component of `y`

. Since variances involve squares, the variance of `1.5 * x`

works out to be \(1.5^2\, \text{var(}\mathit{x}\text{)} = 2.25\). Adding up the variances from the two components of `y`

gives

\[\text{var(}\mathit{y}\text{)} = \underbrace{2.25}_\text{from 1.5 exo()} + \underbrace{1}_\text{from exo()} = 3.25\]

This result that the variance of `y`

is 3.25 closely matches what we found in summarizing the `y`

data generated by the DAG.

**The lesson here**: When adding two sources of variation, the variances of the individual sources add to form the overall variance of the sum. Just like \(A^2 + B^2 = C^2\) in the Pythagorean Theorem.