`print(dag01)`

```
x ~ exo()
y ~ 1.5 * x + 4 + exo()
```

Imagine being transported back to June 1940. The family is in the living room, sitting around the radio console, waiting for it to warm up. The news today is from Europe, the surrender of the French in the face of the German invasion. Press the play button and listen to recording #103.

You may have to scroll down to see the play button and the recordings.

The spoken words from the recording are discernible despite the hiss and clicks of the background noise. The situation is similar to a conversation in a sports stadium. The crowd is loud, so the speaker has to shout. The listener ignores the noise (unless it is too loud) and recovers the shouted words.

Engineers and others make a distinction between **signal** and noise. The engineer aims to separate the signal from the noise. That aim applies to statistics as well.

There are many sources of noise in data; every variable has its own story, part of which is noise from measurement errors and recording blunders. For instance, economists use national statistics, like GDP, even though the definition is arbitrary (a Hurricane can raise GDP!), and early reports are invariably corrected a few months later. Historians go back to original documents, but inevitably many of the documents have been lost or destroyed: a source of noise. Even in elections where, in principle, counting is straightforward, the voters’ intentions are measured imperfectly due to “hanging chads,” “butterfly ballots,” broken voting machines, spoiled ballots, and so on.

The statistical thinker is well advised to know about the sources of noise in the system she is studying. Analysis of data will be better the more the modeler knows about how measurements are made and data collected.

To illustrate the statistical problem of signal and noise, we turn to a DAG simulation.

`print(dag01)`

```
x ~ exo()
y ~ 1.5 * x + 4 + exo()
```

`dag01`

involves only two variables. `x`

is pure exogenous noise. `y`

is a mixture of `x`

with some more noise added.

It would fair to say that `y`

is pure noise, coming as it does from a linear combination of two sources of noise. The signal, however, is the relationship between `x`

and `y`

. Separating signal from noise means finding the relationship despite the noisy origins of `x`

and `y`

.

Is it possible to see the relationship from the data frame itself? Table 21.1 shows a sample of size \(n=10\):

`<- sample(dag01, size=10) Small `

x | y |
---|---|

-0.7860 | 1.890 |

0.0547 | 4.120 |

-1.1700 | 2.360 |

-0.1670 | 6.330 |

-1.8700 | 0.933 |

-0.1200 | 2.930 |

0.8260 | 5.700 |

1.1900 | 5.900 |

-1.0900 | 2.130 |

-0.3750 | 4.230 |

Any of an infinite number of possible relationships could account for the `x`

and `y`

data in Table 21.1. The signal separation problem of statistics is to make a guess that is as good as possible.

A careful perusal of Table 21.1 suggests some patterns. `x`

is never larger than about 2 in magnitude and can be positive or negative. `y`

is always positive. Furthermore, when `x`

is negative, the corresponding `y`

value is relatively small compared to the `y`

values for positive `x`

.

Human cognition is not well suited to looking at long columns of numbers. Often, we can make better use of our natural human talents by translating the data frame into a graphic, as in Figure 21.1:

The signal that we seek to find is the relationship between `x`

and `y`

. This will be in the form of a function: for any value of `x`

the function gives a corresponding value of `y`

. Figure 21.2 shows three possible functions, all of which go through the data points.

The functions drawn in Figure 21.2 are disturbing proposals for the relationship between `x`

and `y`

implied by the data (black points). While any of the functions *could* be the mechanism for how `y`

is generated from `x`

, they all seem to involve a lot of detail that is not suggested by the actual data points. Even the blue function, which makes minimal excursions from the data point, is not satisfactory.

Separating the signal from the noise involves specifying *what kind of function* we think is at work, then finding the specific instance of that kind of function that goes through the data points, as in Figure 21.3.

The blue function in Figure 21.3 is based on a simplifying assumption, that the relationship between `x`

and `y`

is itself simple. Straight-line functions are a very simple form. The noise is the deviation of the data from the signal, shown as red for negative deviations and black for positive. These deviations are called the **residuals**, what’s left over after you take away the signal.

We have *assumed* that the relationship is that of a sloping line in order to give a handle on what is signal and what is noise. It is perfectly legitimate to make other assumptions about the form of the relationship. Figure 21.4 shows two such alternative forms, one is a flat line and the other a gentle curve. Assuming a flat line, a sloping line, or a gently curving line each leads to a different estimation of the signal.

It’s natural to wonder which of the three versions of the “signal” presented above is the best. Generations of statisticians have struggled with this problem: how to determine the best version of the signal purely from data. The reality is that the data alone do not define the signal. Instead, it is the data **plus** an assumption about the shape of the signal that, together, enable us to extract the specific form of signal.

Experience has shown that a good, general-purpose assumption for extracting signal is a sloping-line function. There are certainly many situations where another assumption is more meaningful, and we will consider one such situation in Lesson 33, but for the purposes of these Lessons we will stick mostly to the sloping-line function.

Keep in mind that the signal we seek to extract from the data is about the relationship among variables. In the above, we have represented that relationship as the graph of a function. The noise in those graphs is represented by the residuals—the black and red segments that connect the function to the actual values of the response variable.

Most of our work in these Lessons will use a different representation of the signal: the coefficients from a model trained on the data. To find these coefficients—that is, to extract the signal—we have to specify what “shape” of relationship we want to use. Then we give the specified shape and the data to `lm()`

and the computer finds the coefficients for us.

The specification of shape is done via the first argument to `lm()`

: a tilde expression. As you know, the left-hand side of the tilde expression is always the name of the response variable. The right-hand side describes the shape.

There are only a few “shapes” in common use.

`y ~ 1`

is the flat-line function.`y ~ x`

is the sloping line function.

To see the coefficients, pipe the model produced by `lm()`

into the `conf_interval()`

summary function:

`lm(y ~ 1, data=Small) %>% conf_interval()`

term | .lwr | .coef | .upr |
---|---|---|---|

(Intercept) | 2.304149 | 3.65201 | 4.999872 |

`lm(y ~ x, data=Small) %>% conf_interval()`

term | .lwr | .coef | .upr |
---|---|---|---|

(Intercept) | 3.4516339 | 4.262846 | 5.074058 |

x | 0.8841454 | 1.741758 | 2.599370 |

The `.coef`

column gives the coefficients, which are a compact description of the signal. For instance, with the `y ~ x`

shape specification, the signal (that is, the relationship between `y`

and `x`

) has the form \(\widehat{\ y\ } = 4.26 + 1.74 x\). In conventional mathematical notation, the \(\widehat{\ \ \ }\) on top of \(y\) is the way of saying, “This is the formula for the signal.”

What about the noise? In the graphical representation of signal and noise (e.g. Figure 21.4), the noise was shown as the residuals, one for each row of the data frame.

When we use coefficients to represent the signal, the noise shows up as the `.lwr`

and `.upr`

columns of the `conf_interval()`

summary. In Lessons 22 and 23, we will show where these columns come from and attribute them to a statistical phenomenon called “**sampling variation**”.

Let’s turn to the analogy between data and audio transmission.

The

**response**variable is the raw**transmission**, consisting of both signal and noise.The

**signal**is the**fitted model values**, as in the`.output`

column produced by`model_eval()`

. This is analogous to cleaning up a recording so that it can be listened to without hiss and pop.The

**noise**is the**residuals**. In an audio signal, the noise is the pure hiss and pop (and other ambient sounds). That is, the noise is what is left over after subtracting the signal from the original transmission.

Statisticians have a different way of measuring the size of the signal. Rather than comparing the signal to the noise, they compare the signal to the *transmission*, that is, to the response variable. This comparison is quantified by a measurement called “R-squared” and written R^{2}.

The “size” of the signal and transmission are measured by their respective variances. For instance, let’s look at the size of the signal and transmission using a simple model from the `Galton`

data frame:

```
lm(height ~ mother + sex, data = Galton) %>%
model_eval() %>%
summarize(signal_size = var(.output),
transmission_size = var(.response))
```

`Using training data as input to model_eval().`

signal_size | transmission_size |
---|---|

7.21 | 12.84 |

The R^{2} summary is simply the signal size divided by the transmission size. Here, that is \[\text{R}^2 = \frac{7.21}{12.84} = 0.56 = 56\%\ .\]

The comparison of the signal to the transmission has a nice property. The signal can never be bigger than the transmission, otherwise the signal wouldn’t fit into the transmission. Consequently, the largest possible R^{2} is 1.0 (or, in terms of percent, 100%). The smallest possible R^{2} is zero.

The `R2()`

summary function will calculate R^{2} for you. For instance,

`lm(height ~ mother + sex, data = Galton) %>% R2()`

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

898 | 2 | 0.5618019 | 573.7276 | 0.5608227 | 0 | 2 | 895 |

R^{2} is a traditional measure of the “quality” of a model, so you will see it in a large fraction of research reports.

```
set.seed(103)
<- sample(dag01, size=10000) Large
```

Lesson 19 introduced the standard way to measure variation in a single variable: the **variance** or its square root, the **standard deviation**. For instance, we can measure the variation in the variables from the `Large`

sample using `sd()`

and `var()`

:

```
%>%
Large summarize(sx = sd(x), sy = sd(y), vx = var(x), vy = var(y))
```

sx | sy | vx | vy |
---|---|---|---|

0.9830639 | 1.779003 | 0.9664146 | 3.164851 |

According to the standard deviation, the size of the `x`

variation is about 1. The size of the `y`

variation is about 1.8.

Look again at the formulas that compose `dag01`

:

`print(dag01)`

```
x ~ exo()
y ~ 1.5 * x + 4 + exo()
```

The formula for `x`

shows that `x`

is endogenous, its values coming from a random number generator, `exo()`

, which, unless otherwise specified, generates noise of size 1.

As for `y`

, the formula includes two sources of variation:

- The part of
`y`

determined by`x`

, that is \(y = \mathbf{1.5 x} + \color{gray}{4 + \text{exo()}}\) - The noise added directly into
`y`

, that is \(y = \color{gray}{\mathbf{1.5 x} + 4} + \color{black}{\mathbf{exo(\,)}}\)

The 4 in the formula does not add any *variation* to `y`

; it is just a number.

We already know that `exo()`

generates random noise of size 1. So the amount of variation contributed by the `+ exo()`

term in the DAG formula is 1. The remaining variation is contributed by `1.5 * x`

. The variation in `x`

is 1 (coming from the `exo()`

in the formula for `x`

). A reasonable guess is that `1.5 * x`

will have 1.5 times the variation in `x`

. So, the variation contributed by the `1.5 * x`

component is 1.5. The overall variation in `y`

is the sum of the variations contributed by the individual components. This suggests that the variation in `y`

should be \[\underbrace{1}_\text{from exo()} + \underbrace{1.5}_\text{from 1.5 x} = \underbrace{2.5}_\text{overall variation in y}.\] Simple addition! Unfortunately, the result is wrong. In the previous summary of the `Large`

, we measured the overall variation in `y`

as about 1.8.

The *variance* will give a better accounting than the standard deviation. Recall that `exo()`

generates variation whose standard deviation is 1, so the variance from `exo()`

is \(1^2 = 1\). Since `x`

comes entirely from `exo()`

, the variance of `x`

is 1. So is the variance of the `exo()`

component of `y`

.

Turn to the `1.5 * x`

component of `y`

. Since variances involve squares, the variance of `1.5 * x`

works out to be \(1.5^2\, \text{var(}\mathit{x}\text{)} = 2.25\). Adding up the variances from the two components of `y`

gives

\[\text{var(}\mathit{y}\text{)} = \underbrace{2.25}_\text{from 1.5 exo()} + \underbrace{1}_\text{from exo()} = 3.25\]

This result that the variance of `y`

is 3.25 closely matches what we found in summarizing the `y`

data generated by the DAG.

**The lesson here**: When adding two sources of variation, the variances of the individual sources add to form the overall variance of the sum. Just like \(A^2 + B^2 = C^2\) in the Pythagorean Theorem.