23  Causal influence and DAGs

Warning

The causal inference diagrams in this chapter are not yet sized appropriately to fit in with the text. Otherwise, they are correct.

A DAG is a hypothesis, that is, a statement that might or might not be true. A common purpose for a hypothesis is to organize thinking around an assumed state of the world. But just because we assume it, does not mean the hypothesis is true. Any conclusions that we draw from hypothetical thinking should be regarded as conditioned on the assumption; not as a definitive statement about the world.

Often we work with multiple hypotheses that are different or even utterly inconsistent with one another. The point of this is to judge the extent to which conclusions on a specific hypothesis or apply to a set of differing hypotheses.

When we use a DAG to represent an causal influence diagram, the hypothesis has to do with what causes what. In many cases, we use the causal influence diagram to study a hypothesized direct causal link between two variables. In other cases, the two variables of interest are hypothesized to have no direct causal link.

The hypothesis that there is a direct causal link from variable X to variable Y will be drawn like this, with the arrow showing the flow of influence from X to Y.

X X Y Y X->Y

A direct causal link between X and Y.

On the other hand, the hypothesis that there is no direct causal link from X to Y will be drawn like this:

X X Y Y

No arrow between two nodes means there is no direct causal link connecting them.

These two simple examples, each with only two nodes, have very simple implications regarding data. With a direct causal link \(\text{X} \rightarrow \text{Y}\), we anticipate that a model specification y ~ x trained on sufficient data will show a relationship between the two variables and the coefficients will reveal the effect size. On the other hand, with no direct causal link between X and Y, the model will not show any relationship. By “not showing” we mean that the relevant confidence interval (Lesson 20) includes zero.

Instructor note: We’re assuming here that the X-to-Y relationship is linear. For nonlinear relationships, we would have to consider richer model types to see the relationship.

Things get more interesting when there are more than two nodes with causal links between them. To illustrate, suppose our causal hypothesis is

X X Y Y X->Y C C C->X C->Y

As you will see, under this hypothesis, a model y ~ x will show a relationship, even though there is no direct causal connection between X and Y. (However, there is an indirect causal connection, through C.) In Lesson 24 you will find out that, under the \(\text{X}\leftarrow \text{C} \rightarrow \text{Y}\) hypothesis, we would need to collect data on X, Y, and C, and model using the specification x ~ y + c in order to tell from the data whether there is indeed no direct connection X to Y. Results from the simpler model, x ~ y, would be utterly unreliable.

The purpose of this chapter is to introduce you to a variety of organizations of causal influence diagrams and learn how to take them apart into “causal pathways.” Of particular interest will be “backdoor pathways”, like that in \(\text{X}\leftarrow \text{C} \rightarrow \text{Y}\) that get in the way of understanding whether there is a direct \(\text{X} \rightarrow \text{Y}\) pathway.

In draft

PUT THE Galton regression example here.

GraphViz

NOTES FOR DRAWING INFLUENCE DIAGRAMS

Confounding

x x y y x->y Direct causal link c c c->x c->y

Collider

x x y y x->y Direct causal link c c x->c y->c

Common cause

  1. Save the dot to a file in www
  2. cd www
  3. Run `dot -Tpng graphviz-common-cause.dot -ographviz-common-cause.png

DPI = 30 is for rendering the graph directly in Quarto.

x x y y x->y No direct causal link c c c->x c->y

A common-cause relationship between X and Y.

wealth one two three abc abc def def abc->def confounding efg efg hij hij efg->hij toxicity a a b b a->b c c c->a funny d d c->d

Mermaid

flowchart LR
   Wealth[/"*Wealth*"/]:::start_node -.-> Chemicals
   Chemicals -- "*toxicity*" ---> Disease
   Wealth -.-> Disease
   
   classDef start_node stroke:#333,stroke-width:1px

Causal inference

MOVE FROM DAGs to “influence diagrams.”

Often, but not always, our interest in studying data is to reveal or exploit the causal connections between variables. Understanding causality is essential, for instance, if we are planning to intervene in the world and want to anticipate the consequences. Interventions are things like “increase the dose of medicine,” “stop smoking!”, “lower the budget,” “add more cargo to a plane (which will increase fuel consumption and reduce the range).”

Historically, mainstream statisticians were hostile to using data to explore causal relationships. (The one exception was experiment, which gathers data from an actual intervention in the world. See Lesson 25.) Statistics teachers encouraged students to use phrases like “associated with” or “correlated with” and reminded them that “correlation is not causation.”

Regrettably, this attitude made statistics irrelevant to the many situations where intervention is the core concern and experiment was not feasible. A tragic episode of this sort likely caused millions of unnecessary deaths. Starting in the 1940s, doctors and epidemiologists saw evidence that smoking causes lung cancer. In stepped the most famous statistician of the age, Ronald Fisher, to insist that the statement should be, “smoking is associated with lung cancer.” He speculated that smoking and lung cancer might have a common cause, perhaps genetic. Fisher argued that establishing causation requires running an experiment where people are randomly assigned to smoke or not smoke and then observed for decades to see if they developed lung cancer. Such an experiment is unfeasible and unethical, to say nothing of the need to wait decades to get a result.

Fortunately, around 1960, a researcher at the US National Institutes of Health, Jerome Cornfield, was able to show mathematically that the strength of the association between smoking and cancer ruled out any genetic mechanism. Cornfield’s work was an important step in the development of a new area in statistics: “causal inference.”

Causal inference is not about proving that one thing causes another but about formal ways to say something about how the world works that can be used, along with data, to make responsible conclusions about causal relationships.

As you will see in Lesson 24, DAGs are a major tools in causal inference, allowing you not only to represent a hypothesis about causal relationships, but to deduce what sorts of models will be able to reveal causal mechanisms.

The point of a DAG is to make a clear statement of a hypothesis about causation. Drawing a DAG does not mean that the hypothesis is correct, just that we believe the hypothesis is, in some sense, a possibility. Different people might have different beliefs about what causes what in real-world systems. Comparing their different DAGs can help, sometimes, to discuss and resolve the disagreement.

We are going to use DAGs for two distinct purposes. One purpose is to inform responsible conclusions from data about what causes what. The data on its own is insufficient to demonstrate the causal connections. However, data combined with a DAG can provide insight. For example, analysis of the paths in a DAG, as in Lesson 24, can tell us which explanatory variables to include and which to exclude from a model if our modeling goal is to represent the hypothetical causal connections.

The second purpose for our use of DAGs involves the generation of simulated data. For this purpose, we outfit each DAG with formulas that specify quantitative how the variables are related. These formulas constitute a mechanism for simulating data. Since we know the mechanism, we can check the results of our data modeling to see the extent to which those results are consistent with the mechanism. This provides valuable feedback to help us understand what makes models better or worse.

Reality check: DAGs and data

DAGs represent hypotheses about the connections between variables in the real world. They are a kind of scratchpad for constructing alternative scenarios and, as seen in Lesson 24, thinking about how models might go wrong in the face of a plausible alternative causal mechanism.

In this book, we extend the use of DAGs beyond their scope in professional statistics; we use them as simulations from which we can generate data. Such simulations provide one way to learn about statistical methodology.

DAGs are aides to reasoning, scratchpads that help us play out the consequences of our hypotheses about possible real-world mechanisms. However, take caution to distinguish data from DAG simulations from data from reality.

Finding out about the real world requires collecting data from the real world. The proper role of DAGs in real work is to guide model building from real data.

In this course, we sample from DAGs to learn statistical techniques. But never to make claims about real-world phenomena.

With the conceptual tool of DAGs, the statistical thinker can consider multiple possibilities for what might cause what. Sometimes she can discard some of the possibilities based on common sense. (Think: Which causes which: the rooster crowing and the sun rising.) However, in other settings, there may be possibilities that she does not favor but might be plausible to other people. In Lesson 24, we will explore how each configuration of DAG has implications for which model specifications can or cannot reveal the hypothesized causal mechanism.

Example: Child development and conversational turns

Lesson ?sec-lesson-19 included an example of a causal claim regarding the effects of the COVID-induced reduction in “conversational turns” between parents and their infants or toddlers. We will represent that claim using in terms of three variables: “COVID”, “healthy development” and “turn count.” The causal claim is that the larger the turn count, the better for healthy development.

There are other possibilities for the causal connections. For instance, other variables such as socio-economic status, education of the parents, and a genetic propensity to talkativeness might be involved. Suppose the genetic propensity to chat causes both “turn count” and “healthy development.” That is,

Then the COVID-induced reduction in “turn count” would not have any impact on healthy development.