16  Directed acyclic graphs

  1. The topological layout of the simulations. Cycles, dependent, independent.

Directed Acyclic Graphs

A core tool in thinking about causal connections is a mathematical structure called a “directed acyclic graph” (DAG, for short). DAGs are one of the most popular ways for statistical thinkers to express their ideas about what might be happening in the real world. Despite the long name, DAGs are very accessible to a broad audience.

DAGs, despite the G for “graph,” are not about data graphics. The “graph” in DAG is a mathematical term of art; a suitable synonym is “network.” Mathematical graphs consist of a set of “nodes” and a set of “edges” connecting the nodes. For instance, Figure 16.1 shows three different graphs, each with five nodes labeled A, B, C, D, and E.

(a) undirected graph

(b) directed but cyclic

(c) directed acyclic graph (DAG)

Figure 16.1: Graphs of various types

The nodes are the same in all three graphs of Figure 16.1, but each graph is different from the others. It is not just the nodes that define a graph; the edges (drawn as lines) are part of the definition as well.

The left-most graph in Figure 16.1 is an “undirected” graph; there is no suggestion that the edges run one way or another. In contrast, the middle graph has the same nodes and edges, but the edges are directed. An excellent way to think about a directed graph is that each node is a pool of water; each directed edge shows how the water flows between pools. This analogy is also helpful in thinking about causality: the causal influences flow like water.

Look more carefully at the middle graph. There is a couple of loops; the graph is cyclic. In one loop, water flows from E to C to D and back again to E. The other loop runs B, C, D, E, and back to B. Such a flow pattern cannot exist without pumps pushing the water back uphill.

The rightmost graph reverses the direction of some of the edges. This graph has no cycles; it is acyclic. Using the flowing and pumped water analogy, an acyclic graph needs no pumps; the pools can be arranged at different heights to create a flow exclusively powered by gravity. The node-D pool will be the highest, E lower. C has to be lower than E for gravity to pull water along the edge from E to C. The node-B pool is the lowest, so water can flow in from E, C, and A.

Directed acyclic graphs represent causal influences; think of “A causes B,” meaning that causal “water” flows naturally from A to B. In a DAG, a node can have multiple outputs, like D and E, and it might have multiple inputs, like B and C. In terms of causality, a node—like B—having multiple inputs means that more than one factor is responsible for the value of that node. A real-world example: the rising sun causes a rooster to crow, but so can another intruder to the coop.

Often, nodes do not have any inputs. These are called “exogenous factors”at least by economists. The “genous” means “originates from.” “Exo” means “outside.” The value of an exogenous node is determined by something, just not something that we are interested in (or perhaps capable of) modeling. No edges are directed into an exogenous node since none of the other nodes influence its value.

For simulating data, we go beyond drawing a graph of causal connections to outfit DAGs with specific formulas representing the mechanism imbued in each node. DAGs equipped with formulas can be used to generate simulated data.1 Training a model on those data leads to a model function that we can compare to the DAG’s formulas. Then check whether the formulas and the model function match. This practice helps us learn what can go right or wrong in building a model, just as practice in an aircraft simulator trains pilots to handle real-world situations in real aircraft.

We start with a simple example, dag08. The dag_draw() command draws a picture of the graph.


The graph shows that both c and x contribute to y.

Printing the dag displays the formulas that set the values of the nodes.

c ~ exo()
x ~ c + exo()
y ~ x + c + 3 + exo()

The formulas show that x and c contribute equally to y, with coefficients of 1. To what extent can regression modeling recover this relationship from data?

To find out, we can generate simulated data using the sample() function. For instance,

sample(dag08, size=5)
      c         x      y
-------  --------  -----
 -0.326    0.8480   4.05
  0.552    1.1700   3.93
 -0.675   -0.7880   2.97
  0.214    1.1300   2.88
  0.311    0.0875   3.16

Each row in the sample is one trial; in each trial, the node’s formula sets the value for that node. For example, the formula might use the values of other nodes as input. Alternatively, the formula might specify that the node is exogenous, without input from any other nodes.

Models can be trained on the simulated data using the same techniques as for any other data. To illustrate, here we generate a sample of size \(n=50\), then fit the model specification c ~ a + b and summarize by taking the coefficients.

sample(dag08, size=50) %>% 
  lm(y ~ c + x, data = .) %>%
term                .lwr       .coef       .upr
------------  ----------  ----------  ---------
(Intercept)    2.6441540   2.9451445   3.246135
c              0.7854440   1.2606473   1.735850
x              0.5016365   0.8235923   1.145548

The coefficients, including the intercept, are close, but not exactly right.

In Lessons 13 and 19 we will figure out how close we can expect the coefficients to be to the precise values implemented in the simulation.


Other than the use of DAGs for pedagogical purposes, DAGs are a helpful framework for organizing ideas about what causes what. There are almost always multiple possibilities for a DAG. Sometimes the modeler can discard some of the possibilities based on common sense. (Think: roosters and the sun.) However, in other settings, there may be possibilities that the modeler does not favor but might be plausible to other people. In Lesson 24, we will explore how each configuration of DAG has implications for which model specifications can or cannot reveal the hypothesized causal mechanism.

  1. The value of exogenous nodes is usually set randomly, without input from the other nodes in the DAG.↩︎