Directed Acyclic Graphs - mrd-external-brain

### General idea Directed acyclic graphs (DAGs) were largely popularized by Judea Pearl for his work on artificial intelligence. With respect to causal inference, they seem mostly to be a helpful tool for designing a credible model. **We can think of DAGs as a graphical representation of a chain of causal effects.** It is important to remember that any proposed DAG is just one researcher (or group of researchers) attempt at modeling a certain phenomenon. It should always be based on deep knowledge of the area of interest. Below is a simple DAG... ![[dag1.png]] In this DAG, we have three random variables: $D$, $X$, and $Y$. The arrows within the graph represent causal relationships in which causality flows in the direction of the arrow. For example, we would say that there is a *direct causal path* from $D$ to $Y$, represented with $D \rightarrow Y$. Notice, however, that the variable $X$ affects both $D$ and $Y$. If we are interested in the causal affect of $D$ on $Y$, then this third variable $X$ creates what is called a *backdoor path* in which $D$ can appear to effect $Y$ *through* $X$. This is represented with the following path: $D \leftarrow X \rightarrow Y$. Backdoor paths are problematic because they create spurious correlations between $D$ and $Y$ that are driven solely by the fluctuations of variable $X$. For example, the DAG pictured above tells us that, if $X$ varies then $D$ and $Y$ will both fluctuate as a result. If we are aware of $X$ we can control for its fluctuation. However, if we are not (and, for example, only measure $D$ and $Y$) it will appear as if $D$ is causing $Y$ to fluctuate when, in reality, we are measuring a spurious correlation. As a result, we would call $X$ a *confounder* or *confounding* variable because it jointly determines both $D$ and $Y$, and so confounds our ability to discern the effect of $D$ on $Y$ in naive comparisons. In the example DAG above, all variables are observable. This is indicated by the use of solid lines between all random variables. As a result, given the confounding variable $X$, a researcher could simply control for this variable in their regression analysis to discern the causal effect of $D$ on $Y$. Controlling for a confounding variable's affect is called **closing the backdoor path**. However, often we are unable to measure all variables of interest. These types of relationships are indicated with dashed arrows, as seen below. #### Unobserved variables ![[dag2.png]] In this example, we would say that random variable $U$, which affects both $D$ and $Y$, is *unobserved*. Since this variable's fluctuations in the world are unobserved, we cannot control for it — **meaning that the backdoor path is left open**. #### A more realistic example In the book, Cunningham presents a model by Becker (1994) that examines how education affects one's earnings, seen below. ![[dag3.png]] The variables in the DAG are defined as follows: - $PE =$ parental education - $B =$ unobserved background factors (genetics, family environment, etc) - $I =$ family income - $D =$ the treatment, in this case college education - $Y =$ the outcome, earnings As we can see this DAG is telling us a story about the modeler's assumptions of the world. > The transparency of DAGs represent one of the most beneficial aspects of this approach. All assumptions of the researcher are on the table for readers to clearly understand. Should these assumptions be reasonable, analyses that follow should be credible. The DAG's story is as follows. (This description is taken from the book) > Each person has some background. It’s not contained in most data sets, as it measures things like intelligence, contentiousness, mood stability, motivation, family dynamics, and other environmental factors — hence, it is unobserved in the picture. Those environmental factors are likely correlated between parent and child and therefore subsumed in the variable $B$. > > Background causes a child’s parent to choose her own optimal level of education, and that choice also causes the child to choose their level of education through a variety of channels. First, there is the shared background factors, $B$. Those background factors cause the child to choose a level of education, just as her parent had. Second, there’s a direct effect, perhaps through simple modeling of achievement or setting expectations, a kind of peer effect. And third, there’s the effect that parental education has on family earnings, $I$, which in turn affects how much schooling the child receives. Family earnings may itself affect the child’s future earnings through bequests and other transfers, as well as external investments in the child’s productivity. Whether this DAG is accurate is up to readers to discern for themselves — for example, this model does not directly include the unobserved ability of an individual. Regardless we can now list all paths (direct and indirect) between $D$ and $Y$. 1. $D \rightarrow Y$: Causal effect of education on earnings 2. $D \leftarrow I \rightarrow Y$: backdoor path 1 3. $D \leftarrow PE \rightarrow I \rightarrow Y$: backdoor path 2 4. $D \leftarrow B \rightarrow PE \rightarrow I \rightarrow Y$: backdoor path 3 This DAG makes explicitly clear to the researcher and the reader that (assuming this model is accurate), to study the causal effect of education on future earnings, we must close *all three backdoor paths*. ### Colliders Let's take a minute to consider an important type of variable — collider variables. If we have a new simple DAG... ![[dag4.png]] ... we can again list all paths from $D$ to $Y$: 1. $D \rightarrow Y$: causal effect of $D$ on $Y$ 2. $D \rightarrow X \leftarrow Y$: backdoor path 1 The second path here is called a "collider path" because the two arrows are coming together and "colliding" on $X$. These paths are important because **when left alone, backdoor collider paths are naturally closed**. You can think about this idea in the following way: since $D \rightarrow Y \rightarrow X$, if we control for the variation in $X$, we will mask the effect that $D$ has on $Y$. Put another way, any variation that you see in $D$ will be "transferred" to both $Y$ and then also to $X$, so controlling for that variation in $X$ is actually removing the variation on the variables that you are interested in observing. Please reference the book for more examples of colliders, they can be a bit confusing. 😄 ### Backdoor criterion Open backdoor paths are problematic because they create systematic, noncausal correlations between the causal variable of interest $D$ and the outcome variable you're trying to study $Y$. In regression terms, we call this "**omitted variable bias**". #### Closing backdoor paths **Two ways to close a backdoor path** 1. Conditioning on the confounding variable (controlling for this variable) - This requires holding the confounding variable fixed using some method. This can be done (e.g.) via subclassification, matching, regression or another method. 2. Utilizing collider variables - Controlling for collider variables always opens a backdoor path so by *not controlling for them* it closes the backdoor path. When all backdoor paths in a DAG have been closed, we say that the **backdoor criterion** has been met. Only after the backdoor criterion is met can we reasonably conclude observed effects are driven by causality. Put another way, if you have satisfied the backdoor criterion, then you have in effect isolated some causal effect. More formally: > A set of variables $X$ satisfies the backdoor criterion in a DAG if and only if $X$ blocks every path between confounders that contain an arrow from $D$ to $Y$ #### Making use of our DAG The value of creating a DAG that represents your model of a potential causal relationship is that it can really simplify the process of understanding what variables to control for within your model. For example, lets look again at the paths from the earlier DAG. ![[dag5.png]] 1. $D \rightarrow Y$: Causal effect of education on earnings 2. $D \leftarrow I \rightarrow Y$: backdoor path 1 3. $D \leftarrow PE \rightarrow I \rightarrow Y$: backdoor path 2 4. $D \leftarrow B \rightarrow PE \rightarrow I \rightarrow Y$: backdoor path 3 Considering the backdoor criterion, it may now be easier to realize that the only variable we need to control for is $I$ — as it closes all of the backdoor paths. Basically, this removes the effect along the curved line from $I \rightarrow Y$ which is separate from the direct path of $D \rightarrow Y$. All effects from other variables are captured in the variation of $D$ itself. For example, while parents increased education ($PE$) may increase the likelihood of their child getting a college education, this does not affect any variation in $Y$, and is captured in $D$. Thus, any variation in $Y$ it is only due to the variation of $D$ affecting $Y$. The resulting regression model to isolate causal affects of $D$ on $Y$ would simply be the following: $ Y_i = \alpha + \delta D_i + \beta I_i + \epsilon $ where $\alpha$ represents our y-intercept, $\delta$ the causal effect of education, $\beta$ is the effect of family income in the model, and $\epsilon$ is the structural error term which is independent of all other variables in our DAG. ### Code See [this page](https://github.com/mr-devs/causal-inference-mixtape/blob/main/notebooks/02_DAGs.ipynb) for some code that plays with the idea of collider variables and 'bad controls'. --- #### Related #causal_inference