Let’s say you want to find out if a pedagogical intervention boosts learners’ conversational skills in L2 French. You’ve learnt that including a well-chosen control variable in your analysis can work wonders in terms of statistical power and precision, so you decide to administer a French vocabulary test to your participants in order to include their score on this test in your analyses as a covariate. But if you administer the vocabulary test after the intervention, it’s possible that the vocabulary scores are themselves affected by the intervention as well. If this is indeed the case, you may end up doing more harm than good. In this blog post, I will take a closer look at four general cases where controlling for such a ‘post-treatment’ variable is harmful, and one case where it improves matters.

In the following, x and y refer to the independent and dependent variable of interest, respectively, i.e., x would correspond to the intervention and y to the L2 French conversational skills in our example. z refers to the post-treatment variable, i.e., the French vocabulary scores in our example. x is a binary variable, y and z are continuous. Since z is a post-treatment variable, it’s possible that it is itself influenced directly or indirectly by x. In the first four cases examined below, this is indeed the case.

I’ve included all R code as I think running simulations like the ones below are a useful way to learn research design and statistics. If you’re just interested in the upshot, just ignore the code snippets :)

## Case 1: x affects both y and z; y and z don’t affect each other.

In the first case, x affects both y and z, but z and y don’t influence each other.

Figure 1.1. The causal links between x, y and z in Case 1.

In this case, controlling for z doesn’t bias the estimate for the causal influence of x on y. It does, however, reduce the precision of these estimates. To appreciate this, let’s simulate some data. The function case1() defined in the next code snippet generates a dataset corresponding to Case 1. The parameter beta_xy specifies the coefficient of the influence of x on y; the goal of the analysis is to estimate the value of this parameter from the data. The parameter beta_xz similarly specifies the coefficient of the influence of x on z. Estimating the latter coefficient isn’t a goal of the analysis, since z is merely a control variable.

Use this function to create a dataset with 100 participants per group:

A graphical analysis that doesn’t take the control variable z into account reveals a roughly one-point difference between the two conditions, which is as it should be.

Figure 1.2. Graphical analysis without the covariate for Case 1.

A linear model is able to retrieve the beta_xy coefficient, which was set at 1, well enough ($\widehat{\beta_{xy}} = 1.03 \pm 0.13$).

Alternatively, we could analyse these data while taking the control variable into account. The graphical analysis in Figure 3 achieves this by splitting up the control variable at its median and plotting the two subset separately. This is statistically suboptimal, but it makes the visualisation easier to grok. Here we also find a roughly one-point difference between the two conditions in each panel, which suggests that controlling for z won’t induce any bias.

Figure 1.3. Graphical analysis with the covariate (median split) for Case 1.

The linear model is again able to retrieve the coefficient of interest well enough ($\widehat{\beta_{xy}} = 1.04 \pm 0.16$), though with a slightly wider standard error.

Of course, it’s difficult to draw any firm conclusions about the analysis of a single simulated dataset. To see that in this general case, the coefficient of interest is indeed estimated without bias but with decreased precision, let’s generate 5,000 such datasets and analyse them with and without taking the control variable into account. The function sim_case1() defined below runs these analyses; the ggplot call plots the estimates for the $\beta_{xy}$ parameter. As the caption to Figure 1.4 explains, this simulation confirms what we observed above.

Figure 1.4. In Case 1, the distribution of the parameter estimates is centred around the correct value both when the control variable is taken into account and when it isn’t. The distribution is wider when taking the control variable into account, however, i.e., the estimates are less precise when taking the control variable into account than when not taking it into account.

The estimate for the $\beta_{xy}$ parameter is unbiased in both analyses, but the analysis with the covariate offers less rather than more precision: The standard deviation of the distribution of parameter estimates increases from 0.14 to 0.18:

## Case 2: x affects y, which in turn affects z.

In the second case, x affects y directly, and y in turns affects z.

Figure 2.1. The causal links between x, y and z in Case 2.

This time, controlling for z biases the estimates for the $\beta_{xy}$ parameter. To see this, let’s again simulate and analyse some data.

When the data are analyses without taking the control variable into account, we obtain the following result:

Figure 2.2. Graphical analysis without the covariate for Case 2.

This isn’t quite as close to a one-point difference as in the previous example, but as we’ll see below that’s merely due to the randomness inherent in these simulations. The linear model yields a parameter estimate of $\widehat{\beta_{xy}} = 0.76 \pm 0.14$.

When we take the control variable into account, however, the difference between the two groups defined by x becomes smaller:

Figure 2.3. Graphical analysis with the covariate for Case 2.

The linear model now yields a parameter estimate of $\widehat{\beta_{xy}} = 0.18 \pm 0.08$, which is considerably farther from the actual parameter value of 1.

The larger-scale simulation shows that the analysis with the covariate is indeed biased if you want to estimate the causal influence of x on y.

Figure 2.4. In Case 2, the distribution of the parameter estimates is centred around the correct value when the control variable isn’t taken into account but it is strongly biased when this control variable is taken into account, i.e., the analysis with the covariate yields biased estimates.

The fact that the distribution of the parameter estimates is narrower when taking the covariate into account is completely immaterial, since these estimates are estimating the wrong quantity.

# Case 3: x and y both affect z. x also affects y.

Now z is affected by both x and y. x still affects y, though. Taking the covariate into account again yields biased estimates.

Figure 3.1. The causal links between x, y and z in Case 3.

Same procedure as last year, James.

Figure 3.2. Graphical analysis without the covariate for Case 3.

Again, the analysis without the control variable yields a reasonably accurate estimate of the true parameter value of 1 ($\widehat{\beta_{xy}} = 1.07 \pm 0.15$).

When we take the control variable into account, however, the difference between the two groups defined by x becomes smaller:

Figure 3.3. Graphical analysis with the covariate for Case 3.

The linear model now yields a parameter estimate of $\widehat{\beta_{xy}} = -0.31 \pm 0.11$, which is considerably farther from the actual parameter value of 1 and even has the wrong sign. (This isn’t evident from Figure 3.3, but keep in mind that the graphical analysis in Figure 3.3 uses a median split on the continuous covariate whereas the linear model below respects the continuous nature of this covariate.)

For the sake of completeness, let’s run this simulation 5,000 times, too.

Figure 3.4. In Case 3, too, the distribution of the parameter estimates is centred around the correct value when the control variable isn’t taken into account but it is strongly biased when this control variable is taken into account, i.e., the analysis with the covariate yields biased estimates.

The fact that the distribution of the parameter estimates is narrower when taking the covariate into account is completely immaterial, since these estimates are estimating the wrong quantity.

## Case 4: x affects z; both x and z influence y.

That is, x influences both y and z, but z also influences y. Let $\beta_{xy}$ be the direct effect of x on y, $\beta_{xz}$ the effect of x on z and $\beta_{zy}$ the effect of z on y. Then the total effect of x on y is $\beta_{xy} + \beta_{xz}\times\beta_{zy}$.

Figure 4.1. The causal links between x, y and z in Case 4.

Using the defaults in the following function, the total effect of x on y is $1 + 1.5\times 0.5 = 1.75$. If this doesn’t make immediate sense, consider what a change of one unit in x causes downstream: A one-unit increase in x directly increases y by 1. It also increases z by 1.5. But a one-unit increase in z causes an increase of 0.5 in y as well, so a 1.5-unit increase in z causes an additional increase of 0.75 in y. So in total, a one-unit increase in x causes a 1.75-point increase in y.

Figure 4.2. Graphical analysis without the covariate for Case 4.

Again, the analysis without the control variable yields a reasonably accurate estimate of the true total influence of x on y of 1.75 (!) ($\widehat{\beta_{xy}} = 1.67 \pm 0.16$).

When we take the control variable into account, however, the difference between the two groups defined by x becomes smaller:

Figure 4.3. Graphical analysis with the covariate for Case 4.

The linear model now yields a parameter estimate of $\widehat{\beta_{xy}} = 1.11 \pm 0.17$. This analysis correctly estimates the direct effect of x on y (i.e., without the additional causal link between x on y through z). This may be interesting in its own right, but the analysis addresses a question different from ‘‘What’s the causal influence of x on y?’’

For the sake of completeness, let’s run this simulation 5,000 times, too.

Figure 4.4. In Case 4, the analysis without the covariate correctly estimates the total causal influence that x has on y, while the analysis with the covariate correctly estimates the direct causal effect of x on y. Either may be relevant, but you have to know which!

## Case 5: x and z affect y; x and z don’t affect each other.

In the final general case, x and z both affect y, but x and z don’t affect each other. That is, z isn’t affected by the intervention in any way and so functions like a pre-treatment control variable would. The result is an increase in statistical precision. This is the only of the five cases examined in which the control variable has added value for the purposes of estimated the causal influence of x on y.

Figure 5.1. The causal links between x, y and z in Case 5.

Using the defaults in the following function, the total effect of x on y is $1 + 1.5\times 0.5 = 1.75$. If this doesn’t make immediate sense, consider what a change of one unit in x causes downstream: A one-unit increase in x directly increases y by 1. It also increases z by 1.5. But a one-unit increase in z causes an increase of 0.5 in y as well, so a 1.5-unit increase in z causes an additional increase of 0.75 in y. So in total, a one-unit increase in x causes a 1.75-point increase in y.

Figure 5.2. Graphical analysis without the covariate for Case 5.

Again, the analysis without the control variable yields an estimate within one standard error of the true parameter value of 1 ($\widehat{\beta_{xy}} = 0.76 \pm 0.24$).

Figure 5.3. Graphical analysis with the covariate for Case 5.

The linear model now yields a parameter estimate of $\widehat{\beta_{xy}} = 1.08 \pm 0.15$, with is also a reasonable estimate but with a smaller standard error.

For the sake of completeness, let’s run this simulation 5,000 times, too.

Figure 5.4. In Case 5, both analyses are centred around the correct value, i.e., both are unbiased. The analysis with the control variable yields a narrower distribution of estimates, however, i.e., it is more precise.

## Conclusion

When a control variable is collected after the intervention took place, it is possible that it is directly or indirectly affected by the intervention. If this is indeed the case, including the control variable in the analysis may yield biased estimates or decrease rather than increase the precision of the estimates. In designed experiments, the solution to this problem is evident: collect the control variable before the intervention takes place. If this isn’t possible, you had better be pretty sure that the control variable isn’t a post-treatment variable. More generally, throwing predictor variables into a statistical model in the hopes that this will improve the analysis is a dreadful idea.

29 June 2021