# The consequences of controlling for a post-treatment variable

Let’s say you want to find out if
a pedagogical intervention boosts
learners’ conversational skills in
L2 French. You’ve learnt that including
a well-chosen control variable
in your analysis can work
wonders
in terms of statistical power and precision,
so you decide to administer a French vocabulary test
to your participants in order to include their
score on this test in your analyses as a covariate.
But if you administer the vocabulary test
*after* the intervention, it’s possible that the
vocabulary scores are themselves affected by the
intervention as well. If this is indeed the case,
you may end up doing more harm than good.
In this blog post, I will take a closer look at four
general cases where controlling for such a ‘post-treatment’
variable is harmful, and one case where it improves matters.

In the following, `x`

and `y`

refer to the
independent and dependent variable of interest,
respectively, i.e., `x`

would correspond to the intervention
and `y`

to the L2 French conversational skills in
our example. `z`

refers to the post-treatment variable,
i.e., the French vocabulary scores in our example.
`x`

is a binary variable, `y`

and `z`

are continuous.
Since `z`

is a post-treatment variable, it’s possible
that it is itself influenced directly or indirectly
by `x`

. In the first four cases examined below, this is
indeed the case.

I’ve included all R code as I think running simulations like the ones below are a useful way to learn research design and statistics. If you’re just interested in the upshot, just ignore the code snippets :)

## Case 1: `x`

affects both `y`

and `z`

; `y`

and `z`

don’t affect each other.

In the first case, `x`

affects both `y`

and `z`

, but `z`

and `y`

don’t influence each other.

Figure 1.1.The causal links between`x`

,`y`

and`z`

in Case 1.

In this case,
controlling for `z`

doesn’t bias the estimate
for the causal influence of `x`

on `y`

.
It does, however,
reduce the precision of these estimates.
To appreciate this,
let’s simulate some data. The function `case1()`

defined in the next code snippet generates
a dataset corresponding to Case 1.
The parameter `beta_xy`

specifies the coefficient
of the influence of `x`

on `y`

; the goal of the analysis
is to estimate the value of this parameter from the data.
The parameter `beta_xz`

similarly specifies the coefficient
of the influence of `x`

on `z`

. Estimating the latter
coefficient isn’t a goal of the analysis, since `z`

is merely a control variable.

Use this function to create a dataset with 100 participants per group:

A graphical analysis that doesn’t take the control variable
`z`

into account reveals a roughly one-point difference
between the two conditions, which is as it should be.

Figure 1.2.Graphical analysis without the covariate for Case 1.

A linear model is able to retrieve the `beta_xy`

coefficient,
which was set at 1, well enough ($\widehat{\beta_{xy}} = 1.03 \pm 0.13$).

Alternatively, we could analyse these data while taking
the control variable into account. The graphical analysis
in Figure 3 achieves this by splitting up the control variable
at its median and plotting the two subset separately.
This is statistically suboptimal, but it makes the visualisation
easier to grok. Here we also find a roughly one-point difference
between the two conditions in each panel, which suggests that
controlling for `z`

won’t induce any bias.

Figure 1.3.Graphical analysis with the covariate (median split) for Case 1.

The linear model is again able to retrieve the coefficient of interest well enough ($\widehat{\beta_{xy}} = 1.04 \pm 0.16$), though with a slightly wider standard error.

Of course, it’s difficult to draw any firm conclusions about the analysis
of a single simulated dataset. To see that in this general case,
the coefficient of interest is indeed
estimated without bias but with decreased precision, let’s generate 5,000
such datasets and analyse them with and without taking the control variable
into account. The function `sim_case1()`

defined below runs these analyses;
the ggplot call plots the estimates for the $\beta_{xy}$ parameter.
As the caption to Figure 1.4 explains, this simulation confirms what we observed
above.

Figure 1.4.In Case 1, the distribution of the parameter estimates is centred around the correct value both when the control variable is taken into account and when it isn’t. The distribution is wider when taking the control variable into account, however, i.e., the estimates are less precise when taking the control variable into account than when not taking it into account.

The estimate for the $\beta_{xy}$ parameter is unbiased in both analyses,
but the analysis with the covariate offers *less* rather than more precision:
The standard deviation of the distribution of parameter estimates
increases from 0.14 to 0.18:

## Case 2: `x`

affects `y`

, which in turn affects `z`

.

In the second case, `x`

affects `y`

directly,
and `y`

in turns affects `z`

.

Figure 2.1.The causal links between`x`

,`y`

and`z`

in Case 2.

This time, controlling for `z`

biases the estimates
for the $\beta_{xy}$ parameter. To see this, let’s again
simulate and analyse some data.

When the data are analyses without taking the control variable into account, we obtain the following result:

Figure 2.2.Graphical analysis without the covariate for Case 2.

This isn’t quite as close to a one-point difference as in the previous example, but as we’ll see below that’s merely due to the randomness inherent in these simulations. The linear model yields a parameter estimate of $\widehat{\beta_{xy}} = 0.76 \pm 0.14$.

When we take the control variable into account, however, the difference
between the two groups defined by `x`

becomes smaller:

Figure 2.3.Graphical analysis with the covariate for Case 2.

The linear model now yields a parameter estimate of $\widehat{\beta_{xy}} = 0.18 \pm 0.08$, which is considerably farther from the actual parameter value of 1.

The larger-scale simulation shows that the analysis with the covariate
is indeed biased if you want to estimate the causal influence of `x`

on `y`

.

Figure 2.4.In Case 2, the distribution of the parameter estimates is centred around the correct value when the control variable isn’t taken into account but it is strongly biased when this control variableistaken into account, i.e., the analysis with the covariate yields biased estimates.

The fact that the distribution of the parameter estimates is narrower when taking the covariate into account is completely immaterial, since these estimates are estimating the wrong quantity.

# Case 3: `x`

and `y`

both affect `z`

. `x`

also affects `y`

.

Now `z`

is affected by both `x`

and `y`

.
`x`

still affects `y`

, though. Taking the
covariate into account again yields biased estimates.

Figure 3.1.The causal links between`x`

,`y`

and`z`

in Case 3.

Same procedure as last year, James.

Figure 3.2.Graphical analysis without the covariate for Case 3.

Again, the analysis without the control variable yields a reasonably accurate estimate of the true parameter value of 1 ($\widehat{\beta_{xy}} = 1.07 \pm 0.15$).

When we take the control variable into account, however, the difference
between the two groups defined by `x`

becomes smaller:

Figure 3.3.Graphical analysis with the covariate for Case 3.

The linear model now yields a parameter estimate of $\widehat{\beta_{xy}} = -0.31 \pm 0.11$, which is considerably farther from the actual parameter value of 1 and even has the wrong sign. (This isn’t evident from Figure 3.3, but keep in mind that the graphical analysis in Figure 3.3 uses a median split on the continuous covariate whereas the linear model below respects the continuous nature of this covariate.)

For the sake of completeness, let’s run this simulation 5,000 times, too.

Figure 3.4.In Case 3, too, the distribution of the parameter estimates is centred around the correct value when the control variable isn’t taken into account but it is strongly biased when this control variableistaken into account, i.e., the analysis with the covariate yields biased estimates.

The fact that the distribution of the parameter estimates is narrower when taking the covariate into account is completely immaterial, since these estimates are estimating the wrong quantity.

## Case 4: `x`

affects `z`

; both `x`

and `z`

influence `y`

.

That is, `x`

influences both `y`

and `z`

, but `z`

also influences `y`

.
Let $\beta_{xy}$ be the direct effect of `x`

on `y`

,
$\beta_{xz}$ the effect of `x`

on `z`

and $\beta_{zy}$ the effect of `z`

on `y`

.
Then the *total* effect of `x`

on `y`

is $\beta_{xy} + \beta_{xz}\times\beta_{zy}$.

Figure 4.1.The causal links between`x`

,`y`

and`z`

in Case 4.

Using the defaults in the following function, the total effect of `x`

on `y`

is
$1 + 1.5\times 0.5 = 1.75$.
If this doesn’t make immediate sense, consider what a change of one unit in `x`

causes downstream:
A one-unit increase in `x`

directly increases `y`

by 1.
It also increases `z`

by 1.5.
But a one-unit increase in `z`

causes an increase of 0.5 in `y`

as well, so a 1.5-unit increase in `z`

causes an additional increase of 0.75 in `y`

. So in total, a one-unit increase in `x`

causes a 1.75-point increase in `y`

.

Figure 4.2.Graphical analysis without the covariate for Case 4.

Again, the analysis without the control variable
yields a reasonably accurate estimate of the true *total*
influence of `x`

on `y`

of 1.75 (!)
($\widehat{\beta_{xy}} = 1.67 \pm 0.16$).

When we take the control variable into account, however, the difference
between the two groups defined by `x`

becomes smaller:

Figure 4.3.Graphical analysis with the covariate for Case 4.

The linear model now yields a parameter estimate
of $\widehat{\beta_{xy}} = 1.11 \pm 0.17$.
This analysis correctly estimates the *direct*
effect of `x`

on `y`

(i.e., without the additional causal link between `x`

on `y`

through `z`

).
This may be interesting in its own right, but the analysis
addresses a question different from ‘‘What’s the causal
influence of `x`

on `y`

?’’

For the sake of completeness, let’s run this simulation 5,000 times, too.

Figure 4.4.In Case 4, the analysis without the covariate correctly estimates thetotalcausal influence that`x`

has on`y`

, while the analysis with the covariate correctly estimates thedirectcausal effect of`x`

on`y`

. Either may be relevant, but you have to know which!

## Case 5: `x`

and `z`

affect `y`

; `x`

and `z`

don’t affect each other.

In the final general case, `x`

and `z`

both affect `y`

, but
`x`

and `z`

don’t affect each other. That is, `z`

isn’t affected by the
intervention in any way and so functions like a pre-treatment control variable would.
The result is an increase in statistical precision. This is the only of the five
cases examined in which the control variable has added value for the purposes
of estimated the causal influence of `x`

on `y`

.

Figure 5.1.The causal links between`x`

,`y`

and`z`

in Case 5.

Using the defaults in the following function, the total effect of `x`

on `y`

is
$1 + 1.5\times 0.5 = 1.75$.
If this doesn’t make immediate sense, consider what a change of one unit in `x`

causes downstream:
A one-unit increase in `x`

directly increases `y`

by 1.
It also increases `z`

by 1.5.
But a one-unit increase *in z* causes an increase of 0.5 in

`y`

as well, so a 1.5-unit increase in `z`

causes an additional increase of 0.75 in `y`

. So in total, a one-unit increase in `x`

causes a 1.75-point increase in `y`

.

Figure 5.2.Graphical analysis without the covariate for Case 5.

Again, the analysis without the control variable yields an estimate within one standard error of the true parameter value of 1 ($\widehat{\beta_{xy}} = 0.76 \pm 0.24$).

Figure 5.3.Graphical analysis with the covariate for Case 5.

The linear model now yields a parameter estimate of $\widehat{\beta_{xy}} = 1.08 \pm 0.15$, with is also a reasonable estimate but with a smaller standard error.

For the sake of completeness, let’s run this simulation 5,000 times, too.

Figure 5.4.In Case 5, both analyses are centred around the correct value, i.e., both are unbiased. The analysis with the control variable yields a narrower distribution of estimates, however, i.e., it is more precise.

## Conclusion

When a control variable is collected *after* the intervention took place,
it is possible that it is directly or indirectly affected by the intervention.
If this is indeed the case, including the control variable in the analysis
may yield biased estimates or decrease rather than increase the precision of the
estimates. In designed experiments, the solution to this problem is evident:
collect the control variable before the intervention takes place. If this isn’t
possible, you had better be pretty sure that the control variable isn’t a
post-treatment variable. More generally, throwing predictor variables into
a statistical model in the hopes that this will improve the analysis is a dreadful
idea.