# Baby steps in Bayes: Accounting for measurement error on a control variable

In observational studies, it is customary to account for confounding variables
by including measurements of them in the statistical model. This practice is
referred to as *statistically controlling* for the confounding variables. An
underappreciated problem is that if the confounding variables were measured
imperfectly, then statistical control will be imperfect as well, and the
confound won’t be eradicated entirely (see Berthele & Vanhove 2017;
Brunner & Austin 2009; Westfall & Yarkoni 2016) (see also
*Controlling for confounding variables in correlational research: Four caveats*).

This blog post details my efforts to specify a Bayesian model in which the measurement error on the confounding variable was taken into account. The ultimate aim was to obtain more honest estimates of the impact of the confounding variable and the variable of actual interest on the outcome. First, I’ll discuss a simulated example to demonstrate the consequences of measurement error on statistical control and what a model needs to do to appropriately take measurement error into account. Then I apply the insights gained on a real-life study in applied linguistics.

I will preface all of this with the disclaimer that I don’t consider myself an expert in the techniques discussed below; one reason for writing this blog is to solicit feedback from readers more knowledgeable than I am.

## Preliminaries

If you want to follow along, you need the following R packages/settings:

You’ll also need the `MASS`

package, but you don’t need to load it.

## Demonstration: Measurement error messes up statistical control

Let’s first illustrate the problem that measurement error causes for statistical control using simulated data. That way, we know what goes into the data and what we hope a model should take out of it.

The scenario I want to focus on is the following.
You are pretty sure that a given construct `A`

causally affects
a variable `Z`

. You are, however, interested in finding out if another
construct `B`

also affects `Z`

. You can’t manipulate any of the variables,
so you have to make do with an observational study. Unfortunately,
`A`

and `B`

are likely to be correlated. Let’s simulate some data
to make this more concrete:

- 500 datapoints (
`n`

) - Constructs
`A`

and`B`

are correlated at $\rho = 0.73$. - Constructs
`A`

and`B`

are normally distributed with standard deviations of 1.5 (`sd_A`

) and 0.8 (`sd_B`

), respectively. The means of these normal distributions are 3 and -4, respectively.

The numbers in the list above aren’t special; I just wanted to make sure the model I will specify further down below isn’t restricted to assuming that the constructs are distributed normally with mean 0 and standard deviation 1.

For the purposes of this simulation, I’ll generate data for
`Z`

that are affected by `A`

but not by `B`

:

As **Figure 1** shows, `B`

and `Z`

are correlated,
even though neither influences the other. This is
because of their link with `A`

.

Figure 1.Even though variable`Z`

isn’t causally affected by construct`B`

, there exists a considerable correlation between`B`

and`Z`

. This correlation exists because`Z`

is causally affected by construct`A`

, which is correlated with`B`

.

In situations such as these, researchers typically include
both `A`

and `B`

as predictors in a model with `Z`

as the outcome.
And this works: we find a significant relationship between `A`

and `Z`

,
but not between `B`

and `Z`

. Moreover, all estimated parameters
are in the vicinity of their true values, as specified in the simulation.

But in real life, the situation is more complicated.
When researchers “statistically control” for a possible
confound, they’re usually interested in controlling for
the confounding *construct* rather than for any one
*measurement* of this construct. For instance, when teasing
apart the influences of L2 vocabulary knowledge and
L2 morphosyntactic knowledge on L2 speaking fluency,
researchers don’t actually want to control for the
learners performance on this or that vocabulary test:
they want to control for L2 vocabulary knowledge itself.
One would hope that the vocabulary test gives a good
indication of the learners’ vocabulary knowledge, but
it’s understood that their performance will be affected
by other factors as well (e.g., form on the day, luck
with guessing, luck with the words occurring in the test etc.).

So let’s add some noise (measurement error) to
constructs `A`

and `B`

. Here I express the measurement error
in terms of the reliability of the instruments used to measure
the constructs: If $\sigma_A$ is the standard deviation of the unobserved
construct scores and $r_{AA’}$ is the reliability of the measurement instrument,
then the standard deviation of the measurement error is $\sqrt{\frac{\sigma_A^2}{r_{AA’}} - \sigma_A^2}$.
For the purposes of this demonstration, I’m going to specify that
construct `A`

was measured with ‘okay’ reliability (0.70), whereas
construct `B`

was measured with exceptional reliability (0.95):

Crucially, if we include the observed values `obs_A`

and `obs_B`

as predictors in a model with `Z`

as the outcome, we find that the
parameter for `obs_B`

is significant—even though there is no
causal link between `B`

and `Z`

:

Descriptively, this is perfectly fine:
You do indeed now know more about `Z`

if you take into account `obs_B`

in addition to `obs_A`

. But you take this to interpret that
the *construct* of `B`

can explain variation in `Z`

over and beyond
that which can be explained by the construct of `A`

, this would be a mistake.

Conceptually, what has happened is that since `obs_A`

imperfectly reflects
construct `A`

, including `obs_A`

in the model control for construct `A`

imperfectly.

## Fitting a model that takes measurement into account

Below is the Stan code I used for fitting the simulated data.
The model takes as its input the three observed variables
(`obs_A`

, `obs_B`

and `Z`

). Information about the reliability
of `obs_A`

and `obs_B`

is also provided in the form of a prior
distribution on `reliability_A`

and `reliability_B`

. Specifically,
it’s assumed that the reliability coefficient for `obs_A`

is
drawn from a `beta(30, 10)`

-distribution. This assigns a
95% probability to the reliability coefficient lying between
roughly 0.61 and 0.87. `obs_B`

is assumed to be measured more
reliably, as encoded by a `beta(95, 5)`

-distribution, which assigns
a 95% probability to the reliability coefficient lying between
0.90 and 0.98.

Importantly, as I noted
in some earlier explorations, the model has to take into account the possibility
that the constructs of `A`

and `B`

are correlated. I specified a prior
vaguely expecting a positive correlation but that wouldn’t find correlations
close to or below zero to be too surprising either. Priors on the other
parameters are pretty vague; I find it difficult to come up with
reasonable priors in context-free examples.

Let’s put the data into a Stan-friendly list and fit the model:

We get a couple of warnings, and I’m not sure how serious they are
or how they ought to be resolved. (Again, I’m not an expert in these
techniques; **actionable advice is more than welcome!**).
That said, all estimated parameters are pretty much on the
money. Importantly, include the estimated slope for the
`B`

construct (`slope_B`

: -0.09, with a 95% credible interval of [-0.66, 0.33]).
Notice, too, that the model was able to figure out the correlation
between the latent constructs `A`

and `B`

(`latent_rho`

).

To get some sense of what the model is doing, I’m going to extract the posterior distributions for the latent construct scores. These are the model’s guesses of which scores the simulated participants would have had if there had been no measurement error. These guesses are based on the information we’ve fed the model, including the observed variables, the relationships among them, and their probable reliability. I’m just going to work with the means of these posterior distributions, but there can be substantial uncertainty about the model’s guesses.

**Figure 2** shows the relationships among the three variables
and shows *shrinkage* at work. For the variables about whose
actual values there is uncertainty (viz., A and B), the model
reckons that extreme values are caused by a combination of
skill (or lack thereof) as well as good (or bad) luck. Accordingly,
it adjusts these values towards the bulk of the data. In doing so,
it takes into account both the correlation that we ‘expected’ between A and B
as well as the possible relationship between A and B on the one hand and Z on
the other. For A, the adjustments are fairly large because this variable was assumed
to be measured with considerable error. For B, the adjustments are smaller.
Z, finally, was assumed to be measured without error and so no adjustments
are required.

Figure 2.The relationships among the three variables. The empty circles represent the variables as they were observed. The black circles represent, for each observation, the mean of the model’s guesses as to what the value would have been if there hadn’t been any measurement error.

In statistics at least, shrinkage is generally a good thing: The shrunken values (i.e., the model’s guesses) lie, on average, closer to the true but unobserved values than the observed values do. This is clearly the case for variable A; here I compute the mean of the absolute differences between the true latent construct scores on the one hand and the observed values and the mean estimated values on the other hand.

For variable B, the difference is negligible seeing as this variable was measured with exceptional reliability:

## A real-life example: Linguistic interdependence

For the simulated data, the model seemed to work okay, so let’s turn to a real-life example. I’ll skip the theoretical background, but several studies in applied linguistics have tried to find out if knowledge in a ‘heritage language’ contributes to the development of the societal language (For more information about such research, see Berthele & Lambelet (2017), Vanhove & Berthele (2017) and Berthele & Vanhove (2017)). In a typical research design, researchers collect data on a group of pupils’ language skills in their heritage language as well as in their societal languages at the beginning of the school year. Then, at the end of the school year, they collect similar data. Unsurprisingly, pupils with relatively good societal language skills at the beginning of the year are still relatively good at the end. But what is sometimes also observed is that heritage language proficiency at the first data collection is a predictor of societal language proficiency at the second data collection, even after taking into account societal language proficiency at the first data collection.

It’s tempting but premature to interpret such findings as evidence for a beneficial effect of heritage language skills on the development of societal language proficiency. The reason is that (a) societal and heritage language proficiency are bound to be correlated at the first data collection due to factors such as intelligence, testwiseness, form on the day etc., and (b) language proficiency is invariably measured with error. This is true of heritage language proficiency, but most importantly, it’s true of the variable that is “statistically controlled for”, i.e., societal language proficiency. Consequently, it’s likely that an off-the-shelf statistical model undercorrects for the role of societal language proficiency and overestimates the role of heritage language profiency.

So let’s fit a model that takes measurement error into account.

### Data and question

The data we’re going to analyse are a subset of those analysed by Vanhove & Berthele (2017) and Berthele & Vanhove (2017). We have data on 91 pupils with French as their societal language and Portuguese as their heritage language. The study consisted of three data collections (and many more pupils), but we’re just going to analyse the reading proficiency data collected during waves 2 and 3 here.

The full datasets are are available as an R package from https://github.com/janhove/helascot, but copy-paste the command below into R to work with the reduced dataset we’ll work with here.

We’re going to fit the French reading scores at the third
data collection (`FR_T3`

) in terms of the French and Portuguese
reading scores at the second data collection (`FR_T2`

and `PT_T2`

).
**Figure 3** shows the observed variables. Note that all values are
bounded between 0 and 1, where 1 was the highest possible result.

Figure 3.The relationships between the French proficiency scores at T3, the French proficiency scores at T2 and the Portuguese profiency scores at T2.

Fitting an off-the-shelf regression model, we find that
`PT_T2`

is significantly related to `FR_T3`

, even when accounting for
`FR_T2`

.

Lastly, as reported by Pestana et al. (2017), the reliability of the French reading test at T2 was estimated to be 0.73, with a 95% confidence interval of [0.65, 0.78]. For Portuguese at T2, the reliability was estimated to be 0.79, with a 95% confidence interval of [0.72, 0.84]. This is information we can feed to the model. (For French at T3, the estimated reliability coefficient was 0.73, 95% CI: [0.65, 0.79], but for now, we’re not going to model the measurement error on the outcome variable.)

### Model

The model specified below is essentially the same as the model for the simulated example, but with more informed priors.

The reliability estimates for the French T2 and Portuguese T2 variables were incorporated by means of prior distributions.

- For French T2, I put a
`beta(73, 27)`

prior on the reliability coefficient, which assigns a 95% probability of the reliablity coefficient lying between 0.64 and 0.81. This doesn’t exactly correspond to the estimated reliability coefficient’s confidence interval, but I think it’s close enough. - For Portuguese T2, I put a
`beta(79, 21)`

prior on the reliability coefficient, which assigns a 95% probability of the reliablity coefficient lying between 0.71 and 0.86.

Other prior distributions reflect the fact that the predictor and the outcome data were restricted to the [0, 1] range and some common knowledge. The rationale for them is explained in the comments sprinkled throughout the code.

After tweaking the `adapt_delta`

and `max_treedepth`

parameters
and letting the model run for a sufficient number of iterations,
it converged without errors or warnings.

### Results

Compared to the off-the-shelf `lm()`

model,
the model that takes measurement errors and correlated predictors
into account estimates the parameter for `FR_T2`

to be higher
and that of `PT_T2`

to be lower. It is also more uncertain
about these parameters than the off-the-shelf model, I think appropriately so.
The slope for `PT_T2`

, which is the important bit, is now
estimated to be 0.10, with a 95% credible interval of [-0.17, 0.38].
So when you take measurement error into account, you find that there
is less evidence for a beneficial effect of heritage language proficiency
on the development of societal language proficiency that you could
otherwise have thought.
Moreover, the model estimates that the correlation between
`FR_T2`

and `PT_T2`

is in the vicinity of 0.81 rather than 0.67, once
measurement error has been accounted for.

Considerable shrinkage is observed for both predictors (**Figure 4**);
keep in mind that we didn’t account for measurement error on the outcome.
The plots in the top row shows how predictor values that
are pretty low given a certain outcome value are shifted toward the
right and how predictor values that are pretty high given
a certain outcome value get shifted toward the left.
The plot in the bottom row additionally shows that the model has learnt,
among other things,
that if a participant has a high `PT_T2`

score but a low
`FR_T2`

score, that the former is probably an overestimate and the latter
is likely an underestimate. So it adjusts both values towards the centre.

Figure 4.Relationships among the predictors and the outcome. The empty circles represent the variables as they were observed. The black circles represent, for each observation, the mean of the model’s guesses as to what the value would have been if there hadn’t been any measurement error.

## Caveats

- I’m just trying to figure out this stuff, so this blog posts comes with absolutely no warranties.
- We only took into account
*reliability*and we took*validity*for granted. In the real-life example, this means that we acknowledged that the French reading test measured French reading proficiency imperfectly — but we did assume that it, deep down, measured French reading proficiency. You can easily imagine a test that measures something highly reliably, but that still doesn’t reflect what it’s supposed to measure well. (For instance, a highly reliable trivia quiz about the Belgian national football team under the reign of Georges Leekens that’s used to represent general intelligence.) Taking measurement error (in terms of unreliability) into account doesn’t fix possible validity problems.

## References

Berthele, Raphael and Jan Vanhove. 2017.
What would disprove interdependence? Lessons learned from a study on biliteracy in Portuguese heritage language speakers in Switzerland.
*International Journal of Bilingual Education and Bilingualism*.

Brunner, Jerry and Peter C. Austin. 2009.
Inflation of Type I error rate in multiple regression when independent variables are measured with error.
*Canadian Journal of Statistics* 37(1). 33–46.

Berthele, Raphael and Amelia Lambelet (eds.). 2017.
*Heritage and school language literacy development in migrant children: Interdependence or independence?*
Multilingual Matters.

Pestana, Carlos, Amelia Lambelet and Jan Vanhove. 2017.
Chapter 4: Reading comprehension development in Portuguese heritage speakers in Switzerland (HELASCOT project).
In Raphael Berthele and Amelia Lambelet (eds.),
*Heritage and school language literacy development in migrant children: Interdependence or independence?*, pp. 58-82.
Multilingual Matters.

Vanhove, Jan and Raphael Berthele. 2017.
Chapter 6: Testing the interdependence of languages (HELASCOT project).
In Raphael Berthele and Amelia Lambelet (eds.), *Heritage and school language literacy development in migrant children: Interdependence or independence?*, pp. 97-118.
Multilingual Matters.

Westfall, Jacob and Tal Yarkoni. 2016.
Statistically controlling for confounding constructs is harder than you think..
*PLOS ONE* 11(3). e0152719.