In observational studies, it is customary to account for confounding variables by including measurements of them in the statistical model. This practice is referred to as statistically controlling for the confounding variables. An underappreciated problem is that if the confounding variables were measured imperfectly, then statistical control will be imperfect as well, and the confound won’t be eradicated entirely (see Berthele & Vanhove 2017; Brunner & Austin 2009; Westfall & Yarkoni 2016) (see also Controlling for confounding variables in correlational research: Four caveats).

This blog post details my efforts to specify a Bayesian model in which the measurement error on the confounding variable was taken into account. The ultimate aim was to obtain more honest estimates of the impact of the confounding variable and the variable of actual interest on the outcome. First, I’ll discuss a simulated example to demonstrate the consequences of measurement error on statistical control and what a model needs to do to appropriately take measurement error into account. Then I apply the insights gained on a real-life study in applied linguistics.

I will preface all of this with the disclaimer that I don’t consider myself an expert in the techniques discussed below; one reason for writing this blog is to solicit feedback from readers more knowledgeable than I am.

## Preliminaries

If you want to follow along, you need the following R packages/settings:

You’ll also need the MASS package, but you don’t need to load it.

## Demonstration: Measurement error messes up statistical control

Let’s first illustrate the problem that measurement error causes for statistical control using simulated data. That way, we know what goes into the data and what we hope a model should take out of it.

The scenario I want to focus on is the following. You are pretty sure that a given construct A causally affects a variable Z. You are, however, interested in finding out if another construct B also affects Z. You can’t manipulate any of the variables, so you have to make do with an observational study. Unfortunately, A and B are likely to be correlated. Let’s simulate some data to make this more concrete:

• 500 datapoints (n)
• Constructs A and B are correlated at $\rho = 0.73$.
• Constructs A and B are normally distributed with standard deviations of 1.5 (sd_A) and 0.8 (sd_B), respectively. The means of these normal distributions are 3 and -4, respectively.

The numbers in the list above aren’t special; I just wanted to make sure the model I will specify further down below isn’t restricted to assuming that the constructs are distributed normally with mean 0 and standard deviation 1.

For the purposes of this simulation, I’ll generate data for Z that are affected by A but not by B:

As Figure 1 shows, B and Z are correlated, even though neither influences the other. This is because of their link with A.

Figure 1. Even though variable Z isn’t causally affected by construct B, there exists a considerable correlation between B and Z. This correlation exists because Z is causally affected by construct A, which is correlated with B.

In situations such as these, researchers typically include both A and B as predictors in a model with Z as the outcome. And this works: we find a significant relationship between A and Z, but not between B and Z. Moreover, all estimated parameters are in the vicinity of their true values, as specified in the simulation.

But in real life, the situation is more complicated. When researchers “statistically control” for a possible confound, they’re usually interested in controlling for the confounding construct rather than for any one measurement of this construct. For instance, when teasing apart the influences of L2 vocabulary knowledge and L2 morphosyntactic knowledge on L2 speaking fluency, researchers don’t actually want to control for the learners performance on this or that vocabulary test: they want to control for L2 vocabulary knowledge itself. One would hope that the vocabulary test gives a good indication of the learners’ vocabulary knowledge, but it’s understood that their performance will be affected by other factors as well (e.g., form on the day, luck with guessing, luck with the words occurring in the test etc.).

So let’s add some noise (measurement error) to constructs A and B. Here I express the measurement error in terms of the reliability of the instruments used to measure the constructs: If $\sigma_A$ is the standard deviation of the unobserved construct scores and $r_{AA’}$ is the reliability of the measurement instrument, then the standard deviation of the measurement error is $\sqrt{\frac{\sigma_A^2}{r_{AA’}} - \sigma_A^2}$. For the purposes of this demonstration, I’m going to specify that construct A was measured with ‘okay’ reliability (0.70), whereas construct B was measured with exceptional reliability (0.95):

Crucially, if we include the observed values obs_A and obs_B as predictors in a model with Z as the outcome, we find that the parameter for obs_B is significant—even though there is no causal link between B and Z:

Descriptively, this is perfectly fine: You do indeed now know more about Z if you take into account obs_B in addition to obs_A. But you take this to interpret that the construct of B can explain variation in Z over and beyond that which can be explained by the construct of A, this would be a mistake.

Conceptually, what has happened is that since obs_A imperfectly reflects construct A, including obs_A in the model control for construct A imperfectly.

## Fitting a model that takes measurement into account

Below is the Stan code I used for fitting the simulated data. The model takes as its input the three observed variables (obs_A, obs_B and Z). Information about the reliability of obs_A and obs_B is also provided in the form of a prior distribution on reliability_A and reliability_B. Specifically, it’s assumed that the reliability coefficient for obs_A is drawn from a beta(30, 10)-distribution. This assigns a 95% probability to the reliability coefficient lying between roughly 0.61 and 0.87. obs_B is assumed to be measured more reliably, as encoded by a beta(95, 5)-distribution, which assigns a 95% probability to the reliability coefficient lying between 0.90 and 0.98.

Importantly, as I noted in some earlier explorations, the model has to take into account the possibility that the constructs of A and B are correlated. I specified a prior vaguely expecting a positive correlation but that wouldn’t find correlations close to or below zero to be too surprising either. Priors on the other parameters are pretty vague; I find it difficult to come up with reasonable priors in context-free examples.

Let’s put the data into a Stan-friendly list and fit the model:

We get a couple of warnings, and I’m not sure how serious they are or how they ought to be resolved. (Again, I’m not an expert in these techniques; actionable advice is more than welcome!). That said, all estimated parameters are pretty much on the money. Importantly, include the estimated slope for the B construct (slope_B: -0.09, with a 95% credible interval of [-0.66, 0.33]). Notice, too, that the model was able to figure out the correlation between the latent constructs A and B (latent_rho).

To get some sense of what the model is doing, I’m going to extract the posterior distributions for the latent construct scores. These are the model’s guesses of which scores the simulated participants would have had if there had been no measurement error. These guesses are based on the information we’ve fed the model, including the observed variables, the relationships among them, and their probable reliability. I’m just going to work with the means of these posterior distributions, but there can be substantial uncertainty about the model’s guesses.

Figure 2 shows the relationships among the three variables and shows shrinkage at work. For the variables about whose actual values there is uncertainty (viz., A and B), the model reckons that extreme values are caused by a combination of skill (or lack thereof) as well as good (or bad) luck. Accordingly, it adjusts these values towards the bulk of the data. In doing so, it takes into account both the correlation that we ‘expected’ between A and B as well as the possible relationship between A and B on the one hand and Z on the other. For A, the adjustments are fairly large because this variable was assumed to be measured with considerable error. For B, the adjustments are smaller. Z, finally, was assumed to be measured without error and so no adjustments are required.

Figure 2. The relationships among the three variables. The empty circles represent the variables as they were observed. The black circles represent, for each observation, the mean of the model’s guesses as to what the value would have been if there hadn’t been any measurement error.

In statistics at least, shrinkage is generally a good thing: The shrunken values (i.e., the model’s guesses) lie, on average, closer to the true but unobserved values than the observed values do. This is clearly the case for variable A; here I compute the mean of the absolute differences between the true latent construct scores on the one hand and the observed values and the mean estimated values on the other hand.

For variable B, the difference is negligible seeing as this variable was measured with exceptional reliability:

## A real-life example: Linguistic interdependence

For the simulated data, the model seemed to work okay, so let’s turn to a real-life example. I’ll skip the theoretical background, but several studies in applied linguistics have tried to find out if knowledge in a ‘heritage language’ contributes to the development of the societal language (For more information about such research, see Berthele & Lambelet (2017), Vanhove & Berthele (2017) and Berthele & Vanhove (2017)). In a typical research design, researchers collect data on a group of pupils’ language skills in their heritage language as well as in their societal languages at the beginning of the school year. Then, at the end of the school year, they collect similar data. Unsurprisingly, pupils with relatively good societal language skills at the beginning of the year are still relatively good at the end. But what is sometimes also observed is that heritage language proficiency at the first data collection is a predictor of societal language proficiency at the second data collection, even after taking into account societal language proficiency at the first data collection.

It’s tempting but premature to interpret such findings as evidence for a beneficial effect of heritage language skills on the development of societal language proficiency. The reason is that (a) societal and heritage language proficiency are bound to be correlated at the first data collection due to factors such as intelligence, testwiseness, form on the day etc., and (b) language proficiency is invariably measured with error. This is true of heritage language proficiency, but most importantly, it’s true of the variable that is “statistically controlled for”, i.e., societal language proficiency. Consequently, it’s likely that an off-the-shelf statistical model undercorrects for the role of societal language proficiency and overestimates the role of heritage language profiency.

So let’s fit a model that takes measurement error into account.

### Data and question

The data we’re going to analyse are a subset of those analysed by Vanhove & Berthele (2017) and Berthele & Vanhove (2017). We have data on 91 pupils with French as their societal language and Portuguese as their heritage language. The study consisted of three data collections (and many more pupils), but we’re just going to analyse the reading proficiency data collected during waves 2 and 3 here.

The full datasets are are available as an R package from https://github.com/janhove/helascot, but copy-paste the command below into R to work with the reduced dataset we’ll work with here.

We’re going to fit the French reading scores at the third data collection (FR_T3) in terms of the French and Portuguese reading scores at the second data collection (FR_T2 and PT_T2). Figure 3 shows the observed variables. Note that all values are bounded between 0 and 1, where 1 was the highest possible result.

Figure 3. The relationships between the French proficiency scores at T3, the French proficiency scores at T2 and the Portuguese profiency scores at T2.

Fitting an off-the-shelf regression model, we find that PT_T2 is significantly related to FR_T3, even when accounting for FR_T2.

Lastly, as reported by Pestana et al. (2017), the reliability of the French reading test at T2 was estimated to be 0.73, with a 95% confidence interval of [0.65, 0.78]. For Portuguese at T2, the reliability was estimated to be 0.79, with a 95% confidence interval of [0.72, 0.84]. This is information we can feed to the model. (For French at T3, the estimated reliability coefficient was 0.73, 95% CI: [0.65, 0.79], but for now, we’re not going to model the measurement error on the outcome variable.)

### Model

The model specified below is essentially the same as the model for the simulated example, but with more informed priors.

The reliability estimates for the French T2 and Portuguese T2 variables were incorporated by means of prior distributions.

• For French T2, I put a beta(73, 27) prior on the reliability coefficient, which assigns a 95% probability of the reliablity coefficient lying between 0.64 and 0.81. This doesn’t exactly correspond to the estimated reliability coefficient’s confidence interval, but I think it’s close enough.
• For Portuguese T2, I put a beta(79, 21) prior on the reliability coefficient, which assigns a 95% probability of the reliablity coefficient lying between 0.71 and 0.86.

Other prior distributions reflect the fact that the predictor and the outcome data were restricted to the [0, 1] range and some common knowledge. The rationale for them is explained in the comments sprinkled throughout the code.

After tweaking the adapt_delta and max_treedepth parameters and letting the model run for a sufficient number of iterations, it converged without errors or warnings.

### Results

Compared to the off-the-shelf lm() model, the model that takes measurement errors and correlated predictors into account estimates the parameter for FR_T2 to be higher and that of PT_T2 to be lower. It is also more uncertain about these parameters than the off-the-shelf model, I think appropriately so. The slope for PT_T2, which is the important bit, is now estimated to be 0.10, with a 95% credible interval of [-0.17, 0.38]. So when you take measurement error into account, you find that there is less evidence for a beneficial effect of heritage language proficiency on the development of societal language proficiency that you could otherwise have thought. Moreover, the model estimates that the correlation between FR_T2 and PT_T2 is in the vicinity of 0.81 rather than 0.67, once measurement error has been accounted for.

Considerable shrinkage is observed for both predictors (Figure 4); keep in mind that we didn’t account for measurement error on the outcome. The plots in the top row shows how predictor values that are pretty low given a certain outcome value are shifted toward the right and how predictor values that are pretty high given a certain outcome value get shifted toward the left. The plot in the bottom row additionally shows that the model has learnt, among other things, that if a participant has a high PT_T2 score but a low FR_T2 score, that the former is probably an overestimate and the latter is likely an underestimate. So it adjusts both values towards the centre.

Figure 4. Relationships among the predictors and the outcome. The empty circles represent the variables as they were observed. The black circles represent, for each observation, the mean of the model’s guesses as to what the value would have been if there hadn’t been any measurement error.

## Caveats

• I’m just trying to figure out this stuff, so this blog posts comes with absolutely no warranties.
• We only took into account reliability and we took validity for granted. In the real-life example, this means that we acknowledged that the French reading test measured French reading proficiency imperfectly — but we did assume that it, deep down, measured French reading proficiency. You can easily imagine a test that measures something highly reliably, but that still doesn’t reflect what it’s supposed to measure well. (For instance, a highly reliable trivia quiz about the Belgian national football team under the reign of Georges Leekens that’s used to represent general intelligence.) Taking measurement error (in terms of unreliability) into account doesn’t fix possible validity problems.

## References

Berthele, Raphael and Jan Vanhove. 2017. What would disprove interdependence? Lessons learned from a study on biliteracy in Portuguese heritage language speakers in Switzerland. International Journal of Bilingual Education and Bilingualism.

Brunner, Jerry and Peter C. Austin. 2009. Inflation of Type I error rate in multiple regression when independent variables are measured with error. Canadian Journal of Statistics 37(1). 33–46.

Berthele, Raphael and Amelia Lambelet (eds.). 2017. Heritage and school language literacy development in migrant children: Interdependence or independence? Multilingual Matters.

Pestana, Carlos, Amelia Lambelet and Jan Vanhove. 2017. Chapter 4: Reading comprehension development in Portuguese heritage speakers in Switzerland (HELASCOT project). In Raphael Berthele and Amelia Lambelet (eds.), Heritage and school language literacy development in migrant children: Interdependence or independence?, pp. 58-82. Multilingual Matters.

Vanhove, Jan and Raphael Berthele. 2017. Chapter 6: Testing the interdependence of languages (HELASCOT project). In Raphael Berthele and Amelia Lambelet (eds.), Heritage and school language literacy development in migrant children: Interdependence or independence?, pp. 97-118. Multilingual Matters.

Westfall, Jacob and Tal Yarkoni. 2016. Statistically controlling for confounding constructs is harder than you think.. PLOS ONE 11(3). e0152719.

21 January 2020