Researchers sometimes calculate reliability indices such as Cronbach’s $\alpha$ or Revelle’s $\omega_T$, but their statistical models rarely take these reliability indices into account. Here I want to show you how you can incorporate information about the reliability about your measurements in a statistical model so as to obtain more honest and more readily interpretable parameter estimates.

## Reliability and measurement error

This blog post is the sequel to the previous one, where I demonstrated how imperfectly measured control variables undercorrect for the actual confounding in observational studies (also see Berthele & Vanhove 2017; Brunner & Austin 2009; Westfall & Yarkoni 2016). A model that doesn’t account for measurement error on the confounding variable—and hence implicitly assumes that the confound was measured perfectly—may confidently conclude that the variable of actual interest is related to the outcome even when taking into account the confound. From such a finding, researchers typically infer that the variable of actual interest is causally related to the outcome even in absence of the confound. But once this measurement error is duly accounted for, you may find that the evidence for a causal link between the variable of interest and the outcome is more tenuous than originally believed.

So especially in observational studies, where confounds abound, it behooves researchers to account for the measurement error in their variables so that they don’t draw unwarranted conclusions from their data too often. The amount of measurement error on your variables is usually unknown. But if you’ve calculated some reliability estimate such as Cronbach’s $\alpha$ for your variables, you can use this to obtain an estimate of the amount of measurement error.

To elaborate, in classical test theory, the reliability $\rho_{XX’}$ of a measure is equal to the ratio of the variance of the (error-free) true scores to the variance of the observed scores. The latter is the sum of the variance of the true scores and the error variance:

$\rho_{XX'} = \frac{\textrm{true score variance}}{\textrm{true score variance} + \textrm{measurement error variance}} = \frac{\sigma^2_T}{\sigma^2_T + \sigma^2_E}$

Rearranging, we get

$\sigma^2_E = \sigma^2_T\left(\frac{1}{\rho_{XX'}}-1\right)$

None of these values are known, but they can be estimated based on the sample. Specifically, $\rho_{XX’}$ can be estimated by a reliability index such as Cronbach’s $\alpha$ and the sum $\sigma^2_T + \sigma^2_E$ can be estimated by computing the variable’s sample variance.

## A simulated example

Let’s first deal with a simulated dataset. The main advantage of analysing simulated data is that you check that what comes out of the model corresponds to what went into the data. In preparing this blog post, I was able to detect an arithmetic error in my model code in this way as one parameter was consistently underestimated. Had I applied the model immediately to the real data set, I wouldn’t have noticed anything wrong. But we’ll deal with real data afterwards.

Run these commands to follow along:

### Generating the construct scores

The scenario we’re going to simulate is one in which you have two correlated predictor variables (A and B) and one outcome variable (Z). Unbeknownst to the analyst, Z is causally affected by A but not by B. Moreover, the three variables are measured with some degree of error, but we’ll come to that later. Figure 1 depicts the scenario for which we’re going to simulate data. Figure 1. Graphical causal model for the simulated data. A set of unobserved factors $U$ gives rise to a correlation between A and B. Even though only A causally affects Z, an association between B and Z will be found unless A is controlled for.

The first thing we need are two correlated predictor variables. I’m going to generate these from a bivariate normal distribution. A has a mean of 3 units and a standard deviation of 1.5 units, and B has mean -4 and standard deviation 0.8 units. The correlation between them is $\rho = 0.73$. To generate a sample from this bivariate normal distribution, you need to construct the variance-covariance matrix from the standard deviations and correlation, which I do in the code below:

Next, we need to generate the outcome. In this simulation, Z depends linearly on A but not on B (hence ‘$0 \times B_i$’).

$Z_i = 2 + 0.7 \times A_i + 0 \times B_i + \varepsilon_i$

The error term $\varepsilon$ is drawn from a normal distribution with standard deviation 1.3. Importantly, this error term does not express the measurement error on Z; it is the part of the true score variance in Z that isn’t related to either A or B:

$\varepsilon_i \sim \textrm{Normal}(0, 1.3)$

Even though B isn’t causally related to Z, we find that B and Z are correlated thanks to B’s correlation to A. Figure 2. B is causally unrelated to Z, but these two variables are still correlated because B correlates with A and A and Z are causally related.

A multiple regression model is able to tease apart the effects of A and B on Z:

The confound A is significantly related to Z and its estimated regression parameter is close to its true value of 0.70 ($\widehat{\beta_A} = 0.69 \pm 0.07$). The variable of interest B, by contrast, isn’t significantly related to Z when A has been accounted for ($\widehat{\beta_B} = 0.02 \pm 0.13$). This is all as it should be.

Now let’s add measurement error to all variables. The A values that we actually observe will then be distorted versions of the true A values:

$\textrm{observed } A_i = \textrm{true } A_i + \varepsilon_{Ai}$

The ‘noise’ $\varepsilon_{A}$ is commonly assumed to be normally distributed:

$\varepsilon_{Ai} \sim \textrm{Normal}(0, \tau_A)$

That said, you can easily imagine situations where the noise likely has a different distribution. For instance, when measurements are measured to the nearest integer (e.g., body weights in kilograms), the noise is likely uniformly distributed (e.g., a reported body weight of 66 kg means that the true body weight lies between 65.5 and 66.5 kg).

To make the link with the analysis more transparant, I will express the amount of noise in terms of the variables’ reliabilities. For the confound A, I set the reliability at 0.70. Since A’s standard deviation was 1.5 units, this means that the standard deviation of the noise is $\sqrt{1.5^2\left(\frac{1}{0.70} - 1\right)} = 0.98$ units. I set the reliability for B, the variable of actual interest at 0.90. Its standard deviation is 0.8, so the standard deviation of the noise is $\sqrt{0.8^2\left(\frac{1}{0.90} - 1\right)} = 0.27$ units.

The same logic applies to adding measurement noise to Z. The difficulty here lies in obtaining the population standard deviation (or variance) of Z. I don’t want to just plug in Z’s sample standard deviation since I want to have exact knowledge of the population parameters. While we specified a sigma_Z.AB above, this is not the total population standard deviation of Z: it’s the population standard deviation of Z once Z has been controlled for A and B. To obtain the total standard deviation of Z (here admittedly confusingly labelled $\sigma_Z$), we need to add in the variance in Z due to A and B:

$\sigma_Z = \sqrt{(\beta_A\sigma_A)^2 + (\beta_B\sigma_B)^2 + 2\beta_A\beta_B\sigma_A\sigma_B\rho_{AB} + \sigma_{Z.AB}^2}$

Since $\beta_B = 0$, this simplifies to $\sigma_Z = \sqrt{(\beta_A\sigma_A)^2 + \sigma_{Z.AB}^2}$, but if you want to simulate your own datasets, the full formula may be useful.

The population standard deviation of Z is thus $\sqrt{(0.7 \times 1.5)^2 + 1.3^2} = 1.67$. Setting Z’s reliability to 0.70, we find that the standard deviation of the noise is $\sqrt{1.67^2\left(\frac{1}{0.70} - 1\right)} = 1.09$.

Figure 3 shows the causal diagram for the actually observed simulated data. Figure 3. Graphical causal model for the simulated data. A, B and Z are now unobserved (latent) variables, but so-called ‘descendants’ of them were measured. These reflect their latent constructs imperfectly. Crucially, controlling for obs_A will not entirely control for the confound: the path between obs_B and obs_Z stays open.

As Figure 4 shows, the observed variables are all correlated with each other. Figure 4.

Crucially, controlling for obs_A in a regression model doesn’t entirely eradicate the confound and we find that obs_B is significantly related to obs_Z even after controlling for obs_A ($\widehat{\beta_{B_{\textrm{obs}}}} = 0.40 \pm 0.15$). Moreover, the regression model on the observed variables underestimates the strength of the relationship between the true construct scores ($\widehat{\beta_{A_{\textrm{obs}}}} = 0.40 \pm 0.07$, whereas $\beta_A = 0.7$).

### Statistical model

The statistical model, written below in Stan code, corresponds to the data generating mechanism above and tries to infer its parameters from the observed data and some prior information.

The data block specifies the input that the model should handle. I think this is self-explanatory. Note that the latent variable scores A, B and Z aren’t part of the input as we wouldn’t have directly observed these.

The parameters block first defines the parameters needed for the regression model with the unobserved latent variables (the one we used to generate Z). It then defines the parameters needed to generate the true variables scores for A and B as well as the parameters needed to generate the observed scores from the true scores (viz., the true scores themselves and the reliabilities). Note that it is crucial to allow the model to estimate a correlation between A and B, otherwise it won’t ‘know’ that A confounds the B-Z relationship.

The transformed parameters block contains, well, transformations of these parameters. For instance, the standard deviations of A and B and the correlation between A and B are used to generate a variance-covariance matrix. Moreover, the standard deviations of the measurement noise are computed using the latent variables’ standard deviation and their reliabilities.

The model block, finally, specifies how we think the observed data and the (transformed or untransformed) parameters fit together and what plausible a priori values for the (transformed or untransformed) parameters are. These prior distributions are pretty abstract in this example: we generated context-free data ourselves, so it’s not clear what motivates these priors. The real example to follow will hopefully make more sense in this respect. You’ll notice that I’ve also specified a prior for the reliabilities. The reason is that you typically don’t know the reliability of an observed variable with perfect precision but that you have sample estimate with some inherent uncertainty. The priors reflect this uncertainty. Again, this will become clearer in the real example to follow.

Put the data in a list and fit the model:

### Results

The model recovers the true parameter values pretty well (Table 1) and, on the basis of this model, you wouldn’t erroneously conclude that B is causally related to Z (see the parameter estimate for slope_B).

Table 1. Comparison of the true parameter values and their estimates. Parameters prefixed with ‘*’ are transformations of other parameters. The estimated reliability parameters aren’t really estimated from the data: they just reflect the prior distributions put on them.

Parameter True value Estimate
intercept 2.00 $0.67 \pm 1.53$
slope_A 0.70 $0.86 \pm 0.17$
slope_B 0.00 $-0.20 \pm 0.26$
sigma_Z.AB 1.30 $1.17 \pm 0.12$
sd_A 1.50 $1.40 \pm 0.07$
sd_B 0.80 $0.80 \pm 0.03$
mean_A 3.00 $3.06 \pm 0.10$
mean_B -4.00 $-4.02 \pm 0.05$
rho 0.73 $0.78 \pm 0.05$
reliability_A 0.70 $0.69 \pm 0.04$
reliability_B 0.90 $0.90 \pm 0.03$
reliability_Z 0.70 $0.70 \pm 0.04$
*sd_Z 1.67 $1.59 \pm 0.07$
*error_A 0.98 $0.92 \pm 0.07$
*error_B 0.27 $0.26 \pm 0.04$
*error_Z 1.09 $1.05 \pm 0.08$

In the previous blog post, I’ve shown that such a model also estimates the latent true variable scores and that these estimates correspond more closely to the actual true variable scores than do the observed variable scores. I’ll skip this step here.

## Real-life example: Interdependence

Satisfied that our model can recover the actual parameter values in scenarios such as those depicted in Figure 3, we now turn to a real-life example of such a situation. The example was already described in the previous blog post; here I’ll just draw the causal model that assumes that reflects the null hypothesis that a child’s Portuguese skills at T2 (PT.T2) don’t contribute to their French skills at T3 (FR.T3), but that due to common factors such as intelligence, form on the day etc. ($U$), French skills and Portuguese skills at T2 are correlated across children. What is observed are test scores, not the children’s actual skills. Figure 5.

### Data

The command below is pig-ugly, but allows you to easily read in the data.

## Statistical model

The only thing that’s changed in the statistical model compared to the example with the simulated data is that I’ve renamed the parameters and that the prior distributions are better motivated. Let’s consider each prior distribution in turn:

• intercept ~ normal(0.2, 0.1);: The intercept is the average true French skill score at T3 for children whose true French and Portuguese skill scores at T2 are 0. This is the lowest possible score (the theoretical range of the data is [0, 1]), so we’d expect such children to perform poorly at T3, too. A normal(0.2, 0.1) distribution puts 95% probability on such children having a true French score at T3 between 0 and 0.4.

• slope_FR ~ normal(0.5, 0.25);: This parameter expresses the difference between the average true French skill score at T3 for children with a true French skill score of 1 at T2 (the theoretical maximum) vs. those with a true French skill score of 0 at T2 (the theoretical minimum). This is obviously some value between -1 and 1, and presumably it’s going to be positive. A normal(0.5, 0.25) puts 95% probability on this difference lying between 0 and 1, which I think is reasonable.

• slope_PT ~ normal(0, 0.25);: The slope for Portuguese is bound to be smaller than the one for French. Moreover, it’s not a given that it will be appreciably different from zero. Hence a prior centred on 0 that still gives the data a chance to pull the estimate in either direction.

• sigma ~ normal(0.15, 0.08);: If neither of the T2 variables predicts T3, uncertainty is highest when the mean T3 score is 0.5. Since these scores are bounded between 0 and 1, the standard deviation could not be much higher than 0.20. But French T2 is bound to be a predictor, so let us choose a slightly lower value (0.15).

• latent_means ~ normal(0.5, 0.1);: These are the prior expectations for the true score means of the T2 variables. 0.5 lies in the middle of the scale; the normal(0.5, 0.1) prior puts 95% probability on these means to lie between 0.3 and 0.7.

• sigma_lat_FR_T2 ~ normal(0, 0.25);, sigma_lat_FR_T2 ~ normal(0, 0.25);: The standard deviations of the latent T2 variables. If these truncated normal distributions put a 95% probability of the latent standard deviations to be lower than 0.50.

• latent_rho ~ normal(0.4, 0.3);: The a priori expected correlation between the latent variables A and B. These are bound to be positively correlated.

• reliability_FR_T2 ~ beta(100, 100*0.27/0.73); The prior distribution for the reliability of the French T2 variable. Cronbach’s $\alpha$ for this variable was 0.73 (95% CI: [0.65, 0.78]). This roughly corresponds to a beta(100, 100*0.27/0.73) distribution:

• reliability_PT_T2 ~ beta(120, 120*0.21/0.79); Similarly, Cronbach’s $\alpha$ for the Portuguese T2 variable was 0.79 (95% CI: [0.72, 0.84]), which roughly corresponds to a beta(120, 120*0.21/0.79) distribution:
• reliability_FR_T3 ~ beta(73, 27);: The estimated reliability for the French T3 data was similar to that of the T2 data, so I used the same prior.

### Results

Unsurprisingly, the model confidently finds a link between French skills at T2 and at T3, even on the level of the unobserved true scores ($\widehat{\beta}_{\textrm{French}} = 0.71 \pm 0.16$). But more importantly, the evidence for an additional effect of Portuguese skills at T2 on French skills at T3 is flimsy ($\widehat{\beta}_{\textrm{Portuguese}} = 0.11 \pm 0.14$). The latent T2 variables are estimated to correlate strongly at $\widehat{\rho} = 0.81 \pm 0.08$. These results don’t change much when a flat prior on $\rho$ is specified (this can be accomplished by not specifying any prior at all for $\rho$).

Compared to the model in the previous blog post (Table 2), little has changed. The only appreciable difference is that the estimate for sigma is lower. The reason is that, unlike the previous model, the current model partitions the variance in the French T3 scores into true score variance and measurement error variance. In this model, sigma captures the true score variance that isn’t accounted for by T2 skills, whereas in the previous model, sigma captured the total variance that wasn’t accounted for by T2 skills. But other than that, the current model doesn’t represent a huge change from the previous one.

Table 2. Comparison of parameter estimates between the model fitted in this blog post and the model fitted in the previous one. In the previous model, the outcome was assumed to be perfectly reliable.

Parameter Current estimate Previous estimate
intercept $0.19 \pm 0.05$ $0.19 \pm 0.05$
slope_FR $0.71 \pm 0.16$ $0.71 \pm 0.16$
slope_PT $0.11 \pm 0.14$ $0.10 \pm 0.14$
sigma $0.07 \pm 0.02$ $0.12 \pm 0.01$
latent_rho $0.81 \pm 0.08$ $0.81 \pm 0.08$

A couple of things still remain to be done. First, the French test at T3 was the same as the one at T2 so it’s likely that the measurement errors on both scores won’t be completely independent of one another. I’d like to find out how correlated measurement errors affect the parameter estimates. Second, I’d like to get started with prior and posterior predictive checks: the former to check if the priors give rise to largely possible data patterns, and the latter to check if the full model tends to generate data sets similar to the one actually observed.

Berthele, Raphael and Jan Vanhove. 2017. What would disprove interdependence? Lessons learned from a study on biliteracy in Portuguese heritage language speakers in Switzerland. International Journal of Bilingual Education and Bilingualism.

Brunner, Jerry and Peter C. Austin. 2009. Inflation of Type I error rate in multiple regression when independent variables are measured with error. Canadian Journal of Statistics 37(1). 33–46.

Westfall, Jacob and Tal Yarkoni. 2016. Statistically controlling for confounding constructs is harder than you think.. PLOS ONE 11(3). e0152719.