Guarantees in the long run vs. interpreting the data at hand: Two analyses of clustered data
An analytical procedure may have excellent long-run properties but still produce nonsensical results in individual cases. I recently encountered a real-life illustration of this, but since those data aren’t mine, I’ll use simulated data with similar characteristics for this blog post.
The illustration occurred for a study with clustered data (pupils in classes) in which there were only one or two clusters per condition. In the dataset that I simulated, there are three classes that were assigned in their entirety to either the control condition (1 class) or the intervention condition (2 classes):
Crucially, the clusters in the data need to be taken into consideration during the analysis (see Vanhove 2015). Having just one or two clusters per condition is far from ideal, but this dataset is, in principle, analysable. Let’s start by plotting the data:
Figure 1. Simulated data for a cluster-randomised experiment in which one class (= cluster) was assigned to the control condition and two were assigned to the intervention condition.
Analysis 1: A hierarchical model
One option is to analyse these data in a hierarchical (mixed-effects) model in which random effects are used to account for class-to-class variability. Based on the present dataset, this variability is in fact estimated to be non-existent: the estimated standard deviation of the distribution of class effects is essentially 0 (0.00003). This seems implausible - not least of all because I simulated data with a class effect. The estimated intervention effect, meanwhile is -0.24 with a standard error of 0.31, which means that the intervention effect is not significant in this analysis (t(1) = 0.77, p = 0.58).
This procedure is somewhat too conservative, which means
that the p-values it produces tend to be too large.
Alternatively, you could fit the mixed-effects model
lmer() function from the
lme4 package, but
then you have to take care that your p-values
aren’t too low.
But in practical terms, this doesn’t matter much:
with a t-value of 0.77, you’re never going to end
up with a p-value anywhere close to 0.05.
Analysis 2: A t-test on the group means
If mixed-effects models aren’t your thing,
you may be tempted to run an analysis that is
both easy to report and perfectly accounts for
the effect of clustering on inferential statistics:
you’d compute the mean outcome per class
and then run a t-test on the class means instead.
This procedure has excellent long-run properties:
it returns a significant result for the intervention
effect in 5% of cases when no such effect exists.
That is, it isn’t too liberal (as t-test on the
individual-level data would be) nor too conservative
(like the analysis with
(Incidentally, an analysis on individual level data would
give you pretty much the same result as the
above for the present data – but as a procedure it’s
much too liberal.)
For the present data, this means that we first compute our three class means and then run a t-test on these three means (i.e., a t-test with one degree of freedom).
I write the t-test in the form of a linear model as the output is more informative.
The estimated intervention effect is still -0.24, but now the standard error is merely 0.001, yielding a t-value of a whopping 174. Even with one degree of freedom, this corresponds to a highly significant effect (p = 0.004). (The data were simulated without an intervention effect, so this is a Type-I error. But you wouldn’t know this.)
Which analysis is better?
So we’ve got two analyses that are both defensible
but give rise to wildly different inferences. The
estimate for the class-by-class variability in the
lme() fit seems implausibly small, but the estimated
intervention effect and its standard error wouldn’t
raise any eyebrows. By contrast, the t-test on the
class means seems to produce absurdly low standard errors –
not to mention a ridiculous R² value of 1. Yet it is
this latter procedure that has the better long-run
characteristics. So which analysis is the better one?
The boring answer is that it depends on what you’re looking
for in an analysis. The second analysis yields a very
low p-value, but it would be incorrect to say that it is too
low: If you simulate lots and lots of datasets similar
in structure to the one analysed (clustered data with
one and two clusters per condition, respectively) and
without any intervention effect present, this procedure
will only find a p-value lower than 0.05 in 5% of cases,
and it’ll only find a p-value lower than 0.004 in 0.4% of cases.
If you analyse those same simulated datasets using
you may not find a single significant result (I didn’t
in 20,000 simulations). So if the long-run behaviour
of the analytical procedure is your chief concern,
running a t-test on the class means is the better option.
You may, however, also be interested in estimating the intervention effect and, crucially, quantifying your uncertainty about this estimate in the form of a standard error or confidence interval. In this respect, the hierarchical model produces a more plausible result - even though for want of data, this uncertainty estimate may be highly inaccurate.
(The actual simulation-to-simulation variation was such
that 90% of the estimated intervention effects lay between
-0.83 and 0.83, a range of 1.65 units. The 90% confidence
interval for the intervention effect in
from -0.25 to -0.23, for a range of merely 0.02 units.
lme_fit, it ranges from -2.17 to 1.70, for a range
of 3.87 units. So the true accuracy lies pretty much half-way
between these two estimates. But in real life, you’d only
have the confidence intervals to go by, and one spanning merely
0.02 units would be considerably more suspicious than one
spanning nearly 4 units.
Incidentally, if you fit the same model using
function, you get a 90% confidence interval from -0.74 to 0.26,
though with a couple of warnings.)
What’s causing the problem?
The ridiculously low standard errors
and huge R² values in
ttest_fit are consequences of the small
residuals that this model produces:
The first value is the residual for the class that served as the single control class. The corresponding fitted value equals the class mean, so the residual is 0 (which is what the ‘e-20’ part essentially means). The second and third value are the residuals for the classes that served as the intervention classes. Their fitted values equal the mean of both class means, and the residuals are the deviations from that mean of means. Since these class means are so similar, the deviations are small, too. Crucially, the similarity in the means of classes 2 and 3 is atypical: Based on the variability within each class, we’d expect their means to differ more than they do.
To demonstrate this, let’s first compute the standard deviation of these two class means:
Now we take bootstrap samples within classes 2 and 3, compute their respective means, and compute the standard deviation of these means.
Figure 2. Bootstrapping the outcomes within each class demonstrates that the class means are more similar to each other than they ordinarily would be.
As Figure 2 shows, given the variability within each class, you would typically expect their class means to differ considerably more than they actually do. In fact, only in about 0.5% of bootstrap runs did the class means differ less than in our dataset.
ttest_fit model doesn’t know that the class means are much more
similar to each other than they normally would be, so it doesn’t account
for this. (Mind you, if only the long-run behaviour of the model is
of importance, it doesn’t need to know this.)
It isn’t unheard of that different defensible analyses of the same data lead to different inferences (see, e.g., Poarch et al., 2018, and Silberzahn et al., 2018). But I think it’s instructive to come across a case where the objectives of achieving the desired long-run behaviour and of sensibly interpreting the data that were actually observed clash.
If I had to analyse a dataset like this one in earnest, I would look
into fitting a Bayesian model with reasonably informed priors in
order to offset the lack of information present in the data.
(A hierarchical model with
brms’s default priors experiences
convergence issues.) If that doesn’t work, I’d probably fit
a frequentist hierarchical model (using
lmer()) as the
results it yields are more plausible in the present case than
that those of the class mean analysis.
Oh, and I’d rant to whoever collected these data that they should’ve dropped by before running the study :) I’d then have told them that it would be great if they could collect data in two or three more classes. Or that it might perhaps be possible to assign half of the pupils within each class to each condition. Still, they were extremely unlucky that the class means were so similar given the within-class variance.
Gregory J. Poarch, Jan Vanhove and Raphael Berthele. 2018. The effect of bidialectalism on executive function. International Journal of Bilingualism.
R. Silberzahn et al.. 2018. Many analysts, one data set: Making transparent how variations in analytic choices affect results. Advances in Methods and Practices in Psychological Science 1(3). 337-356.
Jan Vanhove. 2015. Analyzing randomized controlled interventions: Three notes for applied linguists. Studies in Second Language Learning and Teaching 5(1). 135-152.