Analysing experiments with intact groups: the problem and an easy solution
To evaluate a new teaching method, a researcher arranges for one class of 20 students to be taught with the old method and another class of 20 students with the new method. According to a ttest, the students taught with the new method significantly outperform the control group at an endofyear evaluation. The setup of studies like these is pretty standard in applied linguistics – but it is fatally flawed. In this post, I explain the problem with comparing intact groups using traditional statistical tools and present a solution that’s easy to both understand and implement.
The problem
In a textbook experiment, the 40 students from the example above would have been assigned to one of the teaching methods randomly and on an individual basis. In the example above, however, the assignment to one of the teaching methods wasn’t done individually but classbyclass, so that all students in the same class were taught using the same method.
The difference may seem minute – after all, we end up with about 20 participants in either group, so who cares? However, assigning participants to experimental conditions classbyclass – in what is called a clusterrandomised experiment – leads to a problem known as clustering: Due to selection processes, a common background, having the same teachers, etc., two students in the same class tend to be somewhat more alike than two students from different classes. This, in turn, means that the information that each participant contributes to the study isn’t entirely new: if someone’s classmates have aboveaverage reading skills, chances are she too will have aboveaverage reading skills, regardless of the teaching method. Rather than having 40 participants, the researcher in the example above could have the equivalent of, say, 8, 14 or 32 participants per group in terms of the information that the sample contains.
The degree of clustering is expressed in the intraclass correlation (ICC). This number takes a value between 0 and 1. An ICC of 1 means that all values within a cluster (e.g. a class) are identical to each other so that multiple observations per cluster don’t contribute any information to the study. An ICC of 0 means that values within a cluster are no more alike that values from different clusters. For reference, typical ICC values fall in the 0.15 to 0.20 bracket in educational contexts.
Ordinary ttests and ANOVAs as well as correlation and regression analyses don’t take into account the clustered nature of such data. Instead, they assume that each datapoint is independent of any other. This may seem like a technicality, but its effect is staggering. Conventionally, a statistical test has a 5% chance of detecting a ‘significant effect’ even if nothing is going on. This is known as the TypeI error rate. If clustered data are analysed without taking the clustering into account, the TypeI error rate could easily rise to 40% or more. Crucially, even seemingly negligible degrees of clustering bloat the TypeI error rate. Figure 1 shows how the actual TypeI error rate increases as the ICC becomes larger and as the number of participants per cluster increases if the clustered nature of the data is not taken into account.
Figure 1: Spurious significance in clusterrandomised experiments analysed using traditional ttests according to the intraclass correlation (ICC) and the number of participants per cluster (m). The number of clusters was fixed at 10, but the TypeI error rate differs only slightly for different numbers of clusters. Reading guide: If you carry out a clusterrandomised experiment with 10 classes of 20 (m, green line) students each and run an ordinary ttest on the resultant 200 data points, you have a 17% chance of observing a significant difference even if no real difference exists (for an intraclass correlation of 0.05) rather than the nominal 5% chance. Common ICC values in educational contexts are in the 0.15 to 0.20 bracket.
The intuition behind these bloated TypeI error rates is this. If you compare two groups of 10 participants each but you pretend to compare two groups of 100 participants each (e.g. by entering each participant’s outcome ten times), you overestimate the degree of confidence you have in your results. Similarly, if you have 40 participants that are the informational equivalent of only 20 participants, you overestimate your confidence in the results.
Clustered data are fairly common in applied linguistics and the language sciences at large, and often for good reason. For instance, it may not be feasible to randomly assign students to teaching methods on an individual basis, or doing so may jeopardise the ecological validity of the study. But regardless of whether a clusterrandomised experiment is wellmotivated, the clustering it gives rise to needs to be taken into account in order to arrive at statistically valid results. Before discussing how this can be done, here are some examples of common clustered designs so that you may more easily identify clustered data when you stumbled across them:

94 students from 6 classes participate in an intervention study. To improve the study’s ecological validity, students in the same class are all assigned to the same experimental condition. If you see a ttest with 92 (i.e. 94  2) degrees of freedom (e.g. ‘t(92) = 2.3, p = 0.02’), the analysis is overwhelming likely not to have taken into account clustering and probably underestimates the TypeI error rate. In a word, the analysis is invalid.

The length of the KIT vowel ([I]) is measured in 15 bilinguals and 13 monolinguals. A total of 840 tokens are collected. In the Results section, you find the following: “We found a significant length difference between [I] sounds produced by bilinguals (n = 450, M = 72 ms, SD = 29 ms) and those produced by the monolingual controls (n = 390, M = 87 ms, SD = 28 ms; t(838) = 7.38, p < 0.001).” This study doesn’t concern comparisons of intact groups of students. Nevertheless, vowel tokens produced by one speaker tend to be more alike than vowels produced by different speakers. In other words, vowel tokens cluster by speaker in the same way that students cluster by class. A traditional ttest on the individual vowels (i.e. a ttest with 840  2 = 838 degrees of freedom) ignores this clustering and is therefore invalid.

40 participants are randomly assigned to one of two teaching methods on an individual basis. Teaching takes place in four groups of ten participants each (two groups per method). Even though the participants are assigned to the teaching methods individually, some clustering can be expected to arise during teaching (see Lee and Thompson 2005 for a related discussion). A traditional ttest (in this case one with 40  2 = 38 degrees of freedom) ignores this clustering, too, and is likely to yield too low a pvalue.

80 primaryschool students are divided into age groups and, within age group, are randomly combined into pairs. They are then given a map task, and the lexical diversity of their utterances is measured. A regression or correlation analysis is used to gauge the relationship between the students’ age and their lexical diversity. Here, too, the data are clustered – presumably by class but also per dyad: Mutual likes and dislikes and phenomena such as lexical and syntactic priming may contribute to lexical diversity measurements within each dyad being more similar than between different dyads in the same age group.

A questionnairebased experiment is administered to two groups of students taking parallel courses. Students in the first group all fill out version A of the questionnaire; those in the second group fill out version B. Again, even fairly innocuouslooking similarities within each group (friends going to the same course, students having the same timetable because they take the same courses and have similar interests, more convenient to attend one class and not the other due to better train connections etc.) can lead to an overstated degree of confidence in the study’s results.
In the last three cases, researchers may argue that no clustering is present in their data, but this argument would be extremely difficult to prove statistically. (It would involve showing that the betweencluster variance is zero.) Such an argument would be entirely rhetorical and would probably be quite easy to pick apart.
An easy solution
The fancy solution for dealing with the clustering problem is to analyse the data in a multilevel (i.e. mixedeffects) model. Multilevel models aren’t too easy to learn, though, and communicating their results to an audience that’s unfamiliar with them can be a challenge. There’s a conceptually straightforward and easytoimplement alternative, however: average the measurements per cluster and run the ttest on the cluster averages instead. Doing so removes the dependencies in the data and produces nominal TypeI error rates.
To show how easy this is and how fundamentally it can affect the results, I’ll go through the process using simulated data for a clusterrandomised experiment that you can either download or read directly into R.
As you can see, this is a simple experiment with 80 participants per condition.
Crosstabulating the Condition
and Class
variables reveals that all students in a given class were assigned to the same experimental condition.
What if we were to ignore this clustering? A visual inspection of the distribution of the scores in each condition wouldn’t reveal anything out of the ordinary…
… and a ttest would reveal a highly significant difference between the two conditions (t(158) = 3.2, p = 0.002): the intervention worked!
It’s only by plotting the scores per class that we get an idea of how the data are clustered. Class 3, a control class, consists of students who were pretty good at the task relative to the other control classes, for instance. (Incidentally, I set the ICC value to 0.1 when simulating this dataset.)
To take byclass clustering into account, we can compute the mean score per class.
For this, I use the plyr
package (use install.packages("plyr")
to install):
These class means are then compared in a ttest:
This time, the test fails to show a significant difference (t(6) = 1.1, p = 0.31), and we’d be forced to conclude that we have no evidence that the intervention worked. (Which doesn’t mean that it didn’t, mind you.)
Further reading and some additional points
For further references and some additional points concerning clusterrandomised experiments, I refer to a recent article of mine (Section 4). Here’s the bulletpoint version of what I think researchers should be aware of when planning or evaluating classbased experiments:

Properly analysed clusterrandomised experiments have lower statistical power (i.e., a lower probability of observing statistical differences if the intervention actually works) than individually randomised experiments with the same number of participants – no matter what. Having 200 participants in an individually randomised design or 200 participants in a clusterrandomised design are two entirely different things.

Increasing the number of participants per cluster increases power, but not as much as you’d think. For instance, if having 10 classes of 20 participants gives you 40% power, having 10 classes of 200 participants gives you about 53% power. Increasing the number of clusters is much more efficient. With 14 classes of 20 participants each, you’d already have 57% power.

Having 10 clusters of 20 participants each is better than having 2 clusters of 80 participants each and 8 clusters of 5 participants each. If the former gives you 40% power, the latter gives you 27% power.

Think about what sort of covariates would account for uninteresting variance in the dependent variable. Pretest scores, for instance, can be averaged by class, too, and be entered into an analysis of covariance on the class means. Don’t go overboard with this, though: Each covariate costs a degree of freedom, and sacrificing a degree of freedom for a weakly predictive covariate could actually result in a loss of power.

Don’t run experiments with only one cluster per condition. A ttest on two cluster means will have 2  2 = 0 degrees of freedom and will return an error. Conceptually, a study with one cluster per condition has no reliable way of telling whether any difference between the two classes are due to the experimental condition or to the classes tested.

Lastly, in the article referenced above, I suggested that researchers may want to consider weighting the cluster means by the number of observations in each cluster when running analyses on the cluster means. This was incorrect. When running some simulations in preparation of this blog post, I noticed that weighting cluster means in this way inflates the test’s TypeI error rate, sometimes dramatically. My error was due to having misinterpreted some articles on taking unequal cluster sizes into account when planning clusterrandomised designs.