Suggestions for more informative replication studies

design features
mixed-effects models

Jan Vanhove


November 20, 2017

In recent years, psychologists have started to run large-scale replications of seminal studies. For a variety of reasons, which I won’t go into, this welcome development hasn’t quite made it to research on language learning and bi- and multilingualism. That said, I think it can be interesting to scrutinise how these large-scale replications are conducted. In this blog post, I take a closer look at a replication attempt by O’Donnell et al. with some 4,500 participants that’s currently in press at Psychological Science and make five suggestions as to how I think similar replications could be designed to be even more informative.

A replication attempt of “professor priming”

A purported phenomenon that’s come under particular scrutiny in recent years is that it is possible to predictably influence people’s behaviour by ‘priming’ them with a concept related to the desired behaviour. In so-called “professor priming”, you would make your participants engage with the concept of intelligence by having them write down how they imagined their daily life as a university professor would look like. Relative to participants who had to engage with the concept of stupidity (by imagining their daily life as football hooligans), participants primed with intelligence would then perform better on a trivia quiz. (I don’t really understand what trivia quizzing has to do with intelligence, but you get the idea.)

It was this specific example that O’Donnell et al. sought to replicate in 23 labs across the globe with some 4,500 participants in total. About half of the participants imagined their daily life as professors, and the rest as hooligans. Afterwards, they completed an ostensibly unrelated trivia quiz with 30 questions.

“Professor priming” sounds literally unbelievable, and, indeed, O’Donnell et al.’s replication did not yield much support for it. (Replication attempts by Ap Dijksterhuis, the researcher who first suggested the existence of professor priming, also failed. Or they only worked in men. Which is to say that the original result couldn’t be replicated.) In their discussion, the replication authors point out that two-thirds of the participants in fact suspected a link between the two ostensibly unrelated tasks, and duly discuss a number of factors that might explain the discrepancy between the original professor priming study and the replication. These include the possibility that “professor” and “hooligan” are associated with different concepts across cultures and that some trivia items could work better in some cultures than in others.

When reading about this replication attempt, I ventured on Twitter - that oasis of nuance - that it struck me as a waste of time. That wasn’t to say that I thought it wasn’t carried well, but to me it seemed that both the finding itself (zilch) as well as the potential mitigating factors were fairly predictable. But that’s hardly a constructive attitude. So, in the hopes that they are useful for future replication attempts, here are some post-hoc suggestions for getting more out of 4,500 participants. I’ll assume throughout that “professor priming” might just be a real phenomenon.

Five suggestions for getting more out of 4,500 participants

(1) A larger pool of stimuli

One possible factor that could explain O’Donnell et al.’s null result is that “professors” and “hooligans” don’t evoke the same concepts now across the globe as they did at the time and place of the original study (The Netherlands in the 1990s). One solution to this would be to use more than one stimulus that’s associated with the concept of intelligence and more than one stimulus associated with the concept of stupidity. For instance, you could have a whole pool of brainy stereotypes and of brawny stereotypes. Each participant assigned to the “brainy” condition is then in turn assigned randomly to one of the brainy stereotypes, and similarly for the “brawny” condition.

The advantage of operationalising intelligence and stupidity using multiple constructs is generalisability. Arguably, “professor priming” is but one instantiation of “intelligence priming” - that is, it’s not so much that imagining your life as a professor would boost your trivia performance as engaging with the concept of intelligence is. By using multiple operationalisations, you put yourself in a position to claim that any positive or negative results aren’t just due to your choice of stereotype. Using partial pooling (mixed-effects models), you can get an estimate of how well each operationalisation “works” without falling victim to multiple comparisons.

If you want to run a more exact, highly powered replication of the original while at the same time exploring the generalisability of these findings, you could still use the “professor” and “hooligan” primes for 2,000 participants and, say, ten other primes for 250 participants each.

(2) A larger pool of trivia items

Similarly, instead of presenting the same 30 trivia items to all participants, you could have a larger pool of trivia items from which you present 30 to each participant. This again reduces the risk that you happened to select those trivia items that are or aren’t sensitive to priming. Using mixed effects modelling, it should be fairly easy to explore which trivia items are answered more correctly following professor priming than following hooligan priming. (I don’t quite get the use of meta-analytic techniques in this replication attempt seeing as all data are available, but this may be inherent to the format.)

(Further reading concerning this suggestion: Westfall et al. (2015).)

(3) Multiple outcomes

I understand that earlier replication attempts of professor priming also used IQ tests as the follow-up task, as opposed to trivia quizzes. In the same spirit of generalisability, you could have some or all participants complete a slimmed-down IQ test.

(4) A partial pretest–posttest design (a.k.a. a Solomon four-group design)

Assuming professor priming affects trivia quizzing, it’s natural to ask which aspects of trivia quizzing are improved by it. In addition to using a larger pool of trivia items and analysing them using mixed effects models, you could also imagine having, say, 25% of your participants complete the quiz before and after the priming intervention. (I wouldn’t do this with all participants for fear that pretest sensitisation becomes another mitigating factor.) That way, you can start to answer questions such as In what respect did the participants’ performance improve? Was it particular questions they got right afterwards, or did performance improve across all items? and Does pretest sensitisation mitigate the effect?

(5) A pilot phase?

Two-thirds of the participants claimed to have suspected a link between the priming part of the experiment and the quizzing part. This finding was probably more difficult to anticipate. But perhaps future replication studies may want to use some proportion of their data, say, the initial 5%, as pilot data, which is to be analysed in full. Any striking patterns that could call into question the reliability of the replication findings, such as participant awareness, could be discussed and possibly ironed out at that stage.


The informativeness of large replication attempts could be improved even further by carrying out both a maximally exact replication and looser replications that seek to generalise the postulated effect. Using mixed-effects models, you can then analyse how well the finding replicates more or less exactly and whether this generalises to other stimuli and outcomes. This would make positive findings more informative since you can start to explore potency differences between stimuli and susceptibility differences between outcomes. But it would also make negative findings more informative, since it would be more difficult to suggest that a null finding is due to ill-chosen stimuli or items and that the effect really does exist nonetheless.