The Centre for Open Science’s Preregistration Challenge: Why it’s relevant and some recommended background reading

multiple comparisons
open science

Jan Vanhove


October 31, 2016

This blog post is an edited version of a mail I sent round to my colleagues at the various language and linguistics departments in Fribourg. Nothing in this post is new per se, but I haven’t seen much discussion of these issues among linguists, applied linguists and bilingualism researchers.

I’d like to point you to an initiative of the Center for Open Science: the $1,000,000 Preregistration Challenge. The basic idea is to foster research transparency by offering a monetary reward to researchers who’ve outlined their study design and planned analyses in advance and report the results of these analyses in the report.

I’m not affiliated with this organisation, but I do think both it and its initiative are important developments. For those interested in knowing why I think so, I’ve written a brief text below that includes links to more detailed articles or examples; if you prefer reference lists, there’s one of those down below. Most of articles were written by and for psychologists, but I reckon pretty much all of it applies equally to research in linguistics and language learning.

Ever wondered why the literature is filled to the brim with statistically significant results – often mutually contradictory – even though we know that you shouldn’t be able to find so many of them even if everyone’s theories were correct? A big part of the answer is that researchers enjoy a great degree of flexibility in terms of running their studies, analysing their data, and interpreting the findings, and – wittingly or unwittingly – use this flexibility to increase their chances of finding a significant result.

All of this, incidentally, leaves aside false findings due to errors in research design and data analysis, including common ones such as reading too much into the difference between significant and non-significant results, taking the results of an analysis that ‘statistically controls for’ confounding variables at face value, or carving up continuous variables into groups, as well as mistakes in reporting, and flat-out fraud.

Scientific thoroughness is a virtue, but such flexibility messes up statistical inferences in a big way, and the odds that a statistically significant finding represents a fluke sky-rockets. As this flexibility almost always remains undisclosed, perusers of the scholarly literature have no way of calibrating their expectations of what p < 0.05 means in terms of providing support for a theory, and budding researchers may find themselves scratching their heads wondering why everyone seems to ‘achieve’ significance but they can’t.

Preregistering your study – i.e., writing down, to the extent possible, what you’ll do in your study and how you’ll analyse the data – and then following through on these decisions won’t solve all of these problems. But it’ll help researchers and readers to distinguish more clearly between planned and post-hoc decisions, which will in turn allow them to calibrate their interpretation of the results more accurately.


Brown, Nicholas J. L. & James A. J. Heathers. 2016. The GRIM test: A simple technique detects numerous anomalies in the reporting of results in psychology. Social Psychological & Personality Science.

De Groot, Adriaan D. 2014. The meaning of “significance”” for different types of research. Translated and annotated by Eric-Jan Wagenmakers, Denny Borsboom, Josine Verhagen, Rogier Kievit, Marjan Bakker, Angelique Cramer, Dora Matzke, Don Mellenbergh, and Han L. J. van der Maas. Acta Psychologica 148. 188-194.

Elson, Malte.

Gelman, Andrew & Eric Loken. 2013. The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “\(p\)-hacking” and the research hypothesis was posited ahead of time..

Ioannidis, John P. A. 2005. Why most published research findings are false. PLOS Medicine 2. e124.

John, Leslie K., George Loewenstein & Drazen Prelec. 2012. Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science 23. 524-532.

Kerr, Norbert L. 1998. HARKing: Hypothesizing After the Results are Known. Personality and Social Psychology Review 2. 196-217.

Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science 349.

Sedlmeier, Peter & Gerd Gigerenzer. 1989. Do studies of statistical power have an effect on the power of studies? Psychological Bulletin 105. 309-316.

Simmons, Joseph P., Leif D. Nelson & Uri Simonsohn. 2011. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science 22. 1359-1366.

Sterling, Theodore D. 1959. Publication decisions and their possible effects on inferences drawn from tests of significance–or vice versa. Journal of the American Statistical Association 54. 30-34.

Sterling, Theodore D., W. L. Rosenbaum & J. J. Weinkam. 1995. Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. The American Statistician 49. 108-112.