I blog about statistics and research design with an audience consisting of researchers in bilingualism, multilingualism, and applied linguistics in mind.

Feed Subscribe to new blog posts.

Latest blog posts


Suggestions for more informative replication studies

20 November 2017

In recent years, psychologists have started to run large-scale replications of seminal studies. For a variety of reasons, which I won’t go into, this welcome development hasn’t quite made it to research on language learning and bi- and multilingualism. That said, I think it can be interesting to scrutinise how these large-scale replications are conducted. In this blog post, I take a closer look at a replication attempt by O’Donnell et al. with some 4,500 participants that’s currently in press at Psychological Science and make five suggestions as to how I think similar replications could be designed to be even more informative.

Read more...


Increasing power and precision using covariates

24 October 2017

A recurring theme in the writings of methodologists over the last years and indeed decades is that researchers need to increase the statistical power (and precision) of the studies they conduct. These writers have rightly stressed the necessity of larger sample sizes, but other research design characteristics that affect power and precision have received comparatively little attention. In this blog post, I discuss and demonstrate how by capitalising on information that they collect anyway, researchers can achieve more power and precision without running more participants.

Read more...


Confidence interval-based optional stopping

19 September 2017

Stopping to collect data early when the provisional results are significant (“optional stopping”) inflates your chances of finding a pattern in the data when nothing is going on. This can be countered using a technique known as sequential testing, but that’s not what this post is about. Instead, I’d like to illustrate that optional stopping isn’t necessarily a problem if your stopping rule doesn’t involve p-values. If instead of on p-values, you base your decision to collect more data or not on how wide your current confidence interval (or Bayesian credible interval) is, peeking at your data can be a reasonable strategy. Additionally, in this post, I want share some code for R functions that you can adapt in order to simulate the effects of different stopping rules.

Read more...


Creating comparable sets of stimuli

14 September 2017

When designing a study, you sometimes have a pool of candidate stimuli (words, sentences, texts, images etc.) that is too large to present to each participant in its entirety. If you want data for all or at least most stimuli, a possible solution is to split up the pool of stimuli into sets of overseeable size and assign each participant to one of the different sets. Ideally, you’d want the different sets to be as comparable as possible with respect to a number of relevant characteristics so that each participant is exposed to about the same diversity of stimuli during the task. For instance, when presenting individual words to participants, you may want for each participant to be confronted with a similar distribution of words in terms of their frequency, length, and number of adverbs. In this blog post I share some R code that I used to split up two types of stimuli into sets that are comparable with respect to one or several variables—in the hopes that you can easily adapt them for your own needs.

Read more...


Draft: Replication success as predictive utility

24 August 2017

In recent years, a couple of high-profile, multi-lab attempts to replicate previous findings have been conducted in psychology, but there wasn’t much consensus about when a replication attempt should be considered as confirming the original finding. As I started dabbling in predictive modelling a couple of months ago, I started thinking that it could be useful to view replication success in terms of how accurately the original finding predicts the replication: If taking the original finding at face value permits more accurate predictions about the patterns in the replication data than ignoring it does, the original finding can at least be said to contribute to our body of knowledge. This approach, I think, has some key advantages over the methods used to quantify replication success in the multi-lab replication attempts that I mentioned earlier.

Read more...


Abandoning standardised effect sizes and opening up other roads to power

14 July 2017

Numerical summaries of research findings will typically feature an indication of the sizes of the effects that were studied. These indications are often standardised effect sizes, which means that they are expressed relative to the variance in the data rather than with respect to the units in which the variables were measured. Popular standardised effect sizes include Cohen’s d, which expresses the mean difference between two groups as a proportion of the pooled standard deviation, and Pearson’s r, which expresses the difference in one variable as a proportion of its standard deviation that is associated with a change of one standard deviation of another variable. There exists a rich literature that discusses which standardised effect sizes ought to be used depending on the study’s design, how this or that standardised effect size should be adjusted for this and that bias, and how confidence intervals should be constructed around standardised effect sizes (blog post Confidence intervals for standardised mean differences). But most of this literature should be little importance to the practising scientist for the simple reason that standardised effect sizes themselves ought to be of little importance.

In what follows, I will defend this point of view, which I’ve outlined in two previous blog posts (Why I don’t like standardised effect sizes and More on why I don’t like standardised effect sizes), by sharing some quotes by respected statisticians and methodologists. In particular, I hope to, first, make you think about the deterimental effect that the use of standardised effect sizes entails on the interpretability of research findings and the accumulation of knowledge and, second, convince you that the use of standardised effect sizes for planning studies overly stresses sample size as the main determinant of a study’s power and that by abandoning it other roads to power can be opened up.

Read more...


Interactions between continuous variables

26 June 2017

Splitting up continuous variables is generally a bad idea. In terms of statistical efficiency, the popular practice of dichotomising continuous variables at their median is comparable to throwing out a third of the dataset. Moreover, statistical models based on split-up continuous variables are prone to misinterpretation: threshold effects are easily read into the results when, in fact, none exist. Splitting up, or ‘binning’, continuous variables, then, is something to avoid. But what if you’re interested in how the effect of one continuous predictor varies according to the value of another continuous predictor? In other words, what if you’re interested in the interaction between two continuous predictors? Binning one of the predictors seems appealing since it makes the model easier to interpret. However, as I’ll show in this blog post, it’s fairly straightforward to fit and interpret interactions between continuous predictors.

Read more...


Tutorial: Adding confidence bands to effect displays

12 May 2017

In the previous blog post, I demonstrated how you can draw effect displays to render regression models more intelligible to yourself and to your audience. These effect displays did not contain information about the uncertainty inherent to estimating regression models, however. To that end, this blog post demonstrates how you can add confidence bands to effect displays for multiple regression, logistic regression, and logistic mixed-effects models, and explains how these confidence bands are constructed.

Read more...


Tutorial: Plotting regression models

23 April 2017

The results of regression models, particularly fairly complex ones, can be difficult to appreciate and hard to communicate to an audience. One useful technique is to plot the effect of each predictor variable on the outcome while holding constant any other predictor variables. Fox (2003) discusses how such effect displays are constructed and provides an implementation in the effects package for R.

Since I think it’s both instructive to see how effect displays are constructed from the ground up and useful to be able to tweak them yourself in R, this blog post illustrates how to draw such plots for three increasingly complex statistical models: ordinary multiple regression, logistic regression, and mixed-effects logistic regression. The goal in each of these three examples is to visualise the effects of the predictor variables without factoring in the uncertainty about these effects; visualising such uncertainty will be the topic of a future blog post.

Read more...


Confidence intervals for standardised mean differences

22 February 2017

Standardised effect sizes express patterns found in the data in terms of the variability found in the data. For instance, a mean difference in body height could be expressed in the metric in which the data were measured (e.g., a difference of 4 centimetres) or relative to the variation in the data (e.g., a difference of 0.9 standard deviations). The latter is a standardised effect size known as Cohen’s d.

As I’ve written before, I don’t particularly like standardised effect sizes. Nonetheless, I wondered how confidence intervals around standardised effect sizes (more specifically: standardised mean differences) are constructed. Until recently, I hadn’t really thought about it and sort of assumed you would compute them the same way as confidence intervals around raw effect sizes. But unlike raw (unstandardised) mean differences, standardised mean differences are a combination of two estimates subject to sampling error: the mean difference itself and the sample standard deviation. Moreover, the sample standard deviation is a biased estimate of the population standard deviation (it tends to be too low), which causes Cohen’s d to be an upwardly biased estimate of the population standardised mean difference. Surely both of these factors must affect how the confidence intervals around standardised effect sizes are constructed?

It turns out that indeed they do. When I compared the confidence intervals that I computed around a standardised effect size using a naive approach that assumed that the standard deviation wasn’t subject to sampling error and wasn’t biased, I got different results than when I used specialised R functions.

But these R functions all produced different results, too.

Obviously, there may well be more than one way to skin a cat, but this caused me to wonder if the different procedures for computing confidence intervals all covered the true population parameter with the nominal probability (e.g., in 95% of cases for a 95% confidence interval). I ran a simulation to find out, which I’ll report in the remainder of this post. If you spot any mistakes, please let me know.

Read more...