I blog about statistics and research design with an audience consisting of researchers in bilingualism, multilingualism, and applied linguistics in mind.

Feed Subscribe to new blog posts.

Latest blog posts

Checking model assumptions without getting paranoid

25 April 2018

Statistical models come with a set of assumptions, and violations of these assumptions can render irrelevant or even invalid the inferences drawn from these models. It is important, then, to verify that your model’s assumptions are at least approximately tenable for your data. To this end, statisticians commonly recommend that you check the distribution of your model’s residuals (i.e., the difference between your actual data and the model’s fitted values) graphically. An excellent piece of advice that, unfortunately, causes some students to become paranoid and see violated assumptions everywhere they look. This blog post is for them.


Consider generalisability

12 February 2018

A good question to ask yourself when designing a study is, “Who and what are any results likely to generalise to?” Generalisability needn’t always be a priority when planning a study. But by giving the matter some thought before collecting your data, you may still be able to alter your design so that you don’t have to smother your conclusions with ifs and buts if you do want to draw generalisations.

The generalisability question is mostly cast in terms of the study’s participants: Would any results apply just to the participants themselves or to some wider population, and if so, to which one? Important as this question is, this blog post deals with a question that is asked less often but is equally crucial: Would any results apply just to the materials used in the study or might they stand some chance of generalising to different materials?


Suggestions for more informative replication studies

20 November 2017

In recent years, psychologists have started to run large-scale replications of seminal studies. For a variety of reasons, which I won’t go into, this welcome development hasn’t quite made it to research on language learning and bi- and multilingualism. That said, I think it can be interesting to scrutinise how these large-scale replications are conducted. In this blog post, I take a closer look at a replication attempt by O’Donnell et al. with some 4,500 participants that’s currently in press at Psychological Science and make five suggestions as to how I think similar replications could be designed to be even more informative.


Increasing power and precision using covariates

24 October 2017

A recurring theme in the writings of methodologists over the last years and indeed decades is that researchers need to increase the statistical power (and precision) of the studies they conduct. These writers have rightly stressed the necessity of larger sample sizes, but other research design characteristics that affect power and precision have received comparatively little attention. In this blog post, I discuss and demonstrate how by capitalising on information that they collect anyway, researchers can achieve more power and precision without running more participants.


Confidence interval-based optional stopping

19 September 2017

Stopping to collect data early when the provisional results are significant (“optional stopping”) inflates your chances of finding a pattern in the data when nothing is going on. This can be countered using a technique known as sequential testing, but that’s not what this post is about. Instead, I’d like to illustrate that optional stopping isn’t necessarily a problem if your stopping rule doesn’t involve p-values. If instead of on p-values, you base your decision to collect more data or not on how wide your current confidence interval (or Bayesian credible interval) is, peeking at your data can be a reasonable strategy. Additionally, in this post, I want share some code for R functions that you can adapt in order to simulate the effects of different stopping rules.


Creating comparable sets of stimuli

14 September 2017

When designing a study, you sometimes have a pool of candidate stimuli (words, sentences, texts, images etc.) that is too large to present to each participant in its entirety. If you want data for all or at least most stimuli, a possible solution is to split up the pool of stimuli into sets of overseeable size and assign each participant to one of the different sets. Ideally, you’d want the different sets to be as comparable as possible with respect to a number of relevant characteristics so that each participant is exposed to about the same diversity of stimuli during the task. For instance, when presenting individual words to participants, you may want for each participant to be confronted with a similar distribution of words in terms of their frequency, length, and number of adverbs. In this blog post I share some R code that I used to split up two types of stimuli into sets that are comparable with respect to one or several variables—in the hopes that you can easily adapt them for your own needs.


Draft: Replication success as predictive utility

24 August 2017

In recent years, a couple of high-profile, multi-lab attempts to replicate previous findings have been conducted in psychology, but there wasn’t much consensus about when a replication attempt should be considered as confirming the original finding. As I started dabbling in predictive modelling a couple of months ago, I started thinking that it could be useful to view replication success in terms of how accurately the original finding predicts the replication: If taking the original finding at face value permits more accurate predictions about the patterns in the replication data than ignoring it does, the original finding can at least be said to contribute to our body of knowledge. This approach, I think, has some key advantages over the methods used to quantify replication success in the multi-lab replication attempts that I mentioned earlier.


Abandoning standardised effect sizes and opening up other roads to power

14 July 2017

Numerical summaries of research findings will typically feature an indication of the sizes of the effects that were studied. These indications are often standardised effect sizes, which means that they are expressed relative to the variance in the data rather than with respect to the units in which the variables were measured. Popular standardised effect sizes include Cohen’s d, which expresses the mean difference between two groups as a proportion of the pooled standard deviation, and Pearson’s r, which expresses the difference in one variable as a proportion of its standard deviation that is associated with a change of one standard deviation of another variable. There exists a rich literature that discusses which standardised effect sizes ought to be used depending on the study’s design, how this or that standardised effect size should be adjusted for this and that bias, and how confidence intervals should be constructed around standardised effect sizes (blog post Confidence intervals for standardised mean differences). But most of this literature should be little importance to the practising scientist for the simple reason that standardised effect sizes themselves ought to be of little importance.

In what follows, I will defend this point of view, which I’ve outlined in two previous blog posts (Why I don’t like standardised effect sizes and More on why I don’t like standardised effect sizes), by sharing some quotes by respected statisticians and methodologists. In particular, I hope to, first, make you think about the deterimental effect that the use of standardised effect sizes entails on the interpretability of research findings and the accumulation of knowledge and, second, convince you that the use of standardised effect sizes for planning studies overly stresses sample size as the main determinant of a study’s power and that by abandoning it other roads to power can be opened up.


Interactions between continuous variables

26 June 2017

Splitting up continuous variables is generally a bad idea. In terms of statistical efficiency, the popular practice of dichotomising continuous variables at their median is comparable to throwing out a third of the dataset. Moreover, statistical models based on split-up continuous variables are prone to misinterpretation: threshold effects are easily read into the results when, in fact, none exist. Splitting up, or ‘binning’, continuous variables, then, is something to avoid. But what if you’re interested in how the effect of one continuous predictor varies according to the value of another continuous predictor? In other words, what if you’re interested in the interaction between two continuous predictors? Binning one of the predictors seems appealing since it makes the model easier to interpret. However, as I’ll show in this blog post, it’s fairly straightforward to fit and interpret interactions between continuous predictors.


Tutorial: Adding confidence bands to effect displays

12 May 2017

In the previous blog post, I demonstrated how you can draw effect displays to render regression models more intelligible to yourself and to your audience. These effect displays did not contain information about the uncertainty inherent to estimating regression models, however. To that end, this blog post demonstrates how you can add confidence bands to effect displays for multiple regression, logistic regression, and logistic mixed-effects models, and explains how these confidence bands are constructed.