I blog about statistics and research design with an audience consisting of researchers in bilingualism, multilingualism, and applied linguistics in mind.
Latest blog posts
20 November 2017
In recent years, psychologists have started to run large-scale replications of seminal studies. For a variety of reasons, which I won’t go into, this welcome development hasn’t quite made it to research on language learning and bi- and multilingualism. That said, I think it can be interesting to scrutinise how these large-scale replications are conducted. In this blog post, I take a closer look at a replication attempt by O’Donnell et al. with some 4,500 participants that’s currently in press at Psychological Science and make five suggestions as to how I think similar replications could be designed to be even more informative.
24 October 2017
A recurring theme in the writings of methodologists over the last years and indeed decades is that researchers need to increase the statistical power (and precision) of the studies they conduct. These writers have rightly stressed the necessity of larger sample sizes, but other research design characteristics that affect power and precision have received comparatively little attention. In this blog post, I discuss and demonstrate how by capitalising on information that they collect anyway, researchers can achieve more power and precision without running more participants.
19 September 2017
Stopping to collect data early when
the provisional results are significant (“optional stopping”)
inflates your chances of finding
a pattern in the data when
nothing is going on. This can be countered using a technique
known as sequential testing, but that’s not what this post
is about. Instead, I’d like to illustrate that optional stopping
isn’t necessarily a problem if your stopping rule doesn’t
involve p-values. If instead of on p-values, you base
your decision to collect more data or not on how wide
your current confidence interval (or Bayesian credible interval)
is, peeking at your data can be a reasonable strategy.
Additionally, in this post, I want share some code for
R functions that you can adapt in order to simulate
the effects of different stopping rules.
14 September 2017
When designing a study, you sometimes have a pool of candidate stimuli
(words, sentences, texts, images etc.) that is too large to present to
each participant in its entirety. If you want data for all or at least
most stimuli, a possible solution is to split up the pool of stimuli
into sets of overseeable size and assign each participant to one of the
different sets. Ideally, you’d want the different sets to be as comparable
as possible with respect to a number of relevant characteristics so that
each participant is exposed to about the same diversity of stimuli during
the task. For instance, when presenting individual words to participants,
you may want for each participant to be confronted with a similar distribution
of words in terms of their frequency, length, and number of adverbs.
In this blog post I share some
R code that I used to split up two
types of stimuli into sets that are comparable with respect to one or several
variables—in the hopes that you can easily adapt them for your own needs.
24 August 2017
In recent years, a couple of high-profile, multi-lab attempts to replicate previous findings have been conducted in psychology, but there wasn’t much consensus about when a replication attempt should be considered as confirming the original finding. As I started dabbling in predictive modelling a couple of months ago, I started thinking that it could be useful to view replication success in terms of how accurately the original finding predicts the replication: If taking the original finding at face value permits more accurate predictions about the patterns in the replication data than ignoring it does, the original finding can at least be said to contribute to our body of knowledge. This approach, I think, has some key advantages over the methods used to quantify replication success in the multi-lab replication attempts that I mentioned earlier.
14 July 2017
Numerical summaries of research findings will typically feature an indication of the sizes of the effects that were studied. These indications are often standardised effect sizes, which means that they are expressed relative to the variance in the data rather than with respect to the units in which the variables were measured. Popular standardised effect sizes include Cohen’s d, which expresses the mean difference between two groups as a proportion of the pooled standard deviation, and Pearson’s r, which expresses the difference in one variable as a proportion of its standard deviation that is associated with a change of one standard deviation of another variable. There exists a rich literature that discusses which standardised effect sizes ought to be used depending on the study’s design, how this or that standardised effect size should be adjusted for this and that bias, and how confidence intervals should be constructed around standardised effect sizes (blog post Confidence intervals for standardised mean differences). But most of this literature should be little importance to the practising scientist for the simple reason that standardised effect sizes themselves ought to be of little importance.
In what follows, I will defend this point of view, which I’ve outlined in two previous blog posts (Why I don’t like standardised effect sizes and More on why I don’t like standardised effect sizes), by sharing some quotes by respected statisticians and methodologists. In particular, I hope to, first, make you think about the deterimental effect that the use of standardised effect sizes entails on the interpretability of research findings and the accumulation of knowledge and, second, convince you that the use of standardised effect sizes for planning studies overly stresses sample size as the main determinant of a study’s power and that by abandoning it other roads to power can be opened up.
26 June 2017
Splitting up continuous variables is generally a bad idea. In terms of statistical efficiency, the popular practice of dichotomising continuous variables at their median is comparable to throwing out a third of the dataset. Moreover, statistical models based on split-up continuous variables are prone to misinterpretation: threshold effects are easily read into the results when, in fact, none exist. Splitting up, or ‘binning’, continuous variables, then, is something to avoid. But what if you’re interested in how the effect of one continuous predictor varies according to the value of another continuous predictor? In other words, what if you’re interested in the interaction between two continuous predictors? Binning one of the predictors seems appealing since it makes the model easier to interpret. However, as I’ll show in this blog post, it’s fairly straightforward to fit and interpret interactions between continuous predictors.
12 May 2017
In the previous blog post, I demonstrated how you can draw effect displays to render regression models more intelligible to yourself and to your audience. These effect displays did not contain information about the uncertainty inherent to estimating regression models, however. To that end, this blog post demonstrates how you can add confidence bands to effect displays for multiple regression, logistic regression, and logistic mixed-effects models, and explains how these confidence bands are constructed.
23 April 2017
The results of regression models,
particularly fairly complex ones,
can be difficult to appreciate
and hard to communicate to an audience.
One useful technique is to plot
the effect of each predictor variable
on the outcome while holding constant
any other predictor variables.
discusses how such effect displays are
constructed and provides an implementation
effects package for
Since I think it’s both instructive to see how effect displays
are constructed from the ground up
and useful to be able to tweak them yourself in
this blog post illustrates how to draw such plots
for three increasingly complex statistical models:
ordinary multiple regression,
and mixed-effects logistic regression.
The goal in each of these three examples
is to visualise the effects of the predictor
variables without factoring in the uncertainty
about these effects;
visualising such uncertainty will be the
topic of a future blog post.
22 February 2017
Standardised effect sizes express patterns found in the data in terms of the variability found in the data. For instance, a mean difference in body height could be expressed in the metric in which the data were measured (e.g., a difference of 4 centimetres) or relative to the variation in the data (e.g., a difference of 0.9 standard deviations). The latter is a standardised effect size known as Cohen’s d.
As I’ve written before, I don’t particularly like standardised effect sizes. Nonetheless, I wondered how confidence intervals around standardised effect sizes (more specifically: standardised mean differences) are constructed. Until recently, I hadn’t really thought about it and sort of assumed you would compute them the same way as confidence intervals around raw effect sizes. But unlike raw (unstandardised) mean differences, standardised mean differences are a combination of two estimates subject to sampling error: the mean difference itself and the sample standard deviation. Moreover, the sample standard deviation is a biased estimate of the population standard deviation (it tends to be too low), which causes Cohen’s d to be an upwardly biased estimate of the population standardised mean difference. Surely both of these factors must affect how the confidence intervals around standardised effect sizes are constructed?
It turns out that indeed they do. When I compared the confidence intervals that I computed around a standardised effect size using a naive approach that assumed that the standard deviation wasn’t subject to sampling error and wasn’t biased, I got different results than when I used specialised R functions.
But these R functions all produced different results, too.
Obviously, there may well be more than one way to skin a cat, but this caused me to wonder if the different procedures for computing confidence intervals all covered the true population parameter with the nominal probability (e.g., in 95% of cases for a 95% confidence interval). I ran a simulation to find out, which I’ll report in the remainder of this post. If you spot any mistakes, please let me know.