# Blog

I blog about statistics and research design with an audience consisting of researchers in bilingualism, multilingualism, and applied linguistics in mind.

## Latest blog posts

### Suggestions for more informative replication studies

20 November 2017

In recent years, psychologists have started to run large-scale replications of seminal studies. For a variety of reasons, which I won’t go into, this welcome development hasn’t quite made it to research on language learning and bi- and multilingualism. That said, I think it can be interesting to scrutinise how these large-scale replications are conducted. In this blog post, I take a closer look at a replication attempt by O’Donnell et al. with some 4,500 participants that’s currently in press at Psychological Science and make five suggestions as to how I think similar replications could be designed to be even more informative.

### Increasing power and precision using covariates

24 October 2017

A recurring theme in the writings of methodologists over the last years and indeed decades is that researchers need to increase the statistical power (and precision) of the studies they conduct. These writers have rightly stressed the necessity of larger sample sizes, but other research design characteristics that affect power and precision have received comparatively little attention. In this blog post, I discuss and demonstrate how by capitalising on information that they collect anyway, researchers can achieve more power and precision without running more participants.

### Confidence interval-based optional stopping

19 September 2017

Stopping to collect data early when
the provisional results are significant (“optional stopping”)
inflates your chances of finding
a pattern in the data when
nothing is going on. This can be countered using a technique
known as sequential testing, but that’s not what this post
is about. Instead, I’d like to illustrate that optional stopping
isn’t necessarily a problem if your stopping rule doesn’t
involve p-values. If instead of on p-values, you base
your decision to collect more data or not on how wide
your current confidence interval (or Bayesian credible interval)
is, peeking at your data can be a reasonable strategy.
Additionally, in this post, I want share some code for
`R`

functions that you can adapt in order to simulate
the effects of different stopping rules.

### Creating comparable sets of stimuli

14 September 2017

When designing a study, you sometimes have a pool of candidate stimuli
(words, sentences, texts, images etc.) that is too large to present to
each participant in its entirety. If you want data for all or at least
most stimuli, a possible solution is to split up the pool of stimuli
into sets of overseeable size and assign each participant to one of the
different sets. Ideally, you’d want the different sets to be as comparable
as possible with respect to a number of relevant characteristics so that
each participant is exposed to about the same diversity of stimuli during
the task. For instance, when presenting individual words to participants,
you may want for each participant to be confronted with a similar distribution
of words in terms of their frequency, length, and number of adverbs.
In this blog post I share some `R`

code that I used to split up two
types of stimuli into sets that are comparable with respect to one or several
variables—in the hopes that you can easily adapt them for your own needs.

### Draft: Replication success as predictive utility

24 August 2017

In recent years, a couple of high-profile, multi-lab attempts to replicate previous findings have been conducted in psychology, but there wasn’t much consensus about when a replication attempt should be considered as confirming the original finding. As I started dabbling in predictive modelling a couple of months ago, I started thinking that it could be useful to view replication success in terms of how accurately the original finding predicts the replication: If taking the original finding at face value permits more accurate predictions about the patterns in the replication data than ignoring it does, the original finding can at least be said to contribute to our body of knowledge. This approach, I think, has some key advantages over the methods used to quantify replication success in the multi-lab replication attempts that I mentioned earlier.

### Abandoning standardised effect sizes and opening up other roads to power

14 July 2017

Numerical summaries of research findings will typically feature
an indication of the sizes of the effects that were studied.
These indications are often *standardised* effect sizes,
which means that they are expressed relative to the
variance in the data rather than with respect to the units
in which the variables were measured.
Popular standardised effect sizes include Cohen’s *d*, which
expresses the mean difference between two groups as a proportion
of the pooled standard deviation, and Pearson’s *r*,
which expresses the difference in one variable as a proportion of its standard deviation
that is associated with a change of one standard deviation of another variable.
There exists a rich literature that discusses which standardised
effect sizes ought to be used depending on the study’s design,
how this or that standardised effect size should be adjusted for this
and that bias,
and how confidence intervals should be constructed around standardised
effect sizes (blog post Confidence intervals for standardised mean differences).
But most of this literature *should* be little importance to the practising
scientist for the simple reason that standardised effect sizes themselves ought to
be of little importance.

In what follows, I will defend this point of view, which I’ve outlined in two previous blog posts (Why I don’t like standardised effect sizes and More on why I don’t like standardised effect sizes), by sharing some quotes by respected statisticians and methodologists. In particular, I hope to, first, make you think about the deterimental effect that the use of standardised effect sizes entails on the interpretability of research findings and the accumulation of knowledge and, second, convince you that the use of standardised effect sizes for planning studies overly stresses sample size as the main determinant of a study’s power and that by abandoning it other roads to power can be opened up.

### Interactions between continuous variables

26 June 2017

Splitting up continuous variables is generally a bad idea. In terms of statistical efficiency, the popular practice of dichotomising continuous variables at their median is comparable to throwing out a third of the dataset. Moreover, statistical models based on split-up continuous variables are prone to misinterpretation: threshold effects are easily read into the results when, in fact, none exist. Splitting up, or ‘binning’, continuous variables, then, is something to avoid. But what if you’re interested in how the effect of one continuous predictor varies according to the value of another continuous predictor? In other words, what if you’re interested in the interaction between two continuous predictors? Binning one of the predictors seems appealing since it makes the model easier to interpret. However, as I’ll show in this blog post, it’s fairly straightforward to fit and interpret interactions between continuous predictors.

### Tutorial: Adding confidence bands to effect displays

12 May 2017

In the previous blog post, I demonstrated how you can draw effect displays to render regression models more intelligible to yourself and to your audience. These effect displays did not contain information about the uncertainty inherent to estimating regression models, however. To that end, this blog post demonstrates how you can add confidence bands to effect displays for multiple regression, logistic regression, and logistic mixed-effects models, and explains how these confidence bands are constructed.

### Tutorial: Plotting regression models

23 April 2017

The results of regression models,
particularly fairly complex ones,
can be difficult to appreciate
and hard to communicate to an audience.
One useful technique is to plot
the effect of each predictor variable
on the outcome while holding constant
any other predictor variables.
Fox (2003)
discusses how such **effect displays** are
constructed and provides an implementation
in the `effects`

package for `R`

.

Since I think it’s both instructive to see how effect displays
are constructed from the ground up
and useful to be able to tweak them yourself in `R`

,
this blog post illustrates how to draw such plots
for three increasingly complex statistical models:
ordinary multiple regression,
logistic regression,
and mixed-effects logistic regression.
The goal in each of these three examples
is to visualise the effects of the predictor
variables without factoring in the uncertainty
about these effects;
visualising such uncertainty will be the
topic of a future blog post.

### Confidence intervals for standardised mean differences

22 February 2017

Standardised effect sizes express patterns found in the data in
terms of the variability found in the data. For instance, a mean difference
in body height could be expressed in the metric in which the data were
measured (e.g., a difference of 4 centimetres) or relative to the
variation in the data (e.g., a difference of 0.9 standard deviations).
The latter is a standardised effect size known as Cohen’s *d*.

As I’ve
written
before,
I don’t particularly like standardised effect sizes.
Nonetheless, I wondered how confidence intervals around standardised
effect sizes (more specifically: standardised mean differences)
are constructed. Until recently, I hadn’t really thought about it
and sort of assumed you would compute them the same way as
confidence intervals around
raw effect sizes. But unlike raw (unstandardised) mean differences,
standardised mean differences are a combination of *two* estimates
subject to sampling error: the mean difference itself
and the sample standard deviation.
Moreover, the sample standard deviation is a biased estimate of
the population standard deviation (it tends to be
too low),
which causes Cohen’s *d* to be an upwardly biased estimate of the
population standardised mean difference.
Surely both of these factors must affect how
the confidence intervals around standardised effect sizes are constructed?

It turns out that indeed they do. When I compared the confidence intervals that I computed around a standardised effect size using a naive approach that assumed that the standard deviation wasn’t subject to sampling error and wasn’t biased, I got different results than when I used specialised R functions.

But these R functions all produced different results, too.

Obviously, there may well be more than one way to skin a cat, but this
caused me to wonder if the different procedures for computing confidence
intervals all covered the true population parameter with the nominal
probability (e.g., in 95% of cases for a 95% confidence interval).
I ran a simulation to find out, which I’ll report in the remainder of this post.
**If you spot any mistakes, please let me know.**