# Blog

I blog about statistics and research design with an audience consisting of researchers in bilingualism, multilingualism, and applied linguistics in mind.

## Latest blog posts

### Confidence interval-based optional stopping

19 September 2017

Stopping to collect data early when
the provisional results are significant (“optional stopping”)
inflates your chances of finding
a pattern in the data when
nothing is going on. This can be countered using a technique
known as sequential testing, but that’s not what this post
is about. Instead, I’d like to illustrate that optional stopping
isn’t necessarily a problem if your stopping rule doesn’t
involve p-values. If instead of on p-values, you base
your decision to collect more data or not on how wide
your current confidence interval (or Bayesian credible interval)
is, peeking at your data can be a reasonable strategy.
Additionally, in this post, I want share some code for
`R`

functions that you can adapt in order to simulate
the effects of different stopping rules.

### Creating comparable sets of stimuli

14 September 2017

When designing a study, you sometimes have a pool of candidate stimuli
(words, sentences, texts, images etc.) that is too large to present to
each participant in its entirety. If you want data for all or at least
most stimuli, a possible solution is to split up the pool of stimuli
into sets of overseeable size and assign each participant to one of the
different sets. Ideally, you’d want the different sets to be as comparable
as possible with respect to a number of relevant characteristics so that
each participant is exposed to about the same diversity of stimuli during
the task. For instance, when presenting individual words to participants,
you may want for each participant to be confronted with a similar distribution
of words in terms of their frequency, length, and number of adverbs.
In this blog post I share some `R`

code that I used to split up two
types of stimuli into sets that are comparable with respect to one or several
variables—in the hopes that you can easily adapt them for your own needs.

### Draft: Replication success as predictive utility

24 August 2017

In recent years, a couple of high-profile, multi-lab attempts to replicate previous findings have been conducted in psychology, but there wasn’t much consensus about when a replication attempt should be considered as confirming the original finding. As I started dabbling in predictive modelling a couple of months ago, I started thinking that it could be useful to view replication success in terms of how accurately the original finding predicts the replication: If taking the original finding at face value permits more accurate predictions about the patterns in the replication data than ignoring it does, the original finding can at least be said to contribute to our body of knowledge. This approach, I think, has some key advantages over the methods used to quantify replication success in the multi-lab replication attempts that I mentioned earlier.

### Abandoning standardised effect sizes and opening up other roads to power

14 July 2017

Numerical summaries of research findings will typically feature
an indication of the sizes of the effects that were studied.
These indications are often *standardised* effect sizes,
which means that they are expressed relative to the
variance in the data rather than with respect to the units
in which the variables were measured.
Popular standardised effect sizes include Cohen’s *d*, which
expresses the mean difference between two groups as a proportion
of the pooled standard deviation, and Pearson’s *r*,
which expresses the difference in one variable as a proportion of its standard deviation
that is associated with a change of one standard deviation of another variable.
There exists a rich literature that discusses which standardised
effect sizes ought to be used depending on the study’s design,
how this or that standardised effect size should be adjusted for this
and that bias,
and how confidence intervals should be constructed around standardised
effect sizes (blog post Confidence intervals for standardised mean differences).
But most of this literature *should* be little importance to the practising
scientist for the simple reason that standardised effect sizes themselves ought to
be of little importance.

In what follows, I will defend this point of view, which I’ve outlined in two previous blog posts (Why I don’t like standardised effect sizes and More on why I don’t like standardised effect sizes), by sharing some quotes by respected statisticians and methodologists. In particular, I hope to, first, make you think about the deterimental effect that the use of standardised effect sizes entails on the interpretability of research findings and the accumulation of knowledge and, second, convince you that the use of standardised effect sizes for planning studies overly stresses sample size as the main determinant of a study’s power and that by abandoning it other roads to power can be opened up.

### Interactions between continuous variables

26 June 2017

Splitting up continuous variables is generally a bad idea. In terms of statistical efficiency, the popular practice of dichotomising continuous variables at their median is comparable to throwing out a third of the dataset. Moreover, statistical models based on split-up continuous variables are prone to misinterpretation: threshold effects are easily read into the results when, in fact, none exist. Splitting up, or ‘binning’, continuous variables, then, is something to avoid. But what if you’re interested in how the effect of one continuous predictor varies according to the value of another continuous predictor? In other words, what if you’re interested in the interaction between two continuous predictors? Binning one of the predictors seems appealing since it makes the model easier to interpret. However, as I’ll show in this blog post, it’s fairly straightforward to fit and interpret interactions between continuous predictors.

### Tutorial: Adding confidence bands to effect displays

12 May 2017

In the previous blog post, I demonstrated how you can draw effect displays to render regression models more intelligible to yourself and to your audience. These effect displays did not contain information about the uncertainty inherent to estimating regression models, however. To that end, this blog post demonstrates how you can add confidence bands to effect displays for multiple regression, logistic regression, and logistic mixed-effects models, and explains how these confidence bands are constructed.

### Tutorial: Plotting regression models

23 April 2017

The results of regression models,
particularly fairly complex ones,
can be difficult to appreciate
and hard to communicate to an audience.
One useful technique is to plot
the effect of each predictor variable
on the outcome while holding constant
any other predictor variables.
Fox (2003)
discusses how such **effect displays** are
constructed and provides an implementation
in the `effects`

package for `R`

.

Since I think it’s both instructive to see how effect displays
are constructed from the ground up
and useful to be able to tweak them yourself in `R`

,
this blog post illustrates how to draw such plots
for three increasingly complex statistical models:
ordinary multiple regression,
logistic regression,
and mixed-effects logistic regression.
The goal in each of these three examples
is to visualise the effects of the predictor
variables without factoring in the uncertainty
about these effects;
visualising such uncertainty will be the
topic of a future blog post.

### Confidence intervals for standardised mean differences

22 February 2017

Standardised effect sizes express patterns found in the data in
terms of the variability found in the data. For instance, a mean difference
in body height could be expressed in the metric in which the data were
measured (e.g., a difference of 4 centimetres) or relative to the
variation in the data (e.g., a difference of 0.9 standard deviations).
The latter is a standardised effect size known as Cohen’s *d*.

As I’ve
written
before,
I don’t particularly like standardised effect sizes.
Nonetheless, I wondered how confidence intervals around standardised
effect sizes (more specifically: standardised mean differences)
are constructed. Until recently, I hadn’t really thought about it
and sort of assumed you would compute them the same way as
confidence intervals around
raw effect sizes. But unlike raw (unstandardised) mean differences,
standardised mean differences are a combination of *two* estimates
subject to sampling error: the mean difference itself
and the sample standard deviation.
Moreover, the sample standard deviation is a biased estimate of
the population standard deviation (it tends to be
too low),
which causes Cohen’s *d* to be an upwardly biased estimate of the
population standardised mean difference.
Surely both of these factors must affect how
the confidence intervals around standardised effect sizes are constructed?

It turns out that indeed they do. When I compared the confidence intervals that I computed around a standardised effect size using a naive approach that assumed that the standard deviation wasn’t subject to sampling error and wasn’t biased, I got different results than when I used specialised R functions.

But these R functions all produced different results, too.

Obviously, there may well be more than one way to skin a cat, but this
caused me to wonder if the different procedures for computing confidence
intervals all covered the true population parameter with the nominal
probability (e.g., in 95% of cases for a 95% confidence interval).
I ran a simulation to find out, which I’ll report in the remainder of this post.
**If you spot any mistakes, please let me know.**

### Which predictor is most important? Predictive utility vs. construct importance

15 February 2017

Every so often, I’m asked for my two cents on a correlational study in which the researcher wants to find out which of a set of predictor variables is the most important one. For instance, they may have the results of an intelligence test, of a working memory task and of a questionnaire probing their participants’ motivation for learning French, and they want to find out which of these three is the most important factor in acquiring a nativelike French accent, as measured using a pronunciation task. As I will explain below, research questions such as these can be interpreted in two ways, and whether they can be answered sensibly depends on the interpretation intended.

### Automatise repetitive tasks

31 January 2017

Research often involves many repetitive tasks. For a ongoing project, for instance, we needed to replace all stylised apostrophes (’) with straight apostrophes (‘) in some 3,000 text files when preparing the texts for the next step. As another example, you may need to split up a bunch of files into different directories depending on, say, the character in the file name just before the extension. When done by hand, such tasks are as mind-numbing and time-consuming as they sound – perhaps you would do them on a Friday afternoon while listening to music or outsource them to a student assistant. My advice, though, is this: Try to automatise repetitive tasks.

Doing repetitive tasks is what computers are for, so rather than spending several hours learning nothing, I suggest you spend that time writing a script or putting together a command line call that does the task for you. If you have little experience doing this, this will take time at first. In fact, I reckon I often spend roughly same amount of time trying to automatise menial tasks as it would have cost me to do them by hand. But in the not-so-long run, automatisation is a time-saver: Once you have a working script, you can tweak and reuse it. Additionally, while you’re figuring out how to automatise a menial chore, you’re actually learning something useful. The chores become more of a challenge and less mind-numbing. I’m going to present an example or two of what I mean and I will conclude by giving some general pointers.