I blog about statistics and research design with an audience consisting of researchers in bilingualism, multilingualism, and applied linguistics in mind.
Latest blog posts
16 April 2019
I wrote an R function that’s hopefully useful to teach students what significance tests do and how they can and can’t be interpreted.
11 April 2019
Beginning analysts tend to be overly anxious about the assumptions of their statistical models. This observation is the point of departure of my tutorial Checking the assumptions of your statistical model without getting paranoid, but it’s probably too general. It’d be more accurate to say that beginning analysts who e-mail me about possible assumption violations and who read tutorials on statistics are overly anxious about model assumptions. (Of course, there are beginning as well as seasoned researchers who are hardly ever worry about model assumptions, but they’re unlikely to read papers and blog posts about model assumptions.)
Some anxiety about assumptions is desired if it results in more careful analyses. But if it leads researchers to abandon attempts to model their data, to resort to arcane modelling techniques with little added value, or to reject outright the results of other researchers’ modelling attempts because they didn’t resort to some arcane model, it is counterproductive. I suspect that part of what causes anxiety about assumptions is that these assumptions tend to be interpreted as mathematical requirements that, if violated, vacate any inferential guarantees the model may offer. Here, I will take a different perspective: often, the main problem when some model assumptions are clearly violated is not that the inferences won’t be approximately correct but rather that they may not be as relevant.
14 January 2019
An analytical procedure may have excellent long-run properties but still produce nonsensical results in individual cases. I recently encountered a real-life illustration of this, but since those data aren’t mine, I’ll use simulated data with similar characteristics for this blog post.
20 December 2018
Interpreting models that take into account a host of possible
interactions between predictor variables can be a pain, especially
when some of the predictors contain more than two levels.
In this post, I show how I went about fitting and then making sense
of a multilevel model containing a three-way interaction between
its categorical fixed-effect predictors. To this end, I used
which makes it relatively easy to fit Bayesian models using a notation
that hardly differs from the one used in the popular
I won’t discuss the Bayesian bit much here (I don’t think it’s too important),
and I will instead cover the following points:
- How to fit a multilevel model with
R’s default way of handling categorical predictors (treatment coding).
- How to interpret this model’s fixed parameter estimates.
- How to visualise the modelled effects.
- How to recode predictors to obtain more useful parameter estimates.
- How to extract information from the model to home in on specific comparisons.
29 October 2018
In this blog post, I take a closer look at the results of a classic study I sometimes discuss in my classes on second language acquisition. As I’ll show below, the strength of this study’s findings is strongly overexaggerated, presumably owing to a mechanical error.
26 September 2018
I’ve put my first R package on GitHub!
cannonball and contains a couple of functions that I use for teaching;
perhaps others will follow.
12 September 2018
I’ve written a paper titled Checking the assumptions of your statistical model without getting paranoid and I’d like to solicit your feedback. The paper is geared towards beginning analysts, so I’m particularly interested in hearing from readers who don’t consider themselves expert statisticians if there is anything that isn’t entirely clear to them. If you’re a more experienced analyst and you spot an error in the paper or accompanying tutorial, I’d be grateful if you could let me know, too, of course.
27 July 2018
In this follow-up to the blog post Baby steps in Bayes: Piecewise regression, I’m going to try to model the relationship between two continuous variables using a piecewise regression with not one but two breakpoints. (The rights to the movie about the first installment are still up for grabs, incidentally.)
6 July 2018
I’m currently working on a large longitudinal project as a programmer/analyst. Most of the data are collected using paper/pencil tasks and questionnaires and need to be entered into the database by student assistants. In previous projects, this led to some minor irritations since some assistants occasionally entered some words with capitalisation and others without, or they inadvertently added a trailing space to the entry, or used participant IDs that didn’t exist – all small things that cause difficulties during the analysis.
Anyway, you can download a slimmed-down version
of this platform
The comments in the PHP files should tell you what
I try to accomplish; if something’s not clear, there’s
a comment section at the bottom of this page.
You’ll need a webserver that supports PHP,
and you’ll need to change the permissions of the
You can also check out the demo.
To log in, use one of the following
(You can change the accepted e-mail addresses in
The password is
Then enter some data. You can only enter
data for participants you’ve already created an ID for,
though. For this project, the participant IDs
consist of the number 4 or 5 (= the participant’s grade), followed by a dot,
followed by a two digit number between 0 and 39 (= the participant’s class),
followed by a dot and another two digit number between
0 and 99. The entry for
Grade needs to match the
first number in
If you enter task data for a participant for whom
someone has already task data at that data collection wave,
you’ll receive an error. You can override this error
by ticking the
Correct existing entry? box at the bottom.
This doesn’t overwrite the existing entry, but adds the
new entry, which is flagged as the accurate one.
During the analysis, you can then filter out data that
was later updated.
Hopefully this is of some use to some of you!
4 July 2018
Inspired by Richard McElreath’s excellent book Statistical rethinking: A Bayesian course with examples in R and Stan, I’ve started dabbling in Bayesian statistics. In essence, Bayesian statistics is an approach to statistical inference in which the analyst specifies a generative model for the data (i.e., an equation that describes the factors they suspect gave rise to the data) as well as (possibly vague) relevant information or beliefs that are external to the data proper. This information or these beliefs are then adjusted in light of the data observed.
I’m hardly an expert in Bayesian statistics (or the more commonly encountered ‘orthodox’ or ‘frequentist’ statistics, for that matter), but I’d like to understand it better – not only conceptually, but also in terms of how the statistical model should be specified. While quite a few statisticians and methodologists tout Bayesian statistics for a variety of reasons, my interest is primarily piqued by the prospect of being able to tackle problems that would be impossible or at least awkward to tackle with the tools I’m pretty comfortable with at the moment.
In order to gain some familiarity with Bayesian statistics, I plan to set myself a couple of problems and track my efforts in solving them here in a Dear diary fashion. Perhaps someone else finds them useful, too.
The first problem that I’ll tackle is fitting a regression model in which the relationship between the predictor and the outcome may contain a breakpoint at one unknown predictor value. One domain in which such models are useful is in testing hypotheses that claim that the relationship between the age of onset of second language acquisition (AOA) and the level of ultimate attainment in that second language flattens after a certain age (typically puberty). It’s possible to fit frequentist breakpoint models, but estimating the breakpoint age is a bit cumbersome (see blog post Calibrating p-values in ‘flexible’ piecewise regression models). But in a Bayesian approach, it should be possible to estimate both the regression parameters as well as the breakpoint itself in the same model. That’s what I’ll try here.