I blog about statistics and research design with an audience consisting of researchers in bilingualism, multilingualism, and applied linguistics in mind.

Feed Subscribe to new blog posts.

Latest blog posts

Walkthrough: A significance test for a two-group comparison

16 April 2019

I wrote an R function that’s hopefully useful to teach students what significance tests do and how they can and can’t be interpreted.


Before worrying about model assumptions, think about model relevance

11 April 2019

Beginning analysts tend to be overly anxious about the assumptions of their statistical models. This observation is the point of departure of my tutorial Checking the assumptions of your statistical model without getting paranoid, but it’s probably too general. It’d be more accurate to say that beginning analysts who e-mail me about possible assumption violations and who read tutorials on statistics are overly anxious about model assumptions. (Of course, there are beginning as well as seasoned researchers who are hardly ever worry about model assumptions, but they’re unlikely to read papers and blog posts about model assumptions.)

Some anxiety about assumptions is desired if it results in more careful analyses. But if it leads researchers to abandon attempts to model their data, to resort to arcane modelling techniques with little added value, or to reject outright the results of other researchers’ modelling attempts because they didn’t resort to some arcane model, it is counterproductive. I suspect that part of what causes anxiety about assumptions is that these assumptions tend to be interpreted as mathematical requirements that, if violated, vacate any inferential guarantees the model may offer. Here, I will take a different perspective: often, the main problem when some model assumptions are clearly violated is not that the inferences won’t be approximately correct but rather that they may not be as relevant.


Guarantees in the long run vs. interpreting the data at hand: Two analyses of clustered data

14 January 2019

An analytical procedure may have excellent long-run properties but still produce nonsensical results in individual cases. I recently encountered a real-life illustration of this, but since those data aren’t mine, I’ll use simulated data with similar characteristics for this blog post.


Baby steps in Bayes: Recoding predictors and homing in on specific comparisons

20 December 2018

Interpreting models that take into account a host of possible interactions between predictor variables can be a pain, especially when some of the predictors contain more than two levels. In this post, I show how I went about fitting and then making sense of a multilevel model containing a three-way interaction between its categorical fixed-effect predictors. To this end, I used the brms package, which makes it relatively easy to fit Bayesian models using a notation that hardly differs from the one used in the popular lme4 package. I won’t discuss the Bayesian bit much here (I don’t think it’s too important), and I will instead cover the following points:

  1. How to fit a multilevel model with brms using R’s default way of handling categorical predictors (treatment coding).
  2. How to interpret this model’s fixed parameter estimates.
  3. How to visualise the modelled effects.
  4. How to recode predictors to obtain more useful parameter estimates.
  5. How to extract information from the model to home in on specific comparisons.


A closer look at a classic study (Bailey et al. 1974)

29 October 2018

In this blog post, I take a closer look at the results of a classic study I sometimes discuss in my classes on second language acquisition. As I’ll show below, the strength of this study’s findings is strongly overexaggerated, presumably owing to a mechanical error.


Introducing cannonball - Tools for teaching statistics

26 September 2018

I’ve put my first R package on GitHub! It’s called cannonball and contains a couple of functions that I use for teaching; perhaps others will follow.


Looking for comments on a paper on model assumptions

12 September 2018

I’ve written a paper titled Checking the assumptions of your statistical model without getting paranoid and I’d like to solicit your feedback. The paper is geared towards beginning analysts, so I’m particularly interested in hearing from readers who don’t consider themselves expert statisticians if there is anything that isn’t entirely clear to them. If you’re a more experienced analyst and you spot an error in the paper or accompanying tutorial, I’d be grateful if you could let me know, too, of course.


Baby steps in Bayes: Piecewise regression with two breakpoints

27 July 2018

In this follow-up to the blog post Baby steps in Bayes: Piecewise regression, I’m going to try to model the relationship between two continuous variables using a piecewise regression with not one but two breakpoints. (The rights to the movie about the first installment are still up for grabs, incidentally.)


A data entry form with sanity checks

6 July 2018

I’m currently working on a large longitudinal project as a programmer/analyst. Most of the data are collected using paper/pencil tasks and questionnaires and need to be entered into the database by student assistants. In previous projects, this led to some minor irritations since some assistants occasionally entered some words with capitalisation and others without, or they inadvertently added a trailing space to the entry, or used participant IDs that didn’t exist – all small things that cause difficulties during the analysis.

To reduce the chances of such mishaps in the current project, I created an on-line platform that uses HTML, JavaScript and PHP to homogenise how research assistants can enter data and that throws errors and warnings when they enter impossible data. Nothing that will my name pop up at Google board meetings, but useful enough.

Anyway, you can download a slimmed-down version of this platform here. The comments in the PHP files should tell you what I try to accomplish; if something’s not clear, there’s a comment section at the bottom of this page. You’ll need a webserver that supports PHP, and you’ll need to change the permissions of the Data directory to 777.

You can also check out the demo. To log in, use one of the following e-mail addresses: first.assistant@university.ch, second.assistant@university.ch, third.assistant@university.ch. (You can change the accepted e-mail addresses in index.php). The password is projectpassword.

Then enter some data. You can only enter data for participants you’ve already created an ID for, though. For this project, the participant IDs consist of the number 4 or 5 (= the participant’s grade), followed by a dot, followed by a two digit number between 0 and 39 (= the participant’s class), followed by a dot and another two digit number between 0 and 99. The entry for Grade needs to match the first number in ID.

If you enter task data for a participant for whom someone has already task data at that data collection wave, you’ll receive an error. You can override this error by ticking the Correct existing entry? box at the bottom. This doesn’t overwrite the existing entry, but adds the new entry, which is flagged as the accurate one. During the analysis, you can then filter out data that was later updated.

Hopefully this is of some use to some of you!


Baby steps in Bayes: Piecewise regression

4 July 2018

Inspired by Richard McElreath’s excellent book Statistical rethinking: A Bayesian course with examples in R and Stan, I’ve started dabbling in Bayesian statistics. In essence, Bayesian statistics is an approach to statistical inference in which the analyst specifies a generative model for the data (i.e., an equation that describes the factors they suspect gave rise to the data) as well as (possibly vague) relevant information or beliefs that are external to the data proper. This information or these beliefs are then adjusted in light of the data observed.

I’m hardly an expert in Bayesian statistics (or the more commonly encountered ‘orthodox’ or ‘frequentist’ statistics, for that matter), but I’d like to understand it better – not only conceptually, but also in terms of how the statistical model should be specified. While quite a few statisticians and methodologists tout Bayesian statistics for a variety of reasons, my interest is primarily piqued by the prospect of being able to tackle problems that would be impossible or at least awkward to tackle with the tools I’m pretty comfortable with at the moment.

In order to gain some familiarity with Bayesian statistics, I plan to set myself a couple of problems and track my efforts in solving them here in a Dear diary fashion. Perhaps someone else finds them useful, too.

The first problem that I’ll tackle is fitting a regression model in which the relationship between the predictor and the outcome may contain a breakpoint at one unknown predictor value. One domain in which such models are useful is in testing hypotheses that claim that the relationship between the age of onset of second language acquisition (AOA) and the level of ultimate attainment in that second language flattens after a certain age (typically puberty). It’s possible to fit frequentist breakpoint models, but estimating the breakpoint age is a bit cumbersome (see blog post Calibrating p-values in ‘flexible’ piecewise regression models). But in a Bayesian approach, it should be possible to estimate both the regression parameters as well as the breakpoint itself in the same model. That’s what I’ll try here.