# Blog

I blog about statistics and research design with an audience consisting of researchers in bilingualism, multilingualism, and applied linguistics in mind.

## Latest blog posts

### Walkthrough: A significance test for a two-group comparison

16 April 2019

I wrote an R function that’s hopefully useful to teach students what significance tests do and how they can and can’t be interpreted.

### Before worrying about model assumptions, think about model relevance

11 April 2019

Beginning analysts tend to be overly anxious about the assumptions
of their statistical models.
This observation is the point of departure of my tutorial
*Checking the assumptions of your statistical model without getting paranoid*,
but it’s probably too general. It’d be more accurate to say that beginning analysts
who e-mail me about possible assumption violations and who read tutorials on statistics are
overly anxious about model assumptions.
(Of course, there are beginning as well as seasoned researchers who are hardly ever worry
about model assumptions, but they’re unlikely to read papers and blog posts about model assumptions.)

Some anxiety about assumptions is desired if it results in more careful analyses.
But if it leads researchers to abandon attempts to model their data, to resort to arcane
modelling techniques with little added value, or to reject outright the results of other
researchers’ modelling attempts because *they* didn’t resort to some arcane model, it is counterproductive.
I suspect that part of what causes anxiety about assumptions is that these assumptions
tend to be interpreted as mathematical requirements that, if violated, vacate any
inferential guarantees the model may offer.
Here, I will take a different perspective:
often, the main problem when some model assumptions
are clearly violated is not that the inferences won’t be
approximately *correct* but rather that they may not be as *relevant*.

### Guarantees in the long run vs. interpreting the data at hand: Two analyses of clustered data

14 January 2019

An analytical procedure may have excellent long-run properties but still produce nonsensical results in individual cases. I recently encountered a real-life illustration of this, but since those data aren’t mine, I’ll use simulated data with similar characteristics for this blog post.

### Baby steps in Bayes: Recoding predictors and homing in on specific comparisons

20 December 2018

Interpreting models that take into account a host of possible
interactions between predictor variables can be a pain, especially
when some of the predictors contain more than two levels.
In this post, I show how I went about fitting and then making sense
of a multilevel model containing a three-way interaction between
its categorical fixed-effect predictors. To this end, I used
the `brms`

package,
which makes it relatively easy to fit Bayesian models using a notation
that hardly differs from the one used in the popular `lme4`

package.
I won’t discuss the Bayesian bit much here (I don’t think it’s too important),
and I will instead cover the following points:

- How to fit a multilevel model with
`brms`

using`R`

’s default way of handling categorical predictors (treatment coding). - How to interpret this model’s fixed parameter estimates.
- How to visualise the modelled effects.
- How to recode predictors to obtain more useful parameter estimates.
- How to extract information from the model to home in on specific comparisons.

### A closer look at a classic study (Bailey et al. 1974)

29 October 2018

In this blog post, I take a closer look at the results of a classic study I sometimes discuss in my classes on second language acquisition. As I’ll show below, the strength of this study’s findings is strongly overexaggerated, presumably owing to a mechanical error.

### Introducing cannonball - Tools for teaching statistics

26 September 2018

I’ve put my first R package on GitHub!
It’s called `cannonball`

and contains a couple of functions that I use for teaching;
perhaps others will follow.

### Looking for comments on a paper on model assumptions

12 September 2018

I’ve written a paper titled *Checking the assumptions of your statistical model without getting paranoid* and I’d like to solicit your feedback.
The paper is geared towards beginning analysts, so I’m particularly
interested in hearing from readers who don’t consider themselves
expert statisticians if there is anything that isn’t entirely clear to them.
If you’re a more experienced analyst and you spot an error in the paper
or accompanying tutorial, I’d be grateful if you could let me know, too, of course.

### Baby steps in Bayes: Piecewise regression with two breakpoints

27 July 2018

In this follow-up to the blog post *Baby steps in Bayes: Piecewise regression*,
I’m going to try to model the relationship between two continuous variables
using a piecewise regression with not one but two breakpoints.
(The rights to the movie about the first installment are still up for grabs, incidentally.)

### A data entry form with sanity checks

6 July 2018

I’m currently working on a large longitudinal project as a programmer/analyst. Most of the data are collected using paper/pencil tasks and questionnaires and need to be entered into the database by student assistants. In previous projects, this led to some minor irritations since some assistants occasionally entered some words with capitalisation and others without, or they inadvertently added a trailing space to the entry, or used participant IDs that didn’t exist – all small things that cause difficulties during the analysis.

To reduce the chances of such mishaps in the current project, I created an on-line platform that uses HTML, JavaScript and PHP to homogenise how research assistants can enter data and that throws errors and warnings when they enter impossible data. Nothing that will my name pop up at Google board meetings, but useful enough.

Anyway, you can download a slimmed-down version
of this platform
here.
The comments in the PHP files should tell you what
I try to accomplish; if something’s not clear, there’s
a comment section at the bottom of this page.
You’ll need a webserver that supports PHP,
and you’ll need to change the permissions of the `Data`

directory to `777`

.

You can also check out the demo.
To log in, use one of the following
e-mail addresses:
`first.assistant@university.ch`

,
`second.assistant@university.ch`

,
`third.assistant@university.ch`

.
(You can change the accepted e-mail addresses in `index.php`

).
The password is `projectpassword`

.

Then enter some data. You can only enter
data for participants you’ve already created an ID for,
though. For this project, the participant IDs
consist of the number 4 or 5 (= the participant’s grade), followed by a dot,
followed by a two digit number between 0 and 39 (= the participant’s class),
followed by a dot and another two digit number between
0 and 99. The entry for `Grade`

needs to match the
first number in `ID`

.

If you enter task data for a participant for whom
someone has already task data at that data collection wave,
you’ll receive an error. You can override this error
by ticking the `Correct existing entry?`

box at the bottom.
This doesn’t overwrite the existing entry, but adds the
new entry, which is flagged as the accurate one.
During the analysis, you can then filter out data that
was later updated.

Hopefully this is of some use to some of you!

### Baby steps in Bayes: Piecewise regression

4 July 2018

Inspired by Richard McElreath’s excellent book
*Statistical rethinking: A Bayesian course with examples in R and Stan*,
I’ve started dabbling in Bayesian statistics.
In essence, Bayesian statistics is an approach to statistical inference
in which the analyst specifies a generative model for the data
(i.e., an equation that describes the factors they suspect gave rise to the
data) as well as (possibly vague) relevant information or beliefs
that are external to the data proper. This information or these
beliefs are then adjusted in light of the data observed.

I’m hardly an expert in Bayesian statistics (or the more commonly encountered ‘orthodox’ or ‘frequentist’ statistics, for that matter), but I’d like to understand it better – not only conceptually, but also in terms of how the statistical model should be specified. While quite a few statisticians and methodologists tout Bayesian statistics for a variety of reasons, my interest is primarily piqued by the prospect of being able to tackle problems that would be impossible or at least awkward to tackle with the tools I’m pretty comfortable with at the moment.

In order to gain some familiarity with Bayesian statistics, I plan
to set myself a couple of problems and track my efforts in solving
them here in a *Dear diary* fashion. Perhaps someone else finds them
useful, too.

The first problem that I’ll tackle is fitting a regression model
in which the relationship between the predictor and the outcome
may contain a breakpoint at one unknown predictor value. One domain
in which such models are useful is in testing hypotheses that claim
that the relationship between the age of onset of second language
acquisition (AOA) and the level of ultimate attainment in that second
language flattens after a certain age (typically puberty).
It’s possible to fit frequentist breakpoint models, but
estimating the breakpoint age is a bit cumbersome (see blog post
*Calibrating p-values in ‘flexible’ piecewise regression models*).
But in a Bayesian approach, it should be possible to estimate
both the regression parameters as well as the breakpoint itself
in the same model. That’s what I’ll try here.