Jan Vanhove :: Blog
  • About
  • Blog archive
  • Teaching resources
  • Publications

I blog about statistics and research design with an audience consisting of researchers in bilingualism, multilingualism, and applied linguistics in mind.

Latest blog posts

Exact significance tests for 2 × 2 tables

R
significance
Two-by-two contingency tables look so simple that you’d be forgiven for thinking they’re straightforward to analyse. A glance at the statistical literature on the analysis of contingency tables, however, reveals a plethora of techniques and controversies surrounding them that will quickly disabuse you of this notion (see, for instance, Fagerland et al. 2017). In this blog post, I discuss a handful of different study designs that give rise to two-by-two tables and present a few exact significance tests that can be applied to these tables. A more exhaustive overview can be found in Fagerland et al. (2017).
Sep 10, 2024
Jan Vanhove

 

New and updated teaching resources

teaching materials
This is just a quick blog post to let you know that I’ve added a few entries to the Teaching resources page and updated a few others:
Sep 4, 2024
Jan Vanhove

Adjusting to Julia: Piecewise regression

Julia
piecewise regression
non-linearities
In this fourth installment of Adjusting to Julia, I will at long last analyse some actual data. One of the first posts on this blog was Calibrating p-values in ‘flexible’ piecewise regression models. In that post, I fitted a piecewise regression to a dataset comprising the ages at which a number of language learners started learning a second language (age of acquisition, AOA) and their scores on a grammaticality judgement task (GJT) in that second language. A piecewise regression is a regression model in which the slope of the function relating the predictor (here: AOA) to the outcome (here: GJT) changes at some value of the predictor, the so-called breakpoint. The problem, however, was that I didn’t specify the breakpoint beforehand but pick the breakpoint that minimised the model’s deviance. This increased the probability that I would find that the slope before and after the breakpoint differed, even if they in fact were the same. In the blog post I wrote almost nine years ago, I sought to recalibrate the p-value for the change in slope by running a bunch of simulations in R. In this blog post, I’ll do the same, but in Julia.
Mar 7, 2023
Jan Vanhove

 

Adjusting to Julia: Tea tasting

Julia
In this third blog post in which I try my hand at the Julia language, I’ll tackle a slight variation of an old problem – Fisher’s tea tasting lady – both analytically and using a brute-force simulation.
Feb 23, 2023
Jan Vanhove

 

Adjusting to Julia: The Levenshtein algorithm

Julia
In this second blog post about Julia, I’ll share with you a Julia implementation of the Levenshtein algorithm.
Feb 9, 2023
Jan Vanhove

Adjusting to Julia: Generating the Fibonacci sequence

Julia
I’m currently learning a bit of Julia and I thought I’d share with you a couple of my attempts at writing Julia code. I’ll spare you the sales pitch, and I’ll skip straight to the goal of this blog post: writing three different Julia functions that can generate the Fibonacci sequence.
Dec 20, 2022
Jan Vanhove

 

In research, don’t do things you don’t see the point of

simplicity
silly tests
research questions
When I started reading quantitative research reports, I hadn’t taken any methods or statistics classes, so small wonder that I didn’t understand why certain background variables on the participants were collected, why it was reported how many of them were women and how many of them were men, and what all those numbers in the results sections meant. However, I was willing to assume that these reports had been written by some fairly intelligent people and that, by the Gricean maxim of relevance, these bits and bobs must be relevant — why else report them?
Feb 18, 2022
Jan Vanhove

 

An R function for computing Levenshtein distances between texts using the word as the unit of comparison

R
For a new research project, we needed a way to tabulate the changes that were made to a text when correcting it. Since we couldn’t find a suitable tool, I wrote an R function that uses the Levenshtein algorithm to determine both the smallest number of words that need to be changed to transform one version of a text into another and what these changes are.
Feb 17, 2022
Jan Vanhove

The consequences of controlling for a post-treatment variable

R
multiple regression
Let’s say you want to find out if a pedagogical intervention boosts learners’ conversational skills in L2 French. You’ve learnt that including a well-chosen control variable in your analysis can work wonders in terms of statistical power and precision, so you decide to administer a French vocabulary test to your participants in order to include their score on this test in your analyses as a covariate. But if you administer the vocabulary test after the intervention, it’s possible that the vocabulary scores are themselves affected by the intervention as well. If this is indeed the case, you may end up doing more harm than good. In this blog post, I will take a closer look at five general cases where controlling for such a ‘post-treatment’ variable is harmful.
Jun 29, 2021
Jan Vanhove

 

Quantitative methodology: An introduction

research design
teaching materials
I’ve taught my last class for the semester and I thought I’d make available the booklet that I wrote for teaching my class on quantitative methodology. You can download it here.
Dec 16, 2020
Jan Vanhove

 

Capitalising on covariates in cluster-randomised experiments

R
power
significance
design features
cluster-randomised experiments
preprint
In cluster-randomised experiments, participants are assigned to the conditions randomly but not on an individual basis. Instead, entire batches (‘clusters’) of participants are assigned in such a way that each participant in the same cluster is assigned to the same condition. A typical example would be an educational experiment in which all pupils in the same class get assigned to the same experimental condition. Crucially, the analysis should take into account the fact that the random assignment took place at the cluster level rather than at the individual level.
Sep 2, 2020
Jan Vanhove

 

Tutorial: Visualising statistical uncertainty using model-based graphs

R
graphs
logistic regression
mixed-effects models
multiple regression
Bayesian statistics
brms
I wrote a tutorial about visualising the statistical uncertainty in statistical models for a conference that took place a couple of months ago, and I’ve just realised that I’ve never advertised this tutorial in this blog. You can find the tutorial here: Visualising statistical uncertainty using model-based graphs.
Jun 29, 2020
Jan Vanhove

 

Interpreting regression models: a reading list

measurement error
logistic regression
correlational studies
mixed-effects models
multiple regression
predictive modelling
research questions
contrast coding
reliability
Last semester I taught a class for PhD students and collaborators that focused on how the output of regression models is to be interpreted. Most participants had at least some experience with fitting regression models, but I had noticed that they were often unsure about the precise statistical interpretation of the output of these models (e.g., What does this parameter estimate of 1.2 correspond to in the data?). Moreover, they were usually a bit too eager to move from the model output to a subject-matter interpretation (e.g., What does this parameter estimate of 1.2 tell me about language learning?). I suspect that the same goes for many applied linguists, and social scientists more generally, so below I provide an overview of the course contents as well as the reading list.
Jun 12, 2020
Jan Vanhove

Tutorial: Obtaining directly interpretable regression coefficients by recoding categorical predictors

R
contrast coding
mixed-effects models
multiple regression
tutorial
research questions
The output of regression models is often difficult to parse, especially when categorical predictors and interactions between them are being modelled. The goal of this tutorial is to show you how you can obtain estimated coefficients that you can interpret directly in terms of your research question. I’ve learnt about this technique thanks to Schad et al. (2020), and I refer to them for a more detailed discussion. What I will do is go through three examples of increasing complexity that should enable you to apply the technique in your own analyses.
May 23, 2020
Jan Vanhove

Nonparametric tests aren’t a silver bullet when parametric assumptions are violated

R
power
significance
simplicity
assumptions
nonparametric tests
Some researchers adhere to a simple strategy when comparing data from two or more groups: when they think that the data in the groups are normally distributed, they run a parametric test (t-test or ANOVA); when they suspect that the data are not normally distributed, they run a nonparametric test (e.g., Mann–Whitney or Kruskal–Wallis). Rather than follow such an automated approach to analysing data, I think researchers ought to consider the following points:
May 23, 2020
Jan Vanhove

Baby steps in Bayes: Incorporating reliability estimates in regression models

R
Stan
Bayesian statistics
measurement error
correlational studies
reliability
Researchers sometimes calculate reliability indices such as Cronbach’s α or Revelle’s ωT, but their statistical models rarely take these reliability indices into account. Here I want to show you how you can incorporate information about the reliability about your measurements in a statistical model so as to obtain more honest and more readily interpretable parameter estimates.
Feb 18, 2020
Jan Vanhove

Baby steps in Bayes: Accounting for measurement error on a control variable

R
Stan
Bayesian statistics
measurement error
correlational studies
In observational studies, it is customary to account for confounding variables by including measurements of them in the statistical model. This practice is referred to as statistically controlling for the confounding variables. An underappreciated problem is that if the confounding variables were measured imperfectly, then statistical control will be imperfect as well, and the confound won’t be eradicated entirely (see Berthele & Vanhove 2017; Brunner & Austin 2009; Westfall & Yarkoni 2016) (see also Controlling for confounding variables in correlational research: Four caveats).
Jan 21, 2020
Jan Vanhove

 

Five suggestions for simplifying research reports

simplicity
silly tests
graphs
cluster-randomised experiments
open science
Whenever I’m looking for empirical research articles to discuss in my classes on second language acquisition, I’m struck by how needlessly complicated and unnecessarily long most articles in the field are. Here are some suggestions for reducing the numerical fluff in quantitative research reports.
Dec 5, 2019
Jan Vanhove

Adjusting for a covariate in cluster-randomised experiments

R
power
significance
simplicity
mixed-effects models
cluster-randomised experiments
Cluster-randomised experiments are experiments in which groups of participants (e.g., classes) are assigned randomly but in their entirety to the experiments’ conditions. Crucially, the fact that entire groups of participants were randomly assigned to conditions - rather than each participant individually - should be taken into account in the analysis, as outlined in a previous blog post. In this blog post, I use simulations to explore the strengths and weaknesses of different ways of analysing cluster-randomised experiments when a covariate (e.g., a pretest score) is available.
Nov 28, 2019
Jan Vanhove

Drawing scatterplot matrices

R
graphs
correlational studies
non-linearities
multiple regression
This is just a quick blog post to share a function with which you can draw scatterplot matrices.
Nov 28, 2019
Jan Vanhove

Collinearity isn’t a disease that needs curing

R
multiple regression
assumptions
collinearity
Every now and again, some worried student or collaborator asks me whether they’re “allowed” to fit a regression model in which some of the predictors are fairly strongly correlated with one another. Happily, most Swiss cantons have a laissez-faire policy with regard to fitting models with correlated predictors, so the answer to this question is “yes”. Such an answer doesn’t always set the student or collaborator at ease, so below you find my more elaborate answer.
Sep 11, 2019
Jan Vanhove

Interactions in logistic regression models

R
logistic regression
tutorial
bootstrapping
Bayesian statistics
brms
When you want to know if the difference between two conditions is larger in one group than in another, you’re interested in the interaction between ‘condition’ and ‘group’. Fitting interactions statistically is one thing, and I will assume in the following that you know how to do this. Interpreting statistical interactions, however, is another pair of shoes. In this post, I discuss why this is the case and how it pertains to interactions fitted in logistic regression models.
Aug 7, 2019
Jan Vanhove

 

Walkthrough: A significance test for a two-group comparison

significance
R
teaching materials
I wrote an R function that’s hopefully useful to teach students what significance tests do and how they can and can’t be interpreted.
Apr 16, 2019
Jan Vanhove

Before worrying about model assumptions, think about model relevance

simplicity
graphs
non-linearities
assumptions
Beginning analysts tend to be overly anxious about the assumptions of their statistical models. This observation is the point of departure of my tutorial Checking the assumptions of your statistical model without getting paranoid, but it’s probably too general. It’d be more accurate to say that beginning analysts who e-mail me about possible assumption violations and who read tutorials on statistics are overly anxious about model assumptions. (Of course, there are beginning as well as seasoned researchers who are hardly ever worry about model assumptions, but they’re unlikely to read papers and blog posts about model assumptions.)
Apr 11, 2019
Jan Vanhove

Guarantees in the long run vs. interpreting the data at hand: Two analyses of clustered data

R
mixed-effects models
cluster-randomised experiments
An analytical procedure may have excellent long-run properties but still produce nonsensical results in individual cases. I recently encountered a real-life illustration of this, but since those data aren’t mine, I’ll use simulated data with similar characteristics for this blog post.
Jan 14, 2019
Jan Vanhove
No matching items
  • 1
  • 2
  • 3
  • ...