I blog about statistics and research design with an audience consisting of researchers in bilingualism, multilingualism, and applied linguistics in mind.

Feed Subscribe to new blog posts.

Latest blog posts

Abandoning standardised effect sizes and opening up other roads to power

14 July 2017

Numerical summaries of research findings will typically feature an indication of the sizes of the effects that were studied. These indications are often standardised effect sizes, which means that they are expressed relative to the variance in the data rather than with respect to the units in which the variables were measured. Popular standardised effect sizes include Cohen’s d, which expresses the mean difference between two groups as a proportion of the pooled standard deviation, and Pearson’s r, which expresses the difference in one variable as a proportion of its standard deviation that is associated with a change of one standard deviation of another variable. There exists a rich literature that discusses which standardised effect sizes ought to be used depending on the study’s design, how this or that standardised effect size should be adjusted for this and that bias, and how confidence intervals should be constructed around standardised effect sizes (blog post Confidence intervals for standardised mean differences). But most of this literature should be little importance to the practising scientist for the simple reason that standardised effect sizes themselves ought to be of little importance.

In what follows, I will defend this point of view, which I’ve outlined in two previous blog posts (Why I don’t like standardised effect sizes and More on why I don’t like standardised effect sizes), by sharing some quotes by respected statisticians and methodologists. In particular, I hope to, first, make you think about the deterimental effect that the use of standardised effect sizes entails on the interpretability of research findings and the accumulation of knowledge and, second, convince you that the use of standardised effect sizes for planning studies overly stresses sample size as the main determinant of a study’s power and that by abandoning it other roads to power can be opened up.


When justifying their use of standardised effect sizes, researchers usually cite the need to be able to compare results that were obtained on different scales or to render results on scales that are difficult to understand more meaningful. I understand this argument up to a point, but I think it’s overused. Firstly, to the extent that different outcome measures for similar constructs are commonly used, it should be possible to rescale them without relying on the variance of the sample at hand. This could be done by making reference to norming studies. Moreover, standardised effect sizes should not be an excuse to ignore one’s measurements:

“To work constructively with ‘raw’ regression coefficients and confidence intervals, psychologists have to start respecting the units they work with, or develop measurement units they can respect enough so that researchers in a given field or subfield can agree to use them.” (Cohen, 1994)

Secondly, when different instruments are used to measure different constructs, then I think it’s actually an advantage when the measurements cannot directly be compared. Thirdly, I agree with Tukey (1969) in that I think that the increase in interpretability from standardised effect sizes is largely deceptive:

“Why then are correlation coefficients so attractive? Only bad reasons seem to come to mind. Worst of all, probably, is the absence of any need to think about units for either variable. (…) [W]e think we know what r = -.7 means. Do we? How often? Sweeping things under the rug is the enemy of of good data analysis. (…) Being so disinterested in our variables that we do not care about their units can hardly be desirable.” (Tukey, 1969)

Indeed, correlation coefficients in particular are regularly misinterpreted as somehow representing the slope of the function characterising the relationship between two variables (see Vanhove, 2013). I think that the apparent enhanced interpretability of standardised effect sizes stems from most researchers’ “knowing” that r = 0.10 represents a ‘small’ effect size, r = 0.30 a ‘medium’ one, and r = 0.50 a ‘large’ one according to Cohen (1992), which enables them to map any correlation coefficient somewhere on this grid. (Plonsky & Oswald, 2014, propose different reference values for L2 research, but the idea is the same.) But apart from positioning their own standardised effect size relative to the distribution of standardised effect sizes in a biased literature, most of which isn’t related to their own study, I don’t see what this buys us in terms of interpretability.

One situation where I grant that standardised effect sizes are more useful and fairly easy to interpret is when you want to express how much information one variable contains about another. For instance, when you want to see how collinear two or more predictor variables are so that you know whether they can sensibly be used in the same regression model, or when you want to argue that two cognitive traits may or may not be isomorphic. But more often, we’re interested in characterising the functional relationship between variables, often in terms of a causal narrative. For such a purpose, standardised effect sizes are wholly unsuited:

“The major problem with correlations applied to research data is that they can not provide useful information on causal strength because they change with the degree of variability of the variables they relate. Causality operates on single instances, not on populations whose members vary. The effect of A on B for me can hardly depend on whether I’m in a group that varies greatly in A or another that does not vary at all.” (Cohen, 1994)

Cohen also writes that

“(…) I’ve found that when dealing with variables expressed in units whose magnitude we understand, the effect size in linear relationships is better comprehended with regression than with correlation coefficients.” (Cohen, 1990)

I obviously agree, and I’d like to point out that I think that “variables expressed in units whose magnitude we understand” doesn’t merely include lengths in metres and response latencies in milliseconds, but also responses on 5- or 7-point scales.

Cumulative science

Standardised effect sizes are often used in the pinnacle of cumulative science, namely meta-analyses. That said, as I’ve written in a previous blog post, standardised effect sizes can make it seem as though two studies on the same phenomenon contradict each other when they in fact found the exact same result, and vice versa. As Tukey writes,

“I find the use of a correlation coefficient a dangerous symptom. It is an enemy of generalization, a focuser on the “here and now” to the exclusion of the “there and then.” Any influence that exerts selection on one variable and not on the other will shift the correlation coefficient. What usually remains constant under such circumstances is one of the regression coefficients. If we wish to seek for constancies, then, regression coefficients are much more likely to serve us than correlation coefficients.” (Tukey, 1969)

I know too little of meta-analyses to have any firm views on them. But if standardised effect sizes are indispensible to the meta-analyst, they’re easy to compute on the basis of the raw effect sizes and other summaries provided, so there’s little need to other researchers to focus on them.

Standardised effect sizes in power analyses

I know come to the main point I want to make. One major use of standardised effect sizes is in discussing the statistical power of studies. Indeed, it was Cohen’s (1977) treatise on power analyses that popularised standard effect sizes in the behavioural sciences. The reason Cohen (1977) used standardised effect sizes when discussing statistical power was a practical one (see p. 11): statistical power is a function of one’s sample size and the ratio of the population-wide effect to the within-population variability. Instead of providing one table with power values for a population-wide effect of 3 units and for a within-population standard deviation of 10 units and another table for an effect of 12 units and a standard deviation of 40 units (and so on), he could just provide a table for an effect-to-standard deviation ratio of 0.3.

Current discussions about statistical power are almost invariably castin terms of standardised effect sizes. The problem is that in such discussions the standardised effect size of a phenomenon in a particular social context is typically treated as immutable. That is, you can’t change the standardised effect size of the phenomenon you’re investigating at the population level. However, standardised effect sizes in fact conflate the (raw, possibly causal) effect size of what you’re investigating—which indeed you can’t change as a researcher—with the variability of the data, which you can change through optimising the research design:

“[T]he control of various sources of variation through the use of improved research designs serves to increase [standardised] effect sizes as they are defined here.” (…) “Thus, operative effect sizes may be increased not only by improvement in measurement and experimental technique, but also by improved experimental designs.” (Cohen, 1977)

As a result of the assumption of immutable standardised effect sizes, discussions about statistical power overly focus on the other determinant of power, i.e., sample size. Large sample sizes are obviously a good thing, but there are other roads to high-powered studies (or to studies with high precision) that don’t get as much attention. In the next section, I’ll discuss three other roads to power.

Other roads to power

Reducing measurement error

For a given raw effect size and a given sample size, studies are more powerful when there is less residual variance in the outcome variable. Much of this residual variance will be related to differences at the construct level (e.g., people differ with respect to how introvert they are or to how well they can detect grammatical rules in a miniature language). But some part of it will be due to measurement error, i.e., variance unrelated to the construct you’re interested in. If we could reduce the measurement error in the outcome variable, we’d reduce the residual variance and we’d consequently improve the study’s statistical power.

Now, it’s easy for me to say that everyone, myself included, ought to use instruments with less measurement error. Apart from taking time and money to develop and validate, highly reliable instruments can take forbiddingly long to administrate. But there are sometimes easier ways to reduce one’s measurement error. For instance, labelling some or all of the points on a rating scale enhances its reliability (Krosnick & Presser, 2010). As another example, when the outcome variable consists of human ratings of text quality, it may be difficult to get raters to agree on a score but it may be fairly easy to recruit additional raters. By averaging the judgements of multiple raters, the ratings’ measurement error can be much reduced.

For further discussion about measurement error and statistical power, see Sam Schwarzkopf’s blog post.

Statistical control using covariates

The residual variance can also be reduced by statistically accounting for known sources of variability in the data. This is usually done by means of covariates. Covariates seem to have a bad reputation nowadays since they can easily be abused to turn non-significant results into significant findings. But used properly, they can work wonders power- and precision-wise. I won’t discuss covariate control in more detail here and refer to these blog posts instead:

Purposeful selective sampling

The last underappreciated road to power that I’ll discuss is purposeful selective sampling: instead of sampling collecting data on whichever participants you can convince to sign up for your study, you screen the pool of potential participants and target only a subset of them. Selective sampling is particularly attractive when the outcome variable is difficult or expensive to collect (e.g., because it’s based on a task battery that takes hours to complete), but when the predictor variable of interest (or a proxy of it) can easily be collected in advance (e.g., the participants’ age or their performance on a school test they took the year before). If you’re willing to assume that a linear relationship exists between the predictor and the outcome, you can achieve excellent power for a fraction of the resources that would be needed to attain the same power using random sampling. And if you’re not willing to assume a linear relationship, selective sampling can still be highly efficient.

To illustrate this, let’s say you want to investigate the relationship between an easy-to-collect predictor and a difficult-to-collect outcome. Unbeknownst to you, the population-level variance–covariance matrix for this relationship is:

##      [,1] [,2]
## [1,]  1.0  0.3
## [2,]  0.3  1.0

That is, both variables have a standard deviation of one unit at the population level and a 1-unit increase along the predictor is associated with a 0.3-unit increase in the outcome. In other words, the population-wide regression coefficient for this relationship is 0.3. Since the variables have standard deviations of one unit, the correlation coefficient at the population level is 1 as well, which simplifies the comparisons below.

If we wanted to have a 90% chance to detect a significant correlation between these two predictors, we would have to sample 112 participants:

# r = 0.3 requires 112 participants to achieve 0.9 power
pwr::pwr.r.test(n = NULL, r = 0.3, power = 0.9)
##      approximate correlation power calculation (arctangh transformation) 
##               n = 112
##               r = 0.3
##       sig.level = 0.05
##           power = 0.9
##     alternative = two.sided

But this is assuming we sampled participants randomly from the population. Instead, we could sample participants from the extremes. The figure below illustrates a scenario where you collect the predictor data for 100 participants but only go on to collect the outcome data for the 10 participants with the highest score and for the 10 participants with the lowest score, for a total sample size of 20.


In this scenario, sampling at the extremes leads to a study with 65% power. For reference, randomly sampling 20 participants from the population only gives 26% power. Obviously, power can further be increased by sampling more participants at the extremes. But more crucially, casting a wider net during screening, e.g., by screening 200 participants rather than 100, leads to an ever bigger increase in power. As the figure below shows, screening 200 participants and retaining 30 or screening 500 and retaining only 20 yields 89-90% power—a considerable improvement over random sampling in terms of efficiency! Sampling at the extremes results in larger correlation coefficients, but crucially, it doesn’t affect regression coefficients and so still allows us to correctly characterise the relationship between the two variables (see my earlier blog post as well as Baguley, 2009).


One major drawback to sampling at the extremes is that you have to be willing to assume a linear relationship between the two variables. If you’re not willing to assume such a relationship, you can instead sample both at the extremes and at a couple of midway points. This way, you can check whether the relationship is indeed linear. The figure below illustrates the scenario where you screen 200 participants and then go on to collect the outcome data for the four participants with the highest predictor score, the four with the lowest predictor score, and four participants each closest to the 25th, 50th and 75th screening sample percentile, for a total of 20 participants in the final sample.


In this scenario, sampling at both the extremes and 3 midway points leads to 57% power, which is still a respectable boost relative to the paltry 26% random sampling gives you. The figure below shows that with a wide enough net, excellent power can be achieved with as few as 40–50 participants while enabling you to assess whether the relationship between the two variables is approximately linear. Picking only 2 midway points gives increases power somewhat, and different strategies for determining the midway points may be more efficient still. But the main point is that abandoning standardised effect sizes as the basis for power computations enables you to explore ways to improve the statistical power or precision of your studies other than doubling your sample size.



Standardised effect sizes conflate raw, possibly causal effect sizes with data variability, and aren’t as readily interpretable as often believed. Discussions about statistical power based on standard effect sizes correctly stress the importance of sample size, but often gloss over other, complementary ways to design high-powered studies. Untangling raw effect size from data variability when discussing power can lead to practical recommendations to increase statistical power more efficiency.


Baguley, Thom. 2009. Standardized or simple effect size: What should be reported. British Journal of Psychology 100. 603-617.

Cohen, Jacob. 1977. Statistical power analysis for the behavioral sciences (rev. edn.). New York: Academic Press.

Cohen, Jacob. 1990. Things I have learned (so far). American Psychologist 45(12). 1304-1312.

Cohen, Jacob. 1992. A power primer. Psychological Bulletin 112(1). 115-159.

Cohen, Jacob. 1994. The Earth is round (p < .05). American Psychologist 49(12). 997-1003.

Krosnick, Jon A. & Stanley Presser. 2010. Question and questionnaire design. In Peter V. Marsden & James D. Wright (eds.), Handbook of survey research (2nd edn.), 263-313. Bingley, UK: Emerald.

Plonsky, Luke & Frederick L. Oswald. 2014. How big is “big”? Interpreting effect sizes in L2 research. Language Learning 64(4). 878-912.

Tukey, John W. 1969. Analyzing data: Sanctification or detective work. American Psychologist 24. 83-91.

Vanhove, Jan. 2013. The critical period hypothesis in second language acquisition: A statistical critique and a reanalysis. PLOS ONE 8(7). e69172.


Interactions between continuous variables

26 June 2017

Splitting up continuous variables is generally a bad idea. In terms of statistical efficiency, the popular practice of dichotomising continuous variables at their median is comparable to throwing out a third of the dataset. Moreover, statistical models based on split-up continuous variables are prone to misinterpretation: threshold effects are easily read into the results when, in fact, none exist. Splitting up, or ‘binning’, continuous variables, then, is something to avoid. But what if you’re interested in how the effect of one continuous predictor varies according to the value of another continuous predictor? In other words, what if you’re interested in the interaction between two continuous predictors? Binning one of the predictors seems appealing since it makes the model easier to interpret. However, as I’ll show in this blog post, it’s fairly straightforward to fit and interpret interactions between continuous predictors.


Tutorial: Adding confidence bands to effect displays

12 May 2017

In the previous blog post, I demonstrated how you can draw effect displays to render regression models more intelligible to yourself and to your audience. These effect displays did not contain information about the uncertainty inherent to estimating regression models, however. To that end, this blog post demonstrates how you can add confidence bands to effect displays for multiple regression, logistic regression, and logistic mixed-effects models, and explains how these confidence bands are constructed.


Tutorial: Plotting regression models

23 April 2017

The results of regression models, particularly fairly complex ones, can be difficult to appreciate and hard to communicate to an audience. One useful technique is to plot the effect of each predictor variable on the outcome while holding constant any other predictor variables. Fox (2003) discusses how such effect displays are constructed and provides an implementation in the effects package for R.

Since I think it’s both instructive to see how effect displays are constructed from the ground up and useful to be able to tweak them yourself in R, this blog post illustrates how to draw such plots for three increasingly complex statistical models: ordinary multiple regression, logistic regression, and mixed-effects logistic regression. The goal in each of these three examples is to visualise the effects of the predictor variables without factoring in the uncertainty about these effects; visualising such uncertainty will be the topic of a future blog post.


Confidence intervals for standardised mean differences

22 February 2017

Standardised effect sizes express patterns found in the data in terms of the variability found in the data. For instance, a mean difference in body height could be expressed in the metric in which the data were measured (e.g., a difference of 4 centimetres) or relative to the variation in the data (e.g., a difference of 0.9 standard deviations). The latter is a standardised effect size known as Cohen’s d.

As I’ve written before, I don’t particularly like standardised effect sizes. Nonetheless, I wondered how confidence intervals around standardised effect sizes (more specifically: standardised mean differences) are constructed. Until recently, I hadn’t really thought about it and sort of assumed you would compute them the same way as confidence intervals around raw effect sizes. But unlike raw (unstandardised) mean differences, standardised mean differences are a combination of two estimates subject to sampling error: the mean difference itself and the sample standard deviation. Moreover, the sample standard deviation is a biased estimate of the population standard deviation (it tends to be too low), which causes Cohen’s d to be an upwardly biased estimate of the population standardised mean difference. Surely both of these factors must affect how the confidence intervals around standardised effect sizes are constructed?

It turns out that indeed they do. When I compared the confidence intervals that I computed around a standardised effect size using a naive approach that assumed that the standard deviation wasn’t subject to sampling error and wasn’t biased, I got different results than when I used specialised R functions.

But these R functions all produced different results, too.

Obviously, there may well be more than one way to skin a cat, but this caused me to wonder if the different procedures for computing confidence intervals all covered the true population parameter with the nominal probability (e.g., in 95% of cases for a 95% confidence interval). I ran a simulation to find out, which I’ll report in the remainder of this post. If you spot any mistakes, please let me know.


Which predictor is most important? Predictive utility vs. construct importance

15 February 2017

Every so often, I’m asked for my two cents on a correlational study in which the researcher wants to find out which of a set of predictor variables is the most important one. For instance, they may have the results of an intelligence test, of a working memory task and of a questionnaire probing their participants’ motivation for learning French, and they want to find out which of these three is the most important factor in acquiring a nativelike French accent, as measured using a pronunciation task. As I will explain below, research questions such as these can be interpreted in two ways, and whether they can be answered sensibly depends on the interpretation intended.


Automatise repetitive tasks

31 January 2017

Research often involves many repetitive tasks. For a ongoing project, for instance, we needed to replace all stylised apostrophes (’) with straight apostrophes (‘) in some 3,000 text files when preparing the texts for the next step. As another example, you may need to split up a bunch of files into different directories depending on, say, the character in the file name just before the extension. When done by hand, such tasks are as mind-numbing and time-consuming as they sound – perhaps you would do them on a Friday afternoon while listening to music or outsource them to a student assistant. My advice, though, is this: Try to automatise repetitive tasks.

Doing repetitive tasks is what computers are for, so rather than spending several hours learning nothing, I suggest you spend that time writing a script or putting together a command line call that does the task for you. If you have little experience doing this, this will take time at first. In fact, I reckon I often spend roughly same amount of time trying to automatise menial tasks as it would have cost me to do them by hand. But in the not-so-long run, automatisation is a time-saver: Once you have a working script, you can tweak and reuse it. Additionally, while you’re figuring out how to automatise a menial chore, you’re actually learning something useful. The chores become more of a challenge and less mind-numbing. I’m going to present an example or two of what I mean and I will conclude by giving some general pointers.


Some illustrations of bootstrapping

20 December 2016

This post illustrates a statistical technique that becomes particularly useful when you want to calculate the sampling variation of some custom statistic when you start to dabble in mixed-effects models. This technique is called bootstrapping and I will first illustrate its use in constructing confidence intervals around a custom summary statistic. Then I’ll illustrate three bootstrapping approaches when constructing confidence intervals around a regression coefficient, and finally, I will show how bootstrapping can be used to compute p-values.

The goal of this post is not to argue that bootstrapping is superior to the traditional alternatives—in the examples discussed, they are pretty much on par—but merely to illustrate how it works. The main advantage of bootstrapping, as I understand it, is that it can be applied in situation where the traditional alternatives are not available, where you don’t understand how to use them or where their assumptions are questionable, but I think it’s instructive to see how its results compare to those of traditional approaches where both can readily be applied.


What data patterns can lie behind a correlation coefficient?

21 November 2016

In this post, I want to, first, help you to improve your intuition of what data patterns correlation coefficients can represent and, second, hammer home the point that to sensibly interpret a correlation coefficient, you need the corresponding scatterplot.


Common-language effect sizes

16 November 2016

The goal of this blog post is to share with you a simple R function that may help you to better communicate the extent to which two groups differ and overlap by computing common-language effect sizes.