Standardised effect sizes express patterns found in the data in terms of the variability found in the data. For instance, a mean difference in body height could be expressed in the metric in which the data were measured (e.g., a difference of 4 centimetres) or relative to the variation in the data (e.g., a difference of 0.9 standard deviations). The latter is a standardised effect size known as Cohen’s d.

As I’ve written before, I don’t particularly like standardised effect sizes. Nonetheless, I wondered how confidence intervals around standardised effect sizes (more specifically: standardised mean differences) are constructed. Until recently, I hadn’t really thought about it and sort of assumed you would compute them the same way as confidence intervals around raw effect sizes. But unlike raw (unstandardised) mean differences, standardised mean differences are a combination of two estimates subject to sampling error: the mean difference itself and the sample standard deviation. Moreover, the sample standard deviation is a biased estimate of the population standard deviation (it tends to be too low), which causes Cohen’s d to be an upwardly biased estimate of the population standardised mean difference. Surely both of these factors must affect how the confidence intervals around standardised effect sizes are constructed?

It turns out that indeed they do. When I compared the confidence intervals that I computed around a standardised effect size using a naive approach that assumed that the standard deviation wasn’t subject to sampling error and wasn’t biased, I got different results than when I used specialised R functions.

But these R functions all produced different results, too.

Obviously, there may well be more than one way to skin a cat, but this caused me to wonder if the different procedures for computing confidence intervals all covered the true population parameter with the nominal probability (e.g., in 95% of cases for a 95% confidence interval). I ran a simulation to find out, which I’ll report in the remainder of this post. If you spot any mistakes, please let me know.

### Introducing the contenders

Below, I’m going to introduce three R functions for computing confidence intervals for standardised effect sizes (standardised mean differences, to be specific). To illustrate how they work, though, I’m first going to generate a two samples from normal distributions with standard deviations of 1 and means of 0.5 and 0, respectively.

#### `cohen.d` in the `effsize` package

The first function is `cohen.d` from the `effsize` package. It takes as its arguments the two samples you want to compare and the desired confidence level (here: 90%). You can also specify whether you want to apply `hedges.correction`, which causes the function to compute Hegdes’ g and confidence intervals for it. (Hedges’ g is less biased than Cohen’s d.)

90% confidence interval for Cohen’s d: [0.24, 1.38].

90% confidence interval for Hedges’ g: [0.23, 1.36].

(Incidentally, `cohen.d` also has a parameter called `noncentral`, but setting it to `TRUE` doesn’t seem to work…)

#### `tes` in the `compute.es` package

The second function is `tes` from the `compute.es` package. It takes as its arguments the t statistic for the t test comparing the two samples, the sample sizes and the desired confidence level (as a percentage, not as a proportion):

This function outputs a lot of standardised effect sizes and their confidence intervals. Here, I’m only interested in Cohen’s d, whose 90% confidence interval now is [0.26, 1.37]. (The confidence interval for Hedges’ g is also different from that from the `cohen.d` function.)

Note, incidentally, that the Cohen’s d and Hedges’ g values are the same for the `tes` and the `cohen.d` function; it’s just the confidence intervals that are different.

#### `ci.smd` in the `MBESS` package

Lastly, the `ci.smd` function from the `MBESS` package takes as its input a Cohen’s d, the two sample sizes, and the desired confidence level. Here I compute Cohen’s d using the `cohen.d` function and then feed it to `ci.smd`.

This time, the 90% confidence interval is [0.26, 1.35]. It was fairly small differences between the three functions such as these that led me to run the simulation I report below.

### Coverage of the population standardised mean difference by different confidence intervals

#### Method

For the simulation I generated lots of samples from two normal distributions with the same standard deviation whose means where half a standard deviation apart. In other words, the population standardised mean difference was 0.5. For each sample, I computed 90% confidence intervals around the sample standardised mean difference (Cohen’s d) using the `cohen.d`, `tes` and `ci.smd` functions; I also computed 90% confidence intervals around Hedges’ g using the `cohen.d` function. I then checked how often these intervals contained the population mean standardised difference (0.5). Ideally, this should be the case in about 90% of the samples generated. If it’s fewer than that, the confidence intervals are too narrow; if it’s more than that, they’re too wide. I ran this simulation for different sample sizes, ranging from 5 observations per group to 500 per group.

The R code for this simulation is available at the bottom of this post.

#### Results Figure 1. Coverage of the population standardised mean difference (0.5) by confidence intervals computed using the `cohen.d`, `tes` and `ci.smd` functions based on 10,000 simulation runs per sample size. The dashed horizontal line shows the nominal confidence level; the grey lines around it show the values between which the coverage rates should lie with 95% probability if the confidence interval had their nominal coverage rate.

As Figure 1 clearly shows, the coverage rates for the confidence intervals computed around Cohen’s d using the `ci.smd` function are at their nominal level even for small samples. The confidence intervals computed using the `cohen.d` and `tes` functions, however, are too wide for sample sizes of up to 50 observations per group.

First of all, I want to reiterate that I think standardised effect sizes, including standardised mean differences and correlation coefficients, are overvalued and that I think we should strive to interpret raw effect sizes instead.

That said, on a practical level, this simulation suggests that if you nonetheless want to express your results as a standardised mean difference and you want to compute a confidence around it, it’s a good idea to take a look at the `MBESS` package. The package’s vignette also has a good discussion of how exact confidence intervals can be constructed around standardised effect sizes, and the package provides a fast implementation of these methods.

By contrast, the `effsize` and `compute.es` packages seem to rely on overly conservative approximations to these exact methods, and differ between each other in how the variance for Cohen’s d is computed (see here and here).

For those of you interested in further details, Wolfgang Viechtbauer provided some links on Twitter that you may want to take a look at.

### R code

For those interested, here’s the R code I used for the simulation. If you spot an error that explains the results above, please let me know.

First, I defined a function, `d_ci`, that generates two samples from normal distributions with standard deviations of 1. In the simulation, the mean difference between these populations is 0.5, which, since both population have standard deviations of 1, means that the true standardised mean difference is 0.5. Then, four confidence intervals are computed around the sample Cohen’s d:

1. Using `cohen.d` with Hedges’ correction.
2. Using `tes`. (I only used the code relevant to Cohen’s d to speed things up.)
3. Using `ci.smd`.
4. Using `cohen.d` without Hedges’ correction.

`d_ci` simply returns, for each of these four intervals, whether they contain the true population standardised mean difference (i.e., 0.5).

Then I wrote a function, `sim_es`, which runs `d_ci` a set number of times and return the proportion of times the four confidence intervals contained the true population d.

Next I ran `sim_es` 10,000 times for 9 different samples, from 5 observations per sample to 500. The desired confidence level was 90%.

Finally, I stored the results to a dataframe and plotted them.

22 February 2017