# Confidence intervals for standardised mean differences

Standardised effect sizes express patterns found in the data in
terms of the variability found in the data. For instance, a mean difference
in body height could be expressed in the metric in which the data were
measured (e.g., a difference of 4 centimetres) or relative to the
variation in the data (e.g., a difference of 0.9 standard deviations).
The latter is a standardised effect size known as Cohen’s *d*.

As I’ve
written
before,
I don’t particularly like standardised effect sizes.
Nonetheless, I wondered how confidence intervals around standardised
effect sizes (more specifically: standardised mean differences)
are constructed. Until recently, I hadn’t really thought about it
and sort of assumed you would compute them the same way as
confidence intervals around
raw effect sizes. But unlike raw (unstandardised) mean differences,
standardised mean differences are a combination of *two* estimates
subject to sampling error: the mean difference itself
and the sample standard deviation.
Moreover, the sample standard deviation is a biased estimate of
the population standard deviation (it tends to be
too low),
which causes Cohen’s *d* to be an upwardly biased estimate of the
population standardised mean difference.
Surely both of these factors must affect how
the confidence intervals around standardised effect sizes are constructed?

It turns out that indeed they do. When I compared the confidence intervals that I computed around a standardised effect size using a naive approach that assumed that the standard deviation wasn’t subject to sampling error and wasn’t biased, I got different results than when I used specialised R functions.

But these R functions all produced different results, too.

Obviously, there may well be more than one way to skin a cat, but this
caused me to wonder if the different procedures for computing confidence
intervals all covered the true population parameter with the nominal
probability (e.g., in 95% of cases for a 95% confidence interval).
I ran a simulation to find out, which I’ll report in the remainder of this post.
**If you spot any mistakes, please let me know.**

### Introducing the contenders contenders

Below, I’m going to introduce three R functions for computing confidence intervals for standardised effect sizes (standardised mean differences, to be specific). To illustrate how they work, though, I’m first going to generate a two samples from normal distributions with standard deviations of 1 and means of 0.5 and 0, respectively.

`cohen.d`

in the `effsize`

package

The first function is `cohen.d`

from the `effsize`

package.
It takes as its arguments the two samples you want to compare and
the desired confidence level (here: 90%). You can also specify
whether you want to apply `hedges.correction`

,
which causes the function to compute
Hegdes’ *g*
and confidence intervals for it.
(Hedges’ *g* is less biased than Cohen’s *d*.)

90% confidence interval for Cohen’s *d*: [0.24, 1.38].

90% confidence interval for Hedges’ *g*: [0.23, 1.36].

(Incidentally, `cohen.d`

also has a parameter called `noncentral`

,
but setting it to `TRUE`

doesn’t seem to work…)

`tes`

in the `compute.es`

package

The second function is `tes`

from the `compute.es`

package.
It takes as its arguments the *t* statistic for the *t* test comparing the two samples,
the sample sizes and the desired confidence level (as a percentage, not as a proportion):

This function outputs a lot of standardised effect sizes and their confidence intervals.
Here, I’m only interested in Cohen’s *d*, whose 90% confidence interval now is [0.26, 1.37].
(The confidence interval for Hedges’ *g* is also different from that from the `cohen.d`

function.)

Note, incidentally, that the Cohen’s *d* and Hedges’ *g* values are the same for the
`tes`

and the `cohen.d`

function;
it’s just the confidence intervals that are different.

`ci.smd`

in the `MBESS`

package

Lastly, the `ci.smd`

function from the `MBESS`

package takes
as its input a Cohen’s *d*, the two sample sizes, and the desired confidence level.
Here I compute Cohen’s *d* using the `cohen.d`

function and then feed it to `ci.smd`

.

This time, the 90% confidence interval is [0.26, 1.35]. It was fairly small differences between the three functions such as these that led me to run the simulation I report below.

### Coverage of the population standardised mean difference by different confidence intervals

#### Method

For the simulation I generated lots of samples from two normal distributions
with the same standard deviation whose means where half a standard deviation apart.
In other words, the population standardised mean difference was 0.5.
For each sample, I computed 90% confidence intervals around the sample standardised
mean difference (Cohen’s d) using the `cohen.d`

, `tes`

and `ci.smd`

functions;
I also computed 90% confidence intervals around Hedges’ *g* using the `cohen.d`

function.
I then checked how often these intervals contained the population mean standardised
difference (0.5). Ideally, this should be the case in about 90% of the samples generated.
If it’s fewer than that, the confidence intervals are too narrow;
if it’s more than that, they’re too wide.
I ran this simulation for different sample sizes, ranging from 5 observations per
group to 500 per group.

The R code for this simulation is available at the bottom of this post.

#### Results

Figure 1. Coverage of the population standardised mean difference (0.5) by confidence intervals computed using the

`cohen.d`

,`tes`

and`ci.smd`

functions based on 10,000 simulation runs per sample size. The dashed horizontal line shows the nominal confidence level; the grey lines around it show the values between which the coverage rates should lie with 95% probability if the confidence interval had their nominal coverage rate.

As Figure 1 clearly shows, the coverage rates for the confidence intervals computed around
Cohen’s *d* using the `ci.smd`

function are at their nominal level even for small samples.
The confidence intervals computed using the `cohen.d`

and `tes`

functions, however, are too wide
for sample sizes of up to 50 observations per group.

### Conclusions and further reading

First of all, I want to reiterate that I think standardised effect sizes, including standardised mean differences and correlation coefficients, are overvalued and that I think we should strive to interpret raw effect sizes instead.

That said, on a practical level, this simulation suggests that
if you nonetheless want to express your results as a standardised
mean difference and you want to compute a confidence around it,
it’s a good idea to take a look at the `MBESS`

package.
The package’s vignette also has
a good discussion of how exact confidence intervals can be constructed around standardised effect sizes,
and the package provides a fast implementation of these methods.

By contrast, the `effsize`

and `compute.es`

packages seem to rely on overly
conservative approximations to these exact methods, and differ between
each other in how the variance for Cohen’s *d* is computed (see here
and here).

For those of you interested in further details, Wolfgang Viechtbauer provided some links on Twitter that you may want to take a look at.

### R code

For those interested, here’s the R code I used for the simulation. If you spot an error that explains the results above, please let me know.

First, I defined a function, `d_ci`

, that generates two samples from normal distributions
with standard deviations of 1. In the simulation, the mean difference between these populations
is 0.5, which, since both population have standard deviations of 1, means that the
true standardised mean difference is 0.5.
Then, four confidence intervals are computed around the sample Cohen’s *d*:

- Using
`cohen.d`

with Hedges’ correction. - Using
`tes`

. (I only used the code relevant to Cohen’s*d*to speed things up.) - Using
`ci.smd`

. - Using
`cohen.d`

without Hedges’ correction.

`d_ci`

simply returns, for each of these four intervals, whether they contain the true
population standardised mean difference (i.e., 0.5).

Then I wrote a function, `sim_es`

, which runs `d_ci`

a set number of times
and return the proportion of times the four confidence intervals
contained the true population *d*.

Next I ran `sim_es`

10,000 times for 9 different samples, from
5 observations per sample to 500.
The desired confidence level was 90%.

Finally, I stored the results to a dataframe and plotted them.