Confidence intervals for standardised mean differences
Standardised effect sizes express patterns found in the data in terms of the variability found in the data. For instance, a mean difference in body height could be expressed in the metric in which the data were measured (e.g., a difference of 4 centimetres) or relative to the variation in the data (e.g., a difference of 0.9 standard deviations). The latter is a standardised effect size known as Cohen’s d.
As I’ve written before, I don’t particularly like standardised effect sizes. Nonetheless, I wondered how confidence intervals around standardised effect sizes (more specifically: standardised mean differences) are constructed. Until recently, I hadn’t really thought about it and sort of assumed you would compute them the same way as confidence intervals around raw effect sizes. But unlike raw (unstandardised) mean differences, standardised mean differences are a combination of two estimates subject to sampling error: the mean difference itself and the sample standard deviation. Moreover, the sample standard deviation is a biased estimate of the population standard deviation (it tends to be too low), which causes Cohen’s d to be an upwardly biased estimate of the population standardised mean difference. Surely both of these factors must affect how the confidence intervals around standardised effect sizes are constructed?
It turns out that indeed they do. When I compared the confidence intervals that I computed around a standardised effect size using a naive approach that assumed that the standard deviation wasn’t subject to sampling error and wasn’t biased, I got different results than when I used specialised R functions.
But these R functions all produced different results, too.
Obviously, there may well be more than one way to skin a cat, but this caused me to wonder if the different procedures for computing confidence intervals all covered the true population parameter with the nominal probability (e.g., in 95% of cases for a 95% confidence interval). I ran a simulation to find out, which I’ll report in the remainder of this post. If you spot any mistakes, please let me know.
Introducing the contenders
Below, I’m going to introduce three R functions for computing confidence intervals for standardised effect sizes (standardised mean differences, to be specific). To illustrate how they work, though, I’m first going to generate a two samples from normal distributions with standard deviations of 1 and means of 0.5 and 0, respectively.
cohen.d in the
The first function is
cohen.d from the
It takes as its arguments the two samples you want to compare and
the desired confidence level (here: 90%). You can also specify
whether you want to apply
which causes the function to compute
and confidence intervals for it.
(Hedges’ g is less biased than Cohen’s d.)
90% confidence interval for Cohen’s d: [0.24, 1.38].
90% confidence interval for Hedges’ g: [0.23, 1.36].
cohen.d also has a parameter called
but setting it to
TRUE doesn’t seem to work…)
tes in the
The second function is
tes from the
It takes as its arguments the t statistic for the t test comparing the two samples,
the sample sizes and the desired confidence level (as a percentage, not as a proportion):
This function outputs a lot of standardised effect sizes and their confidence intervals.
Here, I’m only interested in Cohen’s d, whose 90% confidence interval now is [0.26, 1.37].
(The confidence interval for Hedges’ g is also different from that from the
Note, incidentally, that the Cohen’s d and Hedges’ g values are the same for the
tes and the
it’s just the confidence intervals that are different.
ci.smd in the
ci.smd function from the
MBESS package takes
as its input a Cohen’s d, the two sample sizes, and the desired confidence level.
Here I compute Cohen’s d using the
cohen.d function and then feed it to
This time, the 90% confidence interval is [0.26, 1.35]. It was fairly small differences between the three functions such as these that led me to run the simulation I report below.
Coverage of the population standardised mean difference by different confidence intervals
For the simulation I generated lots of samples from two normal distributions
with the same standard deviation whose means where half a standard deviation apart.
In other words, the population standardised mean difference was 0.5.
For each sample, I computed 90% confidence intervals around the sample standardised
mean difference (Cohen’s d) using the
I also computed 90% confidence intervals around Hedges’ g using the
I then checked how often these intervals contained the population mean standardised
difference (0.5). Ideally, this should be the case in about 90% of the samples generated.
If it’s fewer than that, the confidence intervals are too narrow;
if it’s more than that, they’re too wide.
I ran this simulation for different sample sizes, ranging from 5 observations per
group to 500 per group.
The R code for this simulation is available at the bottom of this post.
Figure 1. Coverage of the population standardised mean difference (0.5) by confidence intervals computed using the
ci.smdfunctions based on 10,000 simulation runs per sample size. The dashed horizontal line shows the nominal confidence level; the grey lines around it show the values between which the coverage rates should lie with 95% probability if the confidence interval had their nominal coverage rate.
As Figure 1 clearly shows, the coverage rates for the confidence intervals computed around
Cohen’s d using the
ci.smd function are at their nominal level even for small samples.
The confidence intervals computed using the
tes functions, however, are too wide
for sample sizes of up to 50 observations per group.
Conclusions and further reading
First of all, I want to reiterate that I think standardised effect sizes, including standardised mean differences and correlation coefficients, are overvalued and that I think we should strive to interpret raw effect sizes instead.
That said, on a practical level, this simulation suggests that
if you nonetheless want to express your results as a standardised
mean difference and you want to compute a confidence around it,
it’s a good idea to take a look at the
The package’s vignette also has
a good discussion of how exact confidence intervals can be constructed around standardised effect sizes,
and the package provides a fast implementation of these methods.
By contrast, the
compute.es packages seem to rely on overly
conservative approximations to these exact methods, and differ between
each other in how the variance for Cohen’s d is computed (see here
For those interested, here’s the R code I used for the simulation. If you spot an error that explains the results above, please let me know.
First, I defined a function,
d_ci, that generates two samples from normal distributions
with standard deviations of 1. In the simulation, the mean difference between these populations
is 0.5, which, since both population have standard deviations of 1, means that the
true standardised mean difference is 0.5.
Then, four confidence intervals are computed around the sample Cohen’s d:
cohen.dwith Hedges’ correction.
tes. (I only used the code relevant to Cohen’s d to speed things up.)
cohen.dwithout Hedges’ correction.
d_ci simply returns, for each of these four intervals, whether they contain the true
population standardised mean difference (i.e., 0.5).
Then I wrote a function,
sim_es, which runs
d_ci a set number of times
and return the proportion of times the four confidence intervals
contained the true population d.
Next I ran
sim_es 10,000 times for 9 different samples, from
5 observations per sample to 500.
The desired confidence level was 90%.
Finally, I stored the results to a dataframe and plotted them.