Exact significance tests for 2 × 2 tables
Two-by-two contingency tables look so simple that you’d be forgiven for thinking they’re straightforward to analyse. A glance at the statistical literature on the analysis of contingency tables, however, reveals a plethora of techniques and controversies surrounding them that will quickly disabuse you of this notion (see, for instance, Fagerland et al. 2017). In this blog post, I discuss a handful of different study designs that give rise to two-by-two tables and present a few exact significance tests that can be applied to these tables. A more exhaustive overview can be found in Fagerland et al. (2017).
Preliminaries
Two-by-two contingency tables
Two-by-two contingency tables arise when cross-tabulate obserations that have two binary properties (
Exact and approximate tests
This blog post is about exact significance tests. An exact test has the following defining property: If the null hypothesis is true, the
Tests that aren’t exact can still be approximate. A possible problem with approximate tests is that their justification depends on results derived for large samples; for smaller samples,
Contingency tables with both marginals fixed
Two-by-two contingency tables can be the result of different research designs – some fairly common, others exceedingly rare.
Example 1 (Fisher’s exact test, one-sided). Say we want to establish if a learner of English is able to tell the /æ/ phoneme in bat and the /ɛ/ phoneme in bet apart. To this end, we make 35 recordings, 21 of which contain the word bet and 14 contain the word bat. The learner is then asked to identify those 21 audio files that he thinks are recordings of bet; the remaining 14 audio files are suspected recordings of bat. The results are summarised in the following contingency table:
Note that we insisted that the learner select exactly 21 suspected recordings of bet, no more and no fewer. As a result, the column total
The null hypothesis in this setting is that the learner is incapable of distinguishing bet from bat recordings and just selected 21 random audio files as suspected bet recordings in order to comply with the instructions. Under this null hypothesis, the top left entry in the contingency table (
Figure 1 shows the probability mass and cumulative probability functions of the
If the learner did not just pick 21 audio files at random but was in fact able to tell bet and bat recordings apart to some degree, this top-left entry can be expected to be large as opposed to small. This means that we want to compute a right-sided
1 - phyper(15 - 1, 21, 14, 21)
[1] 0.09059986
This computation amounts to running Fisher’s exact test:
<- rbind(c(15, 6), c(6, 8))
tab fisher.test(tab, alternative = "greater")$p.value
[1] 0.09059986
Example 2 (Fisher’s exact test, two-sided). Let’s slightly change the design of the study in Example 1. Instead of recording bet 21 times and bat 14 times and asking the learner to select 21 suspected bet recordings, we record both bet and bat 18 times and ask the learner to select 18 suspected bet recordings. The results are summarised in the following contingency table:
Under the null hypothesis that the learner possesses no relevant discriminatory ability, the top-left entry (
Of note, the learner seems to be able to tell bet and bat apart to some extent – it’s just that he seems to identify bet recordings as bat and vice versa. Since we’re interested in the learner’s discriminatory ability, regardless of whether he is then also able to correctly label the two categories, we want to compute a two-sided
<- dhyper(0:18, 18, 18, 18)
p_k sum(p_k[p_k <= dhyper(5, 18, 18, 18)])
[1] 0.01839395
Fisher’s exact test carries out the same computation.
<- rbind(c(5, 13), c(13, 5))
tab fisher.test(tab)$p.value
[1] 0.01839395
Contingency tables with one marginal fixed
Contingency tables in which both the row and column marginals are fixed in advance are a rare sight. More common in some areas of research are contingency tables where only the row marginals are fixed by design. Such tables can be found in, for instance, experimental research in which a fixed number of participants are assigned to one experimental condition and another fixed number of participants are assigned to the other experimental condition and where for each participant, we have a single binary outcome (e.g., died vs. survived, or passed vs. failed).
Example 3 (Boschloo’s test). Let’s imagine we’re in charge of an agency that designs self-study courses to help students prepare for an entrance exam. We’ve developed a course whose efficacy we want to compare against that of its predecessor. More specifically, we’re interested in finding out whether the new course is better than the old one in terms of helping the students pass the entrance exam. We recruit 24 students willing to participate in an evaluation study and, using complete randomisation, we assign 12 of them to work with the new course and 12 to work with the old one. The results look as follows:
In contrast to the previous two examples, it’s only the row marginals that were known beforehand in this example. Nevertheless, applying Fisher’s exact test to this contingency table is reasonable. In doing so, we would be conditioning the analysis on the observed column marginals (
An exact test remains exact when it is used conditionally (Lydersen 2009:1165), so the resulting
<- rbind(c(10, 2), c(6, 6))
tab fisher.test(tab, alternative = "greater")$p.value
[1] 0.09651366
That said, conditional exact tests tend to be pretty conservative. Intuitively, the reason is that the conditional exact test only considers
Unconditional exact tests consider the whole sample space and are consequently less conservative than conditional exact tests. The contingency table in the fixed row marginals design can be considered the result of two draws from binomial distributions: one with
An unconditional exact test for the fixed row marginals design that is often recommended is Boschloo’s test (Boschloo 1970). The idea behind this test is as follows. First, we define a test statistic that captures the extent to which observed contingency tables differ from the contingency table you’d expect to find under the null hypothesis, given
Let’s walk through the computation step by step. First, we run a one-sided Fisher’s exact test in order to obtain the observed test statistic:
<- fisher.test(rbind(c(10, 2), c(6, 6)), alternative = "greater")$p.value) (obs_test_stat
[1] 0.09651366
Next, we create a grid with all possible combinations of
<- 12
n_row1 <- 12
n_row2 <- expand.grid(n11 = 0:n_row1, n21 = 0:n_row2) tables
We now fix some
The probability of observing the first row is given by the probability mass of
dbinom(8, 12, 0.43) * dbinom(6, 12, 0.43)
[1] 0.01223444
We compute this probability for all 169 tables:
$probability <- dbinom(tables$n11, 12, 0.43) * dbinom(tables$n21, 12, 0.43) tables
For each table, we also compute the test statistic:
$test_statistic <- NA
tablesfor (i in 1:nrow(tables)) {
<- rbind(c(tables$n11[i], 12 - tables$n11[i]),
current_table c(tables$n21[i], 12 - tables$n21[i]))
$test_statistic[i] <- fisher.test(current_table, alternative = "greater")$p.value
tables }
We can now compute the probability that we’d observe a test statistic at least as extreme as the test statistic associated with the table we actually observed, assuming the null hypothesis is true and
weighted.mean(tables$test_statistic <= obs_test_stat, w = tables$probability)
[1] 0.04312598
Assuming
If our alternative hypothesis were that the new programme produced worse results than the old one, we’d have used the left-sided
To run Boschloo’s test, you can use the following boschloo_test()
function:
<- function(tab, alternative = "two.sided", pi_range = c(0, 1), stepsize = 0.01) {
boschloo_test # This test assumes fixed row sums.
# Nuisance parameter values in the interval pi_range are tried out.
# stepsize governs granularity of search through nuisance parameter value candidates.
if (!all(dim(tab) == c(2, 2))) stop("tab needs to be a 2*2 contingency table.")
if (alternative == "two.sided") {
# Truncate two-sided p-value at 1
return(
min(2 * min(boschloo_test(tab, alternative = "less", pi_range = pi_range, stepsize = stepsize),
boschloo_test(tab, alternative = "greater", pi_range = pi_range, stepsize = stepsize)),
1)
)
}
# Use Fisher's exact test p-value as test statistic
<- function(x) fisher.test(x, alternative = alternative)$p.value
statistic
# Construct grid with possible results
<- rowSums(tab)
row_sums <- expand.grid(n1 = 0:row_sums[1], n2 = 0:row_sums[2])
my_grid $statistic <- NA
my_gridfor (i in 1:nrow(my_grid)) {
<- rbind(c(my_grid$n1[i], row_sums[1] - my_grid$n1[i]),
my_tab c(my_grid$n2[i], row_sums[2] - my_grid$n2[i]))
$statistic[i] <- statistic(my_tab)
my_grid
}
# Compute observed test statistic
<- statistic(tab)
obs_p <- my_grid$statistic <= obs_p
is_extreme
# Maximise p-value over range
<- seq(pi_range[1], pi_range[2], by = stepsize)
pis <- 0
max_p for (current_pi in pis) {
<- weighted.mean(x = is_extreme,
current_p w = dbinom(my_grid$n1, row_sums[1], current_pi) *
dbinom(my_grid$n2, row_sums[2], current_pi))
if (current_p > max_p) max_p <- current_p
}
max_p }
It works like so:
boschloo_test(tab = tab, alternative = "greater")
[1] 0.04954898
Alternatively, you can use the boschloo()
function in the exact2x2
package. See ?boschloo
for details on the parameters. Here I specify the number of grid points (nPgrid
) in order to make the results agree exactly with those produced by boschloo_test()
:
::boschloo(6, 12, 10, 12, alternative = "greater",
exact2x2control = exact2x2::ucControl(nPgrid = 101))$p.value
[1] 0.04954898
Contingency tables with only the total sum fixed
A third possibility is that neither the row sums (
Observational studies
Example 4 (Boschloo’s test). During a hike through the Fribourg Prealpes near Schwarzsee, we conduct a linguistic field experiment. Any time we encounter a hiking party chatting in French, we greet them in German; any time we encounter a hiking party chatting in German, we greet them in French. Afterwards, we jot down for each party whether the first person greeting us back did so in the same language in which they were addressed or in a different language. (Parties not chatting in French or German are ignored in this field experiment.) We planned to continue the field experiment until we’ve encountered the 20th French- or German-speaking hiking party. Here’s the resulting contingency table:
Note that only the total number of observations (
The null hypothesis is that whether the first greeter in the party responded in the same language or in a different language is independent of the language in which the party was chatting. More formally, let
More compactly, our null hypothesis is that the tuple
In terms of the analysis, we could condition on both marginals or on one of them and run Fisher’s or Boschloo’s test, respectively:
<- rbind(c(2, 6), c(8, 4))
tab
# Condition on both marginals
fisher.test(tab)$p.value
[1] 0.1698023
# Condition on row marginals only
boschloo_test(tab)
[1] 0.08332351
# Condition on column marginals only
boschloo_test(t(tab))
[1] 0.08940788
The resulting
One possible solution is to generalise Boschloo’s test to two nuisance parameters. That is, rather computing a unconditional_test()
function defined below carries out this procedure:
<- function(
unconditional_test
tab, alternative = "two.sided",
pi_range_row = c(0, 1),
pi_range_col = c(0, 1),
stepsize = 0.01)
{
# This test assumes a fixed total sum.
# Nuisance parameter values in the rectangle pi_range_row * pi_range_col are tried out.
# stepsize governs granularity of search through nuisance parameter value candidates.
if (!all(dim(tab) == c(2, 2))) stop("tab needs to be a 2*2 contingency table.")
if (alternative == "two.sided") {
# Truncate two-sided p-value at 1
return(min(
2 * min(unconditional_test(tab, alternative = "less", stepsize = stepsize),
unconditional_test(tab, alternative = "greater", stepsize = stepsize)),
1))
}
# Use Fisher's exact test p-value as test statistic
<- function(x) fisher.test(x, alternative = alternative)$p.value
statistic
# Helper function for multinomial weights
<- function(pi_row, pi_col, n11, n12, n21, n22) {
weights <- n11 + n12 + n21 + n22
total_sum *pi_col)^n11 * (pi_row * (1 - pi_col))^n12 *
(pi_row1 - pi_row)*pi_col)^n21 * ((1 - pi_row)*(1 - pi_col))^n22 *
((factorial(total_sum)/(factorial(n11) * factorial(n12) * factorial(n21) * factorial(n22))
}
# Construct grid with possible results
<- sum(tab)
total_sum <- expand.grid(n11 = 0:total_sum,
my_grid n12 = 0:total_sum,
n21 = 0:total_sum)
$n22 <- total_sum - my_grid$n11 - my_grid$n12 - my_grid$n21
my_grid<- subset(my_grid, n22 >= 0)
my_grid $statistic <- NA
my_gridfor (i in 1:nrow(my_grid)) {
<- rbind(c(my_grid$n11[i], my_grid$n12[i]),
my_tab c(my_grid$n21[i], my_grid$n22[i]))
$statistic[i] <- statistic(my_tab)
my_grid
}$statistic[is.na(my_grid$statistic)] <- 1
my_grid
# Compute observed test statistic
<- statistic(tab)
obs_p <- my_grid$statistic <= obs_p
is_lower
# Maximise p value over grid
<- expand.grid(
pis pi_row = seq(pi_range_row[1], pi_range_row[2], by = stepsize),
pi_col = seq(pi_range_col[1], pi_range_col[2], by = stepsize)
)<- 0
max_p for (i in 1:nrow(pis)) {
<- weights(pis$pi_row[i], pis$pi_col[i],
w $n11, my_grid$n12, my_grid$n21, my_grid$n22)
my_grid<- weighted.mean(is_lower, w = w)
current_p if (current_p > max_p) max_p <- current_p
}
max_p }
While it takes noticeably longer to run this test, it should be a bit more powerful than the tests that condition on one or both marginals:
unconditional_test(tab)
[1] 0.07722463
The exact.test()
function in the Exact
also implements this procedure. For one-sided tests (i.e., alternative = "less"
or "greater"
), it produces the same unconditional_test()
, but it computes the two-sided
::exact.test(tab, model = "multinomial",
ExactnpNumber = 101, ref.pvalue = FALSE, method = "boschloo")$p.value
[1] 0.1139588
Experiments with simple randomisation
Example 5 (Boschloo’s test). We conduct the same experiment as in Example 3, but with one change: We assign the participants to the conditions using simple randomisation rather than using complete randomisation. Hence, we’re not guaranteed to have exactly 12 participants in each condition, and the row marginals aren’t fixed in advance. For the sake of comparison, let’s assume we obtain the same results as in Example 3:
We could again condition on the row marginals:
<- rbind(c(10, 2),
tab c(6, 6))
boschloo_test(tab, alternative = "greater") # condition on row marginals
[1] 0.04954898
Alternatively, we could run an unconditional test. Note, however that while
unconditional_test(tab, alternative = "greater", pi_range_row = c(0.5, 0.5))
[1] 0.04392337
Contingency tables with nothing fixed
Contingency tables where not even
References
Boschloo, R. D. 1970. Raised conditional level of significance for the 2 × 2 table when testing the equality of two probabilities. Statistica Neerlandica 24. 1-35.
Fagerland, Morten W., Stian Lydersen & Petter Laake. 2017. Statistical analysis of contingency tables. Boca Raton, FL: Chapman and Hall/CRC.
Lydersen, Stian, Morten W. Fagerland & Petter Laake. 2009. Recommended tests for association in 2 × 2 tables. Statistics in Medicine 28. 1159-1175.