Silly significance tests: Tests unrelated to the genuine research questions
Quite a few significance tests in applied linguistics don’t seem to be relevant to the genuine aims of the study. Too many of these tests can make the text an intransparent jumble of t-, F- and p-values, and I suggest we leave them out.
An example of the kind of test I have in mind is this. Let’s say a student wants to investigate the efficiency of a new teaching method. After going to great lengths to collect data in several schools, he compares the mean test scores of classes taught with the new method to those of classes taught with the old method. But in addition to this key comparison, he also runs separate tests to see whether the test scores of the girls differ from those of the boys or whether the pupils’ socio-econonomic status (SES) is linked to their test performance.
These additional tests aren’t needed to test the efficiency of the new teaching method (the genuine aim of the study), yet tests like these are commonly run and reported. They constitute the third type of ‘silly test’ that we can do without (the first two being superfluous balance tests and tautological tests). In what follows, I respond to three possible arguments in favour of these tests.
Argument 2: ‘The efficiency of the new teaching method might be different according to sex / socio-econonomic status / age etc.’
The second objection that I would anticipate is that the variable of interest (here: teaching method) might have a different effect depending on some additional variable such as the learners’ sex, SES, age etc. This argument is different from the first one: According to the first argument, boys are expected to perform differently from girls in the experimental (new method) and control condition (old method) alike. According to the second argument, boys might profit more from the new teaching method than do girls (or vice versa). Such a finding could be an important nuance when comparing the new and the old method.
Running an additional t-test with sex as the independent variable doesn’t account for this nuance, however. Instead, it’s the interaction between teaching method and sex that’s of interest here.
In my experience, however, the rationale behind such interaction tests is rarely theoretically or empirically buttressed: the idea is merely that the effect of interest might somehow differ according to sex, SES etc. Clearly, then, the interactions aren’t the focus of the study. I therefore think it’s best to explicitly label any such interaction tests as exploratory, if you want to run them at all, and demarcate them from the study’s main aim for greater reader-friendliness. Any interesting patterns can then be left to a new study that explicitly targets these interactions.
Argument 3: ‘After painstakingly collecting this much data, running just one test seems a bit meagre.’
The third objection isn’t so much a rational argument as an emotional appeal – and one that I’m entirely sympathetic to: After travelling through the country to collect data, negotiating with school principals, sending out several reminders for getting parental consent, trying to make sense of illegible handwriting etc., running a single straightforward t-test seems pretty underwhelming.
Saying that a straightforward analysis is the reward for a good design and that the scientific value of a study isn’t a (positive) function of the number of significance tests it features probably offers only scant consolation. Nothing speaks against conducting additional analyses on your painstakingly collected data, however, provided that these exploratory analyses are labelled as such and, ideally, clearly demarcated from the main analysis. Furthermore, keeping track of tens of test results when reading a research paper is a challenge, which is why I think it pays to be selective when conducting and reporting exploratory analyses. First, exploratory analyses are ideally still theoretically guided and pave the way towards a follow-up study. Second, I think exploratory analyses should only compare what can sensibly be compared in the study. For instance, learners’ comprehension of active and passive sentences might sensibly be compared inasmuch as these form each other’s counterpart (especially if the active and passive sentences express the same proposition). But it’d be more difficult to justify a comparison between the comprehension of object questions and that of unrelated relative clauses. Third, before drawing sweeping conclusions from exploratory analyses, researchers should remind themselves that their chances of finding some significant results increase with each comparison, even if no real differences exist.
Lastly, I think there’s an argument to be made to report exploratory analyses descriptively only, e.g. using graphs and descriptive statistics but without t-tests, ANOVAs, p-values and the like, but I fear that reviewers and editors would probably insist on some inferential measures.
Summary and conclusion
A thread running through this blog is my convinction that typical quantitative research papers in applied linguistics and related fields contain too many significance tests, which can make for a challenging read even for dyed-in-the-wool quantitative researchers. In addition to doing away with balance tests and obviously tautological tests, I suggest that we get rid of tests that don’t contribute to the study’s primary aim. To that end, I propose three guidelines:
- If you analyse variables that you aren’t genuinely interested in because they may nonetheless give rise to differences in the dependent variable, consider including them in the same analysis as the variables that you are interested in.
- If you analyse such variables because they could reasonably interact with the effect you’re really interested in, it’s the interaction effect you want to take a look at.
- By all means, conduct exploratory analyses on rich datasets, but show some restraint in choosing which comparisons to run and in interpreting them.
To wrap off, here’s a rule of thumb that could have some heuristic value: If a comparison isn’t worth the time and effort and for a decent-looking graph to show it, it probably isn’t worth testing it.