Classifying second-language learners as native- or non-nativelike: Don't neglect classification error rates
I’d promised to write another installment on drawing graphs, but instead I’m going to write about something that I had to exclude, for reasons of space, from a recently published book chapter on age effects in second language (L2) acquisition: classifying observations (e.g., L2 learners) and estimating error rates.
I’m going to illustrate the usefulness of classification algorithms for addressing some problems in L2 acquisition research, but my broader aim is to show that there’s more to statistics than running significance tests and to encourage you to explore—even if superficially—what else is out there.
Background: classifying L2 learners as native- or non-nativelike
In the field of second language acquisition, there are a couple of theories that predict that L2 learners who begin learning the L2 after a certain age will never be ‘native-like’ in the L2. The ‘certain age’ differs between studies, and what the prediction boils down to in some versions is that no L2 learner will ever be fully ‘native-like’ in the L2.
I, for one, don’t think that ‘nativelikeness’ is a useful scientific construct, but that doesn’t matter for this post: Some researchers obviously do consider it useful, and for them the question is how they can test their prediction.
Researchers interested in nativelikeness usually administer a battery of linguistic tasks to a sample of L2 learners as well as to a ‘control’ sample of L1 speakers. On the basis of the L1 speakers’ results, they then define a nativelikeness criterion—an interval that is considered typical of L1 speakers’ performance. Common intervals are (a) the L1 speakers’ mean ± two standard deviations or (b) the range of the L1 speakers’ results. L2 speakers whose results fall outside this interval are considered non-nativelike, and the goal of the study is often to establish from which age of L2 acquisition onwards no nativelike L2 speakers can be found.
The problem: Misclassifications
The procedure I’ve just sketched is pretty common but it’s fundamentally flawed. One problem with it is that it may misclassify non-nativelike speakers as nativelike. I think most researchers are aware of this problem, as they sometimes seem to imply that fewer L2 learners would’ve qualified as nativelike if only more and more reliable data were available. This may well be true. But the other side of the coin is rarely considered: not all L1 speakers may pass the nativelikeness criterion either!
To my knowledge, no paper in L2 acquisition provides an error-rate estimate, i.e., a quantitative appraisal of how well the nativelikeness criterion would distinguish between L2 and L1 speakers other than those used for defining the criterion. Nonetheless, I think this is precisely what is needed if we are to sensibly interpret such studies. Let me illustrate.
Abrahamsson and Hyltenstam subjected 41 advanced Spanish-speaking learners of L2 Swedish as well as 15 native speakers of Swedish to a battery of linguistic tasks. From these tasks, 14 variables were extracted; the details don’t matter much here, but you can look them up in the paper (see Table 6 on page 280). Abrahamsson and Hyltenstam defined the minimum criterion of nativelikeness as the lowest native-speaker result on each measure, but I’m going to define it as the range of native-speaker results (i.e., between lowest and highest; it doesn’t really matter much).
The original raw data aren’t available, but I’ve simulated some placeholder data to illustrate my point.
(For the 15 native speakers, I simulated 14 variables from normal distributions with the same mean as in Abrahamsson & Hyltenstam’s Table 6; the standard deviation was estimated by taking the range and dividing it by 4. For the 41 non-native speakers, I simulated the same 14 variables but with generally lower means and larger standard deviations. None of the variables were systematically correlated. This simulation obviously represent a huge simplification; life would be easier if people put their data online.)
Using these simulated data, we can compute the range of the native-speaker results. Don’t be intimidated by the R code, the comments say what it accomplishes, which is really all you need to know.
We can then take a look at the L2 speakers’ results and filter out the L2 speakers whose results aren’t all within the native speakers’ range:
Sure enough, none of the L2 learners classify as nativelike. With more realistic data, a handful probably would have, cf. Abrahamsson & Hyltenstam’s results.
By contrast, and quite obviously, all fifteen native speakers are classified as nativelike:
This comes as no surprise: the nativelikeness criterion was based on these speakers’ scores, so of course they should pass it with flying colours.
But what happens when we test a new sample of native speakers using the old nativelikeness criterion? I simulated data for another 10,000 native speakers using the same procedure I used to create the first 15 native speakers’ data.
Only 1048 of the 10,000 new native speakers pass the nativelikeness criterion! And these 10,000 new native speakers were sampled from the exact same population as the fifteen speakers used to establish the nativelikeness criterion—factors that would matter in real life such as social status, age, region, linguistic background, and what not don’t matter here; these would only make matters worse (see this paper on the selection of native-speaker controls by Sible Andringa).
Clearly, the finding that none of the L2 speakers are classified as nativelike carries considerably less weight now that we know that most L1 speakers wouldn’t have, either. Such information about the error rate associated with the nativelikeness criterion is therefore crucial to properly interpret studies relying on such a criterion. In practice, the bias against being classified as nativelike may not be huge as in this simulated example, but without an error-rate estimate (or access to the raw data), we’ve no way of knowing.
Estimating error rates using classification algorithms
If researchers want to classify L2 learners as nativelike or non-nativelike and sensibly interpret their results, I suggest they stop defining nativelikeness criteria as intervals based on native speakers’ scores. Instead, they can turn to tools developed in a field specialised in such matters: machine learning, or predictive modelling. There’s an astounding number of algorithms out there that were developed for taking a set of predictor variables (e.g., task scores) on the one hand and a set of class labels (e.g., L1 speaker vs. L2 speaker) on the other hand, deriving a classification model from these data, and estimating the error rate of the classifications.
I won’t provide a detailed introduction—Kuhn & Johnson’s Applied Predictive Modeling seems excellent—but I’ll just illustrate one such classification algorithm, random forests. In fact, the precise workings of this algorithm, which was developed in 2001 by Leo Breiman, needn’t really concern us here—you can read about them from the horse’s mouth, so to speak, in my thesis, or in tutorials by Tagliamonte & Baayen or Strobl and colleagues. What’s important is that it often produces excellent classification models and that it computes an error-rate estimate as a matter of course.
randomForest function in the
randomForest package implements the algorithm.
There are a couple of settings that the user can tweak; again these needn’t concern us here—you
can read about these in the articles referred to above.
The output shows the estimated classification error
that was computed on the basis of the original (simulated) data with
15 L1 and 41 L2 speakers (
OBB estimate of error rate):
an estimated 12.5% of observations will be misclassified by this algorithm.
With more data (more observations, more predictors, more reliable predictors),
this estimated error rate may become more accurate.
More interesting for our present purposes is the confusion matrix: The algorithm wrongly classifies two out of 41 L2 speakers as L1 speakers—these could perhaps be considered to have passed an updated ‘nativelikeness criterion’ inasmuch as they ‘fooled’ the algorithm. But it also misclassifies 5 of the 15 L1 speakers as L2 speakers. In this case, then, the 5% ‘nativelikeness incidence’ among L2 speakers may be an underestimate, as the algorithm seems to be biased against classifying participants as L1 speakers. This is likely due to the imbalance in the data: there are about 3 times more L2 than L1 speakers, so the algorithm naturally defaults to L2 speakers. (Take-home message if you want to conduct a study on nativelikeness: include more native speakers.)
The same random forest can also be applied to the 10,000 new L1 speakers, which gives a better estimate of how much the odds are stacked against classifying a participant as an L1 speaker:
While the random forest doesn’t classify all L1 speakers in the original control sample as L1 speakers (as the naïve nativelikeness procedure did), it performs much better on new L1 data, classifying 78% of new L1 speakers as L1 speakers. Evidently, in a real study, we wouldn’t have a sample of 10,000 participants on the side to check the estimated classification error rate.
By using common definitions of nativelikeness criteria, L2 acquisition studies are likely to stack the odds against findings of nativelikeness and yield generally uninterpretable results.
Random forests and other classification algorithms will yield considerably better classifications than ad-hoc criteria, but they may be far from perfect. Their imperfection, unlike that of ad-hoc criteria, can be quantified, however, which is crucial for interpreting the results.
You’re unlikely to learn about such algorithms in an introductory course to statistics, but it’s useful to simply know that they exist. This is how you build up your statistical toolbox: when you know that these tools exist and have a vague sense of what they’re for, you can brush up on them when you need them. There’s a world beyond t-tests, ANOVA and Pearson’s r.