When designing a study, you sometimes have a pool of candidate stimuli
(words, sentences, texts, images etc.) that is too large to present to
each participant in its entirety. If you want data for all or at least
most stimuli, a possible solution is to split up the pool of stimuli
into sets of overseeable size and assign each participant to one of the
different sets. Ideally, you’d want the different sets to be as comparable
as possible with respect to a number of relevant characteristics so that
each participant is exposed to about the same diversity of stimuli during
the task. For instance, when presenting individual words to participants,
you may want for each participant to be confronted with a similar distribution
of words in terms of their frequency, length, and number of adverbs.
In this blog post I share some R code that I used to split up two
types of stimuli into sets that are comparable with respect to one or several
variables—in the hopes that you can easily adapt them for your own needs.
Fair warning: There’s lots of R code below, but it’s mainly a matter
of copy-pasting the same lines and changing one or two things.
If the R is poorly organised on your screen, you can copy-paste it
Creating sets that are identical with respect to one categorical variable
I have a list of 144 sentences that I want to use as stimuli in an experiment.
I categorised these sentences into five different-sized plausibility categories myself.
The categories themselves don’t matter here, but before running the experiment
itself I want to make sure that my categorisation agrees with people’s
intuitions about the plausibility of these sentences. To that end, I will
run a validation study in which people rate their plausibility. The way the
validation study is designed, it would take about half an hour to rate the
plausibility of all 144 sentences. This isn’t that long, but when I can’t pay
the participants, I prefer to keep the task’s duration at 10 minutes or so.
To that end, I split up the pool of 144 sentences into 3 sets of 48 each.
But rather than constructing these sets completely at random,
I ensured that the plausibility categories (as I had defined them)
were distributed identically in each set,
so that each participant would be confronted with roughly the same
variation in terms of the sentences’ plausibility. This is just a matter
of randomly splitting up each of the categories into 3 sets, but the code
below automises this.
First, you need a dataframe that lists the stimuli and their categories.
Below, I just create this dataframe in R, but you can also read in a
spreadsheet with the candidate stimuli.
In this case, the numbers of stimuli per category are all divisible
by the number of sets (3), but the code below will also work when this isn’t
the case (e.g., when the number of INR items would be 16 and the number of IR items 38).
In such cases, the algorithm will create comparable equal-sized sets and a handful
of items will remain unassigned. You can try this yourself by changing the number
of stimuli per category above, or by setting n_sets below to, say, 5.
The column Set now specifies which set each item was assigned to,
and all sets have the same distribution of Category.
If the numbers of stimuli per category aren’t all divisible by the
number of sets, some stimuli will remained unassigned. If you want
to have data on all stimuli, you could randomly assign these remaining
stimuli to the sets. The sets won’t be identically distributed with
regard to the categories, but depending on what you want, it may be
Creating similar sets with respect to several categorical variables
For another study, we had 1,065 short texts for which we wanted to
collect human lexical richness ratings. Some of the texts were
narrative, others argumentative; some were written by bilingual children
and others by children in a monolingual control group; and some
were written when the children were 8, 9 or 10 years old.
You can’t expect volunteers to rate 1,065 texts, so we decided
to split them up into 20 sets of 53 texts each.
(In reality, each set contained only 50 texts as we excluded a bunch of
very short texts.) The sets were constructed so as to be
maximally similar in terms of narrative vs. argumentative texts,
texts written by bi- vs. monolingual children,
and texts written by 8-, 9- and 10-year-olds.
Below is the commented R code used to construct them.
First read in a dataframe containing the TextIDs
and info regarding bi- vs. monolingual (ControlGroup),
TextType and the children’s age (Time: 1, 2, 3).
The dataframe for this example is available from http://homeweb.unifr.ch/VanhoveJ/Pub/Data/french_texts.csv.
Here’s how many texts there are for each combination of the three
There are 2×2×3 = 12 combinations of these three variables.
Create a new variable that specifies which of these combinations
each text belongs to:
To divide up these texts evenly into 20 sets, we’ll need at least the
following numbers of texts per combination per batch:
That is, we’ll need 5 FALSEarg1 texts, 3 TRUEarg2 texts etc. per set.
These numbers add up to 48 in this case, which means that 20×48 = 960 out of 1,065
texts can be already be assigned by applying the code from the previous example.
Below I only changed the name of the dataframe (stimuli became texts)
and of the column containing the categories (Category became Combined).
Now each set contains 48 texts and all sets are identically distributed with
regard to each combination of the three structural variables. This leaves
105 texts unassigned. For this project, it wasn’t so important that the
distributions of the structural variables TextType, ControlGroup and
Time were exactly identical in each set, just close enough.
In order to assign the remaining texts to sets, we loosened up the constraints
slightly: to each set, we added as many texts as possible while ensuring
that the joint distribution of ControlGroup and Time would be identical
across the sets; we didn’t insist on equality in terms of TextType any more.
We can again copy-paste and slightly adapt the code above.
I’ve commented all changes:
With 20×2 = 40 additional texts assigned, all sets now contain 50 stimuli
and 65 texts remain unassigned. To assign as many of the remaining texts
as possible, we can again loosen the constraints and only insist that all
sets be identical in terms of their ControlGroup distribution (for instance).
Apart from the first line, the code is the same as above:
Now each set contains 52 texts. The distribution of ControlGroup
is the same in each set; the distributions of TextType and Time
are similar but not identical across sets.
If you want, you randomly assign 20 of the remaining 25 texts to a set;
5 texts will remain unassigned if we insist on each set having the same
number of texts:
This way, each set contains 53 texts:
While the sets aren’t with respect to the structural variables
ControlGroup, TextType and Time, they are highly similar:
Dealing with numerical variables
I haven’t dealt with numerical variables yet
(e.g., constructing word sets that are similar in terms
of their corpus frequency), but I think a reasonable way
of going about it would be to bin the numerical variable
(e.g., convert it to a variable with 10 categories) and
apply the functions above. If too many stimuli remain
unallocated, the remaining stimuli could be allocated
using a wider bin size.
This was probably a bit messy with all the R code, but
I hope some of you found it useful!