The goal of this primer is to present the techniques for visualising research data that I have found to be most useful and to show to you how you can use R, and specifically the ggplot2/tidyverse package, to draw informative plots. The focus is on drawing plots of raw data as opposed to visualising statistical models. If you want to learn more about visualising statistical models, you can check out my tutorial Visualising statistical uncertainty using model-based graphs. I also decided to leave out plot types that are not easy to interpret properly for audiences without a considerable amount of statistical know-how such as kernel density estimates and quantile–quantile plots.
Type, don’t copy-paste – at home
You’re going to take away much more from this primer if you copy the code snippets by typing them rather than by copy-pasting them. That said, I strongly advise you to do so in your own time and not as you’re attending the lecture: Some things are bound to go wrong (perhaps a comma in the wrong spot, a missing bracket or an upper-case letter that should have been a lower-case one), and as you’re trying to fix the errors, you’ll lose track of the lecture.
So work through the code snippets at home, and be patient with yourselves.
2 Preliminaries
Refer to the section Working with R and RStudio in the datasets primer.
Load the here and tidyverse packages.
library(here)
here() starts at C:/Users/VanhoveJ/switchdrive/Documents/Workshops/DatasetsAndGraphs
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
3 Histograms
The file Vanhove2015_Vpn.csv contains a couple of pieces of information about the participants in an experiment that I conducted several years ago (Vanhove 2016): ID, scores on German, English and French vocabulary tests (Wortschatz, Englisch and Französisch, respectively), sex (Geschlecht), and age (Alter). Make sure that this file is stored in the data subdirectory of the R project you created. Now read in the data as explained in the primer on working with datasets.
Rows: 80 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Geschlecht
dbl (5): VPN, Wortschatz, Englisch, Französisch, Alter
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A histogram shows how a single numeric variable is distributed (‘univariate distribution’). The range of the observed values of this variable is split up in a number of intervals, typically all of equal width. Over each interval, a bar is drawn whose height reflects how often the variable takes on values in this interval. In the simplest version, the height of the bar corresponds to the number of times the variable falls in the interval, but we’ll encounter a slightly more complicated (but sometimes more useful) version shortly.
Histograms are particularly useful to check if there are any ‘univariate’ outliers (i.e., values that lie far from the bulk of the data if you consider only this one variable) and to see if the variable seems to be distributed approximately uniformly, normally, bimodally or more wonkily.
Since you’ve already loaded the tidyverse and since you’ve already read in the data, we can start to draw some histograms. To this end, we need to define the object and the variables we want to plot (lines 1–2). If we only executed the first two lines of the code below, all we’d see is a blank canvas. To draw the histogram itself, we need to add a layer to this blank canvas that contains a histogram based on these pieces of information.
ggplot(data = d_hist, # specify data setaes(x = Englisch)) +# variable to be drawngeom_histogram() # draw as histogram
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Alternatively, you can use the pipe (|>) to pass the tibble to the ggplot() function. This is merely a matter of personal preference.
Different layers of a ggplot() figure are strung together using +, whereas |> is used outside of the ggplot() code proper to relay an object to the next function.
The histogram shows, among other things, that there are 8 individuals with an English vocabulary score just below 0.7 and 4 with a score just below 0.6. By default, 30 such bins are drawn, but as the warning indicates (‘Pick better value’), there’s nothing special about this number. You can adjust the desired bin width:
Clearly, the shape of the histogram depends crucially on how the bins were chosen. There aren’t any hard and fast rules for determining the optimal bin width. The default settings (1st plot) produce a histogram that seems to be a bit too fine-grained, whereas a bin width of 0.2 (2nd plot) results in too coarse a histogram. The other three histograms seem fine by me; I’d probably pick the 4th or 5th for a presentation or when writing a paper.
Alternatively, you can specify the number of bins or you can define the bins yourself:
The bins don’t need to all be of the same size, but they usually are.
Incidentally, the intervals are, by default, open to the left and closed to the right. That is, if we have bins (0.6, 0.7] and (0.7, 0.8], the value 0.7 is counted in the bin (0.6, 0.7] and not in (0.7, 0.8].
The histograms we’ve already drawn are probably sufficient for our own private use. But if we wanted to use these histograms in a presentation or in a publication, we ought to make some further changes.
First, use xlab() and ylab() to label axes. Don’t forget the quotation marks.
Next, add a title and, if needed, a subtitle and a caption. See ?labs for further options.
ggplot(data = d_hist,aes(x = Englisch)) +geom_histogram(binwidth =0.05) +xlab("English vocabulary score") +ylab("Number of participants") +labs(title ="The participants' English vocabulary scores",subtitle ="(some subtitle if needed)",caption ="English vocabulary scores were obtained using the LexTALE test." )
I don’t really like the default colours, but we can easily override those:
ggplot(data = d_hist,aes(x = Englisch)) +geom_histogram(binwidth =0.05,fill ="lightgrey",colour ="black") +xlab("English vocabulary score") +ylab("Number of participants") +labs(title ="The participants' English vocabulary scores",subtitle ="(some subtitle if needed)",caption ="English vocabulary scores were obtained using the LexTALE test." )
You can also apply one of the ggplot2 themes, see the ggplot2 documentation. For instance,
ggplot(data = d_hist,aes(x = Englisch)) +geom_histogram(binwidth =0.05, fill ="lightgrey",colour ="black") +xlab("English vocabulary score") +ylab("Number of participants") +labs(title ="The participants' English vocabulary scores",subtitle ="(some subtitle if needed)",caption ="English vocabulary scores were obtained using the LexTALE test." ) +theme_bw()
Saving figures
If the last figure was drawn using ggplot(), we can save it using ggsave(). The following command saves the figure as a PNG file, but other common formats such as PDF, BMP, SVG, JPEG and TIFF are supported as well. You can also specify the dimensions and other details, see ?ggsave.
ggsave(here("figs", "histogram_english.png"))
Saving 7 x 5 in image
Alternatively, we can save the figure by enclosing the code to draw it between a call to pdf(), png() or a couple of other options, and one to dev.off(), like so:
pdf(here("figs", "histogram_english.pdf"))ggplot(data = d_hist,aes(x = Englisch)) +geom_histogram(binwidth =0.05, fill ="lightgrey",colour ="black") +xlab("English vocabulary score") +ylab("Number of participants") +labs(title ="The participants' English vocabulary scores",subtitle ="(some subtitle if needed)",caption ="English vocabulary scores were obtained using the LexTALE test." ) +theme_bw()dev.off()
svg
2
See ?pdf for details. Related functions are png(), tiff(), svg() etc.
In order to faciliate comparisons of different histograms (e.g., histograms for different subsets of the data, or histograms of the same data but using different bin widths), what’s often plotted along the y-axis isn’t the number of data points in each interval, but a so-called density estimate. To this end, the y-axis is scaled in such a way that the total area covered by the histogram is equal to 1. Use y = after_stat(density) to plot density estimates rather than counts:
We can verify that \[0.05 \times (0.5 + 1.5 + 5 + 4.5 + 3.5 + 2.75 + 1.5 + 0.75) = 1.\]
If we were to express the English vocabulary scores on a scale from 0 to 10 rather than from 0 to 1 and use bins of width 0.5 points rather than of width 0.05 points, the density estimates would be a tenth of what they are now. This is because, if we’re using density estimates, the area covered by the histogram needs to equal 1.
Exercise 1 (A histogram) Draw a suitable histogram for the Französisch variable as well as for the Wortschatz variable in the d_hist dataset. This will require you to tinker with the binwidth setting. Don’t forget to adjust the axis labels. Use ggsave() to save your preferred histogram for each of the two variables as an SVG file.
On probability densities
The reason why histograms sometimes represent so-called density estimates rather than, say, the proportion of observations that fall in each bin is this. The distribution of a continuous variable \(X\) can be described by its probability density. This is a non-negative function \(f\) with the property that \[\mathbb{P}(X \in [a,b]) = \int_{a}^b f(t) ~\textrm{d}t\] for all \(a \leq b\). Or in words, the probability that \(X\) takes on a value in the interval \([a,b]\) corresponds to the area underneath the function graph (= integral) between the values \(a\) and \(b\). Since \(\mathbb{P}(X \in (-\infty, \infty)) = 1\), the area underneath the whole graph should equal 1.
For instance, the distribution of IQ values can be modelled as a normal distribution with mean 100 and standard deviation 15. The plot below shows the probability density function of this distribution. To compute the probability that a randomly picked person has an IQ between 92 and 103, we can compute the size of the shaded area.
Histograms can be thought of as estimators of the probability density function of the distribution from which the observed values were sampled. Other types of plots that serve this purpose exist (kernel density estimation), but I don’t think these are too useful in an introduction such as this. Further note that only continuous distributions have probability densities. (By definition, \(X\) has a continuous distribution if \(\mathbb{P}(X = x) = 0\) for all real numbers \(x\). )
The shape of a histogram doesn’t depend on just the bin width
The code belows draws two histograms for the Englisch data. The difference between them is that, for the first one, the bins used are \((0.45, 0.5], (0.5, 0.55], \dots, (0.85, 0.9]\), whereas for the second one, they are \((0.475, 0.525], (0.525, 0.575], \dots, (0.875, 0.925]\). In both cases, though, the width of the bins is 0.05 points. Still, the histograms look somewhat different.
Incidentally, plots drawn using ggplot() can be stored as objects (here h1 and h2). You can combine these plots into a single figure using the patchwork package.
h1 <-ggplot(data = d_hist,aes(x = Englisch,y =after_stat(density))) +geom_histogram(breaks =seq(0.45, 0.9, 0.05), fill ="lightgrey",colour ="black") +xlab("English vocabulary score") +ylab("density")h2 <-ggplot(data = d_hist,aes(x = Englisch,y =after_stat(density))) +geom_histogram(breaks =seq(0.45, 0.9, 0.05) +0.025, fill ="lightgrey",colour ="black") +xlab("English vocabulary score") +ylab("density")library(patchwork)h1 + h2 # this requires patchwork to work
4 Boxplots
The file VowelChoices_ij.csv contains a part of the results of a learning experiment (Vanhove 2016) in which 80 participants were randomly assigned to one of two learning conditions (LearningCondition: <oe> participants and <ij> participants). The column PropCorrect contains the proportion of correct answers for each participant, and the question we want to answer concerns the difference in the performance between learners in the different conditions. We’ll use these data to introduce boxplots.
Rows: 80 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Subject, LearningCondition
dbl (1): PropCorrect
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Below you see a basic boxplot of the vocabulary test scores of the eighty participants in the same learning experiment. The median score is highlighted by a thick line. The 25th and 75th percentiles are highlighted by thinner lines and together form a box. The difference between the 75th and 25th percentile (also known as the third and the first quartile) is called the inter-quartile range (IQR). The data points that don’t fall in the box, that is, the data points that don’t make up the middle 50% of the data, are represented by a line protruding from the upper part of the box and by a line protruding from the lower part of the box. However, data points whose distance to the box exceeds 1.5 times the inter-quartile range are plotted separately. If such data points exist, the lines protruding from the box only extend up to the lowest / highest data point whose distance to the box is lower than 1.5 times the IQR.
Quantiles, percentiles, median?
Let \(X = (X_1, \dots, X_n)\) be a data vector and let \(\alpha \in (0,1)\). We say that a value \(q_{\alpha}\) is an \(\alpha\)quantile (or a \(100\alpha\)th percentile) of the \(X\) if at least \(\alpha \cdot n\) values in \(X\) are no greater than \(q_{\alpha}\) and at most \((1-\alpha) \cdot n\) values in \(X\) are no smaller than \(q_{\alpha}\). For instance, the boxplot below shows that 34 is a 0.75 quantile (or 75th percentile) of the vocabulary scores. Indeed, we can verify that at least 75% (in fact 78.75%) of the vocabulary scores are no greater than 34 and that at least 25% (in fact 38.75%) of them are no smaller than 34.
mean(dat$Wortschatz <=34)
[1] 0.7875
mean(dat$Wortschatz >=34)
[1] 0.3875
Note that it is not the case that exactly\(100\alpha\)% of the observations are no greater than any given \(\alpha\) quantile. Moreover, quantiles aren’t necessarily uniquely defined. For instance, 28.4 is a 0.075 quantile of the vocabulary scores. But so is 28.
mean(dat$Wortschatz <=28.4) # at least 0.075
[1] 0.075
mean(dat$Wortschatz >=28.4) # at least 0.925
[1] 0.925
mean(dat$Wortschatz <=28) # at least 0.075
[1] 0.075
mean(dat$Wortschatz >=28) # at least 0.925
[1] 0.9375
When we talk about the (as opposed to an) \(\alpha\) quantile, we mean the average of all the \(\alpha\) quantiles.
A median is merely a \(0.5\) quantile. The median is the average of all \(0.5\) quantiles. In practice, this boils down to the average of the smallest and the largest such quantile.
(The discussion above assumes that we are interested in the empirical distribution of the values in the data vector \(X\) rather than in estimating the quantiles of the distribution from which \(X\) was sampled. For our present purposes, this is more than sufficient. But see the explanation ?quantile for details.)
From the boxplot below, we can glean that the IQR is \(34 - 31 = 3\). The highest data point has a value of 38. Since \(38 < 34 + 1.5 \times 3 = 38.5\), this data point is captured by the line protruding from the upper part of the box. The lowest data point has a value of 25. Since \(25 < 31 - 1.5 \times 3 = 26.5\), it lies too far from the box to be captured by the line protruding from the lower part of the box, so it gets drawn separately.
We can draw boxplots in order to compare the distribution of a variable in two groups. But you can also use boxplots to compare more than two groups.
The following command plots the accuracy data (PropCorrect) using a separate boxplot for each condition (LearningCondition). We also add axis labels and change the plotting theme:
ggplot(data = d_box, aes(x = LearningCondition, y = PropCorrect)) +geom_boxplot() +xlab("Learning condition") +ylab("Proportion of correct translations") +theme_bw()
Plain boxplots are useful when you want to get a quick idea of the distribution of the data in different groups. However, identically looking boxplots can represent markedly different distributions! Furthermore, as the next exercise may illustrate, boxplots may be hard to interpret at first blush.
Exercise 2 (Matching boxplots to histograms) For the plot below, I generated data from five different distributions. For each generated dataset, I drew a histogram and a boxplot. Match the histograms and the boxplots that are drawn on the basis of the same data. Are there any pecularities of the data distributions that are visible in the histogram but not in the boxplot?
To make boxplots more informative and easier to interpret, it can be useful to draw boxplots that also show the individual data points rather than merely the distribution’s quartiles. Such visualisations are particularly useful when the dataset isn’t prohibitively large, so that you can actually make out the different data points.
To do so, we can another layer to the boxplot by inserting the command geom_point() into the ggplot call. geom_point() draws the data points as, well, points and doesn’t summarise its quartiles like geom_boxplot() does. Since we insert geom_point()aftergeom_boxplot(), these points will be plotted on top of (rather than underneath) the boxplots.
ggplot(data = d_box,aes(x = LearningCondition, y = PropCorrect)) +geom_boxplot() +geom_point() +# draw individual data pointsxlab("Learning condition") +ylab("Proportion correct answers") +theme_bw()
If you count the number of points in the graph above, you’ll notice that there are much fewer points (28) than there were participants (80). The reason is that several participants obtained the same result, and their data are plotted on top of each other. For this reason, it makes sense to move the plotted points a bit apart (jittering). Here we can move the points apart horizontally. To do this, we need to specify the position parameter inside the geom_point() command.
We can specify both width (horizontal jittering) and height (vertical jittering). Fiddle with the setting for width to see what happens if you change it. I’d leave the value for height set to 0 (= no vertical jittering) so that we don’t depict proportions of, say, 0.24 as 0.28.
Notice that I’ve additionally specified the outlier.shape parameter in the geom_boxplot() command as NA (not available). This prevents the geom_boxplot() command from plotting outlying data points so that these points aren’t plotted twice (as part of the boxplot and as an individual data point).
To make the individual data points more visually distinct, we can change the plotting symbol by changing the shape parameter. This overview shows which plotting symbols are available:
Shapes 21 through 25 resemble shapes 1 through 5, but they can be coloured in.
Exercise 3 (A boxplot) Wagenmakers et al. (2016) set out to replicate an older study across a large number of psychology labs. The dataset Wagenmakers_Zeelenberg.csv contains summary data from one such lab (Zeelenberg et al.). In the experiment, participants were instructed to either smile or pout (Condition) as they judged the funniness of four cartoons on a 10-point scale (0–9). The mean rating per participant can be found in the creatively named column MeanRating. The research question was if smiling participants gave higher funniness ratings than pouting ones.
Compute the number of participants in each condition as well as the mean MeanRating per condition (see the previous primer). Also draw boxplots (with individual data points) on the basis of which this research question can be answered. Label your axes appropriately. Save a graph you’re happy with using ggsave().
Exercise 4 (Another boxplot and some data wrangling) You can find the data from all labs involved in the Wagenmakers et al. (2016) replication attempt on https://osf.io/hgi2y/. Locate and download the data from any of the labs (other than the Zeelenberg et al. one). Inspect the dataset using a spreadsheet program and note how the data are laid out. In particular, observe that the data entries start in row 3 as opposed to row 2. The funniness ratings are in columns K through N, whereas columns D through H indicate whether the participants followed the instruction. Column C shows which condition the participants were assigned to. Using R only (i.e., don’t use your spreadsheet program for the next steps), do the following:
Read in the data. (The skip parameter of the read_csv() function may be useful.)
Create a tibble that contains the mean funniness ratings assigned by each participant that followed the instructions on all four cartoons. This tibble should, at the very least, contain a column indicating the condition the participants were assigned to and the mean funniness rating they gave to the four cartoons. Participants who did not follow the instructions all the time should not be included. It’s probably a good idea to label the conditions pout and smile rather than 0 and 1.
Compute the number of participants with usable data in each condition as well as the mean average funniness rating per condition.
Draw boxplots (with individual data points) on the basis of which this research question can be answered. Label your axes appropriately. Save a graph you’re happy with using ggsave().
Comparing groups using frequency polygons or empirical cumulative density functions
It’s possible to compare the distribution of some variable across two or more groups using frequency polygons, too. These are constructed similarly to histograms, but without the bars being drawn. See the code below. Depending on how many bins are drawn and where their boundaries are, these frequency polygons may take on different shapes.
Another possibility is to make use of so-called empirical cumulative density functions. For a given data vector \(X = (X_1, \dots, X_n)\), the empirical cumulative density function outputs for each number \(r\) the proportion of data points \(X_i\) in \(X\) such that \(X_i \leq r\):
What may be counterintuitive at first blush is that the line for the participants is higher than the one for the participants. On reflection, this makes sense: Looking up the value 0.5 on the x-axis, we may deduce that about 62.5% of the participants have a PropCorrect value of 0.5 or below, whereas more than 87.5% of the participants have a PropCorrect value of 0.5 or below. So the group that tends to have the lowest values should have higher empirical cumulative density values! This may go some way towards explaining why I don’t recall ever having seen empirical cumulative density plots outside of the statistical literature. That said, an attractive feature of these plots is that they fully characterise the distribution of the observed data and that they don’t require you to fiddle with parameters such as the number of bins.
Exercise 5 (Boxplots vs empirical cumulative density functions) For the plot below, I generated five datasets for two groups. For each generated dataset, I drew boxplots and empirical cumulative density plots. Match the boxplots and the empirical cumulative density plots that are drawn on the basis of the same data. Are there any pecularities of the data distributions that are visible in the empirical cumulative density plots but not in the boxplot?
Try out alternative plots
Boxplots are a decent default choice for comparing groups. But sometimes they obscure rather than reveal relevant differences between groups, as shown in the previous exercises. Furthermore, if the outcome data are fairly coarse, it may be more sensible to plot the individual data points and highlight the mean instead of the median.
To illustrate this, we use data from an experiment by Berthele (2012). The data are available in the file berthele2012.csv. Skipping over the details, we’d like to know how the numeric variable Potenzial varies depending on the values of the categorical variables CS and Name. In the plot below, we use a technique called facetting to split up the graph into two subgraphs. The boxplots don’t seem to be too informative as they are degenerate: The medians coincide with either the 1st or 3rd quartile due to the coarseness of the data.
Rows: 155 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): CS, Name
dbl (1): Potenzial
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(berthele,aes(x = CS, y = Potenzial)) +geom_boxplot(outlier.shape =NA) +geom_point(shape =1, position =position_jitter(height =0, width =0.2)) +facet_grid(cols =vars(Name))
Instead of highlighting the medians and the quartiles, let’s compute the mean Potenzial score as well as the standard deviation for the Potenzial scores in each cell and add these to the data plot. Note that we can pass new data arguments to geom-layers (here: geom_pointrange()) in order to visualise data from different sources. I also jittered the data points vertically ever so slightly in order to be able to make out the overlapping data points better.
Line charts are a reasonable choice for presenting the results of more complex studies. They’re often used to depict temporal developments, but I don’t think their use should be restricted to this. I often use line charts to better understand complex results that are presented in a table. For instance, Slavin et al. (2011) presented their results in four tables (Tables 4–7). I entered the results pertaining to the PPVT (English) and TVIP (Spanish) post-tests into a CSV file (Peabody_Slavin2011.csv):
Rows: 16 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): CourseType, Language
dbl (4): Grade, Learners, Mean, StDev
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Grade: Grades 1 through 4.
CourseType: EI (English immersion) or TBE (transitional bilingual education).
Language: English or Spanish.
Learners: Number of learners per class, language and course type.
Mean: Mean score per class, language and course type.
StDev: Standard deviation per class, language and course type.
We can plot the development in the English and Spanish test scores for the two course types:
ggplot(data = slavin,aes(x = Grade, y = Mean,colour = Language, # use different colours per languagelinetype = CourseType)) +# use different line types per course typegeom_line() +# connect data points with a linegeom_point() # add data points as points for good measure
What’s new in the R code is that you can specify more than just an x and a y parameter in the aes() call. When you associate colour and linetype with variables in the dataset, the data associated with different levels of these variables (i.e., English vs. Spanish; TBE vs. EI) will be plotted in a different colour or using a different line type. The colours and line types used here are ggplot’s default choices; we could override those, see ?scale_colour_manual and ?scale_linetype_manual. The same goes for the appearance of the legends.
Using different colours and line types is all good and well if you have a limited number of variables and a small number of levels per variable. But for more complex data, such graphs quickly become confusing. One useful technique is to plot different lines in separate graphs (small multiples) and show the different graphs side-by-side. This is known as facetting. The command facet_grid() can be used to specify the variables according to which the graph should be separated and how. Since the information regarding Language and CourseType is expressed in the facets, we don’t have to express it using colours and linetypes any more. (But feel free to do so if you think it makes things clearer!) We need to quote the names of the variables according to which the facets are to be drawn in a vars() call.
ggplot(data = slavin,aes(x = Grade, y = Mean)) +geom_line() +geom_point() +facet_grid(rows =vars(CourseType), # TBE vs. EI in rows; cols =vars(Language)) # Span. vs. Eng. in columns
Or you can combine facetting with using different colours or line types. In the next plot, the data for different languages are shown in separate panels, but each panel shows the data for both the TBE and EI pupils. This is particularly useful if we want to highlight differences (or similarities) between the TBE and EI pupils. If instead we wanted to highlight differences or similarities in the development of the pupils’ language skills in Spanish and English, we’d draw this graph the other way round. In addition to connecting the means with a line, this graph also visualises the means using a symbol in order to highlight that there were four discrete measurement times (i.e., no measurements between Grades 1 and 2).
ggplot(data = slavin,aes(x = Grade,y = Mean,shape = CourseType, # different symbols by course typelinetype = CourseType)) +geom_point() +geom_line() +facet_grid(cols =vars(Language)) # only split graph by column (Sp. vs. Eng.)
Averages don’t tell the whole story
The line charts above only show the average PPVT and TVIP scores by grade and course type as I don’t have access to the raw data. But averages don’t always tell the whole story.
Consider the following example. You’re reading a study in which 39 participants answer a question on a scale from 0 to 5. The study reports that the average response was 2.43 on this 6-point scale. You would be forgiven for assuming that most participants answered ‘2’ or ‘3’. But this doesn’t have to be the case: the three graphs below all show 39 data points whose mean is 2.43 (after rounding). But we’d interpret the left graph (only extreme choices) differently from the middle graph (whole range of opinions) and from the right graph (only moderate responses).
If the reported results include a measure of dispersion (e.g., a standard deviation), the number of possible data patterns is reduced, but there could nonetheless be a multitude of patterns that give rise to the same numerical summaries but that may have to be interpreted differently:
A related reason for graphing the data is to make sure that the numerical results reported are actually relevant. For instance, a study may report a difference in the mean outcomes of two groups and illustrate this with the left panel in the figure below. But if we look at the distribution of the raw data in both groups, shown in the right panel, we realise that there is a strong right skew to these data. This would prompt us to at least ask ourselves the question whether we’re at all interested in the means of these groups.
The upshot is this: If at all possible, don’t just show averages but try to give your readers – and yourselves – a sense of how the actual data are distributed.
Exercise 6 (A line chart) The dataset MorphologicalCues.csv contains data stemming from a learning task in which the participants had to associate a morphological form with a syntactic function. 70 learners were randomly assigned to one of three conditions (Condition):
RuleBasedInput: In this condition, the input showed a perfect association between the morphological form and the syntactic function.
StrongBias: The input showed a strong, but imperfect association between form and function.
WeakBias: The input showed a weak (but non-zero) association between form and function.
To track the learners’ progress throughout the task, the learners took part in three sorts of exercises: Comprehension, grammaticality judgement (GJT) and Production. They did this at four stages during the data collection (Block 1, 2, 3, 4), that is, after receiving a bit of input, some more input, etc.
The column Accuracy shows the mean percentage of correct responses per Block, per Task and per Condition.
The research question: How does Accuracy develop in the course of the data collection depending on the Task and Condition?
Plot these data so that the research question can be answered. You will want to try out several graphs so that you can pick a graph that you find most comprehensible.
Exercise 7 (Compute summary statistics and drawing a line chart) Schlechtweg, Peters, and Frank (2023) investigated how well native speakers of German can tell apart the English /ɛ/ (pen) and /æ/ (pan) vowels. In their first study, they played a total of 198 recordings to each of their 51 participants and asked them to identify whether the vowel sound contained in these recordings was English /ɛ/ or English /æ/. The recordings differed in terms of their spectral properties (formants 1 and 2) as well as in their duration.
The file schlechtweg2023_selection.txt contains the data for this task. It contains the columns Subject (the participants’ IDs), Item (the minimal pair the recording belonged to), GermanPillai (not relevant for this exercise), Duration (short, middle or long), Step (an integer between 1 and 11, with 1 representing a more open and somewhat more back articulation as would be typical of /æ/ and 11 representing a closer and somewhat more fronted articulation as would be typtical of /ɛ/), LevelEnglish (high, low or mid), SelVowel1 (with 1 meaning that the participant selected /æ/ and 0 meaning that they selected /ɛ/), and ReactionTime (not relevant for this exercise).
We would expect that the proportion of /æ/ selections drops as Step gets closer to 11. Further, since English /æ/ tends to be longer than English /ɛ/, we expect that the proportion of /æ/ selections is greater for longer durations than for shorter ones. Lastly, we expect that highly-proficient participants are more sensitive to both of these cues.
Read in the data. Aggregate it in a sensible way and visualise the aggregated data using a linechart in such a way that it can be gauged whether these expectations are borne out.
6 Putting it all together: Portuguese reading and writing skills
I’ll walk you through this example in the lecture. The data stem from Desgrippes, Lambelet, and Vanhove (2017) and Pestana, Lambelet, and Vanhove (2017).
Drawing decent graphs can take time
Don’t get disheartened by the lengthy code snippets below. They are not the result of one energetic burst of coding. As I’ll try to demonstrate in the lecture, I started with a fairly basic graph, found some problem with it, tried to fix it, and so on. The end result is a plot with 27 boxplots that hopefully enables readers to compare the three language groups in terms of their test scores, to gauge the progress that each language group makes from T1 to T3, and that also gives readers a sense of the variation that exists within each group.
Rows: 1904 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Subject, LanguageTested
dbl (4): Time, Reading, Argumentation, Narration
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skills_longer_portuguese <- skills |>pivot_longer(Reading:Narration,values_to ="score",names_to ="skill") |>mutate(class =str_split_i(Subject, "_", 1),group =str_split_i(Subject, "_", 2) ) |>mutate(language_group =case_when( group =="CP"~"Portuguese control", group =="CF"~"French control", group =="CD"~"German control", group %in%c("PLF", "PNF") ~"French-Portuguese", group %in%c("PLD", "PND") ~"German-Portuguese",.default ="other" )) |>filter(LanguageTested =="Portuguese")
The three graphs below all show the same data (the proportion of sales of a fictitious pie producer by pie taste).
The graph on the left is a traditional pie chart. Though common, this type of graph has serious shortcomings, for which reason it should rarely be used:
Pie charts aren’t very flexible: When the number of categories is larger than here, pie charts are hardly comprehensible. It’s also difficult to add additional information to pie charts (e.g., the proportion of sales by taste the year before; see below).
Readers gauge the numbers (in this case: the proportions) that the pie chart represents less accurately than when they’re presented in a bar chart or in a Cleveland dot plot (Cleveland and McGill 1984).
The graph in the middle is a bar chart, the one of the right a (Cleveland) dot plot. Both are better choices than pie charts. For one things, they’re more flexible. For instance, you can add the sales from the year before to both graphs – as separate bars or by using different symbols. This would be difficult to accomplish in a pie chart:
I prefer dot plots because it’s easier to visualise a large number of categories and I find them less ‘busy’.
Bars next to each other
If you prefer bar charts: It’s usually better to plot the bars next to each other rather than to stack them on top of each other.
Don’t use pie charts
7.1 A basic Cleveland dot plot
For this tutorial, we’ll use data from Vanhove (2017). For this study, I recruited 175 native speakers of Dutch from Belgium and the Netherlands. I showed them 44 German words with Dutch cognates (e.g., Stadt (NL stad) or Schiff (NL schip)) and asked them to choose their gender-marked definitive article (der (masculine), die (feminine), das (neuter)). Dutch and German are closely related languages, but there are a number of cognates whose grammatical gender differs between both languages. The Dutch word strand (beach), for instance, is neuter, whereas the German word Strand is masculine. One of the research questions was if Belgian and Dutch participants were more likely to choose the neuter article das if the German word’s Dutch cognate is neuter than when it has common gender. (Common gender = non-neuter. Present-day Standard Dutch hardly distinguishes between masculine and feminine gender.)
Rows: 88 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Stimulus, Country, DutchCognate, GermanGender, DutchGender
dbl (1): NeuterResponses
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The column NeuterResponses contains the proportion of neuter article (das) choices per German word (Stimulus) per (Country). The column GermanGender contains the word’s correct gender (in German); the column DutchGender contains the grammatical gender of the word’s Dutch cognate.
We can plot the proportion of das choices per word separately per country (Belgium vs. the Netherlands). This will show us how strongly the response pattern varies per word and whether Belgian and Dutch speakers of Dutch show different preferences. Note that I put the word labels along the y-axis, which makes them much easier to read, and the proportion of neuter responses along the x-axis.
The plot above doesn’t offer an answer to the research question since the difference between stimuli with different Dutch genders isn’t highlighted. To accomplish this, we can plot stimuli with neuter cognates and those with common-gender cognates in different boxes (facetting, see the tutorial on line charts). (To see what scales = "free_y" and space = "free_y" accomplish, leave these parameters out of the call, i.e., just use facet_grid(rows = vars(DutchGender)).)
This graph shows pretty clearly that both Belgian and Dutch speakers of Dutch pick das more often if the German word has a neuter-gender cognate than when it has a common-gender cognate: The points in the lower box lie more to the right than those in the upper box. With a single exception (Boot), there is no overlap between these two distributions.
Additionally, the responses of Belgian and Dutch participants don’t seem to vary much from one another. For instance, we don’t observe that most triangles lie to the right of the circles or that Belgian participants prefer das for Wurst and Dutch participants don’t.
7.2 Finishing touches
The previous graph is good enough. But we can do better still. German is taught in school in both Flanders and the Netherlands, and at least some participants will have known which words are neuter in German and which aren’t. So some of the variation between the stimuli will be attributable to the stimuli’s German gender. To highlight this, we can split up the graph not only by the words’ Dutch gender, but also by their German gender. And we still have to label the axes!
ggplot(d_dot,aes(x = NeuterResponses, y = Stimulus,shape = Country)) +geom_point() +facet_grid(rows =vars(DutchGender, GermanGender),scales ="free_y", space ="free_y") +xlab("Proportion 'das'") +ylab(element_blank())
The upper three boxes show feminine, masculine and neuter German words with common-gender cognates; the lower three boxes show feminine, masculine and neuter German words with neuter-gender cognates. The graph shows that the factor Dutch gender is the most important determinant of the participants’ article choices. But the factor German gender also plays a role: When a German word is neuter, both Belgian and Dutch people choose das more often than when it’s feminine or masculine.
Now the words are sorted alphabetically in each box. But this order can be customised. We can also add to the word labels the words’ Dutch counterparts.
We could reorder the facets so that the word categories that result in the most ‘das’ responses are at the top of the graph. While we’re at it, we could also give these facets clearer labels.
The symbols used in the graph above are difficult to distinguish optically. They, too, can be changed.
Hm. The facets with more ‘das’ responses are at the bottom of the graph, whereas within each facet, the words with more ‘das’ responses are at the top. We can fix this by setting the decreasing parameter in the reorder() function. We also put the legend at the bottom of the graph and slightly increase the size of the symbols:
d_dot <- d_dot |># Reorder the gender labels in decreasing order by proportion 'das'mutate(dutch_label =reorder(dutch_label, NeuterResponses, decreasing =TRUE),german_label =reorder(german_label, NeuterResponses, decreasing =TRUE) )ggplot(d_dot,aes(x = NeuterResponses,y = word_label,shape = Country)) +geom_point(size =2) +xlab("Proportion of neuter (das) choices") +ylab("German noun") +scale_shape_manual(values =c(1, 3)) +facet_grid(rows =vars(dutch_label, german_label),scales ="free_y",space ="free_y") +theme_bw() +theme(legend.position ="bottom")
Not half bad, I think.
Exercise 8 (Another dotplot) In Vanhove (2016), native speakers of German who hadn’t learnt Dutch yet were assigned to one of two learning conditions (oe–u and ij–ei), in which they were exposed to Dutch–German cognates exhibiting the grapheme correspondences oe–u (e.g., hoed–Hut) and ij–ei (e.g., wijn–Wein), respectively. I then asked them to translate previously unseen Dutch words containing these digraphs. The question was quite simply, if the participants who had been exposed to the oe–u correspondence but not to the ij–ei correspondence would be more likely to translate Dutch words containing oe as German words containing u than the participants who had been exposed to the ij–ei correspondence but not to the oe–u correspondence, and similarly for the other way round.
The data are laid out in three data sets:
Vanhove2016_subjects.csv contains information about the participants, including the LearningCondition to which they were assigned;
Vanhove2016_items.csv contains information about the items, including the Category to which they belong (ij cognate, oe cognate, other cognate, non-cognate);
Vanhove2016_translations.csv shows how the subjects translated each item and how these translations were scored. The relevant outcome is CorrectVowel.
Read in and merge these three datasets. Then draw a dotchart similar to the one above that shows, for each Dutch word containing oe or ij, the proportion of participants in each condition that used the correct vowel when translating it. You can ignore the items in the categories other cognate and non-cognate as well as the items occurring in the TrainingBlock. Interpret the results.
Bonus: Aggregate the data in a different way, namely by computing the proportion of correct vowel uses per participant per item category. (You should again ignore the categories other cognate and non-cognate as well as the TrainingBlock.) Visualise these aggregated data, for stance using boxplots. Do you see any advantages and disadvantages of this visualisation compared to the dotchart? Would it be useful to include both kinds of visualisation when presenting the study’s results at a conference or in a research paper?
8 Scatterplots
For my PhD thesis (Vanhove 2014), I investigated how people’s ability to recognise written and spoken cognates in a related but unknown language develops throughout the lifespan and how it is related to linguistic and cognitive factors. The dataset Vanhove2014_Translations.csv contains the raw data of this study. For each of the 163 participants, this dataset contains 100 entries (one for each cognate), for a total of 16,300 rows. Each translation was rated as correct or incorrect. Additionally, the dataset contains some information about the participants (e.g., their performance on other tasks) as well as about the stimuli (e.g., a measure expressing its formal similarity to its French, German or English cognate).
Rows: 16300 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Stimulus, Mode, Translation, Sex, Status
dbl (10): Subject, Trial, Correct, Age, NrLang, DS.Total, WST.Right, Raven.R...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The variables:
Stimulus: The word to be translated.
Subject: The participant’s ID.
Mode: Whether the word was presented in its spoken (Auditory) or in its written form (Visual).
Trial: The position of the word in the task.
Translation: The participant’s translation attempt for the stimulus.
Correct: Whether the translation was correct (1) or incorrect (0).
Sex
Age
NrLang: The number of languages the participant spoken.
DS.Total: The participant’s score on a working memory task.
WST.Right: The participant’s score on a German vocabulary test.
Raven.Right: The participant’s score on an intelligence test.
English.Total: The participant’s score on an English-language test.
Status: Whether the stimulus has a German, English, or French cognates (target) or not (profile).
MinLev: The degree of formal discrepancy between the stimulus and its most similar German, English or French cognate. (lower = more similar)
Missing values were labelled NA (not available).
A rough summary can be obtained like so:
summary(d_scatter)
Stimulus Subject Mode Trial
Length:16300 Min. : 64 Length:16300 Min. : 1.0
Class :character 1st Qu.:2909 Class :character 1st Qu.:13.0
Mode :character Median :5731 Mode :character Median :25.5
Mean :5317 Mean :25.5
3rd Qu.:7794 3rd Qu.:38.0
Max. :9913 Max. :50.0
Translation Correct Sex Age
Length:16300 Min. :0.0000 Length:16300 Min. :10.00
Class :character 1st Qu.:0.0000 Class :character 1st Qu.:16.00
Mode :character Median :0.0000 Mode :character Median :39.00
Mean :0.3523 Mean :40.28
3rd Qu.:1.0000 3rd Qu.:60.00
Max. :1.0000 Max. :86.00
NrLang DS.Total WST.Right Raven.Right
Min. :1.000 Min. : 2.000 Min. : 4.00 Min. : 0.0
1st Qu.:2.000 1st Qu.: 5.000 1st Qu.:29.00 1st Qu.:12.0
Median :3.000 Median : 6.000 Median :34.00 Median :19.0
Mean :3.067 Mean : 6.374 Mean :30.24 Mean :17.8
3rd Qu.:4.000 3rd Qu.: 8.000 3rd Qu.:36.00 3rd Qu.:24.0
Max. :9.000 Max. :12.000 Max. :41.00 Max. :35.0
NA's :100
English.Total Status MinLev
Min. : 3.00 Length:16300 Min. :0.0000
1st Qu.:20.75 Class :character 1st Qu.:0.2857
Median :31.00 Mode :character Median :0.4000
Mean :28.30 Mean :0.4566
3rd Qu.:37.00 3rd Qu.:0.6062
Max. :44.00 Max. :1.0000
NA's :300
In order to be able to sensibly visualise these data, we need to first transform this dataset. If we’re interested in the relationship between the participants’ age, sex, and linguistic and cognitive test results on the one hand and their translation performance on the other hand, it seems useful to first compute the number of correct translations for spoken and written words per participant. To this end, we can use the tools introduced in the Datasets part of this lecture. Note that we group_by() not just Subject and Mode, but also by a bunch of participant-related variables. This way, we don’t need to construct a tibble with these variables and then join it with the summary data:
To investigate the relationhip between, say, WST.Right and number_correct, we can plot these data in a scatterplot. While we’re at it, we can split up this graph into two panels: one for written words, and one for spoken words.
ggplot(dat = per_participant,aes(x = WST.Right, y = nr_correct)) +geom_point(shape =1) +xlab("L1 vocabulary test") +ylab("Number of correct translations\n(out of 45)") +facet_grid(cols =vars(Mode)) +theme_bw()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
The warning concerns missing values in the WST.Right variable:
per_participant |>filter(is.na(WST.Right))
While only one participant had a missing WST.Right value, this participant is represented twice in the dataset (once for auditory stimuli, once for written stimuli), hence the two warnings.
As for the decision which variable to put along which axis. By and large, put the variable that is most likely to be the cause of the relationship along the x axis and the variable that is most likely to be the effect along the y axis. In this case, it seems more likely that L1 skills affect one’s ability to recognise cognates in a foreign language than vice versa. Hence, put the variable representing L1 skills along the x axis.
No correlations without scatterplots!
Relations between two numeric variables are often summarised by means of a correlation coefficient. It’s important to appreciate that any correlation coefficient can correspond to a multitude of underlying data patterns. Crucially, correlation coefficients close to 0 do not have to mean that there is no relation between the two numeric variables (the relation may be strongly nonlinear), and correlation coefficients close to 1 (or -1) don’t have to mean that there is a strong relation between the two numeric variables (the correlation may be driven by an outlier, among other possibilities).
So, whenever you want to compute a correlation coefficient, draw a scatterplot first. And show it to your readers.
8.1 Trendlines
Consider the following scatterplots:
ggplot(dat = per_participant,aes(x = English.Total, y = nr_correct)) +geom_point(shape =1) +xlab("Result English test") +ylab("Number of correct translations") +facet_grid(cols =vars(Mode))
Warning: Removed 6 rows containing missing values or values outside the scale range
(`geom_point()`).
We can add scatterplot smoothers to these plots by means of geom_smooth(). Scatterplot smoothers were developed to discover relationships (including nonlinear ones) between two variables that aren’t necessarily immediately obvious if the data are shown in a scatterplot. Here I turn off the confidence band around the scatterplot smoother (se = FALSE) as confidence bands are a topic well beyond the scope of this primer.
ggplot(dat = per_participant,aes(x = English.Total, y = nr_correct)) +geom_point(shape =1) +geom_smooth(se =FALSE) +xlab("Result English test") +ylab("Number of correct translations") +facet_grid(cols =vars(Mode))
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 6 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 6 rows containing missing values or values outside the scale range
(`geom_point()`).
The points on the smoother are a kind of mean value of the \(Y\) variable for the respective \(X\) value. In the left panel, for instance, the average number of correct translations in the auditory mode for someone with an English test score of 30 is roughly 17–18, whereas the average number of correct translations for written words for participants with a score of 40 on the English test is about 25.
We needn’t necessarily amuse ourselves with the maths behind these smoothers (but check out the info box if you’re curious), but the following points are important:
The trend line is nearly always a bit wiggly. This is the case even when the relationship itself is as good as linear.
The default settings for geom_smooth() tend to work fairly well, but sometimes it’s necessary to fiddle with them so that the smoother captures the trend in the data better.
To elaborate on the second point, consider the graph below, which shows a scatterplot of simulated data with two scatterplot smoothers. The red line was drawn with the default settings. This line doesn’t capture an important feature of the relationship (the data points go up and down). The blue line captures this trend much better. It was drawn using the command geom_smooth(span = 0.1). The span parameter determines how wiggly the curve may be (the smaller span, the wigglier the curve). By default, span is set to 0.75. Finding a decent span value is matter of trial and error.
In the second example, the blue line was drawn using geom_smooth(span = 0.1). This line is much too wiggly, and it essentially models random deviations from the general trend. The red line, drawn with the default setting (span = 0.75), captures the general trend much more sensibly. The green line, by contrast, isn’t wiggly enough (span = 3).
Summing up: Generally, the default settings work reasonably well. But when you notice that visually salient patterns in the scatterplot aren’t captured by the trend line, you need to fiddle a bit with the span parameter.
More generally, data analysis and statistics aren’t a matter of blindly applying formulae and recipes.
Trend line techniques
By default, geom_smooth() draws the trend line using the LOESS (locally estimated scatterplot smoothing) technique for datasets with fewer than 1,000 observations and uses a GAM (generalised additive model) for larger datasets. Alternatives are simple linear regression (method == "lm") and robust linear regression (method == "rlm"). We can’t discuss all of these techniques in this primer, but LOESS works roughly as follows.
Let \(X = (X_1, \dots, X_n)\) be a data vector of predictor values and let \(Y = (Y_1, \dots, Y_n)\) be a data vector of outcome values. We specify a positive value for the span parameter, also referred to as \(\alpha\). Typically, \(\alpha \in (0, 1)\), and we will restrict our attention to this case. If we want to evaluate the LOESS smoother at a given value \(x_*\), we first compute the distances between the observed predictor values and \(x_*\): \[d_i = |X_i - x_*|\] for \(i = 1, \dots, n\). We sort the distances ascendingly and read out the smallest distance such that at least \(\alpha \cdot n\) of the distances are no larger than it. Call this distance \(M\). Next, for each data point, we compute a weight: \[w_i = \begin{cases}\left(1 - \left(\frac{d_i}{M}\right)^3\right)^3 & \textrm{if $d_i \leq M$}, \\ 0 & \textrm{else},\end{cases}\] for \(i = 1, \dots, n\). This weighting scheme assigns large weights to data points with predictor values close to \(x_*\) and smaller ones with predictor values farther away from \(x_*\). Data points with predictor values larger than \(M\), i.e., data points that do not belong to the closest \(100\alpha\)% of data points to \(x_*\) get weight 0. That is, the smaller \(\alpha\), the more the LOESS fit is based on local information, and the larger \(\alpha\), the more it uses global information. Next, a weighted polynomial regression model is fitted to the data. By default, quadratic regression is used, which means that we we compute the values \(\widehat{b}_0, \widehat{b}_1, \widehat{b}_2\) that minimise the weighted sum of squares \[\sum_{i=1}^n w_i(Y_i - (\widehat{b}_0 + \widehat{b}_1X_i + \widehat{b}_2X_i^2))^2.\] Finally, we evaluate the smoother at \(x_*\) as \[\widehat{b}_0 + \widehat{b}_1x_* + \widehat{b}_2x_*^2.\] The loess() function that geom_smooth() draws on implements a few improvements on this procedure.
Exercise 9 (Age trends) Still using the data from Vanhove (2014), visualise the relationship between the participants’ age and their performance on the cognate translation task in both modalities. How would you describe the age trends?
Only if you know about regression modelling: Would you use simple linear regression to capture these age trends? Why or why not?
Exercise 10 (Scatterplots for binary data) Sscatterplots can also be useful if you have a binary outcome variable and a continuous predictor. To illustrate this, we revisit the data from the study by Schlechtweg, Peters, and Frank (2023), see Exercise 7.
Read in the data.
Retain only the data for Subject 42.
Draw a scatterplot (with a smoother) that shows the Step variable along the x-axis and the SelVowel1 (i.e., whether the sound was identified as /æ/) along the y-axis. Use a tiny amount of vertical jittering (see the section on boxplots) so that we can make out overlapping data points, and use facetting to split up the plot by Duration.
Interpret the graph.
In this example, the binary outcome SelVowel1 was already coded as a numeric variable with values 0 and 1. If it had been coded as a character variable with values, say, /æ/ and /ɛ/, you would have to assign the value of 0 to one level of the variable and 1 to the other one.
Now use the data for all participants, but only plot the data for sounds of middle duration. Use facet_wrap(vars(Subject)) to split up the plot by subjects. Interpret the graph in terms of the main trend. Are there many subjects that seem to buck the trend?
Schlechtweg, Peters, and Frank (2023) also collected data on how strongly their participants distinguish between the German sounds /e:/ (denen) and /ɛ:/ (Dänen). Participants with GermanPillai scores near 1 strongly disambiguate between these sounds in their German production; those with GermanPillai scores near 0 have merged these sounds in their own German production. Instead of facetting by Subject, facet by reorder(Subject, GermanPillai). Then the plots will be arranged from left to right, top to bottom in increasing order of the participants’ GermanPillai scores. Is there some evidence that the participants’ vowel selections in L2 English depend on how strongly they disambiguate between the two German sounds?
Using the same lay-out as for the previous plot (i.e., facetting by subject, order by GermanPillai), draw plots showing the trendlines per subject for sounds of short, middle, and long duration. Use different colours for the different durations. Don’t plot the individual data points, as they would overload the plots. Can you observe any trends? For instance, is one duration typically associated with more /æ/ selections than the others?
8.2 The Tukey mean–difference plot
Within-subjects studies are studies in which the participants are tested in several (typically all) conditions rather than just one as in the case of between-subjects studies. Data from within-subjects studies are often difficult to visualise satisfactorily: While you could draw boxplots that show the participants’ performance in the different conditions, these plots don’t show which data points belong to the same participants. Yet this information could be crucial: Are the top performers in one condition also the top performers the other conditions (suggesting that the conditions don’t affect relative performance) or are the top performers in one condition the bottom performers in the other conditions (suggesting that the conditions are differently suited to different profiles)? One option is to add lines connecting the data points for each participant, but such plots get overloaded pretty quickly.
In the special but common case of a within-subjects design with just two conditions, the Tukey mean–difference plot (also known as the Bland–Altman plot in medical circles) is a useful alternative. For each participant, we compute the average of their two condition scores as well as the difference between these two scores. We then draw a scatterplot and put the averages along the x-axis and the differences along the y-axis. If the scores in one condition are generally better than in the other condition, it should either be the case that most scores lie above 0 or that most scores lie beneath 0. Further, if there is no noticeable pattern in the scatterplot, this suggests that the difference between the condition is pretty consistent across all performance levels. If, however, the scatterplot shows, say, a downward trend, this would mean that the top performers don’t derive the same benefit from the other condition as the other participants do.
Exercise 11 (Drawing a Tukey mean–difference plot) van den Broek et al. (2018) investigated whether repeated exposures to foreign-language words enhanced word retention more if these words were embedded in an information-rich context that allows the learners to infer the meaning of the words from context or if these words were embedded in an information-poor context that forces learners to retrieve the meaning of these words from memory. We will focus on their Experiment 2, which was a within-subjects experiment, the summary data for which are reported in their Table 2 (p. 565). Specifically, we will focus on the Retrieval and Context inference conditions (ignoring the Retrieval-plus-context inference condition) and on the productive tests (i.e., translation from the L1 to the L2) at the delayed posttest. van den Broek et al. (2018) report that, on average, the participants correctly translated 42% of the words in the Retrieval condition (with a standard deviation of 21 percentage points) and 37% in the Context inference condition (also with a standard deviation of 21 percentage points).
Read in the data from VanDenBroek2018_Exp2.csv. Retain only those entries for which RepeatedTest is FALSE. The participant IDs are in the column PP; the column Word contains the L2 word labels (in Swahili); the column Translation their actual L1 translations (in Dutch); Condition contains the condition in which the participant learnt the word in question; and the column Test2Swa2NlLenient indicates whether the participant was able to correctly translate the word from Dutch into Swahili at the delayed posttest.
Create a new tibble that contains the proportion of correctly translated Dutch words at the delayed posttest per participant and per condition. Do not include the retrievalc condition.
Check if the average proportion per condition matches the numbers given in Van den Broek et al.’s Table 2.
Create a tibble that contains, for each participant, the average proportion of correctly translated words over the two conditions as well as the difference between these proportions (i.e., the proportion of correct translations in the context condition minus the proportion of correct translations in the retrievalf condition).
Draw a scatterplot of the tibble you constructed in the previous step, with the averages along the x-axis and the differences along the y-axis. Add a horizontal line at \(y = 0\) to the plot (use geom_hline(yintercept = 0, linetype = "dashed")). Also add a scatterplot smoother.
Interpret the plot.
Exercise 12 (The same results as a Cleveland dotplot) Using the same data as in the previous exercise, draw a Cleveland dotplot showing the proportion of correctly translated words in both conditions for each participant. Which visualisation do you prefer for these data?
9 Scatterplot matrices
Scatterplot matrices are useful for showing the bivariate relationships among multiple variables. I usually use a custom function, scatterplot_matrix(), to plot such scatterplot matrices. To use this function, load the scatterplot_matrix.R file that you’ve put in the functions subdirectory:
source(here("functions", "scatterplot_matrix.R"))
Let’s look at an example. We read in the lexical metrics data and text ratings from the French/German/Portuguese project (see the Datasets primer), compute the average rating per text, and plot the relationships between the average rating, the number of tokens in the texts as well as the texts’ type/token ratio in a scatterplot matrix:
Rows: 3060 Columns: 137
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Text, Subject, Text_Language, Text_Type, LemmawordsNotInSUBTLEX, ...
dbl (129): Time, TTR, Guiraud, nTokens, nTypes, nLemmas, meanWordLength, MST...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 29179 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): Batch, Rater, Text, Subject, Text_Language, Text_Type, Rater_Sex, ...
dbl (4): Trial, Rating, Time, Rater_Age
lgl (1): Rater_NativeLanguageOther
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
On the main diagonal, we see a histogram for each variable as well as the number of data points that each histogram is based on. Here, we have \(n = 180\) for all histograms. But if you have missing data for some variables, these numbers will vary.
In the top triangle, we see scatterplots (including scatterplot smoothers) for the three bivariate relatonships. The scatterplot in the \((i,j)\)-th cell shows the relationship between the variable whose histogram is shown in the \(i\)-th row (along the y-axis) and the variable whose histogram is shown in the \(j\)-th column (along the x-axis). That is, the top right scatterplot shows the relation between the number of tokens (log-2 transformed, on the x-axis) and the mean rating (along the y-axis).
In the bottom triangle, the Pearson correlation coefficients for these relationships are shown, along with the number of data points it is based on. For instance, the number -0.15 is the correlation between the mean ratings and the type/token ratios, whereas 0.70 is the correlation between the mean ratings and the number of tokens (log-2 transformed).
Incidentally, I transformed the number of tokens because this variable was pretty right-skewed. If the log-2 transformed value is 4, then the original value is \(2^4 = 16\); if the log-2 transformed value is 6, then the original value is \(2^6 = 64\).
An alternative to scatterplot_matrix() is the ggpairs() function from the GGally package.
10 Suggested reading
Kieran Healy’s Data visualization: A practical introduction does a stirling job at explaining the why and the how of drawing graphs using R and the tidyverse. In addition to the material covered in this lecture, Healy covers drawing maps and drawing visualisations of statistical models. The entire book is freely available from socviz.co.
As their name suggests, Cleveland dot plots were popularised by Cleveland (1993) in his book Visualizing data. But if you want to learn more about this flexible visualisation tool, I suggest you check out the article by Sönning (2016), which is geared towards linguists.
Berthele, Raphael. 2012. “The Influence of Code-Mixing and Speaker Information on Perception and Assessment of Foreign Language Proficiency: An Experimental Study.”International Journal of Bilingualism 16 (4): 453–66. https://doi.org/10.1177/1367006911429514.
Cleveland, William S. 1993. Visualizing Data. Murray Hill, NJ: AT&T Bell Laboratories.
Cleveland, William S., and Robert McGill. 1984. “Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods.”Journal of the American Statistical Association 79: 531–54.
Desgrippes, Magalie, Amelia Lambelet, and Jan Vanhove. 2017. “The Development of Argumentative and Narrative Writing Skills in Portuguese Heritage Speakers in Switzerland (HELASCOT Project).” In Heritage and School Language Literacy Development in Migrant Children: Interdependence or Independence?, edited by Raphael Berthele and Amelia Lambelet, 83–96. Bristol: Multilingual Matters. https://doi.org/10.21832/9781783099054-006.
Pestana, Carlos, Amelia Lambelet, and Jan Vanhove. 2017. “Reading Comprehension in Portuguese Heritage Speakers in Switzerland (HELASCOT Project).” In Heritage and School Language Literacy Development in Migrant Children: Interdependence or Independence?, edited by Raphael Berthele and Amelia Lambelet, 58–82. Bristol: Multilingual Matters. https://doi.org/10.21832/9781783099054-005.
Schlechtweg, Marcel, Jörg Peters, and Marina Frank. 2023. “L1 Variation and L2 Acquisition: L1German /eː/-/Ɛː/ Overlap and Its Effect on the Acquisition of L2English /ɛ/-/æ/.”Frontiers in Psychology 14: 1133859. https://doi.org/10.3389/fpsyg.2023.1133859.
Slavin, Robert E., Nancy Madden, Malgarita Calderón, Anne Chamberlain, and Megan Hennessy. 2011. “Reading and Language Outcomes of a Multiyear Randomized Evaluation of Transitional Bilingual Education.”Educational Evaluation and Policy Analysis 33 (1): 47–58. https://doi.org/10.3102/0162373711398127.
Sönning, Lukas. 2016. “The Dot Plot: A Graphical Tool for Data Analysis and Presentation.” In A Blend of MaLT: Selected Contributions from the Methods and Linguistic Theories Symposium 2015, edited by Hanna Christ, Daniel Klenovšak, Lukas Sönning, and Valentin Werner, 101–29. Bamberg, Germany: University of Bamberg Press. https://doi.org/10.20378/irbo-51101.
van den Broek, Gesa S. E., Atsuko Takashima, Eliane Segers, and Ludo Verhoeven. 2018. “Contextual Richness and Word Learning: Context Enhances Comprehension but Retrieval Enhances Retention.”Language Learning 68 (2): 546–85. https://doi.org/10.1111/lang.12285.
Vanhove, Jan. 2014. “Receptive Multilingualism Across the Lifespan: Cognitive and Linguistic Factors in Cognate Guessing.” PhD thesis, University of Fribourg. http://doc.rero.ch/record/210293.
———. 2016. “The Early Learning of Interlingual Correspondence Rules in Receptive Multilingualism.”International Journal of Bilingualism 20 (5): 580–93. https://doi.org/10.1177/1367006915573338.
———. 2017. “The Influence of Standard and Substandard Dutch on Gender Assignment in Second Language German.”Language Learning 67 (2): 431–60. https://doi.org/10.1111/lang.12230.
———. 2021. “Towards Simpler and More Transparent Quantitative Research Reports.”ITL - International Journal of Applied Linguistics 172 (1): 3–25. https://doi.org/10.1075/itl.20010.van.
Wagenmakers, E.-J., T. Beek, L. Dijkhoff, Q. F. Gronau, A. Acosta, Jr. R. B. Adams, D. N. Albohn, et al. 2016. “Registered Replication Report: Strack, Martin, & Stepper (1988).”Perspectives on Psychological Science 11 (6): 917–28. https://doi.org/10.1177/1745691616674458.