Graphs are incredibly useful both for understanding your own data and for communicating your insights to your audience. This is why the next few blog posts will consist of tutorials on how to draw four kinds of graphs that I find most useful: scatterplots, line charts, boxplots and some variations, and Cleveland dotplots. These tutorials are aimed primarily at the students in our MA programme. Today’s graph: the line chart.

What’s a linechart?

A line chart is quite simply a graph in which data points that belong together are connected with a line. If you want to compare different groups, as we do below, you can use lines of different colours or different types to highlight differences between the groups.

As an aside, line charts are often used to plot the development of a variable over time—the tutorial below is an example of this—and I used to think that linecharts should only be used when time was involved, that connecting data points using lines was somehow not kosher otherwise. But now I’m fine with using line charts even when time isn’t involved: the lines often highlight the patterns in the data much better than a handful of unconnected symbols do.

Tutorial: Drawing a linechart in ggplot2

In this tutorial, you’ll learn how to draw a basic linechart and how you can tweak it. You’ll also learn how to quickly partition a dataset according to several variables and compute summary statistics within each part. For this, we’ll make use of the free statistical program R and the add-on packages `ggplot2`, `magrittr` and `dplyr`. Working with these programs and packages may be irksome at first if you’re used to pull-down menus, but the trouble is well worth it.

What you’ll need

• The free program R.
• The graphical user interface RStudio – also free. Download and install R first and only then RStudio.

I’m going to assume some familiarity with these programs. Specifically, I’ll assume that you know how to enter commands in RStudio and import datasets stored in the CSV file format. If you need help with this, see Chapter 1 of my introduction to statistics (in German) or Google importing data R.

• The `ggplot2`, `magrittr` and `dplyr` add-on packages for R. To install them, simply enter the following command at the prompt in RStudio.
• A dataset. For this tutorial, we’ll use a dataset on the acquisition of morphological cues to agency that my students compiled. It consists of the responses of 70 learners (`SubjectID`) who were assigned to one of three learning conditions (`BiasCondition`) – the details don’t matter much for our purposes. All learners completed three kinds of tasks (`Task`): understanding sentences in an unknown languages, judging the grammaticality of sentences in the same languages, and producing sentences in this language. These tasks occurred in `Block`s. The learners’ responses were tracked throughout the experiment (`ResponseCorrect`). Download this dataset to your hard disk.

Preliminaries

In RStudio, read in the data.

If the summary looks like this, you’re good to go.

Now load the packages we’ll be using. You may get a message that some ‘objects are masked’, but that’s nothing to worry about.

Summarising a data frame

We want to compare how response accuracy develops block by block in the different experimental conditions. To that end, we need to calculate the proportion of correct responses by each learner in each block and for each task. The `dplyr` and `magrittr` packages make doing so easy.

The following lines of code create a new data frame called `semproj_perParticipant` that was constructed by taking the dataset `semproj` (first line), grouping it by the variables `SubjectID`, `BiasCondition`, `Block` and `Task` (second line), and within each ‘cell’ calculating the proportion of entries in `ResponseCorrect` that read `"yes"` (third line).

Type the name of the new data frame at the prompt. If you see something like this, everything’s fine.

Now that we’ve computed the proportion of correct responses by each participant for each block and task, we can compute the average proportion of correct responses per block and task according to the experimental condition the participants were assigned to. The code works similarly to before: a new data frame called `semproj_perCondition` is created by taking the `semproj_perParticipant` data frame we constructed above (line 1), grouping it by `BiasCondition`, `Block` and `Task` (line 2), and computing the mean proportion of correct responses (line 3).

The result should look like this—you can see that those in the ‘rule-based input’ learning condition score an average of 69% on the first comprehension block, 59% on the first grammaticality judgement task (GJT) block, and 13% on the first production block.

A first attempt: Development in comprehension

To start off with a simple example, let’s plot the mean proportion of correct responses in the four comprehension blocks for the three experimental conditions and connect them with a line.

First, we create another new data frame that contains the averages for the comprehension task only. The new data frame `semproj_perCondition_Comprehension` is constructed by taking the data frame `semproj_perCondition` we constructed above and retaining (filtering) the rows for which the `Task` variable reads `Comprehension`.

To plot these averages, use the following code. The first line specifies the data frame the graph should be based on, the second line specifies that `Block` (1-2-3-4) should go on the x-axis, the third that `MeanProportionCorrect` should go on the y-axis, and the fourth that the different experimental conditions should be rendered using different colours. The fifth line, finally, specifies that the data should be plotted as lines.

This is decent enough for a start: it’s clear from this graph that, contrary to what we’d expected, those in the weak bias condition actually seem to perform better than the other participants, for instance. We could go on and draw similar graphs for the other two tasks—comprehension and production—but there’s a better option: draw them all at once so that the results can more easily be compared.

Several linecharts in one plot

For this plot, we use the `semproj_perCondition` data frame that contains the averages for all three tasks, split up by block and experimental condition. The code is otherwise the same as before, but I’ve added one additional line: `facet_wrap` splits up the data according to a variable (here `Task`) and plots a separate plot for each part. By default, the axes of the different subplots span the same range so that differences in overall performance can easily be compared between the three tasks. So not only is this quicker than drawing three separate graphs, it also saves (vertical) space and the side-by-side plots are easier to compare with one another than three separate plots would be.

A printer-friendly version

If you prefer a printer-friendly version, you can add the `theme_bw()` command to the ggplot call (10th line) and specify that the different experimental conditions should be distinguished using different linetypes (solid, dashed, dotted) rather than different colours (4th line). Since the difference between dashed and dotted lines may not be immediately obvious, it can be a good idea to also plot the averages using different symbols (lines 5 and 6).

With customised legends and labels

The plot above is okay, but you can go the extra mile by customising the axis and legend labels rather than using the defaults—even if they are comprehensible, it just makes a better impression to do so:

1. The `xlab` and `ylab` commands change the names of the x- and y-axes. Note that `\n` starts a new line.
2. With `scale_shape_manual`, I changed the `labels` of the legend for the different symbols. I also changed the symbols themselves (`values`) as I thought the default symbols were difficult to tell apart. The values 1, 2 and 3 work fine for this graph, I think, but you can try out different values (handy list with symbol numbers).
3. If you customise the labels and symbols for the `shape` parameter, you need to do the same for the `linetype` parameters—otherwise, R gets confused. This is what I did in `scale_linetype_manual`. Note that the `labels` must occur in the same order as the labels in `scale_shape_manual`. (handy list with linetypes)
4. In both `scale_shape_manual` and `scale_linetype_manual`, I set `name` to `"Learning condition"`. This changes the title of the legend, and by using the same title twice, you tell R to combine the two legends into one.
5. In `theme`, `legend_position` specifies where the legend should go (on top rather than on the right), and `legend_direction` whether the keys should be plotted next to (horizontal) or under (vertical) each other.
6. The lines with `panel.grid` draw horizontal grid lines to facilitate the comparison between tasks and suppress any vertical grid lines ggplot may draw.

13 June 2016