Tutorial: Drawing a line chart
Graphs are incredibly useful both for understanding your own data and for communicating your insights to your audience. This is why the next few blog posts will consist of tutorials on how to draw four kinds of graphs that I find most useful: scatterplots, line charts, boxplots and some variations, and Cleveland dotplots. These tutorials are aimed primarily at the students in our MA programme. Today’s graph: the line chart.
What’s a linechart?
A line chart is quite simply a graph in which data points that belong together are connected with a line. If you want to compare different groups, as we do below, you can use lines of different colours or different types to highlight differences between the groups.
As an aside, line charts are often used to plot the development of a variable over time—the tutorial below is an example of this—and I used to think that linecharts should only be used when time was involved, that connecting data points using lines was somehow not kosher otherwise. But now I’m fine with using line charts even when time isn’t involved: the lines often highlight the patterns in the data much better than a handful of unconnected symbols do.
Tutorial: Drawing a linechart in ggplot2
In this tutorial, you’ll learn how to draw a basic linechart and how you can tweak it.
You’ll also learn how to quickly partition a dataset according to several variables and compute summary statistics within each part.
For this, we’ll make use of the free statistical program R and the add-on packages
Working with these programs and packages may be irksome at first if you’re used to pull-down menus,
but the trouble is well worth it.
What you’ll need
- The free program R.
- The graphical user interface RStudio – also free. Download and install R first and only then RStudio.
I’m going to assume some familiarity with these programs. Specifically, I’ll assume that you know how to enter commands in RStudio and import datasets stored in the CSV file format. If you need help with this, see Chapter 1 of my introduction to statistics (in German) or Google importing data R.
dplyradd-on packages for R. To install them, simply enter the following command at the prompt in RStudio.
- A dataset. For this tutorial, we’ll use a dataset
on the acquisition of morphological cues to agency
that my students compiled.
It consists of the responses of 70 learners (
SubjectID) who were assigned to one of three learning conditions (
BiasCondition) – the details don’t matter much for our purposes. All learners completed three kinds of tasks (
Task): understanding sentences in an unknown languages, judging the grammaticality of sentences in the same languages, and producing sentences in this language. These tasks occurred in
Blocks. The learners’ responses were tracked throughout the experiment (
ResponseCorrect). Download this dataset to your hard disk.
In RStudio, read in the data.
If the summary looks like this, you’re good to go.
Now load the packages we’ll be using. You may get a message that some ‘objects are masked’, but that’s nothing to worry about.
Summarising a data frame
We want to compare how response accuracy develops block by block
in the different experimental conditions.
To that end, we need to calculate the proportion of correct
responses by each learner in each block and for each task.
magrittr packages make doing so easy.
The following lines of code create a new data frame called
that was constructed by taking the dataset
semproj (first line),
grouping it by the variables
Task (second line),
and within each ‘cell’ calculating the proportion of entries in
ResponseCorrect that read
"yes" (third line).
Type the name of the new data frame at the prompt. If you see something like this, everything’s fine.
Now that we’ve computed the proportion of correct responses
by each participant for each block and task,
we can compute the average proportion of correct responses
per block and task according to the experimental condition the participants
were assigned to.
The code works similarly to before:
a new data frame called
semproj_perCondition is created
by taking the
semproj_perParticipant data frame we constructed above (line 1),
grouping it by
Task (line 2), and
computing the mean proportion of correct responses (line 3).
The result should look like this—you can see that those in the ‘rule-based input’ learning condition score an average of 69% on the first comprehension block, 59% on the first grammaticality judgement task (GJT) block, and 13% on the first production block.
A first attempt: Development in comprehension
To start off with a simple example, let’s plot the mean proportion of correct responses in the four comprehension blocks for the three experimental conditions and connect them with a line.
First, we create another new data frame that contains the averages
for the comprehension task only.
The new data frame
semproj_perCondition_Comprehension is constructed
by taking the data frame
semproj_perCondition we constructed above
and retaining (filtering) the rows for which the
Task variable reads
To plot these averages, use the following code.
The first line specifies the data frame the graph should be based on,
the second line specifies that
Block (1-2-3-4) should go on the x-axis,
the third that
MeanProportionCorrect should go on the y-axis,
and the fourth that the different experimental conditions
should be rendered using different colours.
The fifth line, finally, specifies that the data should be plotted as lines.
This is decent enough for a start: it’s clear from this graph that, contrary to what we’d expected, those in the weak bias condition actually seem to perform better than the other participants, for instance. We could go on and draw similar graphs for the other two tasks—comprehension and production—but there’s a better option: draw them all at once so that the results can more easily be compared.
Several linecharts in one plot
For this plot, we use the
semproj_perCondition data frame
that contains the averages for all three tasks, split up by block and experimental condition.
The code is otherwise the same as before,
but I’ve added one additional line:
facet_wrap splits up the
data according to a variable (here
and plots a separate plot for each part.
By default, the axes of the different subplots
span the same range so that
differences in overall performance
can easily be compared between the three tasks.
So not only is this quicker than drawing three separate graphs,
it also saves (vertical) space and the side-by-side plots are easier to compare with one another than
three separate plots would be.
A printer-friendly version
If you prefer a printer-friendly version,
you can add the
theme_bw() command to the ggplot call (10th line)
and specify that the different experimental conditions
should be distinguished using different linetypes (solid, dashed, dotted) rather than different colours (4th line).
Since the difference between dashed and dotted lines may not be immediately obvious,
it can be a good idea to also plot the averages using different symbols (lines 5 and 6).
With customised legends and labels
The plot above is okay, but you can go the extra mile by customising the axis and legend labels rather than using the defaults—even if they are comprehensible, it just makes a better impression to do so:
ylabcommands change the names of the x- and y-axes. Note that
\nstarts a new line.
scale_shape_manual, I changed the
labelsof the legend for the different symbols. I also changed the symbols themselves (
values) as I thought the default symbols were difficult to tell apart. The values 1, 2 and 3 work fine for this graph, I think, but you can try out different values (handy list with symbol numbers).
- If you customise the labels and symbols for the
shapeparameter, you need to do the same for the
linetypeparameters—otherwise, R gets confused. This is what I did in
scale_linetype_manual. Note that the
labelsmust occur in the same order as the labels in
scale_shape_manual. (handy list with linetypes)
- In both
scale_linetype_manual, I set
"Learning condition". This changes the title of the legend, and by using the same title twice, you tell R to combine the two legends into one.
legend_positionspecifies where the legend should go (on top rather than on the right), and
legend_directionwhether the keys should be plotted next to (horizontal) or under (vertical) each other.
- The lines with
panel.griddraw horizontal grid lines to facilitate the comparison between tasks and suppress any vertical grid lines ggplot may draw.