8 Week 8: Scatterplots with trend lines

8.1 Goal

You’ll learn to highlight the trend in the relationship between two variables using scatterplot smoothers.

8.2 Drawing scatterplot smoothers

ggplot(dat = per_part,
       aes(x = English.Total, 
           y = number_correct)) +
  geom_point(shape = 1) +
  geom_smooth() + # add smoother
  xlab("Result English test") +
  ylab("Numer of correct translations") +
  facet_grid(cols = vars(Mode))
## `geom_smooth()` using method = 'loess' and formula
## = 'y ~ x'
## Warning: Removed 6 rows containing non-finite outside the
## scale range (`stat_smooth()`).
## Warning: Removed 6 rows containing missing values or values
## outside the scale range (`geom_point()`).

The blue line is the smoother; the grey ribbon around it is known as a (95%) confidence band and communicates our uncertainty about the precise location of the smoother (due to lack of data and variability in the data). Of course, you can change the colour of the line to suit your personal taste, and you can turn off the grey ribbon (se = FALSE; se is short for ‘standard error’, which is unfortunate – a standard error and a confidence band aren’t the same thing). We’re going to turn off the confidence band because it’s bound to be misinterpreted anyway.

ggplot(dat = per_part,
       aes(x = English.Total, 
           y = number_correct)) +
  geom_point(shape = 1) +
  geom_smooth(se = FALSE, colour = "red") +
  xlab("Result English test") +
  ylab("Numer of correct translations") +
  facet_grid(cols = vars(Mode))
## `geom_smooth()` using method = 'loess' and formula
## = 'y ~ x'
## Warning: Removed 6 rows containing non-finite outside the
## scale range (`stat_smooth()`).
## Warning: Removed 6 rows containing missing values or values
## outside the scale range (`geom_point()`).

8.3 So what’s a scatterplot smoother anyway?

Scatterplot smoothers were developed to discover relationships (including nonlinear ones) between two variables that aren’t necessarily immediately obvious if the data are shown in a scatterplot. The points on the smoother are a kind of mean value of the \(Y\) variable for the respective \(X\) value. In the left panel, for instance, the average number of correct translations in the auditory mode for someone with an English test score of 30 is roughly 17–18, whereas the average number of correct translations for written words for participants with a score of 40 on the English test is about 25.

We needn’t amuse ourselves with the maths behind these smoothers, but the following points are important:

  1. The trend line is nearly always a bit wiggly (to use the actual technical term). This is the case even when the relationship itself is as good as linear.

  2. The uncertainty about the smoother is larger at the extremes (high and low \(x\) values). The confidence band is correspondingly wider at these places. But again, we’re not going to plot the confidence bands in order not to confuse ourselves or our readership.

  3. The default settings for geom_smooth() tend to work fairly well, but sometimes it’s necessary to fiddle with them so that the smoother captures the trend in the data better. See next section.

8.4 Examples of nonlinear relationships

This section serves to show you what true nonlinear relationships can look like and how you sometimes need to fiddle with geom_smooth()’s settings.

In the first example, the red line was drawn with the default settings. This line doesn’t capture an important feature of the relationship (the data points go up and down). The blue line captures this trend much better. It was drawn using the command geom_smooth(span = 0.1). The span parameter determines how wiggly the curve may be (the smaller span, the wigglier the curve). By default, span is set to 0.75. Finding a decent span value is matter of trial and error.

In the second example, the red line was drawn using geom_smooth(span = 0.1). This line is much too wiggly, and it essentially models random deviations from the general trend. The blue line, drawn with the default setting (span = 0.75), captures the general trend much more sensibly. The green line, by contrast, isn’t wiggly enough (span = 3).

Summing up: Generally, the default settings work reasonably well. But when you notice that visually salient patterns in the scatterplot aren’t captured by the trend line, you need to fiddle a bit with the span parameter.

More generally, data analysis and statistics aren’t a matter of blindly applying formulae and recipes.

8.5 Exercise

Draw a scatterplot that shows the relationship between the participants’ age and their performance on the cognate translation task. How would you describe this relationship?

Now add a smoother (default settings) to this scatterplot. Does it capture the trend that you identified in the scatterplot?

Now fiddle with the span parameter (0.1, 0.2, 0.3 etc.)

Hand in the graph you think works best as well as the compiled HTML report.