Working with datasets in R: A primer

Author

Jan Vanhove

Published

February 28, 2025

1 Goal

The goal of this primer is to furnish you with the basic tools and knowledge to organise, transform, and query datasets in research contexts. We’ll work with R and more specifically with the tidyverse add-on suite that markedly simplifies working with datasets.

2 Tips for organising datasets

Let’s say we’ve run a study in which speakers of German read a couple of words in Swedish and were asked to guess what these words might mean. An excerpt from the raw data might look like this, with the words having been shown to the participants in the order in which they are listed:

Participant 1034. Woman, 51 years.
- Word: söka. Translation: Socken (incorrect).
- Word: försiktig. Translation: vorsichtig (correct).
- Word: mjölk. Translation: Milch (correct).
- Word: behärska. No translation given.
- Word: fiende. Translation: finden (incorrect).
Participant 2384. Woman, 27 years.
- Word: fiende. No translation given.
- Word: behärska. No translation given.
- Word: försiktig. Translation: vorsichtig (correct).
- Word: mjölk. Translation: Milch (correct).
- Word: söka. Translation: Socke (incorrect).
Participant 8667. Woman, 27 years.
- Word: mjölk. Translation: Milch (correct).
- Word: behärska. No translation given.
- Word: fiende. Translation: finden (incorrect).
- Word: söka. Translation: suchen (correct).
- Word: försiktig. Translation: vorsichtig (correct).
Participant 5901. Man, 15 years.
- Word: behärska. Translation: beherrschen (correct).
- Word: mjölk. Translation: milch (sic.; correct).
- Word: försiktig. Translation: vorsichtig (correct).
- Word: fiende. Translation: feinde (sic.; correct; actually Feind).
- Word: söka. Translation: socken (sic.; incorrect).

There are lots of ways in which we could represent these data in a spreadsheet. Let’s look at a few rules of thumb.

2.1 Wide vs. long formats

We’ll concern ourselves strictly with rectangular datasets. These are datasets in which the information is laid out in rows and columns and in which all columns have the same length, and all rows have the same width. Examples of non-rectangular data formats are JSON and XML – see here.

Broadly speaking, we can organise our data in a wide format or in a long format. In a wide format, all pieces of information related to a unit of data collection are organised in a single row. For instance, we could think of each participant in the study as a unit of data collection, in which case we could lay out the data as in Figure 1. Note that the spreadsheet contains a column for each word that indicates the position in which it was presented to the participants. Alternatively, we could think of each word as a unit of data collection and organise the data like in Figure 2.

Figure 1: A wide dataset with one row per participant.

Figure 2: A wide dataset with one row per word.

In a long format, all pieces of information pertaining to an observational unit are organised in a single row. It’s difficult to precisely define what units of data collection and observational units are, and it wouldn’t be too useful to have a precise definition, anyway. But in the present example, the observational units would be the individual responses, i.e., the individual translations. Figure 3 shows how the same data look like when organised in a long format.

Figure 3: A long dataset with one row per word per participant. Long datasets tend to be easier to manage and to analyse than wide ones.

It’s usually easier to work with data in a long-ish format compared to with data in a wide-ish format. Moreover, when you do need your data in a wide-ish format for data analysis, converting a longer dataset to a wider one is typically easier than vice versa. We will, of course, illustrate how you can effect such conversions.

When in doubt, arrange your data in a long-ish format

Assuming you have any say in the matter, of course.

Make sure that all rows can be interpreted independently of one another. What we want to avoid is that we can’t make sense of the data in some row because we need data from another row to do so, and this other row was deleted, or the order of the rows has changed, etc. For instance, in Figure 3, we also have a column with the Positions, even though we could have derived this information from the ordering of the rows. But once a row gets deleted or once the order of the rows gets permuted, we’d lose this piece of information. So don’t organise the data like in Figure 4.

Figure 4: Not like this! In this dataset, several cells are left empty as their contents can be derived from other cells. But this will result in difficulties when analysing the data. Moreover, deleting some rows or changing their order would make it impossible to reconstruct the intended contents of these empty cells.

Datasets are different from tables

Figure 4 employs a trick often used when presenting data in a table: leave out information that didn’t change compared to the last row. But the goal of a table is to communicate data or patterns in data, not to organise the data so that they can be analysed.

Furthermore, the dataset needs to be rectangular. Figure 5 shows what not to do: The additional rows reporting some averages don’t fit in with the rest of the dataset (“Prozent Männer:” is not the ID of a Versuchsperson, and “30” is not a Geschlecht).

Figure 5: Not like this! This dataset is not rectangular.

2.2 Naming variables

Make your life during the data analysis easier by using short but descriptive labels for variables or values in the spreadsheet. By doing so, you avoid that you constantly need to look up in the project’s codebook what the labels mean, thereby reducing the likelihood that you’ll make errors. Moreover, you save yourself some typing.

A few examples:

When working with questionnaire data, don’t simply label the columns with the number of the question in the questionnaire (e.g., Q3 or Question17). Instead, use more descriptive labels such as DegreeFather oder DialectUse.
If you have a column labelled Sex that is filled with 1s and 0s, then you may end up having to look up if the 1s refer to men or to women. Instead, either fill the column with m(an) and w(oman) values, or rename the column Man, so that you know that the 1s refer to men.
Try to strike a balance between descriptive power and length. For instance, a variable named HowOftenDoYouSpeakStandardGerman will get on your nerves before long; SpeakStandard could be sufficient.

Use a codebook

Having short and descriptive variable names is not an excuse for not maintaining a codebook that spells out precisely what each variable in the dataset refers to. [Example of a codebook].

2.3 Labelling missing values

In Figure 3, I left some of the translation cells empty. From this, I can deduce that the word was presented to the participant but that he or she did not translate the word. However, in other situations, it could be the case that some participants were inadvertently never shown a particular word (e.g., due to a programming error) or that some of the participants’ answers were irretrievably lost. We should enable ourselves to distinguish between these cases, for instance by marking the latter cases using NA (not available). (Alternatively, we could explicitly label cases where the participant did not translate the word with [no response] or something similar.)

If you want to be able to tell apart different reasons for missing data (such as data loss, programming errors, participants’ refusal to answer a certain question, participants’ being absent from a data collection etc.), it’s probably easiest to just write NA in the column and add another column with comments detailing the reasons for the missingness.

Don’t use numbers to encode missingness

Some data logging applications use -99 or -9999 to encode missingness. The problem with this is that, sometimes, -99 or -9999 don’t immediately stand out as conspicuous values.

2.4 Using several smaller datasets

The spreadsheet above contains several repeated pieces of information. For instance, for all five translations provided by participant 1034, we indicated that they were provided by a woman aged 51. We can eliminate this source of redundancy by managing a handful of smaller datasets rather than just a single large one. More specifically, we can manage a dataset that contains all information that depends just on the participants, see Figure 6. Each participant has a unique ID (here: Versuchsperson), and the dataset contains contains a single row for each participant. If we need to correct an entry related to, say, some participant’s age, we just need to change it here – once – rather than five times in the larger dataset.

Figure 6: The first smaller dataset only contains information concerning the participants.

By the same token, we can put all information that depends just on the stimuli used in a separate dataset, see Figure 7. Here, too, each stimulus has a unique ID (here: Wort).

Figure 7: The second smaller dataset only contains information concerning the words.

The third dataset then only contains information that depends on the combination of a particular participant and a particular word. As shown in Figure 8, each row in this dataset contains the IDs of both the participant and the stimulus that the response is related to. But any other information related to just the word or to just the participant is left out of this dataset. As we’ll see shortly, we can easily add the information related to the participants or to the words to this dataset from the other two datasets.

Figure 8: The third dataset only contains the translations.

2.5 Miscellaneous tips

Mind capitalisation. For some programs and computer languages (e.g., SQL), Frau and frau are the same value; for others (including R), they are not.
Mind trailing spaces. “Mann ” (with trailing space) and “Mann” (without trailing space) are different values to a computer.
Special symbols, including the Umlaut, sometimes lead to problems, especially when collaborating with people whose localisation settings differ from yours.
Format dates in the YYYY/MM/DD format. This way, if you sort the data alphabetically, they’re already in chronological order.
Don’t use colour-coding in your spreadsheet. Or if you do use it, be aware that you’ll lose this information once you import your data into your statistics program.

Work as little as possible in the spreadsheet

After you’ve entered your data into a spreadsheet, all further steps in your analysis should be carried out in R (or Python, or Julia, or what-have-you). Don’t calculate, sort, copy, paste, move, reshape, draw graphs etc. in Excel or whatever spreadsheet program you prefer. Treat your finished spreadsheet as the immutable source of data from which your results will be obtained, so that when in doubt, it’s clear which file you should go back to.

When using R (or Python, or Julia, or whatever), use (a) script(s) to read in the original dataset and convert the data in it to a format more amenable to analysis, but whatever you do, don’t overwrite the original file.

Use data validation

When working in a spreadsheet program, you can use the data validation functions to minimise the chances that you enter faulty data. For instance, if you know that the possible responses to a certain questions are either integers from 0 to 5 or NA, make a data validation rule that prevents you from entering erroneous data.

Also see the blog post A data entry form with failsafes.

3 Working with R and RStudio

With the data available in a rectangular spreadsheet format, the next challenge is to load them into a program capable of analysing them. Several (free) options are available (JASP, Python, Julia, …), but we’ll focus on R complemented with the tidyverse suite. The main reasons for this are that R is more popular in the humanities and the social sciences and that I’m just more proficient in R. Hadley Wickham lists some further reasons.

Use R at home

Don’t try to work with R as you’re attending this class. Some things are bound to go wrong (perhaps a comma in the wrong spot, a missing bracket or an upper-case letter that should have been a lower-case one), and as you’re trying to fix the errors, you’ll lose track of the lecture. So work through the code snippets at home, and be patient with yourselves. I receive error messages or unexpected outcomes all the time – it’s just that after using R for 15 or so years, I’ve become reasonably proficient at spotting the problems and Googling solutions :)

3.1 R, RStudio, and the tidyverse

To install R, visit the R Project website.

In addition, I recommend you install RStudio, which facilitates working with R. To do so, visit the Posit website and download the open source desktop version of RStudio.

Once you’ve installed R and RStudio, open RStudio, and go to Tools > Global options.... In the tab General, Basic, make sure to untick the option ‘Restore .RData into workspace at startup’ and select the option ‘never’ for ‘Save workspace to .RData on exit’. These settings help ensure that you’re starting from a clean slate each time you open RStudio and prevent the results of calculations in previous sessions from messing up your analyses. Additionally, in the tab Code, Editing, tick the box next to ‘Use native pipe operator, |>’. We’ll encounter this pipe operator shortly. Finally, I recommend setting the default text encoding to UTF-8 (Code, Saving).

The functionality of R can be extended by installing further packages. We’ll be using two of these. The first is the here package, which makes it easier to read in data files if you’re working in a complex file system. Nothing in what follows hinges on your using the here package or not; I just find that it makes my life easier. To install the here package, type the following command at the R prompt:

install.packages("here")

The second package isn’t optional if you want to following along with the rest of this course: The tidyverse actually bundles a couple of package that are based on a set of shared principles and that facilitate working with datasets enormously. You can install it like so:

install.packages("tidyverse")

3.2 Setting up an R project

Next, in RStudio, click on File > New Project... > New Directory. Navigate to somewhere on your computer where you want to create a new directory and give this directory a name (for instance, DatasetsAndGraphs). You will use this directory to store the data and scripts that you’ll need for drawing the graphs in as well for saving the graphs themselves to. You don’t have to tick the other options.

When you’re done, close RStudio. Navigate to the directory you’ve just created. You should find an .Rproj file there. Double-click it. If all is well, RStudio should fire up.

Create the following subdirectories in the directory you’re currently in: data, scripts and figs.

On github.com/janhove/DatasetsAndGraphs, you can find a couple of datasets (under data) as well as an R script that defines a new function we’ll use (under functions). Put the datasets in your own data directory and the R script in your own functions directory.

You’ve now set up an R project for this pair of lectures. R projects aren’t strictly necessary, but I find them helpful to manage different research projects, publications, classes I teach, etc.

3.3 Working with scripts

With few exceptions, you should type the commands you want to execute in an R script (File > New File > R Script) rather than directly in the R console. While nothing terrible will happen if you do enter commands directly into the console, entering them into the script editor first is a good habit to get into: it’s much easier to spot errors, save and document your analysis, make tweaks to your code, and reuse old code in the editor than in the console.

As a rule of thumb, for the analysis of large projects, I tend to create separate R scripts for different parts of the analysis that are internally closely connected but more or less independent of the other parts. For instance, I might have a script for cleaning the datasets, another script for the analyses that are reported in Chapter 4, and yet another script for the analyses reported in Chapter 5.

Templates and compiling scripts

In the templates directory, you’ll find two R files. bare_script.R illustrates how I lay-out fairly short scripts without any bells or whistles. script_with_markdown.R is more elaborate and combines snippets of R code with running text (including some mark-up such as headings and emphasis).

R scripts can be compiled as HTML files (i.e., web pages) that show both the R commands as well as their output. This only works if all the code is syntactically correct and if its proper execution does not rely on objects that are not defined in the script. Compiling scripts is a great way to verify that your scripts are syntactically correct as well as self-contained. Compile the templates that I provided to check this for yourselves. (RStudio may prompt you to install some further packages as you do so.)

Hand in your solutions as scripts

Adopt one of the templates for your solutions to the exercises. Use a logical structure in your scripts (e.g., a new section for each exercise if you’re working with the script_with_markdown.R template). Compile the R script as a HTML file. This will only work if your R code is syntactically well-formed and self-contained. Check if the HTML file in fact shows the correct results. If not, this suggests that the results you’d originally obtained depend on some objects that are not defined in the script (or that are defined differently in the script from how you think they are defined). Hand in both the R file as well as the HTML file. Please use one script for all the exercises related to the datasets lecture and another script for all the exercises related to the data visualisation lecture.

3.4 Loading the packages

We’ll work with the here and tidyverse packages below. Even if you’ve already installed these, you need to load these packages in every R session in which you use them. To do so, execute the following lines:

library(here)

here() starts at /home/jan/ownCloud/Documents/Workshops/DatasetsAndGraphs

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The output of the library(here) command tells you the path relative to which the here() function in the here package will look for files etc. The messages generated by the library(tidyverse) command tell you which of the packages that make up the tidyverse suite are loaded by default. Some of the functions in these packages have the same name as functions that are part of packages that are loaded automatically whenever you start up R. What this means is that, if you want to use, say, the filter() function from the stats package, you need to use the notation stats::filter() instead of just filter().

Incidentally, you don’t need to install these packages every time you need them, but you do need to load them every time you need them in a new session. I recommend including all of the packages you need in a script at the top of your script. Remove any library() commands for packages you ended up not using in the script.

Documenting software versions

Both R itself as well as R packages evolve. When trying to run old scripts using new software versions, you may run into compatibility problems: the code may not run or, worse, it will run but produce different results. For this reason, it’s a good idea to keep track of the versions of R and any packages you used. To this end, you can include the command sessionInfo() at the bottom of your scripts. At the bottom of this page, I’ve used a different command that serves the same purpose but that requires another add-on package to be installed.

If you should ever observe that your code doesn’t run as expected, you can now check if differences in software versions could be to blame. If so, you could try installing the old software versions to so if this solves the problem. [Using different R versions; Installing old package versions]

4 Reading in datasets

4.1 Excel files

We’ll focus on reading in two kinds of spreadsheets: Excel spreadsheets in the XLS(X) format, and CSV files.

In order to read in XLS(X) files, we need the readxl package. This package is part of the tidyverse suite, but it does not get loaded automatically as you load the tidyverse suite. So we load it separately:

library(readxl)

Assuming the file uebersetzungen.xlsx is located in the data subdirectory of your R project directory, we can read it into R as follows.

translations <- read_excel(here("data", "uebersetzungen.xlsx"))

The dataset is now loaded as a so-called tibble named translations. We can display it by simply typing its name at the prompt:

translations

Type, don’t copy-paste – at home

You’re going to take away much more from this primer if you copy the code snippets by typing them rather than by copy-pasting them.

Data frames and tibbles

Rectangular data sets are referred to as data frames in R. The tidyverse slightly changes their functionality, mostly in order to allow for prettier displaying, and refers to them as tibbles.

Empty strings interpreted as missing

Note that the empty cells are automatically interpreted as missing data (<NA>). The documentation of the read_excel() function, which you can access by typing ?read_excel at the R prompt, suggests that we can override this behaviour, but we can’t.

If, instead of printing the entire tibble at the prompt, we just want to display the first few rows, we can use slide_head():

slice_head(translations, n = 4)

slice_tail() works similarly. If you want to display a random selection of rows, you can use slice_sample():

slice_sample(translations, n = 5)

Again, for details, you can check the documentation of these functions that is available at ?slice_head.

Exercise 1 (Excel files with multiple sheets) The file VanDenBroek2018.xlsx contains the data from three experiments reported by van den Broek et al. (2018). Using the read_xlsx() function, read in the data from the second experiment.

Hint: Inspect the Excel file and then consult the help page of read_xlsx() by means of ?read_xlsx.

4.2 CSV files

A popular format for storing spreadsheets is the CSV format. CSV is short for comma-separated values: Cells on the same row are separated by commas, see Figure 9. Sometimes, text strings are additionally surrounded by quotation marks.

Figure 9: A spreadsheet stored as *comma-separated values*.

To read in CSV files, we can use the read_csv() function, which is part of the readr package, which in turn is automatically loaded when the tidyverse suit is loaded:

translations <- read_csv(here("data", "uebersetzungen.csv"))

Rows: 20 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Wort, Übersetzung
dbl (3): Versuchsperson, Position, Richtig

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

translations

We receive a couple of messages (not errors!). These show that the read_csv() function correctly recognised that the columns Wort and Übersetzung contain text (‘chr’, character) strings, whereas Versuchsperson, Position and Richtig contain numbers (‘dbl’ for ‘double’, a number format).

Empty strings interpreted as missing

With read_csv(), we can specify that only cells containing “NA” are marked as missing data:

translations2 <- read_csv(here("data", "uebersetzungen.csv"), na = "NA")

Rows: 20 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Wort, Übersetzung
dbl (3): Versuchsperson, Position, Richtig

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

translations2

Note that the Übersetzung value in row 4 is just empty, not NA.

Let’s also read in the datasets containing the information pertaining to the participants and items:

participants <- read_csv(here("data", "versuchspersonen.csv"))

Rows: 4 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Geschlecht, Englisch
dbl (2): Versuchsperson, Alter

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

items <- read_csv(here("data", "woerter.csv"))

Rows: 5 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Wort, RichtigeÜbersetzung

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Different CSV formats

If you save an Excel spreadsheet as a CSV file on a French- or German-language computer system, the cells will be separated using semicolons rather than using commas. The reason is that commas are used as decimal separators in French and German. To read in ‘CSV’ files that use semicolons as cell separators, you can use the read_csv2() function.

Incidentally, if you use LibreOffice.org Calc instead of Excel, you can choose which cell separator gets used if you export a spreadsheet as a CSV file.

read_csv() vs. read.csv()

More seasoned R users are probably already familiar with the read.csv() function (with a dot rather than an underscore). The read_csv() function is merely the tidyverse counterpart to this function. The practical differences between them are negligible.

4.3 Other formats

Some people prefer to use tabs or spaces to separate cells rather than commas or semicolons. Consult the help page for the read_tsv() and read_delim() functions to see how you can read in data using other separators. Particularly the delim and quote arguments are relevant.

For reading in data from Google Sheets, see googlesheets4.

For interacting with Google Drive, see googledrive.

For reading in SPSS, Stata and SAS files, see haven.

5 Joining datasets

Having read in our three datasets as translations, participants, and items, we want to merge these datasets into one large dataset. The most useful function for this is left_join(), which takes the dataset passed as its x argument and adds to this the corresponding rows from the y dataset:

all_data <- left_join(x = translations, y = participants)

Joining with `by = join_by(Versuchsperson)`

Note that the left_join() function recognises that the variable shared between both datasets is called Versuchsperson. Hence, the information related to participant 1034 contained in participants is added to each row in translations where the Versuchsperson value is 1034, and similarly for the other participants:

all_data

If you don’t want the left_join() function to figure out what the shared variable is, you can specify it explicitly:

all_data <- left_join(x = translations, y = participants, by = "Versuchsperson")

If the shared variable has different names in the different datasets, you can use something like

new <- left_join(x = left_dataset, y = right_dataset, by = join_by(var_left == var_right))

See ?join_by for more information.

If there are values in the shared variable that occur in the x dataset that don’t occur in the y dataset, the values in the added columns will read NA for these rows.

Further join functions are the following, see ?join for details:

right_join(x, y): Keep all entries in y. Add corresponding entries in x to it.
full_join(x, y): Keep all entries in both x and y. Add NAs if there is no corresponding entry in the other dataset.
inner_join(x, y): Only keep entries in x for which there is a corresponding entry in y. Add these corresponding entries.
semi_join(x, y): Only keep entries in x for which there is a corresponding entry in y. Don’t add these corresponding entries.
anti_join(x, y): Only keep entries in x for which there are no corresponding entries in y.

In the current example, left_join, full_join and inner_join would all yield the same result. But this isn’t always the case.

Let’s also add the information pertaining to the words:

all_data <- left_join(all_data, items)

Joining with `by = join_by(Wort)`

Incidentally, another way of achieving the same result is this:

all_data <- translations |> 
  left_join(participants) |> 
  left_join(items)

Joining with `by = join_by(Versuchsperson)`
Joining with `by = join_by(Wort)`

Exercise 2 (Join functions)

Use the following code to generate two tibbles called left and right:

left <- tibble(A = c("a", "b", "c", NA),
               B = c(1, 2, NA, 4))
right <- tibble(B = c(1, 3, 4, 4),
                C = c(10, NA, 12, 7))
left

right

Without running these commands, predict what their output will look like:

left_join(x = left, y = right)
right_join(x = left, y = right)
full_join(x = left, y = right)
inner_join(x = left, y = right)
semi_join(x = left, y = right)
semi_join(x = right, y = left) # !
anti_join(x = left, y = right)
anti_join(x = right, y = left) # !

Now run the commands above to verify your predictions.
Create two new tibbles using the code below:

left <- tibble(A = c("a", "b"),
               B = c(1, NA))
right <- tibble(B = c(1, NA, NA),
                C = c(0, 1, 2))
left

right

Consult the help page for left_join and look up the na_matches parameter under ‘Arguments’. Predict what the output of the following two commands will look like, and only then check your answer.

left_join(left, right)
left_join(left, right, na_matches = "never")

6 Queries

6.1 Selecting by row number

We can select the third row of a dataset like so:

slice(all_data, 3)

Alternatively, we can write this command as follows. The symbols |> allow us to take an object (here: all_data) and pass it to a function as its first argument. As we’ll later see, we can use |> to string together a host of function calls without creating illegible code.

all_data |> 
  slice(3)

We can also select multiple rows:

# Rows 5 and 7
all_data |> 
  slice(c(5, 7))

# Rows 5 to (and including) 7
all_data |> 
  slice(5:7)

The results of such actions can be stored as separate objects, for instance, like so:

rows7_12 <- all_data |> 
  slice(7:12)
rows7_12

We can export this new object as a CSV file like so:

write_csv(rows7_12, here("data", "rows7_12.csv"))

6.2 Selecting by values

Selecting rows by their number isn’t too useful. But selecting rows satisfying some set of conditions is very useful. Here are a few examples:

# All data corresponding to the word 'fiende'
all_data |> 
  filter(Wort == "fiende")

# Note: use '!=' for 'not equal to'.

# All data corresponding to participants older than 30
all_data |> 
  filter(Alter > 30)

# Note: use '>=' for 'at least as old as', 
#           '<=' for 'no older than', 
#        and '<' for 'younger than'.

We can use is.na to check for missing values. Note the use of ! to negate a condition.

all_data |> 
  filter(is.na(Übersetzung))

all_data |> 
  filter(!is.na(Übersetzung))

We can string together multiple filter() calls:

# incorrect translations to first word
all_data |> 
  filter(Position == 1) |> 
  filter(Richtig == 0)

An alternative way of writing this is as follows:

all_data |> 
  filter(Position == 1 & Richtig == 0)

‘or’ conditions can be created using |:

# translations to first word or incorrect translations
all_data |> 
  filter(Position == 1 | Richtig == 0)

Note that in logic, ‘or’ is always inclusive. Exclusive ‘or’ (‘xor’) can be obtained as follows:

# translations to first word or incorrect translations, but not both
all_data |> 
  filter(Position == 1 | Richtig == 0) |> 
  filter(!(Position == 1 & Richtig == 0))

Alternatively,

all_data |> 
  filter(xor(Position == 1, Richtig == 0))

Exercise 3 (Filtering)

Run the following commands:

d1 <- all_data |> 
  filter(Übersetzung == "vorsichtig")
d2 <- all_data |> 
  filter(Übersetzung != "vorsichtig")

Explain what both commands achieve.
How many rows are there in d1? How many in d2? How many in all_data? Explain.
Create a tibble d3 that contains only those rows in all_data where the participants did not translate the word as vorsichtig.

Added 1 March 2025: It is also possible to perform more complex string-based queries. For instance, the stringr package, which is loaded automatically as part of the tidyverse, contains the function str_detect(). As its name suggests, this function can be used to detect if a string contains a particular combination of symbols:

# all rows with an Übersetzung containing "ch"
all_data |> 
  filter(str_detect(Übersetzung, "ch"))

Related functions are str_starts() and str_ends() for checking if a string starts or ends with a particular combination of symbols, as well as str_sub() for extracting substrings based on their position in the string. As the examples below illustrate, these queries are case-sensitive:

# Übersetzung starts with an "s" (lowercase!)
all_data |> 
  filter(str_starts(Übersetzung, "s"))

# Übersetzung starts with an "S" (uppercase!)
all_data |> 
  filter(str_starts(Übersetzung, "S"))

# Übersetzung starts with an "s" or an "S"
all_data |> 
  filter(str_starts(Übersetzung, "[sS]"))

# Übersetzung ends with "en"
all_data |> 
  filter(str_ends(Übersetzung, "en"))

# The third symbol in Übersetzung is "c"
all_data |> 
  filter(str_sub(Übersetzung, start = 3, end = 3) == "c")

More complicated queries are possible through the use of so-called regular expressions, which we’ll touch on later. For more information about the str_*() functions, see the stringr documentation.

6.3 Selecting columns

Sometimes, a dataset is too cumbersome to handle because it contains a lot of irrelevant columns. Using select, we can select those columns that are of interest. For instance,

all_data |> 
  select(Wort, RichtigeÜbersetzung, Übersetzung) |> 
  slice_head(n = 5)

There are a couple of auxiliary functions that make it easier to select columns. These are especially useful when working with large datasets. Examples are contains() and starts_with():

all_data |> 
  select(contains("Übersetzung"))

all_data |> 
  select(starts_with("Richt"))

See the tidyselect documentation for further functions.

6.4 Further examples

We can string together different types of commands:

# All translations for 'fiende'
all_data |> 
  filter(Wort == "fiende") |> 
  select(Übersetzung)

# All *distinct* translations for 'behärska':
all_data |> 
  filter(Wort == "behärska") |> 
  select(Übersetzung) |> 
  distinct()

Without the pipe (|>), these commands become difficult to read:

distinct(select(filter(all_data, Wort == "behärska"), Übersetzung))

7 Pivoting

In the course of an analysis, it is often necessary to convert a long-ish dataset to a wider one, and vice versa. This process is known as pivoting. To illustrate pivoting, we’ll make use of a more realistic – and more complicated – dataset derived from a longitudinal project on the development of reading and writing skills in Portuguese–French and Portuguese–German bilingual children (Desgrippes, Lambelet, and Vanhove 2017; Pestana, Lambelet, and Vanhove 2017). We read in the dataset helascot_skills.csv as skills:

skills <- read_csv(here("data", "helascot_skills.csv"))

Rows: 1904 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Subject, LanguageTested
dbl (4): Time, Reading, Argumentation, Narration

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skills

For each participant (Subject) at each Time (T1, T2, T3) and for each LanguageTested (Portuguese, French, German), we have up to three scores, Reading, Argumentation and Narration, arranged next to each other. But let’s say we wanted to compute, for each participant, their progress on each task in each language from T1 to T2 and from T2 to T3. The way the data are laid out at present, this would at best be pretty difficult to do. But it would be easy if only the data were arranged differently, namely with all three measurements per participant and task next to each other. We can convert this dataset to the desired format in two steps.

First, we make the dataset longer by stacking the different scores underneath each other rather than next to each other. To this end, we use the function pivot_longer() and specify that we want to stack the values in the Reading, Argumentation, and Narration columns under each other, that we want to call the resulting column Score, and that we want to put the column names in a new column called Skill:

skills_longer <- skills |> 
  pivot_longer(cols = c("Reading", "Argumentation", "Narration"),
               names_to = "Skill", values_to = "Score")
skills_longer

Then, we make this tibble wider by putting the three measurements per skill and language next to each other using pivot_wider(). We also prefix the Time values with a T.

skills_wider_time <- skills_longer |> 
  pivot_wider(names_from = Time, values_from = Score,
              names_prefix = "T")
skills_wider_time

Using mutate(), we can now easily compute the differences between T1 and T2 and between T2 and T3:

skills_wider_time |> 
  mutate(ProgressT1_T2 = T2 - T1,
         ProgressT2_T3 = T3 - T2) |> 
  select(Subject, LanguageTested, Skill, ProgressT1_T2, ProgressT2_T3)

First longer, then wider

The two-step approach shown above is one that I’ve found generally useful. First convert the tibble to a format that is longer than needed, then pivot it to the wider format desired.

Now imagine that we wanted to compute, for each participant in each skill at each data collection, the difference between the Portuguese score and the German/French score. The first step is the same, resulting in skills_longer. The second step is now similar, but we put the different languages next to each other:

skills_wider_language <- skills_longer |> 
  pivot_wider(names_from = LanguageTested, values_from = Score)
skills_wider_language

Incidentally, not all values for German are NA. It’s just that the first couple of children were tested in French and Portuguese, not in German. We can check this like so:

skills_wider_language |> 
  filter(!is.na(German)) |> 
  slice_sample(n = 10)

If we needed to, we could make this dataset wider still:

skills_wider_time_language <- skills_longer |> 
  pivot_wider(names_from = c(LanguageTested, Time), # combination of Language and Time
              values_from = Score)
skills_wider_time_language

We could reconvert this tibble to a long one using the code snippet below. The code becomes a bit more complex.

The notation French_1:German_3 selects all columns between French_1 and German_3 (including). Alternatively, we could have used starts_with(c("French", "Portuguese", "German")) to select the relevant columns.
A so-called regular expression (regex) is used to split up these column names into the Language bit and into the Time bit. The split happens at the first underscore (_) encountered.

skills_wider_time_language |> 
  pivot_longer(cols = French_1:German_3,
               names_to = c("Language", "Time"),
               names_pattern = "(.*)_(.*)",
               values_to = "Score")

For the present example, we don’t need this code snippet since we already have skills_longer. But I wanted to illustrate that such conversions are possible. If you ever need to convert a similar dataset to a longer format, you now know that you can look up the details on the help page of pivot_longer() (?pivot_longer) and take it from there.

About regex

With regular expressions, it’s less important to know them than to know about them. Any time a string is comprised of several pieces of information in a more or less predictable way, regular expressions can be used to extract these pieces. That said, compiling regular expressions that actually do the job is pretty difficult. But once you’re aware that they exist and can be used to such ends, you can enlist the help of AI tools to construct them.

Incidentally, the regex used above ((.*)_(.*)) matches any two groups of characters ((.*)) preceding and following an underscore (_).

8 Summaries

Using the summarise() function, we can easily summarise large tibbles. For instance, we can compute the average (mean) narration and argumentation scores in the skills tibble like so:

skills |> 
  summarise(mean_narr = mean(Narration, na.rm = TRUE),
            mean_arg  = mean(Argumentation, na.rm = TRUE))

We set the na.rm parameter in the mean() call to TRUE since there are several missing observations in both the Narration and Argumentation variable. If we don’t set na.rm to TRUE, the result of both computations would be NA. By setting na.rm to TRUE, missing observations are ignored.

summarise() is often used in conjuction with group_by(), which splits up a tibble into subgroups. This way, we can straightforwardly compute summaries for different subgroups. For instance, if we wanted to compute the mean narration and argumentation scores for each time of data collection and each language tested, we could use the following code snippet:

skills |> 
  group_by(Time, LanguageTested) |> 
  summarise(mean_narr = mean(Narration, na.rm = TRUE),
            mean_arg  = mean(Argumentation, na.rm = TRUE),
            .groups = "drop")

By setting .groups = "drop", you make sure that the grouping applied to skills doesn’t apply to the summary tibble any more. (It’s not so important.) For instance, compare the outcomes of these commands:

skills |> 
  group_by(Time, LanguageTested) |> 
  summarise(mean_narr = mean(Narration, na.rm = TRUE),
            mean_arg  = mean(Argumentation, na.rm = TRUE),
            .groups = "drop") |> 
  summarise(median_mean_narr = median(mean_narr))

skills |> 
  group_by(Time, LanguageTested) |> 
  summarise(mean_narr = mean(Narration, na.rm = TRUE),
            mean_arg  = mean(Argumentation, na.rm = TRUE)) |> 
  summarise(median_mean_narr = median(mean_narr))

`summarise()` has grouped output by 'Time'. You can override using the
`.groups` argument.

We can treat these summary tibbles like ordinary tibbles and apply all of the other commands to them:

skills |> 
  group_by(Time, LanguageTested) |> 
  summarise(mean_narr = mean(Narration, na.rm = TRUE),
            .groups = "drop") |> 
  pivot_wider(names_from = "Time", names_prefix = "T", values_from = "mean_narr")

In addition to mean(), here are some further functions that are often used when computing summaries; see the functions’ help pages for details:

median(),
quantile(),
min() and max(),
sd() and var() for the sample standard deviation and sample variance, respectively,
mad() for the (adjusted) median absolute deviation,
n() for obtaining the number of observations,
sum().

The mean() function can also be used to compute proportions. Consider the following example. The is.na() function checks if a value is NA (in which case it returns TRUE) or not (FALSE). If we compute the mean() of a bunch of TRUE/FALSE values, we obtain the proportion of values that are TRUE. Similarly, the sum() of a bunch of TRUE/FALSE values is the number of TRUE values. Hence, we can quickly obtain the proportion, and number, of the missing Narration scores for each combination of Time and LanguageTested:

skills |> 
  group_by(Time, LanguageTested) |> 
  summarise(
    prop_narr_NA = mean(is.na(Narration)),
    nr_narr_NA = sum(is.na(Narration)),
    n = n(),
    .groups = "drop"
  )

Writing your own functions

You can write your own functions. For instance, an alternative to the mean and the median is the Hodges–Lehmann estimator. Let \(X = (X_1, \dots, X_n)\) be a data vector. To compute the Hodges–Lehmann estimator of \(X\)’s central tendency, proceed as follows:

Compute the pairwise averages (‘Walsh averages’) of the \(X\) values, that is, compute \(\frac{X_i + X_j}{2}\) for each \(i = 1, \dots, n\) and \(j = i, \dots, n\).
Compute the median of these Walsh averages.

We can define a function that computes this estimator like so:

hodges_lehmann <- function(x, na.rm = FALSE) {
  if (na.rm) {
    x <- x[!is.na(x)]
  }
  pairwise_means <- outer(x, x, FUN = "+") / 2 # computes *all* means of two observations
  median(pairwise_means[upper.tri(pairwise_means, diag = FALSE)]) # median of means of x_i, x_j with i < j
}

We can then use this function in summarise() like so,

skills |> 
  group_by(Time, LanguageTested) |> 
  summarise(
    hl_narr = hodges_lehmann(Narration, na.rm = TRUE),
    .groups = "drop"
  ) |> 
  pivot_wider(names_from = "Time", names_prefix = "T", values_from = "hl_narr")

9 String manipulation

It often happens that a single cell in a dataset contains different pieces of information. So, too, it is the case in our current example. The first participant in the dataset is referred to as A_PLF_1:

The A in this ID refers to their class.
The PLF tells us that this participant resided in French-speaking Switzerland (F), had a Portuguese background (P) and took Portuguese heritage and language courses (L). Other abbreviations in the dataset are CD, CF, and CP for comparison groups in German-speaking Switzerland, French-speaking Switzerland, and Portugal, respectively, PLD for participants residing in German-speaking Switzerland with a Portuguese background that attended a Portuguese course, as well as PND and PNF for participants in German- and French-speaking Switzerland, respectively, with a Portuguese background that did not take Portuguese classes.
The 1 uniquely identifies this participant within their class.

It could make sense to split up the information contained in this one cell into multiple cells. Thankfully, the strings in Subject are structured logically and consistently: the different pieces of information are separated using underscores ("_"). We can hence split these strings at the underscores and retrieve the first and second pieces like so, obviating the need for more complicated regular expressions:

skills_wider_language <- skills_wider_language |> 
  mutate(
    Class = str_split_i(Subject, "_", 1),
    Group = str_split_i(Subject, "_", 2)
  )
# check:
skills_wider_language |> 
  select(Subject, Class, Group) |> 
  sample_n(10)

Refer to the stringr documentation for further guidance.

We can now further break down the information contained in the new Group column, for instance as follows:

skills_wider_language <- skills_wider_language |> 
  mutate(language_group = case_when(
    Group == "CP" ~ "Portuguese control",
    Group == "CF" ~ "French control",
    Group == "CD" ~ "German control",
    Group %in% c("PLF", "PNF") ~ "French-Portuguese",
    Group %in% c("PLD", "PND") ~ "German-Portuguese",
    .default =  "other"
  )) |> 
  mutate(heritage_course = case_when(
    Group %in% c("PLF", "PLD") ~ 1,
    .default =  0
  )) |> 
  mutate(has_German = case_when(
    Group %in% c("CD", "PLD", "PND") ~ 1,
    .default =  0
  )) |> 
  mutate(has_French = case_when(
    Group %in% c("CF", "PLF", "PNF") ~ 1,
    .default =  0
  )) |> 
  mutate(has_Portuguese = case_when(
    Group %in% c("CF", "CD") ~ 0,
    .default =  1
  ))

# check:
skills_wider_language |> 
  select(Subject, language_group, heritage_course, has_German, has_French, has_Portuguese) |> 
  sample_n(10)

10 A full-fledged example

We had the children in the French/German/Portuguese project write short narrative and argumentative texts in each of their languages at three points in time. The quality of these texts was scored using a grid (Desgrippes, Lambelet, and Vanhove 2017); it is these scores that are listed in the skills tibble we worked with above. In addition, 3,060 of these texts were rated for their lexical richness by between two and eighteen naïve (i.e., non-instructed) raters each using a 1-to-9 scale (Vanhove et al. 2019). The individual ratings are available in the file helascot_ratings.csv. Furthermore, we computed a bunch of lexical metrics for each text, such as the number of tokens, the mean corpus frequency of the words occurring in the text, etc. These metrics are available in the file helascot_metrics.csv; see Vanhove (2017) for details.

We’ll use these datasets to answer three questions:

What’s the relation between the average lexical richness ratings per text and the text’s Guiraud index? (The Guiraud index is the ratio of the number of types in a text and the square root of the number of tokens in that text.)
What’s the relation between the average lexical richness ratings per text and the mean corpus frequency of the words occuring in the texts?
What’s the relation between the grid-based ratings and the lexical richness ratings?

In doing so, we’ll need to make use of some of the tools introduced earlier.

10.1 Reading in the data

Let’s start from a clean slate and read in the three datasets:

skills <- read_csv(here("data", "helascot_skills.csv"))

Rows: 1904 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Subject, LanguageTested
dbl (4): Time, Reading, Argumentation, Narration

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

metrics <- read_csv(here("data", "helascot_metrics.csv"))

Rows: 3060 Columns: 137
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (8): Text, Subject, Text_Language, Text_Type, LemmawordsNotInSUBTLEX, ...
dbl (129): Time, TTR, Guiraud, nTokens, nTypes, nLemmas, meanWordLength, MST...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ratings <- read_csv(here("data", "helascot_ratings.csv"))

Rows: 29179 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): Batch, Rater, Text, Subject, Text_Language, Text_Type, Rater_Sex, ...
dbl  (4): Trial, Rating, Time, Rater_Age
lgl  (1): Rater_NativeLanguageOther

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Inspect the structure of these three datasets. Note that

skills contains one row per combination of Subject, Time and LanguageTested;
metrics contains one row per text, i.e., one row per combination of Subject, Text_Language, Text_Type and Time;
ratings contains one row per rating, i.e., one row per combination of Rater and Text.

Note

These datasets are already pretty clean and a lot of work (partly manual, partly using R, partly using other tools) went into creating them. See the technical report as well as the R code available from osf.io.

There are many ways to skin a cat. But it seems to me that the following course of action is reasonable:

Using ratings, compute the average Rating per text.
Add these average ratings to metrics and draw some plots. (More on plotting in the second part!)
Make skills longer and add the average ratings to it. Then draw some more plots.

10.2 Average rating per text

rating_per_text <- ratings |> 
  group_by(Text, Subject, Text_Language, Text_Type, Time) |> 
  summarise(mean_rating = mean(Rating),
            n_ratings = n(),
            .groups = "drop")

Exercise 4 (Filtering out bilingual raters.) Some of the raters consider themselves native speakers of several languages (bi-French etc.), others consider themselves monolingual native speakers (mono-French etc.):

table(ratings$Rater_NativeLanguage)


      bi-French       bi-German   bi-Portuguese     mono-French     mono-German 
           1049            2241             728            6235           14149 
mono-Portuguese 
           4777

Compute the average text ratings using only ratings from monolingual French speakers.

10.3 Add to `metrics`

The tibbles metrics and rating_per_text share a variable (Text), so the average ratings can easily be added to metrics:

metrics_ratings <- metrics |> 
  left_join(rating_per_text)

Joining with `by = join_by(Text, Subject, Text_Language, Text_Type, Time)`

We’ll learn in the second part how to draw meaningful data plots, but here’s one way of visualising the relationship between Guiraud and mean_rating:

metrics_ratings |> 
  ggplot(aes(x = Guiraud, y = mean_rating)) +
  geom_point(shape = 1) +
  facet_grid(rows = vars(Time, Text_Type),
             cols = vars(Text_Language),
             scales = "free_y") +
  xlab("Guiraud value") +
  ylab("average lexical richness rating")

Similarly, for meanSUBTLEX (average corpus frequency):

metrics_ratings |> 
  ggplot(aes(x = meanSUBTLEX, y = mean_rating)) +
  geom_point(shape = 1) +
  facet_grid(rows = vars(Time, Text_Type),
             cols = vars(Text_Language),
             scales = "free_y") +
  xlab("mean SUBTLEX frequency") +
  ylab("average lexical richness rating")

correlations <- metrics_ratings |> 
  group_by(Text_Language, Time, Text_Type) |> 
  summarise(
    cor_guiraud = cor(mean_rating, Guiraud),
    cor_frequency = cor(mean_rating, meanSUBTLEX),
    .groups = "drop"
  )
correlations

10.4 Add to `skills`

rating_gridscore <- skills |> 
  pivot_longer(Reading:Narration, names_to = "skill",
               values_to = "grid_score") |> 
  filter(skill != "Reading") |> 
  mutate(Text_Type = case_when(
    skill == "Argumentation" ~ "arg",
    skill == "Narration" ~ "narr"
  )) |> 
  full_join(rating_per_text,
            by = join_by(Text_Type, Subject, Time, LanguageTested == Text_Language))

rating_gridscore |> 
  filter(!is.na(grid_score)) |> 
  filter(!is.na(mean_rating)) |> 
  ggplot(aes(x = mean_rating, y = grid_score)) +
  geom_point(shape = 1) +
  facet_grid(rows = vars(Time, Text_Type),
             cols = vars(LanguageTested),
             scales = "free_y") +
  xlab("average lexical richness rating") +
  ylab("grid-based text quality rating")

rating_gridscore |> 
  filter(!is.na(grid_score)) |> 
  filter(!is.na(mean_rating)) |> 
  group_by(LanguageTested, Time, Text_Type) |> 
  summarise(
    cor_score = cor(mean_rating, grid_score),
    .groups = "drop"
  )

11 Walkthrough with exercises

In the previous sections, we started from a pretty clean dataset containing the writing data. The goal of the present section is to give you some insight into how this data set was compiled starting from a more basic representation of the underlying data. This section is organised as a series of exercises that you should tackle in order. In completing these exercises, you’ll convert a very wide dataset (with lots of criterion scores for up to twelve texts per child on a single row) into a dataset that contains a single composite score for a single text on each row. This process is considerably more complicated than running a couple of pivot_*() and summarise() commands due to how the data were laid out. Some of the instructions in the exercises are intentionally not spelt out in great detail: In real life, it’s up to you to figure out the intermediate steps, too.

Exercise 5 (Reading in the data) The file writing_data.csv contains the data reported on in Desgrippes, Lambelet, and Vanhove (2017). Importing it into R requires some additional work.

Open the file using Notepad (Windows), TextEdit (Mac) or whatever plain text editor you prefer. What character is used to separate different cells?
Try to read it in as a tibble named writing.
You may have obtained the following error message: Error in nchar(x, "width") : invalid multibyte string, element 1. Error messages can be notoriously opaque in R, but experience suggests that the accents and umlauts may have something to do with this error. To fix the error, try to read in the dataset again, but additionally override the locale argument of the function you used in Step 2 as follows: locale = locale(encoding = "Windows-1252"). This is a Windows character encoding for Western languages; see Wikipedia for an overview of common character encodings.
Inspect the resulting tibble. If all is well, it should contain 473 rows and 288 columns. Columns such as Sco1R_Nb_mots and Sco1R_unitThemNbre should be recognised as containing numbers (<dbl>, short for double, a format for representing numbers in computers), whereas columns such as VPNID and Sco1R_unitThem should be recognised as containing text strings (<chr>, short for character). You’ll also observe that the first column wasn’t named in the csv file, so R assigned some meaningless name to it.

UTF-8 character encoding

When saving spreadsheets as CSV files, use the UTF-8 character encoding. In Excel, you can select the character encoding from a drop-down menu when exporting the spreadsheet as a CSV; in LibreOffice.org Calc you’ll be asked to specify how you want to save the CSV file exactly in a pop-up window.

If all went well, the writing tibble has the following columns:

colnames(writing)

  [1] "...1"                      "VPNID"                    
  [3] "X.x"                       "Sco1R_Nb_mots"            
  [5] "Sco1R_titre"               "Sco1R_Signature"          
  [7] "Sco1R_GP"                  "Sco1R_Fid"                
  [9] "Sco1R_unitThem"            "Sco1R_unitThemNbre"       
 [11] "Sco1R_Destinataire"        "Sco1R_AncrTemp_Echelle"   
 [13] "Sco1R_AncrTemp_occurences" "Sco1R_ProgrChrono_Occ"    
 [15] "Sco1R_ProgrChrono"         "Sco1R_AncrSpat"           
 [17] "Sco1R_RéfSpat"             "Sco1R_Textual"            
 [19] "Sco1R_Coh_verb"            "Sco1R_Description"        
 [21] "Sco1R_Ponct"               "Sco1R_Maj"                
 [23] "Sco1R_Synt"                "Sco1R_Genre"              
 [25] "Sco1R_Nombre"              "Sco1R_Cas"                
 [27] "Sco1R_Verbal"              "Sco1R_Dialekt"            
 [29] "Sco1R_remarques"           "P1R_Nb_mots"              
 [31] "P1R_titre"                 "P1R_Signature"            
 [33] "P1R_GP"                    "P1R_Fid"                  
 [35] "P1R_unitThem"              "P1R_unitThemNbre"         
 [37] "P1R_Destinataire"          "P1R_AncrTemp_Echelle"     
 [39] "P1R_AncrTemp_occurences"   "P1R_ProgrChrono_Occ"      
 [41] "P1R_ProgrChrono"           "P1R_AncrSpat"             
 [43] "P1R_RéfSpat"               "P1R_Textual"              
 [45] "P1R_Coh_verb"              "P1R_Description"          
 [47] "P1R_Ponct"                 "P1R_Maj"                  
 [49] "P1R_Synt"                  "P1R_Genre"                
 [51] "P1R_Nombre"                "P1R_Verbal"               
 [53] "P1R_remarques"             "Sco2R_Nb_mots"            
 [55] "Sco2R_titre"               "Sco2R_Signature"          
 [57] "Sco2R_GP"                  "Sco2R_Fid"                
 [59] "Sco2R_unitThem"            "Sco2R_unitThemNbre"       
 [61] "Sco2R_Destinataire"        "Sco2R_AncrTemp_Echelle"   
 [63] "Sco2R_AncrTemp_occurences" "Sco2R_ProgrChrono_Occ"    
 [65] "Sco2R_ProgrChrono"         "Sco2R_AncrSpat"           
 [67] "Sco2R_RéfSpat"             "Sco2R_Textual"            
 [69] "Sco2R_Coh_verb"            "Sco2R_Description"        
 [71] "Sco2R_Ponct"               "Sco2R_Maj"                
 [73] "Sco2R_Synt"                "Sco2R_Genre"              
 [75] "Sco2R_Nombre"              "Sco2R_Cas"                
 [77] "Sco2R_Verbal"              "Sco2R_Dialekt"            
 [79] "Sco2R_remarques"           "P2R_Nb_mots"              
 [81] "P2R_titre"                 "P2R_Signature"            
 [83] "P2R_GP"                    "P2R_Fid"                  
 [85] "P2R_unitThem"              "P2R_unitThemNbre"         
 [87] "P2R_Destinataire"          "P2R_AncrTemp_Echelle"     
 [89] "P2R_AncrTemp_occurences"   "P2R_ProgrChrono_Occ"      
 [91] "P2R_ProgrChrono"           "P2R_AncrSpat"             
 [93] "P2R_RéfSpat"               "P2R_Textual"              
 [95] "P2R_Coh_verb"              "P2R_Description"          
 [97] "P2R_Ponct"                 "P2R_Maj"                  
 [99] "P2R_Synt"                  "P2R_Genre"                
[101] "P2R_Nombre"                "P2R_Verbal"               
[103] "P2R_remarques"             "Sco3R_Nb_mots"            
[105] "Sco3R_titre"               "Sco3R_Signature"          
[107] "Sco3R_GP"                  "Sco3R_Fid"                
[109] "Sco3R_unitThem"            "Sco3R_unitThemNbre"       
[111] "Sco3R_Destinataire"        "Sco3R_AncrTemp_Echelle"   
[113] "Sco3R_AncrTemp_occurences" "Sco3R_ProgrChrono_Occ"    
[115] "Sco3R_ProgrChrono"         "Sco3R_AncrSpat"           
[117] "Sco3R_RéfSpat"             "Sco3R_Textual"            
[119] "Sco3R_Coh_verb"            "Sco3R_Description"        
[121] "Sco3R_Ponct"               "Sco3R_Maj"                
[123] "Sco3R_Synt"                "Sco3R_Genre"              
[125] "Sco3R_Nombre"              "Sco3R_Cas"                
[127] "Sco3R_Verbal"              "Sco3R_Dialekt"            
[129] "Sco3R_remarques"           "P3R_Nb_mots"              
[131] "P3R_titre"                 "P3R_Signature"            
[133] "P3R_GP"                    "P3R_Fid"                  
[135] "P3R_unitThem"              "P3R_unitThemNbre"         
[137] "P3R_Destinataire"          "P3R_AncrTemp_Echelle"     
[139] "P3R_AncrTemp_occurences"   "P3R_ProgrChrono_Occ"      
[141] "P3R_ProgrChrono"           "P3R_AncrSpat"             
[143] "P3R_RéfSpat"               "P3R_Textual"              
[145] "P3R_Coh_verb"              "P3R_Description"          
[147] "P3R_Ponct"                 "P3R_Maj"                  
[149] "P3R_Synt"                  "P3R_Genre"                
[151] "P3R_Nombre"                "P3R_Verbal"               
[153] "P3R_remarques"             "hkngroup"                 
[155] "Schulsprache"              "zweiSprachig"             
[157] "Group"                     "PartDef"                  
[159] "X.y"                       "Sco1A_Nb_mots"            
[161] "Sco1A_Lieu"                "Sco1A_Date"               
[163] "Sco1A_Appel"               "Sco1A_Salutation"         
[165] "Sco1A_Signature"           "Sco1A_GP"                 
[167] "Sco1A_Objet"               "Sco1A_Raisons"            
[169] "Sco1A_TC"                  "Sco1A_Articulation"       
[171] "Sco1A_Destinataire"        "Sco1A_Organisateurs"      
[173] "Sco1A_Coh_verb"            "Sco1A_Ponct"              
[175] "Sco1A_Maj"                 "Sco1A_Synt"               
[177] "Sco1A_Genre"               "Sco1A_Nombre"             
[179] "Sco1A_Cas"                 "Sco1A_Verbal"             
[181] "Sco1A_Dialekt"             "P1A_Nb_mots"              
[183] "P1A_Lieu"                  "P1A_Date"                 
[185] "P1A_Appel"                 "P1A_salutation"           
[187] "P1A_Signature"             "P1A_GP"                   
[189] "P1A_Objet"                 "P1A_Raisons"              
[191] "P1A_TC"                    "P1A_Articulation"         
[193] "P1A_Destinataire"          "P1A_Organisateurs"        
[195] "P1A_Coh_verb"              "P1A_Ponct"                
[197] "P1A_Maj"                   "P1A_Synt"                 
[199] "P1A_Genre"                 "P1A_Nombre"               
[201] "P1A_Cas"                   "P1A_Verbal"               
[203] "Sco2A_Nb_mots"             "Sco2A_Lieu"               
[205] "Sco2A_Date"                "Sco2A_Appel"              
[207] "Sco2A_Salutation"          "Sco2A_Signature"          
[209] "Sco2A_GP"                  "Sco2A_Objet"              
[211] "Sco2A_Raisons"             "Sco2A_TC"                 
[213] "Sco2A_Articulation"        "Sco2A_Destinataire"       
[215] "Sco2A_Organisateurs"       "Sco2A_Coh_verb"           
[217] "Sco2A_Ponct"               "Sco2A_Maj"                
[219] "Sco2A_Synt"                "Sco2A_Genre"              
[221] "Sco2A_Nombre"              "Sco2A_Cas"                
[223] "Sco2A_Verbal"              "Sco2A_Dialekt"            
[225] "P2A_Nb_mots"               "P2A_Lieu"                 
[227] "P2A_Date"                  "P2A_Appel"                
[229] "P2A_Psalutation"           "P2A_Signature"            
[231] "P2A_GP"                    "P2A_Objet"                
[233] "P2A_Raisons"               "P2A_TC"                   
[235] "P2A_Articulation"          "P2A_Destinataire"         
[237] "P2A_Organisateurs"         "P2A_Coh_verb"             
[239] "P2A_Ponct"                 "P2A_Maj"                  
[241] "P2A_Synt"                  "P2A_Genre"                
[243] "P2A_Nombre"                "P2A_Cas"                  
[245] "P2A_Verbal"                "Sco3A_Nb_mots"            
[247] "Sco3A_Lieu"                "Sco3A_Date"               
[249] "Sco3A_Appel"               "Sco3A_Salutation"         
[251] "Sco3A_Signature"           "Sco3A_GP"                 
[253] "Sco3A_Objet"               "Sco3A_Raisons"            
[255] "Sco3A_TC"                  "Sco3A_Articulation"       
[257] "Sco3A_Destinataire"        "Sco3A_Organisateurs"      
[259] "Sco3A_Coh_verb"            "Sco3A_Ponct"              
[261] "Sco3A_Maj"                 "Sco3A_Synt"               
[263] "Sco3A_Genre"               "Sco3A_Nombre"             
[265] "Sco3A_Cas"                 "Sco3A_Verbal"             
[267] "Sco3A_Dialekt"             "P3A_Nb_mots"              
[269] "P3A_Lieu"                  "P3A_Date"                 
[271] "P3A_Appel"                 "P3A_salutation"           
[273] "P3A_Signature"             "P3A_GP"                   
[275] "P3A_Objet"                 "P3A_Raisons"              
[277] "P3A_TC"                    "P3A_Articulation"         
[279] "P3A_Destinataire"          "P3A_Organisateurs"        
[281] "P3A_Coh_verb"              "P3A_Ponct"                
[283] "P3A_Maj"                   "P3A_Synt"                 
[285] "P3A_Genre"                 "P3A_Nombre"               
[287] "P3A_Cas"                   "P3A_Verbal"

Unfortunately, I wasn’t able to locate a technical report or codebook that outlines the meaning of all of these column names and what range of values they can take. That said, this project featured three waves of data collection in which children were, among other things, asked to write two kinds of text: argumentative and narrative. Most of the children were French-Portuguese or German-Portuguese bilinguals, and these were asked to write both kinds of texts in each of their languages. Other children were considered monolingual speakers of French, German or Portuguese, and these only wrote texts in their respective language. Each text was scored on a grid. We can observe the following:

The second column (VPNID) contains the participants’ identification codes. We already encountered these in Section 9. As you can verify for yourself, there is a single row in the dataset for each participant.
Columns 154 (hkngroup), 155 (Schulsprache), 156 (zweiSprachig) and 157 (Group) seem to contain information regarding what type of participant we’re dealing with. As we saw in Section 9, we can extract this information from the VPNID entries.
Columns 1 (...1), 3 (X.x), 158 (PartDef) and 159 (X.y) don’t seem to contain any relevant information. Perhaps they served some administrative purpose?
All the other column names are built up according to a pattern:
- Either the first letter is P or the first three letters are Sco. The P stands for ‘Portuguese’, the Sco for ‘langue scolaire’.
- After the initial letter(s), we find a number between 1 and 3. This number specifies the time of data collection (i.e., T1, T2 or T3).
- After the number and before the (first) underscore, we find either the letter A or the letter R. From context, A means that we’re dealing with an argumentative text, whereas R means that we’re dealing with a narrative text (perhaps R is short for ‘to relate’?).
- Whatever comes after the first underscore specifies either a criterion on the rating grid or some further information about the text production.

The goal of the next few exercises is to reshape this dataset so that each row represents a single text.

Exercise 6 (Converting the dataset to a longer format) Convert the writing tibble to a tibble named writing_long by following these steps:

Retain only the columns VPNID as well as all columns containing solely numerical information and whose names start with either Sco or P. (The resulting tibble contains 473 rows and 250 columns.)
Convert the resulting tibble to a longer one containing just three columns: VPNID, criterion (containing the former column names) and score (containing the values in those former columns). (The resulting tibble contains 117,777 columns.)

We continue with the writing_long tibble from the previous exercise. The criterion column contains a subset of the column names that were shown above. We want to parse these former column names and extract the following pieces of information from them: Language, time of data collection, text type, and criterion proper. For example, given the string "Sco2R_some_criterion", we want to extract "Sco" as the language, 2 as the time of data collection, "R" as the text type, and "some_criterion" as the criterion name. Similarly, given the string "P3A_OtherCriterion", we want to extract "P" as the language, 3 as the time of data collection, "A" as the text type, and "OtherCriterion" as the criterion name.

If you feed the description in the previous paragraph to some AI tool like ChatGPT, it will probably be able to come up with a suggestion as to how to go about this. After some prodding, it suggested the following solution to me:

my_string <- "Sco2R_some_criterion"
my_regex <- "([^\\d]+)(\\d)([^_]*)_(.*)$"
str_match(my_string, my_regex)

     [,1]                   [,2]  [,3] [,4] [,5]            
[1,] "Sco2R_some_criterion" "Sco" "2"  "R"  "some_criterion"

my_string <- "P3A_OtherCriterion"
str_match(my_string, my_regex)

     [,1]                 [,2] [,3] [,4] [,5]            
[1,] "P3A_OtherCriterion" "P"  "3"  "A"  "OtherCriterion"

Though it’s not too important for the remainder of this walkthrough, the regex above is constructed in the following way:

([^\\d]+): This group matches any group of one or more (+) non-numerical (^ for ‘non’, \\d for digits) characters.
(\\d) matches any digit.
([^_]*) matches any group of zero or more (*) characters that aren’t underscores.
(.*) matches any group of zero or more characters without restrictions.

Evidently, it will be important to verify that the parsing was done correctly.

We can use str_match() to parse any number of strings using a regex. For instance, the command below parses two strings and outputs the result as a matrix:

my_strings <- c("Sco2R_some_criterion", "P3A_OtherCriterion")
my_regex <- "([^\\d]+)(\\d)([^_]*)_(.*)$"
str_match(my_strings, my_regex)

     [,1]                   [,2]  [,3] [,4] [,5]            
[1,] "Sco2R_some_criterion" "Sco" "2"  "R"  "some_criterion"
[2,] "P3A_OtherCriterion"   "P"   "3"  "A"  "OtherCriterion"

If we want to just extract, say, the language information from these strings, we can select the second column of this matrix like so:

str_match(my_strings, my_regex)[, 2]

[1] "Sco" "P"

Similarly, we can extract the criterion name like so:

str_match(my_strings, my_regex)[, 5]

[1] "some_criterion" "OtherCriterion"

We can define a helper function extract_info() that bundles these steps:

extract_info <- function(my_strings, i, my_regex = "([^\\d]+)(\\d)([^_]*)_(.*)$") {
  str_match(my_strings, my_regex)[, i]
}
extract_info(my_strings, 4)

[1] "R" "A"

extract_info(my_strings, 2)

[1] "Sco" "P"

Exercise 7 (Parsing the column names) Using the extract_info() function just defined, add columns called Language, Time, TextType and Criterion (uppercase ‘c’) to the writing_long tibble. Evidently, these columns should contain the appropriate pieces of information. Then drop the criterion (lowercase ‘c’) column from the tibble.

Exercise 8 (Splitting up the dataset) Split the writing_long tibble up into two tibbles: argumentative, which contains only the data relating to argumentative texts, and narrative, which contains only the data relating to narrative texts.

If all went well, the argumentative tibble should contain 59,125 rows and 6 columns, and the narrative tibble 58,179 rows and 6 columns.

While I wasn’t able to retrieve a codebook for this dataset, Desgrippes, Lambelet, and Vanhove (2017) describe the grid according to which the texts were scored. I put the relevant information in the Excel file scoring_criteria.xlsx.

Exercise 9 (Obtaining the relevant scores for each argumentative text) We now want to reorganise the data in such a way that we can easily calculate the total score assigned to each text. Since the scoring criteria were different for argumentative and narrative texts, we do this separately for both text types. Here, we’ll only go through the steps for argumentative texts, but as an additional exercise, you could follow the same steps (mutatis mutandis) to calculate the total scores for the narrative texts. Importantly, we need to check whether the scores obtained actually make sense.

Inspect the scoring_criteria.xlsx file in your spreadsheet program. In particular, observe that it contains two sheets.
Read in the sheet containing the criteria for scoring the argumentatitive texts into R as criteria_arg.
In the tibble argumentative, only retain the rows where the value in the Criterion column is one of the scoring criteria for the argumentative texts (i.e., those contained in criteria_arg). (See the hint below.)
Drop the rows containing missing score values from argumentative.
Convert the resulting tibble to a wide format in which each row represents a single argumentative text and contains the scores on all relevant criteria for this text. Note that each combination of VPNID, Language and Time present in the dataset represents one text.
In principle, the current tibble should not contain any missing data. Moreover, the criterion scores should not exceed the MaxScore values listed in scoring_criteria.xlsx. Check if this is the case. (Hint: It isn’t.) Inspect the rows in the dataset that contain abnormalities and figure out what went wrong. (Hint: First use summary(argumentative) to identify the columns containing aberrant entries. Then inspect the rows containing such entries. Look up the corresponding information in writing_data.csv.) Write a brief paragraph detailing what the issues are. Fix the issues without editing the spreadsheet. This may require you to go all the way back to the start of this section and make some modifications there.
Once all issues have been taken care of, reconvert the tibble to a longer format, i.e., with one row per criterion per text. Then, for each text, compute the variable total_score_arg as the sum of all criterion scores.
Using write_csv(), save the tibble containing the variables VPNID, Language, Time and total_score_arg to a CSV file in the data subdirectory.

Hint for step 3: One option is to use one of the *_join() functions. Another one is to combine filter() with the %in% operator. The following example shows how the %in% operator works.

x <- c("A", "B", "C") # define vectors x, y
y <- c("E", "C", "D", "A")
y %in% x # which of the values in y occur in x?

[1] FALSE  TRUE FALSE  TRUE

x %in% y # which of the values in x occur in y?

[1]  TRUE FALSE  TRUE

Exercise 10 (Comparing the calculated scores with the scores used earlier) We now have two sets of argumentative writing scores: the ones that you calculated in the previous exercise, and the ones we used earlier. Let’s compare them.

Read in the scores you calculated from the CSV file you created in the previous exercise.
Read in the helascot_skills.csv file we used earlier.
Combine both files in such a way that, for each text, you have both the total_score_arg value that you computed as well as the Argumentation value provided in helascot_skills.csv side-by-side. Make sure that the resulting tibble contains the scores of all texts, even in the case that a score is present for a text in one file but not in the other. (Hint: You can’t directly use one of the *_join() functions. You’ll need to do some thinking and some minor data wrangling first.)
Are there any texts for which an Argumentation score exists but no total_score_arg was computed? What is (are) the reason(s) for this (these) discrepancy(ies)?
Are there any texts for which a total_scores_arg value was computed but no Argumentation score exists? If so, output the relevant rows.
For each text, compute the difference between the Argumentation and total_score_arg scores. For how many texts is the difference not equal to 0? Can you think of any reason(s) for any such differences?

Exercise 11 (Summarising the results)

For each combination of language (i.e., French, German, Portuguese), group (i.e., bilingual vs. comparison group) and time of data collection, compute the number of texts for which you calculated a total_score_arg value as well as the mean total_score_arg value.
For each combination of language and group, compute the number of children who have total_score_arg values for all three data collections. How many children in each cell have total_score_arg scores for the first and second data collection but not for the third? How many for the first and third, but not for the second? For the second and third, but not for the first? How many children in each cell only have a total_score_arg score at one data collection?

12 Suggested reading

The Bible for working with tibbles in R is R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund 2023), which is freely available from r4ds.hadley.nz.

13 Software versions

This page was built using the following R and package versions.

devtools::session_info(pkgs = "attached")

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.2 (2024-10-31)
 os       Ubuntu 22.04.5 LTS
 system   x86_64, linux-gnu
 ui       X11
 language en_US
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Zurich
 date     2025-03-01
 pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package   * version date (UTC) lib source
 dplyr     * 1.1.4   2023-11-17 [1] CRAN (R 4.4.0)
 forcats   * 1.0.0   2023-01-29 [1] CRAN (R 4.4.0)
 ggplot2   * 3.5.1   2024-04-23 [1] CRAN (R 4.4.0)
 here      * 1.0.1   2020-12-13 [1] CRAN (R 4.4.0)
 lubridate * 1.9.3   2023-09-27 [1] CRAN (R 4.4.0)
 purrr     * 1.0.2   2023-08-10 [1] CRAN (R 4.4.0)
 readr     * 2.1.5   2024-01-10 [1] CRAN (R 4.4.0)
 readxl    * 1.4.3   2023-07-06 [1] CRAN (R 4.4.0)
 stringr   * 1.5.1   2023-11-14 [1] CRAN (R 4.4.0)
 tibble    * 3.2.1   2023-03-20 [1] CRAN (R 4.4.0)
 tidyr     * 1.3.1   2024-01-24 [1] CRAN (R 4.4.0)
 tidyverse * 2.0.0   2023-02-22 [1] CRAN (R 4.4.2)

 [1] /home/jan/R/x86_64-pc-linux-gnu-library/4.4
 [2] /usr/local/lib/R/site-library
 [3] /usr/lib/R/site-library
 [4] /usr/lib/R/library

──────────────────────────────────────────────────────────────────────────────

References

Desgrippes, Magalie, Amelia Lambelet, and Jan Vanhove. 2017. “The Development of Argumentative and Narrative Writing Skills in Portuguese Heritage Speakers in Switzerland (HELASCOT Project).” In Heritage and School Language Literacy Development in Migrant Children: Interdependence or Independence?, edited by Raphael Berthele and Amelia Lambelet, 83–96. Bristol: Multilingual Matters. https://doi.org/10.21832/9781783099054-006.

Pestana, Carlos, Amelia Lambelet, and Jan Vanhove. 2017. “Reading Comprehension in Portuguese Heritage Speakers in Switzerland (HELASCOT Project).” In Heritage and School Language Literacy Development in Migrant Children: Interdependence or Independence?, edited by Raphael Berthele and Amelia Lambelet, 58–82. Bristol: Multilingual Matters. https://doi.org/10.21832/9781783099054-005.

van den Broek, Gesa S. E., Atsuko Takashima, Eliane Segers, and Ludo Verhoeven. 2018. “Contextual Richness and Word Learning: Context Enhances Comprehension but Retrieval Enhances Retention.” Language Learning 68 (2): 546–85. https://doi.org/10.1111/lang.12285.

Vanhove, Jan. 2017. “Lexical Richness of Short French, German and Portugese Texts Written by Children (Technical Report).” Fribourg, Switzerland: University of Fribourg; Institute of Multilingualism, Fribourg. https://osf.io/vw4pc/.

Vanhove, Jan, Audrey Bonvin, Amelia Lambelet, and Raphael Berthele. 2019. “Predicting Perceptions of the Lexical Richness of Short French, German, and Portuguese Texts Using Text-Based Indices.” Journal of Writing Research 10 (3): 499–525. https://doi.org/10.17239/jowr-2019.10.03.04.

Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd ed. O’Reilly. https://r4ds.hadley.nz/.

1 Goal

2 Tips for organising datasets

2.1 Wide vs. long formats

2.2 Naming variables

2.3 Labelling missing values

2.4 Using several smaller datasets

2.5 Miscellaneous tips

3 Working with R and RStudio

3.1 R, RStudio, and the tidyverse

3.2 Setting up an R project

3.3 Working with scripts

3.4 Loading the packages

4 Reading in datasets

4.1 Excel files

4.2 CSV files

4.3 Other formats

5 Joining datasets

6 Queries

6.1 Selecting by row number

6.2 Selecting by values

6.3 Selecting columns

6.4 Further examples

7 Pivoting

8 Summaries

9 String manipulation

10 A full-fledged example

10.1 Reading in the data

10.2 Average rating per text

10.3 Add to metrics

10.4 Add to skills

11 Walkthrough with exercises

12 Suggested reading

13 Software versions

References

10.3 Add to `metrics`

10.4 Add to `skills`