install.packages("here")
Working with datasets in R: A primer
1 Goal
The goal of this primer is to furnish you with the basic tools and knowledge to organise, transform, and query datasets in research contexts. We’ll work with R and more specifically with the tidyverse add-on suite that markedly simplifies working with datasets.
2 Tips for organising datasets
Let’s say we’ve run a study in which speakers of German read a couple of words in Swedish and were asked to guess what these words might mean. An excerpt from the raw data might look like this, with the words having been shown to the participants in the order in which they are listed:
- Participant 1034. Woman, 51 years.
- Word: söka. Translation: Socken (incorrect).
- Word: försiktig. Translation: vorsichtig (correct).
- Word: mjölk. Translation: Milch (correct).
- Word: behärska. No translation given.
- Word: fiende. Translation: finden (incorrect).
- Participant 2384. Woman, 27 years.
- Word: fiende. No translation given.
- Word: behärska. No translation given.
- Word: försiktig. Translation: vorsichtig (correct).
- Word: mjölk. Translation: Milch (correct).
- Word: söka. Translation: Socke (incorrect).
- Participant 8667. Woman, 27 years.
- Word: mjölk. Translation: Milch (correct).
- Word: behärska. No translation given.
- Word: fiende. Translation: finden (incorrect).
- Word: söka. Translation: suchen (correct).
- Word: försiktig. Translation: vorsichtig (correct).
- Participant 5901. Man, 15 years.
- Word: behärska. Translation: beherrschen (correct).
- Word: mjölk. Translation: milch (sic.; correct).
- Word: försiktig. Translation: vorsichtig (correct).
- Word: fiende. Translation: feinde (sic.; correct; actually Feind).
- Word: söka. Translation: socken (sic.; incorrect).
There are lots of ways in which we could represent these data in a spreadsheet. Let’s look at a few rules of thumb.
2.1 Wide vs. long formats
We’ll concern ourselves strictly with rectangular datasets. These are datasets in which the information is laid out in rows and columns and in which all columns have the same length, and all rows have the same width. Examples of non-rectangular data formats are JSON and XML – see here.
Broadly speaking, we can organise our data in a wide format or in a long format. In a wide format, all pieces of information related to a unit of data collection are organised in a single row. For instance, we could think of each participant in the study as a unit of data collection, in which case we could lay out the data as in Figure 1. Note that the spreadsheet contains a column for each word that indicates the position in which it was presented to the participants. Alternatively, we could think of each word as a unit of data collection and organise the data like in Figure 2.
In a long format, all pieces of information pertaining to an observational unit are organised in a single row. It’s difficult to precisely define what units of data collection and observational units are, and it wouldn’t be too useful to have a precise definition, anyway. But in the present example, the observational units would be the individual responses, i.e., the individual translations. Figure 3 shows how the same data look like when organised in a long format.
It’s usually easier to work with data in a long-ish format compared to with data in a wide-ish format. Moreover, when you do need your data in a wide-ish format for data analysis, converting a longer dataset to a wider one is typically easier than vice versa. We will, of course, illustrate how you can effect such conversions.
Assuming you have any say in the matter, of course.
Make sure that all rows can be interpreted independently of one another. What we want to avoid is that we can’t make sense of the data in some row because we need data from another row to do so, and this other row was deleted, or the order of the rows has changed, etc. For instance, in Figure 3, we also have a column with the Position
s, even though we could have derived this information from the ordering of the rows. But once a row gets deleted or once the order of the rows gets permuted, we’d lose this piece of information. So don’t organise the data like in Figure 4.
Figure 4 employs a trick often used when presenting data in a table: leave out information that didn’t change compared to the last row. But the goal of a table is to communicate data or patterns in data, not to organise the data so that they can be analysed.
Furthermore, the dataset needs to be rectangular. Figure 5 shows what not to do: The additional rows reporting some averages don’t fit in with the rest of the dataset (“Prozent Männer:” is not the ID of a Versuchsperson
, and “30” is not a Geschlecht
).
2.2 Naming variables
Make your life during the data analysis easier by using short but descriptive labels for variables or values in the spreadsheet. By doing so, you avoid that you constantly need to look up in the project’s codebook what the labels mean, thereby reducing the likelihood that you’ll make errors. Moreover, you save yourself some typing.
A few examples:
- When working with questionnaire data, don’t simply label the columns with the number of the question in the questionnaire (e.g.,
Q3
orQuestion17
). Instead, use more descriptive labels such asDegreeFather
oderDialectUse
. - If you have a column labelled
Sex
that is filled with 1s and 0s, then you may end up having to look up if the 1s refer to men or to women. Instead, either fill the column withm(an)
andw(oman)
values, or rename the columnMan
, so that you know that the 1s refer to men. - Try to strike a balance between descriptive power and length. For instance, a variable named
HowOftenDoYouSpeakStandardGerman
will get on your nerves before long;SpeakStandard
could be sufficient.
Having short and descriptive variable names is not an excuse for not maintaining a codebook that spells out precisely what each variable in the dataset refers to. [Example of a codebook].
2.3 Labelling missing values
In Figure 3, I left some of the translation cells empty. From this, I can deduce that the word was presented to the participant but that he or she did not translate the word. However, in other situations, it could be the case that some participants were inadvertently never shown a particular word (e.g., due to a programming error) or that some of the participants’ were irretrievably lost. We should enable ourselves to distinguish between these cases, for instance by marking the latter cases using NA
(not available). (Alternatively, we could explicitly label cases where the participant did not translate the word with [no response]
or something similar.)
If you want to be able to tell apart different reasons for missing data (such as data loss, programming errors, participants’ refusal to answer a certain question, participants’ being absent from a data collection etc.), it’s probably easiest to just write NA
in the column and add another column with comments detailing the reasons for the missingness.
Some data logging applications use -99 or -9999 to encode missingness. The problem with this is that, sometimes, -99 or -9999 don’t immediately stand out as conspicuous values.
2.4 Using several smaller datasets
The spreadsheet above contain several repeated pieces of information. For instance, for all five translations provided by participant 1034, we indicated that they were provided by a woman aged 51. We can eliminate this source of redundancy by managing a handful of smaller datasets rather than just a single large one. More specifically, we can manage a dataset that contains all information that depends just on the participants, see Figure 6. Each participant has a unique ID (here: Versuchsperson
), and the dataset contains contains a single row for each participant. If we need to correct an entry related to, say, some participant’s age, we just need to change it here – once – rather than five times in the larger dataset.
By the same token, we can put all information that depends just on the stimuli used in a separate dataset, see Figure 7. Here, too, each stimulus has a unique ID (here: Wort
).
The third dataset then only contains information that depends on the combination of a particular participant and a particular word. As shown in Figure 8, each row in this dataset contains the IDs of both the participant and the stimulus that the response is related to. But any other information related to just the word or to just the participant is left out of this dataset. As we’ll see shortly, we can easily add the information related to the participants or to the words to this dataset from the other two datasets.
2.5 Miscellaneous tips
- Mind capitalisation. For some programs and computer languages (e.g., SQL),
Frau
andfrau
are the same value; for others (including R), they are not. - Mind trailing spaces. “Mann ” (with trailing space) and “Mann” (without trailing space) are different values to a computer.
- Special symbols, including the Umlaut, sometimes lead to problems, especially when collaborating with people whose localisation settings differ from yours.
- Format dates in the YYYY/MM/DD format. This way, if you sort the data alphabetically, they’re already in chronological order.
- Don’t use colour-coding in your spreadsheet. Or if you do use it, be aware that you’ll lose this information once you import your data into your statistics program.
After you’ve entered your data into a spreadsheet, all further steps in your analysis should be carried out in R (or Python, or Julia, or what-have-you). Don’t calculate, sort, copy, paste, move, reshape, draw graphs etc. in Excel or whatever spreadsheet program you prefer. Treat your finished spreadsheet as the immutable source of data from which your results will be obtained, so that when in doubt, it’s clear which file you should go back to.
When using R (or Python, or Julia, or whatever), use (a) script(s) to read in the original dataset and convert the data in it to a format more amenable to analysis, but whatever you do, don’t overwrite the original file.
When working in a spreadsheet program, you can use the data validation functions to minimise the chances that you enter faulty data. For instance, if you know that the possible responses to a certain questions are either integers from 0 to 5 or NA
, make a data validation rule that prevents you from entering erroneous data.
Also see the blog post A data entry form with failsafes.
3 Working with R and RStudio
With the data available in a rectangular spreadsheet format, the next challenge is to load them into a program capable of analysing them. Several (free) options are available (JASP, Python, Julia, …), but we’ll focus on R complemented with the tidyverse
suite. The main reasons for this are that R is more popular in the humanities and the social sciences and that I’m just more proficient in R. Hadley Wickham lists some further reasons.
Don’t try to work with R as you’re attending this class. Some things are bound to go wrong (perhaps a comma in the wrong spot, a missing bracket or an upper-case letter that should have been a lower-case one), and as you’re trying to fix the errors, you’ll lose track of the lecture. So work through the code snippets at home, and be patient with yourselves. I receive error messages or unexpected outcomes all the time – it’s just that after using R for 15 or so years, I’ve become reasonably proficient at spotting the problems and Googling solutions :)
3.1 R, RStudio, and the tidyverse
To install R, visit the R Project website.
In addition, I recommend you install RStudio, which facilitates working with R. To do so, visit the Posit website and download the open source desktop version of RStudio.
Once you’ve installed R and RStudio, open RStudio, and go to Tools > Global options...
. In the tab General, Basic
, make sure to untick the option ‘Restore .RData into workspace at startup’ and select the option ‘never’ for ‘Save workspace to .RData on exit’. These settings help ensure that you’re starting from a clean slate each time you open RStudio and prevent the results of calculations in previous sessions from messing up your analyses. Additionally, in the tab Code, Editing
, tick the box next to ‘Use native pipe operator, |>’. We’ll encounter this pipe operator shortly. Finally, I recommend setting the default text encoding to UTF-8 (Code, Saving
).
The functionality of R can be extended by installing further packages. We’ll be using two of these. The first is the here
package, which makes it easier to read in data files if you’re working in a complex file system. Nothing in what follows hinges on your using the here
package or not; I just find that it makes my life easier. To install the here
package, type the following command at the R prompt:
The second package isn’t optional if you want to following along with the rest of this course: The tidyverse
actually bundles a couple of package that are based on a set of shared principles and that facilitate working with datasets enormously. You can install it like so:
install.packages("tidyverse")
3.2 Setting up an R project
Next, in RStudio, click on File > New Project... > New Directory
. Navigate to somewhere on your computer where you want to create a new directory and give this directory a name (for instance, DatasetsAndGraphs
). You will use this directory to store the data and scripts that you’ll need for drawing the graphs in as well for saving the graphs themselves to. You don’t have to tick the other options.
When you’re done, close RStudio. Navigate to the directory you’ve just created. You should find an .Rproj
file there. Double-click it. If all is well, RStudio should fire up.
Create the following subdirectories in the directory you’re currently in: data
, scripts
and figs
.
On github.com/janhove/DatasetsAndGraphs, you can find a couple of datasets (under data
) as well as an R script that defines a new function we’ll use (under functions
). Put the datasets in your own data
directory and the R script in your own functions
directory.
You’ve now set up an R project for this pair of lectures. R projects aren’t strictly necessary, but I find them helpful to manage different research projects, publications, classes I teach, etc.
3.3 Working with scripts
With few exceptions, you should type the commands you want to execute in an R script (File > New File > R Script
) rather than directly in the R console. While nothing terrible will happen if you do enter commands directly into the console, entering them into the script editor first is a good habit to get into: it’s much easier to spot errors, save and document your analysis, make tweaks to your code, and reuse old code in the editor than in the console.
As a rule of thumb, for the analysis of large projects, I tend to create separate R scripts for different parts of the analysis that are internally closely connected but more or less independent of the other parts. For instance, I might have a script for cleaning the datasets, another script for the analyses that are reported in Chapter 4, and yet another script for the analyses reported in Chapter 5.
In the templates
directory, you’ll find two R files. bare_script.R
illustrates how I lay-out fairly short scripts without any bells or whistles. script_with_markdown.R
is more elaborate and combines snippets of R code with running text (including some mark-up such as headings and emphasis).
R scripts can be compiled as HTML files (i.e., web pages) that show both the R commands as well as their output. This only works if all the code is syntactically correct and if its proper execution does not rely on objects that are not defined in the script. Compiling scripts is a great way to verify that your scripts are syntactically correct as well as self-contained. Compile the templates that I provided to check this for yourselves. (RStudio may prompt you to install some further packages as you do so.)
Adopt one of the templates for your solutions to the exercises. Use a logical structure in your scripts (e.g., a new section for each exercise if you’re working with the script_with_markdown.R
template). Compile the R script as a HTML file. This will only work if your R code is syntactically well-formed and self-contained. Check if the HTML file in fact shows the correct results. If not, this suggests that the results you’d originally obtained depend on some objects that are not defined in the script (or that are defined differently in the script from how you think they are defined). Hand in both the R file as well as the HTML file. Please use one script for all the exercises related to the datasets lecture and another script for all the exercises related to the data visualisation lecture.
3.4 Loading the packages
We’ll work with the here
and tidyverse
packages below. Even if you’ve already installed these, you need to load these packages in every R session in which you use them. To do so, execute the following lines:
library(here)
here() starts at C:/Users/VanhoveJ/switchdrive/Documents/Workshops/DatasetsAndGraphs
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The output of the library(here)
command tells you the path relative to which the here()
function in the here
package will look for files etc. The messages generated by the library(tidyverse)
command tell you which of the packages that make up the tidyverse suite are loaded by default. Some of the functions in these packages have the same name as functions that are part of packages that are loaded automatically whenever you start up R. What this means is that, if you want to use, say, the filter()
function from the stats
package, you need to use the notation stats::filter()
instead of just filter()
.
Incidentally, you don’t need to install these packages every time you need them, but you do need to load them every time you need them in a new session. I recommend including all of the packages you need in a script at the top of your script. Remove any library()
commands for packages you ended up not using in the script.
Both R itself as well as R packages evolve. When trying to run old scripts using new software versions, you may run into compatibility problems: the code may not run or, worse, it will run but produce different results. For this reason, it’s a good idea to keep track of the versions of R and any packages you used. To this end, you can include the command sessionInfo()
at the bottom of your scripts. At the bottom of this page, I’ve used a different command that serves the same purpose but that requires another add-on package to be installed.
If you should ever observe that your code doesn’t run as expected, you can now check if differences in software versions could be to blame. If so, you could try installing the old software versions to so if this solves the problem. [Using different R versions; Installing old package versions]
4 Reading in datasets
4.1 Excel files
We’ll focus on reading in two kinds of spreadsheets: Excel spreadsheets in the XLS(X) format, and CSV files.
In order to read in XLS(X) files, we need the readxl
package. This package is part of the tidyverse
suite, but it does not get loaded automatically as you load the tidyverse
suite. So we load it separately:
library(readxl)
Assuming the file uebersetzungen.xlsx
is located in the data
subdirectory of your R project directory, we can read it into R as follows.
<- read_excel(here("data", "uebersetzungen.xlsx")) translations
The dataset is now loaded as a so-called tibble named translations
. We can display it by simply typing its name at the prompt:
translations
You’re going to take away much more from this primer if you copy the code snippets by typing them rather than by copy-pasting them.
Rectangular data sets are referred to as data frames in R. The tidyverse
slightly changes their functionality, mostly in order to allow for prettier displaying, and refers to them as tibbles.
Note that the empty cells are automatically interpreted as missing data (<NA>
). The documentation of the read_excel()
function, which you can access by typing ?read_excel
at the R prompt, suggests that we can override this behaviour, but we can’t.
If, instead of printing the entire tibble at the prompt, we just want to display the first few rows, we can use slide_head()
:
slice_head(translations, n = 4)
slice_tail()
works similarly. If you want to display a random selection of rows, you can use slice_sample()
:
slice_sample(translations, n = 5)
Again, for details, you can check the documentation of these functions that is available at ?slice_head
.
Exercise 1 (Excel files with multiple sheets) The file VanDenBroek2018.xlsx
contains the data from three experiments reported by van den Broek et al. (2018). Using the read_xlsx()
function, read in the data from the second experiment.
Hint: Inspect the Excel file and then consult the help page of read_xlsx()
by means of ?read_xlsx
.
4.2 CSV files
A popular format for storing spreadsheets is the CSV format. CSV is short for comma-separated values: Cells on the same row are separated by commas, see Figure 9. Sometimes, text strings are additionally surrounded by quotation marks.
To read in CSV files, we can use the read_csv()
function, which is part of the readr
package, which in turn is automatically loaded when the tidyverse
suit is loaded:
<- read_csv(here("data", "uebersetzungen.csv")) translations
Rows: 20 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Wort, Übersetzung
dbl (3): Versuchsperson, Position, Richtig
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
translations
We receive a couple of messages (not errors!). These show that the read_csv()
function correctly recognised that the columns Wort
and Übersetzung
contain text (‘chr’, character) strings, whereas Versuchsperson
, Position
and Richtig
contain numbers (‘dbl’ for ‘double’, a number format).
With read_csv()
, we can specify that only cells containing “NA” are marked as missing data:
<- read_csv(here("data", "uebersetzungen.csv"), na = "NA") translations2
Rows: 20 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Wort, Übersetzung
dbl (3): Versuchsperson, Position, Richtig
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
translations2
Note that the Übersetzung
value in row 4 is just empty, not NA
.
Let’s also read in the datasets containing the information pertaining to the participants and items:
<- read_csv(here("data", "versuchspersonen.csv")) participants
Rows: 4 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Geschlecht, Englisch
dbl (2): Versuchsperson, Alter
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv(here("data", "woerter.csv")) items
Rows: 5 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Wort, RichtigeÜbersetzung
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
If you save an Excel spreadsheet as a CSV file on a French- or German-language computer system, the cells will be separated using semicolons rather than using commas. The reason is that commas are used as decimal separators in French and German. To read in ‘CSV’ files that use semicolons as cell separators, you can use the read_csv2()
function.
Incidentally, if you use LibreOffice.org Calc instead of Excel, you can choose which cell separator gets used if you export a spreadsheet as a CSV file.
read_csv()
vs. read.csv()
More seasoned R users are probably already familiar with the read.csv()
function (with a dot rather than an underscore). The read_csv()
function is merely the tidyverse
counterpart to this function. The practical differences between them are negligible.
4.3 Other formats
Some people prefer to use tabs or spaces to separate cells rather than commas or semicolons. Consult the help page for the read_tsv()
and read_delim()
functions to see how you can read in data using other separators. Particularly the delim
and quote
arguments are relevant.
For reading in data from Google Sheets, see googlesheets4.
For interacting with Google Drive, see googledrive.
For reading in SPSS, Stata and SAS files, see haven.
5 Joining datasets
Having read in our three datasets as translations
, participants
, and items
, we want to merge these datasets into one large dataset. The most useful function for this is left_join()
, which takes the dataset passed as its x
argument and adds to this the corresponding rows from the y
dataset:
<- left_join(x = translations, y = participants) all_data
Joining with `by = join_by(Versuchsperson)`
Note that the left_join()
function recognises that the variable shared between both datasets is called Versuchsperson
. Hence, the information related to participant 1034 contained in participants
is added to each row in translations
where the Versuchsperson
value is 1034, and similarly for the other participants:
all_data
If you don’t want the left_join()
function to figure out what the shared variable is, you can specify it explicitly:
<- left_join(x = translations, y = participants, by = "Versuchsperson") all_data
If the shared variable has different names in the different datasets, you can use something like
<- left_join(x = left_dataset, y = right_dataset, by = join_by(var_left == var_right)) new
See ?join_by
for more information.
If there are values in the shared variable that occur in the x
dataset that don’t occur in the y
dataset, the values in the added columns will read NA
for these rows.
Further join functions are the following, see ?join
for details:
right_join(x, y)
: Keep all entries iny
. Add corresponding entries inx
to it.full_join(x, y)
: Keep all entries in bothx
andy
. AddNA
s if there is no corresponding entry in the other dataset.inner_join(x, y)
: Only keep entries inx
for which there is a corresponding entry iny
. Add these corresponding entries.semi_join(x, y)
: Only keep entries inx
for which there is a corresponding entry iny
. Don’t add these corresponding entries.anti_join(x, y)
: Only keep entries inx
for which there are no corresponding entries iny
.
In the current example, left_join
, full_join
and inner_join
would all yield the same result. But this isn’t always the case.
Let’s also add the information pertaining to the words:
<- left_join(all_data, items) all_data
Joining with `by = join_by(Wort)`
Incidentally, another way of achieving the same result is this:
<- translations |>
all_data left_join(participants) |>
left_join(items)
Joining with `by = join_by(Versuchsperson)`
Joining with `by = join_by(Wort)`
Exercise 2 (Join functions)
- Use the following code to generate two tibbles called
left
andright
:
<- tibble(A = c("a", "b", "c", NA),
left B = c(1, 2, NA, 4))
<- tibble(B = c(1, 3, 4, 4),
right C = c(10, NA, 12, 7))
left
right
- Without running these commands, predict what their output will look like:
left_join(x = left, y = right)
right_join(x = left, y = right)
full_join(x = left, y = right)
inner_join(x = left, y = right)
semi_join(x = left, y = right)
semi_join(x = right, y = left) # !
anti_join(x = left, y = right)
anti_join(x = right, y = left) # !
Now run the commands above to verify your predictions.
Create two new tibbles using the code below:
<- tibble(A = c("a", "b"),
left B = c(1, NA))
<- tibble(B = c(1, NA, NA),
right C = c(0, 1, 2))
left
right
- Consult the help page for
left_join
and look up thena_matches
parameter under ‘Arguments’. Predict what the output of the following two commands will look like, and only then check your answer.
left_join(left, right)
left_join(left, right, na_matches = "never")
6 Queries
6.1 Selecting by row number
We can select the third row of a dataset like so:
slice(all_data, 3)
Alternatively, we can write this command as follows. The symbols |>
allow us to take an object (here: all_data
) and pass it to a function as its first argument. As we’ll later see, we can use |>
to string together a host of function calls without creating illegible code.
|>
all_data slice(3)
We can also select multiple rows:
# Rows 5 and 7
|>
all_data slice(c(5, 7))
# Rows 5 to (and including) 7
|>
all_data slice(5:7)
The results of such actions can be stored as separate objects, for instance, like so:
<- all_data |>
rows7_12 slice(7:12)
rows7_12
We can export this new object as a CSV file like so:
write_csv(rows7_12, here("data", "rows7_12.csv"))
6.2 Selecting by values
Selecting rows by their number isn’t too useful. But selecting rows satisfying some set of conditions is very useful. Here are a few examples:
# All data corresponding to the word 'fiende'
|>
all_data filter(Wort == "fiende")
# Note: use '!=' for 'not equal to'.
# All data corresponding to participants older than 30
|>
all_data filter(Alter > 30)
# Note: use '>=' for 'at least as old as',
# '<=' for 'no older than',
# and '<' for 'younger than'.
We can use is.na
to check for missing values. Note the use of !
to negate a condition.
|>
all_data filter(is.na(Übersetzung))
|>
all_data filter(!is.na(Übersetzung))
We can string together multiple filter
calls:
# incorrect translations to first word
|>
all_data filter(Position == 1) |>
filter(Richtig == 0)
An alternative way of writing this is as follows:
|>
all_data filter(Position == 1 & Richtig == 0)
‘or’ conditions can be created using |
:
# translations to first word or incorrect translations
|>
all_data filter(Position == 1 | Richtig == 0)
Note that in logic, ‘or’ is always inclusive. Exclusive ‘or’ (‘xor’) can be obtained as follows:
# translations to first word or incorrect translations, but not both
|>
all_data filter(Position == 1 | Richtig == 0) |>
filter(!(Position == 1 & Richtig == 0))
Alternatively,
|>
all_data filter(xor(Position == 1, Richtig == 0))
Exercise 3 (Filtering)
- Run the following commands:
<- all_data |>
d1 filter(Übersetzung == "vorsichtig")
<- all_data |>
d2 filter(Übersetzung != "vorsichtig")
Explain what both commands achieve.
How many rows are there in
d1
? How many ind2
? How many inall_data
? Explain.Create a tibble
d3
that contains only those rows inall_data
where the participants did not translate the word as vorsichtig.
6.3 Selecting columns
Sometimes, a dataset is too cumbersome to handle because it contains a lot of irrelevant columns. Using select
, we can select those columns that are of interest. For instance,
|>
all_data select(Wort, RichtigeÜbersetzung, Übersetzung) |>
slice_head(n = 5)
There are a couple of auxiliary functions that make it easier to select columns. These are especially useful when working with large datasets. Examples are contains()
and starts_with()
:
|>
all_data select(contains("Übersetzung"))
|>
all_data select(starts_with("Richt"))
See the tidyselect documentation for further functions.
6.4 Further examples
We can string together different types of commands:
# All translations for 'fiende'
|>
all_data filter(Wort == "fiende") |>
select(Übersetzung)
# All *distinct* translations for 'behärska':
|>
all_data filter(Wort == "behärska") |>
select(Übersetzung) |>
distinct()
Without the pipe (|>
), these commands become difficult to read:
distinct(select(filter(all_data, Wort == "behärska"), Übersetzung))
7 Pivoting
In the course of an analysis, it is often necessary to convert a long-ish dataset to a wider one, and vice versa. This process is known as pivoting. To illustrate pivoting, we’ll make use of a more realistic – and more complicated – dataset derived from a longitudinal project on the development of reading and writing skills in Portuguese–French and Portuguese–German bilingual children (Desgrippes, Lambelet, and Vanhove 2017; Pestana, Lambelet, and Vanhove 2017). We read in the dataset helascot_skills.csv
as skills
:
<- read_csv(here("data", "helascot_skills.csv")) skills
Rows: 1904 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Subject, LanguageTested
dbl (4): Time, Reading, Argumentation, Narration
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skills
For each participant (Subject
) at each Time
(T1, T2, T3) and for each LanguageTested
(Portuguese, French, German), we have up to three scores, Reading
, Argumentation
and Narration
, arranged next to each other. But let’s say we wanted to compute, for each participant, their progress on each task in each language from T1 to T2 and from T2 to T3. The way the data are laid out at present, this would at best be pretty difficult to do. But it would be easy if only the data were arranged differently, namely with all three measurements per participant and task next to each other. We can convert this dataset to the desired format in two steps.
First, we make the dataset longer by stacking the different scores underneath each other rather than next to each other. To this end, we use the function pivot_longer()
and specify that we want to stack the values in the Reading
, Argumentation
, and Narration
columns under each other, that we want to call the resulting column Score
, and that we want to put the column names in a new column called Skill
:
<- skills |>
skills_longer pivot_longer(cols = c("Reading", "Argumentation", "Narration"),
names_to = "Skill", values_to = "Score")
skills_longer
Then, we make this tibble wider by putting the three measurements per skill and language next to each other using pivot_wider()
. We also prefix the Time
values with a T
.
<- skills_longer |>
skills_wider_time pivot_wider(names_from = Time, values_from = Score,
names_prefix = "T")
skills_wider_time
Using mutate()
, we can now easily compute the differences between T1
and T2
and between T2
and T3
:
|>
skills_wider_time mutate(ProgressT1_T2 = T2 - T1,
ProgressT2_T3 = T3 - T2) |>
select(Subject, LanguageTested, Skill, ProgressT1_T2, ProgressT2_T3)
The two-step approach shown above is one that I’ve found generally useful. First convert the tibble to a format that is longer than needed, then pivot it to the wider format desired.
Now imagine that we wanted to compute, for each participant in each skill at each data collection, the difference between the Portuguese score and the German/French score. The first step is the same, resulting in skills_longer
. The second step is now similar, but we put the different languages next to each other:
<- skills_longer |>
skills_wider_language pivot_wider(names_from = LanguageTested, values_from = Score)
skills_wider_language
Incidentally, not all values for German
are NA
. It’s just that the first couple of children were tested in French and Portuguese, not in German. We can check this like so:
|>
skills_wider_language filter(!is.na(German)) |>
slice_sample(n = 10)
If we needed to, we could make this dataset wider still:
<- skills_longer |>
skills_wider_time_language pivot_wider(names_from = c(LanguageTested, Time), # combination of Language and Time
values_from = Score)
skills_wider_time_language
We could reconvert this tibble to a long one using the code snippet below. The code becomes a bit more complex.
- The notation
French_1:German_3
selects all columns betweenFrench_1
andGerman_3
(including). Alternatively, we could have usedstarts_with(c("French", "Portuguese", "German"))
to select the relevant columns. - A so-called regular expression (regex) is used to split up these column names into the
Language
bit and into theTime
bit. The split happens at the first underscore (_
) encountered.
|>
skills_wider_time_language pivot_longer(cols = French_1:German_3,
names_to = c("Language", "Time"),
names_pattern = "(.*)_(.*)",
values_to = "Score")
For the present example, we don’t need this code snippet since we already have skills_longer
. But I wanted to illustrate that such conversions are possible. If you ever need to convert a similar dataset to a longer format, you now know that you can look up the details on the help page of pivot_longer()
(?pivot_longer
) and take it from there.
With regular expressions, it’s less important to know them than to know about them. Any time a string is comprised of several pieces of information in a more or less predictable way, regular expressions can be used to extract these pieces. That said, compiling regular expressions that actually do the job is pretty difficult. But once you’re aware that they exist and can be used to such ends, you can enlist the help of AI tools to construct them.
Incidentally, the regex used above ((.*)_(.*)
) matches any two groups of characters ((.*)
) preceding and following an underscore (_
).
8 Summaries
Using the summarise()
function, we can easily summarise large tibbles. For instance, we can compute the average (mean) narration and argumentation scores in the skills
tibble like so:
|>
skills summarise(mean_narr = mean(Narration, na.rm = TRUE),
mean_arg = mean(Argumentation, na.rm = TRUE))
We set the na.rm
parameter in the mean()
call to TRUE
since there are several missing observations in both the Narration
and Argumentation
variable. If we don’t set na.rm
to TRUE
, the result of both computations would be NA
. By setting na.rm
to TRUE
, missing observations are ignored.
summarise()
is often used in conjuction with group_by()
, which splits up a tibble into subgroups. This way, we can straightforwardly compute summaries for different subgroups. For instance, if we wanted to compute the mean narration and argumentation scores for each time of data collection and each language tested, we could use the following code snippet:
|>
skills group_by(Time, LanguageTested) |>
summarise(mean_narr = mean(Narration, na.rm = TRUE),
mean_arg = mean(Argumentation, na.rm = TRUE),
.groups = "drop")
By setting .groups = "drop"
, you make sure that the grouping applied to skills
doesn’t apply to the summary tibble any more. (It’s not so important.)
We can treat these summary tibbles like ordinary tibbles and apply all of the other commands to them:
|>
skills group_by(Time, LanguageTested) |>
summarise(mean_narr = mean(Narration, na.rm = TRUE),
.groups = "drop") |>
pivot_wider(names_from = "Time", names_prefix = "T", values_from = "mean_narr")
In addition to mean()
, here are some further functions that are often used when computing summaries; see the functions’ help pages for details:
median()
,quantile()
,min()
andmax()
,sd()
andvar()
for the sample standard deviation and sample variance, respectively,mad()
for the (adjusted) median absolute deviation,n()
for obtaining the number of observations,sum()
.
The mean()
function can also be used to compute proportions. Consider the following example. The is.na()
function checks if a value is NA
(in which case it returns TRUE
) or not (FALSE
). If we compute the mean()
of a bunch of TRUE/FALSE
values, we obtain the proportion of values that are TRUE
. Similarly, the sum()
of a bunch of TRUE/FALSE
values is the number of TRUE
values. Hence, we can quickly obtain the proportion, and number, of the missing Narration
scores for each combination of Time
and LanguageTested
:
|>
skills group_by(Time, LanguageTested) |>
summarise(
prop_narr_NA = mean(is.na(Narration)),
nr_narr_NA = sum(is.na(Narration)),
n = n(),
.groups = "drop"
)
You can write your own functions. For instance, an alternative to the mean and the median is the Hodges–Lehmann estimator. Let \(X = (X_1, \dots, X_n)\) be a data vector. To compute the Hodges–Lehmann estimator of \(X\)’s central tendency, proceed as follows:
Compute the pairwise averages (‘Walsh averages’) of the \(X\) values, that is, compute \(\frac{X_i + X_j}{2}\) for each \(i = 1, \dots, n-1\) and \(j = i + 1, \dots, n\).
Compute the median of these Walsh averages.
We can define a function that computes this estimator like so:
<- function(x, na.rm = FALSE) {
hodges_lehmann if (na.rm) {
<- x[!is.na(x)]
x
}<- outer(x, x, FUN = "+") / 2 # computes *all* means of two observations
pairwise_means median(pairwise_means[upper.tri(pairwise_means, diag = TRUE)]) # median of means of x_i, x_j with i < j
}
We can then use this function in summarise()
like so,
|>
skills group_by(Time, LanguageTested) |>
summarise(
hl_narr = hodges_lehmann(Narration, na.rm = TRUE),
.groups = "drop"
|>
) pivot_wider(names_from = "Time", names_prefix = "T", values_from = "hl_narr")
9 String manipulation
It often happens that a single cell in a dataset contains different pieces of information. So, too, it is the case in our current example. The first participant in the dataset is referred to as A_PLF_1
:
- The
A
in this ID refers to their class. - The
PLF
tells us that this participant resided in French-speaking Switzerland (F
), had a Portuguese background (P
) and took Portuguese heritage and language courses (L
). Other abbreviations in the dataset areCD
,CF
, andCP
for comparison groups in German-speaking Switzerland, French-speaking Switzerland, and Portugal, respectively,PLD
for participants residing in German-speaking Switzerland with a Portuguese background that attended a Portuguese course, as well asPND
andPNF
for participants in German- and French-speaking Switzerland, respectively, with a Portuguese background that did not take Portuguese classes. - The
1
uniquely identifies this participant within its class.
It could make sense to split up the information contained in this one cell into multiple cells. Thankfully, the strings in Subject
are structured logically and consistently: the different pieces of information are separated using underscores ("_"
). We can hence split these strings at the underscores and retrieve the first and second pieces like so, obviating the need for more complicated regular expressions:
<- skills_wider_language |>
skills_wider_language mutate(
Class = str_split_i(Subject, "_", 1),
Group = str_split_i(Subject, "_", 2)
)# check:
|>
skills_wider_language select(Subject, Class, Group) |>
sample_n(10)
Refer to the stringr documentation for further guidance.
We can now further break down the information contained in the new Group
column, for instance as follows:
<- skills_wider_language |>
skills_wider_language mutate(language_group = case_when(
== "CP" ~ "Portuguese control",
Group == "CF" ~ "French control",
Group == "CD" ~ "German control",
Group %in% c("PLF", "PNF") ~ "French-Portuguese",
Group %in% c("PLD", "PND") ~ "German-Portuguese",
Group .default = "other"
|>
)) mutate(heritage_course = case_when(
%in% c("PLF", "PLD") ~ 1,
Group .default = 0
|>
)) mutate(has_German = case_when(
%in% c("CD", "PLD", "PND") ~ 1,
Group .default = 0
|>
)) mutate(has_French = case_when(
%in% c("CF", "PLF", "PNF") ~ 1,
Group .default = 0
|>
)) mutate(has_Portuguese = case_when(
%in% c("CF", "CD") ~ 0,
Group .default = 1
))
# check:
|>
skills_wider_language select(Subject, language_group, heritage_course, has_German, has_French, has_Portuguese) |>
sample_n(10)
10 A full-fledged example
We had the children in the French/German/Portuguese project write short narrative and argumentative texts in each of their languages at three points in time. The quality of these texts was scored using a grid (Desgrippes, Lambelet, and Vanhove 2017); it is these scores that are listed in the skills
tibble we worked with above. In addition, 3,060 of these texts were rated for their lexical richness by between two and eighteen naïve (i.e., non-instructed) raters each using a 1-to-9 scale (Vanhove et al. 2019). The individual ratings are available in the file helascot_ratings.csv
. Furthermore, we computed a bunch of lexical metrics for each text, such as the number of tokens, the mean corpus frequency of the words occurring in the text, etc. These metrics are available in the file helascot_metrics.csv
; see Vanhove (2017) for details.
We’ll use these datasets to answer three questions:
- What’s the relation between the average lexical richness ratings per text and the text’s Guiraud index? (The Guiraud index is the ratio of the number of types in a text and the square root of the number of tokens in that text.)
- What’s the relation between the average lexical richness ratings per text and the mean corpus frequency of the words occuring in the texts?
- What’s the relation between the grid-based ratings and the lexical richness ratings?
In doing so, we’ll need to make use of some of the tools introduced earlier.
10.1 Reading in the data
Let’s start from a clean slate and read in the three datasets:
<- read_csv(here("data", "helascot_skills.csv")) skills
Rows: 1904 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Subject, LanguageTested
dbl (4): Time, Reading, Argumentation, Narration
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv(here("data", "helascot_metrics.csv")) metrics
Rows: 3060 Columns: 137
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Text, Subject, Text_Language, Text_Type, LemmawordsNotInSUBTLEX, ...
dbl (129): Time, TTR, Guiraud, nTokens, nTypes, nLemmas, meanWordLength, MST...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv(here("data", "helascot_ratings.csv")) ratings
Rows: 29179 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): Batch, Rater, Text, Subject, Text_Language, Text_Type, Rater_Sex, ...
dbl (4): Trial, Rating, Time, Rater_Age
lgl (1): Rater_NativeLanguageOther
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Inspect the structure of these three datasets. Note that
skills
contains one row per combination ofSubject
,Time
andLanguageTested
;metrics
contains one row per text, i.e., one row per combination ofSubject
,Text_Language
,Text_Type
andTime
;ratings
contains one row per rating, i.e., one row per combination ofRater
andText
.
These datasets are already pretty clean and a lot of work (partly manual, partly using R, partly using other tools) went into creating them. See the technical report as well as the R code available from osf.io.
There are many ways to skin a cat. But it seems to me that the following course of action is reasonable:
- Using
ratings
, compute the averageRating
per text. - Add these average ratings to
metrics
and draw some plots. (More on plotting in the second part!) - Make
skills
longer and add the average ratings to it. Then draw some more plots.
10.2 Average rating per text
<- ratings |>
rating_per_text group_by(Text, Subject, Text_Language, Text_Type, Time) |>
summarise(mean_rating = mean(Rating),
n_ratings = n(),
.groups = "drop")
Exercise 4 (Filtering out bilingual raters.) Some of the raters consider themselves native speakers of several languages (bi-French
etc.), others consider themselves monolingual native speakers (mono-French
etc.):
table(ratings$Rater_NativeLanguage)
bi-French bi-German bi-Portuguese mono-French mono-German
1049 2241 728 6235 14149
mono-Portuguese
4777
Compute the average text ratings using only ratings from monolingual French speakers.
10.3 Add to metrics
The tibbles metrics
and rating_per_text
share a variable (Text
), so the average ratings can easily be added to metrics
:
<- metrics |>
metrics_ratings left_join(rating_per_text)
Joining with `by = join_by(Text, Subject, Text_Language, Text_Type, Time)`
We’ll learn in the second part how to draw meaningful data plots, but here’s one way of visualising the relationship between Guiraud
and mean_rating
:
|>
metrics_ratings ggplot(aes(x = Guiraud, y = mean_rating)) +
geom_point(shape = 1) +
facet_grid(rows = vars(Time, Text_Type),
cols = vars(Text_Language),
scales = "free_y") +
xlab("Guiraud value") +
ylab("average lexical richness rating")
Similarly, for meanSUBTLEX
(average corpus frequency):
|>
metrics_ratings ggplot(aes(x = meanSUBTLEX, y = mean_rating)) +
geom_point(shape = 1) +
facet_grid(rows = vars(Time, Text_Type),
cols = vars(Text_Language),
scales = "free_y") +
xlab("mean SUBTLEX frequency") +
ylab("average lexical richness rating")
<- metrics_ratings |>
correlations group_by(Text_Language, Time, Text_Type) |>
summarise(
cor_guiraud = cor(mean_rating, Guiraud),
cor_frequency = cor(mean_rating, meanSUBTLEX),
.groups = "drop"
) correlations
10.4 Add to skills
<- skills |>
rating_gridscore pivot_longer(Reading:Narration, names_to = "skill",
values_to = "grid_score") |>
filter(skill != "Reading") |>
mutate(Text_Type = case_when(
== "Argumentation" ~ "arg",
skill == "Narration" ~ "narr"
skill |>
)) full_join(rating_per_text,
by = join_by(Text_Type, Subject, Time, LanguageTested == Text_Language))
|>
rating_gridscore filter(!is.na(grid_score)) |>
filter(!is.na(mean_rating)) |>
ggplot(aes(x = mean_rating, y = grid_score)) +
geom_point(shape = 1) +
facet_grid(rows = vars(Time, Text_Type),
cols = vars(LanguageTested),
scales = "free_y") +
xlab("average lexical richness rating") +
ylab("grid-based text quality rating")
|>
rating_gridscore filter(!is.na(grid_score)) |>
filter(!is.na(mean_rating)) |>
group_by(LanguageTested, Time, Text_Type) |>
summarise(
cor_score = cor(mean_rating, grid_score),
.groups = "drop"
)
11 Walkthrough with exercises
In the previous sections, we started from a pretty clean dataset containing the writing data. The goal of the present section is to give you some insight into how this data set was compiled starting from a more basic representation of the underlying data. This section is organised as a series of exercises that you should tackle in order. In completing these exercises, you’ll convert a very wide dataset (with lots of criterion scores for up to twelve texts per child on a single row) into a dataset that contains a single composite score for a single text on each row. This process is considerably more complicated than running a couple of pivot_*()
and summarise()
commands due to how the data were laid out. Some of the instructions in the exercises are intentionally not spelt out in great detail: In real life, it’s up to you to figure out the intermediate steps, too.
Exercise 5 (Reading in the data) The file writing_data.csv
contains the data reported on in Desgrippes, Lambelet, and Vanhove (2017). Importing it into R requires some additional work.
- Open the file using Notepad (Windows), TextEdit (Mac) or whatever plain text editor you prefer. What character is used to separate different cells?
- Try to read it in as a tibble named
writing
. - You may have obtained the following error message:
Error in nchar(x, "width") : invalid multibyte string, element 1
. Error messages can be notoriously opaque in R, but experience suggests that the accents and umlauts may have something to do with this error. To fix the error, try to read in the dataset again, but additionally override thelocale
argument of the function you used in Step 2 as follows:locale = locale(encoding = "Windows-1252")
. This is a Windows character encoding for Western languages; see Wikipedia for an overview of common character encodings. - Inspect the resulting tibble. If all is well, it should contain 473 rows and 288 columns. Columns such as
Sco1R_Nb_mots
andSco1R_unitThemNbre
should be recognised as containing numbers (<dbl>
, short for double, a format for representing numbers in computers), whereas columns such asVPNID
andSco1R_unitThem
should be recognised as containing text strings (<chr>
, short for character). You’ll also observe that the first column wasn’t named in the csv file, so R assigned some meaningless name to it.
When saving spreadsheets as CSV files, use the UTF-8 character encoding. In Excel, you can select the character encoding from a drop-down menu when exporting the spreadsheet as a CSV; in LibreOffice.org Calc you’ll be asked to specify how you want to save the CSV file exactly in a pop-up window.
If all went well, the writing
tibble has the following columns:
colnames(writing)
[1] "...1" "VPNID"
[3] "X.x" "Sco1R_Nb_mots"
[5] "Sco1R_titre" "Sco1R_Signature"
[7] "Sco1R_GP" "Sco1R_Fid"
[9] "Sco1R_unitThem" "Sco1R_unitThemNbre"
[11] "Sco1R_Destinataire" "Sco1R_AncrTemp_Echelle"
[13] "Sco1R_AncrTemp_occurences" "Sco1R_ProgrChrono_Occ"
[15] "Sco1R_ProgrChrono" "Sco1R_AncrSpat"
[17] "Sco1R_RéfSpat" "Sco1R_Textual"
[19] "Sco1R_Coh_verb" "Sco1R_Description"
[21] "Sco1R_Ponct" "Sco1R_Maj"
[23] "Sco1R_Synt" "Sco1R_Genre"
[25] "Sco1R_Nombre" "Sco1R_Cas"
[27] "Sco1R_Verbal" "Sco1R_Dialekt"
[29] "Sco1R_remarques" "P1R_Nb_mots"
[31] "P1R_titre" "P1R_Signature"
[33] "P1R_GP" "P1R_Fid"
[35] "P1R_unitThem" "P1R_unitThemNbre"
[37] "P1R_Destinataire" "P1R_AncrTemp_Echelle"
[39] "P1R_AncrTemp_occurences" "P1R_ProgrChrono_Occ"
[41] "P1R_ProgrChrono" "P1R_AncrSpat"
[43] "P1R_RéfSpat" "P1R_Textual"
[45] "P1R_Coh_verb" "P1R_Description"
[47] "P1R_Ponct" "P1R_Maj"
[49] "P1R_Synt" "P1R_Genre"
[51] "P1R_Nombre" "P1R_Verbal"
[53] "P1R_remarques" "Sco2R_Nb_mots"
[55] "Sco2R_titre" "Sco2R_Signature"
[57] "Sco2R_GP" "Sco2R_Fid"
[59] "Sco2R_unitThem" "Sco2R_unitThemNbre"
[61] "Sco2R_Destinataire" "Sco2R_AncrTemp_Echelle"
[63] "Sco2R_AncrTemp_occurences" "Sco2R_ProgrChrono_Occ"
[65] "Sco2R_ProgrChrono" "Sco2R_AncrSpat"
[67] "Sco2R_RéfSpat" "Sco2R_Textual"
[69] "Sco2R_Coh_verb" "Sco2R_Description"
[71] "Sco2R_Ponct" "Sco2R_Maj"
[73] "Sco2R_Synt" "Sco2R_Genre"
[75] "Sco2R_Nombre" "Sco2R_Cas"
[77] "Sco2R_Verbal" "Sco2R_Dialekt"
[79] "Sco2R_remarques" "P2R_Nb_mots"
[81] "P2R_titre" "P2R_Signature"
[83] "P2R_GP" "P2R_Fid"
[85] "P2R_unitThem" "P2R_unitThemNbre"
[87] "P2R_Destinataire" "P2R_AncrTemp_Echelle"
[89] "P2R_AncrTemp_occurences" "P2R_ProgrChrono_Occ"
[91] "P2R_ProgrChrono" "P2R_AncrSpat"
[93] "P2R_RéfSpat" "P2R_Textual"
[95] "P2R_Coh_verb" "P2R_Description"
[97] "P2R_Ponct" "P2R_Maj"
[99] "P2R_Synt" "P2R_Genre"
[101] "P2R_Nombre" "P2R_Verbal"
[103] "P2R_remarques" "Sco3R_Nb_mots"
[105] "Sco3R_titre" "Sco3R_Signature"
[107] "Sco3R_GP" "Sco3R_Fid"
[109] "Sco3R_unitThem" "Sco3R_unitThemNbre"
[111] "Sco3R_Destinataire" "Sco3R_AncrTemp_Echelle"
[113] "Sco3R_AncrTemp_occurences" "Sco3R_ProgrChrono_Occ"
[115] "Sco3R_ProgrChrono" "Sco3R_AncrSpat"
[117] "Sco3R_RéfSpat" "Sco3R_Textual"
[119] "Sco3R_Coh_verb" "Sco3R_Description"
[121] "Sco3R_Ponct" "Sco3R_Maj"
[123] "Sco3R_Synt" "Sco3R_Genre"
[125] "Sco3R_Nombre" "Sco3R_Cas"
[127] "Sco3R_Verbal" "Sco3R_Dialekt"
[129] "Sco3R_remarques" "P3R_Nb_mots"
[131] "P3R_titre" "P3R_Signature"
[133] "P3R_GP" "P3R_Fid"
[135] "P3R_unitThem" "P3R_unitThemNbre"
[137] "P3R_Destinataire" "P3R_AncrTemp_Echelle"
[139] "P3R_AncrTemp_occurences" "P3R_ProgrChrono_Occ"
[141] "P3R_ProgrChrono" "P3R_AncrSpat"
[143] "P3R_RéfSpat" "P3R_Textual"
[145] "P3R_Coh_verb" "P3R_Description"
[147] "P3R_Ponct" "P3R_Maj"
[149] "P3R_Synt" "P3R_Genre"
[151] "P3R_Nombre" "P3R_Verbal"
[153] "P3R_remarques" "hkngroup"
[155] "Schulsprache" "zweiSprachig"
[157] "Group" "PartDef"
[159] "X.y" "Sco1A_Nb_mots"
[161] "Sco1A_Lieu" "Sco1A_Date"
[163] "Sco1A_Appel" "Sco1A_Salutation"
[165] "Sco1A_Signature" "Sco1A_GP"
[167] "Sco1A_Objet" "Sco1A_Raisons"
[169] "Sco1A_TC" "Sco1A_Articulation"
[171] "Sco1A_Destinataire" "Sco1A_Organisateurs"
[173] "Sco1A_Coh_verb" "Sco1A_Ponct"
[175] "Sco1A_Maj" "Sco1A_Synt"
[177] "Sco1A_Genre" "Sco1A_Nombre"
[179] "Sco1A_Cas" "Sco1A_Verbal"
[181] "Sco1A_Dialekt" "P1A_Nb_mots"
[183] "P1A_Lieu" "P1A_Date"
[185] "P1A_Appel" "P1A_salutation"
[187] "P1A_Signature" "P1A_GP"
[189] "P1A_Objet" "P1A_Raisons"
[191] "P1A_TC" "P1A_Articulation"
[193] "P1A_Destinataire" "P1A_Organisateurs"
[195] "P1A_Coh_verb" "P1A_Ponct"
[197] "P1A_Maj" "P1A_Synt"
[199] "P1A_Genre" "P1A_Nombre"
[201] "P1A_Cas" "P1A_Verbal"
[203] "Sco2A_Nb_mots" "Sco2A_Lieu"
[205] "Sco2A_Date" "Sco2A_Appel"
[207] "Sco2A_Salutation" "Sco2A_Signature"
[209] "Sco2A_GP" "Sco2A_Objet"
[211] "Sco2A_Raisons" "Sco2A_TC"
[213] "Sco2A_Articulation" "Sco2A_Destinataire"
[215] "Sco2A_Organisateurs" "Sco2A_Coh_verb"
[217] "Sco2A_Ponct" "Sco2A_Maj"
[219] "Sco2A_Synt" "Sco2A_Genre"
[221] "Sco2A_Nombre" "Sco2A_Cas"
[223] "Sco2A_Verbal" "Sco2A_Dialekt"
[225] "P2A_Nb_mots" "P2A_Lieu"
[227] "P2A_Date" "P2A_Appel"
[229] "P2A_Psalutation" "P2A_Signature"
[231] "P2A_GP" "P2A_Objet"
[233] "P2A_Raisons" "P2A_TC"
[235] "P2A_Articulation" "P2A_Destinataire"
[237] "P2A_Organisateurs" "P2A_Coh_verb"
[239] "P2A_Ponct" "P2A_Maj"
[241] "P2A_Synt" "P2A_Genre"
[243] "P2A_Nombre" "P2A_Cas"
[245] "P2A_Verbal" "Sco3A_Nb_mots"
[247] "Sco3A_Lieu" "Sco3A_Date"
[249] "Sco3A_Appel" "Sco3A_Salutation"
[251] "Sco3A_Signature" "Sco3A_GP"
[253] "Sco3A_Objet" "Sco3A_Raisons"
[255] "Sco3A_TC" "Sco3A_Articulation"
[257] "Sco3A_Destinataire" "Sco3A_Organisateurs"
[259] "Sco3A_Coh_verb" "Sco3A_Ponct"
[261] "Sco3A_Maj" "Sco3A_Synt"
[263] "Sco3A_Genre" "Sco3A_Nombre"
[265] "Sco3A_Cas" "Sco3A_Verbal"
[267] "Sco3A_Dialekt" "P3A_Nb_mots"
[269] "P3A_Lieu" "P3A_Date"
[271] "P3A_Appel" "P3A_salutation"
[273] "P3A_Signature" "P3A_GP"
[275] "P3A_Objet" "P3A_Raisons"
[277] "P3A_TC" "P3A_Articulation"
[279] "P3A_Destinataire" "P3A_Organisateurs"
[281] "P3A_Coh_verb" "P3A_Ponct"
[283] "P3A_Maj" "P3A_Synt"
[285] "P3A_Genre" "P3A_Nombre"
[287] "P3A_Cas" "P3A_Verbal"
Unfortunately, I wasn’t able to locate a technical report or codebook that outlines the meaning of all of these column names and what range of values they can take. That said, this project featured three waves of data collection in which children were, among other things, asked to write two kinds of text: argumentative and narrative. Most of the children were French-Portuguese or German-Portuguese bilinguals, and these were asked to write both kinds of texts in each of their languages. Other children were considered monolingual speakers of French, German or Portuguese, and these only wrote texts in their respective language. Each text was scored on a grid. We can observe the following:
- The second column (
VPNID
) contains the participants’ identification codes. We already encountered these in Section 9. As you can verify for yourself, there is a single row in the dataset for each participant. - Columns 154 (
hkngroup
), 155 (Schulsprache
), 156 (zweiSprachig
) and 157 (Group
) seem to contain information regarding what type of participant we’re dealing with. As we saw in Section 9, we can extract this information from theVPNID
entries. - Columns 1 (
...1
), 3 (X.x
), 158 (PartDef
) and 159 (X.y
) don’t seem to contain any relevant information. Perhaps they served some administrative purpose? - All the other column names are built up according to a pattern:
- Either the first letter is
P
or the first three letters areSco
. TheP
stands for ‘Portuguese’, theSco
for ‘langue scolaire’. - After the initial letter(s), we find a number between 1 and 3. This number specifies the time of data collection (i.e., T1, T2 or T3).
- After the number and before the (first) underscore, we find either the letter
A
or the letterR
. From context,A
means that we’re dealing with an argumentative text, whereasR
means that we’re dealing with a narrative text (perhapsR
is short for ‘to relate’?). - Whatever comes after the first underscore specifies either a criterion on the rating grid or some further information about the text production.
- Either the first letter is
The goal of the next few exercises is to reshape this dataset so that each row represents a single text.
Exercise 6 (Converting the dataset to a longer format) Convert the writing
tibble to a tibble named writing_long
by following these steps:
- Retain only the columns
VPNID
as well as all columns containing solely numerical information and whose names start with eitherSco
orP
. (The resulting tibble contains 473 rows and 250 columns.) - Convert the resulting tibble to a longer one containing just three columns:
VPNID
,criterion
(containing the former column names) andscore
(containing the values in those former columns). (The resulting tibble contains 117,777 columns.)
We continue with the writing_long
tibble from the previous exercise. The criterion
column contains a subset of the column names that were shown above. We want to parse these former column names and extract the following pieces of information from them: Language, time of data collection, text type, and criterion proper. For example, given the string "Sco2R_some_criterion"
, we want to extract "Sco"
as the language, 2
as the time of data collection, "R"
as the text type, and "some_criterion"
as the criterion name. Similarly, given the string "P3A_OtherCriterion"
, we want to extract "P"
as the language, 3
as the time of data collection, "A"
as the text type, and "OtherCriterion"
as the criterion name.
If you feed the description in the previous paragraph to some AI tool like ChatGPT, it will probably be able to come up with a suggestion as to how to go about this. After some prodding, it suggested the following solution to me:
<- "Sco2R_some_criterion"
my_string <- "([^\\d]+)(\\d)([^_]*)_(.*)$"
my_regex str_match(my_string, my_regex)
[,1] [,2] [,3] [,4] [,5]
[1,] "Sco2R_some_criterion" "Sco" "2" "R" "some_criterion"
<- "P3A_OtherCriterion"
my_string str_match(my_string, my_regex)
[,1] [,2] [,3] [,4] [,5]
[1,] "P3A_OtherCriterion" "P" "3" "A" "OtherCriterion"
Though it’s not too important for the remainder of this walkthrough, the regex above is constructed in the following way:
([^\\d]+)
: This group matches any group of one or more (+
) non-numerical (^
for ‘non’,\\d
for digits) characters.(\\d)
matches any digit.([^_]*)
matches any group of zero or more (*
) characters that aren’t underscores.(.*)
matches any group of zero or more characters without restrictions.
Evidently, it will be important to verify that the parsing was done correctly.
We can use str_match()
to parse any number of strings using a regex. For instance, the command below parses two strings and outputs the result as a matrix:
<- c("Sco2R_some_criterion", "P3A_OtherCriterion")
my_strings <- "([^\\d]+)(\\d)([^_]*)_(.*)$"
my_regex str_match(my_strings, my_regex)
[,1] [,2] [,3] [,4] [,5]
[1,] "Sco2R_some_criterion" "Sco" "2" "R" "some_criterion"
[2,] "P3A_OtherCriterion" "P" "3" "A" "OtherCriterion"
If we want to just extract, say, the language information from these strings, we can select the second column of this matrix like so:
str_match(my_strings, my_regex)[, 2]
[1] "Sco" "P"
Similarly, we can extract the criterion name like so:
str_match(my_strings, my_regex)[, 5]
[1] "some_criterion" "OtherCriterion"
We can define a helper function extract_info()
that bundles these steps:
<- function(my_strings, i, my_regex = "([^\\d]+)(\\d)([^_]*)_(.*)$") {
extract_info str_match(my_strings, my_regex)[, i]
}extract_info(my_strings, 4)
[1] "R" "A"
extract_info(my_strings, 2)
[1] "Sco" "P"
Exercise 7 (Parsing the column names) Using the extract_info()
function just defined, add columns called Language
, Time
, TextType
and Criterion
(uppercase ‘c’) to the writing_long
tibble. Evidently, these columns should contain the appropriate pieces of information. Then drop the criterion
(lowercase ‘c’) column from the tibble.
Exercise 8 (Splitting up the dataset) Split the writing_long
tibble up into two tibbles: argumentative
, which contains only the data relating to argumentative texts, and narrative
, which contains only the data relating to narrative texts.
If all went well, the argumentative
tibble should contain 59,125 rows and 6 columns, and the narrative
tibble 58,179 rows and 6 columns.
While I wasn’t able to retrieve a codebook for this dataset, Desgrippes, Lambelet, and Vanhove (2017) describe the grid according to which the texts were scored. I put the relevant information in the Excel file scoring_criteria.xlsx
.
Exercise 9 (Obtaining the relevant scores for each argumentative text) We now want to reorganise the data in such a way that we can easily calculate the total score assigned to each text. Since the scoring criteria were different for argumentative and narrative texts, we do this separately for both text types. Here, we’ll only go through the steps for argumentative texts, but as an additional exercise, you could follow the same steps (mutatis mutandis) to calculate the total scores for the narrative texts. Importantly, we need to check whether the scores obtained actually make sense.
- Inspect the
scoring_criteria.xlsx
file in your spreadsheet program. In particular, observe that it contains two sheets. - Read in the sheet containing the criteria for scoring the argumentatitive texts into R as
criteria_arg
. - In the tibble
argumentative
, only retain the rows where the value in theCriterion
column is one of the scoring criteria for the argumentative texts (i.e., those contained incriteria_arg
). (See the hint below.) - Drop the rows containing missing
score
values fromargumentative
. - Convert the resulting tibble to a wide format in which each row represents a single argumentative text and contains the scores on all relevant criteria for this text. Note that each combination of
VPNID
,Language
andTime
present in the dataset represents one text. - In principle, the current tibble should not contain any missing data. Moreover, the criterion scores should not exceed the
MaxScore
values listed inscoring_criteria.xlsx
. Check if this is the case. (Hint: It isn’t.) Inspect the rows in the dataset that contain abnormalities and figure out what went wrong. (Hint: First usesummary(argumentative)
to identify the columns containing aberrant entries. Then inspect the rows containing such entries. Look up the corresponding information inwriting_data.csv
.) Write a brief paragraph detailing what the issues are. Fix the issues without editing the spreadsheet. This may require you to go all the way back to the start of this section and make some modifications there. - Once all issues have been taken care of, reconvert the tibble to a longer format, i.e., with one row per criterion per text. Then, for each text, compute the variable
total_score_arg
as the sum of all criterion scores. - Using
write_csv()
, save the tibble containing the variablesVPNID
,Language
,Time
andtotal_score_arg
to a CSV file in thedata
subdirectory.
Hint for step 3: One option is to use one of the *_join()
functions. Another one is to combine filter()
with the %in%
operator. The following example shows how the %in%
operator works.
<- c("A", "B", "C") # define vectors x, y
x <- c("E", "C", "D", "A")
y %in% x # which of the values in y occur in x? y
[1] FALSE TRUE FALSE TRUE
%in% y # which of the values in x occur in y? x
[1] TRUE FALSE TRUE
Exercise 10 (Comparing the calculated scores with the scores used earlier) We now have two sets of argumentative writing scores: the ones that you calculated in the previous exercise, and the ones we used earlier. Let’s compare them.
- Read in the scores you calculated from the CSV file you created in the previous exercise.
- Read in the
helascot_skills.csv
file we used earlier. - Combine both files in such a way that, for each text, you have both the
total_score_arg
value that you computed as well as theArgumentation
value provided inhelascot_skills.csv
side-by-side. Make sure that the resulting tibble contains the scores of all texts, even in the case that a score is present for a text in one file but not in the other. (Hint: You can’t directly use one of the*_join()
functions. You’ll need to do some thinking and some minor data wrangling first.) - Are there any texts for which an
Argumentation
score exists but nototal_score_arg
was computed? What is (are) the reason(s) for this (these) discrepancy(ies)? - Are there any texts for which a
total_scores_arg
value was computed but noArgumentation
score exists? If so, output the relevant rows. - For each text, compute the difference between the
Argumentation
andtotal_score_arg
scores. For how many texts is the difference not equal to 0? Can you think of any reason(s) for any such differences?
Exercise 11 (Summarising the results)
- For each combination of language (i.e., French, German, Portuguese), group (i.e., bilingual vs. comparison group) and time of data collection, compute the number of texts for which you calculated a
total_score_arg
value as well as the meantotal_score_arg
value. - For each combination of language and group, compute the number of children who have
total_score_arg
values for all three data collections. How many children in each cell havetotal_score_arg
scores for the first and second data collection but not for the third? How many for the first and third, but not for the second? For the second and third, but not for the first? How many children in each cell only have atotal_score_arg
score at one data collection?
12 Suggested reading
The Bible for working with tibbles in R is R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund 2023), which is freely available from r4ds.hadley.nz.
13 Software versions
This page was built using the following R and package versions.
::session_info(pkgs = "attached") devtools
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.4.1 (2024-06-14 ucrt)
os Windows 10 x64 (build 19045)
system x86_64, mingw32
ui RTerm
language (EN)
collate English_United Kingdom.utf8
ctype English_United Kingdom.utf8
tz Europe/Zurich
date 2025-02-13
pandoc 3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.4.0)
ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)
here * 1.0.1 2020-12-13 [1] CRAN (R 4.4.0)
lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.4.0)
purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)
readr * 2.1.5 2024-01-10 [1] CRAN (R 4.4.0)
readxl * 1.4.3 2023-07-06 [1] CRAN (R 4.4.0)
stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.4.0)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)
tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.4.0)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.4.0)
[1] C:/Users/VanhoveJ/AppData/Local/R/win-library/4.4
[2] C:/Program Files/R/R-4.4.1/library
──────────────────────────────────────────────────────────────────────────────