```
using DataFrames, CSV, StatsPlots
d = CSV.read("data/dekeyser2010.csv", DataFrame);
@df d plot(:AOA, :GJT
, seriestype = :scatter
, legend = :none
, xlabel = "AOA"
, ylabel = "GJT")
```

We’ll work with the data from the North America study conducted by DeKeyser et al. (2010). If you want to follow along, you can download this dataset here and save it to a subdirectory called `data`

in your working directory.

We need the `DataFrames`

, `CSV`

and `StatsPlots`

packages in order to read in the CSV with the dataset as a data frame and draw some basic graphs.

```
using DataFrames, CSV, StatsPlots
d = CSV.read("data/dekeyser2010.csv", DataFrame);
@df d plot(:AOA, :GJT
, seriestype = :scatter
, legend = :none
, xlabel = "AOA"
, ylabel = "GJT")
```

The `StatsPlots`

package uses the `@df`

macro to specify that the variables in the `plot()`

function can be found in the data frame provided just after it (i.e., `d`

).

Let’s fit two regression models to this data set using the `GLM`

package. The first model, `lm1`

, is a simple regression model with `AOA`

as the predictor and `GJT`

as the outcome. The syntax should be self-explanatory:

```
using GLM
lm1 = lm(@formula(GJT ~ AOA), d);
coeftable(lm1)
```

Coef. | Std. Error | t | Pr(> | t | ) | |
---|---|---|---|---|---|---|

(Intercept) | 190.409 | 3.90403 | 48.77 | <1e-57 | 182.63 | 198.188 |

AOA | -1.21798 | 0.105139 | -11.58 | <1e-17 | -1.42747 | -1.00848 |

We can visualise this model by plotting the data in a scatterplot and adding the model predictions to it like so. I use `begin`

and `end`

to force Julia to only produce a single plot.

```
d[!, "prediction"] = predict(lm1);
begin
@df d plot(:AOA, :GJT
, seriestype = :scatter
, legend = :none
, xlabel = "AOA"
, ylabel = "GJT");
@df d plot!(:AOA, :prediction
, seriestype = :line)
end
```

Our second model will incorporate an ‘elbow’ in the regression line at a given breakpoint – a piecewise regression model. For a breakpoint `bp`

, we need to create a variable `since_bp`

that encodes how many years beyond this breakpoint the participants’ `AOA`

values are. If an `AOA`

value is lower than the breakpoint, the corresponding `since_bp`

value is just 0. The `add_breakpoint()`

value takes a dataset containing an `AOA`

variable and adds a variable called `since_bp`

to it.

```
function add_breakpoint(data, bp)
data[!, "since_bp"] = max.(0, data[!, "AOA"] .- bp);
end;
```

To add the `since_bp`

variable for a breakpoint at age 12 to our dataset `d`

, we just run this function. Note that in Julia, arguments are not copied when they are passed to a function. That is, the `add_breakpoint()`

function *changes* the dataset; it does not create a changed *copy* of the dataset like R would:

```
# changes d!
add_breakpoint(d, 12);
print(d);
```

```
76×4 DataFrame
Row │ AOA GJT prediction since_bp
│ Int64 Int64 Float64 Int64
─────┼────────────────────────────────────
1 │ 59 151 118.548 47
2 │ 9 182 179.447 0
3 │ 51 127 128.292 39
4 │ 58 113 119.766 46
5 │ 27 157 157.523 15
6 │ 11 188 177.011 0
7 │ 17 125 169.703 5
8 │ 57 138 120.984 45
9 │ 10 171 178.229 0
10 │ 14 168 173.357 2
11 │ 20 174 166.049 8
12 │ 34 149 148.997 22
13 │ 19 155 167.267 7
14 │ 54 149 124.638 42
15 │ 63 107 113.676 51
16 │ 71 104 103.932 59
17 │ 24 176 161.177 12
18 │ 16 143 170.921 4
19 │ 22 133 163.613 10
20 │ 48 113 131.946 36
21 │ 17 171 169.703 5
22 │ 20 144 166.049 8
23 │ 44 151 136.818 32
24 │ 24 182 161.177 12
25 │ 56 113 122.202 44
26 │ 5 197 184.319 0
27 │ 71 114 103.932 59
28 │ 36 170 146.561 24
29 │ 57 115 120.984 45
30 │ 45 115 135.6 33
31 │ 56 118 122.202 44
32 │ 44 118 136.818 32
33 │ 23 155 162.395 11
34 │ 18 186 168.485 6
35 │ 42 132 139.254 30
36 │ 54 116 124.638 42
37 │ 14 169 173.357 2
38 │ 47 131 133.164 35
39 │ 8 196 180.665 0
40 │ 24 122 161.177 12
41 │ 52 148 127.074 40
42 │ 27 188 157.523 15
43 │ 11 198 177.011 0
44 │ 18 174 168.485 6
45 │ 48 150 131.946 36
46 │ 31 158 152.651 19
47 │ 49 131 130.728 37
48 │ 48 131 131.946 36
49 │ 15 180 172.139 3
50 │ 49 113 130.728 37
51 │ 23 167 162.395 11
52 │ 10 193 178.229 0
53 │ 20 164 166.049 8
54 │ 24 183 161.177 12
55 │ 35 118 147.779 23
56 │ 36 136 146.561 24
57 │ 44 115 136.818 32
58 │ 49 141 130.728 37
59 │ 15 181 172.139 3
60 │ 12 193 175.793 0
61 │ 53 140 125.856 41
62 │ 16 153 170.921 4
63 │ 54 110 124.638 42
64 │ 9 163 179.447 0
65 │ 25 174 159.959 13
66 │ 27 169 157.523 15
67 │ 18 179 168.485 6
68 │ 26 143 158.741 14
69 │ 22 162 163.613 10
70 │ 50 128 129.51 38
71 │ 42 119 139.254 30
72 │ 5 197 184.319 0
73 │ 14 168 173.357 2
74 │ 39 132 142.908 27
75 │ 56 140 122.202 44
76 │ 12 182 175.793 0
```

Since we don’t know what the best breakpoint is, we’re going to estimate it from the data. For each integer in a given range (`minbp`

through `maxbp`

), we’re going to fit a piecewise regression model with that integer as the breakpoint. We’ll then pick the breakpoint that minimises the deviance of the fit (i.e., the sum of squared differences between the model fit and the actual outcome). The `fit_piecewise()`

function takes care of this. It outputs both the best fitting piecewise regression model and the breakpoint used for this model.

```
function fit_piecewise(data, minbp, maxbp)
min_deviance = Inf
best_model = nothing
best_bp = 0
current_model = nothing
for bp in minbp:maxbp
add_breakpoint(data, bp)
current_model = lm(@formula(GJT ~ AOA + since_bp), data)
if deviance(current_model) < min_deviance
min_deviance = deviance(current_model)
best_model = current_model
best_bp = bp
end
end
return best_model, best_bp
end;
```

Let’s now apply this function to our dataset. The estimated breakpoint is at age 16, and the model coefficients are shown below:

```
lm2 = fit_piecewise(d, 6, 20);
# the first output is the model itself, the second the breakpoint used
coeftable(lm2[1])
lm2[2]
```

`16`

Let’s visualise this model by drawing a scatterplot and adding the regression fit to it. While we’re at it, we might as well add a 95% confidence band around the regression fit.

```
add_breakpoint(d, 16);
predictions = predict(lm2[1], d;
interval = :confidence,
level = 0.95);
d[!, "prediction"] = predictions[!, "prediction"];
d[!, "lower"] = predictions[!, "lower"];
d[!, "upper"] = predictions[!, "upper"];
begin
@df d plot(:AOA, :GJT
, seriestype = :scatter
, legend = :none
, xlabel = "AOA"
, ylabel = "GJT"
);
@df d plot!(:AOA, :prediction
, seriestype = :line
, ribbon = (:prediction .- :lower,
:upper .- :prediction)
)
end
```

We could run an -test for the model comparison like below, but the -value corresponds to the -value for the `since_bp`

value, anyway:

`ftest(lm1.model, lm2[1].model);`

But there’s a problem: This -value can’t be taken at face value. By looping through different possible breakpoint and then picking the one that worked best for our dataset, we’ve increased our chances of finding some pattern in the data even if nothing is going on. So we need to recalibrate the -value we’ve obtained.

Our strategy is as follows. We will generate a fairly large number of datasets similar to `d`

but of which we know that there isn’t any breakpoint in the `GJT`

/`AOA`

relationship. We will do this by simulating new `GJT`

values from the simple regression model fitted above (`lm1`

). We will then apply the `fit_piecewise()`

function to each of these datasets, using the same `minbp`

and `maxbp`

values as before and obtain the -value associated with each model. We will then compute the proportion of the -value so obtained that is lower than the -value from our original model, i.e., 0.0472.

I wasn’t able to find a Julia function similar to R’s `simulate()`

that simulates a new outcome variable based on a linear regression model. But such a function is easy enough to put together:

```
using Distributions
function simulate_outcome(null_model)
resid_distr = Normal(0, dispersion(null_model.model))
prediction = fitted(null_model)
new_outcome = prediction + rand(resid_distr, length(prediction))
return new_outcome
end;
```

The `one_run()`

function generates a single new outcome vector, overwrites the `GJT`

variable in our dataset with it, and then applies the `fit_piecewise()`

function to the dataset, returning the -value of the best-fitting piecewise regression model.

```
function one_run(data, null_model, min_bp, max_bp)
new_outcome = simulate_outcome(null_model)
data[!, "GJT"] = new_outcome
best_model = fit_piecewise(data, min_bp, max_bp)
pval = coeftable(best_model[1]).cols[4][3]
return pval
end;
```

Finally, the `generate_p_distr()`

function runs the `one_run()`

function a large number of times and output the -values generated.

```
function generate_p_distr(data, null_model, min_bp, max_bp, n_runs)
pvals = [one_run(data, null_model, min_bp, max_bp) for _ in 1:n_runs]
return pvals
end;
```

Our simulation will consist of 25,000 runs, and in each run, 16 regression models will be fitted, for a total of 400,000 models. On my machine, this takes less than 20 seconds (i.e., less than 50 microseconds per model).

```
n_runs = 25_000;
pvals = generate_p_distr(d, lm1, 6, 20, n_runs);
```

For about 11–12% of the datasets in which no breakpoint governed the data, the `fit_piecewise()`

procedure returned a -value of 0.0472 or lower. So our original -value of 0.0472 ought to be recalibrated to about 0.12.

`sum(pvals .<= 0.0472) / n_runs`

`0.11864`

In *The Design of Experiments*, Ronald A. Fisher explained the Fisher exact test using the following example. Imagine that a lady claims she can taste the difference between cups of tea in which the tea was poured into the cup first and then milk was added, and cups of tea in which the milk was poured first and then the tea was added. A sceptic might put the lady to the test and prepare eight cups of tea – four with tea to which milk was added, and four with milk to which tea was added. (Yuck to both, by the way.) The lady is presented with these in a random order and is asked to identify those four cups with tea to which milk was added. Now, if the lady has no discriminatory ability whatever, there is only a 1-in-70 chance she identifies all four cups correctly. This is because there are 70 ways of picking four cups out of eight, and only one of these ways is correct. In Julia:

`binomial(8, 4)`

`70`

We can now imagine a slight variation on this experiment. If the lady identifies all four cups correctly, we choose to believe she has the purported discriminatory ability. If she identifies two or fewer cups correctly, we remain sceptical. But she identifies three out of four cups correctly, we prepare another eight cups of tea and give her another chance under the same conditions.

We can ask two questions about this new procedure:

- With which probability will we believe the lady if she, in fact, does not have any discriminatory ability?
- How many rounds of tea tasting will we need on average before the experiment terminates?

In the following, I’ll share both analytical and a simulation-based answers to these questions.

Under the null hypothesis of no discriminatory ability, the number of correctly identified cups in any one draw () follows a hypergeometric distribution with parameters (total), (successes) and (draws), i.e., [ X (8, 4, 4). ] In any given round, the subject fails the test if she only identifies 0, 1 or 2 cups correctly. Under the null hypothesis, the probability of this happening is given by , the value of which we can determine using the cumulative mass function of the Hypergeometric(8, 4, 4) distribution. We suspend judgement on the subject’s discriminatory abilities if she identifies exactly three cups correctly, in which case she has another go. Under the null hypothesis, the probability of this happening in any given round is given by , the value of which can be determined using the probability mass function of the Hypergeometric(8, 4, 4) distribution.

With those probabilities in hand, we can now compute the probability that the subject fails the experiment in a specific round, assuming the null hypothesis is correct. In the first round, she will fail the experiment with probability . In order to fail in the second round, she needed to have advanced to the second round, which happens with probability , and then fail in that round, which happens with probability . That is, she will fail in the second round with probability . To fail in the third round, she needed to advance to the third round, which happens with probability and then fail in the third round, which happens with probability . That is, she will fail in the third round with probability . Etc. etc. The probability that she will fail somewhere in the experiment if the null hypothesis is true, then, is given by where the first equality is just a matter of shifting the index and the second equality holds because the expression is a geometric series.

Let’s compute the final results using Julia. The following loads the `DataFrames`

and `Distributions`

packages and then defines `d`

as the Hypergeometric(8, 4, 4) distribution. Note that in Julia, the parameters for the Hypergeometric distribution aren’t (total), (successes) and (draws), but rather (successes), (failures) and (draws); see the documentation. We then read off the values for and from the cumulative mass function and probability mass function, respectively:

```
using Distributions
d = Hypergeometric(4, 4, 4);
p = cdf(d, 2);
q = pdf(d, 3);
```

The probability that the subject will fail the experiment if she does indeed not have the purported discriminatory ability is now easily computed:

`p / (1 - q)`

`0.9814814814814815`

The next question is how many rounds we expect the experiment to carry on for if the null hypothesis is true. At each round, the probability that the experiment terminates in that round is given by . From the geometric distribution, we know that we then on average need attempts before the first terminating event occurs:

`1 / (1 - q)`

`1.2962962962962965`

In sum, if the subject lacks any discriminatory ability, there is only a 1.85% chance that she will pass the test, and on average, the experiment will run for 1.3 rounds.

First, we define a function `experiment()`

that runs the experiment once. In essence, we have an `urn`

that contains four correct identifications (`true`

) and four incorrect identifications (`false`

). From this `urn`

, we `sample()`

four identifications without replacement.

Note, incidentally, that Julia functions can take both positional arguments and keyword arguments. In the `sample()`

command below, both `urn`

and `4`

are passed as positional arguments, and you’d have to read the documentation to figure out which argument specifies what. The keyword arguments are separated from the positional arguments by a semi-colon and are identified with the parameter’s name.

Next, we count the number of `true`

s in our `pick`

using `sum()`

. Depending on how many `true`

s there are in `pick`

, we terminate the experiment, returning `false`

if we remain sceptic and `true`

if we choose to believe the lady, or we run the experiment for one more round. The number of attempts are tallied and output as well.

```
function experiment(attempt = 1)
urn = [false, false, false, false, true, true, true, true]
pick = sample(urn, 4; replace = false)
number = sum(pick)
if number <= 2
return false, attempt
elseif number >= 4
return true, attempt
else
return experiment(attempt + 1)
end
end;
```

A single run of `experiment()`

could produce the following output:

`experiment()`

`(false, 1)`

Next, we write a function `simulate()`

that runs the `experiment()`

a large number of times, and outputs both whether each `experiment()`

led to us believe the lady or remaining sceptical, and how many round each `experiment()`

took. These results are tabulated in a `DataFrame`

– just because. Of note, Julia supports list comprehension that Python users will be familiar with. I use this feature here both the run the experiment a large number of times as well as to parse the output.

```
using DataFrames
function simulate(runs = 10000)
results = [experiment() for _ in 1:runs]
success = [results[i][1] for i in 1:runs]
attempts = [results[i][2] for i in 1:runs]
d = DataFrame(Successful = success, Attempts = attempts)
return d
end;
```

Let’s swing for the fences and run this experiment a million times. Like in Python, we can make large numbers easier to parse by inserting underscores in them:

`runs = 1_000_000;`

Using the `@time`

macro, we can check how long it take for this simulation to finish.

`@time d = simulate(runs)`

` 0.359740 seconds (4.07 M allocations: 361.334 MiB, 14.82% gc time, 35.07% compilation time)`

1000000×2 DataFrame

999975 rows omitted

Row | Successful | Attempts |
---|---|---|

Bool | Int64 | |

1 | false | 1 |

2 | false | 1 |

3 | false | 1 |

4 | false | 1 |

5 | false | 1 |

6 | false | 2 |

7 | false | 3 |

8 | false | 2 |

9 | false | 1 |

10 | false | 1 |

11 | false | 1 |

12 | false | 1 |

13 | false | 2 |

⋮ | ⋮ | ⋮ |

999989 | false | 1 |

999990 | false | 1 |

999991 | false | 1 |

999992 | false | 4 |

999993 | false | 1 |

999994 | false | 1 |

999995 | false | 1 |

999996 | false | 1 |

999997 | false | 1 |

999998 | false | 1 |

999999 | false | 1 |

1000000 | false | 1 |

On my machine then, running this simulation takes less than a second. Note that 60% of this time is compilation time. (Update: When migrating my blog to Quarto, I reran this code using a new Julia version (1.9.1). Now the code runs faster.) Indeed, if we run the function another time, i.e., after it’s been compiled, the run time drops to about 0.3 seconds (Update: 0.2 seconds now.):

`@time d2 = simulate(runs)`

` 0.209087 seconds (3.89 M allocations: 348.982 MiB, 16.29% gc time)`

1000000×2 DataFrame

999975 rows omitted

Row | Successful | Attempts |
---|---|---|

Bool | Int64 | |

1 | false | 1 |

2 | false | 1 |

3 | false | 1 |

4 | false | 1 |

5 | false | 1 |

6 | false | 2 |

7 | false | 2 |

8 | false | 1 |

9 | false | 1 |

10 | false | 1 |

11 | false | 2 |

12 | false | 1 |

13 | false | 1 |

⋮ | ⋮ | ⋮ |

999989 | false | 3 |

999990 | false | 3 |

999991 | false | 1 |

999992 | false | 1 |

999993 | false | 1 |

999994 | false | 1 |

999995 | false | 1 |

999996 | false | 1 |

999997 | false | 1 |

999998 | false | 2 |

999999 | false | 1 |

1000000 | false | 1 |

Using `describe()`

, we see that this simulation – which doesn’t ‘know’ anything about hypergeometric and geometric distributions, produces the same answers that we arrived at by analytical means: There’s a 1.86% chance that we end up believing the lady even if she has no discriminatory ability whatsoever. And if she doesn’t have any discriminatory ability, we’ll need 1.3 rounds on average before terminating the experiment:

`describe(d)`

2×7 DataFrame

Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|

Symbol | Float64 | Integer | Float64 | Integer | Int64 | DataType | |

1 | Successful | 0.018533 | false | 0.0 | true | 0 | Bool |

2 | Attempts | 1.29611 | 1 | 1.0 | 10 | 0 | Int64 |

The slight discrepancy between the simulation-based results and the analytical ones are just due to randomness. Below is a quick way for constructing 95% confidence intervals around both of our simulation-based estimates, and the analytical solutions fall within both intervals.

`means = mean.(eachcol(d))`

```
2-element Vector{Float64}:
0.018533
1.296105
```

`ses = std.(eachcol(d)) / sqrt(runs)`

```
2-element Vector{Float64}:
0.00013486862533794173
0.0006198767726106645
```

`upr = means + 1.96*ses`

```
2-element Vector{Float64}:
0.018797342505662368
1.297319958474317
```

`lwr = means - 1.96*ses`

```
2-element Vector{Float64}:
0.018268657494337634
1.2948900415256832
```

The basic Levenshtein algorithm is used to count the minimum number of insertions, deletions and substitutions that are needed to convert one string into another. For instance, to convert English *doubt* into French *doute*, you need at least two operations. You could replace the *b* by a *t* and then replace the *t* by an *e*; or you could delete the *b* and then insert the *e*. As this example shows, there may be more than one way to convert one string into another using the minimum number of required operations, but this minimum number itself is unique for each pair of strings.

I won’t cover the logic of the Levenshtein algorithm here. The following is a straightforward Julia implementation of the pseudocode found on Wikipedia, assuming a cost of 1 for all operations. The function takes two inputs (a string `a`

that is to be converted to a string `b`

) and outputs an array with the Levenshtein distances between all substrings of `a`

on the one hand and all substrings of `b`

on the other hand. The entry in the bottom right corner of this array is the Levenshtein distances between the full strings and this is output separately as well.

```
function levenshtein(a::String, b::String)
# Initialise table
distances = zeros(Int, length(a) + 1, length(b) + 1)
distances[:, 1] = 0:length(a)
distances[1, :] = 0:length(b)
# Levenshtein logic
for row in 2:(length(a) + 1)
for col in 2:(length(b) + 1)
distances[row, col] = min(
distances[row - 1, col - 1] + Int(a[row - 1] != b[col - 1] ? 1 : 0)
, distances[row, col - 1] + 1
, distances[row - 1, col] + 1
)
end
end
return distances, distances[length(a) + 1, length(b) + 1]
end
```

`levenshtein (generic function with 1 method)`

Let’s compute the Levenshtein distance between the German word *Zyklus* (‘cycle’) and its Swedish counterpart *cykel*. Note the use of `;`

at the end of the line to suppress the output.

```
dist_matrix, lev_cost = levenshtein("zyklus", "cykel");
display(dist_matrix)
```

```
7×6 Matrix{Int64}:
0 1 2 3 4 5
1 1 2 3 4 5
2 2 1 2 3 4
3 3 2 1 2 3
4 4 3 2 2 2
5 5 4 3 3 3
6 6 5 4 4 4
```

This checks out: you do indeed need four operations to transform *Zyklus* into *cykel*.

But what if we wanted to apply our new functions to several pairs of strings? Let’s first define three Dutch-German word pairs:

```
dutch = ("boek", "zuster", "sneeuw");
german = ("buch", "schwester", "schnee");
```

We can run our `levenshtein()`

on these three word pairs without introducing for-loops by simply appending a dot to the function name:

`levenshtein.(dutch, german)`

`(([0 1 … 3 4; 1 0 … 2 3; … ; 3 2 … 2 3; 4 3 … 3 3], 3), ([0 1 … 8 9; 1 1 … 8 9; … ; 5 4 … 5 6; 6 5 … 6 5], 5), ([0 1 … 5 6; 1 0 … 4 5; … ; 5 4 … 4 3; 6 5 … 5 4], 4))`

However, since the `levenshtein()`

function outputs two pieces of information (both the matrix with the distances between the substrings as well as the final Levenshtein distance), this vectorised call yields a tuple of three subtuples, each subtuple containing both a matrix and the corresponding final Levenshtein distance. This is why the output above looks so messy. If we wanted to obtain just the Levenshtein distances, we could write a for-loop to extract them. But I think an easier solution is to first write a wrapper around the `levenshtein()`

function that outputs only the final Levenshtein distance and use the vectorised version of this wrapper instead:

```
function lev_dist(a::String, b::String)
return levenshtein(a, b)[2]
end
```

`lev_dist (generic function with 1 method)`

Now use the vectorised version of `lev_dist()`

:

`lev_dist.(dutch, german)`

`(3, 5, 4)`

Nice!

We now know that we need four operations to transform *Zyklus* into *cykel* and five to transform *zuster* into *Schwester*. But which are the operations that you need for these transformations? The function `lev_alignment()`

defined below outputs one possible set of operations that would do the job. (Unlike the minimum number of operations required to transform one string into another, the set of operations needed isn’t uniquely defined.)

```
function lev_alignment(a::String, b::String)
source = Vector{Char}()
target = Vector{Char}()
operations = Vector{Char}()
lev_matrix = levenshtein(a, b)[1]
row = size(lev_matrix, 1)
col = size(lev_matrix, 2)
while (row > 1 && col > 1)
if lev_matrix[row - 1, col - 1] == lev_matrix[row, col] &&
lev_matrix[row - 1, col - 1] <= min(
lev_matrix[row - 1, col]
, lev_matrix[row, col - 1]
)
row = row - 1
col = col - 1
pushfirst!(source, a[row])
pushfirst!(target, b[col])
pushfirst!(operations, ' ')
else
if lev_matrix[row - 1, col] <= min(lev_matrix[row - 1, col - 1], lev_matrix[row, col - 1])
row = row - 1
pushfirst!(source, a[row])
pushfirst!(target, ' ')
pushfirst!(operations, 'D')
elseif lev_matrix[row, col - 1] <= lev_matrix[row - 1, col - 1]
col = col - 1
pushfirst!(source, ' ')
pushfirst!(target, b[col])
pushfirst!(operations, 'I')
else
row = row - 1
col = col - 1
pushfirst!(source, a[row])
pushfirst!(target, b[col])
pushfirst!(operations, 'S')
end
end
end
# If first column reached, move up.
while (row > 1)
row = row - 1
pushfirst!(source, a[row])
pushfirst!(target, ' ')
pushfirst!(operations, 'D')
end
# If first row reached, move left.
while (col > 1)
col = col - 1
pushfirst!(source, ' ')
pushfirst!(target, b[col])
pushfirst!(operations, 'I')
end
return vcat(
reshape(source, (1, :))
, reshape(target, (1, :))
, reshape(operations, (1, :))
)
end
```

`lev_alignment (generic function with 1 method)`

I won’t cover the logic behind the algorithm as this is more about learning Julia that the Levenshtein algorithm. On the Julia side, note first how empty character vectors can be initialised. Moreover, notice that the `pushfirst!()`

function is decorated with a `!`

(a ‘bang’). This communicates to whoever is reading the code that this function changes some of its input. For instance, `pushfirst!(source, a[row])`

means that the current character of `a`

(i.e., `a[row]`

) is added to the front of the `source`

vector. That is, this command changes the `source`

vector. Finally, the `source`

, `target`

and `operations`

vectors are all column vectors. In order to display them somewhat nicely, I converted each of them to a single-row matrix using `reshape()`

. Then, the three resulting rows are concatenated vertically using `vcat()`

to show how the two strings can be aligned and which operations are needed to transform one into the other.

Let’s see how we can transform *Zyklus* into *cykel*:

`lev_alignment("zyklus", "cykel")`

```
3×7 Matrix{Char}:
'z' 'y' 'k' ' ' 'l' 'u' 's'
'c' 'y' 'k' 'e' 'l' ' ' ' '
'S' ' ' ' ' 'I' ' ' 'D' 'D'
```

So we substitute *c* for *z*, insert an *e* and delete the *u* and *s*. As I mentioned, this set of operations isn’t uniquely defined. Indeed, we could have also substituted *c* for *z*, *e* for *l* and *l* for *u* and then deleted the *s*. This also corresponds to a Levenshtein distance of four operations.

Above, we computed raw Levenshtein distances. The problem with these is that longer string pairs will tend to have larger raw Levenshtein distances than shorter string pairs, even if they do seem more similar. To correct for this, we can computed normalised Levenshtein distances instead. There are various ways to compute these; one option is to divide the raw Levenshtein distance by the length of the alignment:

```
function norm_lev_dist(a::String, b::String)
raw_dist = lev_dist(a, b)
alignment_length = size(lev_alignment(a, b), 2)
return raw_dist / alignment_length
end
```

`norm_lev_dist (generic function with 1 method)`

(Behind the scenes, we run the Levenshtein algorithm twice: once in `lev_dist()`

and again in `lev_alignment()`

. This seems wasteful - unless the Julia compiler is able to clean up the double work? I’m not sure.)

We obtain a normalised Levenshtein distance of about 0.57 for *Zyklus* - *cykel*:

`norm_lev_dist("zyklus", "cykel")`

`0.5714285714285714`

We can use a vectorised version of this function, too:

`norm_lev_dist.(dutch, german)`

`(0.75, 0.5555555555555556, 0.5)`

Of course, normalised Levenshtein distances are symmetric, so we obtain the same result when running the following command:

`norm_lev_dist.(german, dutch)`

`(0.75, 0.5555555555555556, 0.5)`

The famous Fibonacci sequence is an infinite sequence of natural numbers, the first of which are 1, 1, 2, 3, 5, 8, 13, …. The sequence is defined as follows:

Let’s write some Julia functions that can generate this sequence.

You can download Julia from julialang.org. I’m currently using the Pluto.jl package that allows you to write Julia code in a reactive notebook. Check out the Pluto.jl page for more information.

The Fibonacci sequence is defined recursively: To obtain the *n*th Fibonacci number, you first need to compute the *n-1*th and *n-2*th Fibonacci number and then add them. We can write a Julia function that exactly reflects the definition of the Fibonacci sequence like so:

```
function fibonacci(n)
if n <= 2
return 1
end
return fibonacci(n - 1) + fibonacci(n - 2)
end
```

`fibonacci (generic function with 1 method)`

This function tacitly assumes that `n`

is a non-zero natural number. If `n`

is equal to or lower than 2, i.e., if `n`

is 1 or 2, it immediately returns 1, as per the definition of the sequence. If this condition isn’t met, the output is computed recursively. The function can be run as follows:

`fibonacci(10)`

`55`

Checks out! But from a computational point of view, the `fibonacci()`

function is quite wasteful. In order to obtain `fibonacci(10)`

, we need to compute `fibonacci(9)`

and `fibonacci(8)`

. But in order to compute `fibonacci(9)`

, we *also* need to compute `fibonacci(8)`

. For both `fibonacci(9)`

and `fibonacci(8)`

, we need to compute `fibonacci(7)`

, etc. In fact, we need to compute the value of `fibonacci(8)`

two times, that of `fibonacci(7)`

three times, that of `fibonacci(6)`

five times, and that of `fibonacci(5)`

seven times. So we’d be doing lots of computations over and over again. For this reason, the `fibonacci()`

function is hopelessly inefficient: While you can compute `fibonacci(10)`

in a fraction of a second, it may take minutes to compute, say, `fibonacci(60)`

. Luckily, we can speed up our function considerably.

Memoisation is a programming technique where any intermediate result that you computed is stored in an array. Before computing any further intermediate results, you then first look up in the array if you haven’t in fact already computed it, saving you a lot of unnecessary computations. The following Julia function is a bit more involved that the previous one, but it’s much more efficient.

```
function fib_memo(n)
known = zeros(Int64, n)
function memoize(k)
if known[k] != 0
# do nothing
elseif k == 1 || k == 2
known[k] = 1
else
known[k] = memoize(k-1) + memoize(k-2)
end
return known[k]
end
return memoize(n)
end
```

`fib_memo (generic function with 1 method)`

The overall function that we’ll actually call is `fib_memo()`

. It creates an array called `known`

with `n`

zeroes. Then it defines an inner function `memoize()`

. This latter function obtains an integer `k`

that in practice will range from 0 to `n`

and does the following. First, it checks if the `k`

th value in the array `known`

is still 0. If it got changed, the function just returns the `k`

th value in `known`

. Otherwise, if `k`

is equal to either 1 or 2, it sets the first or second value of `known`

to 1. If `k`

is greater than 2, the `k`

th value of `known`

is computed recursively. In all cases, the `memoize()`

function returns the `k`

value of the `known`

array. The outer `fib_memo()`

function then just returns the result of `memoize(n)`

.

Perhaps by now, your computer has finished running `fibonacci(60)`

and you can try out the alternative implementation:

`fib_memo(60)`

`1548008755920`

Notice how much faster this new function is! Even the 200th Fibonacci number can be computed in a fraction of a second:

`fib_memo(200)`

`-1123705814761610347`

Unfortunately, we’ve ran into a different problem now: integer overflow. The result of the computations has become so large that it exceeded the range of 64-bit integers. To fix this problem, we can work with BigIntegers instead:

```
function fib_memo(n)
known = zeros(BigInt, n)
function memoize(k)
if known[k] != 0
# do nothing
elseif k == 1 || k == 2
known[k] = 1
else
known[k] = memoize(k-1) + memoize(k-2)
end
return known[k]
end
return memoize(n)
end
```

`fib_memo (generic function with 1 method)`

`fib_memo(200)`

`280571172992510140037611932413038677189525`

Nice!

The third alternative is more of a mathematical solution rather than a programming solution. According to Binet’s formula, the *n*th Fibonacci number can be computed as where , the Golden Ratio, and , its conjugate. In Julia:

```
function fib_binet(n)
φ = (1 + sqrt(5))/2
ψ = (1 - sqrt(5))/2
fib_n = 1/sqrt(5) * (φ^n - ψ^n)
return BigInt(round(fib_n))
end
```

`fib_binet (generic function with 1 method)`

Note that you can use mathematical symbols like and in Julia. This function runs very fast, too:

`fib_binet(60)`

`1548008755920`

`fib_binet(200)`

`280571172992512015699912586503521287798784`

Notice, however, that the result for the 200th Fibonacci number differs by 27 orders of magnitude from the one obtained using `fib_memo()`

:

`fib_binet(200) - fib_memo(200)`

`1875662300654090482610609259`

By using Binet’s formula, we’ve left the fairly neat world of integer arithmetic and entered the realm of floating point arithmetic that is rife with approximation errors. While we’re at it, we might as well compute and plot the size of these approximation errors. In the snippet below, I first use list comprehension in order to compute the first 200 Fibonacci numbers using both `fib_memo()`

and `fib_binet()`

. Note that I added a dot (`.`

) to both function names. This is Julia notation for running vectorised computations. Further note that I end all lines with a semi-colon so that the results don’t get printed to the prompt. Then, I compute the absolute values of the differences between the numbers obtained by both computation methods. Note again the use of a dot in both `abs.()`

and `.-`

that is required to have both of these functions work on vectors. Finally, I convert these absolute differences to differences relative to the correct answers;

```
fib_integer = fib_memo.(1:200);
fib_math = fib_binet.(1:200);
abs_diff = abs.(fib_math .- fib_integer);
rel_diff = abs_diff ./ fib_integer;
```

To wrap off this blog post, let’s now plot these absolute and relative differences using the Plots.jl package. While Figure 1 shows that the absolute error becomes huge, Figure 2 shows that these discrepancies only amount to a negligble fraction of the correct answers.

```
using Plots
plot(1:200, abs_diff, seriestype=:scatter,
xlabel = "n",
ylabel = "absolute difference",
label = "")
```

Figure 1.Absolute difference between the Fibonacci numbers obtained using`fib_binet()`

and those obtained using`fib_memo()`

.

```
plot(1:200, rel_diff, seriestype=:scatter,
xlabel = "n",
ylabel = "relative difference",
label = "")
```

Figure 2.Relative difference between the Fibonacci numbers obtained using`fib_binet()`

and those obtained using`fib_memo()`

.

It’s now fifteen years later, and I still haven’t taken any methods or statistics classes. But, as you can tell from a quick glance at the blog archive, I’ve come round to the view that researchers often take actions that don’t actually help them to address their research questions and that much information that is almost routinely reported in research papers is irrelevant to the nominal goal of that research paper (i.e., answering its research questions). Part of the reason that researchers do things that don’t make much sense is that they have misunderstood what some statistical tool does. But I suspect that another part of the reason is that beginning researchers don’t quite see the point of some procedures they run and of some snippets of information they provide but nonetheless assume that *other* researchers do understand why these are important. From my own experience and discussions with former students, I think that there’s a vicious circle at play:

- Students read articles with lots of numbers and procedures they don’t really understand or see the point of.

- They reasonably but often incorrectly assume that these ubiquitous numbers and procedures must be integral to the research report.

- As students become researchers, they still haven’t quite understood whether or why all those numbers and procedures are relevant. But they assume that they are relevant. So they’d better also include them in their own reports, or they’d be betraying their own ignorance. Luckily, even if you don’t know what p-values, correlation coefficients and reliability coefficients actually express, computing them is a piece of cake.
- During peer review, you’re more likely to be chastised for not including some piece of information than for including a couple of irrelevant numbers. So beginning researchers may rarely be forced to consider the added value of their go-to procedures and of the information they routinely provide.
- A new cohort of students reads the published research, see 1).

It’s not that the beginning researchers in this scenario have misunderstood the tools they use — they have no conception of what these tools do, let alone a false one. All that is required for them to run superfluous procedures and include irrelevant information in their reports is that they think that other people see the relevance of what they’re doing — even if they themselves do not.

Now, it’s hard to stop using tools you’ve misunderstood the purpose of since you won’t know that you’ve misunderstood that purpose. But if you’re a young scholar and you want to run some analysis or report some numbers that are commonly run or reported in your line of work, first ask yourself and your colleagues how running this analysis or reporting these numbers would help you or readers of your work **help answer your study’s research questions or make the answers easier to understand**. Risk appearing ignorant and don’t cram your research reports with analyses and numbers you don’t see the added value of.

By the same token, if a young scholar asks you which statistical test they should use, first ask them why they think they need a test at all and what exactly it is they want to test. Similarly, if a novice asks you how they can run this or that analysis, ask them how they think running such an analysis would help them address their research question. Even if the added value of such an analysis is clear to you, it may not be clear to them.

**Edit (February 21, 2022)**: Also see Daniël Lakens’ blog post *The New Heuristics*, where he proposes researchers should adhere to the adage *justify everything*.

You can download the `obtain_edits()`

function from https://janhove.github.io/RCode/obtainEdits.R or source it directly:

`source("https://janhove.github.io/RCode/obtainEdits.R")`

The function recognises words that were deleted or inserted, words that were substituted for other words, and cases where one word was split into two words or where two words were merged to one word. All these changes count as one operation. The algorithm determines both the smallest number of operations needed to transform one version of the text into the other and outputs a data frame that lists what these operations are.

Here’s an example:

```
original <- "Check howmany changes need be made in order to change the first tekst in to the second one."
corrected <- "Check how many changes need to be made in order to change the first text into the second one."
obtain_edits(original, corrected)
```

```
[[1]]
[1] 4
[[2]]
change_position change_type change_from change_to
1 14 merger in to into
2 13 substitution tekst text
3 5 insertion to
4 2 split howmany how many
```

Note that while the minimal operation count is uniquely determined, the list of changes that were made isn’t. Consider this example:

```
textA <- "first secondthird"
textB <- "second third"
obtain_edits(textA, textB)
```

```
[[1]]
[1] 2
[[2]]
change_position change_type change_from change_to
1 2 split secondthird second third
2 1 deletion first
```

The algorithm identifies the difference from `textA`

to `textB`

as a matter of deleting ‘first’ and splitting up ‘secondthird’. But we could also consider it a matter of substituting ‘second’ for ‘first’ and ‘third’ for ‘secondthird’.

Nothing about stats or research design in this post, but perhaps this function is useful to someone somewhere!

]]>In the following, `x`

and `y`

refer to the independent and dependent variable of interest, respectively, i.e., `x`

would correspond to the intervention and `y`

to the L2 French conversational skills in our example. `z`

refers to the post-treatment variable, i.e., the French vocabulary scores in our example. `x`

is a binary variable, `y`

and `z`

are continuous. Since `z`

is a post-treatment variable, it’s possible that it is itself influenced directly or indirectly by `x`

. In the five cases examined below, this is indeed the case.

I’ve included all R code as I think running simulations like the ones below are a useful way to learn research design and statistics. If you’re just interested in the upshot, just ignore the code snippets :)

`x`

affects both `y`

and `z`

; `y`

and `z`

don’t affect each other.In the first case, `x`

affects both `y`

and `z`

, but `z`

and `y`

don’t influence each other.

In this case, controlling for `z`

doesn’t bias the estimate for the causal influence of `x`

on `y`

. It does, however, reduce the precision of these estimates. To appreciate this, let’s simulate some data. The function `case1()`

defined in the next code snippet generates a dataset corersponding to Case 1. The parameter `beta_xy`

specifies the coefficient of the influence of `x`

on `y`

; the goal of the analysis is to estimate the value of this parameter from the data. The parameter `beta_xz`

similarly specifies the coefficient of the influence of `x`

on `z`

. Estimating the latter coefficient isn’t a goal of the analysis, since `z`

is merely a control variable.

```
case1 <- function(n_per_group, beta_xy = 1, beta_xz = 1.5) {
# Create x (n_per_group 0s and n_per_group 1s)
x <- rep(c(0, 1), each = n_per_group)
# x affects y; 'rnorm' just adds some random noise to the observations.
# In a DAG, this noise corresponds to the influence of other variables that
# didn't need to be plotted.
y <- beta_xy*x + rnorm(2*n_per_group)
# x affects z
z <- beta_xz*x + rnorm(2*n_per_group)
# Create data frame
dfr <- data.frame(x = as.factor(x), y, z)
# Add info: z above or below median?
dfr$z_split <- factor(ifelse(dfr$z > median(dfr$z), "above", "below"))
# Return data frame
dfr
}
```

Use this function to create a dataset with 100 participants per group:

```
df_case1 <- case1(n_per_group = 100)
# Type 'df_case1' to inspect.
```

A graphical analysis that doesn’t take the control variable `z`

into account reveals a roughly one-point difference between the two conditions, which is as it should be.

```
library(tidyverse)
ggplot(data = df_case1,
aes(x = x, y = y)) +
geom_boxplot(outlier.shape = NA) +
geom_point(position = position_jitter(width = 0.2), pch = 1)
```

A linear model is able to retrieve the `beta_xy`

coefficient, which was set at 1, well enough ().

`summary(lm(y ~ x, df_case1))$coefficient`

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0333 0.0932 -0.357 7.21e-01
x1 1.0432 0.1319 7.911 1.77e-13
```

Alternatively, we could analyse these data while taking the control variable into account. The graphical analysis in Figure 3 achieves this by splitting up the control variable at its median and plotting the two subset separately. This is statistically suboptimal, but it makes the visualisation easier to grok. Here we also find a roughly one-point difference between the two conditions in each panel, which suggests that controlling for `z`

won’t induce any bias.

```
ggplot(data = df_case1,
aes(x = x, y = y)) +
geom_boxplot(outlier.shape = NA) +
geom_point(position = position_jitter(width = 0.2), pch = 1) +
facet_wrap(~ z_split)
```

The linear model is again able to retrieve the coefficient of interest well enough (), though with a slightly wider standard error.

`summary(lm(y ~ x + z, df_case1))$coefficient`

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0346 0.0935 -0.371 7.11e-01
x1 1.0866 0.1698 6.398 1.12e-09
z -0.0254 0.0625 -0.407 6.85e-01
```

Of course, it’s difficult to draw any firm conclusions about the analysis of a single simulated dataset. To see that in this general case, the coefficient of interest is indeed estimated without bias but with decreased precision, let’s generate 5,000 such datasets and analyse them with and without taking the control variable into account. The function `sim_case1()`

defined below runs these analyses; the ggplot call plots the estimates for the parameter. As the caption to Figure 4 explains, this simulation confirms what we observed above.

```
# Another function. This one takes the function case1(),
# runs it nruns (here: 1000) times and extracts estimates
# from two analyses per generated dataset.
sim_case1 <- function(nruns = 5000, n_per_group = 100, beta_xy = 1, beta_xz = 1.5) {
est_without <- vector("double", length = nruns)
est_with <- vector("double", length = nruns)
for (i in 1:nruns) {
# Generate data
d <- case1(n_per_group = n_per_group, beta_xy = beta_xy, beta_xz = beta_xz)
# Analyse (in regression model) without covariate and extract estimate
est_without[[i]] <- coef(lm(y ~ x, data = d))[[2]]
# Analyse with covariate, extract estimate
est_with[[i]] <- coef(lm(y ~ x + z, data = d))[[2]]
}
# Output data frame with results
results <- data.frame(
simulation = rep(1:nruns, 2),
covariate = rep(c("with covariate", "without covariate"), each = nruns),
estimate = c(est_with, est_without)
)
}
# Run function and plot results
results_sim_case1 <- sim_case1()
ggplot(data = results_sim_case1,
aes(x = estimate)) +
geom_histogram(fill = "lightgrey", colour = "black", bins = 20) +
geom_vline(xintercept = 1, linetype = "dashed") +
facet_wrap(~ covariate)
```

The estimate for the parameter is unbiased in both analyses, but the analysis with the covariate offers *less* rather than more precision: The standard deviation of the distribution of parameter estimates increases from 0.14 to 0.18:

```
results_sim_case1 %>%
group_by(covariate) %>%
summarise(mean_est = mean(estimate),
sd_est = sd(estimate))
```

```
# A tibble: 2 × 3
covariate mean_est sd_est
<chr> <dbl> <dbl>
1 with covariate 0.999 0.176
2 without covariate 0.998 0.140
```

`x`

affects `y`

, which in turn affects `z`

.In the second case, `x`

affects `y`

directly, and `y`

in turns affects `z`

.

This time, controlling for `z`

biases the estimates for the parameter. To see this, let’s again simulate and analyse some data.

```
case2 <- function(n_per_group, beta_xy = 1, beta_yz = 1.5) {
x <- rep(c(0, 1), each = n_per_group)
y <- beta_xy*x + rnorm(2*n_per_group)
# y affects z
z <- beta_yz*y + rnorm(2*n_per_group)
dfr <- data.frame(x = as.factor(x), y, z)
dfr$z_split <- factor(ifelse(dfr$z > median(dfr$z), "above", "below"))
dfr
}
df_case2 <- case2(n_per_group = 100)
```

When the data are analyses without taking the control variable into account, we obtain the following result:

```
ggplot(data = df_case2,
aes(x = x, y = y)) +
geom_boxplot(outlier.shape = NA) +
geom_point(position = position_jitter(width = 0.2), pch = 1)
```

This isn’t quite as close to a one-point difference as in the previous example, but as we’ll see below that’s merely due to the randomness inherent in these simulations. The linear model yields a parameter estimate of .

`summary(lm(y ~ x, df_case2))$coefficient`

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.295 0.095 3.11 2.15e-03
x1 0.684 0.134 5.09 8.33e-07
```

When we take the control variable into account, however, the difference between the two groups defined by `x`

becomes smaller:

```
ggplot(data = df_case2,
aes(x = x, y = y)) +
geom_boxplot(outlier.shape = NA) +
geom_point(position = position_jitter(width = 0.2), pch = 1) +
facet_wrap(~ z_split)
```

The linear model now yields a parameter estimate of , which is considerably farther from the actual parameter value of 1.

`summary(lm(y ~ x + z, df_case2))$coefficient`

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0653 0.0540 1.21 2.28e-01
x1 0.1638 0.0788 2.08 3.89e-02
z 0.4642 0.0221 21.03 3.05e-52
```

The larger-scale simulation shows that the analysis with the covariate is indeed biased if you want to estimate the causal influence of `x`

on `y`

.

```
# Change beta_xz to beta_xy compared to the previous case
sim_case2 <- function(nruns = 5000, n_per_group = 100, beta_xy = 1, beta_yz = 1.5) {
est_without <- vector("double", length = nruns)
est_with <- vector("double", length = nruns)
for (i in 1:nruns) {
d <- case2(n_per_group = n_per_group, beta_xy = beta_xy, beta_yz = beta_yz)
est_without[[i]] <- coef(lm(y ~ x, data = d))[[2]]
est_with[[i]] <- coef(lm(y ~ x + z, data = d))[[2]]
}
results <- data.frame(
simulation = rep(1:nruns, 2),
covariate = rep(c("with covariate", "without covariate"), each = nruns),
estimate = c(est_with, est_without)
)
}
results_sim_case2 <- sim_case2()
ggplot(data = results_sim_case2,
aes(x = estimate)) +
geom_histogram(fill = "lightgrey", colour = "black", bins = 20) +
geom_vline(xintercept = 1, linetype = "dashed") +
facet_wrap(~ covariate)
```

The fact that the distribution of the parameter estimates is narrower when taking the covariate into account is completely immaterial, since these estimates are estimating the wrong quantity.

```
results_sim_case2 %>%
group_by(covariate) %>%
summarise(mean_est = mean(estimate),
sd_est = sd(estimate))
```

```
# A tibble: 2 × 3
covariate mean_est sd_est
<chr> <dbl> <dbl>
1 with covariate 0.309 0.0851
2 without covariate 1.00 0.143
```

`x`

and `y`

both affect `z`

. `x`

also affects `y`

.Now `z`

is affected by both `x`

and `y`

. `x`

still affects `y`

, though. Taking the covariate into account again yields biased estimates.

Same procedure as last year, James.

```
case3 <- function(n_per_group, beta_xy = 1, beta_xz = 1.5, beta_yz = 1.5) {
x <- rep(c(0, 1), each = n_per_group)
y <- beta_xy*x + rnorm(2*n_per_group)
# x and y affect z
z <- beta_xz*x + beta_yz*y + rnorm(2*n_per_group)
dfr <- data.frame(x = as.factor(x), y, z)
dfr$z_split <- factor(ifelse(dfr$z > median(dfr$z), "above", "below"))
dfr
}
df_case3 <- case3(n_per_group = 100)
```

```
ggplot(data = df_case3,
aes(x = x, y = y)) +
geom_boxplot(outlier.shape = NA) +
geom_point(position = position_jitter(width = 0.2), pch = 1)
```

Again, the analysis without the control variable yields a reasonably accurate estimate of the true parameter value of 1 ().

`summary(lm(y ~ x, df_case3))$coefficient`

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.101 0.106 -0.953 3.42e-01
x1 1.047 0.150 6.992 4.07e-11
```

When we take the control variable into account, however, the difference between the two groups defined by `x`

becomes smaller:

```
ggplot(data = df_case3,
aes(x = x, y = y)) +
geom_boxplot(outlier.shape = NA) +
geom_point(position = position_jitter(width = 0.2), pch = 1) +
facet_wrap(~ z_split)
```

The linear model now yields a parameter estimate of , which is considerably farther from the actual parameter value of 1 and even has the wrong sign.

`summary(lm(y ~ x + z, df_case3))$coefficient`

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.00596 0.0580 0.103 9.18e-01
x1 -0.31746 0.1033 -3.074 2.41e-03
z 0.46047 0.0213 21.629 6.28e-54
```

For the sake of completeness, let’s run this simulation 5,000 times, too.

```
sim_case3 <- function(nruns = 5000, n_per_group = 100, beta_xy = 1, beta_xz = 1.5, beta_yz = 1.5) {
est_without <- vector("double", length = nruns)
est_with <- vector("double", length = nruns)
for (i in 1:nruns) {
d <- case3(n_per_group = n_per_group, beta_xy = beta_xy, beta_xz = beta_xz, beta_yz = beta_yz)
est_without[[i]] <- coef(lm(y ~ x, data = d))[[2]]
est_with[[i]] <- coef(lm(y ~ x + z, data = d))[[2]]
}
results <- data.frame(
simulation = rep(1:nruns, 2),
covariate = rep(c("with covariate", "without covariate"), each = nruns),
estimate = c(est_with, est_without)
)
}
results_sim_case3 <- sim_case3()
ggplot(data = results_sim_case3,
aes(x = estimate)) +
geom_histogram(fill = "lightgrey", colour = "black", bins = 20) +
geom_vline(xintercept = 1, linetype = "dashed") +
facet_wrap(~ covariate)
```

The fact that the distribution of the parameter estimates is narrower when taking the covariate into account is completely immaterial, since these estimates are estimating the wrong quantity.

```
results_sim_case3 %>%
group_by(covariate) %>%
summarise(mean_est = mean(estimate),
sd_est = sd(estimate))
```

```
# A tibble: 2 × 3
covariate mean_est sd_est
<chr> <dbl> <dbl>
1 with covariate -0.382 0.105
2 without covariate 1.00 0.145
```

`x`

affects `z`

; both `x`

and `z`

influence `y`

.That is, `x`

influences both `y`

and `z`

, but `z`

also influences `y`

. Let be the direct effect of `x`

on `y`

, the effect of `x`

on `z`

and the effect of `z`

on `y`

. Then the *total* effect of `x`

on `y`

is .

Using the defaults in the following function, the total effect of `x`

on `y`

is . If this doesn’t make immediate sense, consider what a change of one unit in `x`

causes downstream: A one-unit increase in `x`

directly increases `y`

by 1. It also increases `z`

by 1.5. But a one-unit increase *in z* causes an increase of 0.5 in

`y`

as well, so a 1.5-unit increase in `z`

causes an additional increase of 0.75 in `y`

. So in total, a one-unit increase in `x`

causes a 1.75-point increase in `y`

.```
case4 <- function(n_per_group, beta_xy = 1, beta_xz = 1.5, beta_zy = 0.5) {
x <- rep(c(0, 1), each = n_per_group)
# x affects z
z <- beta_xz*x + rnorm(2*n_per_group)
# x and z affect y
y <- beta_xy*x + beta_zy*z + rnorm(2*n_per_group)
dfr <- data.frame(x = as.factor(x), y, z)
dfr$z_split <- factor(ifelse(dfr$z > median(dfr$z), "above", "below"))
dfr
}
df_case4 <- case4(n_per_group = 100)
```

```
ggplot(data = df_case4,
aes(x = x, y = y)) +
geom_boxplot(outlier.shape = NA) +
geom_point(position = position_jitter(width = 0.2), pch = 1)
```

Again, the analysis without the control variable yields a reasonably accurate estimate of the true *total* influence of `x`

on `y`

of 1.75 (!) ().

`summary(lm(y ~ x, df_case4))$coefficient`

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.19 0.114 1.66 9.79e-02
x1 1.49 0.161 9.21 4.69e-17
```

When we take the control variable into account, however, the difference between the two groups defined by `x`

becomes smaller:

```
ggplot(data = df_case4,
aes(x = x, y = y)) +
geom_boxplot(outlier.shape = NA) +
geom_point(position = position_jitter(width = 0.2), pch = 1) +
facet_wrap(~ z_split)
```

The linear model now yields a parameter estimate of . This analysis correctly estimates the *direct* effect of `x`

on `y`

(i.e., without the additional causal link between `x`

on `y`

through `z`

). This may be interesting in its own right, but the analysis addresses a question different from ``What's the causal influence of`

x`on`

y`?’’

`summary(lm(y ~ x + z, df_case4))$coefficient`

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0464 0.1077 0.431 6.67e-01
x1 0.9863 0.1702 5.797 2.65e-08
z 0.4694 0.0777 6.041 7.52e-09
```

For the sake of completeness, let’s run this simulation 5,000 times, too.

```
sim_case4 <- function(nruns = 5000, n_per_group = 100, beta_xy = 1, beta_xz = 1.5, beta_zy = 0.5) {
est_without <- vector("double", length = nruns)
est_with <- vector("double", length = nruns)
for (i in 1:nruns) {
d <- case4(n_per_group = n_per_group, beta_xy = beta_xy, beta_xz = beta_xz, beta_zy = beta_zy)
est_without[[i]] <- coef(lm(y ~ x, data = d))[[2]]
est_with[[i]] <- coef(lm(y ~ x + z, data = d))[[2]]
}
results <- data.frame(
simulation = rep(1:nruns, 2),
covariate = rep(c("with covariate", "without covariate"), each = nruns),
estimate = c(est_with, est_without)
)
}
results_sim_case4 <- sim_case4()
ggplot(data = results_sim_case4,
aes(x = estimate)) +
geom_histogram(fill = "lightgrey", colour = "black", bins = 20) +
geom_vline(xintercept = 1.75, linetype = "dashed") +
facet_wrap(~ covariate)
```

```
results_sim_case4 %>%
group_by(covariate) %>%
summarise(mean_est = mean(estimate),
sd_est = sd(estimate))
```

```
# A tibble: 2 × 3
covariate mean_est sd_est
<chr> <dbl> <dbl>
1 with covariate 0.998 0.177
2 without covariate 1.75 0.157
```

`x`

and `z`

affect `y`

; `x`

and `z`

don’t affect each other.In the final general case, `x`

and `z`

both affect `y`

, but `x`

and `z`

don’t affect each other. That is, `z`

isn’t affected by the intervention in any way and so functions like a pre-treatment control variable would. The result is an increase in statistical precision. This is the only of the five cases examined in which the control variable has added value for the purposes of estimated the causal influence of `x`

on `y`

.

Using the defaults in the following function, the total effect of `x`

on `y`

is . If this doesn’t make immediate sense, consider what a change of one unit in `x`

causes downstream: A one-unit increase in `x`

directly increases `y`

by 1. It also increases `z`

by 1.5. But a one-unit increase *in z* causes an increase of 0.5 in

`y`

as well, so a 1.5-unit increase in `z`

causes an additional increase of 0.75 in `y`

. So in total, a one-unit increase in `x`

causes a 1.75-point increase in `y`

.```
case5 <- function(n_per_group, beta_xy = 1, beta_zy = 1.5) {
x <- rep(c(0, 1), each = n_per_group)
# Create z
z <- rnorm(2*n_per_group)
# x and z affect y
y <- beta_xy*x + beta_zy*z + rnorm(2*n_per_group)
dfr <- data.frame(x = as.factor(x), y, z)
dfr$z_split <- factor(ifelse(dfr$z > mean(dfr$z), "above", "below"))
dfr
}
df_case5 <- case5(n_per_group = 100)
```

```
ggplot(data = df_case5,
aes(x = x, y = y)) +
geom_boxplot(outlier.shape = NA) +
geom_point(position = position_jitter(width = 0.2), pch = 1)
```

Again, the analysis without the control variable yields an estimate within one standard error of the true parameter value of 1 ().

`summary(lm(y ~ x, df_case5))$coefficient`

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0201 0.172 0.117 0.9068
x1 0.6818 0.243 2.807 0.0055
```

```
ggplot(data = df_case5,
aes(x = x, y = y)) +
geom_boxplot(outlier.shape = NA) +
geom_point(position = position_jitter(width = 0.2), pch = 1) +
facet_wrap(~ z_split)
```

The linear model now yields a parameter estimate of , with is also a reasonable estimate but with a smaller standard error.

`summary(lm(y ~ x + z, df_case5))$coefficient`

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.00183 0.1022 -0.0179 9.86e-01
x1 1.05971 0.1459 7.2626 8.62e-12
z 1.44502 0.0759 19.0298 1.67e-46
```

For the sake of completeness, let’s run this simulation 5,000 times, too.

```
sim_case5 <- function(nruns = 1000, n_per_group = 100, beta_xy = 1, beta_zy = 1.5) {
est_without <- vector("double", length = nruns)
est_with <- vector("double", length = nruns)
for (i in 1:nruns) {
d <- case5(n_per_group = n_per_group, beta_xy = beta_xy, beta_zy = beta_zy)
est_without[[i]] <- coef(lm(y ~ x, data = d))[[2]]
est_with[[i]] <- coef(lm(y ~ x + z, data = d))[[2]]
}
results <- data.frame(
simulation = rep(1:nruns, 2),
covariate = rep(c("with covariate", "without covariate"), each = nruns),
estimate = c(est_with, est_without)
)
}
results_sim_case5 <- sim_case5()
ggplot(data = results_sim_case5,
aes(x = estimate)) +
geom_histogram(fill = "lightgrey", colour = "black", bins = 20) +
geom_vline(xintercept = 1, linetype = "dashed") +
facet_wrap(~ covariate)
```

```
results_sim_case5 %>%
group_by(covariate) %>%
summarise(mean_est = mean(estimate),
sd_est = sd(estimate))
```

```
# A tibble: 2 × 3
covariate mean_est sd_est
<chr> <dbl> <dbl>
1 with covariate 1.00 0.139
2 without covariate 0.995 0.249
```

When a control variable is collected *after* the intervention took place, it is possible that it is directly or indirectly affected by the intervention. If this is indeed the case, including the control variable in the analysis may yield biased estimates or decrease rather than increase the precision of the estimates. In designed experiments, the solution to this problem is evident: collect the control variable before the intervention takes place. If this isn’t possible, you had better be pretty sure that the control variable isn’t a post-treatment variable. More generally, throwing predictor variables into a statistical model in the hopes that this will improve the analysis is a dreadful idea.

Please note that I reran the code on this page on August 6, 2023.

`devtools::session_info()`

```
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.1 (2023-06-16)
os Ubuntu 22.04.3 LTS
system x86_64, linux-gnu
ui X11
language en_US
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Zurich
date 2023-08-27
pandoc 3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
abind 1.4-5 2016-07-21 [1] CRAN (R 4.3.1)
backports 1.4.1 2021-12-13 [1] CRAN (R 4.3.0)
boot 1.3-28 2021-05-03 [4] CRAN (R 4.2.0)
cachem 1.0.6 2021-08-19 [2] CRAN (R 4.2.0)
callr 3.7.3 2022-11-02 [1] CRAN (R 4.3.1)
checkmate 2.2.0 2023-04-27 [1] CRAN (R 4.3.1)
cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)
cmdstanr * 0.6.0 2023-08-02 [1] local
coda 0.19-4 2020-09-30 [1] CRAN (R 4.3.1)
codetools 0.2-19 2023-02-01 [4] CRAN (R 4.2.2)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.1)
curl 5.0.1 2023-06-07 [1] CRAN (R 4.3.1)
dagitty * 0.3-1 2021-01-21 [1] CRAN (R 4.3.1)
devtools 2.4.5 2022-10-11 [1] CRAN (R 4.3.1)
digest 0.6.29 2021-12-01 [2] CRAN (R 4.2.0)
distributional 0.3.2 2023-03-22 [1] CRAN (R 4.3.1)
dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)
ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.2.0)
evaluate 0.15 2022-02-18 [2] CRAN (R 4.2.0)
fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.1)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)
fastmap 1.1.0 2021-01-25 [2] CRAN (R 4.2.0)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)
fs 1.5.2 2021-12-08 [2] CRAN (R 4.2.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
ggplot2 * 3.4.2 2023-04-03 [1] CRAN (R 4.3.0)
glue 1.6.2 2022-02-24 [2] CRAN (R 4.2.0)
gridExtra 2.3 2017-09-09 [1] CRAN (R 4.3.0)
gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)
htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.3.0)
htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.1)
httpuv 1.6.11 2023-05-11 [1] CRAN (R 4.3.1)
inline 0.3.19 2021-05-31 [1] CRAN (R 4.3.1)
jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.1)
knitr 1.39 2022-04-26 [2] CRAN (R 4.2.0)
labeling 0.4.2 2020-10-20 [1] CRAN (R 4.3.0)
later 1.3.1 2023-05-02 [1] CRAN (R 4.3.1)
lattice 0.21-8 2023-04-05 [4] CRAN (R 4.3.0)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)
loo 2.6.0 2023-03-31 [1] CRAN (R 4.3.1)
lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
MASS 7.3-60 2023-05-04 [4] CRAN (R 4.3.1)
matrixStats 1.0.0 2023-06-02 [1] CRAN (R 4.3.1)
memoise 2.0.1 2021-11-26 [2] CRAN (R 4.2.0)
mime 0.10 2021-02-13 [2] CRAN (R 4.0.2)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.3.1)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)
mvtnorm 1.2-2 2023-06-08 [1] CRAN (R 4.3.1)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
pkgbuild 1.4.2 2023-06-26 [1] CRAN (R 4.3.1)
pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.2.0)
pkgload 1.3.2.1 2023-07-08 [1] CRAN (R 4.3.1)
posterior 1.4.1 2023-03-14 [1] CRAN (R 4.3.1)
prettyunits 1.1.1 2020-01-24 [2] CRAN (R 4.2.0)
processx 3.8.2 2023-06-30 [1] CRAN (R 4.3.1)
profvis 0.3.8 2023-05-02 [1] CRAN (R 4.3.1)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.3.1)
ps 1.7.5 2023-04-18 [1] CRAN (R 4.3.1)
purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [2] CRAN (R 4.2.0)
Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.3.1)
RcppParallel 5.1.7 2023-02-27 [1] CRAN (R 4.3.1)
readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)
remotes 2.4.2 2021-11-30 [2] CRAN (R 4.2.0)
rethinking * 2.31 2023-08-02 [1] Github (rmcelreath/rethinking@2f01a9c)
rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)
rmarkdown 2.21 2023-03-26 [1] CRAN (R 4.3.0)
rstan * 2.26.22 2023-08-01 [1] local
rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.3.0)
scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)
sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.2.0)
shape 1.4.6 2021-05-19 [1] CRAN (R 4.3.1)
shiny 1.7.4.1 2023-07-06 [1] CRAN (R 4.3.1)
StanHeaders * 2.26.27 2023-06-14 [1] CRAN (R 4.3.1)
stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.1)
stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)
tensorA 0.36.2 2020-11-19 [1] CRAN (R 4.3.1)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.1)
timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)
tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.3.1)
usethis 2.2.2 2023-07-06 [1] CRAN (R 4.3.1)
utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.1)
V8 4.3.0 2023-04-08 [1] CRAN (R 4.3.0)
vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)
withr 2.5.0 2022-03-03 [2] CRAN (R 4.2.0)
xfun 0.39 2023-04-20 [1] CRAN (R 4.3.0)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.3.1)
yaml 2.3.5 2022-02-21 [2] CRAN (R 4.2.0)
[1] /home/jan/R/x86_64-pc-linux-gnu-library/4.3
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library
──────────────────────────────────────────────────────────────────────────────
```

It contains seven reading assignments (mostly empirical studies that serve as examples) and ten chapters with lectures:

- Association and causality.
- Constructing a control group.
- Alternative explanations.
- Inferential statistics 101. (The course is not a statistics course, but there’s no avoiding talking about p-values given their omnipresence.)
- Increasing precision.
- Pedagogical interventions.
- Within-subjects experiments.
- Quasi-experiments and correlational studies.
- Constructs and indicators.
- Questionable research practices.

I’ve also included two appendices:

- Reading difficult results sections.
- Reporting research transparently.

Hopefully some of you find it useful, and feel free to let me know what you think.

]]>Also typically in educational experiments, researchers have some information about the participants’ performance before the intervention took place. This information can come in the form of a covariate, for instance the participants’ performance on a pretest or some self-assessment of their skills. Even in experiments that use random assignment, including such covariates in the analysis is useful as they help to reduce the error variance. Lots of different methods for including covariates in the analysis of cluster-randomised experiments are discussed in the literature, but I couldn’t find any discussion about the merits and drawbacks of these different methods.

In order to provide such discussion, I ran a series of simulations to compare 25 (!) different ways of including a covariate in the analysis of a cluster-randomised experiment in terms of their Type-I error and their power. The **article** outlining these simulations and the findings is available from PsyArXiv; the **R code** used for the simulations as well as its output is available from the Open Science Framework. In the remainder of this post, I’ll discuss how these simulations may be useful to you if you’re planning to run a cluster-randomised experiment.

Please read pages 1–3 and pages 40–42 of the article :)

Ah, interesting! It took a long time to run these simulations (about 36 hours), during which I couldn’t use my computer for anything else, so I’m not exactly gung-ho about rerunning them just to include one additional analytic method.

But here’s what you can do. Go to the OSF page and download the files `functions/generate_clustered_data.R`

and `scripts/additional_simulations.Rmd`

. The latter file contains some smaller-scale simulations that don’t take as long to run. Adapt the simulations there and check if the analytical method you know of has an acceptable Type-I error rate for a variety of parameter settings. (Two examples are available, but if you can’t make sense of them, let me know.) If its Type-I error rate is acceptable, run another batch of simulations to assess its power and compare it to the power for the best-performing methods in my simulation.

If your method compares well to these best-performing methods in terms of both its Type-I error rate and its power, drop me a line, and perhaps I’ll get round to rerunning the large-scale simulations. Better still, download the other functions and scripts, include your method in `functions/analyse_clustered_data.R`

, and adjust the other files accordingly. Then run the simulation yourself :)

Perhaps your study will feature clusters that vary more strongly in size than was the case in my simulations. Or perhaps you suspect that the intracluster correlation will be quite different from the ones that I considered. Or perhaps etc., etc., etc. It’d be better if the results of the simulations were more directly relevant to what you suspect your study will look like.

But here’s the beauty. You can go to the OSF page, download all the functions and scripts, and tailor the simulation parameters to your liking. In `script/simulation_type_I_error.R`

and `script/simulation_power.R`

, you can change the cluster sizes, the number of clusters, the strength of the covariate, the ICC, the effect, and the randomisation scheme. Then run these scripts and figure out which analysis method will likely retain its nominal Type-I error while maximising power.

The assumptions are outlined in the article on pp. 23–24, and they are made more explicit in the function that generates the data (`functions/generate_clustered_data.R`

). Perhaps they’re unrealistic. For instance, the data are all drawn from normal distributions, the covariate is linearly related to the outcome, etc. If you want to revise these assumptions, you’ll have to edit this function. (Test the function extensively afterwards!) Then re-run the simulations, with the simulation parameters tailored to your study.

At the moment, I don’t intend to submit this article to any journal. The main reason is that anyone who may be interested in it already has free access to it. If anyone has any feedback, I’d be happy to hear it, but I don’t currently feel like jumping through a series of hoops in some drawn-out reviewing process.

Contents:

- Why plot models, and why visualise uncertainty?
- The principle: An example with simple linear regression
- Step 1: Fit the model
- Step 2: Compute the conditional means and confidence intervals
- Step 3: Plot!

- Predictions about individual cases vs. conditional means
- More examples
- Several continuous predictors
- Dealing with categorical predictors
- t-tests are models, too
- Dealing with interactions
- Ordinary logistic regression
- Mixed-effects models
- Logistic mixed effects models

- Caveats
- Other things may not be equal
- Your model may be misspecified
- Other models may yield different pictures

Regression models have three main uses. The first is to describe the data at hand. The difficulty here mostly consists in figuring out what aspects of the data the parameter estimates reflect. Weeks 6, 7, 8 and 11 are devoted to statistical interpretation of model parameters and how variables can be recoded so that the model output aligns more closely with the research questions.

However, the main use of regression models in the social sciences is to draw inferences, usually causal ones. Moving from a descriptive to causal interpretation of a statistical model requires making additional assumptions. Weeks 2 through 5 are devoted to a tool (directed acyclic graphs) that allows you to make explicit the assumptions you’re willing to make about the causal relationships between your variables and that allows you to derive from these assumptions any further permissible causal claims. Another type of inference is the move from observable quantities (e.g., test scores) to unobervables (e.g., language skills). Weeks 10 and 11 are devoted to this topic.

The third use is to use the model to make predictions about new data. This week’s text (Shmueli 2010) explains why a model that has been optimised for making predictions about new data may be all but worthless for inference, and why a model that has been optimised for inference may not yield the best possible predictions. The take-home points are that when planning a research project, you need to be crystal-clear what its main goal is (e.g., causal inference or prediction) and that you should be careful not to assume that a model selected for its predictive power is best-suited for drawing causal conclusions.

- Text: Shmueli (2010).
- Further reading: Breiman (2001); Yarkoni and Westfall (2017).

The texts for weeks 2 through 5 introduce directed acyclic graphs (DAGs) and go through numerous examples for them. DAGs are useful for identifying the variables that you should control for and the ones you should *not* control for if you want to estimate some causal relationship in your data. (Some researchers seem to assume that the more variables you control for, the better, but controlling for the wrong variables can mess up your inferences entirely.) This, of course, is most useful when you’re still planning your research project, because otherwise you may find that you need to control for a variable that you didn’t collect, or that you controlled (on purpose or by accident) for a variable you shouldn’t have controlled for.

- Text: McElreath (2020), Chapter 5.

- Text: McElreath (2020), Chapter 6.

- Text: Rohrer (2018).

- No obligatory reading.
- Further reading: Elwert (2013).

Leaving causal interpretations aside, what do all those numbers in the output of a regression model actually express? DeBruine and Barr (2019) explain how you can analyse simulated datasets to learn which parameter estimates in the simulation correspond to which parameter settings in the simulation set-up.

A related point that I highlighted in class was that the random effect estimates as well as the BLUPs in mixed-effects models should always be interpreted conditionally on the fixed effects in the model. This is true of all estimates in regression models, but people tend to have more difficulties in interpreting random effects and BLUPS. Another point was that you can also gain a better understanding of what the model parameters express by *first* fitting the model on your data and *then* having this model predict new data. By figuring out how the model came up with these predictions, you learn what each parameter estimate literally means.

- Text: DeBruine and Barr (2019).

Weeks 7 and 8 were devoted to contrast coding, i.e., how you can recode non-numeric predictors such that the model’s output aligns more closely with what you want to know. I’ve recently blogged about contrast coding, and I was surprised I didn’t learn about this useful technique until 2020 (of all years).

- Text: Schad et al. (2020), up to and including the section
*What makes a good set of contrasts?*

- No obligatory reading.
- Further reading: Schad et al. (2020), from the section
*A closer look at hypothesis and contrast matrices*.

The measured variables included in a model are often but approximations of what is actually of interest. For instance, you may be interested in the learners’ L2 skills, but what you’ve measured is their performance on an L2 test. The test results will only approximately reflect the learners’ true skills. Interpreting the output of a model, which may be valid at the level of the observed variables, in terms of such unobserved but inferred constructs is fraught with difficulties that researchers and consumers of research need to be aware of.

The reading for week 9 deals with some consequences of measurement error on a predictor variable. The reading of week 10 doesn’t strictly deal with measurement error but with the mapping of the observed outcome variable on the unobserved construct of interest and how it affects the interpretation of interactions.

- Text: Westfall and Yarkoni (2016).
- Further reading: Berthele and Vanhove (2020).

- Text: Wagenmakers et al. (2012).
- Further reading: Wagenmakers (2015).

Logistic regression models can be difficult to understand, and the linear probability model (i.e., ordinary linear regression) isn’t to be dismissed out of hand when working with binary data. A related blog post is *Interactions in logistic regression models*.

- Text: Huang (2019).

In week 12 I went through some examples of verbal research questions or hypotheses that at first blush seem pretty well delineated. On closer inspection, however, it becomes clear that radically different patterns in the data would yield the same answer to these questions, and that the research questions or hypotheses were, in fact, underspecified. Drawing several possible data patterns and interpreting them in light of your literal research question or hypothesis can help you rephrase that question or hypothesis less ambiguously.

No texts.

For the last week, I stressed the following take-home points from this course:

- Be crystal-clear about the main aim of your statistical model: Describing the data, predicting new data, or drawing inferences about causality or unobserved phenomena? Plan accordingly by identifying the factors that must be controlled for and those that mustn’t be controlled for.
- Anticipate the consequences of measurement error. If measurement error could mess up the interpretation of the results, try to collect several indictators of the constructs of interest and adopt a latent variable approach.
- Outline
*precisely*how you’d interpret the possible patterns in the data in terms of your research question. - If a regression model is necessary, recode your predictors so that you can interpret the parameter estimates directly in terms of your research question.
- Analyse simulated data if you’re unsure what the model’s parameter estimates correspond to.
- Keep in mind that parameter estimates are always to be interpreted conditionally on the other predictors in the model. I suspect that lots of counterintuitive findings stem from researchers interpreting their parameter estimates unconditionally.

I also showed how you can make your analyses reproducible by working with RStudio projects, the here package, and R Markdown.

No texts.

Berthele, Raphael and Jan Vanhove. 2020. What would disprove interdependence? Lessons learned from a study on biliteracy in Portuguese heritage language speakers in Switzerland. *International Journal of Bilingual Education and Bilingualism* 23(5). 550-566.

Breiman, Leo. 2001. Statistical modeling: The two cultures. *Statistical Science* 16. 199-215.

DeBruine, Lisa M. & Dale J. Barr. 2019. Understanding mixed effects models through data simulation. PsyArXiv.

Elwert, Felix. 2013. Graphical causal models. In S. L. Morgan (ed.), *Handbook of Causal Analysis for Social Research*, pp. 245-273. Dordrecht, The Netherlands: Springer.

Huang, Francis L. 2019. Alternatives to logistic regression models in experimental studies. *Journal of Experimental Education*.

McElreath, Richard. 2020. *Statistical rethinking: A Bayesian course with examples in R and Stan*, 2nd edn. Boca Raton, FL: CRC Press.

Rohrer, Julia. 2018. Thinking clearly about correlations and causation: Graphical causal models for observational data. *Advances in Methods and Practices in Psychological Science* 1(1). 27-42.

Schad, Daniel J., Shravan Vasishth, Sven Hohenstein and Reinhold Kliegl. 2020. How to capitalize on a priori contrasts in linear (mixed) models: A tutorial. *Journal of Memory and Language* 110.

Shmueli, Galit. 2010. To explain or to predict? *Statistical Science* 25. 289-310.

Wagenmakers, Eric-Jan. 2015. A quartet of interactions. *Cortex* 73. 334-335.

Wagenmakers, Eric-Jan, Angelos-Miltiadis Krypotos, Amy H. Criss and Geoff Iverson. 2012. On the interpretation of removable interactions: A survey of the field 33 years after Lotus. *Memory & Cognition* 40. 145-160.

Westfall, Jacob and Tal Yarkoni. 2016. Statistically controlling for confounding constructs is harder than you think. *PLoS ONE* 11(3). e0152719.

Yarkoni, Tal and Jacob Westfall. 2017. Choosing prediction over explanation in psychology: Lessons from machine learning. *Perspectives on Psychological Science* 12. 1100-1122.

Let’s start off nice but not too easy by analysing an experiment with three conditions and only one observation per participant.

This dataset we’ll work with comes from a study by Vanhove (2019) and is available here. The details hardly matter, but there were three experimental conditions: `information`

, `no information`

and `strategy`

. The `no information`

condition serves as the baseline control condition, and the `information`

and `strategy`

conditions serve as the treatment conditions. The expectation was that the treatment conditions would outperform the control condition on the outcome variable (here: `ProportionCongruent`

), and I was also interested in seeing if the `strategy`

condition outperformed the `information`

condition.

The condition means already show that the participants in the `information`

condition did not in fact outperform those in the `no information`

condition, but neither that nor the small sample size should keep us from using these data for our example.

```
d <- read.csv("http://homeweb.unifr.ch/VanhoveJ/Pub/Data/Vanhove2018.csv")
table(d$Condition)
```

```
information no information strategy
15 14 16
```

`tapply(d$ProportionCongruent, d$Condition, mean)`

```
information no information strategy
0.517 0.554 0.627
```

If we fit the model directly, R will apply the default coding scheme to the categorical predictor (viz., treatment coding):

```
# The newest version of R doesn't recode strings as factors automatically,
# so code Condition as a factor for good measure.
d$Condition <- factor(d$Condition)
m_default <- lm(ProportionCongruent ~ Condition, data = d)
summary(m_default)$coefficients
```

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5172 0.0512 10.101 8.31e-13
Conditionno information 0.0369 0.0737 0.501 6.19e-01
Conditionstrategy 0.1099 0.0713 1.542 1.31e-01
```

By default, the `information`

condition is chosen as the reference level because it’s first in the alphabet. That is, the 0.52 is the estimated mean for the `information`

condition. The second estimate (0.04) is the difference between the mean for the `no information`

condition and that of the reference level (`information`

). Similarly, the third estimate (0.11) is the difference between the mean for the `strategy`

condition and that of the reference level (`information`

). These estimates are all correct, and they’re fairly easy to interpret once you’ve figured out what the reference level is. But if we want to, we can obtain estimated coefficients that map more directly onto the research questions by recoding the `Condition`

variable manually.

`Condition`

has three levels, and this means that we can obtain at most three estimated coefficients for it. It’s also possible to obtain fewer than the maximum, but this is not something I will go into here.

The first step is to **write out what you want the model’s intercept to represent** as a null hypothesis. In this example, it makes sense that the intercept should the mean performance in the `no information`

condition. Written as a null hypothesis, this becomes . This null hypothesis is a bit silly, but that’s not important here, just go with it; the equation is easy enough. Then, **rearrange the equation such that the right-hand side reads 0.** This is already the case here. Finally, **add the factor’s remaining levels to the left-hand side of the equation, but multiplied by 0**. You’re just adding 0s to the left-hand side of the equation, which doesn’t affect it. For clarity, I’ve made it clear that . The result looks like this:

Make sure that in the rearranged equation, the levels appear *in the same order* as they do in R. You can check the order of the levels using `levels()`

. By default, the order is alphabetical. You can change the order of the factor levels, but then you’ll also need to change the order in which the coefficients appear in the rearranged equation:

`levels(d$Condition)`

`[1] "information" "no information" "strategy" `

The second step is to **write out null hypotheses for the comparisons that you want the remaining coefficients to estimate**. For the sake of the exercise, let’s say that I want the first remaining coefficient to estimate the difference between the mean of the control group () and the *mean of the means* of the two other groups (i.e., ). First write this as a null hypothesis:

Note that I write the ‘focus’ of the comparison on the left-hand side and what it’s being compared to on the right-hand side. This will make the signs of the coefficients we later get easier to interpret. Then, bring all terms to the left-hand side:

Do not multiply any terms in the equation, i.e., do *not* write so that you don’t have to work with fractions. The hypotheses you’ll test will be the same, but the output will be more confusing than if you just rearrange the coefficients but keep the fractions.

For the final coefficient, let’s say that I want to estimate the difference in means between the `info`

and `strategy`

condition. Again, start from the corresponding null hypothesis (i.e., that these means are the same), and then bring all s to the left-hand side while adding the missing factor levels:

The third step is to put the coefficients of the rearranged equations into a **hypothesis matrix**. As you can see, each line in this matrix contains the coefficients belonging to the terms in the three equations above:

```
Hm <- rbind(
H00 = c(info = 0, no_info = 1, strategy = 0),
H01 = c(info = 1/2, no_info = -1, strategy = 1/2),
H02 = c(info = -1, no_info = 0, strategy = 1)
)
```

Fourth, convert this hypothesis matrix into a **contrast matrix** using the `ginv2()`

function that Schad et al. (2020) provide:

```
ginv2 <- function(x) {
MASS::fractions(provideDimnames(MASS::ginv(x), base = dimnames(x)[2:1]))
}
Cm <- ginv2(Hm)
```

Fifth, apply this contrast matrix, minus the first column (hence `-1`

), as the coding scheme for `Condition`

:

`contrasts(d$Condition) <- Cm[, -1]`

And finally, fit the model:

```
m_manual <- lm(ProportionCongruent ~ Condition, data = d)
summary(m_manual)$coefficients
```

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.554 0.0530 10.455 2.92e-13
ConditionH01 0.018 0.0639 0.282 7.79e-01
ConditionH02 0.110 0.0713 1.542 1.31e-01
```

You can check this yourselves, but the intercept now shows the mean of the `no information`

condition, the first term (`ConditionH01`

) estimates the difference between the `no information`

mean and the mean of the means of the other two condidition, and the second term (`ConditionH02`

) estimates the difference between the `strategy`

mean and the `information`

mean.

For the second and third example, I’ll use data from Pestana et al. (2018), who measured the Portuguese reading skills of Portuguese children in Portugal, French-speaking Switzerland, and German-speaking Switzerland at three points in time. The data are available as part of the `helascot`

package.

```
library(helascot)
library(tidyverse)
library(lme4)
# Combine data and only retain Portuguese test data
d <- skills %>%
left_join(background, by = "Subject") %>%
filter(LanguageTested == "Portuguese") %>%
filter(!is.na(Reading))
# Code Time and LanguageGroup as factors
d$Time <- factor(d$Time)
d$LanguageGroup <- factor(d$LanguageGroup)
# Draw graph
ggplot(data = d,
aes(x = Time,
y = Reading,
fill = LanguageGroup)) +
geom_boxplot()
```

There are up to three observations per child (Time 1, 2 and 3), and the children are clustered in classes. We will take this into account during the analysis using random effects by child and by class.

For the sake of this example, let’s say we’re interested in estimating the development of reading skills through time. The following model estimates the effect of `Time`

and allows for this effect to vary between classes. Since there is only one data point per `Subject`

per `Time`

, no by-subject random slope for `Time`

was estimated.

```
m_default <- lmer(Reading ~ Time + (1+Time|Class) + (1|Subject), data = d)
summary(m_default)$coefficients
```

```
Estimate Std. Error t value
(Intercept) 0.532 0.0239 22.25
Time2 0.104 0.0111 9.38
Time3 0.194 0.0161 12.08
```

When using R’s default coding, the `(Intercept)`

represents the average reading skill score at Time 1, the next coefficient estimates the different in reading skill scores between Time 2 and Time 1, and the third coefficient estimates the difference between Time 3 and Time 1. This is fine, but let’s say we wanted to estimate the difference between Time 3 and Time 2 directly. We can obtain this estimate by coding the predictors ourselves.

In the equations below, the ’s are in the same order as R knows them:

`levels(d$Time)`

`[1] "1" "2" "3"`

The average performance at Time 1 is a reasonable choice for the intercept, so let’s stick with that. The silly null hypothesis is that , which we can elaborate with and as follows:

If we want the next coefficient to estimate the difference between the average reading skill scores at Time 2 and Time 1, we need the null hypothesis that these average reading skill scores are the same, i.e., . (Remember to put the ‘focus’ of the comparison on the left.) From there:

Similarly, if we want the third coefficient to estimate the difference between the average reading skill scores at Time 3 and Time 2, we need the null hypothesis that these average reading skill scores are the same, i.e., :

Put the coefficients in the hypothesis matrix, convert this hypothesis matrix to a contrast matrix, apply this contrast matrix to the factor `Time`

, and refit the model.

```
# Put coefficients in hypothesis matrix
Hm <- rbind(H00 = c(T1 = 1, T2 = 0, T3 = 0),
H01 = c(T1 = -1, T2 = 1, T3 = 0),
H02 = c(T1 = 0, T2 = -1, T3 = 1))
# Convert to contrast matrix
Cm <- ginv2(Hm)
# I'm going to copy Time so we can reuse it in example 3:
d$Time2 <- d$Time
# Apply contrast matrix to factor
contrasts(d$Time2) <- Cm[, -1]
# Refit model
m_manual <- lmer(Reading ~ Time2 + (1+Time2|Class) + (1|Subject), data = d)
summary(m_manual)$coefficients
```

```
Estimate Std. Error t value
(Intercept) 0.5317 0.0239 22.25
Time2H01 0.1040 0.0111 9.38
Time2H02 0.0904 0.0137 6.59
```

As you can see, the third coefficient now estimates the difference between the average reading skill score at T3 and at T2. Compared to manually computing this difference from the first model’s output, the main advantage of coding the predictors yourself is that you also obtain a measure of the uncertainty about the estimate of interest (e.g., the standard error, or a confidence interval).

Finally, let’s take a look at interactions. Still working with the dataset from the second example, we can fit a model that contains an interaction between `Time`

and `LanguageGroup`

, i.e., that allows the effect of `Time`

to differ between the three language groups. Since `Time`

varies within `Class`

, but `LanguageGroup`

doesn’t, we can’t estimate a by-class random slope for `Language Group`

. I’m going to ignore the warning about the singular fit here, because it isn’t related to the topic of the tutorial and I don’t have too many other datasets where interactions need to be modelled.

`m_default <- lmer(Reading ~ Time*LanguageGroup + (1+Time|Class) + (1|Subject), data = d)`

`boundary (singular) fit: see help('isSingular')`

`summary(m_default)$coefficients`

```
Estimate Std. Error t value
(Intercept) 0.5422 0.0233 23.281
Time2 0.1159 0.0198 5.850
Time3 0.1915 0.0279 6.859
LanguageGroupBilingual group German -0.0893 0.0316 -2.822
LanguageGroupControl group Portuguese 0.1316 0.0373 3.532
Time2:LanguageGroupBilingual group German -0.0196 0.0272 -0.720
Time3:LanguageGroupBilingual group German 0.0164 0.0373 0.440
Time2:LanguageGroupControl group Portuguese -0.0121 0.0309 -0.390
Time3:LanguageGroupControl group Portuguese -0.0287 0.0451 -0.636
```

I’m not going to go over the interpretation of all of these coefficients; the point is that they’re not too informative, but that we can obtain more useful estimates by recoding the predictors. To do that, I prefer to **combine the combinations of the factors involved in the interaction into a single variable**, which I’ll call `Cell`

:

```
# Combine combinations of Time and Language group into 1 factor
d$Cell <- factor(paste(d$Time, d$LanguageGroup))
table(d$Cell)
```

```
1 Bilingual group French 1 Bilingual group German
104 104
1 Control group Portuguese 2 Bilingual group French
74 105
2 Bilingual group German 2 Control group Portuguese
97 75
3 Bilingual group French 3 Bilingual group German
105 93
3 Control group Portuguese
69
```

We will eventually need to refer to these cells in the same order as they’re known in R:

```
# Order of the factor levels
levels(d$Cell)
```

```
[1] "1 Bilingual group French" "1 Bilingual group German"
[3] "1 Control group Portuguese" "2 Bilingual group French"
[5] "2 Bilingual group German" "2 Control group Portuguese"
[7] "3 Bilingual group French" "3 Bilingual group German"
[9] "3 Control group Portuguese"
```

Let’s think about what we want our estimates to mean. I think it would make sense for the intercept to represent the mean reading skill score at Time 1 across the three language groups. Then, I’d like for the next coefficients to express the average progress (across language groups) from Time 1 to Time 2 and from Time 2 to Time 3. Next, I’d like to know, at each time point, what the average difference between the Portuguese and the bilingual (Swiss) children is, and what the average difference between the Portuguese-French and the Portuguese-German bilinguals is.

Now, in what follows, you’re going to see some fairly long equations. They may look daunting, but they’re really easy: like before, we’re going to express what we want the coefficients to mean as null hypotheses. It’s just that this time we have to include nine ’s per equation.

- The intercept represents the grand mean of the Time 1 cells. The silly corresponding null hypothesis is that this grand mean is 0: .

- The next term represents the difference between the grand mean of the Time 2 cells and that of the Time 1 cells:

- The third term represents the difference between the grand mean of the Time 3 cells and that of the Time 2 cells:

Now for comparisons between the language groups at each point in time. For each time, I want a term testing if the Portuguese on the one hand and the French- and German-speaking pupils on the other hand perform the same as well as one if the French- and German-speaking pupils differ amongst themselves.

- The fourth term represents the difference between the mean of the Portuguese scores at Time 1 and the grand mean of the two bilingual groups’ performance at Time 1:

- The fifth term represents the difference between the two bilingual groups at Time 1:

- Same as the fourth term, but for Time 2.

- Same as the fifth term, but for Time 2.

- Same as the fourth and sixth terms, but for Time 3.

- Same as the fifth and seventh terms, but for Time 3.

Put all of these coefficients into a large hypothesis matrix and convert it to a contrast matrix:

```
Hm <- rbind(GM_T1 = c(F1 = 1/3, G1 = 1/3, P1 = 1/3,
F2 = 0, G2 = 0, P2 = 0,
F3 = 0, G3 = 0, P3 = 0),
T2vT1 = c(F1 = -1/3, G1 = -1/3, P1 = -1/3,
F2 = 1/3, G2 = 1/3, P2 = 1/3,
F3 = 0, G3 = 0, P3 = 0),
T3vT2 = c(F1 = 0, G1 = 0, P1 = 0,
F2 = -1/3, G2 = -1/3, P2 = -1/3,
F3 = 1/3, G3 = 1/3, P3 = 1/3),
T1_PtvsBi = c(F1 = -1/2, G1 = -1/2, P1 = 1,
F2 = 0, G2 = 0, P2 = 0,
F3 = 0, G3 = 0, P3 = 0),
T1_FrvsGe = c(F1 = 1, G1 = -1, P1 = 0,
F2 = 0, G2 = 0, P2 = 0,
F3 = 0, G3 = 0, P3 = 0),
T2_PtvsBi = c(F1 = 0, G1 = 0, P1 = 0,
F2 = -1/2, G2 = -1/2, P2 = 1,
F3 = 0, G3 = 0, P3 = 0),
T2_FrvsGe = c(F1 = 0, G1 = 0, P1 = 0,
F2 = 1, G2 = -1, P2 = 0,
F3 = 0, G3 = 0, P3 = 0),
T3_PtvsBi = c(F1 = 0, G1 = 0, P1 = 0,
F2 = 0, G2 = 0, P2 = 0,
F3 = -1/2, G3 = -1/2, P3 = 1),
T3_FrvsGe = c(F1 = 0, G1 = 0, P1 = 0,
F2 = 0, G2 = 0, P2 = 0,
F3 = 1, G3 = -1, P3 = 0))
Cm <- ginv2(Hm)
```

Apply the contrasts to the `Cell`

variable and fit the model:

```
contrasts(d$Cell) <- Cm[, -1]
m_manual <- lmer(Reading ~ Cell + (1|Class) + (1|Subject), data = d)
summary(m_manual)$coefficients
```

```
Estimate Std. Error t value
(Intercept) 0.5586 0.01200 46.56
CellT2vT1 0.1032 0.00975 10.58
CellT3vT2 0.0812 0.00992 8.19
CellT1_PtvsBi 0.1758 0.02725 6.45
CellT1_FrvsGe 0.0818 0.02715 3.01
CellT2_PtvsBi 0.1763 0.02723 6.48
CellT2_FrvsGe 0.1052 0.02733 3.85
CellT3_PtvsBi 0.1412 0.02765 5.11
CellT3_FrvsGe 0.0794 0.02748 2.89
```

The coefficients mean exactly what it says on the tin. There is just one problem: I didn’t include a random slope that capture the varying effect of `Time`

by `Class`

yet. Adding a by-class random slope for `Cell`

wouldn’t work: you’d end up estimating an enormous matrix of random effects since `Cell`

has nine levels. Instead, we’ll have to first refit the model using the dummy variables in the contrast matrix of `Cell`

as separate variables:

```
# Add the dummy variables in the contrast matrix of Cell
# to the dataset as separate variables
contrast_matrix <- data.frame(Cm[, -1],
Cell = levels(d$Cell))
d <- merge(d, contrast_matrix, by = "Cell")
# Refit the model using these separate dummy variables
m_manual <- lmer(Reading ~ T2vT1 + T3vT2 +
T1_PtvsBi + T1_FrvsGe +
T2_PtvsBi + T2_FrvsGe +
T3_PtvsBi + T3_FrvsGe +
(1|Class) + (1|Subject), data = d)
summary(m_manual)$coefficients
```

```
Estimate Std. Error t value
(Intercept) 0.5586 0.01200 46.56
T2vT1 0.1032 0.00975 10.58
T3vT2 0.0812 0.00992 8.19
T1_PtvsBi 0.1758 0.02725 6.45
T1_FrvsGe 0.0818 0.02715 3.01
T2_PtvsBi 0.1763 0.02723 6.48
T2_FrvsGe 0.1052 0.02733 3.85
T3_PtvsBi 0.1412 0.02765 5.11
T3_FrvsGe 0.0794 0.02748 2.89
```

The output is exactly the same as above. Now we need to think about which of these estimates can actually vary by `Class`

. If you think about the way we coded these predictors, `T2vsT1`

and `T3vT2`

capture the effect of `Time`

, whereas the other predictors capture the effects of `LanguageGroup`

at different times. The effect of `Time`

can vary according to `Class`

, but the effects of `LanguageGroup`

can’t (each `Class`

belonged to only one `LanguageGroup`

). So if we want random slopes of `Time`

by `Class`

, we need to let the effects of `T2vT1`

and `T3vT2`

vary by class:

```
m_manual <- lmer(Reading ~ T2vT1 + T3vT2 +
T1_PtvsBi + T1_FrvsGe +
T2_PtvsBi + T2_FrvsGe +
T3_PtvsBi + T3_FrvsGe +
(1+T2vT1+T3vT2|Class) + (1|Subject), data = d)
```

`boundary (singular) fit: see help('isSingular')`

`summary(m_manual)$coefficients`

```
Estimate Std. Error t value
(Intercept) 0.5563 0.0143 38.84
T2vT1 0.1054 0.0120 8.77
T3vT2 0.0820 0.0142 5.79
T1_PtvsBi 0.1762 0.0331 5.32
T1_FrvsGe 0.0893 0.0316 2.82
T2_PtvsBi 0.1740 0.0315 5.52
T2_FrvsGe 0.1088 0.0306 3.55
T3_PtvsBi 0.1393 0.0275 5.07
T3_FrvsGe 0.0729 0.0272 2.68
```

The warning isn’t relevant to the purposes of this tutorial. As a sanity check, we can compare the predictions of `m_manual`

and `m_default`

to confirm that `m_manual`

is the same model as `m_default`

, just with parameter estimates that are easier to interpret:

```
# (I don't know why I need to specify 'newdata'...)
d$predict_default <- predict(m_default, newdata = d)
d$predict_manual <- predict(m_manual, newdata = d)
plot(predict_manual ~ predict_default, d)
```

Both models make the same predictions, and the predictions align reasonably well with the data observed:

```
ggplot(data = d,
aes(x = Time,
y = predict_manual,
fill = LanguageGroup)) +
geom_boxplot() +
ylab("Model predictions")
```

If, having specified your own hypothesis matrix, some lines in the regression output contain `NA`

, the reason is probably that some of the rows in your hypothesis matrix are combinations of some of the other rows. In essence, you’re asking the model to answer the same question twice, so it only answers it once. Reformulating the hypotheses will usually work.

Berthele, Raphael and Amelia Lambelet (eds.). 2018. *Heritage and school language literacy development in migrant children: Interdependence or independence?* Multilingual Matters.

Pestana, Carlos, Amelia Lambelet and Jan Vanhove. 2018. Reading comprehension development in Portuguese heritage speakers in Switzerland (HELASCOT project). In Raphael Berthele and Amelia Lambelet (Eds.), *Heritage language and school language literacy development in migrant children: Interdependence or independence?* (pp. 58-82). Bristol, UK: Multilingual Matters. http://doi.org/10.21832/BERTHE9047

Schad, Daniel J., Shravan Vasishth, Sven Hohenstein and Reinhold Kliegl. 2020. How to capitalize on a priori contrasts in linear (mixed) models: A tutorial. *Journal of Memory and Language* 110. https://doi.org/10.1016/j.jml.2019.104038

Vanhove, Jan. 2019. Metalinguistic knowledge about the native language and language transfer in gender assignment. *Studies in Second Language Learning and Teaching* 9(2). 397-419. https://doi.org/10.14746/ssllt.2019.9.2.7

Please note that I reran the code on this page on August 6, 2023.

`devtools::session_info()`

```
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.1 (2023-06-16)
os Ubuntu 22.04.2 LTS
system x86_64, linux-gnu
ui X11
language en_US
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Zurich
date 2023-08-06
pandoc 3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
boot 1.3-28 2021-05-03 [4] CRAN (R 4.2.0)
cachem 1.0.6 2021-08-19 [2] CRAN (R 4.2.0)
callr 3.7.3 2022-11-02 [1] CRAN (R 4.3.1)
cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.1)
devtools 2.4.5 2022-10-11 [1] CRAN (R 4.3.1)
digest 0.6.29 2021-12-01 [2] CRAN (R 4.2.0)
dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)
ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.2.0)
evaluate 0.15 2022-02-18 [2] CRAN (R 4.2.0)
fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.1)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)
fastmap 1.1.0 2021-01-25 [2] CRAN (R 4.2.0)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)
fs 1.5.2 2021-12-08 [2] CRAN (R 4.2.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
ggplot2 * 3.4.2 2023-04-03 [1] CRAN (R 4.3.0)
glue 1.6.2 2022-02-24 [2] CRAN (R 4.2.0)
gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)
helascot * 1.0.0 2023-08-02 [1] Github (janhove/helascot@4cf3c1b)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)
htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.3.0)
htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.1)
httpuv 1.6.11 2023-05-11 [1] CRAN (R 4.3.1)
jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.1)
knitr 1.39 2022-04-26 [2] CRAN (R 4.2.0)
labeling 0.4.2 2020-10-20 [1] CRAN (R 4.3.0)
later 1.3.1 2023-05-02 [1] CRAN (R 4.3.1)
lattice 0.21-8 2023-04-05 [4] CRAN (R 4.3.0)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)
lme4 * 1.1-34 2023-07-04 [1] CRAN (R 4.3.1)
lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
MASS 7.3-60 2023-05-04 [4] CRAN (R 4.3.1)
Matrix * 1.6-0 2023-07-08 [4] CRAN (R 4.3.1)
memoise 2.0.1 2021-11-26 [2] CRAN (R 4.2.0)
mime 0.10 2021-02-13 [2] CRAN (R 4.0.2)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.3.1)
minqa 1.2.5 2022-10-19 [1] CRAN (R 4.3.1)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)
nlme 3.1-162 2023-01-31 [4] CRAN (R 4.2.2)
nloptr 2.0.3 2022-05-26 [1] CRAN (R 4.3.1)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
pkgbuild 1.4.2 2023-06-26 [1] CRAN (R 4.3.1)
pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.2.0)
pkgload 1.3.2.1 2023-07-08 [1] CRAN (R 4.3.1)
prettyunits 1.1.1 2020-01-24 [2] CRAN (R 4.2.0)
processx 3.8.2 2023-06-30 [1] CRAN (R 4.3.1)
profvis 0.3.8 2023-05-02 [1] CRAN (R 4.3.1)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.3.1)
ps 1.7.5 2023-04-18 [1] CRAN (R 4.3.1)
purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [2] CRAN (R 4.2.0)
Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.3.1)
readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)
remotes 2.4.2 2021-11-30 [2] CRAN (R 4.2.0)
rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)
rmarkdown 2.21 2023-03-26 [1] CRAN (R 4.3.0)
rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.3.0)
scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)
sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.2.0)
shiny 1.7.4.1 2023-07-06 [1] CRAN (R 4.3.1)
stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.1)
stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.1)
timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)
tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.3.1)
usethis 2.2.2 2023-07-06 [1] CRAN (R 4.3.1)
utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.1)
vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)
withr 2.5.0 2022-03-03 [2] CRAN (R 4.2.0)
xfun 0.39 2023-04-20 [1] CRAN (R 4.3.0)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.3.1)
yaml 2.3.5 2022-02-21 [2] CRAN (R 4.2.0)
[1] /home/jan/R/x86_64-pc-linux-gnu-library/4.3
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library
──────────────────────────────────────────────────────────────────────────────
```

- The -test and ANOVA compare
*means*; the Mann–Whitney and Kruskal–Wallis don’t. - The Mann–Whitney and Kruskal–Wallis do
*not*in general compare medians, either. I’ll illustrate these first two points in this blog post. - The main problem with parametric tests when you have nonnormal data is that these tests compare means, but that these means don’t necessarily capture a relevant aspect of the data. But even if the data aren’t normally distributed, comparing means can sometimes be reasonable, depending on what the data look like and what it is you’re actually interested in. And if you
*do*want to compare means, parametric tests or bootstrapping are more sensible than running a nonparametric test. See also my blog post*Before worrying about model assumptions, think about model relevance*. - If you want to compare medians, look into bootstrapping or quantile regression.
- Above all, make sure that you know you’re comparing when you run a test and that this comparison makes sense in light of the data
*and your research question*.

In this blog post, I’ll share the results of some simulations that demonstrate that the Mann–Whitney (a) picks up on differences in the variance between two distributions, even if they have the same mean and median; (b) picks up on differences in the median between two distributions, even if they have the same mean and variance; and (c) picks up on differences in the mean between two distributions, even if they have the same median and variance. These points aren’t new (see Zimmerman 1998), but since the automated strategy (‘parametric when normal, otherwise nonparemetric’) is pretty widespread, they bear repeating.

The first simulation demonstrates the Mann–Whitney’s sensitivity to differences in the variance. I simulated samples from a uniform distribution going from to as well as from a uniform distribution going from to . Both distributions have a mean and median of 0, but the standard deviation of the first is 1 and that of the second is 3. I compared these samples using a Mann–Whitney and recorded the -value. I generated samples of both 50 and 500 observations and repeated this process 10,000 times. You can reproduce this simulation using the code below.

**Figure 1** shows the distribution of the -values. Even though the distributions’ means and medians are the same, the Mann–Whitney returns significance () in about 7% of the comparisons for the smaller samples and 8% for the larger samples. If the test were sensitive only to differences in the mean or median, if should return significance in only 5% of the comparisons.

```
# Load package for plotting
library(ggplot2)
# Set number of simulation runs
n_sim <- 10000
# Draw a sample of 50 observations from two uniform distributions with the same
# mean and median but with different variances/standard deviations.
# Run the Mann-Whitney on them (wilcox.test()).
# Repeat this n_sim times.
pvals_50 <- replicate(n_sim, {
x <- runif(50, min = -3*sqrt(3), max = 3*sqrt(3))
y <- runif(50, min = -sqrt(3), max = sqrt(3))
wilcox.test(x, y)$p.value
})
# Same but with samples of 500 observations.
pvals_500 <- replicate(n_sim, {
x <- runif(500, min = -3*sqrt(3), max = 3*sqrt(3))
y <- runif(500, min = -sqrt(3), max = sqrt(3))
wilcox.test(x, y)$p.value
})
# Put in data frame
d <- data.frame(
p = c(pvals_50, pvals_500),
n = rep(c(50, 500), each = n_sim)
)
# Plot
ggplot(data = d,
aes(x = p,
fill = (p < 0.05))) +
geom_histogram(
breaks = seq(0, 1, 0.05),
colour = "grey20") +
scale_fill_manual(values = c("grey70", "red")) +
facet_wrap(~ n) +
geom_hline(yintercept = n_sim*0.05, linetype = 2) +
theme(legend.position = "none") +
labs(
title = element_blank(),
subtitle = "Same mean, same median, different variance",
caption = "Comparison for two sample sizes (50 vs. 500 observations per group):
uniform distribution from -sqrt(3) to sqrt(3)
vs. uniform distribution from -3*sqrt(3) to 3*sqrt(3)"
)
```

The second simulation demonstrates that the Mann–Whitney does not compare means. The simulation set-up was the same as before, but the samples were drawn from different distributions. The first sample was drawn from a log-normal distribution with mean , median 10 and standard deviation . The second sample was drawn from a normal distribution with the same mean (i.e., about 16.5) and the same standard deviation (i.e., about 21.6), but with a different median (viz., 16.5 rather than 10).

**Figure 2** shows that the Mann–Whitney returned significance for 12% of the comparisons of the smaller samples and 92% of the comparisons for the larger samples. So the Mann–Whitney does *not* test for differences in the mean; otherwise only 5% of the comparisons should have been significant (since the means of the distributions are the same).

The last simulation demonstrates that the Mann–Whitney does not compare medians, either. The first sample was again drawn from a log-normal distribution with mean , median 10 and standard deviation . The second sample was now drawn from a normal distribution with the same median (i.e., 10) and the same standard deviation (i.e., about 21.6), but with a different mean (viz., 10 rather than 16.5).

**Figure 3** shows that the Mann–Whitney returned significance for 20% of the comparisons of the smaller samples and 91% of the comparisons for the larger samples. So the Mann–Whitney does *not* test for differences in the median; otherwise only 5% of the comparisons should have been significant (since the medians of the distributions are the same).

Many researchers think that nonparametric tests don’t make any assumptions about the distributions from which the data were drawn. This belief is half-true (i.e., wrong): Nonparametric tests such as the Mann–Whitney don’t assume that the data were drawn from a *specific* distribution (e.g., from a normal distribution). However, they do assume that the data in the different groups being compared were drawn from the *same* distribution (but for a shift in the location of this distribution). If researchers run nonparametric tests because they are worried about violating the assumptions of parametric tests, I suggest they worry about the assumptions of their nonparametric tests, too.

But a better solution in my view is to them to consider more carefully what they actually want to compare. If it is really the means that are of interest, parametric tests are often okay, and their results can be double-checked using the bootstrap if needed. Permutation tests would be an alternative. If it is the medians that are of interest, quantile regression, bootstrapping, or permutation tests may be useful. If another measure of the data’s central tendency is of interest, robust regression may be useful. A discussion of these techniques is beyond the scope of this blog post, whose aims merely were to alert researchers to the fact that nonparametric tests aren’t a silver bullet when parametric assumptions are violated and that nonparametric tests aren’t just sensitive to differences in the mean or median.

Zimmerman, Donald W. 1998. Invalidation of parametric and nonparametric statistical tests by concurrent violation of two assumptions. *Journal of Experimental Education* 67(1). 55-68.

Please note that I reran the code on this page on August 6, 2023.

`devtools::session_info()`

```
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.1 (2023-06-16)
os Ubuntu 22.04.2 LTS
system x86_64, linux-gnu
ui X11
language en_US
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Zurich
date 2023-08-06
pandoc 3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
cachem 1.0.6 2021-08-19 [2] CRAN (R 4.2.0)
callr 3.7.3 2022-11-02 [1] CRAN (R 4.3.1)
cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.1)
devtools 2.4.5 2022-10-11 [1] CRAN (R 4.3.1)
digest 0.6.29 2021-12-01 [2] CRAN (R 4.2.0)
dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)
ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.2.0)
evaluate 0.15 2022-02-18 [2] CRAN (R 4.2.0)
fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.1)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)
fastmap 1.1.0 2021-01-25 [2] CRAN (R 4.2.0)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)
fs 1.5.2 2021-12-08 [2] CRAN (R 4.2.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
ggplot2 * 3.4.2 2023-04-03 [1] CRAN (R 4.3.0)
glue 1.6.2 2022-02-24 [2] CRAN (R 4.2.0)
gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)
htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.3.0)
htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.1)
httpuv 1.6.11 2023-05-11 [1] CRAN (R 4.3.1)
jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.1)
knitr 1.39 2022-04-26 [2] CRAN (R 4.2.0)
labeling 0.4.2 2020-10-20 [1] CRAN (R 4.3.0)
later 1.3.1 2023-05-02 [1] CRAN (R 4.3.1)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)
lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
memoise 2.0.1 2021-11-26 [2] CRAN (R 4.2.0)
mime 0.10 2021-02-13 [2] CRAN (R 4.0.2)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.3.1)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
pkgbuild 1.4.2 2023-06-26 [1] CRAN (R 4.3.1)
pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.2.0)
pkgload 1.3.2.1 2023-07-08 [1] CRAN (R 4.3.1)
prettyunits 1.1.1 2020-01-24 [2] CRAN (R 4.2.0)
processx 3.8.2 2023-06-30 [1] CRAN (R 4.3.1)
profvis 0.3.8 2023-05-02 [1] CRAN (R 4.3.1)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.3.1)
ps 1.7.5 2023-04-18 [1] CRAN (R 4.3.1)
purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [2] CRAN (R 4.2.0)
Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.3.1)
readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)
remotes 2.4.2 2021-11-30 [2] CRAN (R 4.2.0)
rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)
rmarkdown 2.21 2023-03-26 [1] CRAN (R 4.3.0)
rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.3.0)
scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)
sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.2.0)
shiny 1.7.4.1 2023-07-06 [1] CRAN (R 4.3.1)
stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.1)
stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.1)
timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)
tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.3.1)
usethis 2.2.2 2023-07-06 [1] CRAN (R 4.3.1)
utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.1)
vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)
withr 2.5.0 2022-03-03 [2] CRAN (R 4.2.0)
xfun 0.39 2023-04-20 [1] CRAN (R 4.3.0)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.3.1)
yaml 2.3.5 2022-02-21 [2] CRAN (R 4.2.0)
[1] /home/jan/R/x86_64-pc-linux-gnu-library/4.3
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library
──────────────────────────────────────────────────────────────────────────────
```

This blog post is the sequel to the previous one, where I demonstrated how imperfectly measured control variables undercorrect for the actual confounding in observational studies (also see Berthele & Vanhove 2017; Brunner & Austin 2009; Westfall & Yarkoni 2016). A model that doesn’t account for measurement error on the confounding variable—and hence implicitly assumes that the confound was measured perfectly—may confidently conclude that the variable of actual interest is related to the outcome even when taking into account the confound. From such a finding, researchers typically infer that the variable of actual interest is causally related to the outcome even in absence of the confound. But once this measurement error is duly accounted for, you may find that the evidence for a causal link between the variable of interest and the outcome is more tenuous than originally believed.

So especially in observational studies, where confounds abound, it behooves researchers to account for the measurement error in their variables so that they don’t draw unwarranted conclusions from their data too often. The amount of measurement error on your variables is usually unknown. But if you’ve calculated some reliability estimate such as Cronbach’s for your variables, you can use this to obtain an estimate of the amount of measurement error.

To elaborate, in classical test theory, the reliability of a measure is equal to the ratio of the variance of the (error-free) true scores to the variance of the observed scores. The latter is the sum of the variance of the true scores and the error variance:

Rearranging, we get

None of these values are known, but they can be estimated based on the sample. Specifically, can be estimated by a reliability index such as Cronbach’s and the sum can be estimated by computing the variable’s sample variance.

Let’s first deal with a simulated dataset. The main advantage of analysing simulated data is that you check that what comes out of the model corresponds to what went into the data. In preparing this blog post, I was able to detect an arithmetic error in my model code in this way as one parameter was consistently underestimated. Had I applied the model immediately to the real data set, I wouldn’t have noticed anything wrong. But we’ll deal with real data afterwards.

**Update (2023-08-06):** When converting this blog from Jekyll/Bootstrap to Quarto, I noticed that the original code used in this blog post, which involved the R package `rstan`

has started to run very slowly. In the present version, I use `cmdstanr`

instead.

Run these commands to follow along:

```
library(cmdstanr) # for fitting Bayesian models, v. 2.32.2
library(posterior) # for working with posterior distributions
# For drawing scatterplot matrices
source("https://janhove.github.io/RCode/scatterplot_matrix.R")
# Set random seed for reproducibility
set.seed(2020-02-13, kind = "Mersenne-Twister")
```

The scenario we’re going to simulate is one in which you have two correlated predictor variables (`A`

and `B`

) and one outcome variable (`Z`

). Unbeknownst to the analyst, `Z`

is causally affected by `A`

but not by `B`

. Moreover, the three variables are measured with some degree of error, but we’ll come to that later. Figure 1 depicts the scenario for which we’re going to simulate data.

The first thing we need are two correlated predictor variables. I’m going to generate these from a bivariate normal distribution. `A`

has a mean of 3 units and a standard deviation of 1.5 units, and `B`

has mean -4 and standard deviation 0.8 units. The correlation between them is . To generate a sample from this bivariate normal distribution, you need to construct the variance-covariance matrix from the standard deviations and correlation, which I do in the code below:

```
# Generate correlated constructs
n <- 300
rho <- 0.73
mean_A <- 3
mean_B <- -4
sd_A <- 1.5
sd_B <- 0.8
# Given the correlation and the standard deviations,
# construct the variance-covariance matrix for the constructs like so:
latent_covariance_matrix <- rbind(c(sd_A, 0), c(0, sd_B)) %*%
rbind(c(1, rho), c(rho, 1)) %*%
rbind(c(sd_A, 0), c(0, sd_B))
# Draw data from the multivariate normal distribution:
constructs <- MASS::mvrnorm(n = n, mu = c(mean_A, mean_B),
Sigma = latent_covariance_matrix)
# Extract variables from object
A <- constructs[, 1]
B <- constructs[, 2]
```

Next, we need to generate the outcome. In this simulation, `Z`

depends linearly on `A`

but not on `B`

(hence ‘’).

The error term is drawn from a normal distribution with standard deviation 1.3. Importantly, this error term does *not* express the measurement error on `Z`

; it is the part of the true score variance in `Z`

that isn’t related to either `A`

or `B`

:

```
# Create Z
intercept <- 2
slope_A <- 0.7
slope_B <- 0
sigma_Z.AB <- 1.3
Z <- intercept + slope_A*A + slope_B*B + rnorm(n, sd = sigma_Z.AB)
```

Even though `B`

isn’t causally related to `Z`

, we find that `B`

and `Z`

are correlated thanks to `B`

’s correlation to `A`

.

`scatterplot_matrix(cbind(Z, A, B))`

A multiple regression model is able to tease apart the effects of `A`

and `B`

on `Z`

:

`summary(lm(Z ~ A + B))$coefficients`

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.1224 0.7130 2.977 3.15e-03
A 0.6876 0.0706 9.742 1.22e-19
B 0.0196 0.1340 0.146 8.84e-01
```

The confound `A`

is significantly related to `Z`

and its estimated regression parameter is close to its true value of 0.70 (). The variable of interest `B`

, by contrast, isn’t significantly related to `Z`

when `A`

has been accounted for (). This is all as it should be.

Now let’s add measurement error to all variables. The `A`

values that we actually observe will then be distorted versions of the true `A`

values:

The ‘noise’ is commonly assumed to be normally distributed:

That said, you can easily imagine situations where the noise likely has a different distribution. For instance, when measurements are measured to the nearest integer (e.g., body weights in kilograms), the noise is likely uniformly distributed (e.g., a reported body weight of 66 kg means that the true body weight lies between 65.5 and 66.5 kg).

To make the link with the analysis more transparant, I will express the amount of noise in terms of the variables’ reliabilities. For the confound `A`

, I set the reliability at 0.70. Since `A`

’s standard deviation was 1.5 units, this means that the standard deviation of the noise is units. I set the reliability for `B`

, the variable of actual interest at 0.90. Its standard deviation is 0.8, so the standard deviation of the noise is units.

```
# Add measurement error on A and B
obs_A <- A + rnorm(n = n, sd = sqrt(sd_A^2*(1/0.70 - 1))) # reliability 0.70
obs_B <- B + rnorm(n = n, sd = sqrt(sd_B^2*(1/0.90 - 1))) # reliability 0.90
```

The same logic applies to adding measurement noise to `Z`

. The difficulty here lies in obtaining the population standard deviation (or variance) of `Z`

. I don’t want to just plug in `Z`

’s sample standard deviation since I want to have exact knowledge of the population parameters. While we specified a `sigma_Z.AB`

above, this is *not* the total population standard deviation of `Z`

: it’s the population standard deviation of `Z`

once `Z`

has been controlled for `A`

and `B`

. To obtain the *total* standard deviation of `Z`

(here admittedly confusingly labelled ), we need to add in the variance in `Z`

due to `A`

and `B`

:

Since , this simplifies to , but if you want to simulate your own datasets, the full formula may be useful.

The population standard deviation of `Z`

is thus . Setting `Z`

’s reliability to 0.70, we find that the standard deviation of the noise is .

```
# Measurement error on Z
sd_Z <- sqrt((slope_A*sd_A)^2 + (0*slope_B)^2 + 2*(slope_A * slope_B * latent_covariance_matrix[1,2]) + sigma_Z.AB^2)
obs_Z <- Z + rnorm(n = n, sd = sqrt(sd_Z^2*(1/0.70 - 1))) # reliability 0.70
```

Figure 3 shows the causal diagram for the actually observed simulated data.

As Figure 4 shows, the observed variables are all correlated with each other.

`scatterplot_matrix(cbind(obs_Z, obs_A, obs_B))`

Crucially, controlling for `obs_A`

in a regression model doesn’t entirely eradicate the confound and we find that `obs_B`

is significantly related to `obs_Z`

even after controlling for `obs_A`

(). Moreover, the regression model on the observed variables underestimates the strength of the relationship between the true construct scores (, whereas ).

`summary(lm(obs_Z ~ obs_A + obs_B))$coefficients`

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.498 0.7588 5.93 8.49e-09
obs_A 0.402 0.0737 5.45 1.05e-07
obs_B 0.404 0.1470 2.75 6.41e-03
```

The statistical model, written below in Stan code, corresponds to the data generating mechanism above and tries to infer its parameters from the observed data and some prior information.

The `data`

block specifies the input that the model should handle. I think this is self-explanatory. Note that the latent variable scores `A`

, `B`

and `Z`

aren’t part of the input as we wouldn’t have directly observed these.

The `parameters`

block first defines the parameters needed for the regression model with the unobserved latent variables (the one we used to generate `Z`

). It then defines the parameters needed to generate the true variables scores for `A`

and `B`

as well as the parameters needed to generate the observed scores from the true scores (viz., the true scores themselves and the reliabilities). Note that it is crucial to allow the model to estimate a correlation between `A`

and `B`

, otherwise it won’t ‘know’ that `A`

confounds the `B`

-`Z`

relationship.

The `transformed parameters`

block contains, well, transformations of these parameters. For instance, the standard deviations of `A`

and `B`

and the correlation between `A`

and `B`

are used to generate a variance-covariance matrix. Moreover, the standard deviations of the measurement noise are computed using the latent variables’ standard deviation and their reliabilities.

The `model`

block, finally, specifies how we think the observed data and the (transformed or untransformed) parameters fit together and what plausible a priori values for the (transformed or untransformed) parameters are. These prior distributions are pretty abstract in this example: we generated context-free data ourselves, so it’s not clear what motivates these priors. The real example to follow will hopefully make more sense in this respect. You’ll notice that I’ve also specified a prior for the reliabilities. The reason is that you typically don’t know the reliability of an observed variable with perfect precision but that you have sample estimate with some inherent uncertainty. The priors reflect this uncertainty. Again, this will become clearer in the real example to follow.

```
meas_error_code <- '
data {
// Number of observations
int<lower = 1> N;
// Observed outcome
vector[N] obs_Z;
// Observed predictors
vector[N] obs_A;
vector[N] obs_B;
}
parameters {
// Parameters for regression
real intercept;
real slope_A;
real slope_B;
real<lower = 0> sigma;
// Latent predictors (= constructs):
// standard deviations and means
real<lower = 0> sigma_lat_A;
real<lower = 0> sigma_lat_B;
row_vector[2] latent_means;
// Correlation between latent predictors
real<lower = -1, upper = 1> latent_rho;
// Latent variables (true scores)
matrix[N, 2] latent_predictors;
vector[N] lat_Z; // latent outcome
// Unknown but estimated reliabilities
real<lower = 0, upper = 1> reliability_A;
real<lower = 0, upper = 1> reliability_B;
real<lower = 0, upper = 1> reliability_Z;
}
transformed parameters {
vector[N] mu_Z; // conditional mean of outcome
vector[N] lat_A; // latent variables, separated out
vector[N] lat_B;
real error_A; // standard error of measurement
real error_B;
real error_Z;
// standard deviations of latent predictors, in matrix form
matrix[2, 2] sigma_lat;
// correlation and covariance matrix for latent predictors
cov_matrix[2] latent_cor;
cov_matrix[2] latent_cov;
// standard deviation of latent outcome
real<lower = 0> sigma_lat_Z;
sigma_lat_Z = sd(lat_Z);
// Express measurement error in terms of
// standard deviation of constructs and reliability
error_A = sqrt(sigma_lat_A^2*(1/reliability_A - 1));
error_B = sqrt(sigma_lat_B^2*(1/reliability_B - 1));
error_Z = sqrt(sigma_lat_Z^2*(1/reliability_Z - 1));
// Define diagonal matrix with standard errors of latent variables
sigma_lat[1, 1] = sigma_lat_A;
sigma_lat[2, 2] = sigma_lat_B;
sigma_lat[1, 2] = 0;
sigma_lat[2, 1] = 0;
// Define correlation matrix for latent variables
latent_cor[1, 1] = 1;
latent_cor[2, 2] = 1;
latent_cor[1, 2] = latent_rho;
latent_cor[2, 1] = latent_rho;
// Compute covariance matrix for latent variables
latent_cov = sigma_lat * latent_cor * sigma_lat;
// Extract latent variables from matrix
lat_A = latent_predictors[, 1];
lat_B = latent_predictors[, 2];
// Regression model for conditional mean of Z
mu_Z = intercept + slope_A*lat_A + slope_B*lat_B;
}
model {
// Priors for regression parameters
intercept ~ normal(0, 2);
slope_A ~ normal(0, 2);
slope_B ~ normal(0, 2);
sigma ~ normal(0, 2);
// Prior for latent standard deviations
sigma_lat_A ~ normal(0, 2);
sigma_lat_B ~ normal(0, 2);
// Prior for latent means
latent_means ~ normal(0, 3);
// Prior expectation for correlation between latent variables.
// Tend towards positive correlation, but pretty vague.
latent_rho ~ normal(0.4, 0.3);
// Prior for reliabilities.
// These are estimated with some uncertainty, i.e.,
// they are not point values but distributions.
reliability_A ~ beta(70, 30);
reliability_B ~ beta(90, 10);
reliability_Z ~ beta(70, 30);
// Distribution of latent variable
for (i in 1:N) {
latent_predictors[i, ] ~ multi_normal(latent_means, latent_cov);
}
// Generate latent outcome
lat_Z ~ normal(mu_Z, sigma);
// Add noise to latent variables
obs_A ~ normal(lat_A, error_A);
obs_B ~ normal(lat_B, error_B);
obs_Z ~ normal(lat_Z, error_Z);
}
'
```

Put the data in a list and fit the model:

```
data_list <- list(
obs_Z = obs_Z,
obs_A = obs_A,
obs_B = obs_B,
N = n
)
```

```
meas_error_model <- cmdstan_model(write_stan_file(meas_error_code))
model_fit <- meas_error_model$sample(
data = data_list
, seed = 123
, chains = 4
, parallel_chains = 4
, iter_warmup = 1000
, iter_sampling = 1000
, refresh = 500
, max_treedepth = 15
, adapt_delta = 0.99
)
```

```
Running MCMC with 4 parallel chains...
Chain 1 Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2 Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3 Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4 Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4 Iteration: 500 / 2000 [ 25%] (Warmup)
Chain 3 Iteration: 500 / 2000 [ 25%] (Warmup)
Chain 1 Iteration: 500 / 2000 [ 25%] (Warmup)
Chain 2 Iteration: 500 / 2000 [ 25%] (Warmup)
Chain 4 Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 4 Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 3 Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 3 Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2 Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2 Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1 Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1 Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2 Iteration: 1500 / 2000 [ 75%] (Sampling)
Chain 4 Iteration: 1500 / 2000 [ 75%] (Sampling)
Chain 3 Iteration: 1500 / 2000 [ 75%] (Sampling)
Chain 2 Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2 finished in 279.7 seconds.
Chain 1 Iteration: 1500 / 2000 [ 75%] (Sampling)
Chain 4 Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4 finished in 320.0 seconds.
Chain 3 Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3 finished in 355.9 seconds.
Chain 1 Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1 finished in 369.9 seconds.
All 4 chains finished successfully.
Mean chain execution time: 331.4 seconds.
Total execution time: 370.0 seconds.
```

```
model_fit$summary(
variables = c("intercept", "slope_A", "slope_B", "sigma"
,"sigma_lat_A", "sigma_lat_B", "sigma_lat_Z"
,"latent_means", "latent_rho"
, "reliability_A", "reliability_B", "reliability_Z"
,"error_A", "error_B", "error_Z")
, "mean", "sd"
, extra_quantiles = ~posterior::quantile2(., probs = c(0.025, 0.975))
, "rhat"
)
```

```
# A tibble: 16 × 6
variable mean sd q2.5 q97.5 rhat
<chr> <num> <num> <num> <num> <num>
1 intercept 0.649 1.49 -2.43 3.37 1.02
2 slope_A 0.866 0.170 0.564 1.22 1.02
3 slope_B -0.200 0.254 -0.719 0.265 1.02
4 sigma 1.17 0.126 0.903 1.40 1.01
5 sigma_lat_A 1.40 0.0716 1.26 1.54 1.00
6 sigma_lat_B 0.797 0.0353 0.731 0.869 1.00
7 sigma_lat_Z 1.59 0.0702 1.45 1.72 1.01
8 latent_means[1] 3.07 0.0972 2.87 3.26 1.00
9 latent_means[2] -4.02 0.0486 -4.11 -3.92 1.00
10 latent_rho 0.780 0.0493 0.678 0.871 1.01
11 reliability_A 0.694 0.0408 0.616 0.773 1.02
12 reliability_B 0.901 0.0277 0.843 0.947 1.00
13 reliability_Z 0.699 0.0466 0.600 0.782 1.01
14 error_A 0.925 0.0710 0.785 1.06 1.01
15 error_B 0.262 0.0382 0.191 0.338 1.01
16 error_Z 1.04 0.0872 0.880 1.22 1.01
```

The model recovers the true parameter values pretty well (Table 1) and, on the basis of this model, you wouldn’t erroneously conclude that `B`

is causally related to `Z`

(see the parameter estimate for `slope_B`

).

Parameter | True value | Estimate |
---|---|---|

intercept | 2.00 | 0.65 ± 1.53 |

slope_A | 0.70 | 0.87 ± 0.17 |

slope_B | 0.00 | -0.20 ± 0.25 |

sigma_Z.AB | 1.30 | 1.17 ± 0.13 |

sd_A | 1.50 | 1.40 ± 0.07 |

sd_B | 0.80 | 0.80 ± 0.04 |

mean_A | 3.00 | 3.07 ± 0.10 |

mean_B | -4.00 | -4.02 ± 0.05 |

rho | 0.73 | 0.78 ± 0.05 |

reliability_A | 0.70 | 0.69 ± 0.04 |

reliability_B | 0.90 | 0.90 ± 0.03 |

reliability_Z | 0.70 | 0.70 ± 0.05 |

*sd_Z | 1.67 | 1.59 ± 0.07 |

*error_A | 0.98 | 0.93 ± 0.07 |

*error_B | 0.27 | 0.26 ± 0.04 |

*error_Z | 1.09 | 1.04 ± 0.09 |

In the previous blog post, I’ve shown that such a model also estimates the latent true variable scores and that these estimates correspond more closely to the actual true variable scores than do the observed variable scores. I’ll skip this step here.

Satisfied that our model can recover the actual parameter values in scenarios such as those depicted in Figure 3, we now turn to a real-life example of such a situation. The example was already described in the previous blog post; here I’ll just draw the causal model that assumes that reflects the null hypothesis that a child’s Portuguese skills at T2 (`PT.T2`

) don’t contribute to their French skills at T3 (`FR.T3`

), but that due to common factors such as intelligence, form on the day etc. (), French skills and Portuguese skills at T2 are correlated across children. What is observed are test scores, not the children’s actual skills.

The command below is pig-ugly, but allows you to easily read in the data.

```
skills <- structure(list(
Subject = c("A_PLF_1","A_PLF_10","A_PLF_12","A_PLF_13","A_PLF_14","A_PLF_15","A_PLF_16","A_PLF_17","A_PLF_19","A_PLF_2","A_PLF_3","A_PLF_4","A_PLF_5","A_PLF_7","A_PLF_8","A_PLF_9","AA_PLF_11","AA_PLF_12","AA_PLF_13","AA_PLF_6","AA_PLF_7","AA_PLF_8","AD_PLF_10","AD_PLF_11","AD_PLF_13","AD_PLF_14","AD_PLF_15","AD_PLF_16","AD_PLF_17","AD_PLF_18","AD_PLF_19","AD_PLF_2","AD_PLF_20","AD_PLF_21","AD_PLF_22","AD_PLF_24","AD_PLF_25","AD_PLF_26","AD_PLF_4","AD_PLF_6","AD_PLF_8","AD_PLF_9","AE_PLF_1","AE_PLF_2","AE_PLF_4","AE_PLF_5","AE_PLF_6","C_PLF_1","C_PLF_16","C_PLF_19","C_PLF_30","D_PLF_1","D_PLF_2","D_PLF_3","D_PLF_4","D_PLF_5","D_PLF_6","D_PLF_7","D_PLF_8","Y_PNF_12","Y_PNF_15","Y_PNF_16","Y_PNF_17","Y_PNF_18","Y_PNF_2","Y_PNF_20","Y_PNF_24","Y_PNF_25","Y_PNF_26","Y_PNF_27","Y_PNF_28","Y_PNF_29","Y_PNF_3","Y_PNF_31","Y_PNF_32","Y_PNF_33","Y_PNF_34","Y_PNF_36","Y_PNF_4","Y_PNF_5","Y_PNF_6","Y_PNF_7","Y_PNF_8","Y_PNF_9","Z_PLF_2","Z_PLF_3","Z_PLF_4","Z_PLF_5","Z_PLF_6","Z_PLF_7","Z_PLF_8")
, FR_T2 = c(0.6842105263,0.4736842105,1,0.4210526316,0.6842105263,0.6842105263,0.8947368421,0.5789473684,0.7368421053,0.7894736842,0.4210526316,0.5263157895,0.3157894737,0.5263157895,0.6842105263,0.8421052632,0.3684210526,0.8421052632,0.7894736842,0.7894736842,0.6842105263,0.6315789474,0.6315789474,0.3684210526,0.4736842105,0.2631578947,0.4736842105,0.9473684211,0.3157894737,0.5789473684,0.2631578947,0.5263157895,0.5263157895,0.7368421053,0.6315789474,0.8947368421,0.6315789474,0.9473684211,0.7368421053,0.6315789474,0.7894736842,0.7894736842,0.4736842105,0.4736842105,0.9473684211,0.7894736842,0.3157894737,0.9473684211,1,0.7368421053,0.5789473684,0.8421052632,0.8421052632,0.7368421053,0.5789473684,0.6842105263,0.4736842105,0.4210526316,0.6842105263,0.8947368421,0.6842105263,0.7368421053,0.5263157895,0.5789473684,0.8947368421,0.7894736842,0.5263157895,0.6315789474,0.3157894737,0.7368421053,0.5789473684,0.6842105263,0.7368421053,0.5789473684,0.7894736842,0.6842105263,0.6315789474,0.6842105263,0.5789473684,0.7894736842,0.5789473684,0.7368421053,0.4736842105,0.8947368421,0.8421052632,0.7894736842,0.6315789474,0.6842105263,0.8947368421,0.6842105263,0.9473684211)
, PT_T2 = c(0.7368421053,0.5789473684,0.9473684211,0.5263157895,0.6315789474,0.5789473684,0.9473684211,0.4736842105,0.8421052632,0.5263157895,0.2631578947,0.6842105263,0.3684210526,0.3684210526,0.4736842105,0.8947368421,0.4210526316,0.5263157895,0.8947368421,0.8421052632,0.8947368421,0.8947368421,0.6315789474,0.3684210526,0.0526315789,0.3684210526,0.4210526316,0.9473684211,0.3157894737,0.4736842105,0.3157894737,0.5789473684,0.4736842105,0.7894736842,0.5263157895,0.8947368421,0.6315789474,0.7894736842,0.7368421053,0.5789473684,0.6842105263,0.7368421053,0.3684210526,0.7894736842,0.7368421053,0.4736842105,0.5263157895,1,0.8947368421,0.8947368421,0.4736842105,0.8421052632,1,0.6315789474,0.5263157895,0.5789473684,0.5789473684,0.5789473684,0.5263157895,0.9473684211,0.5263157895,0.6315789474,0.5789473684,0.6315789474,0.9473684211,0.7894736842,0.8421052632,0.5263157895,0.7894736842,0.4736842105,0.6842105263,0.3684210526,0.7894736842,0.7368421053,0.6315789474,0.9473684211,0.4210526316,0.5789473684,0.3684210526,0.8947368421,0.6315789474,0.8421052632,0.5789473684,0.5263157895,0.9473684211,0.8947368421,0.7368421053,0.4736842105,0.8421052632,0.7894736842,0.9473684211)
, FR_T3 = c(0.9473684211,0.3157894737,0.9473684211,0.5789473684,0.5789473684,0.6842105263,0.8421052632,0.6842105263,0.7368421053,0.8421052632,0.4210526316,0.5789473684,0.4736842105,0.6842105263,0.5789473684,0.7894736842,0.7368421053,0.7894736842,1,0.8421052632,0.8947368421,0.4210526316,0.8947368421,0.4736842105,0.5263157895,0.4736842105,0.5789473684,1,0.7368421053,0.8421052632,0.2631578947,0.7894736842,0.6842105263,0.8947368421,0.5263157895,0.8947368421,0.6842105263,0.9473684211,0.9473684211,0.5263157895,0.9473684211,0.8421052632,0.4736842105,0.8947368421,0.9473684211,0.7368421053,0.5263157895,0.8421052632,0.9473684211,0.7894736842,0.8947368421,0.8421052632,0.8421052632,0.8947368421,0.5789473684,0.7368421053,0.6842105263,0.4736842105,0.6842105263,0.8947368421,0.4736842105,0.8421052632,0.7894736842,0.5789473684,0.7368421053,0.7894736842,0.8947368421,0.6842105263,0.6842105263,0.9473684211,0.7894736842,0.5263157895,0.7368421053,0.6842105263,0.8421052632,0.7368421053,0.7368421053,0.5789473684,0.4736842105,0.8947368421,0.4210526316,0.8947368421,0.6842105263,1,0.8421052632,0.8421052632,0.6315789474,0.6315789474,0.8947368421,0.6315789474,0.9473684211)
, PT_T3 = c(0.8421052632,0.3684210526,0.9473684211,0.3157894737,0.5789473684,0.7894736842,1,0.5263157895,0.8421052632,0.7894736842,0.3157894737,0.6315789474,0.4210526316,0.5263157895,0.6842105263,0.8421052632,0.8947368421,0.6842105263,0.9473684211,0.8947368421,0.9473684211,0.8421052632,0.8421052632,0.5263157895,0.6842105263,0.5263157895,0.8421052632,0.9473684211,0.4210526316,0.7894736842,0.7894736842,0.8421052632,0.7368421053,1,0.6842105263,1,0.7894736842,0.8421052632,0.9473684211,0.6842105263,0.7894736842,0.7894736842,0.3157894737,0.7894736842,NA,0.6315789474,0.6842105263,0.9473684211,1,0.9473684211,0.7368421053,0.8947368421,0.8421052632,0.8421052632,0.5789473684,0.6315789474,0.6315789474,0.8421052632,0.7894736842,0.8421052632,0.5789473684,0.8421052632,0.7368421053,0.6842105263,0.8421052632,0.8421052632,0.9473684211,0.4736842105,0.8421052632,0.7894736842,0.7368421053,0.2105263158,0.7894736842,0.7894736842,0.7368421053,0.6315789474,0.6315789474,0.4210526316,0.6315789474,0.8421052632,0.6842105263,0.9473684211,0.5789473684,0.5263157895,0.7894736842,0.7894736842,0.7894736842,0.6842105263,0.8421052632,0.8421052632,0.8947368421)
)
, row.names = c(NA, -91L)
, class = c("tbl_df","tbl","data.frame")
)
```

The only thing that’s changed in the statistical model compared to the example with the simulated data is that I’ve renamed the parameters and that the prior distributions are better motivated. Let’s consider each prior distribution in turn:

`intercept ~ normal(0.2, 0.1);`

: The intercept is the average true French skill score at T3 for children whose true French and Portuguese skill scores at T2 are 0. This is the lowest possible score (the theoretical range of the data is [0, 1]), so we’d expect such children to perform poorly at T3, too. A`normal(0.2, 0.1)`

distribution puts 95% probability on such children having a true French score at T3 between 0 and 0.4.`slope_FR ~ normal(0.5, 0.25);`

: This parameter expresses the difference between the average true French skill score at T3 for children with a true French skill score of 1 at T2 (the theoretical maximum) vs. those with a true French skill score of 0 at T2 (the theoretical minimum). This is obviously some value between -1 and 1, and presumably it’s going to be positive. A`normal(0.5, 0.25)`

puts 95% probability on this difference lying between 0 and 1, which I think is reasonable.`slope_PT ~ normal(0, 0.25);`

: The slope for Portuguese is bound to be smaller than the one for French. Moreover, it’s not a given that it will be appreciably different from zero. Hence a prior centred on 0 that still gives the data a chance to pull the estimate in either direction.`sigma ~ normal(0.15, 0.08);`

: If neither of the T2 variables predicts T3, uncertainty is highest when the mean T3 score is 0.5. Since these scores are bounded between 0 and 1, the standard deviation could not be much higher than 0.20. But French T2 is bound to be a predictor, so let us choose a slightly lower value (0.15).`latent_means ~ normal(0.5, 0.1);`

: These are the prior expectations for the true score means of the T2 variables. 0.5 lies in the middle of the scale; the`normal(0.5, 0.1)`

prior puts 95% probability on these means to lie between 0.3 and 0.7.`sigma_lat_FR_T2 ~ normal(0, 0.25);`

,`sigma_lat_FR_T2 ~ normal(0, 0.25);`

: The standard deviations of the latent T2 variables. If these truncated normal distributions put a 95% probability of the latent standard deviations to be lower than 0.50.`latent_rho ~ normal(0.4, 0.3);`

: The a priori expected correlation between the latent variables`A`

and`B`

. These are bound to be positively correlated.`reliability_FR_T2 ~ beta(100, 100*0.27/0.73);`

The prior distribution for the reliability of the French T2 variable. Cronbach’s for this variable was 0.73 (95% CI: [0.65, 0.78]). This roughly corresponds to a`beta(100, 100*0.27/0.73)`

distribution:

`qbeta(c(0.025, 0.975), 100, 100*0.27/0.73)`

`[1] 0.6529105 0.8007296`

`reliability_PT_T2 ~ beta(120, 120*0.21/0.79);`

Similarly, Cronbach’s for the Portuguese T2 variable was 0.79 (95% CI: [0.72, 0.84]), which roughly corresponds to a`beta(120, 120*0.21/0.79)`

distribution:

`qbeta(c(0.025, 0.975), 120, 120*0.21/0.79)`

`[1] 0.7219901 0.8507814`

`reliability_FR_T3 ~ beta(73, 27);`

: The estimated reliability for the French T3 data was similar to that of the T2 data, so I used the same prior.

```
interdependence_code <- '
data {
// Number of observations
int<lower = 1> N;
// Observed outcome
vector[N] FR_T3;
// Observed predictors
vector[N] FR_T2;
vector[N] PT_T2;
}
parameters {
// Parameters for regression
real intercept;
real slope_FR;
real slope_PT;
real<lower = 0> sigma;
// standard deviations of latent predictors (= constructs)
real<lower = 0> sigma_lat_FR_T2;
real<lower = 0> sigma_lat_PT_T2;
// Means of latent predictors
row_vector[2] latent_means;
// Unknown correlation between latent predictors
real<lower = -1, upper = 1> latent_rho;
// Latent variables
matrix[N, 2] latent_predictors;
vector[N] lat_FR_T3; // latent outcome
// Unknown but estimated reliabilities
real<lower = 0, upper = 1> reliability_FR_T2;
real<lower = 0, upper = 1> reliability_PT_T2;
real<lower = 0, upper = 1> reliability_FR_T3;
}
transformed parameters {
vector[N] mu_FR_T3; // conditional mean of outcome
vector[N] lat_FR_T2; // latent variables, separated out
vector[N] lat_PT_T2;
real error_FR_T2; // standard error of measurement
real error_PT_T2;
real error_FR_T3;
// standard deviations of latent predictors, in matrix form
matrix[2, 2] sigma_lat;
// correlation and covariance matrix for latent predictors
cov_matrix[2] latent_cor;
cov_matrix[2] latent_cov;
// standard deviation of latent outcome
real<lower = 0> sigma_lat_FR_T3;
sigma_lat_FR_T3 = sd(lat_FR_T3);
// Express measurement error in terms of
// standard deviation of constructs and reliability
error_FR_T2 = sqrt(sigma_lat_FR_T2^2*(1/reliability_FR_T2 - 1));
error_PT_T2 = sqrt(sigma_lat_PT_T2^2*(1/reliability_PT_T2 - 1));
error_FR_T3 = sqrt(sigma_lat_FR_T3^2*(1/reliability_FR_T3 - 1));
// Define diagonal matrix with standard errors of latent variables
sigma_lat[1, 1] = sigma_lat_FR_T2;
sigma_lat[2, 2] = sigma_lat_PT_T2;
sigma_lat[1, 2] = 0;
sigma_lat[2, 1] = 0;
// Define correlation matrix for latent variables
latent_cor[1, 1] = 1;
latent_cor[2, 2] = 1;
latent_cor[1, 2] = latent_rho;
latent_cor[2, 1] = latent_rho;
// Compute covariance matrix for latent variables
latent_cov = sigma_lat * latent_cor * sigma_lat;
// Extract latent variables from matrix
lat_FR_T2 = latent_predictors[, 1];
lat_PT_T2 = latent_predictors[, 2];
// Regression model for conditional mean of Z
mu_FR_T3 = intercept + slope_FR*lat_FR_T2 + slope_PT*lat_PT_T2;
}
model {
// Priors for regression parameters
intercept ~ normal(0.2, 0.1);
slope_FR ~ normal(0.5, 0.25);
slope_PT ~ normal(0, 0.25);
sigma ~ normal(0.15, 0.08);
// Prior for latent means
latent_means ~ normal(0.5, 0.1);
// Prior for latent standard deviations
sigma_lat_FR_T2 ~ normal(0, 0.25);
sigma_lat_PT_T2 ~ normal(0, 0.25);
// Prior expectation for correlation between latent variables.
latent_rho ~ normal(0.4, 0.3);
// Prior for reliabilities.
// These are estimated with some uncertainty, i.e.,
// they are not point values but distributions.
reliability_FR_T2 ~ beta(100, 100*0.27/0.73);
reliability_PT_T2 ~ beta(120, 120*0.21/0.79);
reliability_FR_T3 ~ beta(100, 100*0.27/0.73);
// Distribution of latent variable
for (i in 1:N) {
latent_predictors[i, ] ~ multi_normal(latent_means, latent_cov);
}
// Generate latent outcome
lat_FR_T3 ~ normal(mu_FR_T3, sigma);
// Measurement model
FR_T2 ~ normal(lat_FR_T2, error_FR_T2);
PT_T2 ~ normal(lat_PT_T2, error_PT_T2);
FR_T3 ~ normal(lat_FR_T3, error_FR_T3);
}
'
```

```
data_list <- list(
FR_T2 = skills$FR_T2,
PT_T2 = skills$PT_T2,
FR_T3 = skills$FR_T3,
N = length(skills$FR_T3)
)
interdependence_model <- cmdstan_model(write_stan_file(interdependence_code))
interdependence_fit <- interdependence_model$sample(
data = data_list
, seed = 42
, chains = 4
, parallel_chains = 4
, iter_warmup = 2000
, iter_sampling = 6000
, refresh = 1000
, max_treedepth = 15
, adapt_delta = 0.9999
)
```

```
Running MCMC with 4 parallel chains...
Chain 1 Iteration: 1 / 8000 [ 0%] (Warmup)
Chain 2 Iteration: 1 / 8000 [ 0%] (Warmup)
Chain 3 Iteration: 1 / 8000 [ 0%] (Warmup)
Chain 4 Iteration: 1 / 8000 [ 0%] (Warmup)
Chain 1 Iteration: 1000 / 8000 [ 12%] (Warmup)
Chain 4 Iteration: 1000 / 8000 [ 12%] (Warmup)
Chain 2 Iteration: 1000 / 8000 [ 12%] (Warmup)
Chain 3 Iteration: 1000 / 8000 [ 12%] (Warmup)
Chain 1 Iteration: 2000 / 8000 [ 25%] (Warmup)
Chain 1 Iteration: 2001 / 8000 [ 25%] (Sampling)
Chain 4 Iteration: 2000 / 8000 [ 25%] (Warmup)
Chain 4 Iteration: 2001 / 8000 [ 25%] (Sampling)
Chain 2 Iteration: 2000 / 8000 [ 25%] (Warmup)
Chain 2 Iteration: 2001 / 8000 [ 25%] (Sampling)
Chain 3 Iteration: 2000 / 8000 [ 25%] (Warmup)
Chain 3 Iteration: 2001 / 8000 [ 25%] (Sampling)
Chain 1 Iteration: 3000 / 8000 [ 37%] (Sampling)
Chain 4 Iteration: 3000 / 8000 [ 37%] (Sampling)
Chain 1 Iteration: 4000 / 8000 [ 50%] (Sampling)
Chain 4 Iteration: 4000 / 8000 [ 50%] (Sampling)
Chain 2 Iteration: 3000 / 8000 [ 37%] (Sampling)
Chain 1 Iteration: 5000 / 8000 [ 62%] (Sampling)
Chain 4 Iteration: 5000 / 8000 [ 62%] (Sampling)
Chain 4 Iteration: 6000 / 8000 [ 75%] (Sampling)
Chain 1 Iteration: 6000 / 8000 [ 75%] (Sampling)
Chain 2 Iteration: 4000 / 8000 [ 50%] (Sampling)
Chain 3 Iteration: 3000 / 8000 [ 37%] (Sampling)
Chain 4 Iteration: 7000 / 8000 [ 87%] (Sampling)
Chain 1 Iteration: 7000 / 8000 [ 87%] (Sampling)
Chain 4 Iteration: 8000 / 8000 [100%] (Sampling)
Chain 4 finished in 261.0 seconds.
Chain 1 Iteration: 8000 / 8000 [100%] (Sampling)
Chain 1 finished in 271.4 seconds.
Chain 2 Iteration: 5000 / 8000 [ 62%] (Sampling)
Chain 2 Iteration: 6000 / 8000 [ 75%] (Sampling)
Chain 3 Iteration: 4000 / 8000 [ 50%] (Sampling)
Chain 2 Iteration: 7000 / 8000 [ 87%] (Sampling)
Chain 2 Iteration: 8000 / 8000 [100%] (Sampling)
Chain 2 finished in 384.6 seconds.
Chain 3 Iteration: 5000 / 8000 [ 62%] (Sampling)
Chain 3 Iteration: 6000 / 8000 [ 75%] (Sampling)
Chain 3 Iteration: 7000 / 8000 [ 87%] (Sampling)
Chain 3 Iteration: 8000 / 8000 [100%] (Sampling)
Chain 3 finished in 656.3 seconds.
All 4 chains finished successfully.
Mean chain execution time: 393.3 seconds.
Total execution time: 656.4 seconds.
```

```
interdependence_fit$summary(
variables = c("intercept", "slope_FR", "slope_PT"
, "sigma", "latent_rho")
, "mean", "sd"
, extra_quantiles = ~posterior::quantile2(., probs = c(0.025, 0.975))
, "rhat"
)
```

```
# A tibble: 5 × 6
variable mean sd q2.5 q97.5 rhat
<chr> <num> <num> <num> <num> <num>
1 intercept 0.189 0.0529 0.0848 0.293 1.00
2 slope_FR 0.712 0.154 0.407 1.01 1.00
3 slope_PT 0.107 0.138 -0.165 0.378 1.00
4 sigma 0.0721 0.0182 0.0331 0.106 1.00
5 latent_rho 0.812 0.0781 0.646 0.950 1.00
```

Unsurprisingly, the model confidently finds a link between French skills at T2 and at T3, even on the level of the unobserved true scores (). But more importantly, the evidence for an additional effect of Portuguese skills at T2 on French skills at T3 is flimsy (). The latent T2 variables are estimated to correlate strongly at . These results don’t change much when a flat prior on is specified (this can be accomplished by not specifying any prior at all for ). Compared to the model in the previous blog post (Table 2), little has changed. The only appreciable difference is that the estimate for `sigma`

is lower. The reason is that, unlike the previous model, the current model partitions the variance in the French T3 scores into true score variance and measurement error variance. In this model, `sigma`

captures the true score variance that isn’t accounted for by T2 skills, whereas in the previous model, `sigma`

captured the *total* variance that wasn’t accounted for by T2 skills. But other than that, the current model doesn’t represent a huge change from the previous one.

Parameter | Current estimate | Previous estimate |
---|---|---|

intercept | 0.19 ± 0.05 | 0.19 ± 0.05 |

slope_FR | 0.71 ± 0.15 | 0.71 ± 0.16 |

slope_PT | 0.11 ± 0.14 | 0.10 ± 0.14 |

sigma | 0.07 ± 0.02 | 0.12 ± 0.01 |

latent_rho | 0.81 ± 0.08 | 0.81 ± 0.08 |

A couple of things still remain to be done. First, the French test at T3 was the same as the one at T2 so it’s likely that the measurement errors on both scores won’t be completely independent of one another. I’d like to find out how correlated measurement errors affect the parameter estimates. Second, I’d like to get started with prior and posterior predictive checks: the former to check if the priors give rise to largely possible data patterns, and the latter to check if the full model tends to generate data sets similar to the one actually observed.

Berthele, Raphael and Jan Vanhove. 2017. What would disprove interdependence? Lessons learned from a study on biliteracy in Portuguese heritage language speakers in Switzerland. *International Journal of Bilingual Education and Bilingualism*.

Brunner, Jerry and Peter C. Austin. 2009. Inflation of Type I error rate in multiple regression when independent variables are measured with error. *Canadian Journal of Statistics* 37(1). 33–46.

Westfall, Jacob and Tal Yarkoni. 2016. Statistically controlling for confounding constructs is harder than you think.. *PLOS ONE* 11(3). e0152719.

Please note that I reran the code on this page on August 6, 2023.

`devtools::session_info()`

```
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.1 (2023-06-16)
os Ubuntu 22.04.2 LTS
system x86_64, linux-gnu
ui X11
language en_US
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Zurich
date 2023-08-06
pandoc 3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
abind 1.4-5 2016-07-21 [1] CRAN (R 4.3.1)
backports 1.4.1 2021-12-13 [1] CRAN (R 4.3.0)
boot 1.3-28 2021-05-03 [4] CRAN (R 4.2.0)
cachem 1.0.6 2021-08-19 [2] CRAN (R 4.2.0)
callr 3.7.3 2022-11-02 [1] CRAN (R 4.3.1)
checkmate 2.2.0 2023-04-27 [1] CRAN (R 4.3.1)
cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)
cmdstanr * 0.6.0 2023-08-02 [1] local
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.1)
curl 5.0.1 2023-06-07 [1] CRAN (R 4.3.1)
dagitty * 0.3-1 2021-01-21 [1] CRAN (R 4.3.1)
devtools 2.4.5 2022-10-11 [1] CRAN (R 4.3.1)
digest 0.6.29 2021-12-01 [2] CRAN (R 4.2.0)
distributional 0.3.2 2023-03-22 [1] CRAN (R 4.3.1)
dplyr 1.1.2 2023-04-20 [1] CRAN (R 4.3.0)
ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.2.0)
evaluate 0.15 2022-02-18 [2] CRAN (R 4.2.0)
fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.1)
farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.0)
fastmap 1.1.0 2021-01-25 [2] CRAN (R 4.2.0)
fs 1.5.2 2021-12-08 [2] CRAN (R 4.2.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
ggplot2 3.4.2 2023-04-03 [1] CRAN (R 4.3.0)
glue 1.6.2 2022-02-24 [2] CRAN (R 4.2.0)
gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.0)
htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.3.0)
htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.1)
httpuv 1.6.11 2023-05-11 [1] CRAN (R 4.3.1)
jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.1)
knitr 1.39 2022-04-26 [2] CRAN (R 4.2.0)
later 1.3.1 2023-05-02 [1] CRAN (R 4.3.1)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
MASS 7.3-60 2023-05-04 [4] CRAN (R 4.3.1)
memoise 2.0.1 2021-11-26 [2] CRAN (R 4.2.0)
mime 0.10 2021-02-13 [2] CRAN (R 4.0.2)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.3.1)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
pkgbuild 1.4.2 2023-06-26 [1] CRAN (R 4.3.1)
pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.2.0)
pkgload 1.3.2.1 2023-07-08 [1] CRAN (R 4.3.1)
posterior * 1.4.1 2023-03-14 [1] CRAN (R 4.3.1)
prettyunits 1.1.1 2020-01-24 [2] CRAN (R 4.2.0)
processx 3.8.2 2023-06-30 [1] CRAN (R 4.3.1)
profvis 0.3.8 2023-05-02 [1] CRAN (R 4.3.1)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.3.1)
ps 1.7.5 2023-04-18 [1] CRAN (R 4.3.1)
purrr 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [2] CRAN (R 4.2.0)
Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.3.1)
remotes 2.4.2 2021-11-30 [2] CRAN (R 4.2.0)
rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)
rmarkdown 2.21 2023-03-26 [1] CRAN (R 4.3.0)
rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.3.0)
scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)
sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.2.0)
shiny 1.7.4.1 2023-07-06 [1] CRAN (R 4.3.1)
stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.1)
stringr 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)
tensorA 0.36.2 2020-11-19 [1] CRAN (R 4.3.1)
tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.3.1)
usethis 2.2.2 2023-07-06 [1] CRAN (R 4.3.1)
utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.1)
V8 4.3.0 2023-04-08 [1] CRAN (R 4.3.0)
vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)
xfun 0.39 2023-04-20 [1] CRAN (R 4.3.0)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.3.1)
yaml 2.3.5 2022-02-21 [2] CRAN (R 4.2.0)
[1] /home/jan/R/x86_64-pc-linux-gnu-library/4.3
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library
──────────────────────────────────────────────────────────────────────────────
```

This blog post details my efforts to specify a Bayesian model in which the measurement error on the confounding variable was taken into account. The ultimate aim was to obtain more honest estimates of the impact of the confounding variable and the variable of actual interest on the outcome. First, I’ll discuss a simulated example to demonstrate the consequences of measurement error on statistical control and what a model needs to do to appropriately take measurement error into account. Then I apply the insights gained on a real-life study in applied linguistics.

I will preface all of this with the disclaimer that I don’t consider myself an expert in the techniques discussed below; one reason for writing this blog is to solicit feedback from readers more knowledgeable than I am.

**Update (2023-08-06):** When converting this blog from Jekyll/Bootstrap to Quarto, I noticed that the original code used in this blog post, which involved the R package `rstan`

has started to run very slowly. In the present version, I use `cmdstanr`

instead.

If you want to follow along, you need the following R packages/settings:

```
library(tidyverse)
library(cmdstanr) # for fitting Bayesian models, v. 2.32.2
library(posterior) # for working with posterior distributions
# For drawing scatterplot matrices
source("https://janhove.github.io/RCode/scatterplot_matrix.R")
# Set random seed for reproducibility
set.seed(2020-01-21, kind = "Mersenne-Twister")
```

You’ll also need the `MASS`

package, but you don’t need to load it.

Let’s first illustrate the problem that measurement error causes for statistical control using simulated data. That way, we know what goes into the data and what we hope a model should take out of it.

The scenario I want to focus on is the following. You are pretty sure that a given construct `A`

causally affects a variable `Z`

. You are, however, interested in finding out if another construct `B`

also affects `Z`

. You can’t manipulate any of the variables, so you have to make do with an observational study. Unfortunately, `A`

and `B`

are likely to be correlated. Let’s simulate some data to make this more concrete:

- 500 datapoints (
`n`

) - Constructs
`A`

and`B`

are correlated at . - Constructs
`A`

and`B`

are normally distributed with standard deviations of 1.5 (`sd_A`

) and 0.8 (`sd_B`

), respectively. The means of these normal distributions are 3 and -4, respectively.

The numbers in the list above aren’t special; I just wanted to make sure the model I will specify further down below isn’t restricted to assuming that the constructs are distributed normally with mean 0 and standard deviation 1.

```
# Generate correlated constructs
n <- 500
rho <- 0.73
sd_A <- 1.5
sd_B <- 0.8
# Given the correlation and the standard deviations,
# construct the covariance matrix for the constructs like so:
latent_covariance_matrix <- rbind(c(sd_A, 0), c(0, sd_B)) %*%
rbind(c(1, rho), c(rho, 1)) %*%
rbind(c(sd_A, 0), c(0, sd_B))
# Draw data from the multivariate normal distribution:
constructs <- MASS::mvrnorm(n = n, mu = c(3, -4)
, Sigma = latent_covariance_matrix)
A <- constructs[, 1]
B <- constructs[, 2]
```

For the purposes of this simulation, I’ll generate data for `Z`

that are affected by `A`

but not by `B`

:

```
# A influences Z, B doesn't
Z <- 2 + 0.7*A + rnorm(n, sd = 1.3)
```

As **Figure 1** shows, `B`

and `Z`

are correlated, even though neither influences the other. This is because of their link with `A`

.

`scatterplot_matrix(cbind(Z, A, B))`

In situations such as these, researchers typically include both `A`

and `B`

as predictors in a model with `Z`

as the outcome. And this works: we find a significant relationship between `A`

and `Z`

, but not between `B`

and `Z`

. Moreover, all estimated parameters are in the vicinity of their true values, as specified in the simulation.

`summary(lm(Z ~ A + B))$coefficients`

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.881 0.5439 5.30 1.78e-07
A 0.609 0.0546 11.14 6.93e-26
B 0.146 0.1025 1.43 1.54e-01
```

But in real life, the situation is more complicated. When researchers “statistically control” for a possible confound, they’re usually interested in controlling for the confounding *construct* rather than for any one *measurement* of this construct. For instance, when teasing apart the influences of L2 vocabulary knowledge and L2 morphosyntactic knowledge on L2 speaking fluency, researchers don’t actually want to control for the learners performance on this or that vocabulary test: they want to control for L2 vocabulary knowledge itself. One would hope that the vocabulary test gives a good indication of the learners’ vocabulary knowledge, but it’s understood that their performance will be affected by other factors as well (e.g., form on the day, luck with guessing, luck with the words occurring in the test etc.).

So let’s add some noise (measurement error) to constructs `A`

and `B`

. Here I express the measurement error in terms of the reliability of the instruments used to measure the constructs: If is the standard deviation of the unobserved construct scores and is the reliability of the measurement instrument, then the standard deviation of the measurement error is . For the purposes of this demonstration, I’m going to specify that construct `A`

was measured with ‘okay’ reliability (0.70), whereas construct `B`

was measured with exceptional reliability (0.95):

```
obs_A <- A + rnorm(n = n, sd = sqrt(sd_A^2/0.70 - sd_A^2))
obs_B <- B + rnorm(n = n, sd = sqrt(sd_B^2/0.95 - sd_B^2))
```

Crucially, if we include the observed values `obs_A`

and `obs_B`

as predictors in a model with `Z`

as the outcome, we find that the parameter for `obs_B`

is significant—even though there is no causal link between `B`

and `Z`

:

`summary(lm(Z ~ obs_A + obs_B))$coefficients`

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.842 0.4557 10.63 6.68e-24
obs_A 0.364 0.0418 8.71 4.36e-17
obs_B 0.456 0.0911 5.01 7.66e-07
```

Descriptively, this is perfectly fine: You do indeed now know more about `Z`

if you take into account `obs_B`

in addition to `obs_A`

. But you take this to interpret that the *construct* of `B`

can explain variation in `Z`

over and beyond that which can be explained by the construct of `A`

, this would be a mistake.

Conceptually, what has happened is that since `obs_A`

imperfectly reflects construct `A`

, including `obs_A`

in the model control for construct `A`

imperfectly.

Below is the Stan code I used for fitting the simulated data. The model takes as its input the three observed variables (`obs_A`

, `obs_B`

and `Z`

). Information about the reliability of `obs_A`

and `obs_B`

is also provided in the form of a prior distribution on `reliability_A`

and `reliability_B`

. Specifically, it’s assumed that the reliability coefficient for `obs_A`

is drawn from a `beta(30, 10)`

-distribution. This assigns a 95% probability to the reliability coefficient lying between roughly 0.61 and 0.87. `obs_B`

is assumed to be measured more reliably, as encoded by a `beta(95, 5)`

-distribution, which assigns a 95% probability to the reliability coefficient lying between 0.90 and 0.98.

Importantly, as I noted in some earlier explorations, the model has to take into account the possibility that the constructs of `A`

and `B`

are correlated. I specified a prior vaguely expecting a positive correlation but that wouldn’t find correlations close to or below zero to be too surprising either. Priors on the other parameters are pretty vague; I find it difficult to come up with reasonable priors in context-free examples.

```
meas_error_code <- '
data {
// Number of observations
int<lower = 1> N;
// Observed outcome
vector[N] Z;
// Observed predictors
vector[N] obs_A;
vector[N] obs_B;
}
parameters {
// Parameters for regression
real intercept;
real slope_A;
real slope_B;
real<lower = 0> sigma;
// standard deviations of latent variables (= constructs)
real<lower = 0> sigma_lat_A;
real<lower = 0> sigma_lat_B;
// Unknown but estimated reliabilities
real<lower = 0, upper = 1> reliability_A;
real<lower = 0, upper = 1> reliability_B;
// Means of latent predictors
row_vector[2] latent_means;
// Unknown correlation between latent predictors
real<lower = -1, upper = 1> latent_rho;
// Latent variables
matrix[N, 2] latent_variables;
}
transformed parameters {
vector[N] mu_Z; // conditional mean of outcome
vector[N] lat_A; // latent variables, separated out
vector[N] lat_B;
real error_A; // standard error of measurement
real error_B;
// standard deviations of latent variables, in matrix form
matrix[2, 2] sigma_lat;
// correlation and covariance matrix for latent variables
cov_matrix[2] latent_cor;
cov_matrix[2] latent_cov;
// Standardised slopes for A and B
real slope_A_std;
real slope_B_std;
// Express measurement error in terms of
// standard deviation of constructs and reliability
error_A = sqrt(sigma_lat_A^2/reliability_A - sigma_lat_A^2);
error_B = sqrt(sigma_lat_B^2/reliability_B - sigma_lat_B^2);
// Define diagonal matrix with standard errors of latent variables
sigma_lat[1, 1] = sigma_lat_A;
sigma_lat[2, 2] = sigma_lat_B;
sigma_lat[1, 2] = 0;
sigma_lat[2, 1] = 0;
// Define correlation matrix for latent variables
latent_cor[1, 1] = 1;
latent_cor[2, 2] = 1;
latent_cor[1, 2] = latent_rho;
latent_cor[2, 1] = latent_rho;
// Compute covariance matrix for latent variables
latent_cov = sigma_lat * latent_cor * sigma_lat;
// Extract latent variables from matrix
lat_A = latent_variables[, 1];
lat_B = latent_variables[, 2];
// Regression model
mu_Z = intercept + slope_A*lat_A + slope_B*lat_B;
// Standardised regression parameters
slope_A_std = slope_A * sigma_lat_A;
slope_B_std = slope_B * sigma_lat_B;
}
model {
// Priors for regression parameters
intercept ~ normal(0, 2);
slope_A ~ normal(0, 2);
slope_B ~ normal(0, 2);
sigma ~ normal(0, 2);
// Prior for reliabilities
reliability_A ~ beta(30, 10); // assume this has been estimated using some metric to be
// roughly 0.78, but with considerable uncertainty
reliability_B ~ beta(95, 5); // assume this has been estimated using exceptional reliability
// Prior for latent means
latent_means ~ normal(0, 3);
// Prior for latent standard deviations
sigma_lat_A ~ normal(0, 2);
sigma_lat_B ~ normal(0, 2);
// Prior expectation for correlation between latent variables:
// tend towards positive rho
latent_rho ~ normal(0.4, 0.3);
// Distribution of latent variable
for (i in 1:N) {
latent_variables[i, ] ~ multi_normal(latent_means, latent_cov);
}
// Measurement model
obs_A ~ normal(lat_A, error_A);
obs_B ~ normal(lat_B, error_B);
// Generate outcome
Z ~ normal(mu_Z, sigma);
}
'
```

Let’s put the data into a Stan-friendly list and fit the model:

```
data_list <- list(
Z = Z,
obs_A = obs_A,
obs_B = obs_B,
N = n
)
```

```
meas_error_model <- cmdstan_model(write_stan_file(meas_error_code))
model_fit <- meas_error_model$sample(
data = data_list
, seed = 123
, chains = 4
, parallel_chains = 4
, iter_warmup = 2000
, iter_sampling = 2000
, refresh = 500
, max_treedepth = 15
, adapt_delta = 0.95
)
```

```
Running MCMC with 4 parallel chains...
Chain 1 Iteration: 1 / 4000 [ 0%] (Warmup)
Chain 2 Iteration: 1 / 4000 [ 0%] (Warmup)
Chain 3 Iteration: 1 / 4000 [ 0%] (Warmup)
Chain 4 Iteration: 1 / 4000 [ 0%] (Warmup)
Chain 1 Iteration: 500 / 4000 [ 12%] (Warmup)
Chain 4 Iteration: 500 / 4000 [ 12%] (Warmup)
Chain 3 Iteration: 500 / 4000 [ 12%] (Warmup)
Chain 2 Iteration: 500 / 4000 [ 12%] (Warmup)
Chain 1 Iteration: 1000 / 4000 [ 25%] (Warmup)
Chain 3 Iteration: 1000 / 4000 [ 25%] (Warmup)
Chain 4 Iteration: 1000 / 4000 [ 25%] (Warmup)
Chain 2 Iteration: 1000 / 4000 [ 25%] (Warmup)
Chain 1 Iteration: 1500 / 4000 [ 37%] (Warmup)
Chain 3 Iteration: 1500 / 4000 [ 37%] (Warmup)
Chain 4 Iteration: 1500 / 4000 [ 37%] (Warmup)
Chain 2 Iteration: 1500 / 4000 [ 37%] (Warmup)
Chain 1 Iteration: 2000 / 4000 [ 50%] (Warmup)
Chain 1 Iteration: 2001 / 4000 [ 50%] (Sampling)
Chain 3 Iteration: 2000 / 4000 [ 50%] (Warmup)
Chain 3 Iteration: 2001 / 4000 [ 50%] (Sampling)
Chain 4 Iteration: 2000 / 4000 [ 50%] (Warmup)
Chain 4 Iteration: 2001 / 4000 [ 50%] (Sampling)
Chain 2 Iteration: 2000 / 4000 [ 50%] (Warmup)
Chain 2 Iteration: 2001 / 4000 [ 50%] (Sampling)
Chain 1 Iteration: 2500 / 4000 [ 62%] (Sampling)
Chain 3 Iteration: 2500 / 4000 [ 62%] (Sampling)
Chain 2 Iteration: 2500 / 4000 [ 62%] (Sampling)
Chain 3 Iteration: 3000 / 4000 [ 75%] (Sampling)
Chain 4 Iteration: 2500 / 4000 [ 62%] (Sampling)
Chain 1 Iteration: 3000 / 4000 [ 75%] (Sampling)
Chain 2 Iteration: 3000 / 4000 [ 75%] (Sampling)
Chain 3 Iteration: 3500 / 4000 [ 87%] (Sampling)
Chain 1 Iteration: 3500 / 4000 [ 87%] (Sampling)
Chain 4 Iteration: 3000 / 4000 [ 75%] (Sampling)
Chain 3 Iteration: 4000 / 4000 [100%] (Sampling)
Chain 3 finished in 283.6 seconds.
Chain 2 Iteration: 3500 / 4000 [ 87%] (Sampling)
Chain 1 Iteration: 4000 / 4000 [100%] (Sampling)
Chain 1 finished in 311.5 seconds.
Chain 4 Iteration: 3500 / 4000 [ 87%] (Sampling)
Chain 2 Iteration: 4000 / 4000 [100%] (Sampling)
Chain 2 finished in 330.5 seconds.
Chain 4 Iteration: 4000 / 4000 [100%] (Sampling)
Chain 4 finished in 349.7 seconds.
All 4 chains finished successfully.
Mean chain execution time: 318.8 seconds.
Total execution time: 349.9 seconds.
```

I’ve turn off warning notifications for this blog post, but I did receive this one:

Warning: 4 of 4 chains had an E-BFMI less than 0.2. See https://mc-stan.org/misc/warnings for details.

The mc-stan website does indeed contain some advice, but I’m going to ignore this warning for the time being and get on with the blog post.

That said, all estimated parameters are pretty much on the money. Importantly, include the estimated slope for the `B`

construct (`slope_B`

: -0.09, with a 95% credible interval of [-0.66, 0.33]). Notice, too, that the model was able to figure out the correlation between the latent constructs `A`

and `B`

(`latent_rho`

).

```
model_fit$summary(
variables = c("intercept", "slope_A", "slope_B", "sigma"
,"sigma_lat_A", "sigma_lat_B"
, "latent_means", "latent_rho"
, "slope_A_std", "slope_B_std"
, "reliability_A", "reliability_B")
, "mean", "sd"
, extra_quantiles = ~posterior::quantile2(., probs = c(0.025, 0.975))
, "rhat"
)
```

```
# A tibble: 13 × 6
variable mean sd q2.5 q97.5 rhat
<chr> <num> <num> <num> <num> <num>
1 intercept 1.25 1.63 -2.38 3.92 1.01
2 slope_A 0.782 0.185 0.483 1.20 1.02
3 slope_B -0.137 0.277 -0.750 0.316 1.01
4 sigma 1.21 0.0742 1.05 1.35 1.01
5 sigma_lat_A 1.55 0.0850 1.39 1.72 1.01
6 sigma_lat_B 0.828 0.0279 0.775 0.883 1.00
7 latent_means[1] 2.90 0.0830 2.74 3.07 1.00
8 latent_means[2] -4.01 0.0370 -4.08 -3.93 1.00
9 latent_rho 0.775 0.0465 0.680 0.861 1.01
10 slope_A_std 1.21 0.249 0.792 1.76 1.01
11 slope_B_std -0.114 0.230 -0.620 0.262 1.01
12 reliability_A 0.705 0.0618 0.593 0.829 1.01
13 reliability_B 0.950 0.0209 0.903 0.984 1.01
```

To get some sense of what the model is doing, I’m going to extract the posterior distributions for the latent construct scores. These are the model’s guesses of which scores the simulated participants would have had if there had been no measurement error. These guesses are based on the information we’ve fed the model, including the observed variables, the relationships among them, and their probable reliability. I’m just going to work with the means of these posterior distributions, but there can be substantial uncertainty about the model’s guesses.

```
est_lat_A <- model_fit$draws("lat_A", format = "draws_matrix")
est_lat_B <- model_fit$draws("lat_B", format = "draws_matrix")
df_variables <- tibble(
Z = Z,
obs_A = obs_A,
obs_B = obs_B,
est_A = apply(est_lat_A, 2, mean),
est_B = apply(est_lat_B, 2, mean)
)
```

**Figure 2** shows the relationships among the three variables and shows *shrinkage* at work. For the variables about whose actual values there is uncertainty (viz., A and B), the model reckons that extreme values are caused by a combination of skill (or lack thereof) as well as good (or bad) luck. Accordingly, it adjusts these values towards the bulk of the data. In doing so, it takes into account both the correlation that we ‘expected’ between A and B as well as the possible relationship between A and B on the one hand and Z on the other. For A, the adjustments are fairly large because this variable was assumed to be measured with considerable error. For B, the adjustments are smaller. Z, finally, was assumed to be measured without error and so no adjustments are required.

```
par(mfrow = c(2, 2))
# Z vs. A
plot(Z ~ obs_A, df_variables, pch = 1,
xlab = "A", ylab = "Z")
points(Z ~ est_A, df_variables, pch = 16)
arrows(x0 = df_variables$obs_A, x1 = df_variables$est_A,
y0 = df_variables$Z,
col = "grey80", length = 0)
# Z vs. B
plot(Z ~ obs_B, df_variables, pch = 1,
xlab = "B", ylab = "Z")
points(Z ~ est_B, df_variables, pch = 16)
arrows(x0 = df_variables$obs_B, x1 = df_variables$est_B,
y0 = df_variables$Z,
col = "grey80", length = 0)
# B vs. A
plot(obs_B ~ obs_A, df_variables, pch = 1,
xlab = "A", ylab = "B")
points(est_B ~ est_A, df_variables, pch = 16)
arrows(x0 = df_variables$obs_A, x1 = df_variables$est_A,
y0 = df_variables$obs_B, y1 = df_variables$est_B,
col = "grey80", length = 0)
par(mfrow = c(1, 1))
```

In statistics at least, shrinkage is generally a good thing: The shrunken values (i.e., the model’s guesses) lie, on average, closer to the true but unobserved values than the observed values do. This is clearly the case for variable A:

`mean(abs(A - obs_A))`

`[1] 0.724`

`mean(abs(A - df_variables$est_A))`

`[1] 0.518`

For variable B, the difference is negligible seeing as this variable was measured with exceptional reliability:

`mean(abs(B - obs_B))`

`[1] 0.149`

`mean(abs(B - df_variables$est_B))`

`[1] 0.146`

For the simulated data, the model seemed to work okay, so let’s turn to a real-life example. I’ll skip the theoretical background, but several studies in applied linguistics have tried to find out if knowledge in a ‘heritage language’ contributes to the development of the societal language (For more information about such research, see Berthele & Lambelet (2017), Vanhove & Berthele (2017) and Berthele & Vanhove (2017)). In a typical research design, researchers collect data on a group of pupils’ language skills in their heritage language as well as in their societal languages at the beginning of the school year. Then, at the end of the school year, they collect similar data. Unsurprisingly, pupils with relatively good societal language skills at the beginning of the year are still relatively good at the end. But what is sometimes also observed is that heritage language proficiency at the first data collection is a predictor of societal language proficiency at the second data collection, even after taking into account societal language proficiency at the first data collection.

It’s tempting but premature to interpret such findings as evidence for a beneficial effect of heritage language skills on the development of societal language proficiency. The reason is that (a) societal and heritage language proficiency are bound to be correlated at the first data collection due to factors such as intelligence, testwiseness, form on the day etc., and (b) language proficiency is invariably measured with error. This is true of heritage language proficiency, but most importantly, it’s true of the variable that is “statistically controlled for”, i.e., societal language proficiency. Consequently, it’s likely that an off-the-shelf statistical model undercorrects for the role of societal language proficiency and overestimates the role of heritage language profiency.

So let’s fit a model that takes measurement error into account.

The data we’re going to analyse are a subset of those analysed by Vanhove & Berthele (2017) and Berthele & Vanhove (2017). We have data on 91 pupils with French as their societal language and Portuguese as their heritage language. The study consisted of three data collections (and many more pupils), but we’re just going to analyse the reading proficiency data collected during waves 2 and 3 here.

The full datasets are are available as an R package from https://github.com/janhove/helascot, but copy-paste the command below into R to work with the reduced dataset we’ll work with here.

```
skills <- structure(list(
Subject = c("A_PLF_1","A_PLF_10","A_PLF_12","A_PLF_13","A_PLF_14","A_PLF_15","A_PLF_16","A_PLF_17","A_PLF_19","A_PLF_2","A_PLF_3","A_PLF_4","A_PLF_5","A_PLF_7","A_PLF_8","A_PLF_9","AA_PLF_11","AA_PLF_12","AA_PLF_13","AA_PLF_6","AA_PLF_7","AA_PLF_8","AD_PLF_10","AD_PLF_11","AD_PLF_13","AD_PLF_14","AD_PLF_15","AD_PLF_16","AD_PLF_17","AD_PLF_18","AD_PLF_19","AD_PLF_2","AD_PLF_20","AD_PLF_21","AD_PLF_22","AD_PLF_24","AD_PLF_25","AD_PLF_26","AD_PLF_4","AD_PLF_6","AD_PLF_8","AD_PLF_9","AE_PLF_1","AE_PLF_2","AE_PLF_4","AE_PLF_5","AE_PLF_6","C_PLF_1","C_PLF_16","C_PLF_19","C_PLF_30","D_PLF_1","D_PLF_2","D_PLF_3","D_PLF_4","D_PLF_5","D_PLF_6","D_PLF_7","D_PLF_8","Y_PNF_12","Y_PNF_15","Y_PNF_16","Y_PNF_17","Y_PNF_18","Y_PNF_2","Y_PNF_20","Y_PNF_24","Y_PNF_25","Y_PNF_26","Y_PNF_27","Y_PNF_28","Y_PNF_29","Y_PNF_3","Y_PNF_31","Y_PNF_32","Y_PNF_33","Y_PNF_34","Y_PNF_36","Y_PNF_4","Y_PNF_5","Y_PNF_6","Y_PNF_7","Y_PNF_8","Y_PNF_9","Z_PLF_2","Z_PLF_3","Z_PLF_4","Z_PLF_5","Z_PLF_6","Z_PLF_7","Z_PLF_8")
, FR_T2 = c(0.6842105263,0.4736842105,1,0.4210526316,0.6842105263,0.6842105263,0.8947368421,0.5789473684,0.7368421053,0.7894736842,0.4210526316,0.5263157895,0.3157894737,0.5263157895,0.6842105263,0.8421052632,0.3684210526,0.8421052632,0.7894736842,0.7894736842,0.6842105263,0.6315789474,0.6315789474,0.3684210526,0.4736842105,0.2631578947,0.4736842105,0.9473684211,0.3157894737,0.5789473684,0.2631578947,0.5263157895,0.5263157895,0.7368421053,0.6315789474,0.8947368421,0.6315789474,0.9473684211,0.7368421053,0.6315789474,0.7894736842,0.7894736842,0.4736842105,0.4736842105,0.9473684211,0.7894736842,0.3157894737,0.9473684211,1,0.7368421053,0.5789473684,0.8421052632,0.8421052632,0.7368421053,0.5789473684,0.6842105263,0.4736842105,0.4210526316,0.6842105263,0.8947368421,0.6842105263,0.7368421053,0.5263157895,0.5789473684,0.8947368421,0.7894736842,0.5263157895,0.6315789474,0.3157894737,0.7368421053,0.5789473684,0.6842105263,0.7368421053,0.5789473684,0.7894736842,0.6842105263,0.6315789474,0.6842105263,0.5789473684,0.7894736842,0.5789473684,0.7368421053,0.4736842105,0.8947368421,0.8421052632,0.7894736842,0.6315789474,0.6842105263,0.8947368421,0.6842105263,0.9473684211)
, PT_T2 = c(0.7368421053,0.5789473684,0.9473684211,0.5263157895,0.6315789474,0.5789473684,0.9473684211,0.4736842105,0.8421052632,0.5263157895,0.2631578947,0.6842105263,0.3684210526,0.3684210526,0.4736842105,0.8947368421,0.4210526316,0.5263157895,0.8947368421,0.8421052632,0.8947368421,0.8947368421,0.6315789474,0.3684210526,0.0526315789,0.3684210526,0.4210526316,0.9473684211,0.3157894737,0.4736842105,0.3157894737,0.5789473684,0.4736842105,0.7894736842,0.5263157895,0.8947368421,0.6315789474,0.7894736842,0.7368421053,0.5789473684,0.6842105263,0.7368421053,0.3684210526,0.7894736842,0.7368421053,0.4736842105,0.5263157895,1,0.8947368421,0.8947368421,0.4736842105,0.8421052632,1,0.6315789474,0.5263157895,0.5789473684,0.5789473684,0.5789473684,0.5263157895,0.9473684211,0.5263157895,0.6315789474,0.5789473684,0.6315789474,0.9473684211,0.7894736842,0.8421052632,0.5263157895,0.7894736842,0.4736842105,0.6842105263,0.3684210526,0.7894736842,0.7368421053,0.6315789474,0.9473684211,0.4210526316,0.5789473684,0.3684210526,0.8947368421,0.6315789474,0.8421052632,0.5789473684,0.5263157895,0.9473684211,0.8947368421,0.7368421053,0.4736842105,0.8421052632,0.7894736842,0.9473684211)
, FR_T3 = c(0.9473684211,0.3157894737,0.9473684211,0.5789473684,0.5789473684,0.6842105263,0.8421052632,0.6842105263,0.7368421053,0.8421052632,0.4210526316,0.5789473684,0.4736842105,0.6842105263,0.5789473684,0.7894736842,0.7368421053,0.7894736842,1,0.8421052632,0.8947368421,0.4210526316,0.8947368421,0.4736842105,0.5263157895,0.4736842105,0.5789473684,1,0.7368421053,0.8421052632,0.2631578947,0.7894736842,0.6842105263,0.8947368421,0.5263157895,0.8947368421,0.6842105263,0.9473684211,0.9473684211,0.5263157895,0.9473684211,0.8421052632,0.4736842105,0.8947368421,0.9473684211,0.7368421053,0.5263157895,0.8421052632,0.9473684211,0.7894736842,0.8947368421,0.8421052632,0.8421052632,0.8947368421,0.5789473684,0.7368421053,0.6842105263,0.4736842105,0.6842105263,0.8947368421,0.4736842105,0.8421052632,0.7894736842,0.5789473684,0.7368421053,0.7894736842,0.8947368421,0.6842105263,0.6842105263,0.9473684211,0.7894736842,0.5263157895,0.7368421053,0.6842105263,0.8421052632,0.7368421053,0.7368421053,0.5789473684,0.4736842105,0.8947368421,0.4210526316,0.8947368421,0.6842105263,1,0.8421052632,0.8421052632,0.6315789474,0.6315789474,0.8947368421,0.6315789474,0.9473684211)
, PT_T3 = c(0.8421052632,0.3684210526,0.9473684211,0.3157894737,0.5789473684,0.7894736842,1,0.5263157895,0.8421052632,0.7894736842,0.3157894737,0.6315789474,0.4210526316,0.5263157895,0.6842105263,0.8421052632,0.8947368421,0.6842105263,0.9473684211,0.8947368421,0.9473684211,0.8421052632,0.8421052632,0.5263157895,0.6842105263,0.5263157895,0.8421052632,0.9473684211,0.4210526316,0.7894736842,0.7894736842,0.8421052632,0.7368421053,1,0.6842105263,1,0.7894736842,0.8421052632,0.9473684211,0.6842105263,0.7894736842,0.7894736842,0.3157894737,0.7894736842,NA,0.6315789474,0.6842105263,0.9473684211,1,0.9473684211,0.7368421053,0.8947368421,0.8421052632,0.8421052632,0.5789473684,0.6315789474,0.6315789474,0.8421052632,0.7894736842,0.8421052632,0.5789473684,0.8421052632,0.7368421053,0.6842105263,0.8421052632,0.8421052632,0.9473684211,0.4736842105,0.8421052632,0.7894736842,0.7368421053,0.2105263158,0.7894736842,0.7894736842,0.7368421053,0.6315789474,0.6315789474,0.4210526316,0.6315789474,0.8421052632,0.6842105263,0.9473684211,0.5789473684,0.5263157895,0.7894736842,0.7894736842,0.7894736842,0.6842105263,0.8421052632,0.8421052632,0.8947368421)
)
, row.names = c(NA, -91L)
, class = c("tbl_df","tbl","data.frame")
)
```

We’re going to fit the French reading scores at the third data collection (`FR_T3`

) in terms of the French and Portuguese reading scores at the second data collection (`FR_T2`

and `PT_T2`

). **Figure 3** shows the observed variables. Note that all values are bounded between 0 and 1, where 1 was the highest possible result.

`scatterplot_matrix(skills %>% select(FR_T3, FR_T2, PT_T2))`

Fitting an off-the-shelf regression model, we find that `PT_T2`

is significantly related to `FR_T3`

, even when accounting for `FR_T2`

.

`summary(lm(FR_T3 ~ FR_T2 + PT_T2, skills))$coefficients`

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.266 0.0520 5.11 1.84e-06
FR_T2 0.506 0.0990 5.11 1.84e-06
PT_T2 0.196 0.0868 2.26 2.60e-02
```

Lastly, as reported by Pestana et al. (2017), the reliability of the French reading test at T2 was estimated to be 0.73, with a 95% confidence interval of [0.65, 0.78]. For Portuguese at T2, the reliability was estimated to be 0.79, with a 95% confidence interval of [0.72, 0.84]. This is information we can feed to the model. (For French at T3, the estimated reliability coefficient was 0.73, 95% CI: [0.65, 0.79], but for now, we’re not going to model the measurement error on the outcome variable.)

The model specified below is essentially the same as the model for the simulated example, but with more informed priors.

The reliability estimates for the French T2 and Portuguese T2 variables were incorporated by means of prior distributions.

- For French T2, I put a
`beta(73, 27)`

prior on the reliability coefficient, which assigns a 95% probability of the reliablity coefficient lying between 0.64 and 0.81. This doesn’t exactly correspond to the estimated reliability coefficient’s confidence interval, but I think it’s close enough. - For Portuguese T2, I put a
`beta(79, 21)`

prior on the reliability coefficient, which assigns a 95% probability of the reliablity coefficient lying between 0.71 and 0.86.

Other prior distributions reflect the fact that the predictor and the outcome data were restricted to the [0, 1] range and some common knowledge. The rationale for them is explained in the comments sprinkled throughout the code.

```
interdependence_code <- '
data {
// Number of observations
int<lower = 1> N;
// Observed outcome
vector[N] FR_T3;
// Observed predictors
vector[N] FR_T2;
vector[N] P
```