Baby steps in Bayes: Piecewise regression with two breakpoints
In this follow-up to the blog post Baby steps in Bayes: Piecewise regression, I’m going to try to model the relationship between two continuous variables using a piecewise regression with not one but two breakpoints. (The rights to the movie about the first installment are still up for grabs, incidentally.)
The kind of relationship I want to model is plotted in Figure 1. According to some applied linguists, relationships similar to these underpin some aspects of language learning, but we don’t need to go into those discussions here.
Figure 1. The underlying form of the x-y relationship is characterised by two breakpoints. The function’s three pieces are connected to each other.
The code below shows how you can generate such an x-y relationship and adds a modicum of noise to it. Figure 2 shows the simulated data we’ll work with in this blog post. I’m not going to analyse actual data at this juncture because it’s Friday afternoon and I want to ride my bike.
Figure 2. The simulated data we’ll work with.
Specifying the model
If none of this makes sense to you, please refer to the previous Baby steps in Bayes blog post.
There’s an x and a y variable, both of length N (here: N = 56).
This model has 7 parameters:
- Two breakpoints. Instead of specifying both breakpoints in the
parametersblock, I specify the first one (
bp1) as well as the distance between the first and the second one (
bp_distance). This allows me to tell the model that that the second breakpoint needs to occur after the first one by constraining the distance between them to be positive.
- the intercept and slope of the regression before the first breakpoint
- the intercept and slope of the regression after the third breakpoint
b3). Note that I don’t specify the intercept and the slope of the model’s second piece, i.e., between the two breakpoints. The reason is that I want the three pieces to be connected at the breakpoints. With this constraint, if you know the form of the first and third piece as well as the position of the two breakpoint, you get the intercept and the slope of the second piece for free.
- the standard deviation of the normal error. Standard deviations are always positive;
this constraint is set by including
<lower = 0>in the declaration.
As in the previous blog post, I specify a transformed parameter
that contains the mean of y conditional on the value of x.
Additionally, for convenience, I derive three parameters from
those specified in the previous block: the position of the
second breakpoint (from the position of the first and the
distance between them), as well as the intercept and slope
of the second piece (
(Sidebar: Since the second piece needs to be connected to
the other two pieces at the respective breakpoint, the
expected y value at the first breakpoint equals
bp1. Similarly, the expected y value
at the second breakpoint equals both
bp2. From these equalities,
can be expressed in terms of the parameters specified above.)
This works similarly to the model specification in the previous blog post. Since I only work with simulated data here, the priors aren’t informed by any subject-matter knowledge.
To check if the model picks up on the relevant aspects of the data, I let it generate alternative datasets. Ideally, these look similar to the actual data analysed.
Running the model
To fit the model, first put the input data in a list.
Then supply this list and the model code to the
stan() function prints a lot of output
to the console, which I didn’t reproduce here.
Unless you receive genuine warnings or error (i.e., red text),
everything’s fine. (Apologies for the blatant self-plagiarism throughout,
by the way.)
Inspecting the model
A summary with the parameter estimates and their
uncertainties can be generated using the
Inference for Stan model: ade2f0f31038d9a48e5f69f1db76bc89. 4 chains, each with iter=2000; warmup=1000; thin=1; post-warmup draws per chain=1000, total post-warmup draws=4000. mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat bp1 2.58 0.05 1.30 1.11 1.63 2.15 3.28 6.02 804 1 bp2 14.53 0.02 0.68 13.04 14.14 14.58 14.97 15.73 1922 1 a_1 4.57 0.02 0.64 3.40 4.12 4.52 5.00 5.88 1113 1 b_1 -0.77 0.03 0.89 -2.76 -1.31 -0.61 -0.06 0.40 808 1 a_2 1.20 0.01 0.56 -0.10 0.92 1.25 1.55 2.11 1433 1 b_2 0.87 0.00 0.06 0.78 0.84 0.87 0.90 1.00 1552 1 a_3 15.08 0.09 3.50 8.88 12.63 14.90 17.33 22.32 1552 1 b_3 -0.08 0.01 0.21 -0.50 -0.21 -0.07 0.06 0.31 1557 1 error 0.93 0.00 0.10 0.76 0.86 0.92 0.99 1.14 1991 1 Samples were drawn using NUTS(diag_e) at Fri Jul 27 15:59:51 2018. For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).
Posterior predictive checks
If the model is any good, data simulated from it should be pretty similar
to the data actually observed. In the
generated quantities block, I let the
model output such simulated data (
sim_y). Using the
we can perform some ‘posterior predictive checks’:
Click ‘Diagnose’ > ‘PPcheck’. Under ‘Select y (vector of observations)’,
y_obs (the simulated data analysed above).
Under ‘Parameter/generated quantity from model’,
sim_y (the additional simulated data generated in the model code).
Then click on ‘Distributions of observed data vs replications’ and
‘Distributions of test statistics’ to check if the properties of the simulated data correspond
to those of the real data.
You can also take this a step further and check whether the model is able to generate scatterplots similar to the one in Figure 2. If the following doesn’t make any immediate sense, please refer to the blog post Checking model assumptions without getting paranoid, because the logic is pretty similar.
First extract some vectors of simulated data from the model output:
Then plot both the observed vectors and the simulated vectors:
Figure 3. The actual input data (top left) and eight simulated datasets. If the simulated datasets are highly similar to the actual data, the model was able to learn the relevant patterns in the data.
Seems fine by me!