Setup and Example Models and Data
library(macpan2)
library(ggplot2)
library(dplyr)
library(broom.mixed)
options(macpan2_verbose = FALSE)
library(macpan2)
si = mp_tmb_model_spec(
before = S ~ 1 - I
, during = mp_per_capita_flow(
from = "S" ## compartment from which individuals flow
, to = "I" ## compartment to which individuals flow
, rate = "beta * I" ## expression giving _per-capita_ flow rate
, abs_rate = "infection" ## name for _absolute_ flow rate = beta * I * S
)
, default = list(I = 0.01, beta = 0.2)
)
print(si)
#> ---------------------
#> Default values:
#> ---------------------
#> matrix row col value
#> I 0.01
#> beta 0.20
#>
#> ---------------------
#> Before the simulation loop (t = 0):
#> ---------------------
#> 1: S ~ 1 - I
#>
#> ---------------------
#> At every iteration of the simulation loop (t = 1 to T):
#> ---------------------
#> 1: mp_per_capita_flow(from = "S", to = "I", rate = "beta * I", abs_rate = "infection")
library(ggplot2)
library(dplyr)
example_data = (si
|> mp_simulator(time_steps = 50, outputs = "infection")
|> mp_trajectory()
)
incidence_example = (example_data
|> mutate(value = 100 * value)
|> pull(value)
)
Naming Convensions
Why is the package called macpan2
?
This macpan2
package grew out of the McMasterPandemic
project. Over the years the people involved in that project started
calling it macpan, and macpan2
is the successor. There is
no package called macpan
.
Why do most of the functions start with mp_
?
The mp
stands for mac-pan. The idea is to make it easy
to see in scripts what functions come from macpan2
and what
functions come from other packages.
Why do some of the functions start with mp_tmb_
?
The tmb
part refers to the TMB package, on which
macpan2
is built. We hope to one day use other
computational engines (e.g., stan),
at which point users will be able to see what computational engine is
being used in their scripts. But for now tmb
is the only
computational engine and so the only thing indicated by
mp_tmb_
is that you are using a function that is intended
to be specific to computations with TMB. Functions that do not start
with mp_tmb_
should be engine-agnostic and able to be used
with any engine when and if such engines are supported.
Coding Style
What does |>
mean?
We use the base R pipe operator throughout our examples. Please read this link for details.
Why do I see weird symbols at the start of lines?
In many of our examples we start lines with characters that many people put at the end of lines, such as the commas, plus signs and the base R pipe. These characters tell R how the code on the rest of the line is to combined with previous lines. We prefer to put these characters at the start of lines because it makes it slightly easier to comment lines out. Note that for this approach to work you need to wrap these continued expressions in parentheses. See the quickstart guide for examples of this style.
Calibration Details and Outputs
What model fitting approach does macpan2
use?
We fit models by finding the values of parameters at the peak of a posterior distribution. This approach is sometimes called maximum a posteriori (MAP) estimation. This approach allows us to get approximate Bayesian uncertainty estimates by approximating the posterior distribution as a multivariate Gaussian centered at the peak obtained by MAP. One is able to declare that parameters be considered random effects, which means that they are integrated out of the posterior distribution using the Laplace approximation. Using this approach allows us to reduce the dimensionality of the posterior distribution used in MAP by only optimizing over the fixed effect parameters.
This approximate posterior distribution can be used to obtain confidence / credible intervals for any continuous function of model variables. The two main examples of this are as follows.
- Confidence intervals on fitted parameters using
mp_tmb_coef(., conf.int = TRUE)
. - Confidence intervals on fitted trajectories using
mp_trajectory_sd()
This approach to computing confidence intervals using Gaussian approximations to the posterior is a special case of the delta method that is implemented in TMB, which is described here.
Example calibrations of varying levels of complexity.
What likelihood and prior distribution assumptions does
macpan2
make?
You are free to declare any value in the model to be a trajectory or
parameter. Trajectories are compared with observed data to produce a
likelihood component (e.g., case reports are distributed with a negative
binomial distribution with dispersion parameter 1
) and
parameters are used to produce a component of the prior distribution
(e.g., the transmission rate is log normally distributed with mean
0.2
and standard deviation 1
). Available
distributional assumptions and settings are described here,
but inspecting the examples linked to in the previous FAQ will be more
useful for figuring out how to use this functionality.
Can I hack macpan2
to use MCMC esimation?
Yes.
Plotting Simulated Trajectories
How can I plot simulated trajectories?
This question is answered in the quickstart guide.
How can I plot a calibrated model with the observed data used in the calibration?
This is my favourite way to do it.
fitted_data = mp_trajectory(model_calibrator)
(observed_data
|> ggplot()
+ geom_point(aes(time, value))
+ geom_line(aes(time, value), data = fitted_data)
+ theme_bw()
)
Calibration Troubleshooting
Should I worry about NA/NaN function evaluation
warnings?
The optimizer will sometimes give warnings like this.
Warning messages:
1: In (function (start, objective, gradient = NULL, hessian = NULL, :
NA/NaN function evaluation
This is usually nothing to worry about, though you should take note
of it. Typically this warning indicates that the optimizer tried
parameter values that led the objective function (i.e., the negative log
posterior density) to be non-numeric. Usually, the opimizer is smart
enough (if it is possible) to get back on track. The key thing to watch
out for is whether the optimizer converged, which can be explored using
the mp_optimizer_output()
function that is illustrated in
this
article.
What should I do if I do not converge?
It depends on the reason for non-convergence. The optimizer tries to
give you information about this, and you can get this information using
the mp_optimizer_output()
.
How do I increase the number of iterations that the optimizer takes?
When non-convergence results because the optimizer reaches the maximum number of iterations that it was allowed to take, you can increase this number (TODO: show example of how). Sometimes however, running out of iterations can (but not always) be a sign that there is not very much information in the data about the model parameters being estimated.
Parameter Estimation and Uncertainty
What can I do when my confidence intervals overlap zero for a parameter that should be positive?
You should use mp_tmb_insert()
to transform your
parameters so that they cannot go negative. TODO: link to an
example.
Dynamical Instabilities
What is the best way for me to keep my state variables from going negative?
The hazard correction (mp_hazard()
) works best in my
experience. TODO: link to example.
Initial Conditions
My data start mid-epidemic so how do I know what the initial number of infectious individuals is?
This is a common situation as data collection often gets better after it is clear that there is a problem. But simulating epidemic models requires specifying the initial conditions of all of the state variables, and we do not have data on that. The basic idea for solving this problem is to fit the initial numbers of some of the state variables. But there are issues.
Let’s get an example. We manipulate the example infection data above
by removing the first 25 time steps. The first issue is what to set the
time steps as. We could just keep them as they were (i.e.,
26:50
) but this would be cheating because it would be
preserving the time of the start of the epidemic as 25 time steps again,
and in real life we do not know this information. So we pretend that the
epidemic started 10 time steps before the data start.
start_date_offset = 10
in_progress_epidemic = (example_data
|> filter(time > 25)
|> mutate(
value = rpois(length(value), 100 * value)
, time = start_date_offset + row_number()
)
)
max_time = max(in_progress_epidemic$time)
(ggplot()
+ geom_point(aes(time, value), data = in_progress_epidemic)
+ scale_x_continuous(limits = c(1, max_time))
+ theme_bw()
)
The blank space on the graph from time 1
to
10
represents the period of the outbreak without data. The
exact size of this period is not crucial, only that it is long enough to
let the dynamics of model settle down and become independent of the
initial values of the state variables. Actually, for this simple SI
model example we could even ignore this start-time issue, and just fit
the initial values of the number of infectious individuals. But in
models with many state variables it becomes very easy to specify initial
conditions that the model wouldn’t go to naturally, resulting in
dynamical instability. This instability can cause optimization problems
because the optimizer might choose starting conditions that lead to
instabilities and therefore bad fits.
But we will continue with the SI model to illustrate the general idea with relative simplicity. There are a few key ideas in the code below.
- We express the initial number of infectious individuals as
N
times a logistic function of a new parameter,logit_prop_I
, giving the proportion of the population that is infectious on the logit scale. This transformation allows us to fitlogit_prop_I
while ensuring that the initial value ofI
stays between0
andN
. - We use the
mp_sim_bounds
function to assert that the first time step is1
and the last time step ismax_time
, which in this case is 35. - We give a prior distribution for
logit_prop_I
calibrator = ("starter_models"
|> mp_tmb_library("si", package = "macpan2")
|> mp_tmb_insert(phase = "before"
, expressions = list(I ~ N / (1 + exp(-logit_prop_I)))
, default = list(logit_prop_I = 0)
)
|> mp_tmb_calibrator(
data = in_progress_epidemic
, traj = list(infection = mp_neg_bin(disp = 1))
, par = list(
beta = mp_normal(location = 0.4, sd = 1)
, logit_prop_I = mp_normal(location = 0, sd = 1)
)
, time = mp_sim_bounds(1, max_time, "steps")
)
)
mp_optimize(calibrator)
#> Warning in (function (start, objective, gradient = NULL, hessian = NULL, :
#> NA/NaN function evaluation
#> $par
#> params params
#> 0.09910844 0.55847313
#>
#> $objective
#> [1] 38.00748
#>
#> $convergence
#> [1] 0
#>
#> $iterations
#> [1] 9
#>
#> $evaluations
#> function gradient
#> 14 10
#>
#> $message
#> [1] "relative convergence (4)"
cc = mp_tmb_coef(calibrator, conf.int = TRUE)
print(cc)
#> term mat row col default type estimate std.error conf.low
#> 1 params beta 0 0 0.2 fixed 0.09910844 0.04234644 0.01611094
#> 2 params.1 prop_I 0 0 0.5 fixed 0.63609918 0.42942417 0.04404283
#> conf.high
#> 1 0.1821059
#> 2 0.9851457
We see that we are able to fit both the transmission rate,
beta
, and the initial proportion of infectious individuals,
and get reasonable confidence intervals. Note that
logit_prop_I
has been automatically back-transformed to
prop_I
, but this functionality can be turned off by setting
back_transform = FALSE
in mp_tmb_coef
. The
plot below shows how the model is able to fit to the data without
knowing the initial number of infectious individuals 10
fitted_data = mp_trajectory(calibrator)
(ggplot()
+ geom_point(aes(time, value), data = in_progress_epidemic)
+ geom_line(aes(time, value), data = fitted_data)
+ theme_bw()
)
Real data will not be as simple, but these principles will help.
Under Reporting, Delayed Reporting, and Total Reporting
How should I account for delayed reporting as well as
under-reporting in macpan2
?
The challenge introduced by reporting delays is that we cannot compare the simulated incidence values from our model with observed case reports, because there is a reporting delay. The following toy example shows how the apparent peak in the data follows the simulations.
So before comparing the simulations with observed data, we use convolution to transform the simulated incidence values into a time series of reported cases that is expected by the model. Keep reading to find out specifically how this is done.
Now we can now compare the observed and expected reported cases in a manner that accounts for reporting delays.
Now to explain how the expected reported case curve is computed. In this method we assume that each new case waits a Gamma-distributed random amount of time before it is observed. The mean and coefficient of variation of this distribution can be fitted parameters. During the COVID-19 pandemic the McMaster group found that a Gamma distribution with mean delay of 11 days and coefficient of variation of 0.25 was a reasonable model.
The following figure illustrates how a single expected case report is computed, in this case at time 26. The Gamma distribution of delay times is represented as bars giving the probability that a new case at a particular time is reported at time 26, which is given as the dashed horizontal line. Note how the reported number of cases is roughly near (but not identical) to the simulated number of cases at the peak of the delay distribution. This makes sense because we would expect most of our reports now to come from the highest probability time in the past.
In macpan2
we typically have a model that doesn’t
account for delayed reporting and we want to update it to do so before
fitting to data. One can use the mp_tmb_insert_reports()
function for this purpose.
si_with_delays = (si
|> mp_tmb_insert_reports(
incidence_name = "infection"
, mean_delay = 11
, cv_delay = 0.25
, report_prob = 0.1
, reports_name = "reports"
)
)
(si_with_delays
|> mp_simulator(
time_steps = 50L
, outputs = c("infection", "reports")
)
|> mp_trajectory()
|> ggplot()
+ geom_line(aes(time, value, colour = matrix))
+ theme_bw()
)
Note how this function allows accounting for under-reporting as well,
by specifying a report_prob
that is less than
1
.
16: dist ~ pgamma(1:17, 1/(incidence_cv_delay), incidence_mean_delay * (incidence_cv_delay^2))
17: delta ~ dist[1:16] - dist[0:15]
18: kernel ~ incidence_report_prob * delta/sum(delta)
19: reported_incidence ~ convolution(incidence, kernel)
Stochastic Simulation
What are the mathematical details of the Euler multinomial process error model?
The Euler-multinomial state update method assumes that the number of individuals that move from one box to another in a single time-step is a random non-negative integer, coming from a particular multinomial distribution that we now describe.
The probability of staying in the th box through an entire time-step is assumed to be the following (TODO: relate this to Poisson processes).
This probability assumes that the are held constant throughout the entire time-step, although they can change when each time-step begins.
The probability of moving from box to box in one time-step is given by the following.
This probability is just the probability of not staying in box , which is , and specifically going to box , which is assumed to be given by this expression .
Let be the random number of individuals that move from box to box in one time-step. The expected value of is . However, these random variables are not independent events, because the total number of individuals, , has to remain constant through a single time-step (at least in the models that we are currently considering).
To account for this non-independence, we collect the associated with a from compartment into a vector . We collect similar vector of probabilities, . Each is a random draw from a multinomial distribution with trials and probability vector, .
Once these random draws have been made, the state variables can be updated at each time-step as follows.
Note that we do not actually need to generate values for the diagonal elements, , because they cancel out in this update equation. We also can ignore any such that .
Under the SIR example we have a particularly simple Euler-binomial distribution because there are no branching flows – when individuals leave S they can only go to I and when they leave I they can only go to R. These two flows are given by the following distributions.
In models with branching flows we would have multinomial distributions. But in this model the state update is given by the following equations.
Can macpan2
do the Gillespie algorithm
No. At least not without substantial hacking. The Euler Multinomial is used instead
More Process Error (incomplete)
set.seed(1L)
spec = mp_tmb_model_spec(
before = list(
R ~ vaxprop * N
, S ~ N - I - R
)
, during = list(
infection ~ reulermultinom(S, beta * I / N)
, recovery ~ reulermultinom(I, gamma)
, S ~ S - infection
, I ~ I + infection - recovery
, R ~ R + recovery
)
, default = list(
vaxprop = 0.71111111
, N = 1000
, I = 10
, beta = 1.25
, gamma = 1.1
)
)
(spec
|> mp_simulator(
time_steps = 10
, outputs = "infection"
)
|> mp_trajectory_replicate(1L)
)
#> [[1]]
#> matrix time row col value
#> 1 infection 1 0 0 2
#> 2 infection 2 0 0 2
#> 3 infection 3 0 0 1
#> 4 infection 4 0 0 3
#> 5 infection 5 0 0 2
#> 6 infection 6 0 0 0
#> 7 infection 7 0 0 0
#> 8 infection 8 0 0 0
#> 9 infection 9 0 0 0
#> 10 infection 10 0 0 0
vaxprop = 0.7
N = 1000
set.seed(10)
spec = mp_tmb_model_spec(
before = list(S ~ N - I - R)
, during = list(
infection ~ reulermultinom(S, beta * I / N)
, recovery ~ reulermultinom(I, gamma)
, S ~ S - infection
, I ~ I + infection - recovery
, R ~ R + recovery
)
, default = list(
N = N
, R = round(vaxprop * N)
, I = 10
, beta = 1.25
, gamma = 1.1
)
)
sim = (spec
|> mp_simulator(
time_steps = 10
, outputs = "infection"
)
)
sim |> mp_trajectory_replicate(10)
#> [[1]]
#> matrix time row col value
#> 1 infection 1 0 0 3
#> 2 infection 2 0 0 2
#> 3 infection 3 0 0 0
#> 4 infection 4 0 0 0
#> 5 infection 5 0 0 0
#> 6 infection 6 0 0 0
#> 7 infection 7 0 0 0
#> 8 infection 8 0 0 0
#> 9 infection 9 0 0 0
#> 10 infection 10 0 0 0
#>
#> [[2]]
#> matrix time row col value
#> 1 infection 1 0 0 4
#> 2 infection 2 0 0 3
#> 3 infection 3 0 0 0
#> 4 infection 4 0 0 0
#> 5 infection 5 0 0 0
#> 6 infection 6 0 0 0
#> 7 infection 7 0 0 0
#> 8 infection 8 0 0 0
#> 9 infection 9 0 0 0
#> 10 infection 10 0 0 0
#>
#> [[3]]
#> matrix time row col value
#> 1 infection 1 0 0 1
#> 2 infection 2 0 0 1
#> 3 infection 3 0 0 2
#> 4 infection 4 0 0 2
#> 5 infection 5 0 0 1
#> 6 infection 6 0 0 1
#> 7 infection 7 0 0 1
#> 8 infection 8 0 0 0
#> 9 infection 9 0 0 0
#> 10 infection 10 0 0 0
#>
#> [[4]]
#> matrix time row col value
#> 1 infection 1 0 0 2
#> 2 infection 2 0 0 2
#> 3 infection 3 0 0 3
#> 4 infection 4 0 0 3
#> 5 infection 5 0 0 1
#> 6 infection 6 0 0 0
#> 7 infection 7 0 0 0
#> 8 infection 8 0 0 0
#> 9 infection 9 0 0 0
#> 10 infection 10 0 0 0
#>
#> [[5]]
#> matrix time row col value
#> 1 infection 1 0 0 0
#> 2 infection 2 0 0 0
#> 3 infection 3 0 0 0
#> 4 infection 4 0 0 0
#> 5 infection 5 0 0 0
#> 6 infection 6 0 0 0
#> 7 infection 7 0 0 0
#> 8 infection 8 0 0 0
#> 9 infection 9 0 0 0
#> 10 infection 10 0 0 0
#>
#> [[6]]
#> matrix time row col value
#> 1 infection 1 0 0 3
#> 2 infection 2 0 0 2
#> 3 infection 3 0 0 0
#> 4 infection 4 0 0 0
#> 5 infection 5 0 0 0
#> 6 infection 6 0 0 0
#> 7 infection 7 0 0 0
#> 8 infection 8 0 0 0
#> 9 infection 9 0 0 0
#> 10 infection 10 0 0 0
#>
#> [[7]]
#> matrix time row col value
#> 1 infection 1 0 0 3
#> 2 infection 2 0 0 4
#> 3 infection 3 0 0 3
#> 4 infection 4 0 0 0
#> 5 infection 5 0 0 0
#> 6 infection 6 0 0 0
#> 7 infection 7 0 0 0
#> 8 infection 8 0 0 0
#> 9 infection 9 0 0 0
#> 10 infection 10 0 0 0
#>
#> [[8]]
#> matrix time row col value
#> 1 infection 1 0 0 4
#> 2 infection 2 0 0 0
#> 3 infection 3 0 0 0
#> 4 infection 4 0 0 0
#> 5 infection 5 0 0 0
#> 6 infection 6 0 0 0
#> 7 infection 7 0 0 0
#> 8 infection 8 0 0 0
#> 9 infection 9 0 0 0
#> 10 infection 10 0 0 0
#>
#> [[9]]
#> matrix time row col value
#> 1 infection 1 0 0 6
#> 2 infection 2 0 0 2
#> 3 infection 3 0 0 1
#> 4 infection 4 0 0 1
#> 5 infection 5 0 0 0
#> 6 infection 6 0 0 0
#> 7 infection 7 0 0 0
#> 8 infection 8 0 0 0
#> 9 infection 9 0 0 0
#> 10 infection 10 0 0 0
#>
#> [[10]]
#> matrix time row col value
#> 1 infection 1 0 0 3
#> 2 infection 2 0 0 3
#> 3 infection 3 0 0 1
#> 4 infection 4 0 0 1
#> 5 infection 5 0 0 1
#> 6 infection 6 0 0 0
#> 7 infection 7 0 0 0
#> 8 infection 8 0 0 0
#> 9 infection 9 0 0 0
#> 10 infection 10 0 0 0
proc_err = rnorm(50)
spec = mp_tmb_model_spec(
before = list(S ~ N - I - R)
, during = list(
mean_foi ~ beta * I / N
, mean_infection ~ mean_foi * S
, dev_foi ~ exp(proc_err[time_step(1)])
, mp_per_capita_flow("S", "I", "dev_foi * mean_foi", "infection")
, mp_per_capita_flow("I", "R", "gamma", "recovery")
)
, default = list(
N = N
, R = 0
, I = 1
, beta = 0.4
, gamma = 0.2
, proc_err = proc_err
)
)
sim = (spec |> mp_hazard()
|> mp_simulator(
time_steps = 50
, outputs = "infection"
)
)
sim_infection = (sim
|> mp_trajectory()
|> mutate(value = rpois(length(value), value))
)
fixef = list(
beta = mp_log_normal(location = 0.2, sd = 1)
, gamma = mp_log_normal(location = 0.2, sd = 1)
)
ranef = list()
#proc_err = mp_normal(location = 0, sd = 0.1)
#)
cal = mp_tmb_calibrator(
spec = mp_tmb_update(spec, default = list(beta = 0.2, proc_err = rep(0, 50)))
, data = sim_infection
, traj = list(infection = mp_neg_bin(disp = 1))
, par = fixef # mp_par(param = fixef, random = ranef)
, outputs = "mean_infection"
)
mp_optimize(cal)
#> Warning in (function (start, objective, gradient = NULL, hessian = NULL, :
#> NA/NaN function evaluation
#> $par
#> params params
#> 0.5811725 0.2381576
#>
#> $objective
#> [1] 215.4921
#>
#> $convergence
#> [1] 0
#>
#> $iterations
#> [1] 10
#>
#> $evaluations
#> function gradient
#> 13 11
#>
#> $message
#> [1] "relative convergence (4)"
mp_tmb_coef(cal)
#> term mat row col default type estimate std.error
#> 1 params beta 0 0 0.2 fixed 0.5811725 0.03470795
#> 2 params.1 gamma 0 0 0.2 fixed 0.2381576 0.04467113
fit_infection = mp_trajectory(cal)
(ggplot()
+ geom_point(aes(time, value), data = sim_infection)
+ geom_line(aes(time, value, colour = matrix), data = fit_infection)
+ theme_bw()
)