endogenr/README.Rmd at main · prio-data/endogenr · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# endogenr

<!-- badges: start -->
<!-- badges: end -->

The goal of `endogenr` is to make it easy to simulate dynamic systems from
regression models, mathematical equations, and exogenous inputs (either
based on a stochastic distribution, or given by some data). It assumes a
panel-data structure with two columns identifying the time and the unit
dimensions.

The simulator identifies the dependency graph of the models added to the
system and derives the order of calculation from that graph. Parallel
execution is opt-in via the `future` package.

## Installation

You can install the development version of `endogenr` from
[GitHub](https://github.com/) with:

``` r
# install.packages("pak")
pak::pak("prio-data/endogenr")
# alternatively with
renv::install("prio-data/endogenr")
```

You can also clone the repository, open it as a project in RStudio,
find the "Build" tab, and press "Install".

## Example

`setup_system()` accepts a plain `data.frame` or `data.table` — no
`tsibble` conversion is required. Formula RHS terms can use `lag()`,
`zoo::rollmean()`, or any function you define yourself (pass user functions
through the `globals` argument).

```{r example}
library(endogenr)
library(dplyr)
df <- endogenr::example_data

# Drop units with any NA in modelled outcomes over the training window 1970–2009.
required <- c("gdppc", "gdppc_grwt", "best", "v2x_polyarchy", "psecprop", "population")
train_window <- df[df$year >= 1970 & df$year <= 2009, ]
ok <- aggregate(train_window[, required],
                by = list(gwcode = train_window$gwcode),
                FUN = function(x) all(!is.na(x)))
keep <- ok$gwcode[apply(ok[, required], 1, all)]
df   <- df[df$gwcode %in% keep, ]
df$gdp <- df$gdppc * df$population  # derived outcome; pre-computed so the initial state is non-NA

c1 <- best ~ lag(best) + lag(log(gdppc)) + lag(log(population)) +
  lag(psecprop) + lag(v2x_polyarchy) + lag(gdppc_grwt) +
  lag(zoo::rollmean(best, k = 5, fill = NA, align = "right"))

model_system <- list(
  build_model("deterministic", formula = gdppc ~ I(abs(lag(gdppc) * (1 + gdppc_grwt)))),
  build_model("deterministic", formula = gdp ~ I(abs(gdppc * population))),
  build_model("parametric_distribution", formula = ~gdppc_grwt, distribution = "norm"),
  build_model("linear", formula = c1, boot = "resid"),
  build_model("univariate_fable",
              formula = v2x_polyarchy ~ error("A") + trend("N") + season("N"),
              method = "ets"),
  build_model("exogen", formula = ~psecprop),
  build_model("exogen", formula = ~population)
)
```

### Validation

`setup_system()` runs `validate_panel()` and `validate_system_closure()`
on your inputs and builds the dependency graph (no models are fit yet).
These check that time is contiguous and integer-valued within each unit,
the initial state at `test_start - 1` has no NAs in any modelled outcome
for units present there, and every variable referenced by a formula is
either modelled or supplied as a column. Panels may be unbalanced: units
may enter late or exit early. Units without a row at `test_start - 1`
are used for training only and are excluded from the simulation. With
`factor()` terms, prefer pre-converted factor columns so window fits
keep all levels. See `?validate_panel` and `?validate_system_closure`
if you want to run them ad-hoc on a candidate panel.

```{r setup}
sys <- setup_system(
  models      = model_system,
  data        = df,
  train_start = 1970,
  test_start  = 2010,
  horizon     = 12,
  groupvar    = "gwcode",
  timevar     = "year",
  inner_sims  = 2,
  min_window  = 10
)
```

### Fit the system

`fit_system()` estimates the models and **stores** the fitted objects, so the
coefficients that drive the simulation can be inspected with
`get_coefficients()` or plotted with `plot_coefficients()`. `nsim` lives here:
it is the number of coefficient draws. Because this example sets `min_window`,
the bootstrapped linear model is refit on a random training window for each
draw, so the stored coefficients vary across draws.

```{r fit}
future::plan(future::multisession, workers = 2)

set.seed(42)
fit <- fit_system(sys, nsim = 2)

# The coefficients actually used across draws
get_coefficients(fit)
```

### Checking the fit against a plain regression

The fit path is designed to match a plain pooled regression exactly.
With `boot = NULL` on a spec and no `min_window`, `fit_system()` fits
that spec once on the full training window
`[train_start, test_start - 1]`, and its coefficients equal `lm()` on
the materialized training data. With `boot = "resid"` the bootstrap
draws centre on that fit and use the same estimation sample (complete
cases on the model's own columns). If a spec ever looks off, compare it
against the reference regression directly:

```{r parity, eval=FALSE}
# endogenr's fit (one shared draw, full window)
sys <- setup_system(
  list(build_model("linear", formula = y ~ lag(y) + lag(x)),
       build_model("exogen", formula = ~x)),
  dt, train_start = 1965, test_start = 2010, horizon = 5,
  groupvar = "unit", timevar = "time", inner_sims = 1
)
fit <- fit_system(sys, nsim = 1)
get_coefficients(fit)

# the same regression by hand
dt[, `:=`(lag_y = shift(y), lag_x = shift(x)), by = unit]
coef(lm(y ~ lag_y + lag_x, dt[time < 2010]))
```

### Parallel execution and progress

The simulator no longer manages a `future` plan internally and no longer
refits anything: `simulate_system()` predicts using the stored draws and
derives `nsim` from them. Set a plan yourself before calling, and wrap the
call in `progressr::with_progress()` if you want a progress bar:

```{r simulate}
set.seed(42)
progressr::with_progress({
  res <- simulate_system(fit)
})

future::plan(future::sequential)
```

### Scoring and plotting

`simulate_system()` stamps `panel_unit` / `panel_time` attributes on the
result so `get_accuracy()` and `plotsim()` can infer the panel context
without any further configuration. Filter to the forecast window using a
plain `data.table` predicate.

```{r postprocess}
res <- res[res$year >= sys$test_start, ]

acc <- get_accuracy(res, "gdppc_grwt", df)
acc |>
  dplyr::summarize(dplyr::across(crps:winkler, ~ mean(.x))) |>
  dplyr::arrange(crps) |>
  knitr::kable()

plotsim(res, "gdppc", c(2, 20, 530), df)
plotsim(res, "gdppc_grwt", c(2, 20, 530), df)
plotsim(res, "v2x_polyarchy", c(2, 20, 530), df)
```

## Spatial lag

`endogenr` supports spatial-lag variables in the simulation. The setup is
two steps: (1) compute the spatial lag for the historical data, and
(2) register a `spatial_lag` model in the system so the lag is recomputed
at each simulated time step.

### Step 1: prepare spatial weights and the historical lag

`st_weights_from_sf()` builds a neighbourhood list and weights from an `sf`
object. The default is queen contiguity. For other schemes, call `sfdep`
directly and supply `nb`, `wt`, and `unit_ids` yourself (in the order they
appear in the spatial object).

Computing the historical lag can be fiddly when the panel is unbalanced or
has missing observations; the example below filters to units present in
the neighbourhood structure before computing the lag.

```{r spatial-lag, eval=FALSE}
library(endogenr)
library(dplyr)

df <- endogenr::example_data

# Load a map and filter to units present in the data
map <- poldat::cshp_gw_modifications(france_overseas = FALSE) |>
  dplyr::filter(end == as.Date("2019-12-31"))
map <- map |> dplyr::filter(gwcode %in% unique(df$gwcode))

# Build spatial weights (queen contiguity by default)
sf::sf_use_s2(FALSE)
neigh <- st_weights_from_sf(map, "gwcode", weights_args = list(allow_zero = TRUE))

# Compute the spatial lag of `best` for each year in the historical data
df <- df |>
  dplyr::filter(gwcode %in% neigh$unit_ids) |>
  dplyr::group_by(year) |>
  dplyr::mutate(
    sl_best = sfdep::st_lag(best, neigh$nb, neigh$wt, allow_zero = TRUE)
  ) |>
  dplyr::ungroup()
```

### Step 2: add a `spatial_lag` model to the system

A `spatial_lag` model is a cross-sectional transformation applied at every
simulated `t`. Reference the lag in other formulas as `lag(sl_y)` — using
the same-period value `sl_y` would create a circular dependency.

```{r spatial-lag-system, eval=FALSE}
c1 <- best ~ lag(best) + lag(sl_best) + lag(log(gdppc)) +
  lag(log(population)) + lag(psecprop) + lag(v2x_polyarchy) + lag(gdppc_grwt) +
  lag(zoo::rollmean(best, k = 5, fill = NA, align = "right"))

model_system <- list(
  # spatial_lag recomputes sl_best from `best` at each simulated t
  build_model("spatial_lag", formula = sl_best ~ best,
              nb = neigh$nb, wt = neigh$wt, unit_ids = neigh$unit_ids,
              island_default = 0),
  build_model("deterministic", formula = gdppc ~ I(abs(lag(gdppc) * (1 + gdppc_grwt)))),
  build_model("deterministic", formula = gdp ~ I(abs(gdppc * population))),
  build_model("parametric_distribution", formula = ~gdppc_grwt, distribution = "norm"),
  build_model("linear", formula = c1, boot = "resid"),
  build_model("univariate_fable",
              formula = v2x_polyarchy ~ error("A") + trend("N") + season("N"),
              method = "ets"),
  build_model("exogen", formula = ~psecprop),
  build_model("exogen", formula = ~population)
)

sys <- setup_system(
  models      = model_system,
  data        = df,
  train_start = 1970,
  test_start  = 2010,
  horizon     = 12,
  groupvar    = "gwcode",
  timevar     = "year",
  inner_sims  = 2,
  min_window  = 10
)

future::plan(future::multisession, workers = 2)
set.seed(42)
fit <- fit_system(sys, nsim = 2)
set.seed(42)
progressr::with_progress({
  res <- simulate_system(fit)
})
future::plan(future::sequential)
```

## Long-horizon comparison

For each horizon `h`, the long-horizon API fits a single *direct* regression of
the h-step-ahead outcome on covariates observed at the forecast origin
(`test_start - 1`, the last observed period — the same information the dynamic
simulator conditions on). It is a reduced-form benchmark for the dynamic
simulator. See `?cv_long_horizon` for the cross-validated entry point.

The outcome on the formula LHS must be wrapped in `lead_horizon()`; the horizon
`h` is supplied internally per horizon. The RHS is evaluated at the origin, so
write the covariates *unlagged* for the standard benchmark (use `lag()` only
when you deliberately want history older than the origin).

```{r long-horizon, eval=FALSE}
formulas <- list(
  lh_linear = lead_horizon(gdppc_grwt) ~ gdppc_grwt + log(gdppc) + best
)

lh_setup <- setup_long_horizon(
  data       = df,
  formulas   = formulas,
  horizons   = 1:12,
  groupvar   = "gwcode",
  timevar    = "year",
  test_start = 2010
)

lh_forecasts <- forecast_long_horizon(
  lh_setup,
  data       = df,
  test_start = 2010,   # defaults to lh_setup$test_start
  nsim       = 100,
  inner_sims = 10
)

# Score both approaches on the same (unit, horizon) grid, then stack them.
lh_acc  <- get_lh_accuracy(lh_forecasts, df, lh_setup)
sim_acc <- get_accuracy(res, "gdppc_grwt", df,
                        test_start = 2010, by = c("gwcode", "horizon"))

compare_approaches(lh_acc, sim_acc) |>
  dplyr::group_by(approach, horizon) |>
  dplyr::summarize(crps = mean(crps), .groups = "drop") |>
  dplyr::arrange(horizon, approach)
```


For a transformed outcome (e.g. `lead_horizon(asinh(gdppc)) ~ ...`), score on
the modelled scale with `get_lh_accuracy(..., scale = "model")`, or on the
native scale with `scale = "native", inverse = sinh`.