-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Single layer of cross-validation for base-learners
- Convert to B Monte Carlo sets of 5-fold CV for base-learners
- out-of-fold predictions for base learner model
$b_i \in b_1, ... b_B$ are used as input into meta-learner models. - CV statistics are a distribution base on the
$B \times L$ base learners. Total and by$l = 1,..., L$
Results in
Base Learner Results Dataset:
Meta-learner is an ensemble
- K-fold space-time CV, repeated M times.
- CV statistics are a distribution based on the M CV sets
Results in
Adding additional covariates into the mete-learner
Let's consider adding a handful of covariates into the mete-learner that are not used in the base learners
- Temporally lagged predictions. e.g. Time
$t$ gets base learner predictions for times$t$ and$t-1$ , perhaps$t-7$ for a week lag. - Random and targets spatial fields. e.g. A spatial field (with mean zero) that increases slightly moving from E-W, N-S, etc.
- Spatial intercept terms for regions, states, etc. e.g. 1 in NC, 0 every where else
- Interaction terms of the above with the base-learners (because we will have column-wise random drop out)
Space-Time CV for all models
According to Phillips, R. V., Van Der Laan, M. J., Lee, H., & Gruber, S. (2023). Practical considerations for specifying a super learner. International Journal of Epidemiology, 52(4), 1276-1285.
, we can us
- Base-learners are trained with k-fold space-time CV.
- Meta-learners are trained with k-fold space-time CV.
The meta-learning stage is where the models will be encouraged for out-of-sample extrapolation.
Add CV metrics
In order to better compare with other papers, let's add other metrics from yardstick
- Mean Absolute Error (MAE)
Investigate PM2.5 sample size
Other papers are reporting a larger number of sites and total space-time samples. https://d197for5662m48.cloudfront.net/documents/publicationstatus/224422/preprint_pdf/47e14a632995ad350c6d3cbe756fbde8.pdf include PM2.5 from AQS, both param 88101 and 88502, where the latter is a subset with "acceptable PM2.5". For example, we have 883 sites in 2019 and they report 1,281.
Random Cross-Validation
- As a maximum benchmark, create a target series starting from the base-learners and
rset
objects that is regular ol' random cross-validation with no concern for spatial or temporal locations. This should give us an upper benchmark that some other papers have reported.