Updates to working preliminary pipeline

# Single layer of cross-validation for base-learners
- [ ] Convert to B Monte Carlo sets of 5-fold CV for base-learners
- [ ] out-of-fold predictions for base learner model $b_i \in b_1, ... b_B$ are used as input into meta-learner models. 
- [ ] CV statistics are a distribution base on the $B \times L$ base learners. Total and by $l = 1,..., L$

Results in $B \times L$ ***out-of-fold predictions*** of total sample size, $N$, where $L$ is the number of learner types (3 in our case).

Base Learner Results Dataset: $[ N \times (B \times L)]$


# Meta-learner is an ensemble 

- [ ] K-fold space-time CV, repeated M times.
- [ ]  CV statistics are a distribution based on the M CV sets 

Results in $M$ ***out-of-fold predictions***. 

$\overline{M} = \frac{1}{M}\Sigma_{m=1}^{M} \hat{Y}_m$ is the mean of all $M$ ensembles. We can use this to visualize the total dataset scatterplot, etc.

# Adding additional covariates into the mete-learner
Let's consider adding a handful of covariates into the mete-learner that are not used in the base learners 

- [ ] Temporally lagged predictions. e.g. Time $t$ gets base learner predictions for times $t$ and $t-1$, perhaps $t-7$ for a week lag. 
- [ ] Random and targets spatial fields. e.g. A spatial field (with mean zero) that increases slightly moving from E-W, N-S, etc. 
- [x] Spatial intercept terms for regions, states, etc. e.g. 1 in NC, 0 every where else
- [ ] Interaction terms of the above with the base-learners (because we will have column-wise random drop out)


# Space-Time CV for all models

According to `Phillips, R. V., Van Der Laan, M. J., Lee, H., & Gruber, S. (2023). Practical considerations for specifying a super learner. International Journal of Epidemiology, 52(4), 1276-1285.`, we can us $K \geq 2$ for the K-folds, because we easily have an effective sample size greater than 10,000. 

- [ ] Base-learners are trained with k-fold space-time CV. 
- [ ] Meta-learners are trained with k-fold space-time CV. 

$k=5$ seems like a reasonable value with a balance b

The meta-learning stage is where the models will be encouraged for out-of-sample extrapolation. 

# Add CV metrics
In order to better compare with other papers, let's add other metrics from `yardstick`

- [x] Mean Absolute Error (MAE)

# Investigate PM2.5 sample size 
Other papers are reporting a larger number of sites and total space-time samples. https://d197for5662m48.cloudfront.net/documents/publicationstatus/224422/preprint_pdf/47e14a632995ad350c6d3cbe756fbde8.pdf include PM2.5 from AQS, both param 88101 and 88502, where the latter is a subset with "acceptable PM2.5". For example, we have 883 sites in 2019 and they report 1,281. 

# Random Cross-Validation

- [ ] As a maximum benchmark, create a target series starting from the base-learners and `rset` objects that is regular ol' random cross-validation with no concern for spatial or temporal locations. This should give us an upper benchmark that some other papers have reported. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Updates to working preliminary pipeline #418

Single layer of cross-validation for base-learners

Meta-learner is an ensemble

Adding additional covariates into the mete-learner

Space-Time CV for all models

Add CV metrics

Investigate PM2.5 sample size

Random Cross-Validation

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Updates to working preliminary pipeline #418

Description

Single layer of cross-validation for base-learners

Meta-learner is an ensemble

Adding additional covariates into the mete-learner

Space-Time CV for all models

Add CV metrics

Investigate PM2.5 sample size

Random Cross-Validation

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions