Skip to content

Updates to working preliminary pipeline #418

@kyle-messier

Description

@kyle-messier

Single layer of cross-validation for base-learners

  • Convert to B Monte Carlo sets of 5-fold CV for base-learners
  • out-of-fold predictions for base learner model $b_i \in b_1, ... b_B$ are used as input into meta-learner models.
  • CV statistics are a distribution base on the $B \times L$ base learners. Total and by $l = 1,..., L$

Results in $B \times L$ out-of-fold predictions of total sample size, $N$, where $L$ is the number of learner types (3 in our case).

Base Learner Results Dataset: $[ N \times (B \times L)]$

Meta-learner is an ensemble

  • K-fold space-time CV, repeated M times.
  • CV statistics are a distribution based on the M CV sets

Results in $M$ out-of-fold predictions.

$\overline{M} = \frac{1}{M}\Sigma_{m=1}^{M} \hat{Y}_m$ is the mean of all $M$ ensembles. We can use this to visualize the total dataset scatterplot, etc.

Adding additional covariates into the mete-learner

Let's consider adding a handful of covariates into the mete-learner that are not used in the base learners

  • Temporally lagged predictions. e.g. Time $t$ gets base learner predictions for times $t$ and $t-1$, perhaps $t-7$ for a week lag.
  • Random and targets spatial fields. e.g. A spatial field (with mean zero) that increases slightly moving from E-W, N-S, etc.
  • Spatial intercept terms for regions, states, etc. e.g. 1 in NC, 0 every where else
  • Interaction terms of the above with the base-learners (because we will have column-wise random drop out)

Space-Time CV for all models

According to Phillips, R. V., Van Der Laan, M. J., Lee, H., & Gruber, S. (2023). Practical considerations for specifying a super learner. International Journal of Epidemiology, 52(4), 1276-1285., we can us $K \geq 2$ for the K-folds, because we easily have an effective sample size greater than 10,000.

  • Base-learners are trained with k-fold space-time CV.
  • Meta-learners are trained with k-fold space-time CV.

$k=5$ seems like a reasonable value with a balance b

The meta-learning stage is where the models will be encouraged for out-of-sample extrapolation.

Add CV metrics

In order to better compare with other papers, let's add other metrics from yardstick

  • Mean Absolute Error (MAE)

Investigate PM2.5 sample size

Other papers are reporting a larger number of sites and total space-time samples. https://d197for5662m48.cloudfront.net/documents/publicationstatus/224422/preprint_pdf/47e14a632995ad350c6d3cbe756fbde8.pdf include PM2.5 from AQS, both param 88101 and 88502, where the latter is a subset with "acceptable PM2.5". For example, we have 883 sites in 2019 and they report 1,281.

Random Cross-Validation

  • As a maximum benchmark, create a target series starting from the base-learners and rset objects that is regular ol' random cross-validation with no concern for spatial or temporal locations. This should give us an upper benchmark that some other papers have reported.

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions