-
Notifications
You must be signed in to change notification settings - Fork 409
Description
Note
#1106 came first. This is a higher level summary written after some design discussions happened in that PR.
Context
There is much sampling bias in SARS-CoV-2 data. One of the goals of this workflow is to produce datasets that are representative of real-world incidence, stripping away as much sampling bias as possible.
Currently, this is approximated by sampling with various group_bys - a combination of geographic (division/country) and temporal (month/week) attributes - to define groups that are then uniformly sampled based on a target max_sequences.
The need for uniform sampling at the group level is an inherent limitation of augur filter. It has prompted workarounds in this workflow such as #1074.
Proposal
There is a proposal to remove the limitation of augur filter: nextstrain/augur#1318. The option to specify sampling weights could be directly used in this workflow. Population-based weighted sampling would bring this workflow one step closer to representing real-world incidence, though there will still be some inherent sampling bias¹.
Case count data was also considered as a potential source of weights, however it was determined that population data would be a better source. See discussion: #1106 (comment)
¹ weighted target sizes are calculated without taking into account the actual number of sequences available per group. This means under-sampled countries would still be under-sampled, resulting in fewer total sequences than requested by max_sequences. This is already the case with current uniform sampling, but it may be more noticeable under population-based weighted sampling for large countries that are under-sampled.
Progress
- Allow weighted subsampling augur#1318
- Use weighted sampling for Asia builds #1106
- I chose these builds to directly address the workaround applied in Update subsampling #1074
- Fix Asia weighted sampling #1150
- Use weighted sampling for other builds #1151
- Figure out what to do with global 1m/2m/6m builds