Skip to content

Use weighted sampling #1141

@victorlin

Description

@victorlin

Note

#1106 came first. This is a higher level summary written after some design discussions happened in that PR.

Context

There is much sampling bias in SARS-CoV-2 data. One of the goals of this workflow is to produce datasets that are representative of real-world incidence, stripping away as much sampling bias as possible.

Currently, this is approximated by sampling with various group_bys - a combination of geographic (division/country) and temporal (month/week) attributes - to define groups that are then uniformly sampled based on a target max_sequences.

The need for uniform sampling at the group level is an inherent limitation of augur filter. It has prompted workarounds in this workflow such as #1074.

Proposal

There is a proposal to remove the limitation of augur filter: nextstrain/augur#1318. The option to specify sampling weights could be directly used in this workflow. Population-based weighted sampling would bring this workflow one step closer to representing real-world incidence, though there will still be some inherent sampling bias¹.

Case count data was also considered as a potential source of weights, however it was determined that population data would be a better source. See discussion: #1106 (comment)

¹ weighted target sizes are calculated without taking into account the actual number of sequences available per group. This means under-sampled countries would still be under-sampled, resulting in fewer total sequences than requested by max_sequences. This is already the case with current uniform sampling, but it may be more noticeable under population-based weighted sampling for large countries that are under-sampled.

Progress

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions