Use weighted sampling

> [!NOTE]
> https://github.com/nextstrain/ncov/pull/1106 came first. This is a higher level summary written after some design discussions happened in that PR.

## Context

There is much sampling bias in SARS-CoV-2 data. One of the goals of this workflow is to produce datasets that are representative of real-world incidence, stripping away as much sampling bias as possible.

Currently, this is approximated by sampling with various `group_by`s - a combination of geographic (`division`/`country`) and temporal (`month`/`week`) attributes - to define groups that are then uniformly sampled based on a target `max_sequences`.

The need for uniform sampling at the group level is an inherent limitation of `augur filter`. It has prompted workarounds in this workflow such as https://github.com/nextstrain/ncov/pull/1074.

## Proposal

There is a proposal to remove the limitation of `augur filter`: https://github.com/nextstrain/augur/issues/1318. The option to specify sampling weights could be directly used in this workflow. Population-based weighted sampling would bring this workflow one step closer to representing real-world incidence, though there will still be some inherent sampling bias¹.

Case count data was also considered as a potential source of weights, however it was determined that population data would be a better source. See discussion: https://github.com/nextstrain/ncov/pull/1106#discussion_r1596112762

¹ weighted target sizes are calculated without taking into account the actual number of sequences available per group. This means under-sampled countries would still be under-sampled, resulting in fewer total sequences than requested by `max_sequences`. This is already the case with current uniform sampling, but it may be more noticeable under population-based weighted sampling for large countries that are under-sampled.

## Progress

- [x] https://github.com/nextstrain/augur/issues/1318
- [x] https://github.com/nextstrain/ncov/pull/1106
    - I chose these builds to directly address the workaround applied in https://github.com/nextstrain/ncov/pull/1074
- [x] https://github.com/nextstrain/ncov/pull/1150
- [x] https://github.com/nextstrain/ncov/pull/1151
- [ ] Figure out what to do with global 1m/2m/6m builds
    - https://github.com/nextstrain/ncov/pull/1161
    - https://github.com/nextstrain/ncov/pull/1168

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use weighted sampling #1141

Context

Proposal

Progress

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use weighted sampling #1141

Description

Context

Proposal

Progress

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions