-
Notifications
You must be signed in to change notification settings - Fork 135
Description
Progress
- New command: augur subsample #1879
- Add route for augur subsample config schema nextstrain.org#1216
- Update subsampling guide with augur subsample docs.nextstrain.org#257
- WNV draft
- WNV draft with future features (e.g. proximity-based sampling)
- WNV draft against WADOH fork
- mpox draft
Tasks
- Create a prototype
- Define initial requirements
- Implement the command, write docs
- Demo the prototype in existing workflows (anything with an existing
rule subsampleshould work)- ncov includes proximity-based sampling which is out of scope of the initial implementation. However, it could be a good target if that feature is deprecated, since the rest of the complex subsampling logic should be covered by the initial implementation.
- See prior work on generating the subsampling YAML config file: rule extract_subsampling_scheme
- wnv is a good target because it currently adopts the generalized subsampling approach which the initial implementation is meant to replace.
- ncov includes proximity-based sampling which is out of scope of the initial implementation. However, it could be a good target if that feature is deprecated, since the rest of the complex subsampling logic should be covered by the initial implementation.
Requirements for initial release
Add a new command that takes an input dataset of metadata/sequences and outputs a subset of metadata/sequences based on command line arguments, including a YAML configuration file defining one or more samples.
The command will call augur filter once per sample, then a final time to combine the intermediate samples.
Details
- Sample-specific and sampling-related global configuration to be done through the YAML file.
- Example: all filtering options
- Input and output configuration to be done through command line arguments.
- Examples:
--metadata,--metadata-delimiters,--output-metadata,--skip-checks
- Examples:
- Other global configuration also to be done through command line arguments.
- Examples:
--nthreads,--subsample-seed
- Examples:
Example YAML configuration (taken from wnv draft):
samples:
region:
query: is_lab_host != 'true'
query_columns:
- is_lab_host:str
min_length: 8200
group_by:
- region
- year
subsample_max_sequences: 3000
exclude:
- defaults/exclude.txt
include:
- defaults/include.txtPoints to consider:
- Is the config schema intuitive?
- Is the config schema sufficiently documented?
- Is it easy to debug a config schema validation error?
- Is it easy to debug an error in one of the underlying
augur filtercalls? - Do relative filepaths get passed through properly?
Features to be considered after initial release:
Links
- 2021-08-17 take 1 by James
- 2022-01-12 discussion during Nextstrain meeting and Slack thread
- 2022-11-03 Slack thread on re-visiting
augur subsamplewith proximities - 2024-02-25 Slack thread
- 2024-04-12 take 2 by James/Victor
Original issue
A common use case is versatile sub-sampling of datasets to suit a particular research question. The current best example of this is the (wonderful) SARS-CoV-2 pipeline which leverages a augur filter rule, a script to calculate priorities and snakemake wizardry to allow versatile, declarative subsampling schemes to be simply and intuitively defined.
This allows a simple-to-reason-with YAML file to result in a very bespoke subsampling scheme:

The question arises: how do we do this for a different pathogen?
As the SARS-CoV-2 example leverages snakemake, one solution would be to abstract that logic into a importable snakemake rule. The alternative approach would be a new augur command augur subsample which takes a YAML file declaring the desired subsampling settings. Learning from our work on nCoV, this would essentially replace the snakemake-controlled augur filter commands with a single augur subsample command. The yaml file would look similar / identical to the current snakemake implementation. The subcommand would leverage the functions used by augur filter as well as the priorities script from nCoV.
Thoughts?
Examples
subsampling.yaml:
schemes:
switzerland:
# Focal samples for country
country:
group_by: "division year month"
max_sequences: 1500
exclude: "--exclude-where 'country!={country}'"
# Contextual samples from country's region
region:
group_by: "country year month"
seq_per_group: 20
exclude: "--exclude-where 'country={country}' 'region!={region}'"
priorities:
type: "proximity"
focus: "country"
# Contextual samples from the rest of the world,
# excluding the current region to avoid resampling.
global:
group_by: "country year month"
seq_per_group: 10
exclude: "--exclude-where 'region={region}'"
priorities:
type: "proximity"
focus: "country"augur subsample --include <TXT> --sequences <FASTA> \
--metadata <TSV> --schemes <YAML> --output <FASTA>