Skip to content

augur subsample command #635

@jameshadfield

Description

@jameshadfield

Progress

Tasks

  • Create a prototype
    • Define initial requirements
    • Implement the command, write docs
  • Demo the prototype in existing workflows (anything with an existing rule subsample should work)
    • ncov includes proximity-based sampling which is out of scope of the initial implementation. However, it could be a good target if that feature is deprecated, since the rest of the complex subsampling logic should be covered by the initial implementation.
    • wnv is a good target because it currently adopts the generalized subsampling approach which the initial implementation is meant to replace.

Requirements for initial release

Add a new command that takes an input dataset of metadata/sequences and outputs a subset of metadata/sequences based on command line arguments, including a YAML configuration file defining one or more samples.

The command will call augur filter once per sample, then a final time to combine the intermediate samples.

Details

  • Sample-specific and sampling-related global configuration to be done through the YAML file.
    • Example: all filtering options
  • Input and output configuration to be done through command line arguments.
    • Examples: --metadata, --metadata-delimiters, --output-metadata, --skip-checks
  • Other global configuration also to be done through command line arguments.
    • Examples: --nthreads, --subsample-seed

Example YAML configuration (taken from wnv draft):

samples:
  region:
    query: is_lab_host != 'true'
    query_columns:
      - is_lab_host:str
    min_length: 8200
    group_by:
      - region
      - year
    subsample_max_sequences: 3000
    exclude:
      - defaults/exclude.txt
    include:
      - defaults/include.txt

Points to consider:

  • Is the config schema intuitive?
  • Is the config schema sufficiently documented?
  • Is it easy to debug a config schema validation error?
  • Is it easy to debug an error in one of the underlying augur filter calls?
  • Do relative filepaths get passed through properly?

Features to be considered after initial release:

Links


Original issue

A common use case is versatile sub-sampling of datasets to suit a particular research question. The current best example of this is the (wonderful) SARS-CoV-2 pipeline which leverages a augur filter rule, a script to calculate priorities and snakemake wizardry to allow versatile, declarative subsampling schemes to be simply and intuitively defined.

This allows a simple-to-reason-with YAML file to result in a very bespoke subsampling scheme:
image

The question arises: how do we do this for a different pathogen?

As the SARS-CoV-2 example leverages snakemake, one solution would be to abstract that logic into a importable snakemake rule. The alternative approach would be a new augur command augur subsample which takes a YAML file declaring the desired subsampling settings. Learning from our work on nCoV, this would essentially replace the snakemake-controlled augur filter commands with a single augur subsample command. The yaml file would look similar / identical to the current snakemake implementation. The subcommand would leverage the functions used by augur filter as well as the priorities script from nCoV.

Thoughts?

Examples

subsampling.yaml:

schemes:
  switzerland:
    # Focal samples for country
    country:
      group_by: "division year month"
      max_sequences: 1500
      exclude: "--exclude-where 'country!={country}'"
    # Contextual samples from country's region
    region:
      group_by: "country year month"
      seq_per_group: 20
      exclude: "--exclude-where 'country={country}' 'region!={region}'"
      priorities:
        type: "proximity"
        focus: "country"
    # Contextual samples from the rest of the world,
    # excluding the current region to avoid resampling.
    global:
      group_by: "country year month"
      seq_per_group: 10
      exclude: "--exclude-where 'region={region}'"
      priorities:
        type: "proximity"
        focus: "country"
augur subsample --include <TXT> --sequences <FASTA> \
    --metadata <TSV> --schemes <YAML> --output <FASTA>

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions