augur subsample command

## Progress

- https://github.com/nextstrain/augur/pull/1879
- https://github.com/nextstrain/nextstrain.org/pull/1216
- https://github.com/nextstrain/docs.nextstrain.org/pull/257
- [WNV draft](https://github.com/nextstrain/WNV/pull/96)
- [WNV draft with future features (e.g. proximity-based sampling)](https://github.com/nextstrain/WNV/pull/97)
- [WNV draft against WADOH fork](https://github.com/NW-PaGe/WNV/pull/3)
- [mpox draft](https://github.com/nextstrain/mpox/pull/334)

## Tasks

- Create a prototype
    - Define initial requirements
    - Implement the command, write docs
- Demo the prototype in existing workflows (anything with an existing `rule subsample` should work)
    - **ncov** includes proximity-based sampling which is out of scope of the initial implementation. However, it could be a good target if that feature is deprecated, since the rest of the complex subsampling logic should be covered by the initial implementation.
        - See prior work on generating the subsampling YAML config file: [rule extract_subsampling_scheme](https://github.com/nextstrain/ncov/blob/f7b11697effa5801f24f1c1c5e4395909670f2d1/workflow/snakemake_rules/main_workflow.smk#L300-L378)
    - **wnv** is a good target because it currently adopts the [generalized subsampling approach](https://docs.nextstrain.org/en/latest/guides/bioinformatics/filtering-and-subsampling.html#generalizing-subsampling-in-a-workflow) which the initial implementation is meant to replace.


## Requirements for initial release

Add a new command that takes an input dataset of metadata/sequences and outputs a subset of metadata/sequences based on command line arguments, including a YAML configuration file defining one or more samples.

The command will call `augur filter` once per sample, then a final time to combine the intermediate samples. 

### Details

- Sample-specific and sampling-related global configuration to be done through the YAML file.
    - Example: all filtering options
- Input and output configuration to be done through command line arguments.
    - Examples: `--metadata`, `--metadata-delimiters`, `--output-metadata`, `--skip-checks`
- Other global configuration also to be done through command line arguments.
    - Examples: `--nthreads`, `--subsample-seed`

Example YAML configuration (taken from [wnv draft](https://github.com/nextstrain/WNV/blob/1a8d35c4b7abcc22e3e003485a8e480f4b636967/phylogenetic/defaults/all-lineages/config.yaml#L8-L21)):

```yaml
samples:
  region:
    query: is_lab_host != 'true'
    query_columns:
      - is_lab_host:str
    min_length: 8200
    group_by:
      - region
      - year
    subsample_max_sequences: 3000
    exclude:
      - defaults/exclude.txt
    include:
      - defaults/include.txt
```

Points to consider:

- Is the config schema intuitive?
- Is the config schema sufficiently documented?
- Is it easy to debug a config schema validation error?
- Is it easy to debug an error in one of the underlying `augur filter` calls?
- Do relative filepaths get passed through properly?

Features to be considered after initial release:

- https://github.com/nextstrain/ncov/issues/816

## Links

- [2021-08-17 take 1 by James](https://github.com/nextstrain/augur/pull/762)
- [2022-01-12 discussion during Nextstrain meeting](https://docs.google.com/document/d/1wMHMP9Oi5iqBMiPfqkVT_VMczIze7AHR-RFHFCkJniA/edit) and [Slack thread](https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1642027484048500)
- [2022-11-03 Slack thread on re-visiting `augur subsample` with proximities](https://bedfordlab.slack.com/archives/C021ZG2P8NN/p1667503063638959?thread_ts=1667502850.978699&cid=C021ZG2P8NN)
- [2024-02-25 Slack thread](https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1708902454228779)
- [2024-04-12 take 2 by James/Victor](https://github.com/nextstrain/augur/pull/1444)

---

# Original issue

A common use case is versatile sub-sampling of datasets to suit a particular research question. The current best example of this is the (wonderful) SARS-CoV-2 pipeline which leverages a [augur filter rule](https://github.com/nextstrain/ncov/blob/master/workflow/snakemake_rules/main_workflow.smk#L352-L403), a [script to calculate priorities](https://github.com/nextstrain/ncov/blob/master/scripts/priorities.py) and [snakemake](https://github.com/nextstrain/ncov/blob/21e8b4297f3725cae0fe6e41b800ac96d9798a28/workflow/snakemake_rules/main_workflow.smk#L275) [wizardry](https://github.com/nextstrain/ncov/blob/21e8b4297f3725cae0fe6e41b800ac96d9798a28/workflow/snakemake_rules/main_workflow.smk#L432) to allow versatile, declarative subsampling schemes to be [simply and intuitively defined](https://github.com/nextstrain/ncov/blob/master/my_profiles/example_advanced_customization/builds.yaml#L95-L134).

This allows a simple-to-reason-with YAML file to result in a very bespoke subsampling scheme:
![image](https://user-images.githubusercontent.com/8350992/100815202-457e7e80-34a8-11eb-88d9-331108087513.png)

The question arises: how do we do this for a different pathogen?

As the SARS-CoV-2 example leverages snakemake, one solution would be to abstract that logic into a importable snakemake rule. The alternative approach would be a new augur command `augur subsample` which takes a YAML file declaring the desired subsampling settings. Learning from our work on nCoV, this would essentially replace the snakemake-controlled `augur filter` commands with a single `augur subsample` command. The yaml file would look similar / identical to the [current snakemake implementation](https://github.com/nextstrain/ncov/blob/master/my_profiles/example_advanced_customization/builds.yaml#L95-L134). The subcommand would leverage the functions used by `augur filter` as well as the priorities script from nCoV.

Thoughts?


**Examples**  

`subsampling.yaml`:
```yaml
schemes:
  switzerland:
    # Focal samples for country
    country:
      group_by: "division year month"
      max_sequences: 1500
      exclude: "--exclude-where 'country!={country}'"
    # Contextual samples from country's region
    region:
      group_by: "country year month"
      seq_per_group: 20
      exclude: "--exclude-where 'country={country}' 'region!={region}'"
      priorities:
        type: "proximity"
        focus: "country"
    # Contextual samples from the rest of the world,
    # excluding the current region to avoid resampling.
    global:
      group_by: "country year month"
      seq_per_group: 10
      exclude: "--exclude-where 'region={region}'"
      priorities:
        type: "proximity"
        focus: "country"
```

```
augur subsample --include <TXT> --sequences <FASTA> \
    --metadata <TSV> --schemes <YAML> --output <FASTA>
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

augur subsample command #635

Progress

Tasks

Requirements for initial release

Details

Links

Original issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

augur subsample command #635

Description

Progress

Tasks

Requirements for initial release

Details

Links

Original issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions