Skip to content

Commit cf79e41

Browse files
authored
Merge pull request #814 from nextstrain/move-filter
Move filter after subsampling
2 parents 9d45734 + d04c639 commit cf79e41

File tree

19 files changed

+287
-424
lines changed

19 files changed

+287
-424
lines changed

.github/workflows/preprocess-gisaid.yml

Lines changed: 0 additions & 70 deletions
This file was deleted.

.github/workflows/preprocess-open.yml

Lines changed: 0 additions & 70 deletions
This file was deleted.

docs/dev_docs.md

Lines changed: 3 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -73,27 +73,9 @@ We do not release new minor versions for new features, but you should document n
7373
The "core" nextstrain builds consist of a global analysis and six regional analyses, performed independently for GISAID data and open data (currently open data is GenBank data).
7474
Stepping back, the process can be broken into three steps:
7575
1. Ingest and curation of raw data. This is performed by the [ncov-ingest](https://github.com/nextstrain/ncov-ingest/) repo and resulting files are uploaded to S3 buckets.
76-
2. Preprocessing of data (alignment, masking and QC filtering). This is performed by the profiles `nextstrain_profiles/nextstrain-open-preprocess` and `nextstrain_profiles/nextstrain-gisaid-preprocess`. The resulting files are uploaded to S3 buckets by the `upload` rule.
77-
3. Phylogenetic builds, which start from the files produced by the previous step. This is performed by the profiles `nextstrain_profiles/nextstrain-open` and `nextstrain_profiles/nextstrain-gisaid`. The resulting files are uploaded to S3 buckets by the `upload` rule.
76+
2. Phylogenetic builds, which start from the files produced by the previous step. This is performed by the profiles `nextstrain_profiles/nextstrain-open` and `nextstrain_profiles/nextstrain-gisaid`. The resulting files are uploaded to S3 buckets by the `upload` rule.
7877

7978

80-
### Manually running preprocessing
81-
82-
To run these pipelines without uploading the results:
83-
```sh
84-
snakemake -pf results/filtered_open.fasta.xz --profile nextstrain_profiles/nextstrain-open-preprocess
85-
snakemake -pf results/filtered_gisaid.fasta.xz --profile nextstrain_profiles/nextstrain-gisaid-preprocess
86-
```
87-
88-
If you wish to upload the resulting information, you should run the `upload` rule.
89-
Optionally, you may wish to define a specific `S3_DST_BUCKET` to avoid overwriting the files already present on the S3 buckets:
90-
```sh
91-
snakemake -pf upload --profile nextstrain_profiles/nextstrain-open-preprocess \
92-
--config S3_DST_BUCKET=nextstrain-staging/files/ncov/open/trial/TRIAL_NAME
93-
snakemake -pf upload --profile nextstrain_profiles/nextstrain-gisaid-preprocess \
94-
--config S3_DST_BUCKET=nextstrain-ncov-private/trial/TRIAL_NAME
95-
```
96-
9779
### Manually running phylogenetic builds
9880

9981
To run these pipelines locally, without uploading the results:
@@ -111,13 +93,13 @@ You may wish to overwrite these parameters for your local runs to avoid overwrit
11193
For instance, here are the commands used by the trial builds action (see below):
11294
```sh
11395
snakemake -pf upload deploy \
114-
--profile nextstrain_profiles/nextstrain-open-preprocess \
96+
--profile nextstrain_profiles/nextstrain-open \
11597
--config \
11698
S3_DST_BUCKET=nextstrain-staging/files/ncov/open/trial/TRIAL_NAME \
11799
deploy_url=s3://nextstrain-staging/ \
118100
auspice_json_prefix=ncov_open_trial_TRIAL_NAME
119101
snakemake -pf upload deploy \
120-
--profile nextstrain_profiles/nextstrain-gisaid-preprocess \
102+
--profile nextstrain_profiles/nextstrain-gisaid \
121103
--config \
122104
S3_DST_BUCKET=nextstrain-ncov-private/trial/TRIAL_NAME \
123105
deploy_url=s3://nextstrain-staging/ \

docs/src/analysis/orientation-files.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ We'll walk through all of the files one by one, but here are the most important
2626
## Output files and directories
2727

2828
* `auspice/<build_name>.json`: output file for visualization in Auspice where `<build_name>` is the name of your build in the workflow configuration file.
29-
* `results/aligned.fasta`, `results/filtered.fasta`, etc.: raw results files (dependencies) that are shared across all builds.
29+
* `results/aligned.fasta`, etc.: raw results files (dependencies) that are shared across all builds.
3030
* `results/<build_name>/`: raw results files (dependencies) that are specific to a single build.
3131
* `logs/`: Log files with error messages and other information about the run.
3232
* `benchmarks/`: Run-times (and memory usage on Linux systems) for each rule in the workflow.

docs/src/reference/change_log.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,10 @@
33
As of April 2021, we use major version numbers (e.g. v2) to reflect backward incompatible changes to the workflow that likely require you to update your Nextstrain installation.
44
We also use this change log to document new features that maintain backward compatibility, indicating these features by the date they were added.
55

6+
## v10 (January 2022)
7+
8+
- Move filter and diagnostic steps after subsampling. For workflows with subsampling that does not depend on priority calculations, these changes allow the workflow to start subsampling from the metadata, skipping sequence alignment of the full input sequences and only looping through these input sequences once per build when subsampled sequences are extracted. To skip the alignment step, define your input sequences with the `aligned` directive. If you use priority-based subsampling, define your input sequences with the `sequences` directive. This reorganization of the workflow causes a breaking change in that the workflow no longer supports input-specific filtering with the `exclude_where`, `min_date`, and `exclude_ambiguous_dates_by` parameters. The workflow continues to support input-specific filtering by `min_length` and skipping of diagnostic filters with `skip_diagnostics`. [PR #814](https://github.com/nextstrain/ncov/pull/814).
9+
610
## New features since last version update
711

812
- 20 December 2021: Surface the crowding penalty parameter via the config file: [PR #828](https://github.com/nextstrain/ncov/pull/827), [Issue #708](https://github.com/nextstrain/ncov/issues/708). The crowding penalty, used when calculating `priority scores` during subsampling, decreases the number of identical samples that are included in the tree during random subsampling to provide a broader picture of the viral diversity in your dataset. However, you may wish to set `crowding_penalty = 0.0` (default value = `0.1`) if you are interested in seeing as many samples as possible that are closely related to your `focal` set. You can change this parameter via `config['priorities']['crowding_penalty']`. There is no change to default behavior.

docs/src/reference/configuration.md

Lines changed: 3 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -273,7 +273,7 @@ Builds support any named attributes that can be referenced by subsampling scheme
273273
* required
274274
* `name`
275275
* `metadata`
276-
* `sequences` or `aligned` or `filtered`
276+
* `sequences` or `aligned`
277277
* examples:
278278
```yaml
279279
inputs:
@@ -283,9 +283,6 @@ inputs:
283283
- name: prealigned-data
284284
metadata: data/other_metadata.tsv.xz
285285
aligned: data/other_aligned.fasta.xz
286-
- name: prealigned-and-filtered-data
287-
metadata: data/other_metadata.tsv.xz
288-
filtered: data/other_filtered.fasta.xz
289286
```
290287

291288
Valid attributes for list entries in `inputs` are provided below.
@@ -310,7 +307,7 @@ Valid attributes for list entries in `inputs` are provided below.
310307

311308
### sequences
312309
* type: string
313-
* description: Path to a local or remote (S3, HTTP(S), GS) FASTA file with **_un_aligned and _un_filtered** genome sequences. Sequences can be uncompressed or compressed.
310+
* description: Path to a local or remote (S3, HTTP(S), GS) FASTA file with **_un_aligned** genome sequences. Sequences can be uncompressed or compressed.
314311
* examples:
315312
* `data/example_sequences.fasta`
316313
* `data/example_sequences.fasta.xz`
@@ -319,22 +316,13 @@ Valid attributes for list entries in `inputs` are provided below.
319316

320317
### aligned
321318
* type: string
322-
* description: Path to a local or remote (S3, HTTP(S), GS) FASTA file with **aligned and _un_filtered** genome sequences. Sequences can be uncompressed or compressed.
319+
* description: Path to a local or remote (S3, HTTP(S), GS) FASTA file with **aligned** genome sequences. Sequences can be uncompressed or compressed.
323320
* examples:
324321
* `data/aligned.fasta`
325322
* `data/aligned.fasta.xz`
326323
* `s3://your-bucket/aligned.fasta.gz`
327324
* `https://data.nextstrain.org/files/ncov/open/aligned.fasta.xz`
328325

329-
### filtered
330-
* type: string
331-
* description: Path to a local or remote (S3, HTTP(S), GS) FASTA file with **aligned and filtered** genome sequences. Sequences can be uncompressed or compressed.
332-
* examples:
333-
* `data/filtered.fasta`
334-
* `data/filtered.fasta.xz`
335-
* `s3://your-bucket/filtered.fasta.gz`
336-
* `https://data.nextstrain.org/files/ncov/open/filtered.fasta.xz`
337-
338326
## localrules
339327
* type: string
340328
* description: Path to a Snakemake file to include in the workflow. This parameter is redundant with `custom_rules` and may be deprecated soon.

docs/src/reference/remote_inputs.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,6 @@ A side-effect of this is the creation and upload of processed versions of the en
4141

4242
* `aligned.fasta.xz` alignment via [nextalign](https://github.com/nextstrain/nextclade/tree/master/packages/nextalign_cli). The default reference genome is [MN908947](https://www.ncbi.nlm.nih.gov/nuccore/MN908947) (Wuhan-Hu-1).
4343
* `mutation-summary.tsv.xz` A summary of the data in `aligned.fasta.xz`.
44-
* `filtered.fasta.xz` The alignment excluding data with incomplete / invalid dates, unexpected genome lengths, missing metadata etc. We also maintain a [list of sequences to exclude](https://github.com/nextstrain/ncov/blob/master/defaults/exclude.txt) which are removed at this step. These sequences represent duplicates, outliers in terms of divergence or sequences with faulty metadata.
4544

4645
## Subsampled datasets
4746

@@ -71,7 +70,6 @@ This means that the full GenBank metadata and sequences are typically updated a
7170
| Full GenBank data | metadata | https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz |
7271
| | sequences | https://data.nextstrain.org/files/ncov/open/sequences.fasta.xz |
7372
| | aligned | https://data.nextstrain.org/files/ncov/open/aligned.fasta.xz |
74-
| | filtered | https://data.nextstrain.org/files/ncov/open/filtered.fasta.xz |
7573
| Global sample | metadata | https://data.nextstrain.org/files/ncov/open/global/metadata.tsv.xz |
7674
| | sequences | https://data.nextstrain.org/files/ncov/open/global/sequences.fasta.xz |
7775
| | aligned | https://data.nextstrain.org/files/ncov/open/global/aligned.fasta.xz |
@@ -138,8 +136,6 @@ inputs:
138136
The following starting points are available:
139137
140138
* replace `sequences` with `aligned` (skips alignment)
141-
* replace `sequences` with `filtered` (skips alignment and basic filtering steps)
142-
143139

144140
## Compressed vs uncompressed starting points
145141

nextstrain_profiles/nextstrain-gisaid-preprocess/builds.yaml

Lines changed: 0 additions & 22 deletions
This file was deleted.

nextstrain_profiles/nextstrain-gisaid-preprocess/config.yaml

Lines changed: 0 additions & 10 deletions
This file was deleted.

nextstrain_profiles/nextstrain-gisaid/builds.yaml

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,13 +17,12 @@ upload:
1717
genes: ["ORF1a", "ORF1b", "S", "ORF3a", "E", "M", "ORF6", "ORF7a", "ORF7b", "ORF8", "N", "ORF9b"]
1818
use_nextalign: true
1919

20-
# Note: we have a separate profile for aligning GISAID sequences. This is triggered
21-
# as soon as new sequences are available. This workflow is thus intended to be
22-
# started from the filtered alignment. james, sept 2021
20+
# Note: unaligned sequences are provided as "aligned" sequences to avoid an initial full-DB alignment
21+
# as we re-align everything after subsampling.
2322
inputs:
2423
- name: gisaid
2524
metadata: "s3://nextstrain-ncov-private/metadata.tsv.gz"
26-
filtered: "s3://nextstrain-ncov-private/filtered.fasta.xz"
25+
aligned: "s3://nextstrain-ncov-private/sequences.fasta.xz"
2726

2827
# Define locations for which builds should be created.
2928
# For each build we specify a subsampling scheme via an explicit key.

0 commit comments

Comments
 (0)