-
Notifications
You must be signed in to change notification settings - Fork 9
Add 2025 outbreak dataset #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
2334ab7 to
c4c0238
Compare
e2a0d72 to
520f757
Compare
c4c0238 to
dfdaca0
Compare
31dbd22 to
91dabec
Compare
cdab432 to
2a60133
Compare
91dabec to
fbefcd1
Compare
2a60133 to
de1fc5e
Compare
fbefcd1 to
707ff4c
Compare
de1fc5e to
7a5bf2b
Compare
7a5bf2b to
5771ba8
Compare
5771ba8 to
6f06ddb
Compare
|
@victorlin I rebased this onto main and added a bunch of commits to make the workflow more ergonomic. None of the changes here are scientific, they're just technical. (The scientific stuff will come tomorrow.) I misunderstood #36 so some of this takes things in a different direction - happy to discuss! |
|
@jameshadfield thanks! Briefly skimmed and I think those are overall good directions for phylogenetic workflows in general. |
Disable the west-africa-2014 dataset while we focus on this one for now. Still keeping its config and files around as a reference for what's currently hosted on nextstrain.org/ebola, though re-running it will no longer produce the same dataset due to the change in input data.
workaround for an augur filter bug that doesn't treat XXXX-XX-XX as a "missing date" for date filters (FIXME: write up the bug)
Keep the working directory from where we invoke snakemake from (or which we set via `--directory`). Note that I expect the working directory to be `phylogenetic` when run within the ebola repo or an external analysis directory (not yet supported). (This commit breaks the workflow but subsequent commits will fix it)
Runnable via a number of ways: **From within the phylogenetic directory** 1. `snakemake --snakefile outbreak-specific/Snakefile ...` 2. `nextstrain build . --snakefile outbreak-specific/Snakefile ...` **From within the (base) ebola directory** 3. `nextstrain build phylogenetic --snakefile outbreak-specific/Snakefile ...` **From within an external analysis directory** 4. `snakemake --snakefile <ebola>/outbreak-specific/Snakefile ...`
6f06ddb to
ddf65ed
Compare
|
Rebased onto main and added some scientific changes. The Ebov-2025 dataset now looks good and is ready to go should we get more outbreak genomes. At this time this virological post is a better summary of the genomic situation. The West-African dataset is still very much a work-in-progress, and I'm not sure there's any pressing need to update what's currently at nextstrain.org/ebola, but I think it does serve as a good baseline check of how the workflow is behaving which is why I pushed it along a little here. I think this PR is much improved cf. |
victorlin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Briefly skimmed and I think those are overall good directions for phylogenetic workflows in general.
But I have some second thoughts after a more thorough review, particularly around configs.
The changes for the 2025 outbreak dataset look good though. If you want to rebase my old in-progress commits as you see fit, I think those can be merged.
| align: | ||
| reference: &Yambuku_Mayinga_Reference outbreak-specific/drc-2025/NC_002549.1.gb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: ca77a27
This duplicates the "reference" path but YAML anchors make this ok
I don't think this is a good pattern for users of --configfile, since it means they would have to override multiple config values instead of just one. If a user overrides align.reference but not translate.reference, I imagine the workflow will run fine but produce unexpected/inaccurate results. We could add error handling to warn/error if align.reference and translate.reference are the same file, but if there is an expectation that they are the same, then it should just be a single config value in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I've dropped the duplication & anchors from this work.
When we move to small-multiple (expanded, pre-processed) configs we will want this duplication and so we'll have to consider whether to duplicate it in the user-facing config or as part of the pre-processing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: abbffff
This has the slight oddity that the alignment reference is used for translation as well, which is non-obvious from the config structure.
From a user perspective, this would be a case for keeping it as files.reference.
Rule-specific config paths like align.reference are clarifying for workflow authors, but I think they expose unnecessary implementation details to workflow users. What do you think about keeping everything under files and translating them to rule-specific paths in the config preprocessing step?
We've used 'files' ~forever but I find rule-specific config paths much more interpretable. This has the slight oddity that the alignment reference is used for translation as well, which is non-obvious from the config structure. I explored duplicating this (via YAML anchors) into `config.translate.reference` but this exposed the footgun of using different references for different steps -- see <#19 (comment)>
to be more ergonomic for the config style used in this repo
to use the outbreak metadata (assigned via Nextclade in ingest), which follows the approach we use in the all-outbreaks dataset where we use this data to subsample on. I played around with `--exclude-all` and `--include-where` arguments but found them too limited as I also wanted to exclude lab hosts, whereas the `--query` argument is both powerful enough to do it all itself if needed and also doesn't need the `--exclude-all` argument. Finally the verbosity of the `rule filter` is striking. Ideally we want the config YAML to be able to express anything that `augur filter` can handle, but currently we need a lot of conditional logic in the snakemake rule to allow this. I believe the `augur subsample` config YAML will give us this.
Colors taken directly from https://github.com/nextstrain/ebola/blob/0a9401b6e0d4220cdbc3dcb564a3085c1e518864/config/colors.tsv More colors need to be added for missing countries / divisions, and this also flags up some mis-spellings of division (e.g. 'Montserrado' is also spelt 'Montesserrado' and 'Monstserrado')
also updates how we pass through parameters from the config to `augur refine` to allow more control
ddf65ed to
0db9dfd
Compare
|
I'm going to keep this open for another day or so - hopefully we can make a decision about the dataset names and update those here too. |
Description of proposed changes
This PR adds a dataset with data from the 2025 outbreak.
Private preview: https://nextstrain.org/groups/blab-private/ebola/drc-2025
Related issue(s)
Checklist