Add 2025 outbreak dataset #19

victorlin · 2025-09-11T21:53:37Z

Description of proposed changes

This PR adds a dataset with data from the 2025 outbreak.

Private preview: https://nextstrain.org/groups/blab-private/ebola/drc-2025

Related issue(s)

Checklist

Merge base PR Move top-level phylogenetic workflow to outbreak-specific #36
Decide on clock rate (presumably it can't it be inferred yet). Should talk to JT
Decide on tree root
Check location metadata coverage & accuracy
- We can use a manual annotations TSV for any changes here and talk to Eddy while there are still so few actual samples to confirm things
Checks pass
Update changelog

jameshadfield · 2025-09-25T04:15:28Z

@victorlin I rebased this onto main and added a bunch of commits to make the workflow more ergonomic. None of the changes here are scientific, they're just technical. (The scientific stuff will come tomorrow.) I misunderstood #36 so some of this takes things in a different direction - happy to discuss!

victorlin · 2025-09-25T19:44:37Z

@jameshadfield thanks! Briefly skimmed and I think those are overall good directions for phylogenetic workflows in general.

Disable the west-africa-2014 dataset while we focus on this one for now. Still keeping its config and files around as a reference for what's currently hosted on nextstrain.org/ebola, though re-running it will no longer produce the same dataset due to the change in input data.

workaround for an augur filter bug that doesn't treat XXXX-XX-XX as a "missing date" for date filters (FIXME: write up the bug)

Keep the working directory from where we invoke snakemake from (or which we set via `--directory`). Note that I expect the working directory to be `phylogenetic` when run within the ebola repo or an external analysis directory (not yet supported). (This commit breaks the workflow but subsequent commits will fix it)

Runnable via a number of ways: **From within the phylogenetic directory** 1. `snakemake --snakefile outbreak-specific/Snakefile ...` 2. `nextstrain build . --snakefile outbreak-specific/Snakefile ...` **From within the (base) ebola directory** 3. `nextstrain build phylogenetic --snakefile outbreak-specific/Snakefile ...` **From within an external analysis directory** 4. `snakemake --snakefile <ebola>/outbreak-specific/Snakefile ...`

jameshadfield · 2025-09-29T03:34:46Z

Rebased onto main and added some scientific changes.

The Ebov-2025 dataset now looks good and is ready to go should we get more outbreak genomes. At this time this virological post is a better summary of the genomic situation.

The West-African dataset is still very much a work-in-progress, and I'm not sure there's any pressing need to update what's currently at nextstrain.org/ebola, but I think it does serve as a good baseline check of how the workflow is behaving which is why I pushed it along a little here.

I think this PR is much improved cf. main, so I'd like to merge (after review)

victorlin

I previously said:

Briefly skimmed and I think those are overall good directions for phylogenetic workflows in general.

But I have some second thoughts after a more thorough review, particularly around configs.

The changes for the 2025 outbreak dataset look good though. If you want to rebase my old in-progress commits as you see fit, I think those can be merged.

phylogenetic/rules/prepare_sequences.smk

victorlin · 2025-10-01T21:39:47Z

phylogenetic/outbreak-specific/config.yaml

+    align:
+      reference: &Yambuku_Mayinga_Reference outbreak-specific/drc-2025/NC_002549.1.gb


Re: ca77a27

This duplicates the "reference" path but YAML anchors make this ok

I don't think this is a good pattern for users of --configfile, since it means they would have to override multiple config values instead of just one. If a user overrides align.reference but not translate.reference, I imagine the workflow will run fine but produce unexpected/inaccurate results. We could add error handling to warn/error if align.reference and translate.reference are the same file, but if there is an expectation that they are the same, then it should just be a single config value in the first place.

Fair enough. I've dropped the duplication & anchors from this work.

When we move to small-multiple (expanded, pre-processed) configs we will want this duplication and so we'll have to consider whether to duplicate it in the user-facing config or as part of the pre-processing code.

Re: abbffff

This has the slight oddity that the alignment reference is used for translation as well, which is non-obvious from the config structure.

From a user perspective, this would be a case for keeping it as files.reference.

Rule-specific config paths like align.reference are clarifying for workflow authors, but I think they expose unnecessary implementation details to workflow users. What do you think about keeping everything under files and translating them to rule-specific paths in the config preprocessing step?

phylogenetic/rules/config.smk

phylogenetic/rules/prepare_sequences.smk

We've used 'files' ~forever but I find rule-specific config paths much more interpretable. This has the slight oddity that the alignment reference is used for translation as well, which is non-obvious from the config structure. I explored duplicating this (via YAML anchors) into `config.translate.reference` but this exposed the footgun of using different references for different steps -- see <#19 (comment)>

to be more ergonomic for the config style used in this repo

to use the outbreak metadata (assigned via Nextclade in ingest), which follows the approach we use in the all-outbreaks dataset where we use this data to subsample on. I played around with `--exclude-all` and `--include-where` arguments but found them too limited as I also wanted to exclude lab hosts, whereas the `--query` argument is both powerful enough to do it all itself if needed and also doesn't need the `--exclude-all` argument. Finally the verbosity of the `rule filter` is striking. Ideally we want the config YAML to be able to express anything that `augur filter` can handle, but currently we need a lot of conditional logic in the snakemake rule to allow this. I believe the `augur subsample` config YAML will give us this.

Colors taken directly from https://github.com/nextstrain/ebola/blob/0a9401b6e0d4220cdbc3dcb564a3085c1e518864/config/colors.tsv More colors need to be added for missing countries / divisions, and this also flags up some mis-spellings of division (e.g. 'Montserrado' is also spelt 'Montesserrado' and 'Monstserrado')

also updates how we pass through parameters from the config to `augur refine` to allow more control

jameshadfield · 2025-10-05T22:43:07Z

I'm going to keep this open for another day or so - hopefully we can make a decision about the dataset names and update those here too.

victorlin self-assigned this Sep 11, 2025

victorlin force-pushed the victorlin/2025-outbreak branch from 2334ab7 to c4c0238 Compare September 11, 2025 23:43

victorlin force-pushed the victorlin/all-outbreaks branch from e2a0d72 to 520f757 Compare September 11, 2025 23:43

victorlin force-pushed the victorlin/2025-outbreak branch from c4c0238 to dfdaca0 Compare September 12, 2025 00:19

victorlin force-pushed the victorlin/all-outbreaks branch 2 times, most recently from 31dbd22 to 91dabec Compare September 12, 2025 00:27

victorlin force-pushed the victorlin/2025-outbreak branch 2 times, most recently from cdab432 to 2a60133 Compare September 12, 2025 19:03

victorlin force-pushed the victorlin/all-outbreaks branch from 91dabec to fbefcd1 Compare September 12, 2025 19:03

victorlin force-pushed the victorlin/2025-outbreak branch from 2a60133 to de1fc5e Compare September 13, 2025 01:56

victorlin force-pushed the victorlin/all-outbreaks branch from fbefcd1 to 707ff4c Compare September 13, 2025 01:56

victorlin force-pushed the victorlin/2025-outbreak branch from de1fc5e to 7a5bf2b Compare September 13, 2025 02:10

victorlin changed the base branch from victorlin/all-outbreaks to main September 13, 2025 02:10

victorlin assigned jameshadfield Sep 13, 2025

victorlin force-pushed the victorlin/2025-outbreak branch from 7a5bf2b to 5771ba8 Compare September 19, 2025 17:13

victorlin changed the base branch from main to victorlin/outbreak-specific September 19, 2025 17:13

Base automatically changed from victorlin/outbreak-specific to main September 21, 2025 22:46

jameshadfield force-pushed the victorlin/2025-outbreak branch from 5771ba8 to 6f06ddb Compare September 25, 2025 04:13

joverlee521 mentioned this pull request Sep 25, 2025

How should we standardize config schema for files? nextstrain/pathogen-repo-guide#26

Open

victorlin and others added 8 commits September 29, 2025 16:07

🚧 filter: Exclude sequences with ambiguous month

b27d1ba

workaround for an augur filter bug that doesn't treat XXXX-XX-XX as a "missing date" for date filters (FIXME: write up the bug)

⛔️ add warning banner

fbb04c3

⛔️ set build_url to PR link

7c60f7f

Improve how config files are loaded

1471dcc

Use inputs config interface

981215e

jameshadfield force-pushed the victorlin/2025-outbreak branch from 6f06ddb to ddf65ed Compare September 29, 2025 03:32

jameshadfield marked this pull request as ready for review September 29, 2025 23:32

victorlin commented Oct 1, 2025

View reviewed changes

jameshadfield added 6 commits October 6, 2025 11:26

Update conditional helper function

fa7aa93

to be more ergonomic for the config style used in this repo

[ebov-2025] root tree

272b5e6

also updates how we pass through parameters from the config to `augur refine` to allow more control

update readme

0db9dfd

jameshadfield force-pushed the victorlin/2025-outbreak branch from ddf65ed to 0db9dfd Compare October 5, 2025 22:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add 2025 outbreak dataset #19

Add 2025 outbreak dataset #19

Uh oh!

victorlin commented Sep 11, 2025 •

edited

Loading

Uh oh!

jameshadfield commented Sep 25, 2025

Uh oh!

victorlin commented Sep 25, 2025

Uh oh!

jameshadfield commented Sep 29, 2025 •

edited

Loading

Uh oh!

victorlin left a comment

Uh oh!

Uh oh!

victorlin Oct 1, 2025

Uh oh!

jameshadfield Oct 5, 2025 •

edited

Loading

Uh oh!

victorlin Oct 6, 2025

Uh oh!

Uh oh!

Uh oh!

jameshadfield commented Oct 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		align:
		reference: &Yambuku_Mayinga_Reference outbreak-specific/drc-2025/NC_002549.1.gb

Add 2025 outbreak dataset #19

Are you sure you want to change the base?

Add 2025 outbreak dataset #19

Uh oh!

Conversation

victorlin commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of proposed changes

Related issue(s)

Checklist

Uh oh!

jameshadfield commented Sep 25, 2025

Uh oh!

victorlin commented Sep 25, 2025

Uh oh!

jameshadfield commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

victorlin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

victorlin Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

jameshadfield Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

victorlin Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jameshadfield commented Oct 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

victorlin commented Sep 11, 2025 •

edited

Loading

jameshadfield commented Sep 29, 2025 •

edited

Loading

jameshadfield Oct 5, 2025 •

edited

Loading