Skip to content

Conversation

@victorlin
Copy link
Member

@victorlin victorlin commented Sep 11, 2025

Description of proposed changes

This PR adds a dataset with data from the 2025 outbreak.

Private preview: https://nextstrain.org/groups/blab-private/ebola/drc-2025

Related issue(s)

Checklist

  • Merge base PR Move top-level phylogenetic workflow to outbreak-specific #36
  • Decide on clock rate (presumably it can't it be inferred yet). Should talk to JT
  • Decide on tree root
  • Check location metadata coverage & accuracy
    • We can use a manual annotations TSV for any changes here and talk to Eddy while there are still so few actual samples to confirm things
  • Checks pass
  • Update changelog

@victorlin victorlin self-assigned this Sep 11, 2025
@victorlin victorlin force-pushed the victorlin/2025-outbreak branch from 2334ab7 to c4c0238 Compare September 11, 2025 23:43
@victorlin victorlin force-pushed the victorlin/all-outbreaks branch from e2a0d72 to 520f757 Compare September 11, 2025 23:43
@victorlin victorlin force-pushed the victorlin/2025-outbreak branch from c4c0238 to dfdaca0 Compare September 12, 2025 00:19
@victorlin victorlin force-pushed the victorlin/all-outbreaks branch 2 times, most recently from 31dbd22 to 91dabec Compare September 12, 2025 00:27
@victorlin victorlin force-pushed the victorlin/2025-outbreak branch 2 times, most recently from cdab432 to 2a60133 Compare September 12, 2025 19:03
@victorlin victorlin force-pushed the victorlin/all-outbreaks branch from 91dabec to fbefcd1 Compare September 12, 2025 19:03
@victorlin victorlin force-pushed the victorlin/2025-outbreak branch from 2a60133 to de1fc5e Compare September 13, 2025 01:56
@victorlin victorlin force-pushed the victorlin/all-outbreaks branch from fbefcd1 to 707ff4c Compare September 13, 2025 01:56
@victorlin victorlin force-pushed the victorlin/2025-outbreak branch from de1fc5e to 7a5bf2b Compare September 13, 2025 02:10
@victorlin victorlin changed the base branch from victorlin/all-outbreaks to main September 13, 2025 02:10
@victorlin victorlin force-pushed the victorlin/2025-outbreak branch from 7a5bf2b to 5771ba8 Compare September 19, 2025 17:13
@victorlin victorlin changed the base branch from main to victorlin/outbreak-specific September 19, 2025 17:13
Base automatically changed from victorlin/outbreak-specific to main September 21, 2025 22:46
@jameshadfield jameshadfield force-pushed the victorlin/2025-outbreak branch from 5771ba8 to 6f06ddb Compare September 25, 2025 04:13
@jameshadfield
Copy link
Member

@victorlin I rebased this onto main and added a bunch of commits to make the workflow more ergonomic. None of the changes here are scientific, they're just technical. (The scientific stuff will come tomorrow.) I misunderstood #36 so some of this takes things in a different direction - happy to discuss!

@victorlin
Copy link
Member Author

@jameshadfield thanks! Briefly skimmed and I think those are overall good directions for phylogenetic workflows in general.

victorlin and others added 8 commits September 29, 2025 16:07
Disable the west-africa-2014 dataset while we focus on this one for now.
Still keeping its config and files around as a reference for what's
currently hosted on nextstrain.org/ebola, though re-running it will no
longer produce the same dataset due to the change in input data.
workaround for an augur filter bug that doesn't treat XXXX-XX-XX as a
"missing date" for date filters (FIXME: write up the bug)
Keep the working directory from where we invoke snakemake from (or which
we set via `--directory`). Note that I expect the working directory
to be `phylogenetic` when run within the ebola repo or an external
analysis directory (not yet supported).

(This commit breaks the workflow but subsequent commits will fix it)
Runnable via a number of ways:

**From within the phylogenetic directory**
  1. `snakemake --snakefile outbreak-specific/Snakefile ...`
  2. `nextstrain build . --snakefile outbreak-specific/Snakefile ...`

**From within the (base) ebola directory**
  3. `nextstrain build phylogenetic --snakefile outbreak-specific/Snakefile ...`

**From within an external analysis directory**
  4. `snakemake --snakefile <ebola>/outbreak-specific/Snakefile ...`
@jameshadfield jameshadfield force-pushed the victorlin/2025-outbreak branch from 6f06ddb to ddf65ed Compare September 29, 2025 03:32
@jameshadfield
Copy link
Member

jameshadfield commented Sep 29, 2025

Rebased onto main and added some scientific changes.

The Ebov-2025 dataset now looks good and is ready to go should we get more outbreak genomes. At this time this virological post is a better summary of the genomic situation.

The West-African dataset is still very much a work-in-progress, and I'm not sure there's any pressing need to update what's currently at nextstrain.org/ebola, but I think it does serve as a good baseline check of how the workflow is behaving which is why I pushed it along a little here.

I think this PR is much improved cf. main, so I'd like to merge (after review)

@jameshadfield jameshadfield marked this pull request as ready for review September 29, 2025 23:32
Copy link
Member Author

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously said:

Briefly skimmed and I think those are overall good directions for phylogenetic workflows in general.

But I have some second thoughts after a more thorough review, particularly around configs.


The changes for the 2025 outbreak dataset look good though. If you want to rebase my old in-progress commits as you see fit, I think those can be merged.

Comment on lines 23 to 24
align:
reference: &Yambuku_Mayinga_Reference outbreak-specific/drc-2025/NC_002549.1.gb
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: ca77a27

This duplicates the "reference" path but YAML anchors make this ok

I don't think this is a good pattern for users of --configfile, since it means they would have to override multiple config values instead of just one. If a user overrides align.reference but not translate.reference, I imagine the workflow will run fine but produce unexpected/inaccurate results. We could add error handling to warn/error if align.reference and translate.reference are the same file, but if there is an expectation that they are the same, then it should just be a single config value in the first place.

Copy link
Member

@jameshadfield jameshadfield Oct 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I've dropped the duplication & anchors from this work.

When we move to small-multiple (expanded, pre-processed) configs we will want this duplication and so we'll have to consider whether to duplicate it in the user-facing config or as part of the pre-processing code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: abbffff

This has the slight oddity that the alignment reference is used for translation as well, which is non-obvious from the config structure.

From a user perspective, this would be a case for keeping it as files.reference.

Rule-specific config paths like align.reference are clarifying for workflow authors, but I think they expose unnecessary implementation details to workflow users. What do you think about keeping everything under files and translating them to rule-specific paths in the config preprocessing step?

We've used 'files' ~forever but I find rule-specific config paths much
more interpretable.

This has the slight oddity that the alignment reference is used for
translation as well, which is non-obvious from the config structure. I
explored duplicating this (via YAML anchors) into
`config.translate.reference` but this exposed the footgun of using
different references for different steps -- see
<#19 (comment)>
to be more ergonomic for the config style used in this repo
to use the outbreak metadata (assigned via Nextclade in ingest), which
follows the approach we use in the all-outbreaks dataset where we
use this data to subsample on.

I played around with `--exclude-all` and `--include-where` arguments
but found them too limited as I also wanted to exclude lab hosts,
whereas the `--query` argument is both powerful enough to do it all
itself if needed and also doesn't need the `--exclude-all` argument.

Finally the verbosity of the `rule filter` is striking. Ideally we want
the config YAML to be able to express anything that `augur filter` can
handle, but currently we need a lot of conditional logic in the
snakemake rule to allow this. I believe the `augur subsample` config
YAML will give us this.
Colors taken directly from https://github.com/nextstrain/ebola/blob/0a9401b6e0d4220cdbc3dcb564a3085c1e518864/config/colors.tsv

More colors need to be added for missing countries / divisions, and this also flags up some mis-spellings of division (e.g. 'Montserrado' is also spelt 'Montesserrado' and 'Monstserrado')
also updates how we pass through parameters from the config to
`augur refine` to allow more control
@jameshadfield jameshadfield force-pushed the victorlin/2025-outbreak branch from ddf65ed to 0db9dfd Compare October 5, 2025 22:33
@jameshadfield
Copy link
Member

I'm going to keep this open for another day or so - hopefully we can make a decision about the dataset names and update those here too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants