Skip to content

Conversation

@joverlee521
Copy link
Contributor

Description of proposed changes

Workflow updates to support nextstrain run. Since there are multiple builds, I've created a separate entrypoint/Snakefile per build.

This allows users to use nextstrain run without know the internal config file locations:

nextstrain run mpox phylogenetic/clade-i <analysis-dir>

For users who still want to use nextstrain build, the command is now

nextstrain build phylogenetic -s clade-i/Snakefile

Discussion

Currently a draft PR mainly for discussion of the changes needed to support nextstrain run.

I want to mainly gather thoughts on the 4 extra directories/Snakefiles as separate workflow entrypoints. I think the long term goal is for nextstrain run to be smarter about picking workflows to run, e.g. imagine defining the workflows within nextstrain-pathogen.yaml.

Related issue(s)

Part of #314

Checklist

  • Checks pass
  • Update changelog

joverlee521 and others added 2 commits July 9, 2025 15:46
Workflow updates to support `nextstrain run`. Since there are multiple builds, 
I've created a separate entrypoint/Snakefile per build.

This allows users to use `nextstrain run` without know the internal config file 
locations:
```
nextstrain run mpox phylogenetic/clade-i <analysis-dir>
```

For users who still want to use `nextstrain build`, the command is now 
```
nextstrain build phylogenetic -s clade-i/Snakefile
```
Copy link
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to mainly gather thoughts on the 4 extra directories/Snakefiles as separate workflow entrypoints. I think the long term goal is for nextstrain run to be smarter about picking workflows to run, e.g. imagine defining the workflows within nextstrain-pathogen.yaml.

Regarding the 4 extra Snakefiles I have two thoughts. Firstly, it's great that each repo has the power to do its own thing, and that it can expose ~any Snakefile (with the necessary path wrangling etc) as an interface to our runner and leverage the power that brings (cloud compute, runtimes etc). Secondly, I imagine you want to decide on a pattern that people can just copy-paste for multi-build workflows and it seems like this approach is in direct competition with using wildcards to parameterise builds. (The long-term plans for nextstrain-pathogen.yaml are opaque to me.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to rename hmpxv1 to clade-iib (which seems fine 🤷 I don't have much context here) then we should go the whole way and rename everything - defaults/hmpxv1 etc. (Perhaps you plan to do this -- it's a draft PR after all -- if so this can just serve as a reminder)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users in office hours and myself have often been confused by the mapping between the config files and the named builds, so I thought the workflow should at least match the build name.

If there's not push back from others here, I'll rename everything.

@@ -0,0 +1,14 @@
configfile: os.path.join(workflow.basedir, "../defaults/hmpxv1/config.yaml")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's cleaner and easier to understand if we co-locate the config.yaml with the Snakefile, i.e.

$ tree phylogenetic/clade-iib
clade-iib
├── Snakefile
└── config.yaml

This raises the question of where to store clade-iib specific config files: phylogenetic/defaults/clade-iib or phylogenetic/clade-iib/defaults (or some variant thereof). If we look at some values in the config YAML (swapping hmpxv1 for clade-iib for clarity) we have things like:

color_scheme: "color_schemes.tsv"
auspice_config: "clade-iib/auspice_config.json"

which isn't very ergonomic from an external users point of view. I think something like the following is cleaner:

color_scheme: "color_schemes.tsv"
auspice_config: "auspice_config.json"
$ tree
phylogenetic
├── defaults
│   └── color_schemes.tsv
├── clade-iib
│   ├── Snakefile
│   ├── config.yaml
│   ├── defaults            # maybe use a defaults subdir?
│   │   └── auspice_config.json
│   └── auspice_config.json # or maybe just do this?

Where you have a path resolution of (1) analysis dir, (2) clade-iib, (3) phylogenetic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, totally see your point about the config not being ergonomic for external users here. On the other hand, as a maintainer, it's no longer clear from

color_scheme: "color_schemes.tsv"
auspice_config: "auspice_config.json"

which files need to be edited to change the config...I am also wary of the splintering of config files, which is making me reconsider the separate Snakefile approach. See proposal in nextstrain/cli#454

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for exploring ways to get different configfile "entrypoints" into nextstrain run beyond creating multiple Snakefiles. I originally did just that for avian-flu, but my solution wasn't as nice as your proposed use of nextstrain-pathogen.yaml which might have been why it didn't get any traction.

Even with the --configfile approach we'll still have subtleties for analysis directory usage such as "color_schemes.tsv" vs "clade-iib/auspice_config.json". I guess the solution here is better per-pathogen docs.

Comment on lines +10 to +26
include: "../../shared/vendored/snakemake/config.smk"


def phylo_resolve_config_path(path: str) -> Callable[[Wildcards], str]:
"""
Wrapper around the shared `resolve_config_path` to force the default directory
to be `phylogenetic/defaults`. This is necessary because the entry point for
each build is nested within phylogenetic (e.g. phylogenetic/clade-i/Snakefile).
"""
PHYLO_DEFAULTS_DIR = os.path.normpath(os.path.join(workflow.current_basedir, "../defaults"))
# Strip the `defaults/` prefix to be backwards compatible with older configs
# This is necessary in this wrapper because we are providing a custom defaults dir
# which skips the handling of the defaults/ prefix within resolve_config_path.
path = path.removeprefix("defaults/")
return resolve_config_path(path, PHYLO_DEFAULTS_DIR)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still wish we got rid of all this magic around defaults/ and the path in the config.yaml was the path used, relative to an ordered set of "base directories" which are easily documented. I understand people don't want external users to have to make a "defaults" directory, but if we used "config" instead I think there'd be less objection.

Secondly I still think phylo_resolve_config_path is overly verbose and leaves the door open for unintentionally using resolve_config_path (which is still in the namespace). I continue to think the following is preferable:

include: "../../shared/vendored/snakemake/config.smk"

_shared_resolve_config_path = resolve_config_path

def resolve_config_path(path: str) -> Callable[[Wildcards], str]:
    return _shared_resolve_config_path(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is only necessary because of the nested Snakefiles. If we keep the main phylogenetic/Snakefile then we could just directly use the shared resolve_config_path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, moving the main Snakefile breaks the pathogen-repo-ci because it expects the phylogenetic/Snakefile...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is why I used pathogen-repo-ci.yaml@v0 for avian-flu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants