-
Notifications
You must be signed in to change notification settings - Fork 23
(WIP) phylogenetic: Support nextstrain run
#327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Workflow updates to support `nextstrain run`. Since there are multiple builds, I've created a separate entrypoint/Snakefile per build. This allows users to use `nextstrain run` without know the internal config file locations: ``` nextstrain run mpox phylogenetic/clade-i <analysis-dir> ``` For users who still want to use `nextstrain build`, the command is now ``` nextstrain build phylogenetic -s clade-i/Snakefile ```
for more information, see https://pre-commit.ci
jameshadfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to mainly gather thoughts on the 4 extra directories/Snakefiles as separate workflow entrypoints. I think the long term goal is for nextstrain run to be smarter about picking workflows to run, e.g. imagine defining the workflows within nextstrain-pathogen.yaml.
Regarding the 4 extra Snakefiles I have two thoughts. Firstly, it's great that each repo has the power to do its own thing, and that it can expose ~any Snakefile (with the necessary path wrangling etc) as an interface to our runner and leverage the power that brings (cloud compute, runtimes etc). Secondly, I imagine you want to decide on a pattern that people can just copy-paste for multi-build workflows and it seems like this approach is in direct competition with using wildcards to parameterise builds. (The long-term plans for nextstrain-pathogen.yaml are opaque to me.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to rename hmpxv1 to clade-iib (which seems fine 🤷 I don't have much context here) then we should go the whole way and rename everything - defaults/hmpxv1 etc. (Perhaps you plan to do this -- it's a draft PR after all -- if so this can just serve as a reminder)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users in office hours and myself have often been confused by the mapping between the config files and the named builds, so I thought the workflow should at least match the build name.
If there's not push back from others here, I'll rename everything.
| @@ -0,0 +1,14 @@ | |||
| configfile: os.path.join(workflow.basedir, "../defaults/hmpxv1/config.yaml") | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's cleaner and easier to understand if we co-locate the config.yaml with the Snakefile, i.e.
$ tree phylogenetic/clade-iib
clade-iib
├── Snakefile
└── config.yamlThis raises the question of where to store clade-iib specific config files: phylogenetic/defaults/clade-iib or phylogenetic/clade-iib/defaults (or some variant thereof). If we look at some values in the config YAML (swapping hmpxv1 for clade-iib for clarity) we have things like:
color_scheme: "color_schemes.tsv"
auspice_config: "clade-iib/auspice_config.json"which isn't very ergonomic from an external users point of view. I think something like the following is cleaner:
color_scheme: "color_schemes.tsv"
auspice_config: "auspice_config.json"$ tree
phylogenetic
├── defaults
│ └── color_schemes.tsv
├── clade-iib
│ ├── Snakefile
│ ├── config.yaml
│ ├── defaults # maybe use a defaults subdir?
│ │ └── auspice_config.json
│ └── auspice_config.json # or maybe just do this?Where you have a path resolution of (1) analysis dir, (2) clade-iib, (3) phylogenetic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, totally see your point about the config not being ergonomic for external users here. On the other hand, as a maintainer, it's no longer clear from
color_scheme: "color_schemes.tsv"
auspice_config: "auspice_config.json"which files need to be edited to change the config...I am also wary of the splintering of config files, which is making me reconsider the separate Snakefile approach. See proposal in nextstrain/cli#454
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for exploring ways to get different configfile "entrypoints" into nextstrain run beyond creating multiple Snakefiles. I originally did just that for avian-flu, but my solution wasn't as nice as your proposed use of nextstrain-pathogen.yaml which might have been why it didn't get any traction.
Even with the --configfile approach we'll still have subtleties for analysis directory usage such as "color_schemes.tsv" vs "clade-iib/auspice_config.json". I guess the solution here is better per-pathogen docs.
| include: "../../shared/vendored/snakemake/config.smk" | ||
|
|
||
|
|
||
| def phylo_resolve_config_path(path: str) -> Callable[[Wildcards], str]: | ||
| """ | ||
| Wrapper around the shared `resolve_config_path` to force the default directory | ||
| to be `phylogenetic/defaults`. This is necessary because the entry point for | ||
| each build is nested within phylogenetic (e.g. phylogenetic/clade-i/Snakefile). | ||
| """ | ||
| PHYLO_DEFAULTS_DIR = os.path.normpath(os.path.join(workflow.current_basedir, "../defaults")) | ||
| # Strip the `defaults/` prefix to be backwards compatible with older configs | ||
| # This is necessary in this wrapper because we are providing a custom defaults dir | ||
| # which skips the handling of the defaults/ prefix within resolve_config_path. | ||
| path = path.removeprefix("defaults/") | ||
| return resolve_config_path(path, PHYLO_DEFAULTS_DIR) | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still wish we got rid of all this magic around defaults/ and the path in the config.yaml was the path used, relative to an ordered set of "base directories" which are easily documented. I understand people don't want external users to have to make a "defaults" directory, but if we used "config" instead I think there'd be less objection.
Secondly I still think phylo_resolve_config_path is overly verbose and leaves the door open for unintentionally using resolve_config_path (which is still in the namespace). I continue to think the following is preferable:
include: "../../shared/vendored/snakemake/config.smk"
_shared_resolve_config_path = resolve_config_path
def resolve_config_path(path: str) -> Callable[[Wildcards], str]:
return _shared_resolve_config_path(...)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is only necessary because of the nested Snakefiles. If we keep the main phylogenetic/Snakefile then we could just directly use the shared resolve_config_path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, moving the main Snakefile breaks the pathogen-repo-ci because it expects the phylogenetic/Snakefile...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is why I used pathogen-repo-ci.yaml@v0 for avian-flu
Description of proposed changes
Workflow updates to support
nextstrain run. Since there are multiple builds, I've created a separate entrypoint/Snakefile per build.This allows users to use
nextstrain runwithout know the internal config file locations:For users who still want to use
nextstrain build, the command is nowDiscussion
Currently a draft PR mainly for discussion of the changes needed to support
nextstrain run.I want to mainly gather thoughts on the 4 extra directories/Snakefiles as separate workflow entrypoints. I think the long term goal is for
nextstrain runto be smarter about picking workflows to run, e.g. imagine defining the workflows withinnextstrain-pathogen.yaml.Related issue(s)
Part of #314
Checklist