Skip to content

Conversation

@jameshadfield
Copy link
Member

Problem description

Ongoing, long-term work has made inroads into the ability to run (and install) pathogen workflows decoupled from the pathogen repo itself, associated with names such as "workflows-as-programs", "nextstrain run", "external analysis directories" etc. Simultaneously we're extending the ability for workflow invocations to customise their behaviour via overwriting default config files, config overlays, multiple inputs etc etc.

This presents us with a situation where we want people to run a workflow (e.g. zika), starting with an ~empty analaysis directory, and without access to the pathogen repo source code (this repo). So, how are they to know what commands to run, and what configuration knobs are available?

There's been some back and forth about this concept. The most salient example I could find is in this avian-flu prototyping PR.

Goal: per-pathogen docs

This prototype explores encapsulating a set of HTML-based docs within the pathogen repo itself and surfaced to the user via nextstrain docs <pathogen> and/or online via docs.nextstrain.org. These docs would briefly introduce the concept of running analyses from a working directory, include brief tutorials for adding in your own data etc, and fully document all the available configuration options alongside their defaults (essentially the API of the program).

If we are to think of workflows as programs, then this documentation can be thought of as man pages.

Keeping docs and code in sync

Documentation must be reliable and accurate. There's nothing more annoying than having program behaviour and documentation disagree.

Keeping code and docs in the same repo helps keep things in sync and goes hand-in-hand with our plans to version pathogen repos. Our only attempt at this has been ncov. Since pathogen repos vendor shared code via nextstrain/shared, we can co-locate relevant documentation in that repo and workflows (e.g. zika) can bring it in alongside the code. For instance, we vendor snakemake code for path resolution and this PR adds a .rst docs file alongside that shared code which the zika docs use.

We can leverage sphinx extensions to help keep program state and docs in sync. This PR implements a couple of ideas (using cursor.ai for the python coding) and I think there's a lot more we can do.

  • We add a custom configvalue directive which allows us to reference config values in .rst docs and have them filled in at build time.
  • We add a custom snakemake-dag directive which will use snakemake to construct the dag and use graphviz to render the DAG in the docs. This all happens at docs build time.
  • The avian-flu prototyping PR explored config schemas and automatically generating HTML docs from them. An easier (and perhaps nicer) solution would be to manually document the config options in the docs, use the custom :configvalue: role to show the default values, and have some custom sphinx built-time code to check that all values in the config are documented. Going further, it's entirely plausible to write a sphinx extension which would examine the snakemake code and list all the rules / functions which use each config value.

How to test the docs in this PR

We don't have the nice nextstrain docs UI but this PR emulates it using commands like nextstrain run zika docs . See the added docs/README.md for instructions on how to test it out. When we eventually bundle pathogens up into their own images then the built docs can be part of this, and before that time there's a number of other ways we can build the docs upon each workflow release. The content in the docs added here expands on many of the ideas introduced in this PR description.

Future directions

This is a draft PR as I don't expect it to be merged in its current state. I would like these ideas to make it to production however and would encourage anyone to try out ideas in this PR.

Code entirely by cursor.ai + some manual and prompt-based debugging. Main prompts:

---

I'm in @tutorial.rst and I want to be able to write a custom interpreted text roles to do something like

:configvalue:`phylogenetic/defaults/config.yaml:strain_id_field`

and have it replaced by the corresponding value in `phylogenetic/defaults/config.yaml` of `strain_id_field` which is "accession". The exact syntax used can vary.

This will require writing some custom python code to provide this functionality in sphinx.

---

That's working great. In @tutorial.rst I actually want to reference the yaml field `inputs[0].metadata` however this renderes as `<config value 'inputs[0].metadata' not found in phylogenetic/defaults/config.yaml>`.

---

This is working great for adding the value inline. For some values, e.g.

:configvalue:`ingest/defaults/config.yaml:ncbi_datasets_fields`

The output is a list and would be better rendered as a YAML codeblock by itself rather than inline. Is this possible?

---
I hoped there was a better way than symlinking, but the robot told me there wasn't so here we are
This printed the reason for each executed rule but was already
deprecated and always true in v7 [1] and has been removed entirely in
later snakemake versions. The upgrade path is clear [2]: "Deprecated:
Drop it and don't worry about anything"

[1] <https://snakemake.readthedocs.io/en/v7.22.0/executing/cli.html#output>
[2] <https://snakemake.readthedocs.io/en/stable/getting_started/migration.html>
Code entirely written by cursor.ai. Here's the initial prompt:

```
I'd like to add a custom sphinx extension in `docs/src/extensions` which should take a directive similar to:

:snakemake-dag:`ingest`

1. Run a command which will generate a visualisation in the dot language. In this case it'd be run from the top level `ingest` directory, and the command would be `snakemake --cores 1 -npf --forceall --dag | grep -v 'Building'`. The STDOUT is the graph viz in dot format. Remember you must use the `augur-dev-snakemake-v9` conda env to run any commands.

2. Take this dot code and use the native sphinx graphviz extension <@https://www.sphinx-doc.org/en/master/usage/extensions/graphviz.html > to render it in the docs page.
```

and lots of debugging prompts followed.
@jameshadfield jameshadfield added enhancement New feature or request help wanted Extra attention is needed proposal Proposals that warrant further discussion documentation Improvements or additions to documentation labels Jul 10, 2025
@victorlin victorlin deleted the branch james/storage July 31, 2025 21:51
@victorlin victorlin closed this Jul 31, 2025
@victorlin
Copy link
Member

Oops, I deleted james/storage because its PR #89 was closed, but didn't see that this PR was based on it. I'll restore the branch and reopen this.

@tsibley
Copy link
Contributor

tsibley commented Sep 11, 2025

I looked over this at a high-level yesterday (i.e. without diving into the code much at all) and agree with the direction it goes in general. I think there's a few details I'd tweak and some things to still figure out, but overall: yeah, this is the direction I'd been imagining for a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed proposal Proposals that warrant further discussion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants