-
Notifications
You must be signed in to change notification settings - Fork 136
subsample: Improve default search paths, allow customization #1918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
ac75919 to
d58a672
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1918 +/- ##
==========================================
+ Coverage 74.15% 74.24% +0.08%
==========================================
Files 82 82
Lines 8986 9054 +68
Branches 1828 1848 +20
==========================================
+ Hits 6664 6722 +58
- Misses 2018 2025 +7
- Partials 304 307 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
tsibley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty close to what I meant when I was describing the approach!
A couple small niggles noted.
| } | ||
| ], | ||
| "description": "File(s) with list of strains to exclude. Paths must be relative to the\nworking directory." | ||
| "description": "File(s) with list of strains to exclude. Paths must be relative to the\nconfig file location." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc change (and the others below) seems incorrect and is also reverted a couple of commits later. Methinks devel/regenerate-subsample-schema needs to be re-run on this commit (and others)?
git rebase -i --exec './devel/regenerate-subsample-schema && git add augur/data/schema-subsample-config.json && git commit --fixup @ --allow-empty' master
git rebase -i --autosquash master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, good catch. "relative to the config file location" was from an old draft where I had swapped the only search path from cwd to config file parent. With the ability to configure this in augur subsample and the unchanged search path for augur filter as cwd, I figured it'd be best to remove the note entirely from the shared description as part of aa68ab4. I'll note this in the commit message.
Thanks for the handy command!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in force-push, and noted in commit message of 616bc41.
augur/subsample.py
Outdated
|
|
||
| def _resolve_filepath_str(path: str, search_paths: List[Path]) -> str: | ||
| """Resolve a string-based filepath by wrapping _resolve_filepath.""" | ||
| return _resolve_filepath(Path(path), search_paths).as_posix() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return _resolve_filepath(Path(path), search_paths).as_posix() | |
| return str(_resolve_filepath(Path(path), search_paths)) |
so that we don't assume POSIX paths? (Not that I think we support Augur on Windows, but…)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I didn't know that str() takes OS into account (doc). That's a good idea.
The string representation of a path is the raw filesystem path itself (in native form, e.g. with backslashes under Windows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 One thing I meant to add but forget is: at the point of doing str(fn(…)), it seems not worth wrapping into a separate function at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the wrapper and switched to str() in various places:
Lines 381 to 386 in 25c4448
| if isinstance(value, list): | |
| config[key] = [str(_resolve_filepath(Path(v), search_paths)) for v in value] | |
| filepaths.extend(config[key]) | |
| elif isinstance(value, str): | |
| config[key] = str(_resolve_filepath(Path(value), search_paths)) | |
| filepaths.append(config[key]) |
Lines 473 to 474 in 25c4448
| if not path.exists(): | |
| raise AugurError(f"File {str(path)!r} does not exist.") |
Lines 484 to 487 in 25c4448
| raise AugurError(dedent(f"""\ | |
| File {str(path)!r} not resolvable from any of the following paths: | |
| {indented_list([str(p) for p in search_paths], ' ' + ' ')}""")) |
augur/subsample.py
Outdated
| print_err(dedent(f"""\ | ||
| WARNING: Both the environment variable AUGUR_SEARCH_PATHS | ||
| and the command line argument --search-paths are set. | ||
| Only the environment variable will be used.""")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think it's more conventional (though not universal) for command line options to take precedence over environment variables, i.e. swap the order here.
Conventional precedence from least to most goes: config file → env var → command line option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll swap the order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done:
Lines 327 to 336 in 25c4448
| if from_cli: | |
| if from_env: | |
| print_err(dedent(f"""\ | |
| WARNING: Both the command line argument --search-paths | |
| and the environment variable AUGUR_SEARCH_PATHS are set. | |
| Only the command line argument will be used.""")) | |
| return [ | |
| *(Path(p) for p in from_cli), | |
| *default, | |
| ] |
| .. envvar:: AUGUR_SEARCH_PATHS | ||
|
|
||
| Colon-separated directory paths. | ||
| If set, ``augur subsample`` will search these directories for relative filepaths specified in the config file. | ||
| Multiple directories can be set by separating them with colons (e.g., ``/path/one:/path/two``). | ||
|
|
||
| This takes precedence over the ``--search-paths`` command line argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking
Were you thinking that we'd expand support for this env var (and option) to other commands too as needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I briefly noted this in "Additional notes / 2. Expanding to additional augur commands" in #1897.
It might be weird to customize the search path for files given directly as command line arguments, but it would be good to avoid resolving filepaths in both the workflow and Augur (a pattern which this PR enables).
| def get_referenced_files( | ||
| config_file: str, | ||
| config_section: Optional[List[str]] = None, | ||
| search_paths: Optional[List[str]] = None, | ||
| ) -> Set[str]: | ||
| """Get the files referenced in a subsample config file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking
This seems likely to fall out of date with the behaviour of augur subsample since it's not used directly. It would be better if augur subsample's own code and this function shared more of a code path. Perhaps file paths should be resolved as an early transformation of config, rather than transformed one by one as Samples are constructed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I'll look into doing that in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up doing this with a function _resolve_filepaths(). It recursively walks through the config/schema and might be slightly hard to follow, but I think it's a decent way to share the same code path between augur subsample and get_referenced_files().
d58a672 to
0e4f947
Compare
This makes it easier to add/remove steps.
This will be used in a future commit.
Preparing to use the schema in another function.
Previously, the only search path was implicitly the current working directory. Make this explicit and add an option to search other paths before it. This means the notes in option descriptions mentioning cwd as search path are no longer accurate for augur subsample. Since the descriptions are shared with augur filter which does not allow custom search paths, I figured it'd be best to remove the notes entirely.
This allows more intuitive relative filepaths from the perspective of the config file.
Environment variables can be easier to use than CLI options in some situations.
This can be used by Snakemake workflows to declare input files for rules that call augur subsample.
0e4f947 to
25c4448
Compare
Description of proposed changes
Previously, the only search path was implicitly the current working directory.
This PR adds another default – the config file's parent directory – and prefers it over cwd. Custom search paths are supported via command line option and environment variable.
Related issue(s)
Checklist