-
Notifications
You must be signed in to change notification settings - Fork 136
filter: Use tsv-utils for --output-strains and --output-metadata
#1469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1469 +/- ##
==========================================
- Coverage 68.85% 68.70% -0.16%
==========================================
Files 69 69
Lines 7607 7624 +17
Branches 1861 1867 +6
==========================================
Hits 5238 5238
- Misses 2086 2100 +14
- Partials 283 286 +3 ☔ View full report in Codecov by Sentry. |
| output_metadata_handle.close() | ||
| if output_strains: | ||
| output_strains.close() | ||
| tsv_join = which("tsv-join") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using tsv-utils/tsv-join in Augur
@tsibley and I chatted about this yesterday. Two options:
- Detect
tsv-joinin the environment and use it if available. Otherwise, fall back to the Python approach. Maintenance and additional testing on both code paths would be necessary in this case. This is effectively the same approach as current invocation offasttree/raxml/iqtree/vcftools/etc. except those are explicitly requested by the user whiletsv-joincould be detected and used automatically as a faster alternative to the Python approach. - We could bundle
tsv-joinas part of Augur to avoid the the downsides of (1). Based on the latest release v2.2.1, I thought tsv-utils only distributed binaries for macOS, but it looks like previous versions distribute binaries for both Linux and macOS (and this is how it's advertised). I think we can get away with using an older version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could bundle tsv-join as part of Augur to avoid the the downsides of (1).
Last I checked tsv-utils wasn't available for osx-arm64. It may be something we could fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@victorlin This is a clever solution and the speed-up you observe with ncov data suggests it's worth pursuing! Regarding:
We could bundle tsv-join as part of Augur to avoid the the downsides of (1).
This seems like the best way to provide this better experience to the most users and follows the pattern of bundling other third-party tools like you mention above.
At first, I liked the idea of tsv-utils being an implementation detail that users don't have to know about, but I wonder about the user experience for people who don't have tsv-utils installed and don't realize why the same command runs slower than in an environment where tsv-utils is available. What if we provided some warning when tsv-utils isn't available to alert users that we are using the fallback implementation? Is there a potential cost to exposing the implementation detail that outweighs the benefit of letting users know they could speed up their filters by installing tsv-utils?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[bundling] seems like the best way to provide this better experience to the most users
I'm wary of the extra work required to figure out how to properly bundle tsv-join with Augur. I'd argue that the best way to provide this experience is already accomplished by including tsv-join in the managed runtimes.
[bundling] follows the pattern of bundling other third-party tools like you mention above.
Oh, I meant that we don't bundle any other third-party tools currently so this would be a new approach.
What if we provided some warning when tsv-utils isn't available to alert users that we are using the fallback implementation? Is there a potential cost to exposing the implementation detail that outweighs the benefit of letting users know they could speed up their filters by installing tsv-utils?
Great point - I think this will be the easiest way to push the feature through:
- don't bundle
- use
tsv-joinif it's available - use Python fallback with a warning to consider downloading
tsv-joinin the environment if experiencing slowness
We can still consider bundling in a future version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last I checked tsv-utils wasn't available for osx-arm64. It may be something we could fix.
Cornelius has made this available in conda-forge. Note that bioconda's tsv-utils still does not support osx-arm64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All bioconda environments always use conda-forge preferentially (if correctly configured) so the migration from bioconda -> conda-forge is not an issue. conda-base uses the conda-forge one seamlessly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tsv-utils is built from source over at conda-forge, so it's available for more platforms than the pre-built binaries. linux-aarch64 and osx-arm64 don't have pre-built binaries, but conda-forge has them now.
a48025b to
afb010c
Compare
victorlin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more fragile than I initially expected.
-
tsv-joinwill only be used when all of these conditions are met:tsv-joinis availablexzcat/gzcat/zstdcatis available if the input type is compressed- the output type is uncompressed (due to limitations)
-
Even with uncompressed output, there are some slight differences in behavior when it comes to handling quoted columns: afb010c
Threads for each point below.
|
|
||
| Quoted columns containing the tab delimiter are left unchanged. | ||
|
|
||
| # FIXME: tsv-join has different behavior here. Test both? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These differences should be tested more before we use this as default behavior across pathogen workflows (and others start using it too). Maybe we can start by releasing this as an opt-in "beta" e.g. --output-metadata-attempt-tsv-utils.
022fcd3 to
91dafbf
Compare
--output-strains and --output-metadata--output-strains and --output-metadata
tsv-join is much faster than the other implementation here (18x faster - 12s vs. 3m43s on the current SARS-CoV-2 GISAID dataset containing 16 million rows).
91dafbf to
b65e7fa
Compare
| if not args.output_strains: | ||
| os.remove(strains_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: pass args.output_strains to write_output_metadata(), and do or don't write the strains file there based on the arg, rather than always writing it and then sometimes removing it.
|
|
||
| $ head -n 1 filtered_metadata.tsv | ||
| strain "col""1" "col2""" | ||
| strain col"1 col2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for linking to this draft PR in yesterday's lab meeting!
Seeing this change reminded me that this was implemented before the discussions around consistent TSV formats in #1566. I think we'd want to keep the consistent CSV-like quoting here. Not sure if wrapping the tsv-util calls with csv2tsv and csvtk fix-quotes is the correct move here as I suspect they would slow things down.
Description of proposed changes
tsv-join is much faster than the other implementation here (18x faster - 12s vs. 3m43s on the current SARS-CoV-2 GISAID dataset containing 16 million rows).
Related issue(s)
Checklist