filter: Use tsv-utils for `--output-strains` and `--output-metadata` #1469

victorlin · 2024-05-18T21:32:16Z

Description of proposed changes

tsv-join is much faster than the other implementation here (18x faster - 12s vs. 3m43s on the current SARS-CoV-2 GISAID dataset containing 16 million rows).

Related issue(s)

Prompted by Slack discussion

Checklist

Address FIXMEs
Checks pass
If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

codecov · 2024-05-18T21:41:54Z

Codecov Report

Attention: Patch coverage is 59.52381% with 17 lines in your changes are missing coverage. Please review.

Project coverage is 68.70%. Comparing base (4923408) to head (a48025b).
Report is 1 commits behind head on master.

Files	Patch %	Lines
augur/filter/io.py	54.05%	14 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1469      +/-   ##
==========================================
- Coverage   68.85%   68.70%   -0.16%     
==========================================
  Files          69       69              
  Lines        7607     7624      +17     
  Branches     1861     1867       +6     
==========================================
  Hits         5238     5238              
- Misses       2086     2100      +14     
- Partials      283      286       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

victorlin · 2024-05-22T19:00:59Z

augur/filter/io.py

-        output_metadata_handle.close()
-    if output_strains:
-        output_strains.close()
+    tsv_join = which("tsv-join")


Using tsv-utils/tsv-join in Augur

@tsibley and I chatted about this yesterday. Two options:

Detect tsv-join in the environment and use it if available. Otherwise, fall back to the Python approach. Maintenance and additional testing on both code paths would be necessary in this case. This is effectively the same approach as current invocation of fasttree/raxml/iqtree/vcftools/etc. except those are explicitly requested by the user while tsv-join could be detected and used automatically as a faster alternative to the Python approach.

We could bundle tsv-join as part of Augur to avoid the the downsides of (1). Based on the latest release v2.2.1, I thought tsv-utils only distributed binaries for macOS, but it looks like previous versions distribute binaries for both Linux and macOS (and this is how it's advertised). I think we can get away with using an older version.

We could bundle tsv-join as part of Augur to avoid the the downsides of (1).

Last I checked tsv-utils wasn't available for osx-arm64. It may be something we could fix.

@victorlin This is a clever solution and the speed-up you observe with ncov data suggests it's worth pursuing! Regarding:

We could bundle tsv-join as part of Augur to avoid the the downsides of (1).

This seems like the best way to provide this better experience to the most users and follows the pattern of bundling other third-party tools like you mention above.

At first, I liked the idea of tsv-utils being an implementation detail that users don't have to know about, but I wonder about the user experience for people who don't have tsv-utils installed and don't realize why the same command runs slower than in an environment where tsv-utils is available. What if we provided some warning when tsv-utils isn't available to alert users that we are using the fallback implementation? Is there a potential cost to exposing the implementation detail that outweighs the benefit of letting users know they could speed up their filters by installing tsv-utils?

[bundling] seems like the best way to provide this better experience to the most users

I'm wary of the extra work required to figure out how to properly bundle tsv-join with Augur. I'd argue that the best way to provide this experience is already accomplished by including tsv-join in the managed runtimes.

[bundling] follows the pattern of bundling other third-party tools like you mention above.

Oh, I meant that we don't bundle any other third-party tools currently so this would be a new approach.

What if we provided some warning when tsv-utils isn't available to alert users that we are using the fallback implementation? Is there a potential cost to exposing the implementation detail that outweighs the benefit of letting users know they could speed up their filters by installing tsv-utils?

Great point - I think this will be the easiest way to push the feature through:

don't bundle

use tsv-join if it's available

use Python fallback with a warning to consider downloading tsv-join in the environment if experiencing slowness

We can still consider bundling in a future version.

Last I checked tsv-utils wasn't available for osx-arm64. It may be something we could fix.

Cornelius has made this available in conda-forge. Note that bioconda's tsv-utils still does not support osx-arm64.

All bioconda environments always use conda-forge preferentially (if correctly configured) so the migration from bioconda -> conda-forge is not an issue. conda-base uses the conda-forge one seamlessly.

tsv-utils is built from source over at conda-forge, so it's available for more platforms than the pre-built binaries. linux-aarch64 and osx-arm64 don't have pre-built binaries, but conda-forge has them now.

victorlin

This is more fragile than I initially expected.

tsv-join will only be used when all of these conditions are met:
- tsv-join is available
- xzcat/gzcat/zstdcat is available if the input type is compressed
- the output type is uncompressed (due to limitations)
Even with uncompressed output, there are some slight differences in behavior when it comes to handling quoted columns: afb010c

Threads for each point below.

augur/filter/io.py

victorlin · 2024-07-17T22:33:04Z

tests/functional/filter/cram/filter-output-metadata-header.t


 Quoted columns containing the tab delimiter are left unchanged.

+# FIXME: tsv-join has different behavior here. Test both?


These differences should be tested more before we use this as default behavior across pathogen workflows (and others start using it too). Maybe we can start by releasing this as an opt-in "beta" e.g. --output-metadata-attempt-tsv-utils.

augur/filter/io.py

tsv-join is much faster than the other implementation here (18x faster - 12s vs. 3m43s on the current SARS-CoV-2 GISAID dataset containing 16 million rows).

genehack · 2025-07-10T17:11:24Z

augur/filter/_run.py

+    if not args.output_strains:
+        os.remove(strains_file)


suggestion: pass args.output_strains to write_output_metadata(), and do or don't write the strains file there based on the arg, rather than always writing it and then sometimes removing it.

joverlee521 · 2025-07-10T17:35:55Z

tests/functional/filter/cram/filter-output-metadata-header.t


  $ head -n 1 filtered_metadata.tsv
-  strain	"col""1"	"col2"""
+  strain	col"1	col2"


Thanks for linking to this draft PR in yesterday's lab meeting!

Seeing this change reminded me that this was implemented before the discussions around consistent TSV formats in #1566. I think we'd want to keep the consistent CSV-like quoting here. Not sure if wrapping the tsv-util calls with csv2tsv and csvtk fix-quotes is the correct move here as I suspect they would slow things down.

victorlin self-assigned this May 18, 2024

victorlin commented May 22, 2024

View reviewed changes

victorlin force-pushed the victorlin/update-filter-outputs branch from a48025b to afb010c Compare July 17, 2024 22:14

victorlin commented Jul 17, 2024

View reviewed changes

tsibley reviewed Jul 24, 2024

View reviewed changes

augur/filter/io.py Outdated Show resolved Hide resolved

augur/filter/io.py Outdated Show resolved Hide resolved

victorlin force-pushed the victorlin/update-filter-outputs branch from 022fcd3 to 91dafbf Compare August 3, 2024 01:44

victorlin mentioned this pull request Aug 9, 2024

Speed up filtering/subsampling without replacing Pandas #1573

Closed

7 tasks

victorlin mentioned this pull request Apr 24, 2025

Speed up augur filter #1575

Open

4 tasks

victorlin changed the title ~~filter: Improve speed of --output-strains and --output-metadata~~ filter: Use tsv-utils for --output-strains and --output-metadata May 23, 2025

victorlin added 2 commits July 8, 2025 18:47

Add tests for compressed metadata outputs

adfff1e

🚧 Use tsv-utils for --output-metadata

b65e7fa

tsv-join is much faster than the other implementation here (18x faster - 12s vs. 3m43s on the current SARS-CoV-2 GISAID dataset containing 16 million rows).

victorlin force-pushed the victorlin/update-filter-outputs branch from 91dafbf to b65e7fa Compare July 9, 2025 02:15

fixup! 🚧 Use tsv-utils for --output-metadata

0f5911e

genehack reviewed Jul 10, 2025

View reviewed changes

joverlee521 reviewed Jul 10, 2025

View reviewed changes


		Quoted columns containing the tab delimiter are left unchanged.

		# FIXME: tsv-join has different behavior here. Test both?

filter: Use tsv-utils for --output-strains and --output-metadata #1469

Are you sure you want to change the base?

filter: Use tsv-utils for --output-strains and --output-metadata #1469

Uh oh!

Conversation

victorlin commented May 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of proposed changes

Related issue(s)

Checklist

Uh oh!

codecov bot commented May 18, 2024

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Using tsv-utils/tsv-join in Augur

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

victorlin May 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

victorlin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

filter: Use tsv-utils for `--output-strains` and `--output-metadata` #1469

filter: Use tsv-utils for `--output-strains` and `--output-metadata` #1469

victorlin commented May 18, 2024 •

edited

Loading

victorlin May 29, 2024 •

edited

Loading