-
Notifications
You must be signed in to change notification settings - Fork 136
filter: Use tsv-utils for --output-strains and --output-metadata
#1469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,18 +1,20 @@ | ||
| import argparse | ||
| import csv | ||
| from argparse import Namespace | ||
| import os | ||
| import re | ||
| from shlex import quote as shquote | ||
| from shutil import which | ||
| from textwrap import dedent | ||
| from typing import Sequence, Set | ||
| from typing import Sequence | ||
| import numpy as np | ||
| from collections import defaultdict | ||
| from xopen import xopen | ||
|
|
||
| from augur.errors import AugurError | ||
| from augur.io.file import open_file | ||
| from augur.io.metadata import Metadata, METADATA_DATE_COLUMN | ||
| from augur.io.metadata import METADATA_DATE_COLUMN | ||
| from augur.io.print import print_err | ||
| from augur.io.shell_command_runner import run_shell_command | ||
| from augur.utils import augur | ||
| from .constants import GROUP_BY_GENERATED_COLUMNS | ||
| from .include_exclude_rules import extract_variables, parse_filter_query | ||
|
|
||
|
|
@@ -96,25 +98,29 @@ def constant_factory(value): | |
| raise AugurError(f"missing or malformed priority scores file {fname}") | ||
|
|
||
|
|
||
| def write_output_metadata(input_metadata_path: str, delimiters: Sequence[str], | ||
| id_columns: Sequence[str], output_metadata_path: str, | ||
| ids_to_write: Set[str]): | ||
| def write_output_metadata(input_filename: str, id_column: str, output_filename: str, ids_file: str): | ||
| """ | ||
| Write output metadata file given input metadata information and a set of IDs | ||
| to write. | ||
| Write output metadata file given input metadata information and a file | ||
| containing ids to write. | ||
| """ | ||
| input_metadata = Metadata(input_metadata_path, delimiters, id_columns) | ||
|
|
||
| with xopen(output_metadata_path, "w", newline="") as output_metadata_handle: | ||
| output_metadata = csv.DictWriter(output_metadata_handle, fieldnames=input_metadata.columns, | ||
| delimiter="\t", lineterminator=os.linesep) | ||
| output_metadata.writeheader() | ||
|
|
||
| # Write outputs based on rows in the original metadata. | ||
| for row in input_metadata.rows(): | ||
| row_id = row[input_metadata.id_column] | ||
| if row_id in ids_to_write: | ||
| output_metadata.writerow(row) | ||
| # FIXME: make this a function like augur() and seqkit() | ||
| tsv_join = which("tsv-join") | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using tsv-utils/tsv-join in Augur@tsibley and I chatted about this yesterday. Two options:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Last I checked tsv-utils wasn't available for osx-arm64. It may be something we could fix.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @victorlin This is a clever solution and the speed-up you observe with ncov data suggests it's worth pursuing! Regarding:
This seems like the best way to provide this better experience to the most users and follows the pattern of bundling other third-party tools like you mention above. At first, I liked the idea of tsv-utils being an implementation detail that users don't have to know about, but I wonder about the user experience for people who don't have tsv-utils installed and don't realize why the same command runs slower than in an environment where tsv-utils is available. What if we provided some warning when tsv-utils isn't available to alert users that we are using the fallback implementation? Is there a potential cost to exposing the implementation detail that outweighs the benefit of letting users know they could speed up their filters by installing tsv-utils?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm wary of the extra work required to figure out how to properly bundle tsv-join with Augur. I'd argue that the best way to provide this experience is already accomplished by including
Oh, I meant that we don't bundle any other third-party tools currently so this would be a new approach.
Great point - I think this will be the easiest way to push the feature through:
We can still consider bundling in a future version.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Cornelius has made this available in conda-forge. Note that bioconda's tsv-utils still does not support osx-arm64.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All bioconda environments always use conda-forge preferentially (if correctly configured) so the migration from bioconda -> conda-forge is not an issue.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. tsv-utils is built from source over at conda-forge, so it's available for more platforms than the pre-built binaries. linux-aarch64 and osx-arm64 don't have pre-built binaries, but conda-forge has them now. |
||
|
|
||
| command = f""" | ||
| {augur()} read-file {shquote(input_filename)} | | ||
| {tsv_join} -H --filter-file <(printf "%s\n" {shquote(id_column)}; cat {shquote(ids_file)}) --key-fields {shquote(id_column)} | | ||
| {augur()} write-file {shquote(output_filename)} | ||
| """ | ||
|
|
||
| try: | ||
| run_shell_command(command, raise_errors=True) | ||
| except Exception: | ||
| if os.path.isfile(output_filename): | ||
| # Remove the partial output file. | ||
| os.remove(output_filename) | ||
| raise AugurError(f"Metadata output failed, see error(s) above.") | ||
| else: | ||
| raise AugurError(f"Metadata output failed, see error(s) above. The command may have already written data to stdout. You may want to clean up any partial outputs.") | ||
|
|
||
|
|
||
| # These are the types accepted in the following function. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| Setup | ||
|
|
||
| $ source "$TESTDIR"/_setup.sh | ||
|
|
||
| Use the same options with 3 different compression methods. | ||
|
|
||
| $ ${AUGUR} filter \ | ||
| > --metadata "$TESTDIR/../data/metadata.tsv" \ | ||
| > --subsample-max-sequences 5 \ | ||
| > --subsample-seed 0 \ | ||
| > --output-metadata filtered_metadata.tsv.gz 2>/dev/null | ||
|
|
||
| $ ${AUGUR} filter \ | ||
| > --metadata "$TESTDIR/../data/metadata.tsv" \ | ||
| > --subsample-max-sequences 5 \ | ||
| > --subsample-seed 0 \ | ||
| > --output-metadata filtered_metadata.tsv.xz 2>/dev/null | ||
|
|
||
| $ ${AUGUR} filter \ | ||
| > --metadata "$TESTDIR/../data/metadata.tsv" \ | ||
| > --subsample-max-sequences 5 \ | ||
| > --subsample-seed 0 \ | ||
| > --output-metadata filtered_metadata.tsv.zst 2>/dev/null | ||
|
|
||
| # The uncompressed outputs are identical. | ||
|
|
||
| $ diff <(gzcat filtered_metadata.tsv.gz) <(xzcat filtered_metadata.tsv.xz) | ||
|
|
||
| $ diff <(gzcat filtered_metadata.tsv.gz) <(zstdcat filtered_metadata.tsv.zst) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,10 +2,7 @@ Setup | |
|
|
||
| $ source "$TESTDIR"/_setup.sh | ||
|
|
||
| Since Pandas's read_csv() and to_csv() are used with a double-quote character as | ||
| the default quotechar, any column names with that character may be altered. | ||
|
|
||
| Quoted columns containing the tab delimiter are left unchanged. | ||
| Quoting is unchanged regardless of placement. | ||
|
|
||
| $ cat >metadata.tsv <<~~ | ||
| > strain "col 1" | ||
|
|
@@ -19,8 +16,6 @@ Quoted columns containing the tab delimiter are left unchanged. | |
| $ head -n 1 filtered_metadata.tsv | ||
| strain "col 1" | ||
|
|
||
| Quoted columns without the tab delimiter are stripped of the quotes. | ||
|
|
||
| $ cat >metadata.tsv <<~~ | ||
| > strain "col1" | ||
| > SEQ_1 a | ||
|
|
@@ -31,9 +26,7 @@ Quoted columns without the tab delimiter are stripped of the quotes. | |
| > --output-metadata filtered_metadata.tsv 2>/dev/null | ||
|
|
||
| $ head -n 1 filtered_metadata.tsv | ||
| strain col1 | ||
|
|
||
| Any other columns with quotes are quoted, and pre-existing quotes are escsaped by doubling up. | ||
| strain "col1" | ||
|
|
||
| $ cat >metadata.tsv <<~~ | ||
| > strain col"1 col2" | ||
|
|
@@ -45,4 +38,4 @@ Any other columns with quotes are quoted, and pre-existing quotes are escsaped b | |
| > --output-metadata filtered_metadata.tsv 2>/dev/null | ||
|
|
||
| $ head -n 1 filtered_metadata.tsv | ||
| strain "col""1" "col2""" | ||
| strain col"1 col2" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for linking to this draft PR in yesterday's lab meeting! Seeing this change reminded me that this was implemented before the discussions around consistent TSV formats in #1566. I think we'd want to keep the consistent CSV-like quoting here. Not sure if wrapping the tsv-util calls with |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: pass
args.output_strainstowrite_output_metadata(), and do or don't write the strains file there based on the arg, rather than always writing it and then sometimes removing it.