Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 5 additions & 6 deletions augur/filter/_run.py
Original file line number Diff line number Diff line change
Expand Up @@ -402,7 +402,7 @@ def run(args):
strains_file = None
if args.output_strains:
strains_file = args.output_strains
elif args.output_sequences:
elif args.output_sequences or args.output_metadata:
strains_file = NamedTemporaryFile(delete=False).name

if strains_file is not None:
Expand All @@ -419,14 +419,13 @@ def run(args):
write_vcf(args.sequences, args.output_sequences, dropped_samps)
else:
subset_fasta(args.sequences, args.output_sequences, strains_file, args.nthreads)
if not args.output_strains:
os.remove(strains_file)

if args.output_metadata:
print_debug(f"Reading metadata from {args.metadata!r} and writing to {args.output_metadata!r}…")
write_output_metadata(args.metadata, args.metadata_delimiters,
args.metadata_id_columns, args.output_metadata,
valid_strains)
write_output_metadata(args.metadata, metadata_object.id_column, args.output_metadata, strains_file)

if not args.output_strains:
os.remove(strains_file)
Comment on lines +427 to +428
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: pass args.output_strains to write_output_metadata(), and do or don't write the strains file there based on the arg, rather than always writing it and then sometimes removing it.


# Calculate the number of strains that don't exist in either metadata or
# sequences.
Expand Down
48 changes: 27 additions & 21 deletions augur/filter/io.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,20 @@
import argparse
import csv
from argparse import Namespace
import os
import re
from shlex import quote as shquote
from shutil import which
from textwrap import dedent
from typing import Sequence, Set
from typing import Sequence
import numpy as np
from collections import defaultdict
from xopen import xopen

from augur.errors import AugurError
from augur.io.file import open_file
from augur.io.metadata import Metadata, METADATA_DATE_COLUMN
from augur.io.metadata import METADATA_DATE_COLUMN
from augur.io.print import print_err
from augur.io.shell_command_runner import run_shell_command
from augur.utils import augur
from .constants import GROUP_BY_GENERATED_COLUMNS
from .include_exclude_rules import extract_variables, parse_filter_query

Expand Down Expand Up @@ -96,25 +98,29 @@ def constant_factory(value):
raise AugurError(f"missing or malformed priority scores file {fname}")


def write_output_metadata(input_metadata_path: str, delimiters: Sequence[str],
id_columns: Sequence[str], output_metadata_path: str,
ids_to_write: Set[str]):
def write_output_metadata(input_filename: str, id_column: str, output_filename: str, ids_file: str):
"""
Write output metadata file given input metadata information and a set of IDs
to write.
Write output metadata file given input metadata information and a file
containing ids to write.
"""
input_metadata = Metadata(input_metadata_path, delimiters, id_columns)

with xopen(output_metadata_path, "w", newline="") as output_metadata_handle:
output_metadata = csv.DictWriter(output_metadata_handle, fieldnames=input_metadata.columns,
delimiter="\t", lineterminator=os.linesep)
output_metadata.writeheader()

# Write outputs based on rows in the original metadata.
for row in input_metadata.rows():
row_id = row[input_metadata.id_column]
if row_id in ids_to_write:
output_metadata.writerow(row)
# FIXME: make this a function like augur() and seqkit()
tsv_join = which("tsv-join")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using tsv-utils/tsv-join in Augur

@tsibley and I chatted about this yesterday. Two options:

  1. Detect tsv-join in the environment and use it if available. Otherwise, fall back to the Python approach. Maintenance and additional testing on both code paths would be necessary in this case. This is effectively the same approach as current invocation of fasttree/raxml/iqtree/vcftools/etc. except those are explicitly requested by the user while tsv-join could be detected and used automatically as a faster alternative to the Python approach.
  2. We could bundle tsv-join as part of Augur to avoid the the downsides of (1). Based on the latest release v2.2.1, I thought tsv-utils only distributed binaries for macOS, but it looks like previous versions distribute binaries for both Linux and macOS (and this is how it's advertised). I think we can get away with using an older version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could bundle tsv-join as part of Augur to avoid the the downsides of (1).

Last I checked tsv-utils wasn't available for osx-arm64. It may be something we could fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@victorlin This is a clever solution and the speed-up you observe with ncov data suggests it's worth pursuing! Regarding:

We could bundle tsv-join as part of Augur to avoid the the downsides of (1).

This seems like the best way to provide this better experience to the most users and follows the pattern of bundling other third-party tools like you mention above.

At first, I liked the idea of tsv-utils being an implementation detail that users don't have to know about, but I wonder about the user experience for people who don't have tsv-utils installed and don't realize why the same command runs slower than in an environment where tsv-utils is available. What if we provided some warning when tsv-utils isn't available to alert users that we are using the fallback implementation? Is there a potential cost to exposing the implementation detail that outweighs the benefit of letting users know they could speed up their filters by installing tsv-utils?

Copy link
Member Author

@victorlin victorlin May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[bundling] seems like the best way to provide this better experience to the most users

I'm wary of the extra work required to figure out how to properly bundle tsv-join with Augur. I'd argue that the best way to provide this experience is already accomplished by including tsv-join in the managed runtimes.

[bundling] follows the pattern of bundling other third-party tools like you mention above.

Oh, I meant that we don't bundle any other third-party tools currently so this would be a new approach.

What if we provided some warning when tsv-utils isn't available to alert users that we are using the fallback implementation? Is there a potential cost to exposing the implementation detail that outweighs the benefit of letting users know they could speed up their filters by installing tsv-utils?

Great point - I think this will be the easiest way to push the feature through:

  • don't bundle
  • use tsv-join if it's available
  • use Python fallback with a warning to consider downloading tsv-join in the environment if experiencing slowness

We can still consider bundling in a future version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last I checked tsv-utils wasn't available for osx-arm64. It may be something we could fix.

Cornelius has made this available in conda-forge. Note that bioconda's tsv-utils still does not support osx-arm64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All bioconda environments always use conda-forge preferentially (if correctly configured) so the migration from bioconda -> conda-forge is not an issue. conda-base uses the conda-forge one seamlessly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tsv-utils is built from source over at conda-forge, so it's available for more platforms than the pre-built binaries. linux-aarch64 and osx-arm64 don't have pre-built binaries, but conda-forge has them now.


command = f"""
{augur()} read-file {shquote(input_filename)} |
{tsv_join} -H --filter-file <(printf "%s\n" {shquote(id_column)}; cat {shquote(ids_file)}) --key-fields {shquote(id_column)} |
{augur()} write-file {shquote(output_filename)}
"""

try:
run_shell_command(command, raise_errors=True)
except Exception:
if os.path.isfile(output_filename):
# Remove the partial output file.
os.remove(output_filename)
raise AugurError(f"Metadata output failed, see error(s) above.")
else:
raise AugurError(f"Metadata output failed, see error(s) above. The command may have already written data to stdout. You may want to clean up any partial outputs.")


# These are the types accepted in the following function.
Expand Down
29 changes: 29 additions & 0 deletions tests/functional/filter/cram/filter-output-metadata-compressed.t
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
Setup

$ source "$TESTDIR"/_setup.sh

Use the same options with 3 different compression methods.

$ ${AUGUR} filter \
> --metadata "$TESTDIR/../data/metadata.tsv" \
> --subsample-max-sequences 5 \
> --subsample-seed 0 \
> --output-metadata filtered_metadata.tsv.gz 2>/dev/null

$ ${AUGUR} filter \
> --metadata "$TESTDIR/../data/metadata.tsv" \
> --subsample-max-sequences 5 \
> --subsample-seed 0 \
> --output-metadata filtered_metadata.tsv.xz 2>/dev/null

$ ${AUGUR} filter \
> --metadata "$TESTDIR/../data/metadata.tsv" \
> --subsample-max-sequences 5 \
> --subsample-seed 0 \
> --output-metadata filtered_metadata.tsv.zst 2>/dev/null

# The uncompressed outputs are identical.

$ diff <(gzcat filtered_metadata.tsv.gz) <(xzcat filtered_metadata.tsv.xz)

$ diff <(gzcat filtered_metadata.tsv.gz) <(zstdcat filtered_metadata.tsv.zst)
13 changes: 3 additions & 10 deletions tests/functional/filter/cram/filter-output-metadata-header.t
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,7 @@ Setup

$ source "$TESTDIR"/_setup.sh

Since Pandas's read_csv() and to_csv() are used with a double-quote character as
the default quotechar, any column names with that character may be altered.

Quoted columns containing the tab delimiter are left unchanged.
Quoting is unchanged regardless of placement.

$ cat >metadata.tsv <<~~
> strain "col 1"
Expand All @@ -19,8 +16,6 @@ Quoted columns containing the tab delimiter are left unchanged.
$ head -n 1 filtered_metadata.tsv
strain "col 1"

Quoted columns without the tab delimiter are stripped of the quotes.

$ cat >metadata.tsv <<~~
> strain "col1"
> SEQ_1 a
Expand All @@ -31,9 +26,7 @@ Quoted columns without the tab delimiter are stripped of the quotes.
> --output-metadata filtered_metadata.tsv 2>/dev/null

$ head -n 1 filtered_metadata.tsv
strain col1

Any other columns with quotes are quoted, and pre-existing quotes are escsaped by doubling up.
strain "col1"

$ cat >metadata.tsv <<~~
> strain col"1 col2"
Expand All @@ -45,4 +38,4 @@ Any other columns with quotes are quoted, and pre-existing quotes are escsaped b
> --output-metadata filtered_metadata.tsv 2>/dev/null

$ head -n 1 filtered_metadata.tsv
strain "col""1" "col2"""
strain col"1 col2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for linking to this draft PR in yesterday's lab meeting!

Seeing this change reminded me that this was implemented before the discussions around consistent TSV formats in #1566. I think we'd want to keep the consistent CSV-like quoting here. Not sure if wrapping the tsv-util calls with csv2tsv and csvtk fix-quotes is the correct move here as I suspect they would slow things down.

Loading