merge: Support sequences #1579

tsibley · 2024-08-15T18:30:34Z

augur merge should support input --sequences and --output-sequences.

Features:

Sequences with same id are coalesced right-to-left, as with metadata rows.
Duplication handling within FASTA file? Compare to duplication handling within metadata file (i.e. no dups allowed).

Prior art:

ncov deduplicates sequences via scripts/sanitize_sequences.py using a rule that's only called when multiple inputs are involved (paralleling how scripts/combine_metadata.py is used).

See previous discussion in the PR which introduced augur merge.

Possible solutions

The text was updated successfully, but these errors were encountered:

tsibley · 2024-08-21T21:00:08Z

Assigning to @victorlin per planning doc commentary.

victorlin · 2024-08-29T22:04:28Z

The proposed workaround:

augur merge --metadata a=a.tsv b=b.csv --output-metadata c.tsv
cat a.fasta b.fasta > all.fasta
augur filter --metadata c.tsv --sequences all.fasta --output-sequences c.fasta

does not work in the case of an entry existing in both a and b. augur merge will de-duplicate while cat keeps both entries, so augur filter would not properly handle (and should error on) the duplicates in all.fasta.

A proper workaround would be using seqkit rmdup in place of cat:

augur merge --metadata a=a.tsv b=b.csv --output-metadata c.tsv
seqkit rmdup b.fasta a.fasta > c.fasta
augur filter --metadata c.tsv --sequences c.fasta --output-metadata merged.tsv --output-sequences merged.fasta

noting that the order of b.fasta a.fasta is necessary to keep consistent with augur merge behavior of coalescing right-to-left. The purpose of augur filter is then to remove any entries present in sequences that are missing from metadata (and vice versa).

The workaround is still not great since the metadata and sequence files are loosely coupled. augur merge could more tightly couple this by requiring the NAME=FILE input format. Take these hypothetical user errors and warnings as an example:

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences b=b.fasta a=a.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta
# ERROR: Order of inputs differs between metadata (a,b) and sequences (b,a).
# Please update one to match the other, noting that when an entry is in multiple
# inputs, only the entry in the last input will be kept.

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a=a.fasta b=b.fasta c=c.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta
# ERROR: Sequence file (c.fasta) does not have a corresponding metadata file.

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a=a.fasta b=b.fasta \
  --output-metadata c.tsv
  --output-sequences c.fasta
# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output.

tsibley · 2024-08-31T01:18:31Z

Hmm. I'm not sure about these example errors. I know the original thinking was to pair metadata and sequences via the names given, but upon further reflection, I'm not sure we should require it.

# ERROR: Order of inputs differs between metadata (a,b) and sequences (b,a).
# Please update one to match the other, noting that when an entry is in multiple
# inputs, only the entry in the last input will be kept.

Why require a matched order? That seems unfriendly to me, like asking the user to do unnecessary tedium for the computer's sake. I'd think to either

a. require named sequence inputs and make the order of --metadata be the authoritative one regardless of order in --sequences, or
b. don't require named sequence inputs and make the order of --sequences be the order to merge sequences, regardless of the order of --metadata. Names, if given, would allow for stricter warnings/validation of the metadata matching sequences.

Of the two, (b) would be better, I think.

If (a) and we want to support invocations without --metadata (i.e. --sequences only), then the order given to --sequences can matter in that specific case. For (b), this isn't an issue.

# ERROR: Sequence file (c.fasta) does not have a corresponding metadata file.

It seems to me to be reasonable/likely to have two or more paired metadata/sequence files, plus an extra file or two of sequences (e.g. corrected sequences) that don't have or need their own separate metadata. With strictly named pairs of metadata/sequence files, that wouldn't be possible and a stub metadata file would have to be fabricated for the sequences to avoid this error.

# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output.

These warnings seem good to me in general, but may need tweaking if behaviour (a) or (b) above is chosen instead.

victorlin · 2024-09-03T20:00:24Z

Before debating requirement of a matched order we should first settle on whether to require named sequence inputs.

The only way to have these warnings

# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output.

is if the different inputs could be paired together (e.g. via named sequence inputs). Without this pairing, there is not much benefit to allowing --metadata and --sequences in the same command. I think at that point, it would be more clear if separate commands are used to signify that no pairing/validation is happening:

augur merge --metadata a=a.tsv b=b.csv --output-metadata c.tsv
augur merge --sequences b.fasta a.fasta --output-sequences c.fasta

How about (c) require named sequence inputs when used with metadata? I'll provide an example below.

It seems to me to be reasonable/likely to have two or more paired metadata/sequence files, plus an extra file or two of sequences (e.g. corrected sequences) that don't have or need their own separate metadata.

This is reasonable, thanks for providing the example. As an extension: suppose there are datasets A (a.tsv, a.fasta) and B (b.tsv, b.fasta, b_corrected.fasta). We could support all in one command:

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a.fasta b.fasta b_corrected.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta

but that wouldn't do any validation between metadata/sequences. It might as well be separate commands:

augur merge \
  --metadata a=a.tsv b=b.csv \
  --output-metadata merged.tsv
augur merge \
  --sequences a.fasta b.fasta b_corrected.fasta \
  --output-sequences merged.fasta

(c) would allow for this:

# Create single FASTA file for B
augur merge \
  --sequences b.fasta b_corrected.fasta \
  --output-sequences b_corrected_merged.fasta

# Merge with paired validation
augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a=a.fasta b=b_corrected_merged.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta
# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b_corrected_merged.fasta is missing from b.csv. It will not be present in any output.

tsibley · 2024-09-03T20:26:53Z

Some specific thoughts below, but the big picture is that I think we'll want to gather more insight into existing use cases and feedback from potential users on the behaviour that's most useful here.

The only way to have these warnings […] is if the different inputs could be paired together (e.g. via named sequence inputs).

Couldn't we have the warnings list all files if not paired by name?

# WARNING: Sequence `XXX` in a.tsv is missing from all sequence files.
# WARNING: Sequence `YYY` in b.fasta is missing from all metadata tables.

Or allow optional pairing by name, and multiple sequence inputs per name, to enable the warnings, but not require naming?

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a=a.fasta b=b.fasta b=b_corrected.fasta cross_input_corrected.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta

Separately, when your example error says, "It will not be present in any output", does that mean missing sequences will filter out metadata records? I'm not sure that's what we want for augur merge.

It might as well be separate commands:

Except that as separate commands it'll take up more (potentially much more) transient disk space that's not otherwise required.

victorlin · 2024-09-03T22:10:02Z

Couldn't we have the warnings list all files if not paired by name?
Or allow optional pairing by name, and multiple sequence inputs per name, to enable the warnings, but not require naming?

Sure, both of those seem reasonable.

Separately, when your example error says, "It will not be present in any output", does that mean missing sequences will filter out metadata records? I'm not sure that's what we want for augur merge.

Yes. That's how augur filter works currently, and I'm assuming we want to stay consistent:

augur/tests/functional/filter/cram/subsample-max-sequences-with-probabilistic-sampling-warning.t

Lines 19 to 25 in f1d65fb

    
           10 strains were dropped during filtering 
        
           	1 had no metadata 
        
           	1 had no sequence data 
        
           	1 was dropped because it was earlier than 2012.0 or missing a date 
        
           	1 was dropped during grouping due to ambiguous month information 
        
           	6 were dropped because of subsampling criteria, using seed 314159 
        
           3 strains passed all filters

as separate commands it'll take up more (potentially much more) transient disk space that's not otherwise required.

If augur merge --sequences a.fasta b.fasta is a wrapper for seqkit rmdup b.fasta a.fasta, I don't see any overhead with separate commands. But this could change with implementation.

the big picture is that I think we'll want to gather more insight into existing use cases and feedback from potential users on the behaviour that's most useful here.

👍 good discussion of possibilities so far. Some potential questions for feedback:

Is it beneficial to validate that an entry is present in both files of an input pair?
Example: warn if an entry in a.tsv is missing from a.fasta
Is it beneficial to aggregate input types and validate that an entry is present in both types?
Example: warn if an entry in a.tsv + b.tsv is missing from a.fasta + b.fasta

In this example, would it be beneficial or confusing if SEQ2 in c.tsv came from b.tsv while in c.fasta it came from a.fasta?

# a: SEQ1, SEQ2
# b:       SEQ2, SEQ3

augur merge \
  --metadata a=a.tsv b=b.tsv \
  --sequences b=b.fasta a=a.fasta \
  --output-metadata c.tsv
  --output-sequences c.fasta

tsibley · 2024-09-04T06:36:10Z

Yes. That's how augur filter works currently, and I'm assuming we want to stay consistent:

Yeah, I know that's how augur filter works. I guess I think of merging as something separate than a filter step here. It'd be weird IMO for data to be filtered out by augur merge.

If augur merge --sequences a.fasta b.fasta is a wrapper for seqkit rmdup b.fasta a.fasta, I don't see any overhead with separate commands. But this could change with implementation.

Ah, I meant in this example of yours:

# Create single FASTA file for B
augur merge \
  --sequences b.fasta b_corrected.fasta \
  --output-sequences b_corrected_merged.fasta

# Merge with paired validation
augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences a=a.fasta b=b_corrected_merged.fasta \
  --output-metadata merged.tsv \
  --output-sequences merged.fasta

where b_corrected_merged.fasta is the extra disk space needed.

victorlin · 2024-09-04T17:09:01Z

Yeah, I know that's how augur filter works. I guess I think of merging as something separate than a filter step here. It'd be weird IMO for data to be filtered out by augur merge.

That's a good point. My mind was stuck in augur filter land but it sounds reasonable to keep filtering out of augur merge. In that case, is there any point in having paired validation between input types? If not, the simplest approach would be to implement metadata and sequence merge support to be mutually exclusive. This would avoid the complexities that arise when allowing them to be used together such as which order to coalesce and the extent of paired validation.

In the future, if paired validation becomes necessary in workflows, we can consider adding support for both metadata+sequences in a single command as an additional feature.

victorlin · 2024-09-16T19:05:38Z

To summarize, there are two different approaches:

(1) is much simpler to implement: the bulk of it is an alias to seqkit rmdup.

(2) needs work to first define how much cross-checking to do (this was discussed above). Depending on the amount of cross-checking, it may be necessary to:

split the metadata merging logic into database import / merge output
read the sequence IDs into a database table
add validation across metadata and sequence tables

The prototypes incorporate all of the above and should be functional enough to help decide which approach we want to take, at least initially.

victorlin · 2024-09-25T22:07:04Z

Thinking through this again, we could allow all these scenarios in the same implementation:

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences c.fasta b.fasta a.fasta
# WARNING: Sequence inputs are unnamed. Skipping validation between metadata and sequences.

augur merge \
  --metadata a=a.tsv b=b.csv \
  --sequences c=c.fasta b=b.fasta a=a.fasta
# ERROR: Order of inputs differs between metadata (a,b) and sequences (c,b,a).
# Please update one to match the other, noting that when an entry is in multiple
# inputs, only the entry in the last input will be kept.
# Alternatively, use unnamed sequence inputs to skip validation between
# metadata and sequences.

augur merge \
  --metadata a=a.tsv b=b.csv c=c.tsv \
  --sequences a=a.fasta b=b.fasta c=c.fasta
# WARNING: Sequence `XXX` in a.tsv is missing from a.fasta. It will not be present in any output.
# WARNING: Sequence `YYY` in b.fasta is missing from b.csv. It will not be present in any output.

I'll gather feedback on this interface in Slack.

tsibley added the enhancement New feature or request label Aug 15, 2024

tsibley self-assigned this Aug 15, 2024

victorlin mentioned this issue Aug 15, 2024

augur merge #1563

Merged

7 tasks

tsibley assigned victorlin and unassigned tsibley Aug 21, 2024

This was referenced Aug 27, 2024

merge: Support sequences with cross-checking #1601

Open

Subcommand for data de-duplication #919

Open

victorlin mentioned this issue Sep 16, 2024

merge: Support sequences without cross-checking #1631

Closed

6 tasks

victorlin mentioned this issue Nov 15, 2024

Provide a generic pattern for including additional user data alongside curated data nextstrain/pathogen-repo-guide#72

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge: Support sequences #1579

merge: Support sequences #1579

tsibley commented Aug 15, 2024 •

edited by victorlin

Loading

tsibley commented Aug 21, 2024

victorlin commented Aug 29, 2024 •

edited

Loading

tsibley commented Aug 31, 2024

victorlin commented Sep 3, 2024

tsibley commented Sep 3, 2024

victorlin commented Sep 3, 2024

tsibley commented Sep 4, 2024

victorlin commented Sep 4, 2024

victorlin commented Sep 16, 2024

victorlin commented Sep 25, 2024 •

edited

Loading

merge: Support sequences #1579

merge: Support sequences #1579

Comments

tsibley commented Aug 15, 2024 • edited by victorlin Loading

Possible solutions

tsibley commented Aug 21, 2024

victorlin commented Aug 29, 2024 • edited Loading

tsibley commented Aug 31, 2024

victorlin commented Sep 3, 2024

tsibley commented Sep 3, 2024

victorlin commented Sep 3, 2024

tsibley commented Sep 4, 2024

victorlin commented Sep 4, 2024

victorlin commented Sep 16, 2024

victorlin commented Sep 25, 2024 • edited Loading

tsibley commented Aug 15, 2024 •

edited by victorlin

Loading

victorlin commented Aug 29, 2024 •

edited

Loading

victorlin commented Sep 25, 2024 •

edited

Loading