Skip to content

align: Improve handling of duplicate reference sequence #1892

@victorlin

Description

@victorlin

augur align will error if the reference is present in the input --sequences and also given as --reference-sequence:

Duplicate strains of "XXXXXX" detected

It is not obvious from the error message, but this is by design: if the reference is already in --sequences, --reference-name should be used.

--reference-name
strip insertions relative to reference sequence; use if the reference is already in the input sequences

--reference-sequence
Add this reference sequence to the dataset & strip insertions relative to this. Use if the reference is NOT already in the input sequences

In a standard phylogenetic workflow, when the reference sequence is available in the curated sequences input file, the workflow must explicitly exclude the reference sequence before alignment (e.g. zika). It's easy to forget this extra step, and the error is stochastic if the reference sequence is only sometimes included in --sequences due to random sampling (e.g. measles).

Possible solutions

  1. Add a special error message in this case:

    Duplicate reference sequence "XXXXXX" detected. You have two options:
    1. Use --reference-name instead of --reference-sequence
    2. Remove the reference sequence from --sequences (e.g. with augur filter)
    
  2. Automatically drop the reference from input sequences if --remove-reference is specified. This option currently only applies to --output, but it could easily be extended to apply to the input --sequences as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions