-
Notifications
You must be signed in to change notification settings - Fork 136
Description
augur align will error if the reference is present in the input --sequences and also given as --reference-sequence:
Duplicate strains of "XXXXXX" detected
It is not obvious from the error message, but this is by design: if the reference is already in --sequences, --reference-name should be used.
--reference-name
strip insertions relative to reference sequence; use if the reference is already in the input sequences
--reference-sequence
Add this reference sequence to the dataset & strip insertions relative to this. Use if the reference is NOT already in the input sequences
In a standard phylogenetic workflow, when the reference sequence is available in the curated sequences input file, the workflow must explicitly exclude the reference sequence before alignment (e.g. zika). It's easy to forget this extra step, and the error is stochastic if the reference sequence is only sometimes included in --sequences due to random sampling (e.g. measles).
Possible solutions
-
Add a special error message in this case:
Duplicate reference sequence "XXXXXX" detected. You have two options: 1. Use --reference-name instead of --reference-sequence 2. Remove the reference sequence from --sequences (e.g. with augur filter) -
Automatically drop the reference from input sequences if
--remove-referenceis specified. This option currently only applies to--output, but it could easily be extended to apply to the input--sequencesas well.