align: Improve handling of duplicate reference sequence

`augur align` will error if the reference is present in the input `--sequences` and also given as `--reference-sequence`:

```
Duplicate strains of "XXXXXX" detected
```

It is not obvious from the error message, but this is [by design](https://docs.nextstrain.org/projects/augur/en/stable/usage/cli/align.html): if the reference is already in `--sequences`, `--reference-name` should be used.

> **--reference-name**
> strip insertions relative to reference sequence; use if the reference is already in the input sequences

> **--reference-sequence**
> Add this reference sequence to the dataset & strip insertions relative to this. Use if the reference is NOT already in the input sequences

In a standard phylogenetic workflow, when the reference sequence is available in the curated sequences input file, the workflow must explicitly exclude the reference sequence before alignment (e.g. [zika](https://github.com/nextstrain/zika/blob/4d8d8c452b8f3f50b771d95a17662c88818b6048/phylogenetic/defaults/exclude.txt#L2)). It's easy to forget this extra step, and the error is stochastic if the reference sequence is only sometimes included in `--sequences` due to random sampling (e.g. [measles](https://github.com/nextstrain/measles/pull/68)).

### Possible solutions

1. Add a special error message in this case:

    ```
    Duplicate reference sequence "XXXXXX" detected. You have two options:
    1. Use --reference-name instead of --reference-sequence
    2. Remove the reference sequence from --sequences (e.g. with augur filter)
    ```

2. Automatically drop the reference from input sequences if `--remove-reference` is specified. This option currently only applies to `--output`, but it could easily be extended to apply to the input `--sequences` as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

align: Improve handling of duplicate reference sequence #1892

Possible solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

align: Improve handling of duplicate reference sequence #1892

Description

Possible solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions