-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subcommand for data de-duplication #919
Comments
One question I have is do we want the deduplicate to be more general and not specifically tied to "metadata". |
The point about a general deduplicate command might help us scope this issue. Deduplicating sequences (FASTA files) is a different process from deduplicating data frames. Sequences often don't include metadata we need to optimally resolve duplicates, so we have to rely on methods like "keep first". When additional metadata does exist with the sequence content, it is stored in the sequence name as delimited values (e.g., For data frames (usually CSV/TSV files), other tools already exist that could perform the deduplication process accurately and quickly. For example, one can apply csvtk and/or tsv-utils to deduplicate example SARS-CoV-2 metadata from the ncov workflow: cd ncov/
# csvtk to sort by name and accession and get first record by name.
csvtk sort -t -k "Virus name" -k gisaid_epi_isl:r tests/unsanitized_metadata.tsv | csvtk -t uniq -f "Virus name"
# csvtk to sort by name and accession, tsv-utils to get first record by name.
csvtk sort -t -k "Virus name" -k gisaid_epi_isl:r tests/unsanitized_metadata.tsv | tsv-uniq -H -f "Virus\ name"
# UNIX sort with tsv-utils's keep-header to sort by name and accession and get first record by name.
keep-header tests/unsanitized_metadata.tsv -- sort -k 1,1 -k 4,4r | tsv-uniq -H -f "Virus\ name" Along this same line of using existing tools to accomplish the deduplication task, we could use seqkit to convert FASTA files to TSVs, apply the same "unique" logic from seqkit fx2tab -Q data/example_sequences.fasta | tsv-uniq -f 1 | seqkit tab2fx If detection of duplicates if our goal (as opposed to resolution), seqkit also includes options to generate hashes of sequences. The following command generates MD5 hashes of all sequences in the input. We could then pass the names and hashes to a command like # Get sequence names and hashes in TSV format.
seqkit fx2tab -H -s -n data/example_sequences.fasta > hashes.tsv
# Get duplicate sequences by name.
tsv-uniq -H -r -f "#name" hashes.tsv > duplicate_sequences_by_name.tsv
# Get sequence names with duplicate different sequences.
tsv-summarize -H -g "#name" --unique-count "seq.hash" hashes.tsv | tsv-filter -H --str-gt seq.hash_unique_count:1 > duplicate_distinct_sequences.tsv You can imagine skipping the hash step completely and just applying I just tested out a couple of the above commands on 119,053 sequences from Africa. # Find duplicate distinct sequences for the same sequence name.
# This ran in 14 seconds.
$ xz -c -d data/sequences_africa.fasta.xz | seqkit fx2tab -H -n -s | tsv-summarize -H -g "#name" --unique-count "seq.hash" | tsv-filter -H --str-gt seq.hash_unique_count:1
#name seq.hash_unique_count
hCoV-19/Rwanda/U66872-RD/2020|2020-07|2020-11-05 2
# Find duplicate sequences by name (N=48).
# This ran in 10 seconds.
$ xz -c -d data/sequences_africa.fasta.xz | seqkit fx2tab -n | tsv-uniq -r -f 1 | wc -l
48
# Get distinct sequences by name (N=119,004).
# This ran for 1 minute.
$ xz -c -d data/sequences_africa.fasta.xz | seqkit fx2tab -Q | tsv-uniq -f 1 | seqkit tab2fx | xz -2 -c > unique.fasta.xz These sequences from Africa are a great example of how metadata can appear in the "sequence name", depending on how it is downloaded. For example, I extracted the GISAID sequence named
The sanitize sequences script currently ignores this kind of metadata by throwing away whatever follows the first We might spend a little time playing around with these approaches that rely on existing tools before we try to implement a command in Augur. We could find that these external tools are sufficient or that a light-weight wrapper command around these would be enough for our needs. |
Thanks for compiling that @huddlej! Really useful to know that existing tools are sufficient for these use cases. I think this sub-command would be more beneficial if we use databases over TSV/FASTA files down the road (#866). In the short term, how about:
|
We should definitely implement the first short-term solution. I'm not sure about the best place to have the discussion about deduplicating data. A tutorial (or how-to guide, depending on the audience) might be easier for users to find and also allow us to contextualize deduplication as part of "data preparation". As we start to support users running workflows for seasonal flu and other pathogens, some of the content in the ncov tutorial will need to bubble up to the top-level docs. I didn't mean to suggest that the existing tools are completely sufficient for our use cases, but that we should try them out and see where they fail to meet our needs. It's possible that an Augur subcommand could handle some of the Nextstrain-specific details and validation that were implicit in my examples above. For example, the subcommand might check the input metadata for the presence of valid accession id columns and strain id columns before piping the input to |
While working on the seasonal flu builds with data from FluDB, I noticed that seqkit has an rmdup command that can remove duplicate sequences by name or sequence content and optionally log the duplicates to a file. Adding this command right after running
|
[note] This was proposed in #860, but creating this issue for discussion tailored to the de-duplication task.
Context
Currently, the ncov workflow has scripts for removing duplicates from
sanitize_metadata.get_database_ids_by_strain()
+sanitize_metadata.filter_duplicates()
sanitize_sequences.drop_duplicate_sequences()
The goal is to add this functionality into Augur for more general use, especially to help users that run into situations such as #616 and #915.
Things to address
Name/placement of command
Both were proposed in #860 but we need to choose here:
A sub-command of the clean/transform/sanitize sub-command (name also undetermined):
A direct sub-command of
augur
Parameters
Here are some parameters and I/O combinations that may be appropriate for this new command.
Metadata I/O:
Currently, in
sanitize_metadata.get_database_ids_by_strain()
, duplicates are detected by ID column (e.g.strain
) and the set of duplicate IDs is written to a file.Then, in
sanitize_metadata.filter_duplicates()
, duplicates are resolved by sorting records by strain ID and by database accession IDs (e.g.,Accession ID
from GISAID,gisaid_epi_isl
, orgenbank_accession
) in descending order and selecting the record with the most recent accession ID.Sequences I/O:
Currently, in
sanitize_sequences.drop_duplicate_sequences()
, duplicates are detected by sequence name (usually the strain ID, matchingstrain
in the metadata) and only the first occurrence is kept. This functional optionally raises aDuplicateSequenceError
, if the user has requested it and if the function finds different sequences for the same sequence name. The function does not consider multiple identical sequences for the same name to be an error.Metadata + sequences I/O:
If the metadata + sequences are a 1:1 mapping, then we may be able to use the metadata way of removing duplicates and apply to the sequences counterpart. However, when there is not a 1:1 mapping between metadata and sequences, this implementation of a deduplicate command could effectively turn into a "filter" command. Common issues include presence of metadata but not sequences, absence of metadata but presence of sequences, or any number of duplicates for either data type.
Current parameters from
ncov/sanitize_metadata.py
:--metadata-id-columns
: A list of potential id columns for strain names in the metadata.--database-id-columns
: A list of potential database id columns whose values can be used to deduplicate records with the same strain name.--metadata-chunk-size
: Number of records to read into memory at once from the metadata.--error-on-duplicate-strains
: Throw an error when duplicate records are detected.Current parameters from
ncov/sanitize_sequences.py
:--error-on-duplicate-strains
: Throw an error when duplicate records are detected.De-duplication method (similar to
keep
parameter ofpandas.DataFrame.drop_duplicates()
):In the ncov workflow, this is effectively
last
for metadata andfirst
for sequences.Write duplicates to a custom file instead of hardcoded
{metadata}.duplicates.txt
:Anything else?
The text was updated successfully, but these errors were encountered: