Skip to content

Sequence ID and Comment formatting breaks when using individual sequences (-i) flag #188

@OliverDietrich

Description

@OliverDietrich

Dear developers,

I am using Mash to compute distances between genome assemblies of bacterial strains (e.g. Pseudomonas).
I prepare sketches for query and reference sequences using sequence lists

mash sketch -o query.msh -l query_seqs
mash sketch -o ref.msh -l ref_seqs

and compute distances using

mash dist query.msh ref.msh > distances.tsv

which can be attributed well to different samples using the file paths contained in the 'ID' columns.

I understand that this works because the sketch file

mash info query.msh

carries the file path in 'ID' and a description of the sequence headers in the 'Comment' (e.g. "[N seqs] header 1 [...]").

I assumed that when using the individual sequences (-i) flag, the assignment of IDs would stay the same (file path) and the comment would show the individual sequence headers, which are concatenated in the distances.tsv separated by ":" (like when using the '-C' flag). However, I find that the file path is missing from the sketch files (.msh), while the individual sequence headers are split by the first whitespace (" ") where the first entry becomes the ID and everything else becomes the Comment.

I agree that when using the individual sequences (-i) flag, the assignment of IDs has to change as to not duplicate IDs. However, I assumed that the file origin is somehow stored in the sketch, either as ID or Comment. But with the current formatting only the individual sequence headers are retained and the file path is lost.

Can you confirm that this behavior is intended?

I assumed that the "-I" and "-C" flags could be used to set IDs and Comments for sketches but they only apply to the first entry. Is there any way to alter the default behavior to set "-i" and retain file path information?

Cheers,

Oliver

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions