-
Notifications
You must be signed in to change notification settings - Fork 94
Description
Dear developers,
I am using Mash to compute distances between genome assemblies of bacterial strains (e.g. Pseudomonas).
I prepare sketches for query and reference sequences using sequence lists
mash sketch -o query.msh -l query_seqs
mash sketch -o ref.msh -l ref_seqs
and compute distances using
mash dist query.msh ref.msh > distances.tsv
which can be attributed well to different samples using the file paths contained in the 'ID' columns.
I understand that this works because the sketch file
mash info query.msh
carries the file path in 'ID' and a description of the sequence headers in the 'Comment' (e.g. "[N seqs] header 1 [...]").
I assumed that when using the individual sequences (-i) flag, the assignment of IDs would stay the same (file path) and the comment would show the individual sequence headers, which are concatenated in the distances.tsv separated by ":" (like when using the '-C' flag). However, I find that the file path is missing from the sketch files (.msh), while the individual sequence headers are split by the first whitespace (" ") where the first entry becomes the ID and everything else becomes the Comment.
I agree that when using the individual sequences (-i) flag, the assignment of IDs has to change as to not duplicate IDs. However, I assumed that the file origin is somehow stored in the sketch, either as ID or Comment. But with the current formatting only the individual sequence headers are retained and the file path is lost.
Can you confirm that this behavior is intended?
I assumed that the "-I" and "-C" flags could be used to set IDs and Comments for sketches but they only apply to the first entry. Is there any way to alter the default behavior to set "-i" and retain file path information?
Cheers,
Oliver