Description
This is to start the ball rolling on adding sequence collections to the SAM header, so we can use it in SAM, BAM, CRAM, but also a related issue for VCF and BCF. (It's easier there though due to the free form nature of the header so for now we can concentrate on the harder case and VCF hopefully will follow in a simpler fashion.)
I see a few potential avenues.
-
Adding a new
@SC
header tag. Cleanest solution as it's a direct separation of data types, but problematic for parsers. -
Permitting an SQ header without SN or LN, but with a new sequence collection tag (eg SC). This is basically a substitute for
@SC
, but one that wouldn't break (as many) parsers. (Or alternatively, a known fake SN/LN field, egSN:SeqCol LN:0
). -
Keeping all the existing
@SQ
header lines but populating the sequence collection tag. This is comparable to UR where we indicate the reference file, but now it's the collection this SQ belongs in. Inevitably it's somewhat verbose as we specify the same thing many times.
Inherently there are problems here, but it's hard to know how to navigate them such that we have the least pain while actually gaining useful functionality. The cleanest method breaks pretty much everything, from SAM parsing to BAM to CRAM. While the latter method has the fewest problems with existing parsers but it doesn't really add anything other than improved data provenance. Not to belittle it - that is useful - but it's basically just a synonym for identifying the assembly we used which we can already do.
This issue arose out of the latest RefGet meeting. Other topics that may be relevant were:
- SQ M5 tag is for MD5sum, but sequence collections use sha512t24u.
- Caching. We'd probably treat SeqCol just as another way of populating a reference cache, so we fetch sequences once and from then on we either load directly from local repositories (if we have SQ headers) or fetch the small SeqCol meta-data to get sequence IDs. (With caveats on types of checksum.)
- Use of DRS for discovery from within sequence collections, and permitting more distributed reference servers, but that's early and ongoing work still (to be tracked in their github repo instead of here).
- Whole gzipped fasta downloads for the entire collection (again via DRS and housed on e.g. s3 buckets). Not so useful except for initial population of a local cache, and even then it may include duplicate data when we have multiple variations on the same assembly.
Thoughts on where to start? I'm aware this is likely to be a long term issue, so discussing it now and getting the ball rolling may help speed things up when the sequence collections is finalised as well as perhaps steering their implementations in the most useful directions.