Open
Description
This is just a list of things that could be improved, for whenever we next revise the format. (So we don't forget any). I'm not suggesting an immediate update, but to gather ideas in one place.
- Magic number with version string.
- Add number of reads / bases as columns. This will make very approximate coverage plots trivial as well as improve tools like samtools idxstats so they work on both BAM and CRAM. What else in idxstats needs replicating?
- A generation UUID. If coupled with an identical UUID in the SAM header then we can use this to spot cases where the CRAM file has been updated without rebuilding the index. (We want to add this same feature to .BAI and .CSI too.)
- Check the utility of container size column. I think currently it is the number of remaining bytes after decoding the container header (and perhaps compression header?). More useful for random slicing would simply by the size of the entire container.
- Consider whether gzipped text is the right format. We could provide for random access on compressed index by self-indexing the index, but that's a far larger change.