Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add splitting BAM index to spec #321

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

tomwhite
Copy link

@tomwhite tomwhite commented Jun 5, 2018

Initial draft of a spec for splitting BAM index files, similar to that used in Hadoop-BAM. A htsjdk implementation can be found in samtools/htsjdk#1138.

that used in Hadoop-BAM. A htsjdk implementation can be found in
samtools/htsjdk#1138.
@tomwhite
Copy link
Author

tomwhite commented Jun 5, 2018

Here's what the generated page looks like: SAMv1-sbi.pdf

@jmarshall
Copy link
Member

jmarshall commented Jun 5, 2018

Thanks, that's looking pretty nice, and seems like a useful convention to establish.

Some thoughts from a fairly quick reading:

  • You don't actually chop up the BAM file on disk (I assume), so maybe say “by conceptually dividing” or similar.

  • Would be good to de-emphasise the empty BAM corner case. Would a footnote saying “In the unlikely event the BAM file has no alignment records, the list of offsets will be empty” be accurate?

  • If you split on a fixed number of compressed bytes, presumably the granularity is an approximate thing? Can the granularity be approximate or is it usually -1 (unless you've split by constant number of alignment records, which the first paragraph suggests would be atypical)?

  • Would the format be improved by a n_offsets field after granularity giving the number of entries in the list? (I guess this is an existing already-in-use format, alas…)

  • Is the third paragraph (“To find the alignments[…]”) how you actually use the index? Or might one just load in the index, and process [offset[0], offset[k]), [offset[k], offset[2k]), [offset[2k], offset[3k]), etc in parallel, i.e., process in batches of k offset entries…?

  • The spec writes intervals without a space, but actually we should probably use thin spaces since they're available; so {\tt [beg,\,end)} twice.

  • s/suffix/extension/ perhaps, and where does this extension fit in? For a BAM file foo.bam, is the index file foo.bam.sbi or foo.sbi — specifically one of those in existing practice, or could it be either? (cf Spec needs to include the accepted file extentions for the index files #215 argh; also it's the index filenames that have the suffix/extension, not the index files)

  • Should probably spell out “ascending” in more than one word 😄, and use the word “sort” somewhere.

@jmarshall jmarshall added the sam label Jun 5, 2018
@tomwhite
Copy link
Author

tomwhite commented Jun 5, 2018

@jmarshall thanks for your comments. I've addressed them in turn below.

Latest draft here: SAMv1-sbi.pdf

  • You don't actually chop up the BAM file on disk (I assume), so maybe say “by conceptually dividing” or similar.

Correct. Updated the wording.

  • Would be good to de-emphasise the empty BAM corner case. Would a footnote saying “In the unlikely event the BAM file has no alignment records, the list of offsets will be empty” be accurate?

There would be a single offset for the length of the file. Added a footnote saying "In the unlikely event the BAM file has no alignment records, the list of offsets will contain a single entry for the overall length of the BAM file."

  • If you split on a fixed number of compressed bytes, presumably the granularity is an approximate thing? Can the granularity be approximate or is it usually -1 (unless you've split by constant number of alignment records, which the first paragraph suggests would be atypical)?

Splitting and granularity are separate things, since you choose a granularity when writing the index (often at the same time as writing the BAM file), which is before you know what the split size is going to be. Indeed, the split size may vary from system to system (e.g. in Hadoop and Spark the split size defaults to 128MB, but may vary on a per-file basis). When you split the file (conceptually), you have a byte range that you convert into virtual file pointers, then you adjust the first point to start on an alignment. It may not be the very first alignment after the split start (unless the granularity is 1), but that doesn't matter (since the assumption is that there are lots of offsets in the split), and the key property is that a set of file ranges that cover the file map into a set of virtual file ranges that cover the file.

There is a case when the granularity is not exact, which is when you write what is conceptually a single BAM file in multiple parts (in parallel) then merge them into a single file (we do this a lot in Spark). It's desirable to write the splitting index files at the same time, but merging them means the granularity is not exact.

Perhaps granularity should be relaxed to be an approximate concept to cope with this case.

  • Would the format be improved by a n_offsets field after granularity giving the number of entries in the list? (I guess this is an existing already-in-use format, alas…)

You'd have to buffer all the offsets for a file in memory to do this, since you don't know the total number until you reach the end of the file. The format in use in Hadoop-BAM is slightly different actually, as it's just a list of offsets; the one proposed here adds a magic string and the granularity, and uses a little-endian representation to be consistent with BAM. So we could make changes here.

  • Is the third paragraph (“To find the alignments[…]”) how you actually use the index? Or might one just load in the index, and process [offset[0], offset[k]), [offset[k], offset[2k]), [offset[2k], offset[3k]), etc in parallel, i.e., process in batches of k offset entries…?

Yes, it describes how to use the index in the case where the splits are predetermined. The way you describe is another way to use the index.

  • The spec writes intervals without a space, but actually we should probably use thin spaces since they're available; so {\tt [beg,\,end)} twice.

Thanks. Haven't written latex for years...

Changed to mandate foo.bam.sbi (like CRAM, which is the only one that is consistent according to #215).

  • Should probably spell out “ascending” in more than one word 😄, and use the word “sort” somewhere.

Done.

SAMv1.tex Outdated
\multicolumn{3}{|l|}{\sf magic} & Magic string & {\tt char[4]} & {\tt SBI\char92 1}\\\cline{1-6}
\multicolumn{3}{|l|}{\sf granularity} & Number of alignments between offsets, or $-1$ if unspecified & {\tt int32\_t} & \\\cline{1-6}
\multicolumn{6}{|c|}{\textcolor{gray}{\it List of offsets}} \\\cline{2-6}
& \multicolumn{2}{l|}{\sf offset} & Virtual file offset of the alignment & {\tt uint64\_t} & \\\cline{1-6}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there should be a special final entry after the list the table for the offset to the end of the bam.

SAMv1.tex Outdated
file.\footnote{In the unlikely event the BAM file has no alignment records,
the index will consist of a single entry for the overall length of the
BAM file.} It does not need to contain a virtual file offset for every
alignment, merely a subset. A granularity of $n$ means that an offset is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to allow indication of approximate number of records per offset? Or is that just making things unnecessarily complicated? At the cost of increasing the index size we could include the number of records in each section in the index. Instead of [offset], we'd have [offset, number of records until next offset] That might be useful for tools deciding how to split the file.

SAMv1.tex Outdated
\subsection{Splitting BAM}\label{sec:code}
A BAM file can be processed in parallel by conceptually dividing the file into
splits (typically of a fixed, but arbitrary, number of bytes) and for each
split processing alignments from the first known alignment after the split
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want a sentence here describing why this additional index is necessary and this use can't be handled by the bai.

SAMv1.tex Outdated
use the index to find the smallest virtual file offset, {\tt v1}, that falls
in this range, and the smallest virtual file offset, {\tt v2}, that is
greater than or equal to {\tt end}. If {\tt v1} does not exist, then the
split has no alignments. Otherwise, it has alignments in the range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no alignments -> no alignment starts

SAMv1.tex Outdated
split processing alignments from the first known alignment after the split
start up to the first known alignment of the next split.

A splitting BAM index is a linear index of virtual file offsets of alignment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is a bit hard to parse on the first run through. To many of in a row. I'm not sure how to improve it without making it much more wordy though, but maybe someone has a good idea.

Also, does linear index imply that it's a sorted list of increasing offsets? Should we mention that somewhere?

@lh3
Copy link
Member

lh3 commented Jun 5, 2018

You'd have to buffer all the offsets for a file in memory to do this, since you don't know the total number until you reach the end of the file.

For BGZF files, it doesn't make sense to choose an granularity in bytes smaller than the average size of a BGZF block. Suppose we have a 100GB BAM and the average block size is 10000 bytes (I don't know the exact number). We need 100e9/10000=10M virtual offsets or 80M bytes. This is a relatively small number. In practice, we will want to use a larger granularity and thus less memory. I don't think we should worry about memory when creating .sbi. When using .sbi, we can choose to randomly seek into .sbi anyway. Having an extra n_offsets field allows us to check if .sbi is truncated and makes it possible for future extensions.

Also, I think such an index should be made more general. In principle, all BGZF'd files, including bam, bcf2 and tabix files, can use this index. To be compatible with tabix, we need to store the sequence dictionary like in .csi. There is already a tool for tabix. It would be good to make a similar tool available to all BGZF files that htslib supports.

@lh3
Copy link
Member

lh3 commented Jun 5, 2018

I just noticed that csi-v2 keeps the number of records in the index (I was not aware of this; has this been implemented?). It achieves a similar goal to .sbi and is more powerful. What is the intended use of .sbi? Do you load the entire .sbi into memory, or seek into it?

EDIT: .csi only works with coordinate-sorted files. .sbi works with any record-based files. In this sense, .sbi is not redundant. Nonetheless, I do think it should be generalized to more file types like .csi.

@lbergelson
Copy link
Member

@lh3 The goal of the .sbi is to make it easy to find record start sites in close proximity to a physical compressed file location. It's for the case where you want to split a bam in parallel in a deterministic way into roughly even chunks by compressed size. Does .csi-v2 support that use case? .bai isn't good for this case because you can have regions that are very large in file size but cover a small genomic region, and there isn't an easy way to subdivide that.

@lh3
Copy link
Member

lh3 commented Jun 5, 2018

I have added an EDIT line later to clarify that csi-v2 doesn't replace .sbi. However, I do think .sbi should have n_offsets and should be generalized.

@lbergelson
Copy link
Member

@lh3 Ah, sorry, I missed the edit. I agree that we should / can generalize them. I think the original thought was scoping it to bam at first might be easier in some way, but there's no reason we can't make it more general right off the bat.

@tomwhite One problem we often have with indexes is that there isn't any way of identifying if they were generated from the bam file that they're associated with. Would it be a good idea to include a checksum of the associated bam in the index here?

@lh3
Copy link
Member

lh3 commented Jun 5, 2018

The index format I am thinking about looks like the following, where anything beforefirst_offset is header and anything after end_offset is footer. It is possible to remove first_offset and require the first offset in the list to always be the offset of the first record. n_records and end_offset are not really necessary, but they are cheap to derive and help to detect and reduce out-of-range access.

screen shot 2018-06-05 at 14 10 17

@jmarshall
Copy link
Member

CSIv2 was never implemented (or the format design considered workable) because no conclusion was ever reached as to how to correctly include counts from the higher bins (see mailing list conversation from January 2015).

Separating the flat split index into a separate .sbi file has the advantage that the data file need not be coordinate sorted. Agreed that this index needn't be specific to BAM and could be described in terms applicable to any record-orientated BGZF-compressed file.

@lh3
Copy link
Member

lh3 commented Jun 5, 2018

@lbergelson One crude way to check if the index corresponds to a specific BGZF file is to store the total file size in .sbi. This approach is cheap to implement at the cost of occasionally missing errors. Checksum would require to read through the entire file (either compressed or uncompressed), so is not suitable for a quick check. Another possible way is to store the file size and the checksum of the uncompressed header. It is a bit complicated to implement, though.

@lbergelson
Copy link
Member

@lh3 File size is a good inexpensive check. In google cloud they keep an md5 of each file as part of the object store, so on some platforms md5 is also an inexpensive test. I think adding some sort of consistency check to the header of the index would be a good idea.

@lh3
Copy link
Member

lh3 commented Jun 5, 2018

@lbergelson I realized that file size is not always easy to implement. When we index a BAM stream and reach the last record, it is not always clear what's the size of a footer. If I am right, it is technically possible to append multiple empty BGZF blocks to the last record. Then we wouldn't easily know the exact file size on disk. Checksum might have a similar issue.

I agree it would be good to have a cheap check, but I am not sure what is the best strategy. I'll let others chime in.

@tomwhite
Copy link
Author

tomwhite commented Jun 6, 2018

Thanks for all the feedback. I agree about having a more general specification for the format. I'll create a new document for it.

@lh3 I agree that it's feasible to store the offsets in memory when writing, so that n_offsets can be computed and written out before writing the offsets.

Regarding first_offset and end_offset, in the case of a file with no records, these would be the same, with both pointing to the end of header/start of footer. The list of offsets would be empty.

I had a look at implementing end_offset in htsjdk for BAMs, and it's tricky to do it from a SamReader since there is no way to access the underlying BlockCompressedInputStream to get the offset after the last record has been read. (Also the first_offset in the case of no records.) This is an API limitation, so I'm not sure it should have much bearing on the spec though.

Checksums: I think MD5 is tricky since it's not possible to compute in parallel easily. I think file length would be good enough in most cases. @h3 is your concern that a BAM file can be tampered with (by adding an empty BGZF block)? I don't think we can guard against that, but the file length is still a good sanity check in most cases.

A use case that I want to be able to support is the ability to write SBI files in parallel and merge them afterwards into a single SBI file. This is a pattern we have with BAM files where we write them as headerless shards in parallel using Spark and concatenate them together. I think the proposed format would support this pattern, with the caveat that granularity would be unspecified. (Actually, it might be possible to get the granularity right by writing index files of granularity 1 from each task and then downsampling them appropriately when combining them. It might not be worth the effort for the parallel processing use case though.) Note that this is a different use case to tools like grabix, where granularity must be set. I think we can support both by allowing an unspecified granularity (-1) in the first case.

@jmarshall
Copy link
Member

I agree about having a more general specification for the format. I'll create a new document for it.

Don't worry about a separate document for this, just organised as a section like you have but couched in more generic "BGZF file" rather than "BAM" language will be fine.

(I have a project on the back-burner to bring the CSIv1.tex and tabix.tex document contents, such as they are, alongside the BGZF indexing and BAI text. We should have fewer micro-documents for index formats, not more! 😄)

@jkbonfield
Copy link
Contributor

jkbonfield commented Jun 6, 2018

One omission from bai, csi and crai is any way to tell if they're in sync with the associated file. If we were designing formats from scratch (that have separate indices) then we'd probably add a UUID to the file and require the same UUID in the index so it's possible to spot replacement of one without the other.

Here however we're defining one new format and not redefining the other, so it's less clear how to achieve this, unless we wish to also at some stage acknowledge a UUID optional tag in the sam @HD line. If so, we could consider making space for this ability in sbi. (If present it can be checked for, if not, then meh - no worse off than now.)

Edit: blergh didn't notice this above. Sorry for dup, but it's subtly different anyway.

Actually we can even restrospectively add this in for BAI and CSI if we wish to use the tried and trusted method of fake bins for meta-data.

@magicDGS
Copy link
Member

magicDGS commented Jun 6, 2018

An optional field with the checksum (md5 or other algorithm) can be also useful as a temporary solution until an UUID is added to BAM (and other formats)...

Add fields for file length, number of records, start and end offsets, and number of offsets.
@tomwhite
Copy link
Author

tomwhite commented Jun 6, 2018

Updated to incorporate latest suggestions: SAMv1-sbi.pdf

Re UUID: would this be a fixed or variable length string?

@lbergelson
Copy link
Member

A UUID seems worse than a checksum of some sort to me. How do you guarantee that the file isn't edited without updating the UUID?

@jkbonfield
Copy link
Contributor

jkbonfield commented Jun 7, 2018

Agreed it's not ideal and we probably shouldn't be starting from here. However computing a checksum has a huge CPU cost while checking headers match does not.

Of course it's possible to edit the data and keep the old UUID, but that's a tool chain bug (or more likely just using old tool chains). I don't really know how you can fix this though. All we can do is best efforts, and if something catches some problems but not all then it's better than having no check at all. (Over time it'll improve.)

Right now it's pure wishful thinking though as we don't have a way of holding this in headers either - I was simply suggesting making room for a (fixed length) checksum and/or UUID at some point in the future without requiring a change of file format.

Edit: actually there can be some other useful checks thinking about it. If we contents of a record but keep the size the same, then the index works anyway. If we extend or shrink the records then the file size is a very simple short check on validity. If we rewrite a BAM header while keeping the size the same (it's possible using eg the ways CRAM manages this, but not done in any BAM implementation I know of) then the md5sum of the header is also a useful check, and quick to compute.

[Rewriting headers in place ideally requires a nul terminated header with multiple nul bytes and more than one gzip block, the second of which is uncompressed to grow or shrink up to a predefined max compressed header size.]

@magicDGS
Copy link
Member

magicDGS commented Jun 7, 2018

In addition to the length of the file as a quick check, what's about a checksum for the header and per-split (stored in the offset list)? This can be useful in two ways:

  • The faster check is to compare the length of the file and the header checksum.
  • Computing the checksum of some splits but not all is more expensive, but it might be more reliable. For example, check the first and last slice is possible as the SBI header contains the offsets.
  • The complete file could be evaluated by computing all the checksums, and this could be also parallelize.
  • It allows to generate the SBI index in parallel and compute the checksum at the same time (e.g., the use case of @tomwhite)
  • While processing in parallel, the checksum can be computed while reading and check at the end for concordance.

For example, the HDF5 file-format has a per-block checksum (only for some blocks of data) using the Jenkins lookup3.hashlittle algorithm.

@jmarshall
Copy link
Member

The previous text had

The index must contain the virtual file offset for the first alignment (in the case of a non-empty BAM), and a virtual file offset for the overall length of the BAM file.

but this text seems to have disappeared. For using the index, it seems important for the first offset in the list to point to the first alignment record and for there to be a sentinel offset at the end of the list pointing to EOF — otherwise the code to find the first and last splits is more complex.

IMHO adding explicit first_offset and end_offset fields is a mistake. first_offset is the same as the first offset list entry, and end_offset is the same as the last offset list entry (or it should be; it's currently badly described).

(In other trivia, please remove \label{sec:code} from the section heading line: it's for making cross-references to the previous subsection.)

@lh3
Copy link
Member

lh3 commented Jun 7, 2018

IMHO adding explicit first_offset and end_offset fields is a mistake. first_offset is the same as the first offset list entry, and end_offset is the same as the last offset list entry (or it should be; it's currently badly described).

As I said above, first_offset can be redundant. We can require that the first offset in the list always corresponds to the first record, which is already somewhat implicit in the text. end_offset points to the end (not the start) of the last record. If you only keep the offset of the last record, you need to jump to the last record and read the record from the disk to find end_offset. I am not sure how useful an end_offset is. I just thought it costs little to find anyway.

@jmarshall
Copy link
Member

What the original text about the sentinel last entry in the list is trying to describe is: the virtual file offset of the byte after the end of the last alignment record. …whether that is EOF or the start of some footers of some kind.

It is the virtual offset of the first non-existent alignment record, the same as the open end of a half-open interval like [5,7).

So it's the same thing as your end_offset, just more usable.

@lh3
Copy link
Member

lh3 commented Jun 7, 2018

end_offset is a special case in that it doesn't point to a record and doesn't follow the granularity rule. Nonetheless, I am ok to merge end_offset into the offset list.

One note about offset in general: if the record starts at the boundary of two adjacent blocks, there are two ways to specify the same virtual offset: (blockEnd,lastByte) vs (blockStart,0). Which to report can be implementation dependent.

@tomwhite
Copy link
Author

tomwhite commented Jun 7, 2018

@jmarshall, @lh3 I've updated the text to explain the final sentinel offset in a way that I hope is clearer. I've also removed first_offset and end_offset from the header, so they only appear in the offset list now. See SAMv1-sbi.pdf.

@lbergelson, @jkbonfield, @magicDGS The minimum level of integrity checking is file length, which I think we should include. Having a checksum per offset may be overkill since it would inflate the size of the index by a factor of three. I would rather add optional metadata fields for MD5 and UUID which can be used by tooling.

@lbergelson
Copy link
Member

+1 to adding optional MD5 and UUID fields

@magicDGS
Copy link
Member

magicDGS commented Jun 8, 2018

+1 to add an optional checksum (MD5 or a different one)

It is true that per-offset checksum is overkill, but maybe only for the first and last split would be a faster/less-expensive way to check integrity instead of whole file checksum.

@tomwhite
Copy link
Author

Added optional MD5 and UUID fields: SAMv1-sbi.pdf

I wonder if n_offsets should be stored as a long - it's not impossible to imagine an index that has an entry for every record (even if it's not read into memory). Similarly, granularity might need to be a long in some (odd) cases, such as an index with two virtual offsets (plus the sentinel) for a data file with 5 billion records. Making both these fields longs would future proof against these kind of things.

@lbergelson
Copy link
Member

It makes sense to make any field that only shows up once a long, unless there's actually a hard limit on it. Someone always finds a way to make a file with too many of it otherwise.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jun 11, 2018

And while we're at it, make quantities unsigned unless there is a valid reason why they have to be negative (eg chromosome -1 being unmapped in BAM).

Edit: I mean that as a general principle, rather than this document which is already unsigned appropriately.

@tomwhite
Copy link
Author

Changed granularity and n_offsets to be unsigned longs. A granularity of 0 now means not specified.

SAMv1-sbi.pdf

@tomwhite
Copy link
Author

Updated the htsjdk implementation to reflect the latest draft: samtools/htsjdk#1138

@tomwhite
Copy link
Author

Discussion seems to have settled down on this. Are there any more changes to the text that are needed?

@yfarjoun
Copy link
Contributor

given that UUID is not a thing (yet?) for SAM, I find it strange that SBI spec would reference it as if it is a known thing. could you modify the wording in the table to say "Placeholder for UUID..." and we can remove the "Placeholder" part when a UUID is included in the SAM spec...

@yfarjoun
Copy link
Contributor

yfarjoun commented Sep 20, 2018

Also, in the second usecase, please note that this is only possible if granularity is specified.

Specify that "Nth record" is zero-based.

@lbergelson
Copy link
Member

Does anyone have further suggestions here or should we return this to @tomwhite for final edits?

@yfarjoun
Copy link
Contributor

yfarjoun commented Feb 6, 2020

@lbergelson I think you need to commandeer this.

@lbergelson
Copy link
Member

lbergelson commented Mar 4, 2020

Someone suggested we look at the PBI index as an an already extant alternative to introducing SBI. I've taken a quick look at it and it does indeed provide a per read-offset. So it could be used as a splitting index instead of sbi.

It has a number of required fields which are specific to pac bio bams though. Things like a required ZMW hole number for every read. Obviously this isn't relevant to non-pac bio bams. I imagine you could create a degenerate one which has a 0 for every irrelevant value except for the offset though.

It devotes a lot more bytes to each individual read than the SBI and doesn't offer the granularity option, so pbi indexes will inevitably contain more data than sbi files. A complete pbi with non-placeholder values on a short read file seems like it will be rather large. They're bgzf compressed though, so If you fill in 0's for all the required fields other than the offsets they should be roughly equivalent to a granularity 1 sbi index.

I tested the pbindex tool on a random tiny (76mb) bam I have on my harddrive. It successfully created an index even though it wasn't a pac bio bam. The index it creates is ~3.4 mb. A granularity one sbi on the same file is 5mb, because we didn't bother compressing them... Compressed it's 1.7mb. So... it seems like we should compress sbi files... That was a silly oversight. In practice for short reads we typically use a much higher granularity value, default in gatk is 4096, so it's usually not an issue... The standard sbi we'd produce would only be 1.3k even uncompressed.

These numbers are on a tiny bam because that's what I had on my hd at the moment, so they might scale slightly differently on a much larger one. I suspect it's roughly accurate % wise though.

File Size
bam 76 mb
bai 11 kb
default pbi 3.4 mb
default sbi 1.3 kb
granularity 1 sbi 5.0 mb
gzip granularity 1 sbi 1.7 mb

So it seems like pbi could practically be used instead of SBI. It's substantially more complicated and has a bunch fo vender specific fields that are not relevant for non-pacbio bams. On the other hand, it is already deployed. I'm not sure if there's any non-pacbio specific tooling support for it though. Htsjdk already supports sbi.

I still think there is a value in the dedicated sbi. What do others think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Stalled
Development

Successfully merging this pull request may close these issues.

7 participants