-
Notifications
You must be signed in to change notification settings - Fork 76
Support for constructing and using GZI format files for BGZF compressed FASTA #164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There is a lot here that doesn't work, but mainly I was trying to figure out the format of the GZI file and provide methods to unpack and pack the binary on-disk format. There are also methods for loading the GZI into an object for use by Faidx.
pyfaidx/__init__.py
Outdated
| raise e | ||
|
|
||
|
|
||
| def build_gzi(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this method should work as-is. The idea is to load the BGZF block boundaries into list that we can bisect to find BGZF virtual offsets that correspond to the closest genomic coordinate we're seeking.
pyfaidx/__init__.py
Outdated
| if not eof.empty: | ||
| raise IOError("BGZF EOF marker not found. File %s is not a valid BGZF file." % self.filename) | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The read_gzi and write_gzi methods should be re-written to use the functions from the end of this file (I think).
pyfaidx/__init__.py
Outdated
| chunk = start0 + newlines_before + newlines_inside + seq_len | ||
| chunk_seq = self.file.read(chunk).decode() | ||
| seq = chunk_seq[start0 + newlines_before:] | ||
| bstart = i.offset + newlines_before + start0 # uncompressed offset for the start of requested string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this section was really working, and should be tested. This is where most of the work needs to happen to close out this feature.
| def build_gzi(self): | ||
| """ Build the htslib .gzi index format """ | ||
| from Bio import bgzf | ||
| with open(self.filename, 'rb') as bgzf_file: | ||
| for i, values in enumerate(bgzf.BgzfBlocks(bgzf_file)): | ||
| self.gzi_index[i] = BGZFblock(*values) | ||
|
|
||
| def write_gzi(self): | ||
| """ Write the on disk format for the htslib .gzi index | ||
| https://github.com/samtools/htslib/issues/473""" | ||
| with open(self.gzi_indexname, 'wb') as bzi_file: | ||
| bzi_file.write(struct.pack('<Q', len(self.gzi_index))) | ||
| for block in self.gzi_index.values(): | ||
| bzi_file.write(block.as_bytes()) | ||
|
|
||
| def read_gzi(self): | ||
| """ Read the on disk format for the htslib .gzi index | ||
| https://github.com/samtools/htslib/issues/473""" | ||
| from ctypes import c_uint64, sizeof | ||
| with open(self.gzi_indexname, 'rb') as bzi_file: | ||
| number_of_blocks = struct.unpack('<Q', bzi_file.read(sizeof(c_uint64)))[0] | ||
| for i in range(number_of_blocks): | ||
| cstart, ustart = struct.unpack('<QQ', bzi_file.read(sizeof(c_uint64) * 2)) | ||
| if cstart == '' or ustart == '': | ||
| raise IndexError("Unexpected end of .gzi file. ") | ||
| else: | ||
| self.gzi_index[i] = BGZFblock(cstart, None, ustart, None) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a duplicate code block?
|
Hello all - wanted to ask about the progress of this pull request? Any way we can help with testing / contributing? Thanks! |
|
@mdshw5: same as @ThomVett I'm interested in the progress of this pull request as a solution to whatshap/whatshap#151 which is included in HKU-BAL/Clair3#163 which in turn is the most popular variant calling library for long-read sequencing data. |
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
|
@ThomVett and @dennishendriksen Thank you for your patience. I've finally finished work on this PR and it looks like the performance and compatibility issues are all worked out. See #153 for the efficiency improvements in fetching sequence from the end of a long chromosome in a BGZF compressed FASTA. I do still want to have some time to consider edge cases and investigate .fai and .gzi compatibility with samtools. Since this version of |
This is a work-in-progress implementation for #126 and much doesn't work properly.