Skip to content

Conversation

@mdshw5
Copy link
Owner

@mdshw5 mdshw5 commented Jun 25, 2020

This is a work-in-progress implementation for #126 and much doesn't work properly.

There is a lot here that doesn't work, but mainly I was trying
to figure out the format of the GZI file and provide methods to
unpack and pack the binary on-disk format. There are also methods
for loading the GZI into an object for use by Faidx.
raise e


def build_gzi(self):
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method should work as-is. The idea is to load the BGZF block boundaries into list that we can bisect to find BGZF virtual offsets that correspond to the closest genomic coordinate we're seeking.

if not eof.empty:
raise IOError("BGZF EOF marker not found. File %s is not a valid BGZF file." % self.filename)


Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The read_gzi and write_gzi methods should be re-written to use the functions from the end of this file (I think).

chunk = start0 + newlines_before + newlines_inside + seq_len
chunk_seq = self.file.read(chunk).decode()
seq = chunk_seq[start0 + newlines_before:]
bstart = i.offset + newlines_before + start0 # uncompressed offset for the start of requested string
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this section was really working, and should be tested. This is where most of the work needs to happen to close out this feature.

Comment on lines 671 to 698
def build_gzi(self):
""" Build the htslib .gzi index format """
from Bio import bgzf
with open(self.filename, 'rb') as bgzf_file:
for i, values in enumerate(bgzf.BgzfBlocks(bgzf_file)):
self.gzi_index[i] = BGZFblock(*values)

def write_gzi(self):
""" Write the on disk format for the htslib .gzi index
https://github.com/samtools/htslib/issues/473"""
with open(self.gzi_indexname, 'wb') as bzi_file:
bzi_file.write(struct.pack('<Q', len(self.gzi_index)))
for block in self.gzi_index.values():
bzi_file.write(block.as_bytes())

def read_gzi(self):
""" Read the on disk format for the htslib .gzi index
https://github.com/samtools/htslib/issues/473"""
from ctypes import c_uint64, sizeof
with open(self.gzi_indexname, 'rb') as bzi_file:
number_of_blocks = struct.unpack('<Q', bzi_file.read(sizeof(c_uint64)))[0]
for i in range(number_of_blocks):
cstart, ustart = struct.unpack('<QQ', bzi_file.read(sizeof(c_uint64) * 2))
if cstart == '' or ustart == '':
raise IndexError("Unexpected end of .gzi file. ")
else:
self.gzi_index[i] = BGZFblock(cstart, None, ustart, None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a duplicate code block?

@ThomVett
Copy link

ThomVett commented Feb 8, 2023

Hello all - wanted to ask about the progress of this pull request? Any way we can help with testing / contributing? Thanks!

@dennishendriksen
Copy link

@mdshw5: same as @ThomVett I'm interested in the progress of this pull request as a solution to whatshap/whatshap#151 which is included in HKU-BAL/Clair3#163 which in turn is the most popular variant calling library for long-read sequencing data.

@mdshw5 mdshw5 self-assigned this Aug 19, 2025
@codecov
Copy link

codecov bot commented Aug 19, 2025

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

@mdshw5
Copy link
Owner Author

mdshw5 commented Aug 20, 2025

@ThomVett and @dennishendriksen Thank you for your patience. I've finally finished work on this PR and it looks like the performance and compatibility issues are all worked out. See #153 for the efficiency improvements in fetching sequence from the end of a long chromosome in a BGZF compressed FASTA. I do still want to have some time to consider edge cases and investigate .fai and .gzi compatibility with samtools. Since this version of pyfaidx will now create, read and write the .gzi block index files I'll also need to consider how to handle migrations from previous versions of pyfaidx which only produced .fai files with byte offsets into the compressed files. I'll likely choose to trigger a re-index of a BGZF FASTA file if only a .fai index is found and no .gzi index exists, but am open to suggestions about how to handle this.

@mdshw5 mdshw5 merged commit de5cdf8 into master Aug 21, 2025
9 checks passed
@mdshw5 mdshw5 deleted the samtools_bgzf_compatibility branch August 22, 2025 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants