Support for constructing and using GZI format files for BGZF compressed FASTA #164

mdshw5 · 2020-06-25T20:07:19Z

This is a work-in-progress implementation for #126 and much doesn't work properly.

There is a lot here that doesn't work, but mainly I was trying to figure out the format of the GZI file and provide methods to unpack and pack the binary on-disk format. There are also methods for loading the GZI into an object for use by Faidx.

mdshw5 · 2020-06-25T20:15:18Z

pyfaidx/__init__.py

                raise e
-
+
+    def build_gzi(self):


I think this method should work as-is. The idea is to load the BGZF block boundaries into list that we can bisect to find BGZF virtual offsets that correspond to the closest genomic coordinate we're seeking.

mdshw5 · 2020-06-25T20:16:45Z

pyfaidx/__init__.py

+	        if not eof.empty:
+	            raise IOError("BGZF EOF marker not found. File %s is not a valid BGZF file." % self.filename)
+
+


The read_gzi and write_gzi methods should be re-written to use the functions from the end of this file (I think).

mdshw5 · 2020-06-25T20:19:04Z

pyfaidx/__init__.py

-                chunk = start0 + newlines_before + newlines_inside + seq_len
-                chunk_seq = self.file.read(chunk).decode()
-                seq = chunk_seq[start0 + newlines_before:]
+            bstart = i.offset + newlines_before + start0  # uncompressed offset for the start of requested string


I'm not sure if this section was really working, and should be tested. This is where most of the work needs to happen to close out this feature.

Maarten-vd-Sande · 2022-02-12T09:51:15Z

pyfaidx/__init__.py

+    def build_gzi(self):
+        """ Build the htslib .gzi index format """
+        from Bio import bgzf
+        with open(self.filename, 'rb') as bgzf_file:
+            for i, values in enumerate(bgzf.BgzfBlocks(bgzf_file)):
+                self.gzi_index[i] = BGZFblock(*values)
+
+    def write_gzi(self):
+        """ Write the on disk format for the htslib .gzi index
+        https://github.com/samtools/htslib/issues/473"""
+        with open(self.gzi_indexname, 'wb') as bzi_file:
+            bzi_file.write(struct.pack('<Q', len(self.gzi_index)))
+            for block in self.gzi_index.values():
+                bzi_file.write(block.as_bytes())
+
+    def read_gzi(self):
+        """ Read the on disk format for the htslib .gzi index
+        https://github.com/samtools/htslib/issues/473"""
+        from ctypes import c_uint64, sizeof
+        with open(self.gzi_indexname, 'rb') as bzi_file:
+            number_of_blocks = struct.unpack('<Q', bzi_file.read(sizeof(c_uint64)))[0]
+            for i in range(number_of_blocks):
+                cstart, ustart = struct.unpack('<QQ', bzi_file.read(sizeof(c_uint64) * 2))
+                if cstart == '' or ustart == '':
+                    raise IndexError("Unexpected end of .gzi file. ")
+                else:
+                    self.gzi_index[i] = BGZFblock(cstart, None, ustart, None)
+


Is this a duplicate code block?

ThomVett · 2023-02-08T08:26:52Z

Hello all - wanted to ask about the progress of this pull request? Any way we can help with testing / contributing? Thanks!

dennishendriksen · 2023-05-22T08:54:42Z

@mdshw5: same as @ThomVett I'm interested in the progress of this pull request as a solution to whatshap/whatshap#151 which is included in HKU-BAL/Clair3#163 which in turn is the most popular variant calling library for long-read sequencing data.

…tibility

codecov · 2025-08-19T19:18:27Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

mdshw5 · 2025-08-20T00:52:03Z

@ThomVett and @dennishendriksen Thank you for your patience. I've finally finished work on this PR and it looks like the performance and compatibility issues are all worked out. See #153 for the efficiency improvements in fetching sequence from the end of a long chromosome in a BGZF compressed FASTA. I do still want to have some time to consider edge cases and investigate .fai and .gzi compatibility with samtools. Since this version of pyfaidx will now create, read and write the .gzi block index files I'll also need to consider how to handle migrations from previous versions of pyfaidx which only produced .fai files with byte offsets into the compressed files. I'll likely choose to trigger a re-index of a BGZF FASTA file if only a .fai index is found and no .gzi index exists, but am open to suggestions about how to handle this.

…lity with pyfaidx <0.9

mdshw5 added 4 commits October 11, 2017 11:52

Initial changes to support #126

db7f140

Initial changes to support #126

b5d375a

Merge local changes for #126

e526ce8

There is a lot here that doesn't work, but mainly I was trying to figure out the format of the GZI file and provide methods to unpack and pack the binary on-disk format. There are also methods for loading the GZI into an object for use by Faidx.

Resolve merge conflicts

f878775

mdshw5 added the enhancement label Jun 25, 2020

mdshw5 mentioned this pull request Jun 25, 2020

Create or load htslib .fai and .gzi index files when using BGZF files #126

Closed

mdshw5 commented Jun 25, 2020

View reviewed changes

mdshw5 mentioned this pull request Aug 6, 2020

Implement fsspec in place of open #168

Closed

mdshw5 mentioned this pull request Feb 11, 2022

BGZip slow performance near end of chromosomes #153

Closed

Maarten-vd-Sande reviewed Feb 12, 2022

View reviewed changes

mdshw5 mentioned this pull request Feb 19, 2022

Version in APT doesn't have correct dependenceis #187

Closed

Merge remote-tracking branch 'origin/master' into samtools_bgzf_compa…

2dc58d6

…tibility

mdshw5 self-assigned this Aug 19, 2025

mdshw5 added 3 commits August 19, 2025 17:41

Merge upstream changes

dbcaa94

Fix BgzfBlock.empty logic

3d3e2c5

Working implementation of more efficient bgzip FASTA indexed retrieval

0c9d0c4

mdshw5 added 3 commits August 20, 2025 17:19

Remove redundant funcitons and clean up

646036f

Ensure on-disk .gzi format matches htslib output

608f8ad

Make sure to re-create .fai if .gzi is not present. Avoid incompatibi…

532419d

…lity with pyfaidx <0.9

mdshw5 added 2 commits August 21, 2025 01:00

Remove unused gzi packing functions.

dfa5df4

Expand test suite based on coverage gaps

ad3d878

mdshw5 merged commit de5cdf8 into master Aug 21, 2025
9 checks passed

ggydush mentioned this pull request Aug 21, 2025

feat: Add bgzip index support for fsspec objects #232

Merged

mdshw5 deleted the samtools_bgzf_compatibility branch August 22, 2025 00:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for constructing and using GZI format files for BGZF compressed FASTA #164

Support for constructing and using GZI format files for BGZF compressed FASTA #164

Uh oh!

mdshw5 commented Jun 25, 2020

Uh oh!

mdshw5 Jun 25, 2020

Uh oh!

mdshw5 Jun 25, 2020

Uh oh!

mdshw5 Jun 25, 2020

Uh oh!

Maarten-vd-Sande Feb 12, 2022

Uh oh!

ThomVett commented Feb 8, 2023

Uh oh!

dennishendriksen commented May 22, 2023

Uh oh!

codecov bot commented Aug 19, 2025 •

edited

Loading

Uh oh!

mdshw5 commented Aug 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		if not eof.empty:
		raise IOError("BGZF EOF marker not found. File %s is not a valid BGZF file." % self.filename)

Support for constructing and using GZI format files for BGZF compressed FASTA #164

Support for constructing and using GZI format files for BGZF compressed FASTA #164

Uh oh!

Conversation

mdshw5 commented Jun 25, 2020

Uh oh!

mdshw5 Jun 25, 2020

Choose a reason for hiding this comment

Uh oh!

mdshw5 Jun 25, 2020

Choose a reason for hiding this comment

Uh oh!

mdshw5 Jun 25, 2020

Choose a reason for hiding this comment

Uh oh!

Maarten-vd-Sande Feb 12, 2022

Choose a reason for hiding this comment

Uh oh!

ThomVett commented Feb 8, 2023

Uh oh!

dennishendriksen commented May 22, 2023

Uh oh!

codecov bot commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Welcome to Codecov 🎉

Uh oh!

mdshw5 commented Aug 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Aug 19, 2025 •

edited

Loading