feat: Add GenBank file format support for biological sequence data by behroozazarkhalili · Pull Request #7951 · huggingface/datasets

behroozazarkhalili · 2026-01-19T01:59:44Z

Summary

Add native support for loading GenBank (.gb, .gbk, .genbank) files, a standard format for biological sequence data with annotations maintained by NCBI.

Changes

Add genbank packaged module with pure Python state machine parser
Register GenBank extensions in _PACKAGED_DATASETS_MODULES and _EXTENSION_TO_MODULE
Add comprehensive test suite (28 tests)

Features

Metadata parsing: LOCUS, DEFINITION, ACCESSION, VERSION, KEYWORDS, ORGANISM, taxonomy
Feature parsing: Structured JSON output with location parsing (complement, join)
Sequence parsing: ORIGIN section with automatic length calculation
Compression support: gzip, bz2, xz via magic bytes detection
Memory efficiency: Dual-threshold batching (batch_size + max_batch_bytes)
Large sequences: Uses large_string Arrow type for sequences/features

Usage

from datasets import load_dataset

# Load GenBank files
ds = load_dataset("genbank", data_files="sequences.gb")

# With options
ds = load_dataset("genbank", data_files="*.gbk", 
                  columns=["sequence", "organism", "features"],
                  parse_features=True)

Test plan

All 28 unit tests pass
Tests cover: basic loading, multi-record, compression, feature parsing, column filtering, batching, schema types

- Add GenBank packaged module with state machine parser - Support .gb, .gbk, .genbank file extensions - Parse LOCUS, DEFINITION, ACCESSION, VERSION, KEYWORDS, ORGANISM metadata - Parse FEATURES section with structured JSON output - Parse ORIGIN sequence data with automatic compression detection (gzip, bz2, xz) - Implement dual-threshold batching (batch_size + max_batch_bytes) - Use large_string Arrow type for sequences to handle very long data - Add comprehensive test suite with 28 tests

lhoestq · 2026-02-04T14:39:09Z

src/datasets/packaged_modules/genbank/genbank.py

+
+        return location
+
+    def _parse_genbank(self, fp):


hi, was this function written by you / an AI / someone else ? if it comes from somewhere else you should mention it

it's too long for me to review for now, please consider using an external / mature library to parse such data instead if it helps simplify the code

Hi @lhoestq, thanks for the review!

I wrote this parser myself. The reason I chose a custom pure-Python state machine parser over an external library is to stay consistent with the zero-external-dependency pattern used by the other packaged modules in this project.

The only mature library for GenBank parsing is Biopython (Bio.GenBank / Bio.SeqIO), but it's a very heavy dependency (~150 MB installed) and would be disproportionate for what we need here. The same reasoning was applied for the FASTA and FASTQ loaders in this project — they also use custom lightweight parsers (based on Heng Li's readfq.py) rather than pulling in Biopython.

The GenBank flat file format is well-documented by NCBI (spec), so a focused parser that only extracts the fields we need is more practical than adding a large dependency.

That said, I'm happy to refactor the _parse_genbank method to make it shorter and easier to review — for example by breaking it into smaller helper methods per section (LOCUS, FEATURES, ORIGIN, etc.). Would that help?

# Conflicts: # src/datasets/packaged_modules/__init__.py

behroozazarkhalili added 2 commits January 18, 2026 17:59

style(genbank): use empty __init__.py to match other packaged modules

f9a0689

lhoestq reviewed Feb 4, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into feat/genbank-support

9a607f8

# Conflicts: # src/datasets/packaged_modules/__init__.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add GenBank file format support for biological sequence data#7951

feat: Add GenBank file format support for biological sequence data#7951
behroozazarkhalili wants to merge 3 commits intohuggingface:mainfrom
behroozazarkhalili:feat/genbank-support

behroozazarkhalili commented Jan 19, 2026

Uh oh!

lhoestq Feb 4, 2026

Uh oh!

behroozazarkhalili Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

behroozazarkhalili commented Jan 19, 2026

Summary

Changes

Features

Usage

Test plan

Uh oh!

lhoestq Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

behroozazarkhalili Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants