Skip to content

feat: Add GenBank file format support for biological sequence data#7951

Open
behroozazarkhalili wants to merge 3 commits intohuggingface:mainfrom
behroozazarkhalili:feat/genbank-support
Open

feat: Add GenBank file format support for biological sequence data#7951
behroozazarkhalili wants to merge 3 commits intohuggingface:mainfrom
behroozazarkhalili:feat/genbank-support

Conversation

@behroozazarkhalili
Copy link

Summary

Add native support for loading GenBank (.gb, .gbk, .genbank) files, a standard format for biological sequence data with annotations maintained by NCBI.

Changes

  • Add genbank packaged module with pure Python state machine parser
  • Register GenBank extensions in _PACKAGED_DATASETS_MODULES and _EXTENSION_TO_MODULE
  • Add comprehensive test suite (28 tests)

Features

  • Metadata parsing: LOCUS, DEFINITION, ACCESSION, VERSION, KEYWORDS, ORGANISM, taxonomy
  • Feature parsing: Structured JSON output with location parsing (complement, join)
  • Sequence parsing: ORIGIN section with automatic length calculation
  • Compression support: gzip, bz2, xz via magic bytes detection
  • Memory efficiency: Dual-threshold batching (batch_size + max_batch_bytes)
  • Large sequences: Uses large_string Arrow type for sequences/features

Usage

from datasets import load_dataset

# Load GenBank files
ds = load_dataset("genbank", data_files="sequences.gb")

# With options
ds = load_dataset("genbank", data_files="*.gbk", 
                  columns=["sequence", "organism", "features"],
                  parse_features=True)

Test plan

  • All 28 unit tests pass
  • Tests cover: basic loading, multi-record, compression, feature parsing, column filtering, batching, schema types

- Add GenBank packaged module with state machine parser
- Support .gb, .gbk, .genbank file extensions
- Parse LOCUS, DEFINITION, ACCESSION, VERSION, KEYWORDS, ORGANISM metadata
- Parse FEATURES section with structured JSON output
- Parse ORIGIN sequence data with automatic compression detection (gzip, bz2, xz)
- Implement dual-threshold batching (batch_size + max_batch_bytes)
- Use large_string Arrow type for sequences to handle very long data
- Add comprehensive test suite with 28 tests

return location

def _parse_genbank(self, fp):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, was this function written by you / an AI / someone else ? if it comes from somewhere else you should mention it

it's too long for me to review for now, please consider using an external / mature library to parse such data instead if it helps simplify the code

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lhoestq, thanks for the review!

I wrote this parser myself. The reason I chose a custom pure-Python state machine parser over an external library is to stay consistent with the zero-external-dependency pattern used by the other packaged modules in this project.

The only mature library for GenBank parsing is Biopython (Bio.GenBank / Bio.SeqIO), but it's a very heavy dependency (~150 MB installed) and would be disproportionate for what we need here. The same reasoning was applied for the FASTA and FASTQ loaders in this project — they also use custom lightweight parsers (based on Heng Li's readfq.py) rather than pulling in Biopython.

The GenBank flat file format is well-documented by NCBI (spec), so a focused parser that only extracts the fields we need is more practical than adding a large dependency.

That said, I'm happy to refactor the _parse_genbank method to make it shorter and easier to review — for example by breaking it into smaller helper methods per section (LOCUS, FEATURES, ORIGIN, etc.). Would that help?

# Conflicts:
#	src/datasets/packaged_modules/__init__.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants