feat: Add GenBank file format support for biological sequence data#7951
feat: Add GenBank file format support for biological sequence data#7951behroozazarkhalili wants to merge 3 commits intohuggingface:mainfrom
Conversation
- Add GenBank packaged module with state machine parser - Support .gb, .gbk, .genbank file extensions - Parse LOCUS, DEFINITION, ACCESSION, VERSION, KEYWORDS, ORGANISM metadata - Parse FEATURES section with structured JSON output - Parse ORIGIN sequence data with automatic compression detection (gzip, bz2, xz) - Implement dual-threshold batching (batch_size + max_batch_bytes) - Use large_string Arrow type for sequences to handle very long data - Add comprehensive test suite with 28 tests
|
|
||
| return location | ||
|
|
||
| def _parse_genbank(self, fp): |
There was a problem hiding this comment.
hi, was this function written by you / an AI / someone else ? if it comes from somewhere else you should mention it
it's too long for me to review for now, please consider using an external / mature library to parse such data instead if it helps simplify the code
There was a problem hiding this comment.
Hi @lhoestq, thanks for the review!
I wrote this parser myself. The reason I chose a custom pure-Python state machine parser over an external library is to stay consistent with the zero-external-dependency pattern used by the other packaged modules in this project.
The only mature library for GenBank parsing is Biopython (Bio.GenBank / Bio.SeqIO), but it's a very heavy dependency (~150 MB installed) and would be disproportionate for what we need here. The same reasoning was applied for the FASTA and FASTQ loaders in this project — they also use custom lightweight parsers (based on Heng Li's readfq.py) rather than pulling in Biopython.
The GenBank flat file format is well-documented by NCBI (spec), so a focused parser that only extracts the fields we need is more practical than adding a large dependency.
That said, I'm happy to refactor the _parse_genbank method to make it shorter and easier to review — for example by breaking it into smaller helper methods per section (LOCUS, FEATURES, ORIGIN, etc.). Would that help?
# Conflicts: # src/datasets/packaged_modules/__init__.py
Summary
Add native support for loading GenBank (.gb, .gbk, .genbank) files, a standard format for biological sequence data with annotations maintained by NCBI.
Changes
genbankpackaged module with pure Python state machine parser_PACKAGED_DATASETS_MODULESand_EXTENSION_TO_MODULEFeatures
large_stringArrow type for sequences/featuresUsage
Test plan