Skip to content

Conversation

@behroozazarkhalili
Copy link

Summary

Add native support for loading GenBank (.gb, .gbk, .genbank) files, a standard format for biological sequence data with annotations maintained by NCBI.

Changes

  • Add genbank packaged module with pure Python state machine parser
  • Register GenBank extensions in _PACKAGED_DATASETS_MODULES and _EXTENSION_TO_MODULE
  • Add comprehensive test suite (28 tests)

Features

  • Metadata parsing: LOCUS, DEFINITION, ACCESSION, VERSION, KEYWORDS, ORGANISM, taxonomy
  • Feature parsing: Structured JSON output with location parsing (complement, join)
  • Sequence parsing: ORIGIN section with automatic length calculation
  • Compression support: gzip, bz2, xz via magic bytes detection
  • Memory efficiency: Dual-threshold batching (batch_size + max_batch_bytes)
  • Large sequences: Uses large_string Arrow type for sequences/features

Usage

from datasets import load_dataset

# Load GenBank files
ds = load_dataset("genbank", data_files="sequences.gb")

# With options
ds = load_dataset("genbank", data_files="*.gbk", 
                  columns=["sequence", "organism", "features"],
                  parse_features=True)

Test plan

  • All 28 unit tests pass
  • Tests cover: basic loading, multi-record, compression, feature parsing, column filtering, batching, schema types

- Add GenBank packaged module with state machine parser
- Support .gb, .gbk, .genbank file extensions
- Parse LOCUS, DEFINITION, ACCESSION, VERSION, KEYWORDS, ORGANISM metadata
- Parse FEATURES section with structured JSON output
- Parse ORIGIN sequence data with automatic compression detection (gzip, bz2, xz)
- Implement dual-threshold batching (batch_size + max_batch_bytes)
- Use large_string Arrow type for sequences to handle very long data
- Add comprehensive test suite with 28 tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant