Skip to content

Conversation

@behroozazarkhalili
Copy link

@behroozazarkhalili behroozazarkhalili commented Dec 31, 2025

Summary

This PR adds support for loading FASTA files directly with load_dataset(), addressing feedback from #7851.

FASTA is a text-based format for representing nucleotide sequences (DNA/RNA) or peptide sequences (proteins), widely used in bioinformatics.

Key Features

  • Zero external dependencies - Uses a lightweight pure Python parser based on readfq.py by Heng Li
  • Streaming support - Generator-based parsing for memory efficiency with large genomic files
  • Compression support - Automatic detection and handling of gzip, bzip2, and xz compressed files via magic bytes
  • Large sequence support - Uses large_string Arrow type to handle viral genomes and long sequences (fixes UTF-8 overflow)
  • Adaptive batching - max_batch_bytes parameter (default 256MB) prevents Parquet page size errors with very large sequences

Technical Decisions (Addressing #7851 Feedback)

Concern Solution
Long sequences → UTF-8 overflow (@apcamargo, @UriNeri) Uses pa.large_string() for sequence column
BioPython is overkill (@apcamargo) Pure Python parser based on Heng Li's readfq.py
Parquet page size limit i32::MAX (@UriNeri) Adaptive dual-threshold batching with max_batch_bytes

Columns

Column Type Description
id string Sequence identifier (first word after >)
description string Full description line (everything after id)
sequence large_string The biological sequence (DNA/RNA/protein)

Supported Extensions

.fa, .fasta, .fna, .ffn, .faa, .frn (and compressed variants)

Usage

from datasets import load_dataset

# Load FASTA file
dataset = load_dataset("fasta", data_files="sequences.fasta")

# Load with column filtering
dataset = load_dataset("fasta", data_files="sequences.fa", columns=["id", "sequence"])

# Load gzipped file
dataset = load_dataset("fasta", data_files="sequences.fa.gz")

# Configure batching for very large genomes
dataset = load_dataset("fasta", data_files="genome.fasta", max_batch_bytes=128*1024*1024)

Test Plan

  • Basic FASTA loading (3 sequences, multi-line)
  • Multiple extension support (.fa, .fasta, .fna, .ffn, .faa, .frn)
  • Compression formats (gzip, bz2, xz)
  • Long sequences with large_string type
  • Column filtering
  • Batch size configuration
  • Byte-based batching (max_batch_bytes)
  • Large genome handling (simulated 50KB sequences)
  • Empty description handling
  • Multiple files loading
  • Custom feature casting

All 22 tests passing.

cc: @georgia-hf

Add native support for loading FASTA biological sequence files with zero
external dependencies. This addresses feedback from PR huggingface#7851.

Key features:
- Pure Python parser based on Heng Li's readfq.py (no BioPython dependency)
- Uses pa.large_string() for sequences to handle UTF-8 overflow with long genomes
- Adaptive byte-based batching (max_batch_bytes=256MB) prevents Parquet page
  size errors with very large sequences like complete viral genomes
- Supports gzip, bzip2, and xz compression via magic byte detection
- Column filtering: select subset of [id, description, sequence]

Supported extensions: .fa, .fasta, .fna, .ffn, .faa, .frn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant