feat(fasta): add lightweight FASTA file format support #7923

behroozazarkhalili · 2025-12-31T19:33:00Z

Summary

This PR adds support for loading FASTA files directly with load_dataset(), addressing feedback from #7851.

FASTA is a text-based format for representing nucleotide sequences (DNA/RNA) or peptide sequences (proteins), widely used in bioinformatics.

Key Features

Zero external dependencies - Uses a lightweight pure Python parser based on readfq.py by Heng Li
Streaming support - Generator-based parsing for memory efficiency with large genomic files
Compression support - Automatic detection and handling of gzip, bzip2, and xz compressed files via magic bytes
Large sequence support - Uses large_string Arrow type to handle viral genomes and long sequences (fixes UTF-8 overflow)
Adaptive batching - max_batch_bytes parameter (default 256MB) prevents Parquet page size errors with very large sequences

Technical Decisions (Addressing #7851 Feedback)

Concern	Solution
Long sequences → UTF-8 overflow (@apcamargo, @UriNeri)	Uses `pa.large_string()` for sequence column
BioPython is overkill (@apcamargo)	Pure Python parser based on Heng Li's readfq.py
Parquet page size limit i32::MAX (@UriNeri)	Adaptive dual-threshold batching with `max_batch_bytes`

Columns

Column	Type	Description
`id`	string	Sequence identifier (first word after `>`)
`description`	string	Full description line (everything after id)
`sequence`	large_string	The biological sequence (DNA/RNA/protein)

Supported Extensions

.fa, .fasta, .fna, .ffn, .faa, .frn (and compressed variants)

Usage

from datasets import load_dataset

# Load FASTA file
dataset = load_dataset("fasta", data_files="sequences.fasta")

# Load with column filtering
dataset = load_dataset("fasta", data_files="sequences.fa", columns=["id", "sequence"])

# Load gzipped file
dataset = load_dataset("fasta", data_files="sequences.fa.gz")

# Configure batching for very large genomes
dataset = load_dataset("fasta", data_files="genome.fasta", max_batch_bytes=128*1024*1024)

Test Plan

All 22 tests passing.

cc: @georgia-hf

Add native support for loading FASTA biological sequence files with zero external dependencies. This addresses feedback from PR huggingface#7851. Key features: - Pure Python parser based on Heng Li's readfq.py (no BioPython dependency) - Uses pa.large_string() for sequences to handle UTF-8 overflow with long genomes - Adaptive byte-based batching (max_batch_bytes=256MB) prevents Parquet page size errors with very large sequences like complete viral genomes - Supports gzip, bzip2, and xz compression via magic byte detection - Column filtering: select subset of [id, description, sequence] Supported extensions: .fa, .fasta, .fna, .ffn, .faa, .frn

This was referenced Dec 31, 2025

Add lightweight FASTQ file format support #7924

Open

feat: Add mmCIF file support for macromolecular structures #7925

Open

Add lightweight PDB (Protein Data Bank) file support #7926

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(fasta): add lightweight FASTA file format support #7923

feat(fasta): add lightweight FASTA file format support #7923

behroozazarkhalili commented Dec 31, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(fasta): add lightweight FASTA file format support #7923

Are you sure you want to change the base?

feat(fasta): add lightweight FASTA file format support #7923

Conversation

behroozazarkhalili commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Technical Decisions (Addressing #7851 Feedback)

Columns

Supported Extensions

Usage

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

behroozazarkhalili commented Dec 31, 2025 •

edited

Loading