Skip to content

Conversation

@behroozazarkhalili
Copy link

Summary

This PR adds support for loading PDB (Protein Data Bank) files directly with load_dataset().

PDB is the legacy fixed-width format for representing 3D macromolecular structures, widely used for historical datasets and still common in computational biology workflows.

Key Features

  • Zero external dependencies - Pure Python parser using fixed-width column positions per official PDB specification
  • Record type filtering - Load ATOM, HETATM, or both record types
  • Column selection - Choose specific columns to reduce memory usage
  • Compression support - Automatic detection of gzip, bzip2, and xz compressed files via magic bytes
  • Batch processing - Configurable batch size for memory-efficient processing

Columns (ATOM/HETATM records)

Column Type Description
record_type string ATOM or HETATM
atom_serial int32 Atom serial number
atom_name string Atom name (e.g., CA, N, C)
residue_name string Residue name (e.g., ALA, GLY)
chain_id string Chain identifier
residue_seq int32 Residue sequence number
x, y, z float32 Coordinates (Å)
occupancy float32 Occupancy factor
temp_factor float32 Temperature factor (B-factor)
element string Element symbol

Supported Extensions

.pdb, .ent (and compressed variants)

Usage

from datasets import load_dataset

# Load PDB file
dataset = load_dataset("pdb", data_files="structure.pdb")

# Load only ATOM records (exclude ligands/water)
dataset = load_dataset("pdb", data_files="structure.pdb", record_types=["ATOM"])

# Load specific columns
dataset = load_dataset("pdb", data_files="structure.pdb", 
                       columns=["atom_name", "residue_name", "x", "y", "z"])

Use Cases

  • Legacy structure dataset processing
  • Molecular dynamics trajectory analysis
  • Structure-based ML training data
  • Protein visualization data preparation

References

Test Results

All 24 tests pass:

  • Basic loading, column filtering, record type filtering
  • Gzip compression, multi-chain structures, alternate locations
  • Charged atoms, batch sizes, schema types, feature casting
  • Empty files, multiple files, insertion codes, negative coordinates

Part of the bioinformatics file format support series (FASTA #7923, FASTQ #7924, mmCIF #7925).

cc @georgia-hf

- Add zero-dependency pure Python parser for PDB format
- Support ATOM and HETATM record types with configurable filtering
- Handle fixed-width column parsing per official PDB specification
- Support gzip, bzip2, and xz compression via magic bytes detection
- Support .pdb and .ent file extensions
- Add comprehensive test suite with 24 tests
- Add documentation to loading.mdx

Columns include: atom_serial, atom_name, residue_name, chain_id,
residue_seq, x, y, z, occupancy, temp_factor, element, and more.

Part of the bioinformatics file format support series.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant