Proposal: Protein 3D Structure Visualization for Dataset Viewer#7930
Proposal: Protein 3D Structure Visualization for Dataset Viewer#7930behroozazarkhalili wants to merge 3 commits intohuggingface:mainfrom
Conversation
This PR proposes adding 3D protein structure visualization to the HuggingFace Dataset Viewer using 3Dmol.js (~150KB gzipped). See PR body for full proposal details.
|
cc @georgia-hf - Following up on your question about protein visualization for the Dataset Viewer. This proposal recommends 3Dmol.js (~150KB gzipped) as a lightweight alternative to Mol* (~1.3MB gzipped). Looking forward to your feedback! |
|
Exciting ! cc @cfahlgren1 @severo for the Viewer part For the |
|
I don't know the JS libraries, but indeed, the lighter the better, as we don't require advanced features. |
|
From a quick look at the PDB and mmCIF PRs I noticed that the dataset has one row = one atom. However I humbly believe that such datasets would be more practical to use if one row = one structure. This way each row is independent, which is practical in ML to perform train/test splits or dataset shuffling. This would also make it easier to add labels and metadata for each structure, similar to what we already for images. E.g. you could group them per folder named after a label, or you can have a metadata.parquet file to add custom metadata per structure. And this way in the Viewer it could show one 3D render per row. What do you think ? |
|
@lhoestq @severo @georgia-hf I will be waiting for all your comments; then, I will start implementing the final plan. |
|
adding some remarks from @0gust1 (feel free to add comments here!):
|
Proposal: Protein 3D Structure Visualization for HuggingFace Dataset Viewer
Executive Summary
This proposal outlines adding 3D protein structure visualization to the HuggingFace Dataset Viewer, enabling users to interactively view PDB and mmCIF molecular structures directly within the dataset preview interface.
Data Type Support (Updated Architecture)
Supported formats (from recent PRs):
.pdb,.entextensions viaPdbFolderbuilder.cif,.mmcifextensions viaMmcifFolderbuilderNew Implementation Pattern (One Row = One Structure):
Both PRs have been refactored to follow the ImageFolder pattern, where each row in the dataset contains one complete protein structure file. This is the recommended ML-friendly approach:
Key Components:
What gets visualized:
Not applicable (1D sequence only):
Visualization Library Comparison
Bundle sizes verified by downloading actual distribution files from npm/CDN (January 2026)
Recommendation: 3Dmol.js
Primary choice: 3Dmol.js
Rationale:
Why not Mol?* As Georgia noted, Mol* is heavy (~1.3 MB gzipped). While it's the industry standard for RCSB PDB, it's overkill for a dataset preview where users just need to verify structure data looks correct.
Alternative for power users: If users need advanced features like density maps, ligand interactions, or sequence alignment overlay, consider PDBe Molstar as an optional "full viewer" mode.
Summary
Recommended approach:
Backend implementation (Updated):
structurecolumn contains the complete file content ready for 3D renderingNext Steps