Skip to content

UME HF Integration with ONNX #177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

UME HF Integration with ONNX #177

wants to merge 15 commits into from

Conversation

karinazad
Copy link
Collaborator

@karinazad karinazad commented Aug 1, 2025

Description

Add UME to HF

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring

def __init__(
self,
model_name: Literal[
"ume-mini-base-12M", "ume-small-base-90M", "ume-medium-base-480M", "ume-large-base-740M"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid lobster imports and redefine this literal here

@karinazad karinazad changed the title [Draft] UME HF Integration with ONNX UME HF Integration with ONNX Aug 1, 2025

# Run inference
with torch.no_grad():
output = model(input_ids.unsqueeze(1), attention_mask.unsqueeze(1))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this unsqueeze here pretty ugly - I need to fix it in the tokenizer or onnx export directly

)

# Example amino acid sequence
sequences = ["MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLA"] * 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: future PR, modify example to show how to dynamically detect modality and tokenize and embed appropriately

# TODO: currently, these will work for internal users
# Support for external users will be added soon

UME_CHECKPOINT_DICT_S3_BUCKET = "prescient-lobster"
UME_CHECKPOINT_DICT_S3_KEY = "ume/checkpoints.json"
UME_CHECKPOINT_DICT_S3_URI = f"s3://{UME_CHECKPOINT_DICT_S3_BUCKET}/{UME_CHECKPOINT_DICT_S3_KEY}"

UME_MODEL_VERSIONS = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum as allen suggested?

@@ -0,0 +1,29 @@
# UME HuggingFace Integration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, can you link to this in the developer docs README?


def __init__(
self,
model_name: Literal[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we import this so it only has to be defined in one place?

def export_ume_models_to_onnx():
for model_version in UME_MODEL_VERSIONS:
model = UME.from_pretrained(model_version)
model.export_onnx(HF_UME_MODEL_DIRPATH / f"{model_version}.onnx", modality=Modality.SMILES)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will only do SMILES compatible models?

| UME-medium | 480M | 24 | 1280 | 20 | High performance applications |
| UME-large | 740M | 24 | 1600 | 25 | Best performance |

All model sizes are optimized for GPU hardware efficiency following established best practices. Currently, all variants use the same model identifier. The default loaded model is UME-mini.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we change this default to UME-medium?

### Intra-Entity Modalities
Different representations of the **same biological entity**:
- Protein sequence → SMILES representation (chemical view of peptide)
- DNA sequence → Amino acid sequence (central dogma)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need transcription & translation and reverse to claim central dogma here. could say "towards central dogma"

- **Proteins**: AMPLIFY (360.7M), PeptideAtlas (4.2M)
- **Small molecules**: ZINC (588.7M), M³-20M (20.8M)
- **Nucleotides**: CaLM (8.6M)
- **Structures**: PINDER (267K), PDBBind, ATOMICA, GEOM (1.17M)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how many samples in PDBBind and ATOMICA?


### Capabilities & Limitations
**Q: Can UME generate sequences?**
- No - encoder-only model for representation learning
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically yes via GIbbs Sampling, primarily for infilling/inpainting, conditional generation

from transformers import PreTrainedTokenizer

# HuggingFace repository configuration
HF_UME_REPO_ID = "karina-zadorozhny/ume"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import from constants?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants