-
Notifications
You must be signed in to change notification settings - Fork 30
UME HF Integration with ONNX #177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
def __init__( | ||
self, | ||
model_name: Literal[ | ||
"ume-mini-base-12M", "ume-small-base-90M", "ume-medium-base-480M", "ume-large-base-740M" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid lobster imports and redefine this literal here
|
||
# Run inference | ||
with torch.no_grad(): | ||
output = model(input_ids.unsqueeze(1), attention_mask.unsqueeze(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this unsqueeze here pretty ugly - I need to fix it in the tokenizer or onnx export directly
) | ||
|
||
# Example amino acid sequence | ||
sequences = ["MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLA"] * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: future PR, modify example to show how to dynamically detect modality and tokenize and embed appropriately
# TODO: currently, these will work for internal users | ||
# Support for external users will be added soon | ||
|
||
UME_CHECKPOINT_DICT_S3_BUCKET = "prescient-lobster" | ||
UME_CHECKPOINT_DICT_S3_KEY = "ume/checkpoints.json" | ||
UME_CHECKPOINT_DICT_S3_URI = f"s3://{UME_CHECKPOINT_DICT_S3_BUCKET}/{UME_CHECKPOINT_DICT_S3_KEY}" | ||
|
||
UME_MODEL_VERSIONS = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enum
as allen suggested?
@@ -0,0 +1,29 @@ | |||
# UME HuggingFace Integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, can you link to this in the developer docs README?
|
||
def __init__( | ||
self, | ||
model_name: Literal[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we import this so it only has to be defined in one place?
def export_ume_models_to_onnx(): | ||
for model_version in UME_MODEL_VERSIONS: | ||
model = UME.from_pretrained(model_version) | ||
model.export_onnx(HF_UME_MODEL_DIRPATH / f"{model_version}.onnx", modality=Modality.SMILES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will only do SMILES compatible models?
| UME-medium | 480M | 24 | 1280 | 20 | High performance applications | | ||
| UME-large | 740M | 24 | 1600 | 25 | Best performance | | ||
|
||
All model sizes are optimized for GPU hardware efficiency following established best practices. Currently, all variants use the same model identifier. The default loaded model is UME-mini. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we change this default to UME-medium?
### Intra-Entity Modalities | ||
Different representations of the **same biological entity**: | ||
- Protein sequence → SMILES representation (chemical view of peptide) | ||
- DNA sequence → Amino acid sequence (central dogma) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need transcription & translation and reverse to claim central dogma here. could say "towards central dogma"
- **Proteins**: AMPLIFY (360.7M), PeptideAtlas (4.2M) | ||
- **Small molecules**: ZINC (588.7M), M³-20M (20.8M) | ||
- **Nucleotides**: CaLM (8.6M) | ||
- **Structures**: PINDER (267K), PDBBind, ATOMICA, GEOM (1.17M) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how many samples in PDBBind and ATOMICA?
|
||
### Capabilities & Limitations | ||
**Q: Can UME generate sequences?** | ||
- No - encoder-only model for representation learning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically yes via GIbbs Sampling, primarily for infilling/inpainting, conditional generation
from transformers import PreTrainedTokenizer | ||
|
||
# HuggingFace repository configuration | ||
HF_UME_REPO_ID = "karina-zadorozhny/ume" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import from constants?
Description
Add UME to HF
Type of Change