-
Notifications
You must be signed in to change notification settings - Fork 108
Description
BioNeMo Framework Version
Bug Description
I am trying to create gene-level embeddings from a dataset using Geneformer. I converted the dataset from H5AD to SingleCellMemMapDataset, downloaded a model checkpoint, converted the checkpoint to TE, and then tried to run the infer.py script for Geneformer in BioNeMo. For each of these steps, I followed READMEs found throughout the GitHub repo.
It threw this error where it was looking for files in the checkpoint directory that do not exist. It seems that it expects the model checkpoint to be in a different format than what was created when I ran export.py in the BioNeMo recipes.
Steps to Reproduce
Download and save a model checkpoint using bionemo recipes. I had to clone the repo for this because I did not see this code in the Docker image.
cd /path/to/bionemo-framework
python export.py --model Geneformer-V2-104M --/scratch/ngalioto/bionemo/modelsStill using the cloned repo: open the model, convert it from HF to TE, and save the result.
from transformers import AutoModelForMaskedLM
from geneformer.convert import convert_geneformer_hf_to_te
# Load the default model (Geneformer-V2-316M)
model = AutoModelForMaskedLM.from_pretrained("/scratch/ngalioto/bionemo/models/Geneformer-V2-104M")
model_te = convert_geneformer_hf_to_te(model)
# Save the TE model
model_te.save_pretrained("/scratch/ngalioto/bionemo/models/Geneformer-V2-104M/te_checkpoint")python /workspace/bionemo2/sub-packages/bionemo-geneformer/src/bionemo/geneformer/scripts/infer_geneformer.py --data-dir "/nfs/turbo/<account>/shared/projects/bionemo/bionemo_v2/geneformer/scmm/bj_fibroblast" --checkpoint-path "/scratch/ngalioto/bionemo/models/Geneformer-V2-104M/te_checkpoint/" --results-path "/scratch/ngalioto/bionemo/results"Error Messages and Logs
Singularity> python /workspace/bionemo2/sub-packages/bionemo-geneformer/src/bionemo/geneformer/scripts/infer_geneformer.py --data-dir "/nfs/turbo/umms-indikar/shared/projects/bionemo/bionemo_v2/geneformer/scmm/bj_fibroblast" --checkpoint-path "/scratch/ngalioto/bionemo/models/Geneformer-V2-104M/te_checkpoint/" --results-path "/scratch/ngalioto/bionemo/results"
Import of quick_gelu from megatron.core.fusions.fused_bias_geglu failed with: Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/nemo/utils/import_utils.py", line 319, in safe_import_from
return getattr(imported_module, symbol), True
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'megatron.core.fusions.fused_bias_geglu' has no attribute 'quick_gelu'
INFO:nemo.utils.import_utils:Import of quick_gelu from megatron.core.fusions.fused_bias_geglu failed with: Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/nemo/utils/import_utils.py", line 319, in safe_import_from
return getattr(imported_module, symbol), True
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'megatron.core.fusions.fused_bias_geglu' has no attribute 'quick_gelu'
Downloading data from 'nvidia/clara/singlecell-testdata:2.0' to file '/home/ngalioto/.cache/bionemo/d8e3ea569bc43768c24aa651aff77722df202078415528497c22394046b08cc3-singlecell-scdltestdata-20241203.tar.gz'.
{
"download_end": "2025-12-19 15:17:18",
"download_start": "2025-12-19 15:17:13",
"download_time": "5s",
"files_downloaded": 1,
"local_path": "/home/ngalioto/.cache/bionemo/tmp4vcsiu41/singlecell-testdata_v2.0",
"size_downloaded": "224.17 MB",
"status": "COMPLETED"
}
Untarring contents of '/home/ngalioto/.cache/bionemo/d8e3ea569bc43768c24aa651aff77722df202078415528497c22394046b08cc3-singlecell-scdltestdata-20241203.tar.gz' to '/home/ngalioto/.cache/bionemo/d8e3ea569bc43768c24aa651aff77722df202078415528497c22394046b08cc3-singlecell-scdltestdata-20241203.tar.gz.untar'
[NeMo I 2025-12-19 15:17:30 nemo_logging:393] Downloading resource: https://huggingface.co/ctheodoris/Geneformer/resolve/main/geneformer/gene_dictionaries_30m/gene_name_id_dict_gc30M.pkl?download=true
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Downloading resource: https://huggingface.co/ctheodoris/Geneformer/resolve/main/geneformer/gene_dictionaries_30m/gene_median_dictionary_gc30M.pkl?download=true
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] *************** Preprocessing Finished ************
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Fixing mis-match between ddp-config & mcore-optimizer config
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Rank 0 has data parallel group : [0]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Ranks 0 has data parallel rank: 0
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Rank 0 has context parallel group: [0]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] All context parallel group ranks: [[0]]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Ranks 0 has context parallel rank: 0
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Rank 0 has model parallel group: [0]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] All model parallel group ranks: [[0]]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Rank 0 has tensor model parallel group: [0]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] All tensor model parallel group ranks: [[0]]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Rank 0 has tensor model parallel rank: 0
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Rank 0 has embedding group: [0]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] All pipeline model parallel group ranks: [[0]]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Rank 0 has pipeline model parallel rank 0
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] All embedding group ranks: [[0]]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Rank 0 has embedding rank: 0
[W1219 15:17:31.864436550 socket.cpp:755] [c10d] The client socket cannot be initialized to connect to [localhost]:53394 (errno: 97 - Address family not supported by protocol).
INFO:pytorch_lightning.utilities.rank_zero:----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
WARNING:/usr/local/lib/python3.12/dist-packages/bionemo/llm/model/config.py:Loading /scratch/ngalioto/bionemo/models/Geneformer-V2-104M/te_checkpoint
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/io/api.py", line 58, in load_context
[rank0]: return load(path, output_type=TrainerContext, subpath=subpath, build=build)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/io/mixin.py", line 787, in load
[rank0]: raise FileNotFoundError(f"No such file: '{_path}'")
[rank0]: FileNotFoundError: No such file: '/scratch/ngalioto/bionemo/models/Geneformer-V2-104M/te_checkpoint/context'
[rank0]: During handling of the above exception, another exception occurred:
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/bionemo2/sub-packages/bionemo-geneformer/src/bionemo/geneformer/scripts/infer_geneformer.py", line 299, in <module>
[rank0]: geneformer_infer_entrypoint()
[rank0]: File "/workspace/bionemo2/sub-packages/bionemo-geneformer/src/bionemo/geneformer/scripts/infer_geneformer.py", line 159, in geneformer_infer_entrypoint
[rank0]: infer_model(
[rank0]: File "/workspace/bionemo2/sub-packages/bionemo-geneformer/src/bionemo/geneformer/scripts/infer_geneformer.py", line 150, in infer_model
[rank0]: trainer.predict(module, datamodule=datamodule) # return_predictions=False failing due to a lightning bug
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 858, in predict
[rank0]: return call._call_and_handle_interrupt(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]: return function(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 897, in _predict_impl
[rank0]: results = self._run(model, ckpt_path=ckpt_path)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 945, in _run
[rank0]: call._call_configure_model(self)
[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/call.py", line 119, in _call_configure_model
[rank0]: _call_lightning_module_hook(trainer, "configure_model")
[rank0]: File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/call.py", line 167, in _call_lightning_module_hook
[rank0]: output = fn(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/bionemo/llm/lightning.py", line 326, in configure_model
[rank0]: model: MegatronModelType = self.config.configure_model(**module_construct_args)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/bionemo/llm/model/biobert/model.py", line 556, in configure_model
[rank0]: self.load_settings_from_checkpoint(self.initial_ckpt_path)
[rank0]: File "/usr/local/lib/python3.12/dist-packages/bionemo/llm/model/config.py", line 93, in load_settings_from_checkpoint
[rank0]: initial_config: MegatronBioNeMoTrainableModelConfig = io.load_context(
[rank0]: ^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/io/api.py", line 65, in load_context
[rank0]: return load(path, output_type=TrainerContext, subpath=subpath, build=build)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/io/mixin.py", line 787, in load
[rank0]: raise FileNotFoundError(f"No such file: '{_path}'")
[rank0]: FileNotFoundError: No such file: '/scratch/ngalioto/bionemo/models/Geneformer-V2-104M/te_checkpoint/io.json'Docker Image
nvcr.io/nvidia/clara/bionemo-framework:nightly
System Information
Environment Details:
- OS: "Ubuntu 24.04.3 LTS"
- CPU: Intel(R) Xeon(R) Platinum 8468
- RAM: 200 GB
GPU Details:
- GPU Model: NVIDIA H100
- GPU Memory: 80 GB
- CUDA Version: 12.9
- CUDA Driver: 570.124.06
- cuDNN Version: 9.10.2
Additional Context
The model checkpoint folder contains
config.json model.safetensors