Skip to content

[BUG] Geneformer's infer.py cannot load model checkpoint #1391

@ngalioto

Description

@ngalioto

BioNeMo Framework Version

3c68452

Bug Description

I am trying to create gene-level embeddings from a dataset using Geneformer. I converted the dataset from H5AD to SingleCellMemMapDataset, downloaded a model checkpoint, converted the checkpoint to TE, and then tried to run the infer.py script for Geneformer in BioNeMo. For each of these steps, I followed READMEs found throughout the GitHub repo.

It threw this error where it was looking for files in the checkpoint directory that do not exist. It seems that it expects the model checkpoint to be in a different format than what was created when I ran export.py in the BioNeMo recipes.

Steps to Reproduce

Download and save a model checkpoint using bionemo recipes. I had to clone the repo for this because I did not see this code in the Docker image.

cd /path/to/bionemo-framework
python export.py --model Geneformer-V2-104M --/scratch/ngalioto/bionemo/models

Still using the cloned repo: open the model, convert it from HF to TE, and save the result.

from transformers import AutoModelForMaskedLM
from geneformer.convert import convert_geneformer_hf_to_te

# Load the default model (Geneformer-V2-316M)
model = AutoModelForMaskedLM.from_pretrained("/scratch/ngalioto/bionemo/models/Geneformer-V2-104M")
model_te = convert_geneformer_hf_to_te(model)

# Save the TE model
model_te.save_pretrained("/scratch/ngalioto/bionemo/models/Geneformer-V2-104M/te_checkpoint")
python /workspace/bionemo2/sub-packages/bionemo-geneformer/src/bionemo/geneformer/scripts/infer_geneformer.py --data-dir "/nfs/turbo/<account>/shared/projects/bionemo/bionemo_v2/geneformer/scmm/bj_fibroblast" --checkpoint-path "/scratch/ngalioto/bionemo/models/Geneformer-V2-104M/te_checkpoint/" --results-path "/scratch/ngalioto/bionemo/results"

Error Messages and Logs

Singularity> python /workspace/bionemo2/sub-packages/bionemo-geneformer/src/bionemo/geneformer/scripts/infer_geneformer.py --data-dir "/nfs/turbo/umms-indikar/shared/projects/bionemo/bionemo_v2/geneformer/scmm/bj_fibroblast" --checkpoint-path "/scratch/ngalioto/bionemo/models/Geneformer-V2-104M/te_checkpoint/" --results-path "/scratch/ngalioto/bionemo/results"
Import of quick_gelu from megatron.core.fusions.fused_bias_geglu failed with: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/nemo/utils/import_utils.py", line 319, in safe_import_from
    return getattr(imported_module, symbol), True
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'megatron.core.fusions.fused_bias_geglu' has no attribute 'quick_gelu'

INFO:nemo.utils.import_utils:Import of quick_gelu from megatron.core.fusions.fused_bias_geglu failed with: Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/nemo/utils/import_utils.py", line 319, in safe_import_from
    return getattr(imported_module, symbol), True
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'megatron.core.fusions.fused_bias_geglu' has no attribute 'quick_gelu'

Downloading data from 'nvidia/clara/singlecell-testdata:2.0' to file '/home/ngalioto/.cache/bionemo/d8e3ea569bc43768c24aa651aff77722df202078415528497c22394046b08cc3-singlecell-scdltestdata-20241203.tar.gz'.
{
    "download_end": "2025-12-19 15:17:18",
    "download_start": "2025-12-19 15:17:13",
    "download_time": "5s",
    "files_downloaded": 1,
    "local_path": "/home/ngalioto/.cache/bionemo/tmp4vcsiu41/singlecell-testdata_v2.0",
    "size_downloaded": "224.17 MB",
    "status": "COMPLETED"
}
Untarring contents of '/home/ngalioto/.cache/bionemo/d8e3ea569bc43768c24aa651aff77722df202078415528497c22394046b08cc3-singlecell-scdltestdata-20241203.tar.gz' to '/home/ngalioto/.cache/bionemo/d8e3ea569bc43768c24aa651aff77722df202078415528497c22394046b08cc3-singlecell-scdltestdata-20241203.tar.gz.untar'
[NeMo I 2025-12-19 15:17:30 nemo_logging:393] Downloading resource: https://huggingface.co/ctheodoris/Geneformer/resolve/main/geneformer/gene_dictionaries_30m/gene_name_id_dict_gc30M.pkl?download=true
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Downloading resource: https://huggingface.co/ctheodoris/Geneformer/resolve/main/geneformer/gene_dictionaries_30m/gene_median_dictionary_gc30M.pkl?download=true
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] No checksum provided, filename exists. Assuming it is complete.
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] *************** Preprocessing Finished ************
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Fixing mis-match between ddp-config & mcore-optimizer config
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Rank 0 has data parallel group : [0]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Ranks 0 has data parallel rank: 0
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Rank 0 has context parallel group: [0]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] All context parallel group ranks: [[0]]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Ranks 0 has context parallel rank: 0
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Rank 0 has model parallel group: [0]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] All model parallel group ranks: [[0]]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Rank 0 has tensor model parallel group: [0]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] All tensor model parallel group ranks: [[0]]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Rank 0 has tensor model parallel rank: 0
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Rank 0 has pipeline model parallel group: [0]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Rank 0 has embedding group: [0]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] All pipeline model parallel group ranks: [[0]]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Rank 0 has pipeline model parallel rank 0
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] All embedding group ranks: [[0]]
[NeMo I 2025-12-19 15:17:31 nemo_logging:393] Rank 0 has embedding rank: 0
[W1219 15:17:31.864436550 socket.cpp:755] [c10d] The client socket cannot be initialized to connect to [localhost]:53394 (errno: 97 - Address family not supported by protocol).
INFO:pytorch_lightning.utilities.rank_zero:----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
WARNING:/usr/local/lib/python3.12/dist-packages/bionemo/llm/model/config.py:Loading /scratch/ngalioto/bionemo/models/Geneformer-V2-104M/te_checkpoint
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/io/api.py", line 58, in load_context
[rank0]:     return load(path, output_type=TrainerContext, subpath=subpath, build=build)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/io/mixin.py", line 787, in load
[rank0]:     raise FileNotFoundError(f"No such file: '{_path}'")
[rank0]: FileNotFoundError: No such file: '/scratch/ngalioto/bionemo/models/Geneformer-V2-104M/te_checkpoint/context'

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/workspace/bionemo2/sub-packages/bionemo-geneformer/src/bionemo/geneformer/scripts/infer_geneformer.py", line 299, in <module>
[rank0]:     geneformer_infer_entrypoint()
[rank0]:   File "/workspace/bionemo2/sub-packages/bionemo-geneformer/src/bionemo/geneformer/scripts/infer_geneformer.py", line 159, in geneformer_infer_entrypoint
[rank0]:     infer_model(
[rank0]:   File "/workspace/bionemo2/sub-packages/bionemo-geneformer/src/bionemo/geneformer/scripts/infer_geneformer.py", line 150, in infer_model
[rank0]:     trainer.predict(module, datamodule=datamodule)  # return_predictions=False failing due to a lightning bug
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 858, in predict
[rank0]:     return call._call_and_handle_interrupt(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank0]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]:     return function(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 897, in _predict_impl
[rank0]:     results = self._run(model, ckpt_path=ckpt_path)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/trainer.py", line 945, in _run
[rank0]:     call._call_configure_model(self)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/call.py", line 119, in _call_configure_model
[rank0]:     _call_lightning_module_hook(trainer, "configure_model")
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/call.py", line 167, in _call_lightning_module_hook
[rank0]:     output = fn(*args, **kwargs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/bionemo/llm/lightning.py", line 326, in configure_model
[rank0]:     model: MegatronModelType = self.config.configure_model(**module_construct_args)
[rank0]:                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/bionemo/llm/model/biobert/model.py", line 556, in configure_model
[rank0]:     self.load_settings_from_checkpoint(self.initial_ckpt_path)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/bionemo/llm/model/config.py", line 93, in load_settings_from_checkpoint
[rank0]:     initial_config: MegatronBioNeMoTrainableModelConfig = io.load_context(
[rank0]:                                                           ^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/io/api.py", line 65, in load_context
[rank0]:     return load(path, output_type=TrainerContext, subpath=subpath, build=build)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/nemo/lightning/io/mixin.py", line 787, in load
[rank0]:     raise FileNotFoundError(f"No such file: '{_path}'")
[rank0]: FileNotFoundError: No such file: '/scratch/ngalioto/bionemo/models/Geneformer-V2-104M/te_checkpoint/io.json'

Docker Image

nvcr.io/nvidia/clara/bionemo-framework:nightly

System Information

Environment Details:

  • OS: "Ubuntu 24.04.3 LTS"
  • CPU: Intel(R) Xeon(R) Platinum 8468
  • RAM: 200 GB

GPU Details:

  • GPU Model: NVIDIA H100
  • GPU Memory: 80 GB
  • CUDA Version: 12.9
  • CUDA Driver: 570.124.06
  • cuDNN Version: 9.10.2

Additional Context

The model checkpoint folder contains

config.json  model.safetensors

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions