Skip to content

Unable to Resume Training from -last Checkpoint using AutoResume in NeMo (v25.04) #14259

@Shrii-WorkspaceNSX

Description

@Shrii-WorkspaceNSX

Hi team 👋,

I'm encountering an issue while trying to resume training using the AutoResume feature in the NeMo framework (container version: 25.04).

Despite having a valid -last checkpoint, training always starts from scratch with the warning:

[NeMo W 2025-07-16 09:08:27 nemo_logging:405] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/workspace/logs_15_07_curated. Training from scratch.
Even after trying various sub-paths, the message persists.

✅ Checkpoint exists at this location:
/workspace/logs_15_07_curated/llama31_8b_dapt/2025-07-16_06-36-17/checkpoints/model_name=0--val_loss=2.33-step=2999-consumed_samples=12000.0-last

🔧 Code Snippet:
from nemo.lightning import AutoResume
resume_test = AutoResume(
resume_if_exists=True,
resume_from_directory="/workspace/logs_15_07_curated/",
resume_ignore_no_checkpoint=True
)

recipe = configure_recipe(nodes=1, gpus_per_node=4)
recipe.resume = resume_test.setup(recipe.trainer)
I verified that the checkpoint path is correct and that the -last file exists. Here is a simplified view of the config:

log_dir: /workspace/logs_16_07_curated

save_last: True

save_top_k: 2

monitor: "val_loss"

🧪 Full Output (Trimmed):
[NeMo W 2025-07-16 09:08:27 nemo_logging:405] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/workspace/logs_15_07_curated. Training from scratch.
[NeMo W 2025-07-16 09:17:12 nemo_logging:405] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/workspace/logs_15_07_curated/llama31_8b_dapt/2025-07-16_06-36-17/checkpoints/model_name=0--val_loss=2.33-step=2999-consumed_samples=12000.0-last. Training from scratch.

Any help or pointers to debug this would be greatly appreciated!
Thanks in advance!

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions