Unable to Resume Training from -last Checkpoint using AutoResume in NeMo (v25.04)

Hi team 👋,

I'm encountering an issue while trying to resume training using the AutoResume feature in the NeMo framework (container version: 25.04).

Despite having a valid -last checkpoint, training always starts from scratch with the warning:

[NeMo W 2025-07-16 09:08:27 nemo_logging:405] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/workspace/logs_15_07_curated. Training from scratch.
Even after trying various sub-paths, the message persists.

✅ Checkpoint exists at this location:
/workspace/logs_15_07_curated/llama31_8b_dapt/2025-07-16_06-36-17/checkpoints/model_name=0--val_loss=2.33-step=2999-consumed_samples=12000.0-last

🔧 Code Snippet:
from nemo.lightning import AutoResume
resume_test = AutoResume(
resume_if_exists=True,
resume_from_directory="/workspace/logs_15_07_curated/",
resume_ignore_no_checkpoint=True
)

recipe = configure_recipe(nodes=1, gpus_per_node=4)
recipe.resume = resume_test.setup(recipe.trainer)
I verified that the checkpoint path is correct and that the -last file exists. Here is a simplified view of the config:

log_dir: /workspace/logs_16_07_curated

save_last: True

save_top_k: 2

monitor: "val_loss"

🧪 Full Output (Trimmed):
[NeMo W 2025-07-16 09:08:27 nemo_logging:405] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/workspace/logs_15_07_curated. Training from scratch.
[NeMo W 2025-07-16 09:17:12 nemo_logging:405] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/workspace/logs_15_07_curated/llama31_8b_dapt/2025-07-16_06-36-17/checkpoints/model_name=0--val_loss=2.33-step=2999-consumed_samples=12000.0-last. Training from scratch.

Any help or pointers to debug this would be greatly appreciated!
Thanks in advance!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to Resume Training from -last Checkpoint using AutoResume in NeMo (v25.04) #14259

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unable to Resume Training from -last Checkpoint using AutoResume in NeMo (v25.04) #14259

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions