-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Description
Hi team 👋,
I'm encountering an issue while trying to resume training using the AutoResume feature in the NeMo framework (container version: 25.04).
Despite having a valid -last checkpoint, training always starts from scratch with the warning:
[NeMo W 2025-07-16 09:08:27 nemo_logging:405] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/workspace/logs_15_07_curated. Training from scratch.
Even after trying various sub-paths, the message persists.
✅ Checkpoint exists at this location:
/workspace/logs_15_07_curated/llama31_8b_dapt/2025-07-16_06-36-17/checkpoints/model_name=0--val_loss=2.33-step=2999-consumed_samples=12000.0-last
🔧 Code Snippet:
from nemo.lightning import AutoResume
resume_test = AutoResume(
resume_if_exists=True,
resume_from_directory="/workspace/logs_15_07_curated/",
resume_ignore_no_checkpoint=True
)
recipe = configure_recipe(nodes=1, gpus_per_node=4)
recipe.resume = resume_test.setup(recipe.trainer)
I verified that the checkpoint path is correct and that the -last file exists. Here is a simplified view of the config:
log_dir: /workspace/logs_16_07_curated
save_last: True
save_top_k: 2
monitor: "val_loss"
🧪 Full Output (Trimmed):
[NeMo W 2025-07-16 09:08:27 nemo_logging:405] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/workspace/logs_15_07_curated. Training from scratch.
[NeMo W 2025-07-16 09:17:12 nemo_logging:405] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/workspace/logs_15_07_curated/llama31_8b_dapt/2025-07-16_06-36-17/checkpoints/model_name=0--val_loss=2.33-step=2999-consumed_samples=12000.0-last. Training from scratch.
Any help or pointers to debug this would be greatly appreciated!
Thanks in advance!