Skip to content

fix: only trigger on_save if the output checkpoint path actually contains files#652

Open
HarikrishnanBalagopal wants to merge 1 commit intofoundation-model-stack:mainfrom
HarikrishnanBalagopal:fix/onsaveevent
Open

fix: only trigger on_save if the output checkpoint path actually contains files#652
HarikrishnanBalagopal wants to merge 1 commit intofoundation-model-stack:mainfrom
HarikrishnanBalagopal:fix/onsaveevent

Conversation

@HarikrishnanBalagopal
Copy link
Contributor

Description of the change

Trigger on_save event only if anything was actually saved in the output checkpoint directory.
This simplifies the use of trainer_controller for triggering subsequent steps (like artifact upload)

Related issue number

Internal issue in our pipeline

How to verify the PR

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

@github-actions
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions bot added the fix label Jan 19, 2026
@ashokponkumar
Copy link
Collaborator

@HarikrishnanBalagopal I don't think this is the right fix. If save_ model is called, and files are not there, its a problem. Trainer should be fixed rather.

@dushyantbehl @YashasviChaurasia Can you look at the situations where this occurs?

@HarikrishnanBalagopal
Copy link
Contributor Author

HarikrishnanBalagopal commented Jan 21, 2026

@HarikrishnanBalagopal I don't think this is the right fix. If save_ model is called, and files are not there, its a problem. Trainer should be fixed rather.

@dushyantbehl @YashasviChaurasia Can you look at the situations where this occurs?

@ashokponkumar The original issue has already been fixed in the main branch and we are now using the image built from main branch commit b3a8a78d5a1e81635a974934cb1b08a3a5361b3e that contains the fix.

This is just an additional/sanity check on top since on_save should not be called if nothing was saved.

@ashokponkumar
Copy link
Collaborator

@HarikrishnanBalagopal I don't think this is the right fix. If save_ model is called, and files are not there, its a problem. Trainer should be fixed rather.
@dushyantbehl @YashasviChaurasia Can you look at the situations where this occurs?

@ashokponkumar The original issue has already been fixed in the main branch and we are now using the image built from main branch commit b3a8a78d5a1e81635a974934cb1b08a3a5361b3e that contains the fix.

This is just an additional/sanity check on top since on_save should not be called if nothing was saved.

@dushyantbehl Shouldn't the tuning fail if save_model was called, and nothing ended up in the dir? In which usecase is it expected to have no files in the dir after save_model is called?

@HarikrishnanBalagopal
Copy link
Contributor Author

@HarikrishnanBalagopal I don't think this is the right fix. If save_ model is called, and files are not there, its a problem. Trainer should be fixed rather.
@dushyantbehl @YashasviChaurasia Can you look at the situations where this occurs?

@ashokponkumar The original issue has already been fixed in the main branch and we are now using the image built from main branch commit b3a8a78d5a1e81635a974934cb1b08a3a5361b3e that contains the fix.
This is just an additional/sanity check on top since on_save should not be called if nothing was saved.

@dushyantbehl Shouldn't the tuning fail if save_model was called, and nothing ended up in the dir? In which usecase is it expected to have no files in the dir after save_model is called?

@ashokponkumar I can send you the builds that were running into that exact scenario. Most likely a bug but this check allows us to catch those sort of bugs early without having to spend hours debugging unrelated things.

@ashokponkumar
Copy link
Collaborator

If save_model fails, shouldn't the tuning fail? @dushyantbehl @HarikrishnanBalagopal ? @HarikrishnanBalagopal Can you please bring this topic up in our next scrum?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants