-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support for robust benchmarking of fms-hf-tuning #142
Comments
@Ssukriti Addressing the above will make it significantly easier for the benchmarking effort to use the latest fms_hf_tuning changes, as well as for others to execute benchmarking runs, so please let us know how we can help - we can contribute a PR with our proposed changes for example. |
@dushyantbehl can you please work with @VassilisVassiliadis to understand the issues with AIM and if the changes requested are already addressed with your pending PRs? |
These are the changes we're proposing:
If these seem reasonable to you, I can either contribute bullet 2 directly to the PR that @dushyantbehl is working on or open a new PR after @dushyantbehl 's PR is merged into main. We can then discuss how to handle bullet point 1. |
Thanks and Great @VassilisVassiliadis. This is on my plate already. Let me know if we need to connect on this for any other details. |
Please follow PR #89 for bullet 1 and 3. We have separate PR for 2 . |
@VassilisVassiliadis #89 is merged. |
Thank you @dushyantbehl I'll take a look. Since you're already working on 2.2 we still need to decide how we're going to handle 1 (i.e. reporting errors). I noticed that @kellyaa has already started working on something related to this in #149 @Ssukriti would it make sense to design a unified way of reporting errors that all scripts (sft_trainer.py, accelerate_launch.py etc) can use ? |
The importance of #149 is to ensure that whatever error messages are thrown:
This is so that when tuning jobs are being executed via Kubernetes CRs, the pod can accurately reflect the end state of the job without the user having to scrape the logs. I'm good with whatever unified mechanisms you want to create, as long as we are able to accomplish the above. |
I think this is now done. I just found a small bug and opened a PR for it here: #199 |
Is your feature request related to a problem? Please describe.
When running benchmarks using fms-hf-tuning we need to:
However fms-hf-tuning does not support these features hence requiring maintenance of external forks of fms-hf-tuning that provide them. Specifically
Describe the solution you'd like
Here is a simplified view of the current state of the
sft_trainer.py
script:We propose the following
Pros:
Cons:
As a plus, the proposed changes in
train()
enable us to experiment with new implementations ofBenchmarkingFriendlyAIMCallback()
without requiring updates to fms-hf-tuning (e.g. by using a wrapper script).Describe alternatives you've considered
Make minimal changes to train() to assist developing third party wrapper scripts
We could slightly refactor the code such that the logic in the
train()
method for activating the currentAIMCallback
(via a call toget_aimstack_callback()
) is in themain()
method instead of thetrain()
method.This would enable us to use a wrapper script which implements our earlier proposed design of
sft_trainer.py
. The wrapper script would have amain()
function similar to the above proposal and which would invoketuning.train()
.sft_trainer.py
in fms-hf-tuning would look like this:Pros:
Cons:
The text was updated successfully, but these errors were encountered: