-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
train rework, introduce --backend and --dtype flags #1157
base: main
Are you sure you want to change the base?
Conversation
This pull request has merge conflicts that must be resolved before it can be |
cc9c6de
to
86678e6
Compare
6b10f71
to
c8dc6eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, Charlie! I have left a few inline comments.
The current state of the PR breaks Intel Gaudi support. It may also cause problems for AMD ROCm. We don't have CI runners for these accelerators, so you probably haven't noticed it.
I also suggest to rename the module like the --backend
argument. from instructlab.train.pytorch import pytorch_train
is more descriptive than linux_macos_train
.
requirements.txt
Outdated
torch>=2.2.0a0,<3.0.0 ; python_version == '3.10' | ||
torch>=2.2.1,<3.0.0 ; python_version != '3.10' | ||
torch>=2.3.0,<3.0.0 | ||
tqdm>=4.66.2,<5.0.0 | ||
transformers>=4.30.0,<=4.38.2 | ||
transformers==4.38.1 | ||
trl>=0.7.11,<0.8.0 | ||
wandb>=0.16.4,<0.17.0 | ||
langchain-text-splitters | ||
# the below library should NOT be imported into any python files | ||
# it is for CLI usage ONLY | ||
yamllint>=1.35.1,<1.36.0 | ||
bitsandbytes | ||
accelerate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes are breaking Intel Gaudi support and may cause issues for AMD ROCm support.
- Intel Gaudi stack has PyTorch 2.2.0a0 on Python 3.10. The special case is required.
- Intel Gaudi uses optimum-habana 1.10.3 from a fork, which requires transformers<4.38.0,>=4.37.0
- bitesandbytes upstream only supports Nvidia CUDA and CPU. There is an work-in-progress fork for AMD ROCm. Intel Gaudi does not use it.
I guess we have to introduce more optional dependencies: instructlab[cuda]
, instructlab[rocm]
, instructlab[cpu]
, and instructlab[mps]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there is a requirements-hpu.txt
I don't think we should have niche stack's in our main requirements.txt.
Would having that other requirements.txt contain all gaudi reqs be ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is a separate requirements file that is used to generate dependencies for instructlab[hpu]
. AFAIK optional dependencies cannot override base requirements. If instructlab
base requires torch=>2.3.0, then instructlab[hpu]
cannot change it to 2.2.0a.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tiran ok, here are the requirements we have overlapping:
Gaudi: Python 3.10, torch 2.2.0a
MPS and all other Devices: 3.10+, torch 2.3.0+
Can you please advise the best way to merge these two requirements?
I was thinking pip install -r requirements-hpu.txt -e .
would allow us to override the torch version in requirements.txt.
One thing I was thinking of would be to have make hpu
or make mps
or make cuda
targets that also build ilab to be able to handle these backends?
I am open to whatever way is best! Though, I don't think having this stuff in the base requirements file makes sense since conflicts like this are bound to happen a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional dependencies cannot introduce conflicting dependencies. They are limited to additional dependencies or more restrictive dependencies. Pip flat-out refuses to resolve conflicts:
ERROR: Cannot install instructlab[hpu]==0.14.1.dev183+gdeee169.d20240514 because these package versions have conflicting dependencies.
The conflict is caused by:
instructlab[hpu] 0.14.1.dev183+gdeee169.d20240514 depends on torch<3.0.0 and >=2.3.0
instructlab[hpu] 0.14.1.dev183+gdeee169.d20240514 depends on torch==2.2.0a0; extra == "hpu"
Why do we require PyTorch 2.3.0 anyway? Is there a specific feature that is required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mps backend is broken below 2.3. I can test it out to give you the specific reason. However, I don't think it can compile with MPS below 2.3
@@ -25,6 +25,8 @@ | |||
# Local | |||
from ..chat.chat import CONTEXTS | |||
|
|||
torch.set_autocast_enabled(False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't autocast disabled by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apparently not, everything was broken until I added this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's really strange. Some 3rd party extension must mess with the settings. Torch 2.3.0 returns False for is_autocast_enabled
.
thanks for the review @tiran I had a feeling gaudi support might be impacted by this. I will try to clean everything up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Witnessed this PR work end-to-end.
@RobotSail Could you please undo your approval? mergify is going to merge the PR automatically but there are still issues with the PR. |
9e6bbbd
to
aa0e813
Compare
201ae25
to
b906411
Compare
# TODO: Why would we call convert without a model? this seems broken | ||
# convert_llama_to_gguf_mock.assert_called_once() | ||
# assert ( | ||
# convert_llama_to_gguf_mock.call_args[1]["model"] | ||
# == "./training_results/final" | ||
# ) | ||
# assert convert_llama_to_gguf_mock.call_args[1]["pad_vocab"] is True | ||
# assert len(convert_llama_to_gguf_mock.call_args[1]) == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see all the commented-out code removed throughout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... or at least the newly introduced commented out code, in some places I see it's moving some liens that were already commented out
My previous request for changes was about the UX related changes and that the alignment with the proposed training commands design was not clear. The new options have been hidden now, making it easier to continue to change them to get to the proposed design once it's agreed on. I'm not blocking on that point anymore.
@@ -50,6 +52,7 @@ | |||
htcore = None | |||
hpu = None | |||
hpu_backends = None | |||
# 'fork' incompatible with some hardware accelerator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apologies, will remove this
@click.option( | ||
"--input-dir", | ||
type=click.Path(), | ||
show_default=True, # TODO: set to None and change help message | ||
help="Path to generated files to use as input.", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just as some general feedback, including the various refactoring of options in the same commit makes this harder to review because refactoring is mixed up with the core changes you were making. When doing refactoring in a change like this, I would suggest at least keeping refactoring isolated to its own commit (and probably multiple commits that explain logical stages of refactoring). A structure like that goes a LONG way in helping people understand your changes as a logical progression without different things mixed up in the same view.
print( | ||
f"Did not get best checkpoint, choosing most recent which is {best_checkpoint}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other logging at this level is done using click.secho()
, though this seems more like a debug message than something actionable to a user.
# torch compile only for MPS | ||
if device.type == "mps": | ||
torch_compile = True | ||
torch_backend = "aot_eager" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and why is this different for mps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added another comment but the backend needs to be rebuilt in order to get mps compiled I believe. i can retry it without this to get the exact error if you want!
max_seq_length = 300 | ||
|
||
# TODO: This to @cdoern smells like it needs to be a different function, file, etc. Too different from the rest of train. Or at least, add flags for a bunch of this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably doesn't need to be in the code ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? imo leaving a TODO is better practice than forgetting about this entirely.
# TODO: Why would we call convert without a model? this seems broken | ||
# convert_llama_to_gguf_mock.assert_called_once() | ||
# assert ( | ||
# convert_llama_to_gguf_mock.call_args[1]["model"] | ||
# == "./training_results/final" | ||
# ) | ||
# assert convert_llama_to_gguf_mock.call_args[1]["pad_vocab"] is True | ||
# assert len(convert_llama_to_gguf_mock.call_args[1]) == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... or at least the newly introduced commented out code, in some places I see it's moving some liens that were already commented out
# TODO: why did this ever work??? linux train will create the gguf file unless we exit 0 earlier | ||
# assert not os.path.isfile(LINUX_GGUF_FILE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would prefer this gets resolved -- did this start failing with your PR and that's why you commented it out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the above one, I think, were tests that were written with the wrong functionality in mind. They passed by luck I think. This one seems to be saying " assert the gguf final file is not created on a successful train " which makes no sense?
I could totally be misinterpreting here. But it is possible the testing suite set this up differently and I somehow broke it?
ilab train on macos and linux are really different just for MLX support. It turns out pytorch supports an mps device enabling hardware accleration on macos. This allows us to use 99% of the codepath for linux train with less storage and memory used. the mlx method requires adapters files, and a whole -fused dir that almost doubles the storage used in this whole process This backend is also more maintainable as mlx is written in a way that lead us to need some heavy infrastructure to run it. ilab train now defaults to --backend=pytorch but --backend=mlx is still available added a few more enhancements: - the --backend flag lets users choose pytorch or mlx - I got rid of the multiple flags referencing "model" and just made a --model-repo flag since train should always pull the full safetensors from HF - I got rid of a few bad flags people should probably never use like gguf-model-path gguf models for training yield horrible results most of the time - --dtype takes fp16, bf16, ftp32, and auto. This sets the dtype to user in pytorch. bf16 and fp16 were hardcoded in the past. Signed-off-by: Charlie Doern <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
@cdoern Is this work tracked in a design document? What's the status? Thanks! |
@cdoern do you still want to keep this open? |
This pull request has been automatically marked as stale because it has not had activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. |
This pull request has merge conflicts that must be resolved before it can be |
Changes
added
--backend
flag forilab train
allowing users to switch their training backend between pytorch and mlx.changed what was
linux_train.py
to work for both macos and linux depending on some new flags like fp16, bfp16, etc.Which issue is resolved by this Pull Request:
resolves #1108
Description of your changes:
ilab train on macos and linux are really different just for MLX support.
It turns out pytorch supports an mps device enabling hardware accleration on macos. This allows us to use 99% of the codepath for linux train with less storage and memory used. the mlx method requires adapters files, and a whole -fused dir that almost doubles the storage used in this whole process
This backend is also more maintainable as mlx is written in a way that lead us to need some heavy infrastructure to run it.
In testing, the pytorch train code takes roughly the same amount of time on comparable hardware and yields a better result with a model that appears to learn the new skill/knowledge in a better manner with less hallucinations.
I will note, 18gb of ram is the minimum to run this backend. When you have 32/38 train takes less than a quarter of the amount of time that 18gb does, and produces better results.
ilab train now defaults to --backend=pytorch but --backend=mlx is still available
added a few more enhancements: