Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create CI Eager/Lazy for Language Modeling #1448

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Luca-Calabria
Copy link
Contributor

@Luca-Calabria Luca-Calabria commented Oct 22, 2024

What does this PR do?

Add a test to CI suite to check/validate LLM training/finetuning as Eager and Lazy Mode

How to run it manually:
root@id:~/optimum-habana# GAUDI2_CI=1 RUN_SLOW=1 python -m pytest tests/test_language_modeling_example.py::test_language_modeling_bf16_1x -s -v

@Luca-Calabria
Copy link
Contributor Author

Luca-Calabria commented Oct 22, 2024

I need an help to catch a "segmentation fault" event like this one (it was the main reason to create this test case)

Traceback (most recent call last):
  File "/root/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
    main()
  File "/root/optimum-habana/examples/language-modeling/run_clm.py", line 641, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1042, in _inner_training_loop
    self.optimizer.step()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 170, in step
    self.optimizer.step(closure)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 391, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpex/optimizers/FusedAdamW.py", line 58, in wrap_
    result = step_func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpex/optimizers/FusedAdamW.py", line 103, in step
    state["exp_avg"] = torch.zeros(p.data.shape, dtype=exp_avg_dtype).to(p.device)
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Lowering thread...
Graph compile failed. synStatus=synStatus 26 [Generic failure]. 
[Rank:0] Habana exception raised from compile at graph.cpp:597

mme_descriptor_generator_base.cpp::1114 function: setParams, failed condition: (0), message: input element types should match

Internal Error: Received signal - Segmentation fault

It doesn't return to main thread and I can't complete the test run

@Luca-Calabria
Copy link
Contributor Author

@emascarenhas this is the test to add to CI suite. Please take a look when you have time. Thank you

Copy link

@Chris-Sigopt Chris-Sigopt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor quibbles, otherwise looks good to me.

tests/test_language_modeling_example.py Outdated Show resolved Hide resolved
tests/test_language_modeling_example.py Outdated Show resolved Hide resolved
tests/test_language_modeling_example.py Outdated Show resolved Hide resolved
@Luca-Calabria
Copy link
Contributor Author

@Chris-Sigopt any idea about this issue? #1448 (comment)

@emascarenhas
Copy link
Contributor

@Chris-Sigopt any idea about this issue? #1448 (comment)

@emascarenhas this is the test to add to CI suite. Please take a look when you have time. Thank you

@Luca-Calabria , Can you make this part of test_examples.py instead of a new file?
I think it will run as part of CI if you make it part of test_examples.py and there are already language modeling tests in there.

@Luca-Calabria
Copy link
Contributor Author

@Chris-Sigopt any idea about this issue? #1448 (comment)

@emascarenhas this is the test to add to CI suite. Please take a look when you have time. Thank you

@Luca-Calabria , Can you make this part of test_examples.py instead of a new file? I think it will run as part of CI if you make it part of test_examples.py and there are already language modeling tests in there.

@emascarenhas sure, I can. This actually was a question I would ask to you because I saw similar test cases in other existent files. I'll make the new test as part of test_examples.py

@Chris-Sigopt
Copy link

as a question I would ask to you because I saw similar test cases in other existent files. I'll make the new test as part of test_examples.py

I have not encountered that particular error message before. Based off the message, my guess would be that some python type coercion is happening incorrectly, which is a problem we've encountered regularly, but that's just a guess.

@Luca-Calabria
Copy link
Contributor Author

@emascarenhas @Chris-Sigopt I moved the new test to test_examples.py as part of Causal Language Modeling test case.
I see that in this test file it covers just Lazy Mode run. Do you think it makes sense to create a test case to cover also the Eager Mode run?

Copy link
Contributor

@emascarenhas emascarenhas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Luca-Calabria ,
About adding Eager test, yes I think you should, because that was one of the failures?

Otherwise, code looks good to me. If you add Eager test, I can do a quick re-review. I suppose you ran the test and it worked without issues and you didn't hit the crash from earlier? Please confirm.

@Luca-Calabria
Copy link
Contributor Author

@Luca-Calabria , About adding Eager test, yes I think you should, because that was one of the failures?

Otherwise, code looks good to me. If you add Eager test, I can do a quick re-review. I suppose you ran the test and it worked without issues and you didn't hit the crash from earlier? Please confirm.

@emascarenhas , yep after the fix on command line suggested by @Chris-Sigopt here #1448 (comment) I'm able to run both Lazy and Eager successufully also on older Synapse versions.
I'll add a test on Egear mode anyway because it is not covered by current CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants