Create CI Eager/Lazy for Language Modeling #1448

Luca-Calabria · 2024-10-22T12:04:12Z

What does this PR do?

Add a test to CI suite to check/validate LLM training/finetuning as Eager and Lazy Mode

How to run it manually:
root@id:~/optimum-habana# GAUDI2_CI=1 RUN_SLOW=1 python -m pytest tests/test_language_modeling_example.py::test_language_modeling_bf16_1x -s -v

Luca-Calabria · 2024-10-22T12:08:33Z

I need an help to catch a "segmentation fault" event like this one (it was the main reason to create this test case)

Traceback (most recent call last):
  File "/root/optimum-habana/examples/language-modeling/run_clm.py", line 695, in <module>
    main()
  File "/root/optimum-habana/examples/language-modeling/run_clm.py", line 641, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 553, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/trainer.py", line 1042, in _inner_training_loop
    self.optimizer.step()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 170, in step
    self.optimizer.step(closure)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 391, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpex/optimizers/FusedAdamW.py", line 58, in wrap_
    result = step_func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpex/optimizers/FusedAdamW.py", line 103, in step
    state["exp_avg"] = torch.zeros(p.data.shape, dtype=exp_avg_dtype).to(p.device)
RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_BRIDGE Exception in Lowering thread...
Graph compile failed. synStatus=synStatus 26 [Generic failure]. 
[Rank:0] Habana exception raised from compile at graph.cpp:597

mme_descriptor_generator_base.cpp::1114 function: setParams, failed condition: (0), message: input element types should match

Internal Error: Received signal - Segmentation fault

It doesn't return to main thread and I can't complete the test run

Luca-Calabria · 2024-10-23T10:26:11Z

@emascarenhas this is the test to add to CI suite. Please take a look when you have time. Thank you

Chris-Sigopt

Minor quibbles, otherwise looks good to me.

tests/test_language_modeling_example.py

Luca-Calabria · 2024-11-05T12:18:25Z

@Chris-Sigopt any idea about this issue? #1448 (comment)

emascarenhas · 2024-11-05T17:21:52Z

@Chris-Sigopt any idea about this issue? #1448 (comment)

@emascarenhas this is the test to add to CI suite. Please take a look when you have time. Thank you

@Luca-Calabria , Can you make this part of test_examples.py instead of a new file?
I think it will run as part of CI if you make it part of test_examples.py and there are already language modeling tests in there.

Luca-Calabria · 2024-11-05T18:08:37Z

@Chris-Sigopt any idea about this issue? #1448 (comment)

@emascarenhas this is the test to add to CI suite. Please take a look when you have time. Thank you

@Luca-Calabria , Can you make this part of test_examples.py instead of a new file? I think it will run as part of CI if you make it part of test_examples.py and there are already language modeling tests in there.

@emascarenhas sure, I can. This actually was a question I would ask to you because I saw similar test cases in other existent files. I'll make the new test as part of test_examples.py

Chris-Sigopt · 2024-11-05T18:13:56Z

as a question I would ask to you because I saw similar test cases in other existent files. I'll make the new test as part of test_examples.py

I have not encountered that particular error message before. Based off the message, my guess would be that some python type coercion is happening incorrectly, which is a problem we've encountered regularly, but that's just a guess.

Luca-Calabria · 2024-11-06T17:11:49Z

@emascarenhas @Chris-Sigopt I moved the new test to test_examples.py as part of Causal Language Modeling test case.
I see that in this test file it covers just Lazy Mode run. Do you think it makes sense to create a test case to cover also the Eager Mode run?

emascarenhas

@Luca-Calabria ,
About adding Eager test, yes I think you should, because that was one of the failures?

Otherwise, code looks good to me. If you add Eager test, I can do a quick re-review. I suppose you ran the test and it worked without issues and you didn't hit the crash from earlier? Please confirm.

Luca-Calabria · 2024-11-12T09:18:04Z

@Luca-Calabria , About adding Eager test, yes I think you should, because that was one of the failures?

Otherwise, code looks good to me. If you add Eager test, I can do a quick re-review. I suppose you ran the test and it worked without issues and you didn't hit the crash from earlier? Please confirm.

@emascarenhas , yep after the fix on command line suggested by @Chris-Sigopt here #1448 (comment) I'm able to run both Lazy and Eager successufully also on older Synapse versions.
I'll add a test on Egear mode anyway because it is not covered by current CI.

Create CI Eager/Lazy for Language Modeling

4c01a78

Luca-Calabria requested a review from regisss as a code owner October 22, 2024 12:04

Luca-Calabria added 2 commits October 22, 2024 14:48

fix style

0d4a6a7

updated Eager baseline

6d9c440

Chris-Sigopt approved these changes Nov 4, 2024

View reviewed changes

tests/test_language_modeling_example.py Outdated Show resolved Hide resolved

tests/test_language_modeling_example.py Outdated Show resolved Hide resolved

tests/test_language_modeling_example.py Outdated Show resolved Hide resolved

remove redundant argument

be825da

added gemma test case to test_examples

4ac2b81

Luca-Calabria requested review from emascarenhas and Chris-Sigopt November 6, 2024 17:08

emascarenhas reviewed Nov 8, 2024

View reviewed changes

Luca-Calabria added 2 commits November 13, 2024 11:36

Merge branch 'main' into lcalabri/ci_test_lang_mod_eager_lazy

db04f0b

added Eager Mode test case

1115f69

Luca-Calabria requested a review from emascarenhas November 15, 2024 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create CI Eager/Lazy for Language Modeling #1448

Create CI Eager/Lazy for Language Modeling #1448

Luca-Calabria commented Oct 22, 2024 •

edited

Loading

Luca-Calabria commented Oct 22, 2024 •

edited

Loading

Luca-Calabria commented Oct 23, 2024

Chris-Sigopt left a comment

Luca-Calabria commented Nov 5, 2024

emascarenhas commented Nov 5, 2024

Luca-Calabria commented Nov 5, 2024

Chris-Sigopt commented Nov 5, 2024

Luca-Calabria commented Nov 6, 2024

emascarenhas left a comment

Luca-Calabria commented Nov 12, 2024

Create CI Eager/Lazy for Language Modeling #1448

Are you sure you want to change the base?

Create CI Eager/Lazy for Language Modeling #1448

Conversation

Luca-Calabria commented Oct 22, 2024 • edited Loading

What does this PR do?

Luca-Calabria commented Oct 22, 2024 • edited Loading

Luca-Calabria commented Oct 23, 2024

Chris-Sigopt left a comment

Choose a reason for hiding this comment

Luca-Calabria commented Nov 5, 2024

emascarenhas commented Nov 5, 2024

Luca-Calabria commented Nov 5, 2024

Chris-Sigopt commented Nov 5, 2024

Luca-Calabria commented Nov 6, 2024

emascarenhas left a comment

Choose a reason for hiding this comment

Luca-Calabria commented Nov 12, 2024

Luca-Calabria commented Oct 22, 2024 •

edited

Loading

Luca-Calabria commented Oct 22, 2024 •

edited

Loading