-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update EVO2 tests according to Hyena arch changes #798
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Farhad Ramezanghorbani <[email protected]>
Signed-off-by: Farhad Ramezanghorbani <[email protected]>
fca02bc
to
58706fe
Compare
/ok to test |
1 similar comment
/ok to test |
4c5ac7d
to
c14f433
Compare
LGTM but will let John verify:
|
Signed-off-by: Farhad Ramezanghorbani <[email protected]>
Signed-off-by: Farhad Ramezanghorbani <[email protected]>
/ok to test |
❌ 21 Tests Failed:
View the top 3 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved but see my comment in line about manual verification of tensor parallel correctness. Ideally the same could be done for CP=2, but I am not 100% that we have that working in the predict script.
x1 = torch.ones((batch_size, seq_len, g, dg), device=device) | ||
x2 = torch.ones((batch_size, seq_len, g, dg), device=device) | ||
v = torch.ones((batch_size, seq_len, g, dg), device=device) | ||
x1 = torch.ones((batch_size, (g * dg), seq_len), device=device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a test somewhere covering that this still works with tensor parallel? It could be that moving sequence to the last dimension breaks tensor parallel because that has a lot of hardcoded assumptions about splitting on axis 1. Maybe if you run the brca notebook but with TP=2 (using the experimental bf16 model weights if doing this on a non fp8 node) and it still works, that would be good? Please post a manual verification to this effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not aware of any tests for TP. But all the tests in NeMo and BioNeMo are passing. The CI failure now is discussed in this thread and is unrelated to these changes.
I will run the notebook with TP=2 and report the results here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can now confirm that the notebook is reproducing ToT results with TP=2 or CP=2 on two A6000. However, there is a regression in ToT compared to the last time notebook was executed and this is unrelated to changes here (more info regarding ToT regression)
Need to bump NeMo to get the changes in NVIDIA/NeMo#12988 after its merged |
Description
NVIDIA/NeMo#12856 introduces code reduction and perf improvements including standardizing input/output shapes for Hyena operators and consequentially reducing rearrangement overhead. This PR updates the EVO2 test to comply with those changes,
Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels:
Note
By default, the notebooks validation tests are skipped unless explicitly enabled.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to test
comment on the pull request to trigger CI. This will need to be done for each new commit.Usage
Pre-submit Checklist