Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update EVO2 tests according to Hyena arch changes #798

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

farhadrgh
Copy link
Collaborator

Description

NVIDIA/NeMo#12856 introduces code reduction and perf improvements including standardizing input/output shapes for Hyena operators and consequentially reducing rearrangement overhead. This PR updates the EVO2 test to comply with those changes,

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels:

Note

By default, the notebooks validation tests are skipped unless explicitly enabled.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Usage

TODO: Add code snippet

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

Signed-off-by: Farhad Ramezanghorbani <[email protected]>
Signed-off-by: Farhad Ramezanghorbani <[email protected]>
Copy link

copy-pr-bot bot commented Apr 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@farhadrgh farhadrgh force-pushed the farhadr/evo2_cleanup branch from fca02bc to 58706fe Compare April 2, 2025 19:20
@farhadrgh
Copy link
Collaborator Author

/ok to test

1 similar comment
@farhadrgh
Copy link
Collaborator Author

/ok to test

@farhadrgh farhadrgh force-pushed the farhadr/evo2_cleanup branch from 4c5ac7d to c14f433 Compare April 2, 2025 21:06
@cspades
Copy link
Member

cspades commented Apr 8, 2025

LGTM but will let John verify:

- features = rearrange(features, "l b d -> b l d").contiguous()
+ features = rearrange(features, "l b d -> b d l").contiguous()

Signed-off-by: Farhad Ramezanghorbani <[email protected]>
Signed-off-by: Farhad Ramezanghorbani <[email protected]>
@farhadrgh
Copy link
Collaborator Author

/ok to test

@codecov-commenter
Copy link

❌ 21 Tests Failed:

Tests completed Failed Passed Skipped
728 21 707 9
View the top 3 failed test(s) by shortest run time
sub-packages/bionemo-amplify/tests/bionemo/amplify/test_convert.py::sub-packages.bionemo-amplify.tests.bionemo.amplify.test_convert
Stack Traces | 0s run time
ImportError while importing test module '.../bionemo/amplify/test_convert.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
.../usr/lib/python3.12/importlib/__init__.py:90: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
.../bionemo/amplify/test_convert.py:25: in <module>
    from bionemo.amplify.convert import HFAMPLIFYImporter, maybe_mock_xformers  # noqa: F401
.../local/lib/python3.12.../bionemo/amplify/convert.py:27: in <module>
    from bionemo.amplify.model import AMPLIFYConfig
.../local/lib/python3.12.../bionemo/amplify/model.py:38: in <module>
    from bionemo.llm.model.biobert.model import BioBertConfig, MegatronBioBertModel, PositionEmbeddingKinds
.../local/lib/python3.12.../model/biobert/model.py:57: in <module>
    from bionemo.llm.model.loss import BERTMLMLossWithReduction
.../local/lib/python3.12.../llm/model/loss.py:22: in <module>
    from nemo.lightning.megatron_parallel import (
E   ImportError: cannot import name 'masked_token_loss_context_parallel' from 'nemo.lightning.megatron_parallel' (.../local/lib/python3.12.../nemo/lightning/megatron_parallel.py)
sub-packages/bionemo-amplify/tests/bionemo/amplify/test_hf_rotary.py::sub-packages.bionemo-amplify.tests.bionemo.amplify.test_hf_rotary
Stack Traces | 0s run time
ImportError while importing test module '.../bionemo/amplify/test_hf_rotary.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
.../usr/lib/python3.12/importlib/__init__.py:90: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
.../bionemo/amplify/test_hf_rotary.py:21: in <module>
    from bionemo.amplify.convert import maybe_mock_xformers
.../local/lib/python3.12.../bionemo/amplify/convert.py:27: in <module>
    from bionemo.amplify.model import AMPLIFYConfig
.../local/lib/python3.12.../bionemo/amplify/model.py:38: in <module>
    from bionemo.llm.model.biobert.model import BioBertConfig, MegatronBioBertModel, PositionEmbeddingKinds
.../local/lib/python3.12.../model/biobert/model.py:57: in <module>
    from bionemo.llm.model.loss import BERTMLMLossWithReduction
.../local/lib/python3.12.../llm/model/loss.py:22: in <module>
    from nemo.lightning.megatron_parallel import (
E   ImportError: cannot import name 'masked_token_loss_context_parallel' from 'nemo.lightning.megatron_parallel' (.../local/lib/python3.12.../nemo/lightning/megatron_parallel.py)
sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_model.py::sub-packages.bionemo-esm2.tests.bionemo.esm2.model.test_model
Stack Traces | 0s run time
ImportError while importing test module '.../esm2/model/test_model.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
.../usr/lib/python3.12/importlib/__init__.py:90: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
.../esm2/model/test_model.py:28: in <module>
    from bionemo.esm2.api import ESM2Config, ESM2Model
.../local/lib/python3.12.../bionemo/esm2/api.py:19: in <module>
    from bionemo.esm2.model.model import ESM2Config, ESM2GenericConfig, ESM2Model
.../local/lib/python3.12.../esm2/model/model.py:40: in <module>
    from bionemo.llm.model.biobert.model import BioBertConfig, MegatronBioBertModel, PositionEmbeddingKinds
.../local/lib/python3.12.../model/biobert/model.py:57: in <module>
    from bionemo.llm.model.loss import BERTMLMLossWithReduction
.../local/lib/python3.12.../llm/model/loss.py:22: in <module>
    from nemo.lightning.megatron_parallel import (
E   ImportError: cannot import name 'masked_token_loss_context_parallel' from 'nemo.lightning.megatron_parallel' (.../local/lib/python3.12.../nemo/lightning/megatron_parallel.py)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Collaborator

@jstjohn jstjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved but see my comment in line about manual verification of tensor parallel correctness. Ideally the same could be done for CP=2, but I am not 100% that we have that working in the predict script.

x1 = torch.ones((batch_size, seq_len, g, dg), device=device)
x2 = torch.ones((batch_size, seq_len, g, dg), device=device)
v = torch.ones((batch_size, seq_len, g, dg), device=device)
x1 = torch.ones((batch_size, (g * dg), seq_len), device=device)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test somewhere covering that this still works with tensor parallel? It could be that moving sequence to the last dimension breaks tensor parallel because that has a lot of hardcoded assumptions about splitting on axis 1. Maybe if you run the brca notebook but with TP=2 (using the experimental bf16 model weights if doing this on a non fp8 node) and it still works, that would be good? Please post a manual verification to this effect.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware of any tests for TP. But all the tests in NeMo and BioNeMo are passing. The CI failure now is discussed in this thread and is unrelated to these changes.

I will run the notebook with TP=2 and report the results here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can now confirm that the notebook is reproducing ToT results with TP=2 or CP=2 on two A6000. However, there is a regression in ToT compared to the last time notebook was executed and this is unrelated to changes here (more info regarding ToT regression)

@farhadrgh
Copy link
Collaborator Author

Need to bump NeMo to get the changes in NVIDIA/NeMo#12988 after its merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants