Update EVO2 tests according to Hyena arch changes #798

farhadrgh · 2025-04-02T18:33:16Z

Description

NVIDIA/NeMo#12856 introduces code reduction and perf improvements including standardizing input/output shapes for Hyena operators and consequentially reducing rearrangement overhead. This PR updates the EVO2 test to comply with those changes,

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels:

SKIP_CI - Skip all continuous integration tests
INCLUDE_NOTEBOOKS_TESTS - Execute notebook validation tests in pytest
INCLUDE_SLOW_TESTS - Execute tests labelled as slow in pytest for extensive testing

Note

By default, the notebooks validation tests are skipped unless explicitly enabled.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
/ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Usage

TODO: Add code snippet

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

copy-pr-bot · 2025-04-02T18:33:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

farhadrgh · 2025-04-02T19:21:04Z

/ok to test

farhadrgh · 2025-04-02T20:01:58Z

/ok to test

cspades · 2025-04-08T15:14:17Z

LGTM but will let John verify:

- features = rearrange(features, "l b d -> b l d").contiguous()
+ features = rearrange(features, "l b d -> b d l").contiguous()

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

farhadrgh · 2025-04-09T20:01:01Z

/ok to test

codecov-commenter · 2025-04-09T20:50:32Z

❌ 21 Tests Failed:

Tests completed	Failed	Passed	Skipped
728	21	707	9

View the top 3 failed test(s) by shortest run time

sub-packages/bionemo-amplify/tests/bionemo/amplify/test_convert.py::sub-packages.bionemo-amplify.tests.bionemo.amplify.test_convert

Stack Traces | 0s run time

ImportError while importing test module '.../bionemo/amplify/test_convert.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
.../usr/lib/python3.12/importlib/__init__.py:90: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
.../bionemo/amplify/test_convert.py:25: in <module>
    from bionemo.amplify.convert import HFAMPLIFYImporter, maybe_mock_xformers  # noqa: F401
.../local/lib/python3.12.../bionemo/amplify/convert.py:27: in <module>
    from bionemo.amplify.model import AMPLIFYConfig
.../local/lib/python3.12.../bionemo/amplify/model.py:38: in <module>
    from bionemo.llm.model.biobert.model import BioBertConfig, MegatronBioBertModel, PositionEmbeddingKinds
.../local/lib/python3.12.../model/biobert/model.py:57: in <module>
    from bionemo.llm.model.loss import BERTMLMLossWithReduction
.../local/lib/python3.12.../llm/model/loss.py:22: in <module>
    from nemo.lightning.megatron_parallel import (
E   ImportError: cannot import name 'masked_token_loss_context_parallel' from 'nemo.lightning.megatron_parallel' (.../local/lib/python3.12.../nemo/lightning/megatron_parallel.py)

sub-packages/bionemo-amplify/tests/bionemo/amplify/test_hf_rotary.py::sub-packages.bionemo-amplify.tests.bionemo.amplify.test_hf_rotary

Stack Traces | 0s run time

ImportError while importing test module '.../bionemo/amplify/test_hf_rotary.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
.../usr/lib/python3.12/importlib/__init__.py:90: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
.../bionemo/amplify/test_hf_rotary.py:21: in <module>
    from bionemo.amplify.convert import maybe_mock_xformers
.../local/lib/python3.12.../bionemo/amplify/convert.py:27: in <module>
    from bionemo.amplify.model import AMPLIFYConfig
.../local/lib/python3.12.../bionemo/amplify/model.py:38: in <module>
    from bionemo.llm.model.biobert.model import BioBertConfig, MegatronBioBertModel, PositionEmbeddingKinds
.../local/lib/python3.12.../model/biobert/model.py:57: in <module>
    from bionemo.llm.model.loss import BERTMLMLossWithReduction
.../local/lib/python3.12.../llm/model/loss.py:22: in <module>
    from nemo.lightning.megatron_parallel import (
E   ImportError: cannot import name 'masked_token_loss_context_parallel' from 'nemo.lightning.megatron_parallel' (.../local/lib/python3.12.../nemo/lightning/megatron_parallel.py)

sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_model.py::sub-packages.bionemo-esm2.tests.bionemo.esm2.model.test_model

Stack Traces | 0s run time

ImportError while importing test module '.../esm2/model/test_model.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
.../usr/lib/python3.12/importlib/__init__.py:90: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
.../esm2/model/test_model.py:28: in <module>
    from bionemo.esm2.api import ESM2Config, ESM2Model
.../local/lib/python3.12.../bionemo/esm2/api.py:19: in <module>
    from bionemo.esm2.model.model import ESM2Config, ESM2GenericConfig, ESM2Model
.../local/lib/python3.12.../esm2/model/model.py:40: in <module>
    from bionemo.llm.model.biobert.model import BioBertConfig, MegatronBioBertModel, PositionEmbeddingKinds
.../local/lib/python3.12.../model/biobert/model.py:57: in <module>
    from bionemo.llm.model.loss import BERTMLMLossWithReduction
.../local/lib/python3.12.../llm/model/loss.py:22: in <module>
    from nemo.lightning.megatron_parallel import (
E   ImportError: cannot import name 'masked_token_loss_context_parallel' from 'nemo.lightning.megatron_parallel' (.../local/lib/python3.12.../nemo/lightning/megatron_parallel.py)

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

jstjohn

Approved but see my comment in line about manual verification of tensor parallel correctness. Ideally the same could be done for CP=2, but I am not 100% that we have that working in the predict script.

jstjohn · 2025-04-09T21:09:17Z

sub-packages/bionemo-evo2/tests/bionemo/evo2/test_hyena_operators.py

-        x1 = torch.ones((batch_size, seq_len, g, dg), device=device)
-        x2 = torch.ones((batch_size, seq_len, g, dg), device=device)
-        v = torch.ones((batch_size, seq_len, g, dg), device=device)
+        x1 = torch.ones((batch_size, (g * dg), seq_len), device=device)


Is there a test somewhere covering that this still works with tensor parallel? It could be that moving sequence to the last dimension breaks tensor parallel because that has a lot of hardcoded assumptions about splitting on axis 1. Maybe if you run the brca notebook but with TP=2 (using the experimental bf16 model weights if doing this on a non fp8 node) and it still works, that would be good? Please post a manual verification to this effect.

I am not aware of any tests for TP. But all the tests in NeMo and BioNeMo are passing. The CI failure now is discussed in this thread and is unrelated to these changes.

I will run the notebook with TP=2 and report the results here

I can now confirm that the notebook is reproducing ToT results with TP=2 or CP=2 on two A6000. However, there is a regression in ToT compared to the last time notebook was executed and this is unrelated to changes here (more info regarding ToT regression)

farhadrgh · 2025-04-11T17:20:53Z

Need to bump NeMo to get the changes in NVIDIA/NeMo#12988 after its merged

farhadrgh added 2 commits April 1, 2025 13:53

drop unused is_mlp

261ca02

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

update the in/out shapes

c14f433

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

farhadrgh requested review from dorotat-nv, jstjohn, malcolmgreaves, pstjohn, trvachov, sichu2023, skothenhill-nv, jomitchellnv, jwilber and cspades as code owners April 2, 2025 18:33

farhadrgh force-pushed the farhadr/evo2_cleanup branch from fca02bc to 58706fe Compare April 2, 2025 19:20

farhadrgh force-pushed the farhadr/evo2_cleanup branch from 4c5ac7d to c14f433 Compare April 2, 2025 21:06

farhadrgh added 2 commits April 9, 2025 19:51

bum NeMo

9005ab5

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

resolve conflicts

aa26425

Signed-off-by: Farhad Ramezanghorbani <[email protected]>

cspades approved these changes Apr 9, 2025

View reviewed changes

jstjohn approved these changes Apr 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update EVO2 tests according to Hyena arch changes #798

Update EVO2 tests according to Hyena arch changes #798

farhadrgh commented Apr 2, 2025

copy-pr-bot bot commented Apr 2, 2025

farhadrgh commented Apr 2, 2025

farhadrgh commented Apr 2, 2025

cspades commented Apr 8, 2025

farhadrgh commented Apr 9, 2025

codecov-commenter commented Apr 9, 2025

jstjohn left a comment

jstjohn Apr 9, 2025

farhadrgh Apr 9, 2025

farhadrgh Apr 10, 2025

farhadrgh commented Apr 11, 2025

Update EVO2 tests according to Hyena arch changes #798

Are you sure you want to change the base?

Update EVO2 tests according to Hyena arch changes #798

Conversation

farhadrgh commented Apr 2, 2025

Description

Type of changes

CI Pipeline Configuration

Authorizing CI Runs

Usage

Pre-submit Checklist

copy-pr-bot bot commented Apr 2, 2025

farhadrgh commented Apr 2, 2025

farhadrgh commented Apr 2, 2025

cspades commented Apr 8, 2025

farhadrgh commented Apr 9, 2025

codecov-commenter commented Apr 9, 2025

❌ 21 Tests Failed:

jstjohn left a comment

Choose a reason for hiding this comment

jstjohn Apr 9, 2025

Choose a reason for hiding this comment

farhadrgh Apr 9, 2025

Choose a reason for hiding this comment

farhadrgh Apr 10, 2025

Choose a reason for hiding this comment

farhadrgh commented Apr 11, 2025