[BUG] Evo2 Data Parallel Distributed Gradient Test Fails

### BioNeMo Framework Version

f14497c0fbeaff6cd3ae6e457cd126c532434b85

### Bug Description

The data parallel gradient test for Evo2 is failing. The gradients don't match the base checkpoint for data parallel. The other 3 gradient tests (CP, TP, PP) are passing.

### Steps to Reproduce

Check out the hash, build the container and run `pytest -v --cov=bionemo --cov-append --cov-report=xml:coverage.xml -m 'slow and multi_gpu' --junitxml=bionemo-evo2.junit.xml -o junit_family=legacy ./sub-packages/bionemo-evo2/`. Needs 2 gpus to run.

### Error Messages and Logs

```shell
E       AssertionError: AssertionErrors comparing ['/tmp/pytest-of-root/pytest-0/base_checkpoint_session0/base_training/evo2/checkpoints/epoch=0-step=0-consumed_samples=0-last/weights', '/tmp/pytest-of-root/pytest-0/test_distributed_training_grad0/parallel_dp2_cp1_tp1_pp1/evo2/checkpoints/epoch=0-step=0-consumed_samples=0-last/weights']:
E       [AssertionError("AssertionError comparing /tmp/pytest-of-root/pytest-0/base_checkpoint_session0/base_training/evo2/checkpoints/epoch=0-step=0-consumed_samples=0-last/weights to /tmp/pytest-of-root/pytest-0/test_distributed_training_grad0/parallel_dp2_cp1_tp1_pp1/evo2/checkpoints/epoch=0-step=0-consumed_samples=0-last/weights:\nAssertion Errors found comparing keys: [AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.0.mixer.dense.weight: AssertionError on relative norm magnitude (rel=0.0995602235198021, bnd=0.08288547619861457, ok=False, rel_shuff=1.4244751930236816, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 20300 / 16777216 (0.1%)\\nGreatest absolute difference: 0.0004218118265271187 at index (0, 0, 15386801) (up to 1e-05 allowed)\\nGreatest relative difference: inf at index (0, 0, 332858) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 16777216])/torch.float32 tensor([[[ 6.6002e-07, -9.3903e-27,  2.6702e-08,  ...,  1.1502e-06,\\n           3.0168e-09, -6.8193e-07]]])\\nRight: torch.Size([1, 1, 16777216])/torch.float32 tensor([[[ 6.8919e-07, -9.3031e-27,  2.5913e-08,  ...,  1.0105e-06,\\n           1.0983e-08, -6.0682e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mixer.dense.weight: AssertionError on relative norm magnitude (rel=0.0969272181391716, bnd=0.08288547619861457, ok=False, rel_shuff=1.4254369735717773, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 893 / 16777216 (0.0%)\\nGreatest absolute difference: 9.87960520433262e-05 at index (0, 0, 15387913) (up to 1e-05 allowed)\\nGreatest relative difference: 65.82613372802734 at index (0, 0, 15386834) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 16777216])/torch.float32 tensor([[[-7.8326e-07, -1.1502e-06,  1.3009e-07,  ...,  1.8897e-07,\\n           1.9718e-06, -5.5047e-07]]])\\nRight: torch.Size([1, 1, 16777216])/torch.float32 tensor([[[-9.3357e-07, -1.1697e-06, -5.4916e-09,  ...,  1.6955e-07,\\n           1.8891e-06, -5.3818e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mixer.dense_projection.layer_norm_weight: AssertionError on relative norm magnitude (rel=0.09405025094747543, bnd=0.08288547619861457, ok=False, rel_shuff=1.4441505670547485, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 10 / 4096 (0.2%)\\nGreatest absolute difference: 1.8577644368633628e-05 at index (0, 2282) (up to 1e-05 allowed)\\nGreatest relative difference: 0.3205079734325409 at index (0, 2138) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 4096])/torch.float32 tensor([[-1.4679e-06,  2.3333e-06, -1.7747e-06,  ..., -3.3740e-06,\\n         -3.1769e-06, -4.7982e-06]])\\nRight: torch.Size([1, 4096])/torch.float32 tensor([[-1.1862e-06,  1.5267e-06, -1.5486e-06,  ..., -3.4268e-06,\\n         -1.8452e-06, -3.7343e-06]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mixer.dense_projection.weight: AssertionError on relative norm magnitude (rel=0.110261470079422, bnd=0.08288547619861457, ok=False, rel_shuff=1.4172718524932861, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 75 / 50331648 (0.0%)\\nGreatest absolute difference: 2.5135617761407048e-05 at index (0, 0, 31119373) (up to 1e-05 allowed)\\nGreatest relative difference: 11.342885971069336 at index (0, 0, 31118966) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 50331648])/torch.float32 tensor([[[ 5.0323e-08, -6.6327e-09, -2.2936e-08,  ..., -2.7250e-07,\\n          -1.4310e-07, -8.8322e-08]]])\\nRight: torch.Size([1, 1, 50331648])/torch.float32 tensor([[[ 6.1780e-08, -8.8380e-09, -2.5399e-08,  ..., -3.0204e-07,\\n          -1.6887e-07, -1.1944e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mixer.mixer.conv_bias: AssertionError on relative norm magnitude (rel=0.09650460630655289, bnd=0.08288547619861457, ok=False, rel_shuff=1.4202098846435547, ok_shuff=False) but torch.testing.assert_close(left, right) passes. \\nLeft: torch.Size([1, 4096])/torch.float32 tensor([[-4.3819e-07, -5.0392e-07, -1.5063e-07,  ..., -2.1909e-07,\\n          1.4570e-06,  1.3967e-06]])\\nRight: torch.Size([1, 4096])/torch.float32 tensor([[-4.7777e-07, -5.7662e-07,  0.0000e+00,  ..., -1.5788e-07,\\n          1.1752e-06,  1.4717e-06]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mixer.mixer.filter.h: AssertionError on relative norm magnitude (rel=0.11557027697563171, bnd=0.08288547619861457, ok=False, rel_shuff=1.0917565822601318, ok_shuff=False) but torch.testing.assert_close(left, right) passes. \\nLeft: torch.Size([1, 1, 32768])/torch.float32 tensor([[[-8.5994e-07,  1.3858e-06,  1.3639e-06,  ...,  2.0595e-06,\\n           2.3005e-06,  3.3521e-06]]])\\nRight: torch.Size([1, 1, 32768])/torch.float32 tensor([[[-8.6218e-07,  1.1258e-06,  8.9513e-07,  ...,  1.9660e-06,\\n           2.2406e-06,  3.4048e-06]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mlp.linear_fc1.layer_norm_weight: AssertionError on relative norm magnitude (rel=0.10146525502204895, bnd=0.08288547619861457, ok=False, rel_shuff=1.3731473684310913, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 2558 / 4096 (62.5%)\\nGreatest absolute difference: 0.00040685414569452405 at index (0, 2282) (up to 1e-05 allowed)\\nGreatest relative difference: inf at index (0, 21) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 4096])/torch.float32 tensor([[-2.0332e-04, -4.7324e-05,  2.3557e-04,  ...,  4.0839e-05,\\n          1.7668e-04,  4.9077e-04]])\\nRight: torch.Size([1, 4096])/torch.float32 tensor([[-2.0385e-04, -4.2175e-05,  2.2634e-04,  ...,  4.2175e-05,\\n          1.7222e-04,  4.9486e-04]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mlp.linear_fc1.weight: AssertionError on relative norm magnitude (rel=0.12013006210327148, bnd=0.08288547619861457, ok=False, rel_shuff=1.4374887943267822, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 2 / 90177536 (0.0%)\\nGreatest absolute difference: 1.2533068002085201e-05 at index (0, 0, 40454378) (up to 1e-05 allowed)\\nGreatest relative difference: 35.659820556640625 at index (0, 0, 40454378) (up to 1.3e-06 allowed)\\nLeft: torch.Size([2, 1, 45088768])/torch.float32 tensor([[[ 2.1772e-07, -6.9836e-07,  9.2019e-07,  ..., -8.9828e-07,\\n           2.1581e-06, -2.2348e-06]],\\n\\n        [[ 9.3663e-07,  1.9390e-06,  1.7856e-06,  ...,  6.9836e-07,\\n          -9.2019e-07, -7.4492e-07]]])\\nRight: torch.Size([2, 1, 45088768])/torch.float32 tensor([[[ 3.2675e-07, -7.9079e-07,  1.1642e-06,  ..., -8.0177e-07,\\n           1.9770e-06, -1.8452e-06]],\\n\\n        [[ 5.3268e-07,  1.5376e-06,  1.4278e-06,  ...,  6.5350e-07,\\n          -8.8964e-07, -7.1940e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mlp.linear_fc2.weight: AssertionError on relative norm magnitude (rel=0.10087946057319641, bnd=0.08288547619861457, ok=False, rel_shuff=1.429401159286499, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 97 / 45088768 (0.0%)\\nGreatest absolute difference: 3.1207731808535755e-05 at index (0, 0, 41354288) (up to 1e-05 allowed)\\nGreatest relative difference: 70.81315612792969 at index (0, 0, 41354260) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 45088768])/torch.float32 tensor([[[-4.7105e-07, -1.7199e-06, -3.0673e-07,  ...,  6.8467e-07,\\n           3.8341e-07, -2.2594e-07]]])\\nRight: torch.Size([1, 1, 45088768])/torch.float32 tensor([[[-6.3153e-07, -1.7244e-06, -1.6200e-07,  ...,  7.0842e-07,\\n           4.0363e-07, -2.1829e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.2.mixer.dense_projection.weight: AssertionError on relative norm magnitude (rel=0.0925312265753746, bnd=0.08288547619861457, ok=False, rel_shuff=1.4162689447402954, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 1029 / 50331648 (0.0%)\\nGreatest absolute difference: 9.991967817768455e-05 at index (0, 0, 34900202) (up to 1e-05 allowed)\\nGreatest relative difference: 29.17668914794922 at index (0, 0, 34898983) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 50331648])/torch.float32 tensor([[[-5.9703e-07,  3.2727e-07, -1.3830e-07,  ...,  1.2872e-07,\\n           2.2868e-07, -1.2598e-07]]])\\nRight: torch.Size([1, 1, 50331648])/torch.float32 tensor([[[-5.9584e-07,  2.7870e-07, -9.3357e-08,  ...,  1.3866e-07,\\n           1.8397e-07, -1.0915e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.2.mixer.mixer.conv_bias: AssertionError on relative norm magnitude (rel=0.11542163044214249, bnd=0.08288547619861457, ok=False, rel_shuff=1.4485046863555908, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 98 / 4096 (2.4%)\\nGreatest absolute difference: 0.00015035035903565586 at index (0, 261) (up to 1e-05 allowed)\\nGreatest relative difference: 2.6042587757110596 at index (0, 2612) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 4096])/torch.float32 tensor([[ 1.0385e-05, -1.5268e-07, -1.9171e-07,  ...,  1.9718e-06,\\n          4.0258e-07,  6.6166e-06]])\\nRight: torch.Size([1, 4096])/torch.float32 tensor([[ 8.9183e-06, -2.2653e-07, -1.0159e-07,  ...,  2.1417e-06,\\n          3.5146e-07,  6.1506e-06]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.2.mixer.mixer.filter.gamma: AssertionError on relative norm magnitude (rel=0.10202537477016449, bnd=0.08288547619861457, ok=False, rel_shuff=1.4891784191131592, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 3 / 65536 (0.0%)\\nGreatest absolute difference: 1.5955651178956032e-05 at index (0, 0, 4178) (up to 1e-05 allowed)\\nGreatest relative difference: 0.24407503008842468 at index (0, 0, 4178) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 65536])/torch.float32 tensor([[[ 2.9167e-07, -1.1845e-07, -1.1845e-07,  ...,  2.3856e-09,\\n           2.3963e-09, -1.7801e-08]]])\\nRight: torch.Size([1, 1, 65536])/torch.float32 tensor([[[ 3.0204e-07, -1.2013e-07, -1.2082e-07,  ...,  2.6493e-09,\\n           2.6600e-09, -1.8277e-08]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.2.mixer.mixer.filter.p: AssertionError on relative norm magnitude (rel=0.10202537477016449, bnd=0.08288547619861457, ok=False, rel_shuff=1.4891784191131592, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 3 / 65536 (0.0%)\\nGreatest absolute difference: 1.5955651178956032e-05 at index (0, 0, 4178) (up to 1e-05 allowed)\\nGreatest relative difference: 0.24407503008842468 at index (0, 0, 4178) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 65536])/torch.float32 tensor([[[ 2.9167e-07, -1.1845e-07, -1.1845e-07,  ...,  2.3856e-09,\\n           2.3963e-09, -1.7801e-08]]])\\nRight: torch.Size([1, 1, 65536])/torch.float32 tensor([[[ 3.0204e-07, -1.2013e-07, -1.2082e-07,  ...,  2.6493e-09,\\n           2.6600e-09, -1.8277e-08]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.2.mlp.linear_fc1.weight: AssertionError on relative norm magnitude (rel=0.09465793520212173, bnd=0.08288547619861457, ok=False, rel_shuff=1.438188076019287, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 70 / 90177536 (0.0%)\\nGreatest absolute difference: 2.4612105335108936e-05 at index (1, 0, 35907595) (up to 1e-05 allowed)\\nGreatest relative difference: 2.262340784072876 at index (1, 0, 12875787) (up to 1.3e-06 allowed)\\nLeft: torch.Size([2, 1, 45088768])/torch.float32 tensor([[[-1.4652e-07, -1.5610e-07,  4.0532e-07,  ...,  3.4370e-07,\\n          -1.7747e-06,  7.2232e-08]],\\n\\n        [[-2.7934e-07,  2.4511e-07,  2.6565e-07,  ..., -7.9969e-07,\\n          -3.9437e-07,  2.5607e-07]]])\\nRight: torch.Size([2, 1, 45088768])/torch.float32 tensor([[[-1.0228e-07, -1.7710e-07,  3.6519e-07,  ...,  4.1187e-07,\\n          -1.6035e-06,  1.6337e-07]],\\n\\n        [[-2.5261e-07,  2.4987e-07,  3.0066e-07,  ..., -8.9513e-07,\\n          -6.2330e-07,  4.2560e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.3.mlp.linear_fc1.weight: AssertionError on relative norm magnitude (rel=0.08877922594547272, bnd=0.08288547619861457, ok=False, rel_shuff=1.4074212312698364, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 190 / 90177536 (0.0%)\\nGreatest absolute difference: 4.655582597479224e-05 at index (1, 0, 10504426) (up to 1e-05 allowed)\\nGreatest relative difference: 12.370415687561035 at index (1, 0, 20887316) (up to 1.3e-06 allowed)\\nLeft: torch.Size([2, 1, 45088768])/torch.float32 tensor([[[ 7.2575e-08,  1.6158e-07, -2.5247e-09,  ...,  7.3944e-07,\\n          -1.5446e-06,  6.0525e-07]],\\n\\n        [[ 5.0118e-07,  8.4556e-08,  3.6424e-07,  ...,  7.8052e-08,\\n           1.1667e-06, -1.5227e-06]]])\\nRight: torch.Size([2, 1, 45088768])/torch.float32 tensor([[[ 3.3636e-08,  1.4484e-07,  4.9768e-09,  ...,  7.2489e-07,\\n          -1.3235e-06,  5.6014e-07]],\\n\\n        [[ 4.2285e-07,  1.4827e-07,  4.1462e-07,  ...,  2.0868e-07,\\n           1.1532e-06, -1.6365e-06]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.3.mlp.linear_fc2.weight: AssertionError on relative norm magnitude (rel=0.08539237082004547, bnd=0.08288547619861457, ok=False, rel_shuff=1.415677547454834, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 1069 / 45088768 (0.0%)\\nGreatest absolute difference: 6.49754365440458e-05 at index (0, 0, 41351259) (up to 1e-05 allowed)\\nGreatest relative difference: 41.8884162902832 at index (0, 0, 41352960) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 45088768])/torch.float32 tensor([[[-4.3434e-09, -1.7887e-08,  9.9277e-09,  ...,  9.3607e-10,\\n           6.3332e-09,  4.4075e-09]]])\\nRight: torch.Size([1, 1, 45088768])/torch.float32 tensor([[[-3.1319e-09, -1.7676e-08,  1.0125e-08,  ...,  1.2764e-09,\\n           6.3926e-09,  5.6632e-09]]])')]")]
```

### Docker Image

_No response_

### System Information

Environment Details:
- Failing in CI (set `ciflow:multi-gpu` label)

GPU Details:
- GPU Model: 2x RTX A6000
- GPU Memory: 48 gb each


### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Evo2 Data Parallel Distributed Gradient Test Fails #1323

BioNeMo Framework Version

Bug Description

Steps to Reproduce

Error Messages and Logs

Docker Image

System Information

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Evo2 Data Parallel Distributed Gradient Test Fails #1323

Description

BioNeMo Framework Version

Bug Description

Steps to Reproduce

Error Messages and Logs

Docker Image

System Information

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions