Skip to content

[BUG] Evo2 Data Parallel Distributed Gradient Test Fails #1323

@gagank1

Description

@gagank1

BioNeMo Framework Version

f14497c

Bug Description

The data parallel gradient test for Evo2 is failing. The gradients don't match the base checkpoint for data parallel. The other 3 gradient tests (CP, TP, PP) are passing.

Steps to Reproduce

Check out the hash, build the container and run pytest -v --cov=bionemo --cov-append --cov-report=xml:coverage.xml -m 'slow and multi_gpu' --junitxml=bionemo-evo2.junit.xml -o junit_family=legacy ./sub-packages/bionemo-evo2/. Needs 2 gpus to run.

Error Messages and Logs

E       AssertionError: AssertionErrors comparing ['/tmp/pytest-of-root/pytest-0/base_checkpoint_session0/base_training/evo2/checkpoints/epoch=0-step=0-consumed_samples=0-last/weights', '/tmp/pytest-of-root/pytest-0/test_distributed_training_grad0/parallel_dp2_cp1_tp1_pp1/evo2/checkpoints/epoch=0-step=0-consumed_samples=0-last/weights']:
E       [AssertionError("AssertionError comparing /tmp/pytest-of-root/pytest-0/base_checkpoint_session0/base_training/evo2/checkpoints/epoch=0-step=0-consumed_samples=0-last/weights to /tmp/pytest-of-root/pytest-0/test_distributed_training_grad0/parallel_dp2_cp1_tp1_pp1/evo2/checkpoints/epoch=0-step=0-consumed_samples=0-last/weights:\nAssertion Errors found comparing keys: [AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.0.mixer.dense.weight: AssertionError on relative norm magnitude (rel=0.0995602235198021, bnd=0.08288547619861457, ok=False, rel_shuff=1.4244751930236816, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 20300 / 16777216 (0.1%)\\nGreatest absolute difference: 0.0004218118265271187 at index (0, 0, 15386801) (up to 1e-05 allowed)\\nGreatest relative difference: inf at index (0, 0, 332858) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 16777216])/torch.float32 tensor([[[ 6.6002e-07, -9.3903e-27,  2.6702e-08,  ...,  1.1502e-06,\\n           3.0168e-09, -6.8193e-07]]])\\nRight: torch.Size([1, 1, 16777216])/torch.float32 tensor([[[ 6.8919e-07, -9.3031e-27,  2.5913e-08,  ...,  1.0105e-06,\\n           1.0983e-08, -6.0682e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mixer.dense.weight: AssertionError on relative norm magnitude (rel=0.0969272181391716, bnd=0.08288547619861457, ok=False, rel_shuff=1.4254369735717773, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 893 / 16777216 (0.0%)\\nGreatest absolute difference: 9.87960520433262e-05 at index (0, 0, 15387913) (up to 1e-05 allowed)\\nGreatest relative difference: 65.82613372802734 at index (0, 0, 15386834) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 16777216])/torch.float32 tensor([[[-7.8326e-07, -1.1502e-06,  1.3009e-07,  ...,  1.8897e-07,\\n           1.9718e-06, -5.5047e-07]]])\\nRight: torch.Size([1, 1, 16777216])/torch.float32 tensor([[[-9.3357e-07, -1.1697e-06, -5.4916e-09,  ...,  1.6955e-07,\\n           1.8891e-06, -5.3818e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mixer.dense_projection.layer_norm_weight: AssertionError on relative norm magnitude (rel=0.09405025094747543, bnd=0.08288547619861457, ok=False, rel_shuff=1.4441505670547485, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 10 / 4096 (0.2%)\\nGreatest absolute difference: 1.8577644368633628e-05 at index (0, 2282) (up to 1e-05 allowed)\\nGreatest relative difference: 0.3205079734325409 at index (0, 2138) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 4096])/torch.float32 tensor([[-1.4679e-06,  2.3333e-06, -1.7747e-06,  ..., -3.3740e-06,\\n         -3.1769e-06, -4.7982e-06]])\\nRight: torch.Size([1, 4096])/torch.float32 tensor([[-1.1862e-06,  1.5267e-06, -1.5486e-06,  ..., -3.4268e-06,\\n         -1.8452e-06, -3.7343e-06]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mixer.dense_projection.weight: AssertionError on relative norm magnitude (rel=0.110261470079422, bnd=0.08288547619861457, ok=False, rel_shuff=1.4172718524932861, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 75 / 50331648 (0.0%)\\nGreatest absolute difference: 2.5135617761407048e-05 at index (0, 0, 31119373) (up to 1e-05 allowed)\\nGreatest relative difference: 11.342885971069336 at index (0, 0, 31118966) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 50331648])/torch.float32 tensor([[[ 5.0323e-08, -6.6327e-09, -2.2936e-08,  ..., -2.7250e-07,\\n          -1.4310e-07, -8.8322e-08]]])\\nRight: torch.Size([1, 1, 50331648])/torch.float32 tensor([[[ 6.1780e-08, -8.8380e-09, -2.5399e-08,  ..., -3.0204e-07,\\n          -1.6887e-07, -1.1944e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mixer.mixer.conv_bias: AssertionError on relative norm magnitude (rel=0.09650460630655289, bnd=0.08288547619861457, ok=False, rel_shuff=1.4202098846435547, ok_shuff=False) but torch.testing.assert_close(left, right) passes. \\nLeft: torch.Size([1, 4096])/torch.float32 tensor([[-4.3819e-07, -5.0392e-07, -1.5063e-07,  ..., -2.1909e-07,\\n          1.4570e-06,  1.3967e-06]])\\nRight: torch.Size([1, 4096])/torch.float32 tensor([[-4.7777e-07, -5.7662e-07,  0.0000e+00,  ..., -1.5788e-07,\\n          1.1752e-06,  1.4717e-06]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mixer.mixer.filter.h: AssertionError on relative norm magnitude (rel=0.11557027697563171, bnd=0.08288547619861457, ok=False, rel_shuff=1.0917565822601318, ok_shuff=False) but torch.testing.assert_close(left, right) passes. \\nLeft: torch.Size([1, 1, 32768])/torch.float32 tensor([[[-8.5994e-07,  1.3858e-06,  1.3639e-06,  ...,  2.0595e-06,\\n           2.3005e-06,  3.3521e-06]]])\\nRight: torch.Size([1, 1, 32768])/torch.float32 tensor([[[-8.6218e-07,  1.1258e-06,  8.9513e-07,  ...,  1.9660e-06,\\n           2.2406e-06,  3.4048e-06]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mlp.linear_fc1.layer_norm_weight: AssertionError on relative norm magnitude (rel=0.10146525502204895, bnd=0.08288547619861457, ok=False, rel_shuff=1.3731473684310913, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 2558 / 4096 (62.5%)\\nGreatest absolute difference: 0.00040685414569452405 at index (0, 2282) (up to 1e-05 allowed)\\nGreatest relative difference: inf at index (0, 21) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 4096])/torch.float32 tensor([[-2.0332e-04, -4.7324e-05,  2.3557e-04,  ...,  4.0839e-05,\\n          1.7668e-04,  4.9077e-04]])\\nRight: torch.Size([1, 4096])/torch.float32 tensor([[-2.0385e-04, -4.2175e-05,  2.2634e-04,  ...,  4.2175e-05,\\n          1.7222e-04,  4.9486e-04]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mlp.linear_fc1.weight: AssertionError on relative norm magnitude (rel=0.12013006210327148, bnd=0.08288547619861457, ok=False, rel_shuff=1.4374887943267822, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 2 / 90177536 (0.0%)\\nGreatest absolute difference: 1.2533068002085201e-05 at index (0, 0, 40454378) (up to 1e-05 allowed)\\nGreatest relative difference: 35.659820556640625 at index (0, 0, 40454378) (up to 1.3e-06 allowed)\\nLeft: torch.Size([2, 1, 45088768])/torch.float32 tensor([[[ 2.1772e-07, -6.9836e-07,  9.2019e-07,  ..., -8.9828e-07,\\n           2.1581e-06, -2.2348e-06]],\\n\\n        [[ 9.3663e-07,  1.9390e-06,  1.7856e-06,  ...,  6.9836e-07,\\n          -9.2019e-07, -7.4492e-07]]])\\nRight: torch.Size([2, 1, 45088768])/torch.float32 tensor([[[ 3.2675e-07, -7.9079e-07,  1.1642e-06,  ..., -8.0177e-07,\\n           1.9770e-06, -1.8452e-06]],\\n\\n        [[ 5.3268e-07,  1.5376e-06,  1.4278e-06,  ...,  6.5350e-07,\\n          -8.8964e-07, -7.1940e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mlp.linear_fc2.weight: AssertionError on relative norm magnitude (rel=0.10087946057319641, bnd=0.08288547619861457, ok=False, rel_shuff=1.429401159286499, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 97 / 45088768 (0.0%)\\nGreatest absolute difference: 3.1207731808535755e-05 at index (0, 0, 41354288) (up to 1e-05 allowed)\\nGreatest relative difference: 70.81315612792969 at index (0, 0, 41354260) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 45088768])/torch.float32 tensor([[[-4.7105e-07, -1.7199e-06, -3.0673e-07,  ...,  6.8467e-07,\\n           3.8341e-07, -2.2594e-07]]])\\nRight: torch.Size([1, 1, 45088768])/torch.float32 tensor([[[-6.3153e-07, -1.7244e-06, -1.6200e-07,  ...,  7.0842e-07,\\n           4.0363e-07, -2.1829e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.2.mixer.dense_projection.weight: AssertionError on relative norm magnitude (rel=0.0925312265753746, bnd=0.08288547619861457, ok=False, rel_shuff=1.4162689447402954, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 1029 / 50331648 (0.0%)\\nGreatest absolute difference: 9.991967817768455e-05 at index (0, 0, 34900202) (up to 1e-05 allowed)\\nGreatest relative difference: 29.17668914794922 at index (0, 0, 34898983) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 50331648])/torch.float32 tensor([[[-5.9703e-07,  3.2727e-07, -1.3830e-07,  ...,  1.2872e-07,\\n           2.2868e-07, -1.2598e-07]]])\\nRight: torch.Size([1, 1, 50331648])/torch.float32 tensor([[[-5.9584e-07,  2.7870e-07, -9.3357e-08,  ...,  1.3866e-07,\\n           1.8397e-07, -1.0915e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.2.mixer.mixer.conv_bias: AssertionError on relative norm magnitude (rel=0.11542163044214249, bnd=0.08288547619861457, ok=False, rel_shuff=1.4485046863555908, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 98 / 4096 (2.4%)\\nGreatest absolute difference: 0.00015035035903565586 at index (0, 261) (up to 1e-05 allowed)\\nGreatest relative difference: 2.6042587757110596 at index (0, 2612) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 4096])/torch.float32 tensor([[ 1.0385e-05, -1.5268e-07, -1.9171e-07,  ...,  1.9718e-06,\\n          4.0258e-07,  6.6166e-06]])\\nRight: torch.Size([1, 4096])/torch.float32 tensor([[ 8.9183e-06, -2.2653e-07, -1.0159e-07,  ...,  2.1417e-06,\\n          3.5146e-07,  6.1506e-06]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.2.mixer.mixer.filter.gamma: AssertionError on relative norm magnitude (rel=0.10202537477016449, bnd=0.08288547619861457, ok=False, rel_shuff=1.4891784191131592, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 3 / 65536 (0.0%)\\nGreatest absolute difference: 1.5955651178956032e-05 at index (0, 0, 4178) (up to 1e-05 allowed)\\nGreatest relative difference: 0.24407503008842468 at index (0, 0, 4178) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 65536])/torch.float32 tensor([[[ 2.9167e-07, -1.1845e-07, -1.1845e-07,  ...,  2.3856e-09,\\n           2.3963e-09, -1.7801e-08]]])\\nRight: torch.Size([1, 1, 65536])/torch.float32 tensor([[[ 3.0204e-07, -1.2013e-07, -1.2082e-07,  ...,  2.6493e-09,\\n           2.6600e-09, -1.8277e-08]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.2.mixer.mixer.filter.p: AssertionError on relative norm magnitude (rel=0.10202537477016449, bnd=0.08288547619861457, ok=False, rel_shuff=1.4891784191131592, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 3 / 65536 (0.0%)\\nGreatest absolute difference: 1.5955651178956032e-05 at index (0, 0, 4178) (up to 1e-05 allowed)\\nGreatest relative difference: 0.24407503008842468 at index (0, 0, 4178) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 65536])/torch.float32 tensor([[[ 2.9167e-07, -1.1845e-07, -1.1845e-07,  ...,  2.3856e-09,\\n           2.3963e-09, -1.7801e-08]]])\\nRight: torch.Size([1, 1, 65536])/torch.float32 tensor([[[ 3.0204e-07, -1.2013e-07, -1.2082e-07,  ...,  2.6493e-09,\\n           2.6600e-09, -1.8277e-08]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.2.mlp.linear_fc1.weight: AssertionError on relative norm magnitude (rel=0.09465793520212173, bnd=0.08288547619861457, ok=False, rel_shuff=1.438188076019287, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 70 / 90177536 (0.0%)\\nGreatest absolute difference: 2.4612105335108936e-05 at index (1, 0, 35907595) (up to 1e-05 allowed)\\nGreatest relative difference: 2.262340784072876 at index (1, 0, 12875787) (up to 1.3e-06 allowed)\\nLeft: torch.Size([2, 1, 45088768])/torch.float32 tensor([[[-1.4652e-07, -1.5610e-07,  4.0532e-07,  ...,  3.4370e-07,\\n          -1.7747e-06,  7.2232e-08]],\\n\\n        [[-2.7934e-07,  2.4511e-07,  2.6565e-07,  ..., -7.9969e-07,\\n          -3.9437e-07,  2.5607e-07]]])\\nRight: torch.Size([2, 1, 45088768])/torch.float32 tensor([[[-1.0228e-07, -1.7710e-07,  3.6519e-07,  ...,  4.1187e-07,\\n          -1.6035e-06,  1.6337e-07]],\\n\\n        [[-2.5261e-07,  2.4987e-07,  3.0066e-07,  ..., -8.9513e-07,\\n          -6.2330e-07,  4.2560e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.3.mlp.linear_fc1.weight: AssertionError on relative norm magnitude (rel=0.08877922594547272, bnd=0.08288547619861457, ok=False, rel_shuff=1.4074212312698364, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 190 / 90177536 (0.0%)\\nGreatest absolute difference: 4.655582597479224e-05 at index (1, 0, 10504426) (up to 1e-05 allowed)\\nGreatest relative difference: 12.370415687561035 at index (1, 0, 20887316) (up to 1.3e-06 allowed)\\nLeft: torch.Size([2, 1, 45088768])/torch.float32 tensor([[[ 7.2575e-08,  1.6158e-07, -2.5247e-09,  ...,  7.3944e-07,\\n          -1.5446e-06,  6.0525e-07]],\\n\\n        [[ 5.0118e-07,  8.4556e-08,  3.6424e-07,  ...,  7.8052e-08,\\n           1.1667e-06, -1.5227e-06]]])\\nRight: torch.Size([2, 1, 45088768])/torch.float32 tensor([[[ 3.3636e-08,  1.4484e-07,  4.9768e-09,  ...,  7.2489e-07,\\n          -1.3235e-06,  5.6014e-07]],\\n\\n        [[ 4.2285e-07,  1.4827e-07,  4.1462e-07,  ...,  2.0868e-07,\\n           1.1532e-06, -1.6365e-06]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.3.mlp.linear_fc2.weight: AssertionError on relative norm magnitude (rel=0.08539237082004547, bnd=0.08288547619861457, ok=False, rel_shuff=1.415677547454834, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 1069 / 45088768 (0.0%)\\nGreatest absolute difference: 6.49754365440458e-05 at index (0, 0, 41351259) (up to 1e-05 allowed)\\nGreatest relative difference: 41.8884162902832 at index (0, 0, 41352960) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 45088768])/torch.float32 tensor([[[-4.3434e-09, -1.7887e-08,  9.9277e-09,  ...,  9.3607e-10,\\n           6.3332e-09,  4.4075e-09]]])\\nRight: torch.Size([1, 1, 45088768])/torch.float32 tensor([[[-3.1319e-09, -1.7676e-08,  1.0125e-08,  ...,  1.2764e-09,\\n           6.3926e-09,  5.6632e-09]]])')]")]

Docker Image

No response

System Information

Environment Details:

  • Failing in CI (set ciflow:multi-gpu label)

GPU Details:

  • GPU Model: 2x RTX A6000
  • GPU Memory: 48 gb each

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions