-
Notifications
You must be signed in to change notification settings - Fork 108
Open
Labels
bugSomething isn't workingSomething isn't working
Description
BioNeMo Framework Version
Bug Description
The data parallel gradient test for Evo2 is failing. The gradients don't match the base checkpoint for data parallel. The other 3 gradient tests (CP, TP, PP) are passing.
Steps to Reproduce
Check out the hash, build the container and run pytest -v --cov=bionemo --cov-append --cov-report=xml:coverage.xml -m 'slow and multi_gpu' --junitxml=bionemo-evo2.junit.xml -o junit_family=legacy ./sub-packages/bionemo-evo2/. Needs 2 gpus to run.
Error Messages and Logs
E AssertionError: AssertionErrors comparing ['/tmp/pytest-of-root/pytest-0/base_checkpoint_session0/base_training/evo2/checkpoints/epoch=0-step=0-consumed_samples=0-last/weights', '/tmp/pytest-of-root/pytest-0/test_distributed_training_grad0/parallel_dp2_cp1_tp1_pp1/evo2/checkpoints/epoch=0-step=0-consumed_samples=0-last/weights']:
E [AssertionError("AssertionError comparing /tmp/pytest-of-root/pytest-0/base_checkpoint_session0/base_training/evo2/checkpoints/epoch=0-step=0-consumed_samples=0-last/weights to /tmp/pytest-of-root/pytest-0/test_distributed_training_grad0/parallel_dp2_cp1_tp1_pp1/evo2/checkpoints/epoch=0-step=0-consumed_samples=0-last/weights:\nAssertion Errors found comparing keys: [AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.0.mixer.dense.weight: AssertionError on relative norm magnitude (rel=0.0995602235198021, bnd=0.08288547619861457, ok=False, rel_shuff=1.4244751930236816, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 20300 / 16777216 (0.1%)\\nGreatest absolute difference: 0.0004218118265271187 at index (0, 0, 15386801) (up to 1e-05 allowed)\\nGreatest relative difference: inf at index (0, 0, 332858) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 16777216])/torch.float32 tensor([[[ 6.6002e-07, -9.3903e-27, 2.6702e-08, ..., 1.1502e-06,\\n 3.0168e-09, -6.8193e-07]]])\\nRight: torch.Size([1, 1, 16777216])/torch.float32 tensor([[[ 6.8919e-07, -9.3031e-27, 2.5913e-08, ..., 1.0105e-06,\\n 1.0983e-08, -6.0682e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mixer.dense.weight: AssertionError on relative norm magnitude (rel=0.0969272181391716, bnd=0.08288547619861457, ok=False, rel_shuff=1.4254369735717773, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 893 / 16777216 (0.0%)\\nGreatest absolute difference: 9.87960520433262e-05 at index (0, 0, 15387913) (up to 1e-05 allowed)\\nGreatest relative difference: 65.82613372802734 at index (0, 0, 15386834) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 16777216])/torch.float32 tensor([[[-7.8326e-07, -1.1502e-06, 1.3009e-07, ..., 1.8897e-07,\\n 1.9718e-06, -5.5047e-07]]])\\nRight: torch.Size([1, 1, 16777216])/torch.float32 tensor([[[-9.3357e-07, -1.1697e-06, -5.4916e-09, ..., 1.6955e-07,\\n 1.8891e-06, -5.3818e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mixer.dense_projection.layer_norm_weight: AssertionError on relative norm magnitude (rel=0.09405025094747543, bnd=0.08288547619861457, ok=False, rel_shuff=1.4441505670547485, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 10 / 4096 (0.2%)\\nGreatest absolute difference: 1.8577644368633628e-05 at index (0, 2282) (up to 1e-05 allowed)\\nGreatest relative difference: 0.3205079734325409 at index (0, 2138) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 4096])/torch.float32 tensor([[-1.4679e-06, 2.3333e-06, -1.7747e-06, ..., -3.3740e-06,\\n -3.1769e-06, -4.7982e-06]])\\nRight: torch.Size([1, 4096])/torch.float32 tensor([[-1.1862e-06, 1.5267e-06, -1.5486e-06, ..., -3.4268e-06,\\n -1.8452e-06, -3.7343e-06]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mixer.dense_projection.weight: AssertionError on relative norm magnitude (rel=0.110261470079422, bnd=0.08288547619861457, ok=False, rel_shuff=1.4172718524932861, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 75 / 50331648 (0.0%)\\nGreatest absolute difference: 2.5135617761407048e-05 at index (0, 0, 31119373) (up to 1e-05 allowed)\\nGreatest relative difference: 11.342885971069336 at index (0, 0, 31118966) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 50331648])/torch.float32 tensor([[[ 5.0323e-08, -6.6327e-09, -2.2936e-08, ..., -2.7250e-07,\\n -1.4310e-07, -8.8322e-08]]])\\nRight: torch.Size([1, 1, 50331648])/torch.float32 tensor([[[ 6.1780e-08, -8.8380e-09, -2.5399e-08, ..., -3.0204e-07,\\n -1.6887e-07, -1.1944e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mixer.mixer.conv_bias: AssertionError on relative norm magnitude (rel=0.09650460630655289, bnd=0.08288547619861457, ok=False, rel_shuff=1.4202098846435547, ok_shuff=False) but torch.testing.assert_close(left, right) passes. \\nLeft: torch.Size([1, 4096])/torch.float32 tensor([[-4.3819e-07, -5.0392e-07, -1.5063e-07, ..., -2.1909e-07,\\n 1.4570e-06, 1.3967e-06]])\\nRight: torch.Size([1, 4096])/torch.float32 tensor([[-4.7777e-07, -5.7662e-07, 0.0000e+00, ..., -1.5788e-07,\\n 1.1752e-06, 1.4717e-06]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mixer.mixer.filter.h: AssertionError on relative norm magnitude (rel=0.11557027697563171, bnd=0.08288547619861457, ok=False, rel_shuff=1.0917565822601318, ok_shuff=False) but torch.testing.assert_close(left, right) passes. \\nLeft: torch.Size([1, 1, 32768])/torch.float32 tensor([[[-8.5994e-07, 1.3858e-06, 1.3639e-06, ..., 2.0595e-06,\\n 2.3005e-06, 3.3521e-06]]])\\nRight: torch.Size([1, 1, 32768])/torch.float32 tensor([[[-8.6218e-07, 1.1258e-06, 8.9513e-07, ..., 1.9660e-06,\\n 2.2406e-06, 3.4048e-06]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mlp.linear_fc1.layer_norm_weight: AssertionError on relative norm magnitude (rel=0.10146525502204895, bnd=0.08288547619861457, ok=False, rel_shuff=1.3731473684310913, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 2558 / 4096 (62.5%)\\nGreatest absolute difference: 0.00040685414569452405 at index (0, 2282) (up to 1e-05 allowed)\\nGreatest relative difference: inf at index (0, 21) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 4096])/torch.float32 tensor([[-2.0332e-04, -4.7324e-05, 2.3557e-04, ..., 4.0839e-05,\\n 1.7668e-04, 4.9077e-04]])\\nRight: torch.Size([1, 4096])/torch.float32 tensor([[-2.0385e-04, -4.2175e-05, 2.2634e-04, ..., 4.2175e-05,\\n 1.7222e-04, 4.9486e-04]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mlp.linear_fc1.weight: AssertionError on relative norm magnitude (rel=0.12013006210327148, bnd=0.08288547619861457, ok=False, rel_shuff=1.4374887943267822, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 2 / 90177536 (0.0%)\\nGreatest absolute difference: 1.2533068002085201e-05 at index (0, 0, 40454378) (up to 1e-05 allowed)\\nGreatest relative difference: 35.659820556640625 at index (0, 0, 40454378) (up to 1.3e-06 allowed)\\nLeft: torch.Size([2, 1, 45088768])/torch.float32 tensor([[[ 2.1772e-07, -6.9836e-07, 9.2019e-07, ..., -8.9828e-07,\\n 2.1581e-06, -2.2348e-06]],\\n\\n [[ 9.3663e-07, 1.9390e-06, 1.7856e-06, ..., 6.9836e-07,\\n -9.2019e-07, -7.4492e-07]]])\\nRight: torch.Size([2, 1, 45088768])/torch.float32 tensor([[[ 3.2675e-07, -7.9079e-07, 1.1642e-06, ..., -8.0177e-07,\\n 1.9770e-06, -1.8452e-06]],\\n\\n [[ 5.3268e-07, 1.5376e-06, 1.4278e-06, ..., 6.5350e-07,\\n -8.8964e-07, -7.1940e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.1.mlp.linear_fc2.weight: AssertionError on relative norm magnitude (rel=0.10087946057319641, bnd=0.08288547619861457, ok=False, rel_shuff=1.429401159286499, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 97 / 45088768 (0.0%)\\nGreatest absolute difference: 3.1207731808535755e-05 at index (0, 0, 41354288) (up to 1e-05 allowed)\\nGreatest relative difference: 70.81315612792969 at index (0, 0, 41354260) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 45088768])/torch.float32 tensor([[[-4.7105e-07, -1.7199e-06, -3.0673e-07, ..., 6.8467e-07,\\n 3.8341e-07, -2.2594e-07]]])\\nRight: torch.Size([1, 1, 45088768])/torch.float32 tensor([[[-6.3153e-07, -1.7244e-06, -1.6200e-07, ..., 7.0842e-07,\\n 4.0363e-07, -2.1829e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.2.mixer.dense_projection.weight: AssertionError on relative norm magnitude (rel=0.0925312265753746, bnd=0.08288547619861457, ok=False, rel_shuff=1.4162689447402954, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 1029 / 50331648 (0.0%)\\nGreatest absolute difference: 9.991967817768455e-05 at index (0, 0, 34900202) (up to 1e-05 allowed)\\nGreatest relative difference: 29.17668914794922 at index (0, 0, 34898983) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 50331648])/torch.float32 tensor([[[-5.9703e-07, 3.2727e-07, -1.3830e-07, ..., 1.2872e-07,\\n 2.2868e-07, -1.2598e-07]]])\\nRight: torch.Size([1, 1, 50331648])/torch.float32 tensor([[[-5.9584e-07, 2.7870e-07, -9.3357e-08, ..., 1.3866e-07,\\n 1.8397e-07, -1.0915e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.2.mixer.mixer.conv_bias: AssertionError on relative norm magnitude (rel=0.11542163044214249, bnd=0.08288547619861457, ok=False, rel_shuff=1.4485046863555908, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 98 / 4096 (2.4%)\\nGreatest absolute difference: 0.00015035035903565586 at index (0, 261) (up to 1e-05 allowed)\\nGreatest relative difference: 2.6042587757110596 at index (0, 2612) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 4096])/torch.float32 tensor([[ 1.0385e-05, -1.5268e-07, -1.9171e-07, ..., 1.9718e-06,\\n 4.0258e-07, 6.6166e-06]])\\nRight: torch.Size([1, 4096])/torch.float32 tensor([[ 8.9183e-06, -2.2653e-07, -1.0159e-07, ..., 2.1417e-06,\\n 3.5146e-07, 6.1506e-06]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.2.mixer.mixer.filter.gamma: AssertionError on relative norm magnitude (rel=0.10202537477016449, bnd=0.08288547619861457, ok=False, rel_shuff=1.4891784191131592, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 3 / 65536 (0.0%)\\nGreatest absolute difference: 1.5955651178956032e-05 at index (0, 0, 4178) (up to 1e-05 allowed)\\nGreatest relative difference: 0.24407503008842468 at index (0, 0, 4178) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 65536])/torch.float32 tensor([[[ 2.9167e-07, -1.1845e-07, -1.1845e-07, ..., 2.3856e-09,\\n 2.3963e-09, -1.7801e-08]]])\\nRight: torch.Size([1, 1, 65536])/torch.float32 tensor([[[ 3.0204e-07, -1.2013e-07, -1.2082e-07, ..., 2.6493e-09,\\n 2.6600e-09, -1.8277e-08]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.2.mixer.mixer.filter.p: AssertionError on relative norm magnitude (rel=0.10202537477016449, bnd=0.08288547619861457, ok=False, rel_shuff=1.4891784191131592, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 3 / 65536 (0.0%)\\nGreatest absolute difference: 1.5955651178956032e-05 at index (0, 0, 4178) (up to 1e-05 allowed)\\nGreatest relative difference: 0.24407503008842468 at index (0, 0, 4178) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 65536])/torch.float32 tensor([[[ 2.9167e-07, -1.1845e-07, -1.1845e-07, ..., 2.3856e-09,\\n 2.3963e-09, -1.7801e-08]]])\\nRight: torch.Size([1, 1, 65536])/torch.float32 tensor([[[ 3.0204e-07, -1.2013e-07, -1.2082e-07, ..., 2.6493e-09,\\n 2.6600e-09, -1.8277e-08]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.2.mlp.linear_fc1.weight: AssertionError on relative norm magnitude (rel=0.09465793520212173, bnd=0.08288547619861457, ok=False, rel_shuff=1.438188076019287, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 70 / 90177536 (0.0%)\\nGreatest absolute difference: 2.4612105335108936e-05 at index (1, 0, 35907595) (up to 1e-05 allowed)\\nGreatest relative difference: 2.262340784072876 at index (1, 0, 12875787) (up to 1.3e-06 allowed)\\nLeft: torch.Size([2, 1, 45088768])/torch.float32 tensor([[[-1.4652e-07, -1.5610e-07, 4.0532e-07, ..., 3.4370e-07,\\n -1.7747e-06, 7.2232e-08]],\\n\\n [[-2.7934e-07, 2.4511e-07, 2.6565e-07, ..., -7.9969e-07,\\n -3.9437e-07, 2.5607e-07]]])\\nRight: torch.Size([2, 1, 45088768])/torch.float32 tensor([[[-1.0228e-07, -1.7710e-07, 3.6519e-07, ..., 4.1187e-07,\\n -1.6035e-06, 1.6337e-07]],\\n\\n [[-2.5261e-07, 2.4987e-07, 3.0066e-07, ..., -8.9513e-07,\\n -6.2330e-07, 4.2560e-07]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.3.mlp.linear_fc1.weight: AssertionError on relative norm magnitude (rel=0.08877922594547272, bnd=0.08288547619861457, ok=False, rel_shuff=1.4074212312698364, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 190 / 90177536 (0.0%)\\nGreatest absolute difference: 4.655582597479224e-05 at index (1, 0, 10504426) (up to 1e-05 allowed)\\nGreatest relative difference: 12.370415687561035 at index (1, 0, 20887316) (up to 1.3e-06 allowed)\\nLeft: torch.Size([2, 1, 45088768])/torch.float32 tensor([[[ 7.2575e-08, 1.6158e-07, -2.5247e-09, ..., 7.3944e-07,\\n -1.5446e-06, 6.0525e-07]],\\n\\n [[ 5.0118e-07, 8.4556e-08, 3.6424e-07, ..., 7.8052e-08,\\n 1.1667e-06, -1.5227e-06]]])\\nRight: torch.Size([2, 1, 45088768])/torch.float32 tensor([[[ 3.3636e-08, 1.4484e-07, 4.9768e-09, ..., 7.2489e-07,\\n -1.3235e-06, 5.6014e-07]],\\n\\n [[ 4.2285e-07, 1.4827e-07, 4.1462e-07, ..., 2.0868e-07,\\n 1.1532e-06, -1.6365e-06]]])'), AssertionError('AssertionError for optimizer.state.exp_avg.module.decoder.layers.3.mlp.linear_fc2.weight: AssertionError on relative norm magnitude (rel=0.08539237082004547, bnd=0.08288547619861457, ok=False, rel_shuff=1.415677547454834, ok_shuff=False): Tensor-likes are not close!\\n\\nMismatched elements: 1069 / 45088768 (0.0%)\\nGreatest absolute difference: 6.49754365440458e-05 at index (0, 0, 41351259) (up to 1e-05 allowed)\\nGreatest relative difference: 41.8884162902832 at index (0, 0, 41352960) (up to 1.3e-06 allowed)\\nLeft: torch.Size([1, 1, 45088768])/torch.float32 tensor([[[-4.3434e-09, -1.7887e-08, 9.9277e-09, ..., 9.3607e-10,\\n 6.3332e-09, 4.4075e-09]]])\\nRight: torch.Size([1, 1, 45088768])/torch.float32 tensor([[[-3.1319e-09, -1.7676e-08, 1.0125e-08, ..., 1.2764e-09,\\n 6.3926e-09, 5.6632e-09]]])')]")]Docker Image
No response
System Information
Environment Details:
- Failing in CI (set
ciflow:multi-gpulabel)
GPU Details:
- GPU Model: 2x RTX A6000
- GPU Memory: 48 gb each
Additional Context
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working