flash_attention_v2_backward #10495

cccddd77 · 2024-04-22T09:56:29Z

flash attention v2 backward算子

github-actions · 2024-04-22T12:26:34Z

Speed stats:

github-actions · 2024-04-22T14:35:00Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions · 2024-04-23T01:27:12Z

View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10495/

MARD1NO · 2024-04-23T01:36:53Z

oneflow/core/functional/impl/nn_functor.cpp

+    auto grad_k_ = (*output_)[1];
+    auto grad_v_ = (*output_)[2];
+
+    //     auto grad_q_padded = JUST(functional::Transpose(grad_q_, {0, 2, 1, 3}));


MARD1NO · 2024-04-23T03:13:36Z

python/oneflow/test/modules/test_scaled_dot_product_attention.py

    if dtype == flow.float16:
-        test_case.assertTrue(np.allclose(ref_out, fused_out, atol=1e-2, rtol=1e-2))
+        error_tol = 1e-2


float16 tolerance=1e-2会不会太大了

这个1e-2是参考

oneflow/python/oneflow/test/modules/test_fused_attention_ops.py

Line 354 in 44ad994

test_case.assertTrue(np.allclose(ref_out, fused_out, atol=1e-2, rtol=1e-2))

然后测试发现1e-3也过不了测试

github-actions · 2024-05-06T09:04:31Z

View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10495/

github-actions · 2024-05-06T10:20:13Z

Speed stats:

GPU Name: NVIDIA GeForce RTX 3080 Ti 

❌ OneFlow resnet50 time: 43.7ms (= 4371.3ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 58.0ms (= 5797.6ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.33 (= 58.0ms / 43.7ms)

OneFlow resnet50 time: 26.2ms (= 2616.5ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 38.1ms (= 3812.5ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.46 (= 38.1ms / 26.2ms)

OneFlow resnet50 time: 19.7ms (= 3932.1ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 35.3ms (= 7060.7ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.80 (= 35.3ms / 19.7ms)

OneFlow resnet50 time: 17.9ms (= 3571.5ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 31.5ms (= 6297.8ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.76 (= 31.5ms / 17.9ms)

OneFlow resnet50 time: 16.8ms (= 3353.0ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 29.5ms (= 5903.2ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.76 (= 29.5ms / 16.8ms)

OneFlow swin dataloader time: 0.201s (= 40.171s / 200, num_workers=1)
PyTorch swin dataloader time: 0.127s (= 25.467s / 200, num_workers=1)
Relative speed: 0.634 (= 0.127s / 0.201s)

OneFlow swin dataloader time: 0.054s (= 10.830s / 200, num_workers=4)
PyTorch swin dataloader time: 0.033s (= 6.583s / 200, num_workers=4)
Relative speed: 0.608 (= 0.033s / 0.054s)

OneFlow swin dataloader time: 0.031s (= 6.216s / 200, num_workers=8)
PyTorch swin dataloader time: 0.017s (= 3.320s / 200, num_workers=8)
Relative speed: 0.534 (= 0.017s / 0.031s)

❌ OneFlow resnet50 time: 49.2ms (= 4924.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 65.9ms (= 6586.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.34 (= 65.9ms / 49.2ms)

OneFlow resnet50 time: 36.4ms (= 3638.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 47.1ms (= 4710.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.29 (= 47.1ms / 36.4ms)

OneFlow resnet50 time: 27.9ms (= 5587.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 42.5ms (= 8501.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.52 (= 42.5ms / 27.9ms)

OneFlow resnet50 time: 25.5ms (= 5100.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 39.0ms (= 7800.5ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.53 (= 39.0ms / 25.5ms)

OneFlow resnet50 time: 24.5ms (= 4901.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 35.7ms (= 7149.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.46 (= 35.7ms / 24.5ms)

flash_attention_v2_backward

98bd7d3

cccddd77 requested a review from MARD1NO April 22, 2024 09:56

cccddd77 requested review from hjchen2, jackalcooper and liujuncheng as code owners April 22, 2024 09:56

cccddd77 added enhancement op labels Apr 22, 2024

cccddd77 and others added 2 commits April 22, 2024 14:31

add cuda version check for grad registration.

3b2a05a

auto format by CI

621c855

cccddd77 requested review from oneflow-ci-bot and removed request for oneflow-ci-bot April 23, 2024 00:53

MARD1NO reviewed Apr 23, 2024

View reviewed changes

cccddd77 added 2 commits April 24, 2024 10:49

typo.

73c6659

set cudnn version to cudnn8.

6cd29fd

MARD1NO approved these changes May 6, 2024

View reviewed changes

MARD1NO merged commit ea585f6 into master May 6, 2024
20 checks passed

MARD1NO deleted the flash_attention_v2_backward branch May 6, 2024 10:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flash_attention_v2_backward #10495

flash_attention_v2_backward #10495

cccddd77 commented Apr 22, 2024

github-actions bot commented Apr 22, 2024

github-actions bot commented Apr 22, 2024

github-actions bot commented Apr 23, 2024

MARD1NO Apr 23, 2024

MARD1NO Apr 23, 2024

cccddd77 Apr 24, 2024

github-actions bot commented May 6, 2024

github-actions bot commented May 6, 2024

flash_attention_v2_backward #10495

flash_attention_v2_backward #10495

Conversation

cccddd77 commented Apr 22, 2024

github-actions bot commented Apr 22, 2024

github-actions bot commented Apr 22, 2024

github-actions bot commented Apr 23, 2024

MARD1NO Apr 23, 2024

Choose a reason for hiding this comment

MARD1NO Apr 23, 2024

Choose a reason for hiding this comment

cccddd77 Apr 24, 2024

Choose a reason for hiding this comment

github-actions bot commented May 6, 2024

github-actions bot commented May 6, 2024