update vdot #778

meinie0826 · 2025-07-12T15:59:12Z

PR Category

Operator

Type of Change

Performance Optimization

Description

Issue

Progress

Change is properly reviewed (1 reviewer required, 2 recommended).
Change is responded to an issue.
Change is fully covered by a UT.

Performance

Before:

benchmark/test_blas_perf.py 
Operator: vdot  Performance Test (dtype=torch.complex64, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Size Detail
-----------------------------------------------------------------------------------------------
SUCCESS               0.008544            0.009280               0.921          [torch.Size([64]), torch.Size([64])]
SUCCESS               0.008672            0.009664               0.897          [torch.Size([1024]), torch.Size([1024])]
SUCCESS               0.008736            0.009632               0.907          [torch.Size([2048]), torch.Size([2048])]
SUCCESS               0.008800            0.009792               0.899          [torch.Size([4096]), torch.Size([4096])]
SUCCESS               0.010080            0.010368               0.972          [torch.Size([65536]), torch.Size([65536])]


Operator: vdot  Performance Test (dtype=torch.float16, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Size Detail
-----------------------------------------------------------------------------------------------
SUCCESS               0.008576            0.023136               0.371          [torch.Size([64]), torch.Size([64])]
SUCCESS               0.008640            0.023840               0.362          [torch.Size([1024]), torch.Size([1024])]
SUCCESS               0.008736            0.022624               0.386          [torch.Size([2048]), torch.Size([2048])]
SUCCESS               0.008768            0.022912               0.383          [torch.Size([4096]), torch.Size([4096])]
SUCCESS               0.009760            0.024192               0.403          [torch.Size([65536]), torch.Size([65536])]


Operator: vdot  Performance Test (dtype=torch.float32, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Size Detail
-----------------------------------------------------------------------------------------------
SUCCESS               0.008576            0.008896               0.964          [torch.Size([64]), torch.Size([64])]
SUCCESS               0.008704            0.009152               0.951          [torch.Size([1024]), torch.Size([1024])]
SUCCESS               0.008672            0.009136               0.949          [torch.Size([2048]), torch.Size([2048])]
SUCCESS               0.008736            0.009280               0.941          [torch.Size([4096]), torch.Size([4096])]
SUCCESS               0.009824            0.009632               1.020          [torch.Size([65536]), torch.Size([65536])]


Operator: vdot  Performance Test (dtype=torch.bfloat16, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Size Detail
-----------------------------------------------------------------------------------------------
SUCCESS               0.008576            0.022688               0.378          [torch.Size([64]), torch.Size([64])]
SUCCESS               0.008640            0.023008               0.376          [torch.Size([1024]), torch.Size([1024])]
SUCCESS               0.008736            0.023616               0.370          [torch.Size([2048]), torch.Size([2048])]
SUCCESS               0.008768            0.022848               0.384          [torch.Size([4096]), torch.Size([4096])]
SUCCESS               0.009760            0.023328               0.418          [torch.Size([65536]), torch.Size([65536])]

After the update, perf test as below

benchmark/test_blas_perf.py 
Operator: vdot  Performance Test (dtype=torch.complex64, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Size Detail
-----------------------------------------------------------------------------------------------
SUCCESS               0.008640            0.009600               0.900          [torch.Size([64]), torch.Size([64])]
SUCCESS               0.008832            0.009808               0.900          [torch.Size([1024]), torch.Size([1024])]
SUCCESS               0.008864            0.009856               0.899          [torch.Size([2048]), torch.Size([2048])]
SUCCESS               0.008928            0.010032               0.890          [torch.Size([4096]), torch.Size([4096])]
SUCCESS               0.010304            0.010592               0.973          [torch.Size([65536]), torch.Size([65536])]


Operator: vdot  Performance Test (dtype=torch.float16, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Size Detail
-----------------------------------------------------------------------------------------------
SUCCESS               0.008800            0.008768               1.004          [torch.Size([64]), torch.Size([64])]
SUCCESS               0.008864            0.008960               0.989          [torch.Size([1024]), torch.Size([1024])]
SUCCESS               0.008896            0.009056               0.982          [torch.Size([2048]), torch.Size([2048])]
SUCCESS               0.008896            0.009024               0.986          [torch.Size([4096]), torch.Size([4096])]
SUCCESS               0.009952            0.009344               1.065          [torch.Size([65536]), torch.Size([65536])]


Operator: vdot  Performance Test (dtype=torch.float32, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Size Detail
-----------------------------------------------------------------------------------------------
SUCCESS               0.008768            0.008736               1.004          [torch.Size([64]), torch.Size([64])]
SUCCESS               0.008864            0.008896               0.996          [torch.Size([1024]), torch.Size([1024])]
SUCCESS               0.008832            0.009056               0.975          [torch.Size([2048]), torch.Size([2048])]
SUCCESS               0.008864            0.008992               0.986          [torch.Size([4096]), torch.Size([4096])]
SUCCESS               0.009920            0.009408               1.054          [torch.Size([65536]), torch.Size([65536])]


Operator: vdot  Performance Test (dtype=torch.bfloat16, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Size Detail
-----------------------------------------------------------------------------------------------
SUCCESS               0.008800            0.008736               1.007          [torch.Size([64]), torch.Size([64])]
SUCCESS               0.008864            0.008928               0.993          [torch.Size([1024]), torch.Size([1024])]
SUCCESS               0.008896            0.009056               0.982          [torch.Size([2048]), torch.Size([2048])]
SUCCESS               0.008896            0.009072               0.981          [torch.Size([4096]), torch.Size([4096])]
SUCCESS               0.009952            0.009344               1.065          [torch.Size([65536]), torch.Size([65536])]

update vdot

1aab9ec

meinie0826 closed this Jul 12, 2025

meinie0826 deleted the op/vdot branch July 12, 2025 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

update vdot #778

update vdot #778

Uh oh!

meinie0826 commented Jul 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

update vdot #778

update vdot #778

Uh oh!

Conversation

meinie0826 commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

Type of Change

Description

Issue

Progress

Performance

Uh oh!

Uh oh!

meinie0826 commented Jul 12, 2025 •

edited

Loading