Skip to content

update vdot #778

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

update vdot #778

wants to merge 1 commit into from

Conversation

meinie0826
Copy link
Collaborator

@meinie0826 meinie0826 commented Jul 12, 2025

PR Category

Operator

Type of Change

Performance Optimization

Description

Issue

Progress

  • Change is properly reviewed (1 reviewer required, 2 recommended).
  • Change is responded to an issue.
  • Change is fully covered by a UT.

Performance

Before:

benchmark/test_blas_perf.py 
Operator: vdot  Performance Test (dtype=torch.complex64, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Size Detail
-----------------------------------------------------------------------------------------------
SUCCESS               0.008544            0.009280               0.921          [torch.Size([64]), torch.Size([64])]
SUCCESS               0.008672            0.009664               0.897          [torch.Size([1024]), torch.Size([1024])]
SUCCESS               0.008736            0.009632               0.907          [torch.Size([2048]), torch.Size([2048])]
SUCCESS               0.008800            0.009792               0.899          [torch.Size([4096]), torch.Size([4096])]
SUCCESS               0.010080            0.010368               0.972          [torch.Size([65536]), torch.Size([65536])]


Operator: vdot  Performance Test (dtype=torch.float16, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Size Detail
-----------------------------------------------------------------------------------------------
SUCCESS               0.008576            0.023136               0.371          [torch.Size([64]), torch.Size([64])]
SUCCESS               0.008640            0.023840               0.362          [torch.Size([1024]), torch.Size([1024])]
SUCCESS               0.008736            0.022624               0.386          [torch.Size([2048]), torch.Size([2048])]
SUCCESS               0.008768            0.022912               0.383          [torch.Size([4096]), torch.Size([4096])]
SUCCESS               0.009760            0.024192               0.403          [torch.Size([65536]), torch.Size([65536])]


Operator: vdot  Performance Test (dtype=torch.float32, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Size Detail
-----------------------------------------------------------------------------------------------
SUCCESS               0.008576            0.008896               0.964          [torch.Size([64]), torch.Size([64])]
SUCCESS               0.008704            0.009152               0.951          [torch.Size([1024]), torch.Size([1024])]
SUCCESS               0.008672            0.009136               0.949          [torch.Size([2048]), torch.Size([2048])]
SUCCESS               0.008736            0.009280               0.941          [torch.Size([4096]), torch.Size([4096])]
SUCCESS               0.009824            0.009632               1.020          [torch.Size([65536]), torch.Size([65536])]


Operator: vdot  Performance Test (dtype=torch.bfloat16, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Size Detail
-----------------------------------------------------------------------------------------------
SUCCESS               0.008576            0.022688               0.378          [torch.Size([64]), torch.Size([64])]
SUCCESS               0.008640            0.023008               0.376          [torch.Size([1024]), torch.Size([1024])]
SUCCESS               0.008736            0.023616               0.370          [torch.Size([2048]), torch.Size([2048])]
SUCCESS               0.008768            0.022848               0.384          [torch.Size([4096]), torch.Size([4096])]
SUCCESS               0.009760            0.023328               0.418          [torch.Size([65536]), torch.Size([65536])]

After the update, perf test as below

benchmark/test_blas_perf.py 
Operator: vdot  Performance Test (dtype=torch.complex64, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Size Detail
-----------------------------------------------------------------------------------------------
SUCCESS               0.008640            0.009600               0.900          [torch.Size([64]), torch.Size([64])]
SUCCESS               0.008832            0.009808               0.900          [torch.Size([1024]), torch.Size([1024])]
SUCCESS               0.008864            0.009856               0.899          [torch.Size([2048]), torch.Size([2048])]
SUCCESS               0.008928            0.010032               0.890          [torch.Size([4096]), torch.Size([4096])]
SUCCESS               0.010304            0.010592               0.973          [torch.Size([65536]), torch.Size([65536])]


Operator: vdot  Performance Test (dtype=torch.float16, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Size Detail
-----------------------------------------------------------------------------------------------
SUCCESS               0.008800            0.008768               1.004          [torch.Size([64]), torch.Size([64])]
SUCCESS               0.008864            0.008960               0.989          [torch.Size([1024]), torch.Size([1024])]
SUCCESS               0.008896            0.009056               0.982          [torch.Size([2048]), torch.Size([2048])]
SUCCESS               0.008896            0.009024               0.986          [torch.Size([4096]), torch.Size([4096])]
SUCCESS               0.009952            0.009344               1.065          [torch.Size([65536]), torch.Size([65536])]


Operator: vdot  Performance Test (dtype=torch.float32, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Size Detail
-----------------------------------------------------------------------------------------------
SUCCESS               0.008768            0.008736               1.004          [torch.Size([64]), torch.Size([64])]
SUCCESS               0.008864            0.008896               0.996          [torch.Size([1024]), torch.Size([1024])]
SUCCESS               0.008832            0.009056               0.975          [torch.Size([2048]), torch.Size([2048])]
SUCCESS               0.008864            0.008992               0.986          [torch.Size([4096]), torch.Size([4096])]
SUCCESS               0.009920            0.009408               1.054          [torch.Size([65536]), torch.Size([65536])]


Operator: vdot  Performance Test (dtype=torch.bfloat16, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Size Detail
-----------------------------------------------------------------------------------------------
SUCCESS               0.008800            0.008736               1.007          [torch.Size([64]), torch.Size([64])]
SUCCESS               0.008864            0.008928               0.993          [torch.Size([1024]), torch.Size([1024])]
SUCCESS               0.008896            0.009056               0.982          [torch.Size([2048]), torch.Size([2048])]
SUCCESS               0.008896            0.009072               0.981          [torch.Size([4096]), torch.Size([4096])]
SUCCESS               0.009952            0.009344               1.065          [torch.Size([65536]), torch.Size([65536])]

@meinie0826 meinie0826 closed this Jul 12, 2025
@meinie0826 meinie0826 deleted the op/vdot branch July 12, 2025 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant