Skip to content

update select_scatter #777

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 14, 2025
Merged

update select_scatter #777

merged 1 commit into from
Jul 14, 2025

Conversation

meinie0826
Copy link
Collaborator

PR Category

Operator

Type of Change

Performance Optimization

Description

merge kernel.

Issue

Progress

  • Change is properly reviewed (1 reviewer required, 2 recommended).
  • Change is responded to an issue.
  • Change is fully covered by a UT.

Performance

benchmark/test_select_and_slice_perf.py 
Operator: select_scatter  Performance Test (dtype=torch.float16, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Torch GBPS            Gems GBPS           Size Detail
-----------------------------------------------------------------------------------------------------------------------------------------
SUCCESS               0.008896            0.005856               1.519               0.950               1.443          [torch.Size([64, 64]), torch.Size([64]), 1, 17]
SUCCESS               0.009600            0.006048               1.587              13.760              21.841          [torch.Size([256, 256]), torch.Size([256]), 1, 20]
SUCCESS               0.012352            0.008160               1.514             170.114             257.506          [torch.Size([1024, 1024]), torch.Size([1024]), 1, 654]
SUCCESS               0.034752            0.029536               1.177             966.011            1136.607          [torch.Size([4096, 4096]), torch.Size([4096]), 1, 3715]
SUCCESS               0.099904            0.094848               1.053            1343.508            1415.126          [torch.Size([1024, 65536]), torch.Size([1024]), 1, 23580]
SUCCESS               0.015296            0.010528               1.453             337.343             490.122          [torch.Size([10000, 256]), torch.Size([10000]), 1, 224]
SUCCESS               0.870112            0.863968               1.007            1506.427            1517.139          [torch.Size([10000, 65536]), torch.Size([10000]), 1, 14848]


Operator: select_scatter  Performance Test (dtype=torch.float32, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Torch GBPS            Gems GBPS           Size Detail
-----------------------------------------------------------------------------------------------------------------------------------------
SUCCESS               0.008832            0.006592               1.340               1.913               2.563          [torch.Size([64, 64]), torch.Size([64]), 1, 13]
SUCCESS               0.009952            0.006976               1.427              26.547              37.872          [torch.Size([256, 256]), torch.Size([256]), 1, 158]
SUCCESS               0.012896            0.009888               1.304             325.876             425.010          [torch.Size([1024, 1024]), torch.Size([1024]), 1, 389]
SUCCESS               0.055712            0.051872               1.074            1205.156            1294.371          [torch.Size([4096, 4096]), torch.Size([4096]), 1, 517]
SUCCESS               0.186912            0.184544               1.013            1436.203            1454.632          [torch.Size([1024, 65536]), torch.Size([1024]), 1, 51497]
SUCCESS               0.017408            0.014176               1.228             592.831             727.991          [torch.Size([10000, 256]), torch.Size([10000]), 1, 43]
SUCCESS               1.731968            1.730288               1.001            1513.608            1515.077          [torch.Size([10000, 65536]), torch.Size([10000]), 1, 58823]


Operator: select_scatter  Performance Test (dtype=torch.bfloat16, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Torch GBPS            Gems GBPS           Size Detail
-----------------------------------------------------------------------------------------------------------------------------------------
SUCCESS               0.008896            0.006144               1.448               0.950               1.375          [torch.Size([64, 64]), torch.Size([64]), 1, 26]
SUCCESS               0.009632            0.006336               1.520              13.714              20.848          [torch.Size([256, 256]), torch.Size([256]), 1, 23]
SUCCESS               0.012288            0.007808               1.574             171.000             269.115          [torch.Size([1024, 1024]), torch.Size([1024]), 1, 41]
SUCCESS               0.034752            0.029184               1.191             966.011            1150.316          [torch.Size([4096, 4096]), torch.Size([4096]), 1, 3040]
SUCCESS               0.100096            0.095104               1.052            1340.931            1411.316          [torch.Size([1024, 65536]), torch.Size([1024]), 1, 20529]
SUCCESS               0.014816            0.009856               1.503             348.272             523.539          [torch.Size([10000, 256]), torch.Size([10000]), 1, 42]
SUCCESS               0.870544            0.864128               1.007            1505.679            1516.859          [torch.Size([10000, 65536]), torch.Size([10000]), 1, 33231]

Copy link
Collaborator

@iclementine iclementine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@iclementine iclementine merged commit ad109d6 into master Jul 14, 2025
10 of 14 checks passed
@iclementine iclementine deleted the op/select_scatter branch July 14, 2025 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants