Skip to content

Opt sum dim for middle dim case #788

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

0x45f
Copy link
Collaborator

@0x45f 0x45f commented Jul 15, 2025

PR Category

Operator

Type of Change

Performance Optimization

Description

Opt sum dim for middle dim case

Issue

Progress

  • Change is properly reviewed (1 reviewer required, 2 recommended).
  • Change is responded to an issue.
  • Change is fully covered by a UT.

Performance

Shapes in qwen3

  • before
benchmark/test_reduction_perf.py 
Operator: sum  Performance Test (dtype=torch.float16, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Torch GBPS            Gems GBPS           Size Detail
-----------------------------------------------------------------------------------------------------------------------------------------
SUCCESS               0.019328            0.015136               1.277             217.007             277.108          [torch.Size([1048576])]
SUCCESS               0.010656            0.007616               1.399               1.538               2.151          [torch.Size([64, 64]), 1]
SUCCESS               0.050912            0.043744               1.164            1318.135            1534.127          [torch.Size([4096, 4096]), 1]
SUCCESS               0.050752            0.224096               0.226            1322.290             299.465          [torch.Size([64, 512, 512]), 1]
SUCCESS               1.494208           13.030944               0.115            2874.411             329.598          [torch.Size([1024, 1024, 1024]), 1]
SUCCESS               0.010720            0.037856               0.283               6.113               1.731          [torch.Size([1, 8, 2048]), 1]
SUCCESS               0.011168            0.102112               0.109             281.673              30.807          [torch.Size([48, 8, 2048]), 1]
SUCCESS               0.443904           30.461409               0.015            2357.739              34.359          [torch.Size([15970, 8, 2048]), 1]
SUCCESS               0.045696            1.912704               0.024            1434.174              34.264          [torch.Size([1000, 8, 2048]), 1]
SUCCESS               0.043232            1.783488               0.024            1412.832              34.247          [torch.Size([932, 8, 2048]), 1]
SUCCESS               0.037536            1.530528               0.025            1395.015              34.213          [torch.Size([799, 8, 2048]), 1]
SUCCESS               0.034624            1.342304               0.026            1324.954              34.176          [torch.Size([700, 8, 2048]), 1]
SUCCESS               0.031872            1.193696               0.027            1278.972              34.149          [torch.Size([622, 8, 2048]), 1]
SUCCESS               0.040192            1.621792               0.025            1381.096              34.227          [torch.Size([847, 8, 2048]), 1]
SUCCESS               0.455520           31.200191               0.015            2357.178              34.415          [torch.Size([16384, 8, 2048]), 1]
SUCCESS               0.023488            0.728192               0.032            1054.692              34.019          [torch.Size([378, 8, 2048]), 1]
SUCCESS               0.020256            0.580064               0.035             970.616              33.894          [torch.Size([300, 8, 2048]), 1]
SUCCESS               0.016448            0.392384               0.042             800.872              33.571          [torch.Size([201, 8, 2048]), 1]
SUCCESS               0.012064            0.140384               0.086             369.401              31.745          [torch.Size([68, 8, 2048]), 1]


Operator: sum  Performance Test (dtype=torch.float32, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Torch GBPS            Gems GBPS           Size Detail
-----------------------------------------------------------------------------------------------------------------------------------------
SUCCESS               0.017536            0.014976               1.171             478.365             560.137          [torch.Size([1048576])]
SUCCESS               0.008928            0.007136               1.251               3.670               4.592          [torch.Size([64, 64]), 1]
SUCCESS               0.076064            0.072800               1.045            1764.537            1843.650          [torch.Size([4096, 4096]), 1]
SUCCESS               0.072448            0.264256               0.274            1852.608             507.908          [torch.Size([64, 512, 512]), 1]
SUCCESS               2.905344           28.224545               0.103            2956.598             304.343          [torch.Size([1024, 1024, 1024]), 1]
SUCCESS               0.009664            0.038048               0.254              13.563               3.445          [torch.Size([1, 8, 2048]), 1]
SUCCESS               0.012832            0.092960               0.138             490.294              67.679          [torch.Size([48, 8, 2048]), 1]
SUCCESS               0.857440           27.213600               0.032            2441.244              76.918          [torch.Size([15970, 8, 2048]), 1]
SUCCESS               0.075840            1.725344               0.044            1728.270              75.969          [torch.Size([1000, 8, 2048]), 1]
SUCCESS               0.072960            1.608928               0.045            1674.330              75.926          [torch.Size([932, 8, 2048]), 1]
SUCCESS               0.064320            1.381568               0.047            1628.211              75.803          [torch.Size([799, 8, 2048]), 1]
SUCCESS               0.058656            1.211904               0.048            1564.212              75.708          [torch.Size([700, 8, 2048]), 1]
SUCCESS               0.053408            1.079136               0.049            1526.490              75.548          [torch.Size([622, 8, 2048]), 1]
SUCCESS               0.067904            1.462688               0.046            1634.925              75.900          [torch.Size([847, 8, 2048]), 1]
SUCCESS               0.879264           27.874912               0.032            2442.365              77.040          [torch.Size([16384, 8, 2048]), 1]
SUCCESS               0.036096            0.661120               0.055            1372.596              74.941          [torch.Size([378, 8, 2048]), 1]
SUCCESS               0.030112            0.525440               0.057            1305.845              74.836          [torch.Size([300, 8, 2048]), 1]
SUCCESS               0.023424            0.353024               0.066            1124.721              74.628          [torch.Size([201, 8, 2048]), 1]
SUCCESS               0.014496            0.126816               0.114             614.852              70.282          [torch.Size([68, 8, 2048]), 1]


Operator: sum  Performance Test (dtype=torch.bfloat16, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Torch GBPS            Gems GBPS           Size Detail
-----------------------------------------------------------------------------------------------------------------------------------------
SUCCESS               0.016000            0.013024               1.229             262.144             322.044          [torch.Size([1048576])]
SUCCESS               0.008704            0.007136               1.220               1.882               2.296          [torch.Size([64, 64]), 1]
SUCCESS               0.050976            0.042912               1.188            1316.480            1563.872          [torch.Size([4096, 4096]), 1]
SUCCESS               0.050432            0.223776               0.225            1330.680             299.893          [torch.Size([64, 512, 512]), 1]
SUCCESS               1.492352           12.973152               0.115            2877.985             331.066          [torch.Size([1024, 1024, 1024]), 1]
SUCCESS               0.009664            0.039008               0.248               6.781               1.680          [torch.Size([1, 8, 2048]), 1]
SUCCESS               0.012224            0.100288               0.122             257.340              31.367          [torch.Size([48, 8, 2048]), 1]
SUCCESS               0.444544           29.853121               0.015            2354.345              35.059          [torch.Size([15970, 8, 2048]), 1]
SUCCESS               0.045280            1.875168               0.024            1447.350              34.949          [torch.Size([1000, 8, 2048]), 1]
SUCCESS               0.043264            1.748832               0.025            1411.787              34.926          [torch.Size([932, 8, 2048]), 1]
SUCCESS               0.038528            1.500448               0.026            1359.096              34.898          [torch.Size([799, 8, 2048]), 1]
SUCCESS               0.034624            1.316128               0.026            1324.954              34.856          [torch.Size([700, 8, 2048]), 1]
SUCCESS               0.031968            1.170432               0.027            1275.131              34.828          [torch.Size([622, 8, 2048]), 1]
SUCCESS               0.039584            1.589952               0.025            1402.309              34.912          [torch.Size([847, 8, 2048]), 1]
SUCCESS               0.455488           30.574944               0.015            2357.344              35.118          [torch.Size([16384, 8, 2048]), 1]
SUCCESS               0.022720            0.714016               0.032            1090.344              34.695          [torch.Size([378, 8, 2048]), 1]
SUCCESS               0.020288            0.568896               0.036             969.085              34.560          [torch.Size([300, 8, 2048]), 1]
SUCCESS               0.016544            0.384992               0.043             796.224              34.216          [torch.Size([201, 8, 2048]), 1]
SUCCESS               0.013152            0.137824               0.095             338.842              32.334          [torch.Size([68, 8, 2048]), 1]
  • after
benchmark/test_reduction_perf.py 
Operator: sum  Performance Test (dtype=torch.float16, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Torch GBPS            Gems GBPS           Size Detail
-----------------------------------------------------------------------------------------------------------------------------------------
SUCCESS               0.019392            0.015104               1.284             216.290             277.695          [torch.Size([1048576])]
SUCCESS               0.011200            0.007488               1.496               1.463               2.188          [torch.Size([64, 64]), 1]
SUCCESS               0.050144            0.043104               1.163            1338.323            1556.906          [torch.Size([4096, 4096]), 1]
SUCCESS               0.051520            0.062976               0.818            1302.579            1065.626          [torch.Size([64, 512, 512]), 1]
SUCCESS               1.494752            1.561664               0.957            2873.364            2750.251          [torch.Size([1024, 1024, 1024]), 1]
SUCCESS               0.010080            0.008960               1.125               6.502               7.314          [torch.Size([1, 8, 2048]), 1]
SUCCESS               0.011616            0.015776               0.736             270.810             199.400          [torch.Size([48, 8, 2048]), 1]
SUCCESS               0.444064            0.631488               0.703            2356.890            1657.371          [torch.Size([15970, 8, 2048]), 1]
SUCCESS               0.045760            0.064192               0.713            1432.168            1020.937          [torch.Size([1000, 8, 2048]), 1]
SUCCESS               0.043104            0.061856               0.697            1417.027             987.447          [torch.Size([932, 8, 2048]), 1]
SUCCESS               0.037504            0.053280               0.704            1396.205             982.794          [torch.Size([799, 8, 2048]), 1]
SUCCESS               0.034528            0.049184               0.702            1328.638             932.726          [torch.Size([700, 8, 2048]), 1]
SUCCESS               0.030912            0.044576               0.693            1318.691             914.469          [torch.Size([622, 8, 2048]), 1]
SUCCESS               0.039520            0.056416               0.701            1404.580             983.923          [torch.Size([847, 8, 2048]), 1]
SUCCESS               0.454944            0.647168               0.703            2360.163            1659.139          [torch.Size([16384, 8, 2048]), 1]
SUCCESS               0.022592            0.031456               0.718            1096.521             787.532          [torch.Size([378, 8, 2048]), 1]
SUCCESS               0.020224            0.027744               0.729             972.152             708.650          [torch.Size([300, 8, 2048]), 1]
SUCCESS               0.017216            0.020480               0.841             765.145             643.200          [torch.Size([201, 8, 2048]), 1]
SUCCESS               0.012032            0.014240               0.845             370.383             312.953          [torch.Size([68, 8, 2048]), 1]


Operator: sum  Performance Test (dtype=torch.float32, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Torch GBPS            Gems GBPS           Size Detail
-----------------------------------------------------------------------------------------------------------------------------------------
SUCCESS               0.017504            0.014848               1.179             479.240             564.966          [torch.Size([1048576])]
SUCCESS               0.009568            0.007488               1.278               3.425               4.376          [torch.Size([64, 64]), 1]
SUCCESS               0.076768            0.072352               1.061            1748.355            1855.066          [torch.Size([4096, 4096]), 1]
SUCCESS               0.072384            0.098624               0.734            1854.246            1360.903          [torch.Size([64, 512, 512]), 1]
SUCCESS               2.904928            3.015520               0.963            2957.022            2848.575          [torch.Size([1024, 1024, 1024]), 1]
SUCCESS               0.010528            0.008192               1.285              12.450              16.000          [torch.Size([1, 8, 2048]), 1]
SUCCESS               0.013728            0.011712               1.172             458.294             537.180          [torch.Size([48, 8, 2048]), 1]
SUCCESS               0.857568            0.855520               1.002            2440.879            2446.722          [torch.Size([15970, 8, 2048]), 1]
SUCCESS               0.075776            0.074208               1.021            1729.730            1766.279          [torch.Size([1000, 8, 2048]), 1]
SUCCESS               0.072160            0.070144               1.029            1692.892            1741.548          [torch.Size([932, 8, 2048]), 1]
SUCCESS               0.064288            0.063104               1.019            1629.021            1659.586          [torch.Size([799, 8, 2048]), 1]
SUCCESS               0.058560            0.056768               1.032            1566.776            1616.234          [torch.Size([700, 8, 2048]), 1]
SUCCESS               0.052704            0.051296               1.027            1546.880            1589.340          [torch.Size([622, 8, 2048]), 1]
SUCCESS               0.067872            0.065312               1.039            1635.696            1699.810          [torch.Size([847, 8, 2048]), 1]
SUCCESS               0.878304            0.877088               1.001            2445.035            2448.424          [torch.Size([16384, 8, 2048]), 1]
SUCCESS               0.035200            0.034944               1.007            1407.535            1417.846          [torch.Size([378, 8, 2048]), 1]
SUCCESS               0.030016            0.028384               1.057            1310.021            1385.344          [torch.Size([300, 8, 2048]), 1]
SUCCESS               0.024160            0.021152               1.142            1090.458            1245.531          [torch.Size([201, 8, 2048]), 1]
SUCCESS               0.014656            0.013280               1.104             608.140             671.152          [torch.Size([68, 8, 2048]), 1]


Operator: sum  Performance Test (dtype=torch.bfloat16, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Torch GBPS            Gems GBPS           Size Detail
-----------------------------------------------------------------------------------------------------------------------------------------
SUCCESS               0.016672            0.012864               1.296             251.578             326.050          [torch.Size([1048576])]
SUCCESS               0.008800            0.008224               1.070               1.862               1.992          [torch.Size([64, 64]), 1]
SUCCESS               0.050208            0.043680               1.149            1336.617            1536.375          [torch.Size([4096, 4096]), 1]
SUCCESS               0.051136            0.063808               0.801            1312.360            1051.731          [torch.Size([64, 512, 512]), 1]
SUCCESS               1.492736            1.560544               0.957            2877.245            2752.224          [torch.Size([1024, 1024, 1024]), 1]
SUCCESS               0.009888            0.007840               1.261               6.628               8.359          [torch.Size([1, 8, 2048]), 1]
SUCCESS               0.011808            0.015232               0.775             266.407             206.521          [torch.Size([48, 8, 2048]), 1]
SUCCESS               0.444704            0.632576               0.703            2353.498            1654.520          [torch.Size([15970, 8, 2048]), 1]
SUCCESS               0.045760            0.064000               0.715            1432.168            1024.000          [torch.Size([1000, 8, 2048]), 1]
SUCCESS               0.043232            0.060896               0.710            1412.832            1003.014          [torch.Size([932, 8, 2048]), 1]
SUCCESS               0.038432            0.053056               0.724            1362.491             986.943          [torch.Size([799, 8, 2048]), 1]
SUCCESS               0.034528            0.049152               0.702            1328.638             933.333          [torch.Size([700, 8, 2048]), 1]
SUCCESS               0.031840            0.045216               0.704            1280.257             901.526          [torch.Size([622, 8, 2048]), 1]
SUCCESS               0.040160            0.056224               0.714            1382.196             987.283          [torch.Size([847, 8, 2048]), 1]
SUCCESS               0.454816            0.647296               0.703            2360.827            1658.811          [torch.Size([16384, 8, 2048]), 1]
SUCCESS               0.023360            0.032320               0.723            1060.471             766.479          [torch.Size([378, 8, 2048]), 1]
SUCCESS               0.020896            0.026976               0.775             940.888             728.826          [torch.Size([300, 8, 2048]), 1]
SUCCESS               0.016576            0.021184               0.782             794.687             621.825          [torch.Size([201, 8, 2048]), 1]
SUCCESS               0.012064            0.013760               0.877             369.401             323.870          [torch.Size([68, 8, 2048]), 1]

FlagGems perf test

  • before
Operator: sum  Performance Test (dtype=torch.float16, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Torch GBPS            Gems GBPS           Size Detail
-----------------------------------------------------------------------------------------------------------------------------------------
SUCCESS               0.018912            0.014848               1.274             221.780             282.483          [torch.Size([1048576])]
SUCCESS               0.011296            0.008640               1.307               1.450               1.896          [torch.Size([64, 64]), 1]
SUCCESS               0.050144            0.042784               1.172            1338.323            1568.550          [torch.Size([4096, 4096]), 1]
SUCCESS               0.050688            0.224000               0.226            1323.960             299.593          [torch.Size([64, 512, 512]), 1]
SUCCESS               1.493664           13.032288               0.115            2875.457             329.564          [torch.Size([1024, 1024, 1024]), 1]
SUCCESS               0.016736            0.012928               1.295             250.860             324.752          [torch.Size([1049600])]
SUCCESS               1.557536            1.508032               1.033            2757.540            2848.061          [torch.Size([1073741824])]
SUCCESS               0.008768            0.007808               1.123               0.467               0.525          [torch.Size([1024, 1]), 1]
SUCCESS               0.009472            0.008736               1.084               6.919               7.502          [torch.Size([1024, 16]), 1]
SUCCESS               0.009728            0.008160               1.192             107.789             128.502          [torch.Size([1024, 256]), 1]
SUCCESS               0.028608            0.016928               1.690             586.452             991.093          [torch.Size([1024, 4096]), 1]
SUCCESS               0.127520            0.120896               1.055            2105.046            2220.383          [torch.Size([1024, 65536]), 1]
SUCCESS               1.627936            1.532992               1.062            2638.290            2801.689          [torch.Size([1024, 1048576]), 1]
SUCCESS               0.008640            0.009152               0.944               1.896               1.790          [torch.Size([64, 1, 64]), 1]
SUCCESS               0.011424            0.036096               0.316              22.947               7.262          [torch.Size([64, 16, 64]), 1]
SUCCESS               0.019360            0.037632               0.514             216.648             111.456          [torch.Size([64, 256, 64]), 1]
SUCCESS               0.057312            0.217120               0.264            1170.939             309.087          [torch.Size([64, 4096, 64]), 1]


Operator: sum  Performance Test (dtype=torch.float32, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Torch GBPS            Gems GBPS           Size Detail
-----------------------------------------------------------------------------------------------------------------------------------------
SUCCESS               0.017408            0.014880               1.170             481.882             563.751          [torch.Size([1048576])]
SUCCESS               0.010080            0.007104               1.419               3.251               4.613          [torch.Size([64, 64]), 1]
SUCCESS               0.076672            0.073504               1.043            1750.544            1825.992          [torch.Size([4096, 4096]), 1]
SUCCESS               0.072480            0.264192               0.274            1851.790             508.031          [torch.Size([64, 512, 512]), 1]
SUCCESS               2.904800           28.293217               0.103            2957.152             303.604          [torch.Size([1024, 1024, 1024]), 1]
SUCCESS               0.017760            0.014976               1.186             472.793             560.684          [torch.Size([1049600])]
SUCCESS               2.974144            3.059712               0.972            2888.204            2807.432          [torch.Size([1073741824])]
SUCCESS               0.008608            0.007232               1.190               0.952               1.133          [torch.Size([1024, 1]), 1]
SUCCESS               0.009408            0.008128               1.157              13.932              16.126          [torch.Size([1024, 16]), 1]
SUCCESS               0.009984            0.009248               1.080             210.051             226.768          [torch.Size([1024, 256]), 1]
SUCCESS               0.035616            0.025504               1.396             942.117            1315.654          [torch.Size([1024, 4096]), 1]
SUCCESS               0.214912            0.214144               1.004            2498.097            2507.056          [torch.Size([1024, 65536]), 1]
SUCCESS               3.222368            2.989184               1.078            2665.721            2873.672          [torch.Size([1024, 1048576]), 1]
SUCCESS               0.008640            0.008416               1.027               3.793               3.894          [torch.Size([64, 1, 64]), 1]
SUCCESS               0.011104            0.036832               0.301              47.216              14.235          [torch.Size([64, 16, 64]), 1]
SUCCESS               0.019968            0.039008               0.512             420.103             215.048          [torch.Size([64, 256, 64]), 1]
SUCCESS               0.083936            0.257792               0.326            1599.048             520.644          [torch.Size([64, 4096, 64]), 1]


Operator: sum  Performance Test (dtype=torch.bfloat16, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Torch GBPS            Gems GBPS           Size Detail
-----------------------------------------------------------------------------------------------------------------------------------------
SUCCESS               0.015936            0.012896               1.236             263.197             325.241          [torch.Size([1048576])]
SUCCESS               0.009824            0.007232               1.358               1.668               2.265          [torch.Size([64, 64]), 1]
SUCCESS               0.050848            0.043840               1.160            1319.794            1530.768          [torch.Size([4096, 4096]), 1]
SUCCESS               0.051168            0.223584               0.229            1311.540             300.151          [torch.Size([64, 512, 512]), 1]
SUCCESS               1.492352           12.978944               0.115            2877.985             330.918          [torch.Size([1024, 1024, 1024]), 1]
SUCCESS               0.016992            0.012960               1.311             247.081             323.951          [torch.Size([1049600])]
SUCCESS               1.559648            1.507136               1.035            2753.805            2849.754          [torch.Size([1073741824])]
SUCCESS               0.008192            0.007456               1.099               0.500               0.549          [torch.Size([1024, 1]), 1]
SUCCESS               0.009856            0.008736               1.128               6.649               7.502          [torch.Size([1024, 16]), 1]
SUCCESS               0.010336            0.008608               1.201             101.449             121.814          [torch.Size([1024, 256]), 1]
SUCCESS               0.028768            0.016416               1.752             583.190            1022.004          [torch.Size([1024, 4096]), 1]
SUCCESS               0.126880            0.121440               1.045            2115.664            2210.437          [torch.Size([1024, 65536]), 1]
SUCCESS               1.630848            1.527552               1.068            2633.579            2811.667          [torch.Size([1024, 1048576]), 1]
SUCCESS               0.008640            0.008288               1.042               1.896               1.977          [torch.Size([64, 1, 64]), 1]
SUCCESS               0.011040            0.053664               0.206              23.745               4.885          [torch.Size([64, 16, 64]), 1]
SUCCESS               0.019552            0.038624               0.506             214.520             108.593          [torch.Size([64, 256, 64]), 1]
SUCCESS               0.057152            0.217376               0.263            1174.217             308.723          [torch.Size([64, 4096, 64]), 1]
  • after
benchmark/test_reduction_perf.py 
Operator: sum  Performance Test (dtype=torch.float16, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Torch GBPS            Gems GBPS           Size Detail
-----------------------------------------------------------------------------------------------------------------------------------------
SUCCESS               0.018784            0.014976               1.254             223.291             280.068          [torch.Size([1048576])]
SUCCESS               0.011840            0.008096               1.462               1.384               2.024          [torch.Size([64, 64]), 1]
SUCCESS               0.050144            0.043104               1.163            1338.323            1556.906          [torch.Size([4096, 4096]), 1]
SUCCESS               0.050752            0.063776               0.796            1322.290            1052.259          [torch.Size([64, 512, 512]), 1]
SUCCESS               1.493216            1.561664               0.956            2876.320            2750.251          [torch.Size([1024, 1024, 1024]), 1]
SUCCESS               0.016000            0.012928               1.238             262.400             324.752          [torch.Size([1049600])]
SUCCESS               1.569984            1.507328               1.042            2735.676            2849.391          [torch.Size([1073741824])]
SUCCESS               0.008160            0.007296               1.118               0.502               0.561          [torch.Size([1024, 1]), 1]
SUCCESS               0.009344            0.008000               1.168               7.014               8.192          [torch.Size([1024, 16]), 1]
SUCCESS               0.010144            0.008096               1.253             103.369             129.518          [torch.Size([1024, 256]), 1]
SUCCESS               0.028576            0.016576               1.724             587.109            1012.139          [torch.Size([1024, 4096]), 1]
SUCCESS               0.126784            0.117792               1.076            2117.266            2278.894          [torch.Size([1024, 65536]), 1]
SUCCESS               1.628128            1.478048               1.102            2637.979            2905.838          [torch.Size([1024, 1048576]), 1]
SUCCESS               0.009120            0.008320               1.096               1.796               1.969          [torch.Size([64, 1, 64]), 1]
SUCCESS               0.011040            0.008864               1.245              23.745              29.574          [torch.Size([64, 16, 64]), 1]
SUCCESS               0.019968            0.011904               1.677             210.051             352.344          [torch.Size([64, 256, 64]), 1]
SUCCESS               0.056544            0.060384               0.936            1186.843            1111.368          [torch.Size([64, 4096, 64]), 1]


Operator: sum  Performance Test (dtype=torch.float32, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Torch GBPS            Gems GBPS           Size Detail
-----------------------------------------------------------------------------------------------------------------------------------------
SUCCESS               0.017504            0.014912               1.174             479.240             562.541          [torch.Size([1048576])]
SUCCESS               0.010016            0.007008               1.429               3.272               4.676          [torch.Size([64, 64]), 1]
SUCCESS               0.076768            0.073088               1.050            1748.355            1836.385          [torch.Size([4096, 4096]), 1]
SUCCESS               0.073312            0.097888               0.749            1830.774            1371.136          [torch.Size([64, 512, 512]), 1]
SUCCESS               2.904128            3.012960               0.964            2957.836            2850.995          [torch.Size([1024, 1024, 1024]), 1]
SUCCESS               0.017728            0.014976               1.184             473.646             560.684          [torch.Size([1049600])]
SUCCESS               2.974976            3.059904               0.972            2887.396            2807.256          [torch.Size([1073741824])]
SUCCESS               0.008512            0.007808               1.090               0.962               1.049          [torch.Size([1024, 1]), 1]
SUCCESS               0.008832            0.008672               1.018              14.841              15.114          [torch.Size([1024, 16]), 1]
SUCCESS               0.010400            0.009184               1.132             201.649             228.348          [torch.Size([1024, 256]), 1]
SUCCESS               0.035584            0.024512               1.452             942.964            1368.898          [torch.Size([1024, 4096]), 1]
SUCCESS               0.214912            0.208800               1.029            2498.097            2571.221          [torch.Size([1024, 65536]), 1]
SUCCESS               3.222912            2.932416               1.099            2665.271            2929.303          [torch.Size([1024, 1048576]), 1]
SUCCESS               0.008608            0.009248               0.931               3.807               3.543          [torch.Size([64, 1, 64]), 1]
SUCCESS               0.011040            0.009632               1.146              47.490              54.432          [torch.Size([64, 16, 64]), 1]
SUCCESS               0.019872            0.016704               1.190             422.132             502.192          [torch.Size([64, 256, 64]), 1]
SUCCESS               0.083872            0.098688               0.850            1600.269            1360.021          [torch.Size([64, 4096, 64]), 1]


Operator: sum  Performance Test (dtype=torch.bfloat16, mode=cuda,level=comprehensive)
Status       Torch Latency (ms)    Gems Latency (ms)         Gems Speedup          Torch GBPS            Gems GBPS           Size Detail
-----------------------------------------------------------------------------------------------------------------------------------------
SUCCESS               0.015936            0.012896               1.236             263.197             325.241          [torch.Size([1048576])]
SUCCESS               0.009344            0.007008               1.333               1.753               2.338          [torch.Size([64, 64]), 1]
SUCCESS               0.050816            0.043104               1.179            1320.625            1556.906          [torch.Size([4096, 4096]), 1]
SUCCESS               0.050400            0.063008               0.800            1331.525            1065.085          [torch.Size([64, 512, 512]), 1]
SUCCESS               1.492416            1.561472               0.956            2877.862            2750.589          [torch.Size([1024, 1024, 1024]), 1]
SUCCESS               0.016128            0.012960               1.244             260.317             323.951          [torch.Size([1049600])]
SUCCESS               1.570848            1.507520               1.042            2734.171            2849.028          [torch.Size([1073741824])]
SUCCESS               0.008160            0.007296               1.118               0.502               0.561          [torch.Size([1024, 1]), 1]
SUCCESS               0.008832            0.007648               1.155               7.420               8.569          [torch.Size([1024, 16]), 1]
SUCCESS               0.009728            0.008096               1.202             107.789             129.518          [torch.Size([1024, 256]), 1]
SUCCESS               0.029504            0.016704               1.766             568.642            1004.383          [torch.Size([1024, 4096]), 1]
SUCCESS               0.126944            0.118592               1.070            2114.597            2263.521          [torch.Size([1024, 65536]), 1]
SUCCESS               1.629600            1.480096               1.101            2635.596            2901.817          [torch.Size([1024, 1048576]), 1]
SUCCESS               0.009824            0.008320               1.181               1.668               1.969          [torch.Size([64, 1, 64]), 1]
SUCCESS               0.012000            0.008544               1.404              21.845              30.682          [torch.Size([64, 16, 64]), 1]
SUCCESS               0.019520            0.011904               1.640             214.872             352.344          [torch.Size([64, 256, 64]), 1]
SUCCESS               0.056480            0.061024               0.926            1188.188            1099.713          [torch.Size([64, 4096, 64]), 1]

@0x45f 0x45f marked this pull request as ready for review July 16, 2025 10:03
@iclementine iclementine self-assigned this Jul 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants