Implement group merging for bwd_weight #3637

johannes-graner · 2026-01-23T08:38:03Z

Proposed changes

Adds support for merging groups in bwd weight V3 kernel. Four such kernel instances are also added. For shapes that result in very skinny GEMMs, this leads to a performance improvement. Examples of such shapes and their uplift on MI350X is shown below.

profiler command	before (ms)	after (ms)	speedup
ckProfiler grouped_conv_bwd_weight 1 2 1 2 0 1 2 32 32 16 16 3 3 50 50 1 1 1 1 1 1 1 1 all	0.224	0.205	1.09
ckProfiler grouped_conv_bwd_weight 1 2 1 2 0 1 2 32 32 8 8 3 3 100 100 1 1 1 1 1 1 1 1 all	0.590	0.470	1.26
ckProfiler grouped_conv_bwd_weight 1 2 1 2 0 1 2 32 32 8 8 3 3 200 200 2 2 1 1 1 1 1 1 all	0.756	0.569	1.33

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

vpietila-amd

Looks good to me. @bartekxk what do you think?

Copilot

Pull request overview

This PR implements group merging functionality for the backward weight V3 kernel in grouped convolutions. The feature enables multiple groups to be merged and processed together, improving performance for workloads that result in skinny GEMM operations.

Changes:

Added NumGroupsToMerge template parameter to the V3 kernel implementation with default value of 1
Updated batch stride calculations to account for merged groups
Added four new kernel instances with group merging factors of 2 and 4
Added validation to ensure the number of groups is evenly divisible by the merge factor

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
device_grouped_conv_bwd_weight_v3_xdl_instance.hpp	Adds four new kernel instances with group merging enabled (merge factors of 2 and 4)
device_grouped_conv_bwd_weight_xdl_cshuffle_v3.hpp	Implements the core group merging logic including template parameter, stride adjustments, and validation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-23T16:39:15Z

include/ck/tensor_operation/gpu/device/impl/device_grouped_conv_bwd_weight_xdl_cshuffle_v3.hpp

+        if constexpr(NumGroupsToMerge > 1)
+        {
+            if(arg.Conv_G_ % NumGroupsToMerge != 0)
+            {
+                if(ck::EnvIsEnabled(CK_ENV(CK_LOGGING)))
+                {
+                    std::cout << "Unsupported! Conv_G_ % NumGroupsToMerge != 0: Conv_G_="
+                              << arg.Conv_G_ << ", NumGroupsToMerge=" << NumGroupsToMerge
+                              << std::endl;
+                }
+                return false;
+            }
+        }


The error message describes an internal condition check rather than a user-facing error. Consider rephrasing to explain the constraint in user terms, such as 'Number of groups must be evenly divisible by the merge factor' and include both values for clarity.

Copilot · 2026-01-23T16:39:15Z

...tion_instance/gpu/grouped_conv_bwd_weight/device_grouped_conv_bwd_weight_v3_xdl_instance.hpp

        //#########################################|        |         |          |          |       |        |        |        |            |            |            |                          |      |      |      |      |   |     |     |     |     |                |                 |               |               |               |               |          |                |               |               |              |               |               |          |            |            | NBlock_NPerBlock|                |          |          |
        // generic instance
        DeviceGroupedConvBwdWeight_Xdl_CShuffleV3< NDimSpatial,  ALayout,   BLayout,   ELayout,    F16,     F16,     F16,     F32, PassThrough, PassThrough, PassThrough,                  ConvSpec,    64,    32,    32,     32,   8,   32,   32,    1,    1,  S<4, 8,  1>, S<2, 0, 1>,  S<1, 0, 2>,                   1,              2,              2,      false,  S<4, 16, 1>,  S<2, 0, 1>,  S<1, 0, 2>,                1,              2,              2,      false,           1,           1,   S<1, 8, 1, 8>,                  2, Scheduler, PipelineVersion>,
        DeviceGroupedConvBwdWeight_Xdl_CShuffleV3< NDimSpatial,  ALayout,   BLayout,   ELayout,    F16,     F16,     F16,     F32, PassThrough, PassThrough, PassThrough,                  ConvSpec,    64,    32,    64,     32,   8,   32,   32,    1,    2,  S<4, 8,  1>, S<2, 0, 1>,  S<1, 0, 2>,                   1,              4,              4,      false,  S<4, 16, 1>,  S<2, 0, 1>,  S<1, 0, 2>,                1,              4,              4,      false,           1,           1,   S<1, 8, 1, 8>,                  2, Scheduler, PipelineVersion>,


Line 96 is a duplicate of line 95 except for the last three template parameters (F16, F16, 2). This creates two nearly identical kernel instances, which could lead to confusion. Consider adding a comment explaining why both configurations are needed or if this duplication is intentional.

Suggested change

DeviceGroupedConvBwdWeight_Xdl_CShuffleV3< NDimSpatial, ALayout, BLayout, ELayout, F16, F16, F16, F32, PassThrough, PassThrough, PassThrough, ConvSpec, 64, 32, 64, 32, 8, 32, 32, 1, 2, S<4, 8, 1>, S<2, 0, 1>, S<1, 0, 2>, 1, 4, 4, false, S<4, 16, 1>, S<2, 0, 1>, S<1, 0, 2>, 1, 4, 4, false, 1, 1, S<1, 8, 1, 8>, 2, Scheduler, PipelineVersion>,

DeviceGroupedConvBwdWeight_Xdl_CShuffleV3< NDimSpatial, ALayout, BLayout, ELayout, F16, F16, F16, F32, PassThrough, PassThrough, PassThrough, ConvSpec, 64, 32, 64, 32, 8, 32, 32, 1, 2, S<4, 8, 1>, S<2, 0, 1>, S<1, 0, 2>, 1, 4, 4, false, S<4, 16, 1>, S<2, 0, 1>, S<1, 0, 2>, 1, 4, 4, false, 1, 1, S<1, 8, 1, 8>, 2, Scheduler, PipelineVersion>,

// Note: this variant is intentionally similar to the previous line, but overrides

// the final template parameters of DeviceGroupedConvBwdWeight_Xdl_CShuffleV3

// (here: F16, F16, 2) to select a distinct kernel implementation.

Implement group merging for bwd_weight and add instances

da202f1

johannes-graner requested review from a team, Snektron, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent, vidyasagar-amd and vpietila-amd as code owners January 23, 2026 08:38

vpietila-amd previously approved these changes Jan 23, 2026

View reviewed changes

bartekxk previously approved these changes Jan 23, 2026

View reviewed changes

Remove unnecessary instances

964372a

johannes-graner dismissed stale reviews from bartekxk and vpietila-amd via 964372a January 23, 2026 12:34

bartekxk approved these changes Jan 23, 2026

View reviewed changes

johannes-graner enabled auto-merge (squash) January 23, 2026 12:36

afagaj requested a review from Copilot January 23, 2026 16:38

Copilot AI reviewed Jan 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement group merging for bwd_weight #3637

Implement group merging for bwd_weight #3637

johannes-graner commented Jan 23, 2026

Uh oh!

vpietila-amd left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-        DeviceGroupedConvBwdWeight_Xdl_CShuffleV3< NDimSpatial,  ALayout,   BLayout,   ELayout,    F16,     F16,     F16,     F32, PassThrough, PassThrough, PassThrough,                  ConvSpec,    64,    32,    64,     32,   8,   32,   32,    1,    2,  S<4, 8,  1>, S<2, 0, 1>,  S<1, 0, 2>,                   1,              4,              4,      false,  S<4, 16, 1>,  S<2, 0, 1>,  S<1, 0, 2>,                1,              4,              4,      false,           1,           1,   S<1, 8, 1, 8>,                  2, Scheduler, PipelineVersion>,
+        DeviceGroupedConvBwdWeight_Xdl_CShuffleV3< NDimSpatial,  ALayout,   BLayout,   ELayout,    F16,     F16,     F16,     F32, PassThrough, PassThrough, PassThrough,                  ConvSpec,    64,    32,    64,     32,   8,   32,   32,    1,    2,  S<4, 8,  1>, S<2, 0, 1>,  S<1, 0, 2>,                   1,              4,              4,      false,  S<4, 16, 1>,  S<2, 0, 1>,  S<1, 0, 2>,                1,              4,              4,      false,           1,           1,   S<1, 8, 1, 8>,                  2, Scheduler, PipelineVersion>,
+        // Note: this variant is intentionally similar to the previous line, but overrides
+        // the final template parameters of DeviceGroupedConvBwdWeight_Xdl_CShuffleV3
+        // (here: F16, F16, 2) to select a distinct kernel implementation.

Implement group merging for bwd_weight #3637

Are you sure you want to change the base?

Implement group merging for bwd_weight #3637

Conversation

johannes-graner commented Jan 23, 2026

Proposed changes

Checklist

Discussion

Uh oh!

vpietila-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants