Skip to content

Conversation

@johannes-graner
Copy link
Contributor

Proposed changes

Adds support for merging groups in bwd weight V3 kernel. Four such kernel instances are also added. For shapes that result in very skinny GEMMs, this leads to a performance improvement. Examples of such shapes and their uplift on MI350X is shown below.

profiler command before (ms) after (ms) speedup
ckProfiler grouped_conv_bwd_weight 1 2 1 2 0 1 2 32 32 16 16 3 3 50 50 1 1 1 1 1 1 1 1 all 0.224 0.205 1.09
ckProfiler grouped_conv_bwd_weight 1 2 1 2 0 1 2 32 32 8 8 3 3 100 100 1 1 1 1 1 1 1 1 all 0.590 0.470 1.26
ckProfiler grouped_conv_bwd_weight 1 2 1 2 0 1 2 32 32 8 8 3 3 200 200 2 2 1 1 1 1 1 1 all 0.756 0.569 1.33

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

vpietila-amd
vpietila-amd previously approved these changes Jan 23, 2026
Copy link
Contributor

@vpietila-amd vpietila-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. @bartekxk what do you think?

bartekxk
bartekxk previously approved these changes Jan 23, 2026
@johannes-graner johannes-graner dismissed stale reviews from bartekxk and vpietila-amd via 964372a January 23, 2026 12:34
@johannes-graner johannes-graner enabled auto-merge (squash) January 23, 2026 12:36
@afagaj afagaj requested a review from Copilot January 23, 2026 16:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements group merging functionality for the backward weight V3 kernel in grouped convolutions. The feature enables multiple groups to be merged and processed together, improving performance for workloads that result in skinny GEMM operations.

Changes:

  • Added NumGroupsToMerge template parameter to the V3 kernel implementation with default value of 1
  • Updated batch stride calculations to account for merged groups
  • Added four new kernel instances with group merging factors of 2 and 4
  • Added validation to ensure the number of groups is evenly divisible by the merge factor

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
device_grouped_conv_bwd_weight_v3_xdl_instance.hpp Adds four new kernel instances with group merging enabled (merge factors of 2 and 4)
device_grouped_conv_bwd_weight_xdl_cshuffle_v3.hpp Implements the core group merging logic including template parameter, stride adjustments, and validation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1373 to +1385
if constexpr(NumGroupsToMerge > 1)
{
if(arg.Conv_G_ % NumGroupsToMerge != 0)
{
if(ck::EnvIsEnabled(CK_ENV(CK_LOGGING)))
{
std::cout << "Unsupported! Conv_G_ % NumGroupsToMerge != 0: Conv_G_="
<< arg.Conv_G_ << ", NumGroupsToMerge=" << NumGroupsToMerge
<< std::endl;
}
return false;
}
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message describes an internal condition check rather than a user-facing error. Consider rephrasing to explain the constraint in user terms, such as 'Number of groups must be evenly divisible by the merge factor' and include both values for clarity.

Copilot uses AI. Check for mistakes.
//#########################################| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NBlock_NPerBlock| | | |
// generic instance
DeviceGroupedConvBwdWeight_Xdl_CShuffleV3< NDimSpatial, ALayout, BLayout, ELayout, F16, F16, F16, F32, PassThrough, PassThrough, PassThrough, ConvSpec, 64, 32, 32, 32, 8, 32, 32, 1, 1, S<4, 8, 1>, S<2, 0, 1>, S<1, 0, 2>, 1, 2, 2, false, S<4, 16, 1>, S<2, 0, 1>, S<1, 0, 2>, 1, 2, 2, false, 1, 1, S<1, 8, 1, 8>, 2, Scheduler, PipelineVersion>,
DeviceGroupedConvBwdWeight_Xdl_CShuffleV3< NDimSpatial, ALayout, BLayout, ELayout, F16, F16, F16, F32, PassThrough, PassThrough, PassThrough, ConvSpec, 64, 32, 64, 32, 8, 32, 32, 1, 2, S<4, 8, 1>, S<2, 0, 1>, S<1, 0, 2>, 1, 4, 4, false, S<4, 16, 1>, S<2, 0, 1>, S<1, 0, 2>, 1, 4, 4, false, 1, 1, S<1, 8, 1, 8>, 2, Scheduler, PipelineVersion>,
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 96 is a duplicate of line 95 except for the last three template parameters (F16, F16, 2). This creates two nearly identical kernel instances, which could lead to confusion. Consider adding a comment explaining why both configurations are needed or if this duplication is intentional.

Suggested change
DeviceGroupedConvBwdWeight_Xdl_CShuffleV3< NDimSpatial, ALayout, BLayout, ELayout, F16, F16, F16, F32, PassThrough, PassThrough, PassThrough, ConvSpec, 64, 32, 64, 32, 8, 32, 32, 1, 2, S<4, 8, 1>, S<2, 0, 1>, S<1, 0, 2>, 1, 4, 4, false, S<4, 16, 1>, S<2, 0, 1>, S<1, 0, 2>, 1, 4, 4, false, 1, 1, S<1, 8, 1, 8>, 2, Scheduler, PipelineVersion>,
DeviceGroupedConvBwdWeight_Xdl_CShuffleV3< NDimSpatial, ALayout, BLayout, ELayout, F16, F16, F16, F32, PassThrough, PassThrough, PassThrough, ConvSpec, 64, 32, 64, 32, 8, 32, 32, 1, 2, S<4, 8, 1>, S<2, 0, 1>, S<1, 0, 2>, 1, 4, 4, false, S<4, 16, 1>, S<2, 0, 1>, S<1, 0, 2>, 1, 4, 4, false, 1, 1, S<1, 8, 1, 8>, 2, Scheduler, PipelineVersion>,
// Note: this variant is intentionally similar to the previous line, but overrides
// the final template parameters of DeviceGroupedConvBwdWeight_Xdl_CShuffleV3
// (here: F16, F16, 2) to select a distinct kernel implementation.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants