[CK] Add new instances for merging multiple fwd conv groups into a single GEMM batch #3639

vpietila-amd · 2026-01-23T11:20:58Z

Proposed changes

Added new instances for merging multiple fwd conv groups into a single GEMM batch. Allowed group merging for C > 1 when vector load/store size is 1 for the output tensor. The new instances improve the performance for grouped convolutions when the number of channels per group is low.

CK prof command	Baseline (TFLOPS)	Group merging (TFLOPS)
grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 4 4 3 3 200 200 1 1 1 1 1 1 1 1	2.85698	4.80702
grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 200 200 2 2 1 1 1 1 1 1	10.4224	18.969
grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 100 100 1 2 1 1 1 1 1 1	15.4201	20.3821

The baseline was calculated prior to #3632 on gfx950.

…GEMM batch. Allow group merging for C > 1 when vector load/store size is 1 for the output tensor.

Copilot

Pull request overview

This PR adds new device instances for grouped forward convolution that enable merging multiple convolution groups into a single GEMM batch operation. The change relaxes the constraint on group merging to allow cases where the number of channels per group (C) is greater than 1, provided the vector load/store size for the output tensor is 1. This optimization improves performance for grouped convolutions with low channel counts per group.

Changes:

Modified the group merging condition to allow C > 1 when the vector size is 1
Added three new device instances with different block and thread configurations for merged group convolutions
Performance improvements demonstrated: 2.85 → 4.80 TFLOPS, 10.42 → 18.97 TFLOPS, and 15.42 → 20.38 TFLOPS for different test cases

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
device_grouped_conv_fwd_xdl_merged_groups_instance.hpp	Adds three new device instance configurations with NumGroupsPerBatch=8 for different block size combinations (256x64x64, 256x128x64 variants)
device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp	Updates the group merging validation logic to allow C > 1 when CDEBlockTransferScalarPerVector_NPerBlock is 1

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Ville Pietilä and others added 2 commits January 23, 2026 06:07

Add new instances for merging multiple fwd conv groups into a single …

93e6929

…GEMM batch. Allow group merging for C > 1 when vector load/store size is 1 for the output tensor.

Merge branch 'develop' into vpietila/ck-conv-group-merge-instances

49d39db

vpietila-amd marked this pull request as ready for review January 23, 2026 11:21

vpietila-amd requested review from a team, Snektron, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent and vidyasagar-amd as code owners January 23, 2026 11:21

afagaj requested a review from Copilot January 23, 2026 16:38

Copilot AI reviewed Jan 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CK] Add new instances for merging multiple fwd conv groups into a single GEMM batch #3639

[CK] Add new instances for merging multiple fwd conv groups into a single GEMM batch #3639

vpietila-amd commented Jan 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[CK] Add new instances for merging multiple fwd conv groups into a single GEMM batch #3639

Are you sure you want to change the base?

[CK] Add new instances for merging multiple fwd conv groups into a single GEMM batch #3639

Conversation

vpietila-amd commented Jan 23, 2026

Proposed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants