Skip to content

Conversation

@vpietila-amd
Copy link
Contributor

Proposed changes

Added new instances for merging multiple fwd conv groups into a single GEMM batch. Allowed group merging for C > 1 when vector load/store size is 1 for the output tensor. The new instances improve the performance for grouped convolutions when the number of channels per group is low.

CK prof command Baseline (TFLOPS) Group merging (TFLOPS)
grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 4 4 3 3 200 200 1 1 1 1 1 1 1 1 2.85698 4.80702
grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 200 200 2 2 1 1 1 1 1 1 10.4224 18.969
grouped_conv_fwd 1 1 1 0 1 0 1 2 32 32 8 8 3 3 100 100 1 2 1 1 1 1 1 1 15.4201 20.3821

The baseline was calculated prior to #3632 on gfx950.

Ville Pietilä and others added 2 commits January 23, 2026 06:07
…GEMM batch. Allow group merging for C > 1 when vector load/store size is 1 for the output tensor.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds new device instances for grouped forward convolution that enable merging multiple convolution groups into a single GEMM batch operation. The change relaxes the constraint on group merging to allow cases where the number of channels per group (C) is greater than 1, provided the vector load/store size for the output tensor is 1. This optimization improves performance for grouped convolutions with low channel counts per group.

Changes:

  • Modified the group merging condition to allow C > 1 when the vector size is 1
  • Added three new device instances with different block and thread configurations for merged group convolutions
  • Performance improvements demonstrated: 2.85 → 4.80 TFLOPS, 10.42 → 18.97 TFLOPS, and 15.42 → 20.38 TFLOPS for different test cases

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
device_grouped_conv_fwd_xdl_merged_groups_instance.hpp Adds three new device instance configurations with NumGroupsPerBatch=8 for different block size combinations (256x64x64, 256x128x64 variants)
device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp Updates the group merging validation logic to allow C > 1 when CDEBlockTransferScalarPerVector_NPerBlock is 1

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants