-
Notifications
You must be signed in to change notification settings - Fork 269
[CK_TILE] Add irregular tail vectorloads pipeline #3606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a new GEMM pipeline (COMPUTE_V7) that enables vectorized loads greater than 1 when the last dimensions of matrices A and B are not multiples of the vector load size. The pipeline handles irregular tail cases by falling back to scalar reads in the final loop iteration, preventing out-of-bounds accesses.
Changes:
- Added
COMPUTE_V7pipeline variant with support for irregular tail vectorloads - Implemented
load_tile_with_elementwise_vectorload1for scalar fallback reads - Modified kernel validation logic to skip vector size divisibility checks for COMPUTE_V7
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| include/ck_tile/ops/gemm/pipeline/gemm_pipelines.hpp | Added COMPUTE_V7 enum entry |
| include/ck_tile/ops/gemm/pipeline/gemm_pipeline_ag_bg_cr_comp_v7.hpp | New pipeline implementation with irregular tail support |
| include/ck_tile/ops/gemm/kernel/universal_gemm_kernel.hpp | Added validation for COMPUTE_V7 padding requirements and relaxed vector size checks |
| include/ck_tile/ops/gemm.hpp | Added include for new pipeline header |
| include/ck_tile/core/tensor/tile_window_base.hpp | Added TraitsVectorload1 structure for scalar load traits |
| include/ck_tile/core/tensor/tile_window.hpp | Implemented load_vectorload1 methods for scalar fallback |
| include/ck_tile/core/tensor/load_tile.hpp | Added load_tile_with_elementwise_vectorload1 function |
| example/ck_tile/03_gemm/gemm_utils.hpp | Added configuration and pipeline traits for COMPUTE_V7 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| bool_constant<oob_conditional_check> = {}) | ||
| { | ||
| // TODO: Tile windows should works with unknow number of params | ||
| // Load element_wise API works only when the input typle is a tuple-type |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'typle' to 'tuple'.
| // Load element_wise API works only when the input typle is a tuple-type | |
| // Load element_wise API works only when the input tuple is a tuple-type |
| number<i_access> = {}, | ||
| bool_constant<oob_conditional_check> = {}) | ||
| { | ||
| // TODO: Tile windows should works with unknow number of params |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'unknow' to 'unknown' and grammar 'works' to 'work'.
| auto first_a_load_tile_with_elementwise = [&]() { | ||
| if constexpr(HasHotLoop) | ||
| return load_tile_with_elementwise(a_copy_dram_window, a_element_func); | ||
| else | ||
| return load_tile_with_elementwise_vectorload1(a_copy_dram_window, | ||
| a_element_func); | ||
| }(); | ||
|
|
||
| auto first_b_load_tile_with_elementwise = [&]() { | ||
| if constexpr(HasHotLoop) | ||
| return load_tile_with_elementwise(b_copy_dram_window, b_element_func); | ||
| else | ||
| return load_tile_with_elementwise_vectorload1(b_copy_dram_window, | ||
| b_element_func); | ||
| }(); | ||
|
|
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conditional logic determining which load function to use (load_tile_with_elementwise vs load_tile_with_elementwise_vectorload1) is duplicated for both A and B tiles (lines 444-450 and 452-458). Consider extracting this into a helper lambda or template function to reduce code duplication.
| auto first_a_load_tile_with_elementwise = [&]() { | |
| if constexpr(HasHotLoop) | |
| return load_tile_with_elementwise(a_copy_dram_window, a_element_func); | |
| else | |
| return load_tile_with_elementwise_vectorload1(a_copy_dram_window, | |
| a_element_func); | |
| }(); | |
| auto first_b_load_tile_with_elementwise = [&]() { | |
| if constexpr(HasHotLoop) | |
| return load_tile_with_elementwise(b_copy_dram_window, b_element_func); | |
| else | |
| return load_tile_with_elementwise_vectorload1(b_copy_dram_window, | |
| b_element_func); | |
| }(); | |
| auto make_first_load_tile_with_elementwise = | |
| [&](auto& copy_dram_window, auto& element_func) { | |
| if constexpr(HasHotLoop) | |
| return load_tile_with_elementwise(copy_dram_window, element_func); | |
| else | |
| return load_tile_with_elementwise_vectorload1(copy_dram_window, | |
| element_func); | |
| }; | |
| auto first_a_load_tile_with_elementwise = | |
| make_first_load_tile_with_elementwise(a_copy_dram_window, a_element_func); | |
| auto first_b_load_tile_with_elementwise = | |
| make_first_load_tile_with_elementwise(b_copy_dram_window, b_element_func); |
| AsTensorIsValid = false; | ||
| } | ||
| if(kargs.K % vectorSizeA != 0) | ||
| if(kargs.K % vectorSizeA != 0 && GemmPipeline::GetPipelineName() != "COMPUTE_V7") |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pipeline name check GemmPipeline::GetPipelineName() != \"COMPUTE_V7\" is repeated in multiple validation conditions (lines 455, 487, 527, 560-561). Consider extracting this into a boolean constant at the beginning of the validation function to improve maintainability and reduce magic string comparisons.
Proposed changes
This PR adds a new pipeline that supports vectorloads > 1 for cases in which A/B last dimension is not a multiple of vectorload. Previously that kind of behavior would result in reaching out of bounds, but this pipeline introduces scalar reads for last loop iteration which fixes that problem.