- 
                Notifications
    
You must be signed in to change notification settings  - Fork 74
 
[BACKEND] Enhance the remove layout implementation to reduce the duplicated values with different layout in scf.for. #4527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
486ed4a    to
    f42bd66      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the remove layout implementation to better handle layout propagation across scf.for operations, addressing limitations that create performance bottlenecks on Intel GPU. The changes focus on reducing duplicated layout conversion operations by improving support for multi-result operations and nested basic blocks.
- Adds support for propagating layouts through 
scf.foroperations with a newincludeForOpparameter - Refactors 
mappedValuesto handle multiple attribute mappings per value instead of single mappings - Includes debug output and unreachable code handling for 
scf.foroperations 
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description | 
|---|---|
| Utility.h | Adds includeForOp parameter to getConvertBackwardSlice function signature | 
| Utility.cpp | Implements scf.for layout propagation logic with early return check and debug output | 
| RemoveLayoutConversions.cpp | Updates data structures to support multiple encodings per value and enables scf.for processing | 
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
        
          
                third_party/intel/lib/TritonIntelGPUTransforms/RemoveLayoutConversions.cpp
              
                Outdated
          
            Show resolved
            Hide resolved
        
      f42bd66    to
    7a8f858      
    Compare
  
    | 
           The flex attn backward ttgir has been simplified by these changes. There are only two root tiling layout of the dpas and the transpose of dot of dpas. Another major in-efficient issue on Xe-Xe3 is that the regular pointer under different layout like: The simplified ttgir | 
    
7a8f858    to
    5828b55      
    Compare
  
    
          
 I checkout this branch and run the benchmark code in flex attn bwdBut the TTGI I get is not the same as the one you mentioned, the IR around the  Snippet from TTGIRThe IR still contains the convert_layout operations and there are other ops (e.g. ttg.local_alloc) which do not appear for you. How did you compile the benchmark? Does this PR contains all of your code ?  | 
    
5828b55    to
    d85a2ba      
    Compare
  
    | 
           @etiotto Added the missed changes for debug the backward.  | 
    
…d values with different layout in scf.for. Signed-off-by: Lu,Chengjun <[email protected]>
Signed-off-by: Lu,Chengjun <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
        
          
                third_party/intel/lib/TritonIntelGPUTransforms/RemoveLayoutConversions.cpp
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                third_party/intel/lib/TritonIntelGPUTransforms/RemoveLayoutConversions.cpp
              
                Outdated
          
            Show resolved
            Hide resolved
        
      …ntel/intel-xpu-backend-for-triton into chengjun/enhance_remove_layout
…ntel/intel-xpu-backend-for-triton into chengjun/enhance_remove_layout
Signed-off-by: Ettore Tiotto <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
        
          
                third_party/intel/lib/TritonIntelGPUTransforms/RemoveLayoutConversions.cpp
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | LogicalResult result = getRematerializableSlice(convertOp.getSrcMutable(), | ||
| targetType.getEncoding(), | ||
| slice, layout, nullptr); | 
    
      
    
      Copilot
AI
    
    
    
      Sep 30, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Adding a nullptr parameter without documentation or context makes the API unclear. Consider adding a comment explaining what this parameter represents or using a named constant instead of nullptr.
| LogicalResult result = getRematerializableSlice(convertOp.getSrcMutable(), | |
| targetType.getEncoding(), | |
| slice, layout, nullptr); | |
| // No filter function is provided, so we pass kNoRematerializationFilter (nullptr). | |
| static constexpr auto kNoRematerializationFilter = nullptr; | |
| LogicalResult result = getRematerializableSlice(convertOp.getSrcMutable(), | |
| targetType.getEncoding(), | |
| slice, layout, kNoRematerializationFilter); | 
Signed-off-by: Ettore Tiotto <[email protected]>
Signed-off-by: Ettore Tiotto <[email protected]>

The layout propagation across the
scf.forop in RemoveLayout is not implemented well for these aspects:scf.forops.With the limitations, the
scf.foroperation is the bottle neck of the efficient after the remove layout pass.This is not issue on NV GPU because the NV GPU convert the layout convert operations to async.cp in software pipeline.
But it is an issue for Intel GPU. We rely on the remove layout to get a simple program with less convert layout operations.
Plan to enhance the remove layout to enhance the limitations of the remove layout.
scf.forops on demand.This is an PR for CI.