Skip to content

Conversation

dhernandez0
Copy link
Contributor

@dhernandez0 dhernandez0 commented Oct 7, 2025

Motivation

In order to allow attention pipelining of the outer loop, we need to be able to annotate the liveness of LDS buffers. Currently, we do:

%alloc1 = rock.alloc()
%alloc2 = rock.alloc()

gemm(%alloc1, %alloc2)
rock.dealloc %alloc1
rock.dealloc %alloc2

%alloc3 = rock.alloc()
output_swizzle(%alloc3)
rock.dealloc %alloc3

Note that we currently use rock.dealloc manually. However, for attention, we need to do:

%alloc = rock.alloc()
for ... {
  stage {
    store_lds(%alloc)
    load_lds(%alloc)
    rock.dealloc %alloc
  }

  %alloc1 = rock.alloc()
  stage {
    store_lds(%alloc1)
  }
  stage {
    load_lds(%alloc1)
    compute(...)
  }
  rock.dealloc %alloc1
...
}

So, after doing pipelining, we would end up with multiple calls to rock.dealloc. The main idea is that currently we have a single liveness range for each alloc, it starts when we call rock.alloc() and ends when we call rock.dealloc().

Technical Details

The solution of this PR is to introduce rock.live_in and rock.live_out (instead of rock.dealloc), then, decouple rock.alloc from liveness analysis. So, we can have multiple regions where a buffer is used and then not used, then used again.

Also, we introduce a new pass AnnotateLiveness that automatically marks the liveness of the buffers. The assumption we use to do this: there are blocks of write() then load() (or write(), write(), ... load(), load()). That block would be a liveness range, we would add rock.live_in before the first write() and rock.live_out after the last load().

See annotateLiveness() comment for more details about other assumptions.

Test Plan

Tests pass.

Test Result

All tests pass.

Submission Checklist

@dhernandez0 dhernandez0 requested a review from causten as a code owner October 7, 2025 08:50
@dhernandez0 dhernandez0 changed the title Annotate liveness pass (+ reuseLDS pass changes) [DRAFT] Annotate liveness pass (+ reuseLDS pass changes) Oct 7, 2025
@pabloantoniom
Copy link
Contributor

What quickly draws my attention is that rock.dealloc is removed in this PR, which seems very wrong. But the intention is to actually rename rock.dealloc to rock.live_out.

Initially that also makes me think is not a good idea. I'd rather have rock.alloc/rock.dealloc and rock.live_in/rock.live_out; the former ops are for memory management, and the latter serve as metadata actually. However the point here is that rock.dealloc actually does not dealloc, since it only works on LDS, which cannot be deallocated.

To me it looks like it was a mistake to call it rock.dealloc, so renaming rock.dealloc to rock.live_out makes a lot of sense.

@dhernandez0 dhernandez0 changed the title [DRAFT] Annotate liveness pass (+ reuseLDS pass changes) Annotate liveness pass (+ reuseLDS pass changes) Oct 7, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new liveness annotation system to enable attention pipelining of the outer loop by replacing the previous rock.dealloc approach with rock.live_in and rock.live_out annotations, allowing for multiple liveness ranges per buffer allocation.

  • Replaces rock.dealloc with rock.live_in/rock.live_out for more flexible LDS memory management
  • Adds a new AnnotateLiveness pass that automatically detects and marks buffer liveness based on write/read patterns
  • Updates the ReuseLDS pass to work with the new liveness annotations and handle multiple liveness ranges per buffer

Reviewed Changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
mlir/lib/Dialect/Rock/Transforms/AnnotateLiveness.cpp New pass implementation for automatic liveness annotation
mlir/lib/Dialect/Rock/Transforms/ReuseLDS.cpp Major refactor to use new liveness annotations and interference graph analysis
mlir/include/mlir/Dialect/Rock/IR/RockOps.td Defines new rock.live_in and rock.live_out operations
mlir/test/Dialect/Rock/lowering_reuse_lds.mlir Updated test cases to use new liveness annotations
mlir/test/Dialect/Rock/lowering_annotate_liveness.mlir New test file for the liveness annotation pass
Multiple test files Removal of rock.dealloc calls throughout existing tests

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.


// Annotate lifetime of memory allocation on GPU memory hierachy.
def Rock_GpuDeallocOp:
Rock_Op<"dealloc", [MemoryEffects<[MemFree<DefaultResource>]>]>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like rock.dealloc annotated the memref input arg with [MemoryEffects<[MemFree<DefaultResource>]. Based on Pablo's earlier comment, he mentioned that we were never using this for actual deallocations, but is there a chance that the presence of this annotation was leading some community passes actually treating this like it was doing deallocations?

Copy link
Contributor Author

@dhernandez0 dhernandez0 Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, because we got rid of rock.dealloc in RockToGPU.cpp (see MIGPUDeallocRewritePattern), which is the last pass of buildKernelPipeline().

return emitError("The size of rock.alloc should be greather than zero.");
}

//===-----------------------------------------------------===//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want verifier ops for the new LiveIn and LiveOut ops?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

funcPm.addPass(rock::createRockAnnotateLivenessPass());
funcPm.addPass(rock::createRockReuseLDSPass());
funcPm.addPass(rock::createRockOutputSwizzlePass());
funcPm.addPass(rock::createRockAnnotateLivenessPass());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining why we need to call RockAnnotateLiveness and RockReuseLDS again after running RockOutputSwizzle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll add the following comments:

  // We run reuse LDS before the output swizzle pass because it uses a heuristic to determine whether to swizzle or not, and that heuristic needs the actual LDS usage.
  // After running output swizzle, we'll create a new LDS buffer and we need to run reuse LDS again to be able to reuse LDS memory. 

// (outside the loop). This would be incorrect because the buffer is alive for
// the whole loop. However, in practise, this is not a problem because if there
// are any interferences they will also happen in the epilogue and prologue.
// This might need to get improved if changes to pipelining are made.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this worth filing a case for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, I'd say if it's not a lot of work, we could fix it. Because this might happen in the future if we change the pipeline we currently use. But if it's a lot of work it might not be worth it...

}

// Annotate LDS buffer usage based on the following assumptions:
// 1. Liveness range is determined by a pattern of write(), then read()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen right now if the number of writes and reads does not match? Should we catch that error in this pass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's ok if they don't match. You can have write(buffer), write(buffer), read(buffer). I guess what you mean is something like: write(buffer), write(buffer), read(buffer), write(buffer)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fail for those cases, see line 206: "Found a non closed read-write pattern"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was the case that I was thinking of. Can you add a LIT test for the failing case as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@pabloantoniom pabloantoniom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@@ -0,0 +1,305 @@
//===- AnnotateLiveness - MLIR Rock ops lowering passes -----===//
//
// Copyright 2025 The MLIR Authors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The MLIR authors?

}

// Annotate LDS buffer usage based on the following assumptions:
// 1. Liveness range is determined by a pattern of write(), then read()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. Liveness range is determined by a pattern of write(), then read() is difficult to understand. What about:

1. Liveness range is determined by a pattern of one or more write() ops, and then one or more read() ops. In other words, there cannot be a write() after a read().

// read([0, 1, 2]).
// clang-format on
//
// Where write(buffer, indices, data), read(indices), alloc(size). We would be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit pedantic but I would prefer read(buffer, indices) instead of read(indices)

func::FuncOp func = getOperation();

// Only run this pass on GPU kernel functions.
if (!func->hasAttr("kernel"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!func->hasAttr("kernel")) {
  LLVM_DEBUG(llvm::dbgs() << "Skipping RockAnnotateLivenessPass on func with no kernel attribute";
  return;
}

@@ -0,0 +1,142 @@
// RUN: sed s/##TOKEN_ARCH##/%arch/g %s | rocmlir-opt -rock-annotate-liveness | FileCheck %s

#wg = #gpu.address_space<workgroup>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Either use #wg everywhere or use #gpu.address_space<workgroup> everywhere

//===-----------------------------------------------------===//

LogicalResult GpuDeallocOp::verify() {
LogicalResult LiveInOp::verify() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If live_in/live_out targets only LDS memory, this is a good moment to check if the GpuAllocOp is LDS or not.

@@ -0,0 +1,47 @@
// RUN: sed s/##TOKEN_ARCH##/%arch/g %s | rocmlir-opt -rock-annotate-liveness -verify-diagnostics

#wg = #gpu.address_space<workgroup>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same here, let's use either #wg or #gpu.address_space<workgroup> below

// RUN: sed s/##TOKEN_ARCH##/%arch/g %s | rocmlir-opt -rock-annotate-liveness -verify-diagnostics

#wg = #gpu.address_space<workgroup>
#priv = #gpu.address_space<private>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: #priv is not used

bool hasRead = lastRead != nullptr;
bool hasWrite = currentWrite != nullptr;
if (hasRead != hasWrite) {
return buffer->emitError("Found a non closed read-write pattern");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this. Let's take non_closed_read_write_pattern as an example.

Yes there is a write on a buffer that is not read later, but we could have a valid program that does that, right? It would be cleaned by dce at some point.

I guess supporting that complicates the logic, but can't we treat the return op as an implicit live.live_out for all buffers?

// Update the last read (could be write, read, read, ... pattern)
lastRead = op;
if (!currentWrite) {
return buffer->emitError("Read before write");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this error makes sense, we are reading from LDS before writing to it. Because it makes no sense to read from an uninitialized buffer (i.e., if we have not written anything to it yet). But I wonder if this error string is appropriate. Actually, the error is not (any) "Read before write", but reading from a position of LDS that has never been written (e.g., we care about the first read before write) - not sure how to put it cleanly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants