Skip to content

[WIP] Fix overflow when shape >= 32Kx32Kx32K, buffer overflow#1061

Open
xintin wants to merge 7 commits intomainfrom
xintin/fix_overflow_and_dynamic_pipeline_remainder_loop_start
Open

[WIP] Fix overflow when shape >= 32Kx32Kx32K, buffer overflow#1061
xintin wants to merge 7 commits intomainfrom
xintin/fix_overflow_and_dynamic_pipeline_remainder_loop_start

Conversation

@xintin
Copy link
Contributor

@xintin xintin commented Mar 6, 2026

  • Buffer offset calculations used signed 32-bit limits (2^31 - 1), capping addressable memory at ~2 GB. This patch switches to unsigned 32-bit limits (2^32 - 1) to support up to ~4 GB, and drops the nsw (no-signed-wrap) overflow flag on offset arithmetic so the compiler doesn't misoptimize offsets above the signed range.

  • Updated buffer size constants for i8 type (2147483646 to 4294967294), OOB index dense constant (2147483647 to 4294967295), validBytes constant (2147483646 to 4294967294). Updated f16, f32, i32 buffer sizes.

@xintin xintin force-pushed the xintin/fix_overflow_and_dynamic_pipeline_remainder_loop_start branch 2 times, most recently from affcf72 to 01bb2aa Compare March 6, 2026 03:58
@xintin xintin requested a review from harsh-nod March 6, 2026 03:59
@xintin xintin force-pushed the xintin/fix_overflow_and_dynamic_pipeline_remainder_loop_start branch 2 times, most recently from 04b34b5 to b26d3b3 Compare March 6, 2026 05:19
@xintin xintin marked this pull request as draft March 6, 2026 05:29
@xintin xintin marked this pull request as ready for review March 6, 2026 05:36
@xintin xintin force-pushed the xintin/fix_overflow_and_dynamic_pipeline_remainder_loop_start branch 2 times, most recently from 53d89ff to 7660835 Compare March 6, 2026 06:25
Copy link
Contributor

@ftynse ftynse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size of the output memref looks completely off, before and after this patch. You may want to find the root cause of that rather than trying to increase the size for it to fit.

"s_waitcnt vmcnt(0)",
"s_waitcnt vmcnt(0) lgkmcnt(0)",
"s_waitcnt vmcnt(0)",
"s_waitcnt lgkmcnt(14)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is... a lot.

Copy link
Contributor Author

@xintin xintin Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's what error message stated.
If the error message is wrong, then that should be corrected or guarded too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What error message? This is a test looking for the presence of exact strings. It likely told you that a new string is present. But we need to understnad what that means, in particular we are adding a lot waits here, which will decrease performance.

Comment on lines +262 to +263
# CHECK: memref.reinterpret_cast %[[D1]] to offset: [0], sizes: [2147483646], strides: [1] : memref<f16> to memref<2147483646xf16, strided<[1]>>
# CHECK: vector.store %[[V]], {{.*}}[{{.*}}] : memref<2147483646xf16, strided<[1]>>, vector<16xf16>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fly-by: where does this number come from? This is an 8GB buffer, whereas it looks like we have M, N = 16, 16, meaning I'd expect to see 256 here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the diff, it is updated from 1073741822 (f32) to 2147483646 (f16)
that is ((2^32 - 1) // 2) - 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but why do we use this number? Memref sizes are meaningful for MLIR optimization, you must not have a wrong size.

@xintin xintin changed the title Fix overflow and dynamic pipeline remainder loop start [WIP] Fix overflow and dynamic pipeline remainder loop start Mar 6, 2026
@xintin xintin changed the title [WIP] Fix overflow and dynamic pipeline remainder loop start [WIP] Fix overflow when shape >= 32Kx32Kx32K, buffer overflow Mar 6, 2026
@xintin xintin force-pushed the xintin/fix_overflow_and_dynamic_pipeline_remainder_loop_start branch from c94f3cf to 9688fb2 Compare March 6, 2026 17:56
harsh-nod and others added 6 commits March 6, 2026 17:57
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Until now, we have only been verifying the absence of a second non-unit
step in index expressions of read and write operations. Do so for every
operation in the trait that attaches the attribute. This is not
super-efficient
as it requires looking up the attribute on the same parent from all
operations, but guarantees the check to happen unlike using the
attribute
verifier which will not kick in in absence of the hyperparameters
attribute
even if we can see a problem. A better, longer-term solution is to
introduce a top-level wave kernel operation where hyperparameters are
mandatory. We can also go for a normal form that will perform a
top-down verification collecting the attributes on the way.

Closes #1013.

---------

Signed-off-by: Alex Zinenko <git@ozinenko.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
The schedule.py changes are now in
xintin/fix_dynamic_pipeline_remainder_loop_start.

Signed-off-by: xintin <gaurav.verma@amd.com>
Made-with: Cursor
Signed-off-by: xintin <gaurav.verma@amd.com>
@xintin xintin force-pushed the xintin/fix_overflow_and_dynamic_pipeline_remainder_loop_start branch from 7ed44c8 to 7a21b68 Compare March 6, 2026 17:58
willghatch added a commit that referenced this pull request Mar 6, 2026
This builds on PRs #1061, #1063, and #1067 to get the block size 256x224x256 working for the list of shapes we were looking at today.

Signed-off-by: William G Hatch <william@hatch.uno>
willghatch added a commit that referenced this pull request Mar 6, 2026
With PRs #1061 this gets the block size 256x224x256 working for the list of shapes we were looking at today.

Without #1061 it passes all shapes but one, which right now as I try to run I get an error that HIP doesn't have enough memory.  I'll re-run it later when the machine hopefully has less usage.

Signed-off-by: William G Hatch <william@hatch.uno>
willghatch added a commit that referenced this pull request Mar 6, 2026
With PRs #1061 this gets the block size 256x224x256 working for the list of shapes we were looking at today.

Without #1061 it passes all shapes but one, which right now as I try to run I get an error that HIP doesn't have enough memory.  I'll re-run it later when the machine hopefully has less usage.

Signed-off-by: William G Hatch <william@hatch.uno>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants