[WIP] Fix overflow when shape >= 32Kx32Kx32K, buffer overflow#1061
[WIP] Fix overflow when shape >= 32Kx32Kx32K, buffer overflow#1061
Conversation
affcf72 to
01bb2aa
Compare
04b34b5 to
b26d3b3
Compare
53d89ff to
7660835
Compare
ftynse
left a comment
There was a problem hiding this comment.
The size of the output memref looks completely off, before and after this patch. You may want to find the root cause of that rather than trying to increase the size for it to fit.
| "s_waitcnt vmcnt(0)", | ||
| "s_waitcnt vmcnt(0) lgkmcnt(0)", | ||
| "s_waitcnt vmcnt(0)", | ||
| "s_waitcnt lgkmcnt(14)", |
There was a problem hiding this comment.
that's what error message stated.
If the error message is wrong, then that should be corrected or guarded too.
There was a problem hiding this comment.
What error message? This is a test looking for the presence of exact strings. It likely told you that a new string is present. But we need to understnad what that means, in particular we are adding a lot waits here, which will decrease performance.
| # CHECK: memref.reinterpret_cast %[[D1]] to offset: [0], sizes: [2147483646], strides: [1] : memref<f16> to memref<2147483646xf16, strided<[1]>> | ||
| # CHECK: vector.store %[[V]], {{.*}}[{{.*}}] : memref<2147483646xf16, strided<[1]>>, vector<16xf16> |
There was a problem hiding this comment.
Fly-by: where does this number come from? This is an 8GB buffer, whereas it looks like we have M, N = 16, 16, meaning I'd expect to see 256 here.
There was a problem hiding this comment.
see the diff, it is updated from 1073741822 (f32) to 2147483646 (f16)
that is ((2^32 - 1) // 2) - 1
There was a problem hiding this comment.
Yes, but why do we use this number? Memref sizes are meaningful for MLIR optimization, you must not have a wrong size.
c94f3cf to
9688fb2
Compare
#1057) Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Until now, we have only been verifying the absence of a second non-unit step in index expressions of read and write operations. Do so for every operation in the trait that attaches the attribute. This is not super-efficient as it requires looking up the attribute on the same parent from all operations, but guarantees the check to happen unlike using the attribute verifier which will not kick in in absence of the hyperparameters attribute even if we can see a problem. A better, longer-term solution is to introduce a top-level wave kernel operation where hyperparameters are mandatory. We can also go for a normal form that will perform a top-down verification collecting the attributes on the way. Closes #1013. --------- Signed-off-by: Alex Zinenko <git@ozinenko.com> Signed-off-by: xintin <gaurav.verma@amd.com>
The schedule.py changes are now in xintin/fix_dynamic_pipeline_remainder_loop_start. Signed-off-by: xintin <gaurav.verma@amd.com> Made-with: Cursor Signed-off-by: xintin <gaurav.verma@amd.com>
7ed44c8 to
7a21b68
Compare
…ainder_loop_start
With PRs #1061 this gets the block size 256x224x256 working for the list of shapes we were looking at today. Without #1061 it passes all shapes but one, which right now as I try to run I get an error that HIP doesn't have enough memory. I'll re-run it later when the machine hopefully has less usage. Signed-off-by: William G Hatch <william@hatch.uno>
With PRs #1061 this gets the block size 256x224x256 working for the list of shapes we were looking at today. Without #1061 it passes all shapes but one, which right now as I try to run I get an error that HIP doesn't have enough memory. I'll re-run it later when the machine hopefully has less usage. Signed-off-by: William G Hatch <william@hatch.uno>
Buffer offset calculations used signed 32-bit limits (2^31 - 1), capping addressable memory at ~2 GB. This patch switches to unsigned 32-bit limits (2^32 - 1) to support up to ~4 GB, and drops the nsw (no-signed-wrap) overflow flag on offset arithmetic so the compiler doesn't misoptimize offsets above the signed range.
Updated buffer size constants for i8 type (2147483646 to 4294967294), OOB index dense constant (2147483647 to 4294967295), validBytes constant (2147483646 to 4294967294). Updated f16, f32, i32 buffer sizes.