Why diff > 0 when m < 256?

Hi expert,

Thanks for your contribution. 
When I test run_sample.sh, I found **CublasLt-Gemm** and **Cutlass-Gemm** have numeral error.
This is the log file for M128N4096K4096.
```
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:46 main] CUTLASS GEMM start with 96 CPU processes on the 0-th GPU: NVIDIA GeForce RTX 4090
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:53 main] CUDA driver version / runtime version: 12.2 / 12.0
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:55 main] CUDA capability major/minor version number: 8.9
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:56 main] 128 multiprocessors, 128 CUDA cores/MP: 16384 CUDA cores
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:59 main] GPU max clock rate: 2520 MHz (2.52 GHz)
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:61 main] Memory clock rate: 10501 MHz (10.50 GHz)
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:63 main] Memory bus width: 384-bit
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:64 main] Total amount of global memory: 24217 MBytes (25393692672 Bytes)
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:66 main] Total amount of constant memory: 64 KBytes (65536 Bytes)
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:68 main] Total amount of shared memory per block: 48 KBytes (49152 Bytes)
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:70 main] Total shared memory per multiprocessor: 100 KBytes (102400 Bytes)
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:72 main] L2 cache size: 73728 KBytes (75497472 Bytes)
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:74 main] Total number of registers available per block: 65536
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:75 main] Warp size: 32
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:76 main] Max number of threads per multiprocessor: 1536
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:77 main] Max number of threads per block: 1024
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:78 main] Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:80 main] Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:85 main] A (128 x 4096) * B (4096 x 4096) = C (128 x 4096)
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:86 main] Profiling: alpha: 1.000000, beta: 0.000000, stream: (nil), is bf16: 1, warmup iterations: 1, profiling iterations: 10, sleep duration: 100 ms, enable check: 1
[CG 2025-05-04 10:25:19 1944667:1944667 matrix.h:38 Matrix] Matrix A: 128 * 4096, cpu: 0x55951ad80200, gpu: 0x7f1bb2c00000
[CG 2025-05-04 10:25:19 1944667:1944667 matrix.h:38 Matrix] Matrix B: 4096 * 4096, cpu: 0x7f1bd1fff010, gpu: 0x7f1bbe000000
[CG 2025-05-04 10:25:19 1944667:1944667 matrix.h:38 Matrix] Matrix C: 128 * 4096, cpu: 0x55951ae81f50, gpu: 0x7f1bb2d00000
[CG 2025-05-04 10:25:19 1944667:1944667 matrix.h:38 Matrix] Matrix D: 128 * 4096, cpu: 0x55951af82360, gpu: 0x7f1bb2e00000
[CG 2025-05-04 10:25:19 1944667:1944667 matrix.h:38 Matrix] Matrix Base: 128 * 4096, cpu: 0x55951b0830c0, gpu: 0x7f1bb2f00000
[CG 2025-05-04 10:25:19 1944667:1944667 gemm_tester.h:57 GemmTester] Cublas-Gemm use: 75.685 ms
[CG 2025-05-04 10:25:19 1944667:1944667 gemm_tester.h:67 evaluate] ----------------- Evaluating Cublas-Gemm -----------------
[CG 2025-05-04 10:25:19 1944667:1944667 gemm_tester.h:78 evaluate] Warm up time: 0.428 ms
[CG 2025-05-04 10:25:19 1944667:1944667 matrix.h:123 checkValue] Max diff: 0.000000, avg diff: 0.000000
[CG 2025-05-04 10:25:19 1944667:1944667 gemm_tester.h:122 profile] Cublas-Gemm exit, profiling time: 0.052 ms (100.00%), throughput: 83.334 TFLOPS (100.00%)
[CG 2025-05-04 10:25:19 1944667:1944667 gemm_tester.h:67 evaluate] ----------------- Evaluating CublasLt-Gemm -----------------
[CG 2025-05-04 10:25:19 1944667:1944667 gemm_tester.h:78 evaluate] Warm up time: 0.154 ms
[CG 2025-05-04 10:25:19 1944667:1944667 matrix.h:123 checkValue] Max diff: 0.500000, avg diff: 0.027669
[CG 2025-05-04 10:25:19 1944667:1944667 gemm_tester.h:122 profile] CublasLt-Gemm exit, profiling time: 0.065 ms (125.54%), throughput: 66.379 TFLOPS (79.65%)
[CG 2025-05-04 10:25:19 1944667:1944667 gemm_tester.h:67 evaluate] ----------------- Evaluating Cutlass-Gemm -----------------
[CG 2025-05-04 10:25:20 1944667:1944667 gemm_tester.h:78 evaluate] Warm up time: 1.992 ms
[CG 2025-05-04 10:25:20 1944667:1944667 matrix.h:123 checkValue] Max diff: 0.500000, avg diff: 0.027669
[CG 2025-05-04 10:25:20 1944667:1944667 gemm_tester.h:122 profile] Cutlass-Gemm exit, profiling time: 0.068 ms (131.86%), throughput: 63.197 TFLOPS (75.84%)
[CG 2025-05-04 10:25:20 1944667:1944667 benchmark_gemm.cpp:103 main] Done
```
Do you also have this issue? And how to solve it?
Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why diff > 0 when m < 256? #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Why diff > 0 when m < 256? #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions