Skip to content

Why diff > 0 when m < 256? #1

@xxyux

Description

@xxyux

Hi expert,

Thanks for your contribution.
When I test run_sample.sh, I found CublasLt-Gemm and Cutlass-Gemm have numeral error.
This is the log file for M128N4096K4096.

[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:46 main] CUTLASS GEMM start with 96 CPU processes on the 0-th GPU: NVIDIA GeForce RTX 4090
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:53 main] CUDA driver version / runtime version: 12.2 / 12.0
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:55 main] CUDA capability major/minor version number: 8.9
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:56 main] 128 multiprocessors, 128 CUDA cores/MP: 16384 CUDA cores
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:59 main] GPU max clock rate: 2520 MHz (2.52 GHz)
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:61 main] Memory clock rate: 10501 MHz (10.50 GHz)
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:63 main] Memory bus width: 384-bit
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:64 main] Total amount of global memory: 24217 MBytes (25393692672 Bytes)
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:66 main] Total amount of constant memory: 64 KBytes (65536 Bytes)
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:68 main] Total amount of shared memory per block: 48 KBytes (49152 Bytes)
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:70 main] Total shared memory per multiprocessor: 100 KBytes (102400 Bytes)
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:72 main] L2 cache size: 73728 KBytes (75497472 Bytes)
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:74 main] Total number of registers available per block: 65536
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:75 main] Warp size: 32
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:76 main] Max number of threads per multiprocessor: 1536
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:77 main] Max number of threads per block: 1024
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:78 main] Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:80 main] Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:85 main] A (128 x 4096) * B (4096 x 4096) = C (128 x 4096)
[CG 2025-05-04 10:25:19 1944667:1944667 benchmark_gemm.cpp:86 main] Profiling: alpha: 1.000000, beta: 0.000000, stream: (nil), is bf16: 1, warmup iterations: 1, profiling iterations: 10, sleep duration: 100 ms, enable check: 1
[CG 2025-05-04 10:25:19 1944667:1944667 matrix.h:38 Matrix] Matrix A: 128 * 4096, cpu: 0x55951ad80200, gpu: 0x7f1bb2c00000
[CG 2025-05-04 10:25:19 1944667:1944667 matrix.h:38 Matrix] Matrix B: 4096 * 4096, cpu: 0x7f1bd1fff010, gpu: 0x7f1bbe000000
[CG 2025-05-04 10:25:19 1944667:1944667 matrix.h:38 Matrix] Matrix C: 128 * 4096, cpu: 0x55951ae81f50, gpu: 0x7f1bb2d00000
[CG 2025-05-04 10:25:19 1944667:1944667 matrix.h:38 Matrix] Matrix D: 128 * 4096, cpu: 0x55951af82360, gpu: 0x7f1bb2e00000
[CG 2025-05-04 10:25:19 1944667:1944667 matrix.h:38 Matrix] Matrix Base: 128 * 4096, cpu: 0x55951b0830c0, gpu: 0x7f1bb2f00000
[CG 2025-05-04 10:25:19 1944667:1944667 gemm_tester.h:57 GemmTester] Cublas-Gemm use: 75.685 ms
[CG 2025-05-04 10:25:19 1944667:1944667 gemm_tester.h:67 evaluate] ----------------- Evaluating Cublas-Gemm -----------------
[CG 2025-05-04 10:25:19 1944667:1944667 gemm_tester.h:78 evaluate] Warm up time: 0.428 ms
[CG 2025-05-04 10:25:19 1944667:1944667 matrix.h:123 checkValue] Max diff: 0.000000, avg diff: 0.000000
[CG 2025-05-04 10:25:19 1944667:1944667 gemm_tester.h:122 profile] Cublas-Gemm exit, profiling time: 0.052 ms (100.00%), throughput: 83.334 TFLOPS (100.00%)
[CG 2025-05-04 10:25:19 1944667:1944667 gemm_tester.h:67 evaluate] ----------------- Evaluating CublasLt-Gemm -----------------
[CG 2025-05-04 10:25:19 1944667:1944667 gemm_tester.h:78 evaluate] Warm up time: 0.154 ms
[CG 2025-05-04 10:25:19 1944667:1944667 matrix.h:123 checkValue] Max diff: 0.500000, avg diff: 0.027669
[CG 2025-05-04 10:25:19 1944667:1944667 gemm_tester.h:122 profile] CublasLt-Gemm exit, profiling time: 0.065 ms (125.54%), throughput: 66.379 TFLOPS (79.65%)
[CG 2025-05-04 10:25:19 1944667:1944667 gemm_tester.h:67 evaluate] ----------------- Evaluating Cutlass-Gemm -----------------
[CG 2025-05-04 10:25:20 1944667:1944667 gemm_tester.h:78 evaluate] Warm up time: 1.992 ms
[CG 2025-05-04 10:25:20 1944667:1944667 matrix.h:123 checkValue] Max diff: 0.500000, avg diff: 0.027669
[CG 2025-05-04 10:25:20 1944667:1944667 gemm_tester.h:122 profile] Cutlass-Gemm exit, profiling time: 0.068 ms (131.86%), throughput: 63.197 TFLOPS (75.84%)
[CG 2025-05-04 10:25:20 1944667:1944667 benchmark_gemm.cpp:103 main] Done

Do you also have this issue? And how to solve it?
Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions