Status update: lifting the unaligned GPU matmul codegen boats #13227

nicolasvasilache · 2023-04-21T13:41:25Z

nicolasvasilache
Apr 21, 2023
Collaborator

I wanted to share a quick update on unaligned matmul codegen for tensorcore-based GPUs before disappearing for 2 weeks.

Below are performance gains that become available (once #13133 lands) by turning on the
--iree-codegen-llvmgpu-enable-transform-dialect-matmul-tensorcore-strategy flag (5-40x improvement over the current IREE unaligned cases).

This can be reproduced today by just patching #13191 (which extracts the key change required from #13133) and running make unaligned_matmuls with this iree-samples commit.

This runs a few combinations of align1/align2/align4/align_more around the 3456_1024_2048 size, f32 only for now.

Feel free to try other sizes.

make run_cuda_a100_fill_matmul_sizes_1000_1000_3052     -> 4017363
make run_cuda_a100_fill_matmul_sizes_1000_1000_3052.td  -> 310591
make run_cuda_a100_fill_matmul_sizes_2052_2052_2052     -> 2767159
make run_cuda_a100_fill_matmul_sizes_2052_2052_2052.td  -> 585470
make run_cuda_a100_fill_matmul_sizes_3451_1022_2046     -> 9057219
make run_cuda_a100_fill_matmul_sizes_3451_1022_2046.td  -> 514686
make run_cuda_a100_fill_matmul_sizes_3452_1000_1000     -> 1682234
make run_cuda_a100_fill_matmul_sizes_3452_1000_1000.td  -> 211135
make run_cuda_a100_fill_matmul_sizes_3452_1022_2044     -> 5963628
make run_cuda_a100_fill_matmul_sizes_3452_1022_2044.td  -> 394239
make run_cuda_a100_fill_matmul_sizes_3452_1024_2044     -> 2285816
make run_cuda_a100_fill_matmul_sizes_3452_1024_2044.td  -> 375743
make run_cuda_a100_fill_matmul_sizes_3452_1024_2046     -> 1766490
make run_cuda_a100_fill_matmul_sizes_3452_1024_2046.td  -> 465215
make run_cuda_a100_fill_matmul_sizes_3455_1023_2047     -> 12061496
make run_cuda_a100_fill_matmul_sizes_3455_1023_2047.td  -> 476926
make run_cuda_a100_fill_matmul_sizes_3456_1022_2044     -> 4932368
make run_cuda_a100_fill_matmul_sizes_3456_1022_2044.td  -> 406335
make run_cuda_a100_fill_matmul_sizes_3456_1023_2044     -> 6725994
make run_cuda_a100_fill_matmul_sizes_3456_1023_2044.td  -> 416575

Now, we are still 2-4x off where we want to be and there is still work to do around some of the low-level aspects:

the pipeline schedule (@manishucsd @qcolombet @ThomasRaoux)
ldmatrix swizzles.
looking for better paramters than the currently hardcoded 128x128x16x3xwmma
inspection of the low-level code quality (we properly generate masked load/store with vector<4> for the copy-back part and they get turned into uncoalesced load/store 1). This is likely less important than 1 + 2 + 3.

If people feel bold, they could try to turn the flag on by default to get the first 5-40x perf gains.

I'll pick this up again in 2 weeks.

@silvasean @mariecwhite @mattwalsh @stellaraccident @ftynse

mariecwhite · 2023-04-23T22:03:13Z

mariecwhite
Apr 23, 2023
Collaborator

Exciting results! I can create new benchmarks with this flag enabled once #13133 is merged.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Status update: lifting the unaligned GPU matmul codegen boats #13227

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Status update: lifting the unaligned GPU matmul codegen boats #13227

Uh oh!

Uh oh!

nicolasvasilache Apr 21, 2023 Collaborator

Replies: 1 comment

Uh oh!

mariecwhite Apr 23, 2023 Collaborator

nicolasvasilache
Apr 21, 2023
Collaborator

mariecwhite
Apr 23, 2023
Collaborator