[QST] Is there any fp16xfp16 GEMM sample using CUTE with a performance comparable to cublas?

**What is your question?**
I want to write my own fused fp16xfp16 gemm kernel with CUTE, but I can not find a tutorial/sample code with a performance comparable to cublas. 

I noticed there are some tutorials in https://github.com/NVIDIA/cutlass/tree/main/examples/cute/tutorial, which has fp32xfp32 and int8xint8 gemm. But the performance of int8xint8 gemm is not good enough. I also noticed a 3rd party of fp16xfp16 gemm with CUTE https://github.com/leimao/CUDA-GEMM-Optimization?tab=readme-ov-file, but as shown in the readme, the performance is yet not comparable to cublas. So I wonder whether CUTE can give an official fp16xfp16 gemm kernel with good performance, so that I can develop based on that?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QST] Is there any fp16xfp16 GEMM sample using CUTE with a performance comparable to cublas? #1686

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QST] Is there any fp16xfp16 GEMM sample using CUTE with a performance comparable to cublas? #1686

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions