We have tried to offload the mat_mul to gpu using cuBlas and c interface, and gain at least 2x speedup in the SlabSS_calc task with A100.
This idea is still very preliminary, but it is compatible with exist mpi parallel.
This project serves as a catalyst, hoping to inspire more people to optimize and refine this project.
https://github.com/pkusc/wannier_tools_cuda