**Describe the bug** An illegal instruction caused by `__shfl_down_sync` in `masked_matmul` on v100 when the dim is not multiple of 32.