Skip to content

Commit 172cdb4

Browse files
cmake: nccl/rccl cleanup (#130)
- simplify FindNCCL.cmake - remove FindRCCL.cmake (rocm provides a cmake config for it) - add COSMA_WITH_RCCL cmake option - update cosmaConfig.cmake
1 parent 2534f97 commit 172cdb4

7 files changed

+85
-238
lines changed

CMakeLists.txt

+8-17
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ cmake_minimum_required(VERSION 3.17 FATAL_ERROR)
33
project(cosma
44
DESCRIPTION "Communication Optimal Matrix Multiplication"
55
HOMEPAGE_URL "https://github.com/eth-cscs/COSMA"
6-
VERSION 2.6.4
6+
VERSION 2.6.5
77
LANGUAGES CXX C)
88

99

@@ -25,6 +25,7 @@ option(COSMA_WITH_APPS "Generate the miniapp targets." ON)
2525
option(COSMA_WITH_BENCHMARKS "Generate the benchmark targets." ON)
2626
option(COSMA_WITH_PROFILING "Enable profiling." OFF)
2727
option(COSMA_WITH_NCCL "Use NCCL as communication backend." OFF)
28+
option(COSMA_WITH_RCCL "Use RCCL as communication backend." OFF)
2829
option(COSMA_WITH_GPU_AWARE_MPI "Use gpu-aware MPI for communication." OFF)
2930
option(BUILD_SHARED_LIBS "Build shared libraries." OFF)
3031
set(COSMA_SCALAPACK "OFF" CACHE STRING "scalapack implementation. Can be MKL, CRAY_LIBSCI, CUSTOM or OFF.")
@@ -120,27 +121,18 @@ endif()
120121

121122
# these are only GPU-backends
122123
if (COSMA_GPU_BACKEND MATCHES "CUDA|ROCM")
123-
124124
set(TILEDMM_GPU_BACKEND ${COSMA_GPU_BACKEND} CACHE STRING "GPU backend" FORCE)
125125
add_git_submodule_or_find_external(Tiled-MM libs/Tiled-MM)
126126
if (NOT TARGET Tiled-MM::Tiled-MM AND TARGET Tiled-MM)
127127
add_library(Tiled-MM::Tiled-MM ALIAS Tiled-MM)
128128
endif()
129129

130-
if (COSMA_WITH_NCCL OR COSMA_WITH_GPU_AWARE_MPI)
131-
if (${COSMA_GPU_BACKEND} MATCHES "CUDA")
132-
find_package(CUDAToolkit REQUIRED)
133-
find_package(NCCL REQUIRED)
134-
message(INFO "NCCL INCLUDE DIRS = ${NCCL_INCLUDE_DIRS}")
135-
message(INFO "NCCL LIBRARIES = ${NCCL_LIBRARIES}")
136-
elseif(${COSMA_GPU_BACKEND} MATCHES "ROCM")
137-
finmd_package(hip REQUIRED)
138-
find_package(RCCL REQUIRED)
139-
message(INFO "RCCL INCLUDE DIRS = ${RCCL_INCLUDE_DIRS}")
140-
message(INFO "RCCL LIBRARIES = ${RCCL_LIBRARIES}")
141-
else()
142-
message(FATAL_ERROR "COSMA_WITH_NCCL AND/OR COSMA_WITH_GPU_AWARE_MPI are specified, but no GPU backend chosen.")
143-
endif()
130+
if (COSMA_WITH_NCCL)
131+
find_package(CUDAToolkit REQUIRED)
132+
find_package(NCCL REQUIRED)
133+
elseif (COSMA_WITH_RCCL)
134+
find_package(hip REQUIRED)
135+
find_package(rccl REQUIRED)
144136
endif()
145137
endif()
146138

@@ -207,7 +199,6 @@ install(FILES "${cosma_BINARY_DIR}/cosmaConfig.cmake"
207199
"${cosma_SOURCE_DIR}/cmake/FindCRAY_LIBSCI.cmake"
208200
"${cosma_SOURCE_DIR}/cmake/FindGenericBLAS.cmake"
209201
"${cosma_SOURCE_DIR}/cmake/FindNCCL.cmake"
210-
"${cosma_SOURCE_DIR}/cmake/FindRCCL.cmake"
211202
"${cosma_SOURCE_DIR}/cmake/FindBLIS.cmake"
212203
DESTINATION "${CMAKE_INSTALL_LIBDIR}/cmake/cosma")
213204

README.md

+31-31
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ COSMA alleviates the issues of current state-of-the-art algorithms, which can be
3838

3939
- `2D (SUMMA)`: Requires manual tuning and not communication-optimal in the presence of extra memory.
4040
- `2.5D`: Optimal for `m=n`, but inefficient for `m << n` or `n << m` and for some numbers of processes `p`.
41-
- `Recursive (CARMA)`: Asymptotically communication-optimal for all `m, n, k, p`, but splitting always the largest dimension might lead up to `√3` increase in communication volume.
41+
- `Recursive (CARMA)`: Asymptotically communication-optimal for all `m, n, k, p`, but splitting always the largest dimension might lead up to `√3` increase in communication volume.
4242
- `COSMA (this work)`: Strictly communication-optimal (not just asymptotically) for all `m, n, k, p` and memory sizes that yields the speedups by factor of up to 8.3x over the second-fastest algorithm.
4343

4444
In addition to being communication-optimal, this implementation is higly-optimized to reduce the memory footprint in the following sense:
@@ -48,21 +48,21 @@ In addition to being communication-optimal, this implementation is higly-optimiz
4848
The library supports both one-sided and two-sided MPI communication backends. It uses `dgemm` for the local computations, but also has a support for the `GPU` acceleration through our `Tiled-MM` library using `cublas` or `rocBLAS`.
4949

5050
## COSMA Literature
51-
51+
5252
The paper and other materials on COSMA are available under the following link:
53-
- **ACM Digital Library (Best Student Paper Award at SC19):** https://dl.acm.org/doi/10.1145/3295500.3356181
53+
- **ACM Digital Library (Best Student Paper Award at SC19):** https://dl.acm.org/doi/10.1145/3295500.3356181
5454
- **Arxiv:** https://arxiv.org/abs/1908.09606
5555
- **YouTube Presentation:** https://www.youtube.com/watch?v=5wiZWw5ltR0
5656
- **Press Release:** https://www.cscs.ch/science/computer-science-hpc/2019/new-matrix-multiplication-algorithm-pushes-the-performance-to-the-limits/
5757

5858
## Features
5959

60-
- **[NEW] Multi-GPU Systems Support:** COSMA is now able to take advantage of fast GPU-to-GPU interconnects either through the use of NCCL/RCCL libraries or by using the GPU-aware MPI. Both, NVIDIA and AMD GPUs are supported.
60+
- **[NEW] Multi-GPU Systems Support:** COSMA is now able to take advantage of fast GPU-to-GPU interconnects either through the use of NCCL/RCCL libraries or by using the GPU-aware MPI. Both, NVIDIA and AMD GPUs are supported.
6161
- **ScaLAPACK API Support:** it is enough to link to COSMA, without changing the code and all `p?gemm` calls will use ScaLAPACK wrappers provided by COSMA.
6262
- **C/Fortran Interface:** written in `C++`, but provides `C` and `Fortran` interfaces.
6363
- **Custom Types:** fully templatized types.
6464
- **GPU acceleration:** supports both **NVIDIA** and **AMD** GPUs.
65-
- **Supported BLAS (CPU) backends:** MKL, LibSci, NETLIB, BLIS, ATLAS.
65+
- **Supported BLAS (CPU) backends:** MKL, LibSci, NETLIB, BLIS, ATLAS.
6666
- **Custom Data Layout Support:** natively uses its own blocked data layout of matrices, but supports arbitrary grid-like data layout of matrices.
6767
- **Tranposition/Conjugation Support:** matrices `A` and `B` can be transposed and/or conjugated.
6868
- **Communication and Computation Overlap:** supports overlapping of communication and computation.
@@ -77,11 +77,11 @@ See [Installation Instructions](INSTALL.md).
7777

7878
COSMA is a CMake project and requires a recent CMake(>=3.17).
7979

80-
External dependencies:
80+
External dependencies:
8181

8282
- `MPI 3`: (required)
8383
- `BLAS`: when the problem becomes local, COSMA uses provided `?gemm` backend, which can be one of the following:
84-
- `MKL` (default)
84+
- `MKL` (default)
8585
- `OPENBLAS`
8686
- `BLIS`
8787
- `ATLAS`
@@ -105,7 +105,7 @@ To allow easy integration, COSMA can be used in the following ways:
105105
- **adapting your code:** if your code is not using ScaLAPACK, then there are two interfaces that can be used:
106106
- **custom layout:** if you matrices are distributed in a custom way, then it is eanough to pass the descriptors of your data layout to `multiply_using_layout` function, which will then adapt COSMA to your own layout.
107107
- **native COSMA layout:** to get the maximum performance, the native COSMA matrix layout should be used. To get an idea of the performance you can expect to get, please have a look at the [matrix multiplication miniapp](#matrix-multiplication).
108-
108+
109109
The documentation for the latter option will soon be published here.
110110

111111
## Using COSMA in 30 seconds
@@ -140,27 +140,27 @@ make install
140140
2) Link your code to COSMA:
141141
- **CPU-only** version of COSMA:
142142
- link your code to:
143-
> -L<installation dir>/cosma/lib64 -lcosma_pxgemm -lcosma -lcosta_scalapack
144-
143+
> -L<installation dir>/cosma/lib64 -lcosma_pxgemm -lcosma -lcosta_scalapack
144+
145145
- then link to the BLAS and ScaLAPACK you built COSMA with (see `COSMA_BLAS` and `COSMA_SCALAPACK` flags in cmake):
146146
> -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -lgomp -lpthread -lm
147-
148-
149-
- using **GPU-accelerated** version of COSMA:
147+
148+
149+
- using **GPU-accelerated** version of COSMA:
150150
- link your code to:
151151
>-L<installation dir>/cosma/lib64 -lcosma_pxgemm -lcosma -lcosta_scalapack -lTiled-MM
152-
152+
153153
- link to the GPU backend you built COSMA with (see `COSMA_BLAS` flag in cmake):
154154
>-lcublas -lcudart -lrt
155-
155+
156156
- then link to the ScaLAPACK you built COSMA with (see `COSMA_SCALAPACK` flag in cmake):
157157
>-L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -lgomp -lpthread -lm
158-
158+
159159
3) Include headers:
160160
>-I<installation dir>/cosma/include
161-
161+
162162
## COSMA on Multi-GPU Systems
163-
163+
164164
COSMA is able to take advantage of fast GPU-to-GPU interconnects on multi-gpu systems. This can be achieved in one of the following ways.
165165

166166
### Using `NCCL/RCCL` Libraries
@@ -173,7 +173,7 @@ When running `cmake` for COSMA, make sure to specify `-DCOSMA_WITH_NCCL=ON`, e.g
173173
# - NCCL_INCLUDE_DIR: Directory where NCCL header is found
174174
# - NCCL_LIB_DIR: Directory where NCCL library is found
175175
cmake -DCOSMA_BLAS=CUDA -DCOSMA_SCALAPACK=MKL -DCOSMA_WITH_NCCL=ON ..
176-
176+
177177
# AMD GPUs
178178
# this will looks for RCCL library in the following environment variables:
179179
# - RCCL_ROOT_DIR: Base directory where all RCCL components are found
@@ -210,11 +210,11 @@ On **128 nodes**, we compared the performance of CP2K using the following algori
210210

211211
<p align="center"><img src="./docs/cp2k-results-128.svg" width="95%"></p>
212212

213-
With COSMA, even higher speedups are possible, depending on matrix shapes. To illustrate possible performance gains, we also ran different **square matrix** multiplications on the same number of nodes (**=128**) of [Piz Daint supercomputer](https://www.cscs.ch/computers/piz-daint/). The block size is `128x128` and the processor grid is also square: `16x16` (2 ranks per node). The performance of COSMA is compared against Intel MKL ScaLAPACK (version: 19.0.1.144). The results on Cray XC50 (GPU-accelerated) and Cray XC40 (CPU-only) are summarized in the following table:
213+
With COSMA, even higher speedups are possible, depending on matrix shapes. To illustrate possible performance gains, we also ran different **square matrix** multiplications on the same number of nodes (**=128**) of [Piz Daint supercomputer](https://www.cscs.ch/computers/piz-daint/). The block size is `128x128` and the processor grid is also square: `16x16` (2 ranks per node). The performance of COSMA is compared against Intel MKL ScaLAPACK (version: 19.0.1.144). The results on Cray XC50 (GPU-accelerated) and Cray XC40 (CPU-only) are summarized in the following table:
214214

215215
<p align="center"><img src="./docs/square-results.svg" width="80%"></p>
216216

217-
All the results from this section assumed matrices given in (block-cyclic) ScaLAPACK data layout. However, if the native COSMA layout is used, even higher throughput is possible.
217+
All the results from this section assumed matrices given in (block-cyclic) ScaLAPACK data layout. However, if the native COSMA layout is used, even higher throughput is possible.
218218

219219
### Julia language
220220

@@ -257,7 +257,7 @@ the project):
257257
# set the number of threads to be used by each MPI rank
258258
export OMP_NUM_THREADS=18
259259
# if using CPU version with MKL backend, set MKL_NUM_THREADS as well
260-
export MKL_NUM_THREADS=18
260+
export MKL_NUM_THREADS=18
261261
# run the miniapp
262262
mpirun -np 4 ./build/miniapp/cosma_miniapp -m 1000 -n 1000 -k 1000 -r 2
263263
```
@@ -287,10 +287,10 @@ The miniapp consists of an executable `./build/miniapp/pxgemm_miniapp` which can
287287
# set the number of threads to be used by each MPI rank
288288
export OMP_NUM_THREADS=18
289289
# if using CPU version with MKL backend, set MKL_NUM_THREADS as well
290-
export MKL_NUM_THREADS=18
290+
export MKL_NUM_THREADS=18
291291
# run the miniapp
292292
mpirun -np 4 ./build/miniapp/pxgemm_miniapp -m 1000 -n 1000 -k 1000 \
293-
--block_a=128,128 \
293+
--block_a=128,128 \
294294
--block_b=128,128 \
295295
--block_c=128,128 \
296296
--p_grid=2,2 \
@@ -301,9 +301,9 @@ mpirun -np 4 ./build/miniapp/pxgemm_miniapp -m 1000 -n 1000 -k 1000 \
301301

302302
The overview of all supported options is given below:
303303
- `-m (--m_dim)` (default: `1000`): number of rows of matrices `A` and `C`.
304-
- `-n (--n_dim)` (default: `1000`): number of columns of matrices `B` and `C`.
304+
- `-n (--n_dim)` (default: `1000`): number of columns of matrices `B` and `C`.
305305
- `-k (--k_dim)` (default: `1000`): number of columns of matrix `A` and rows of matrix `B`.
306-
- `--block_a` (optional, default: `128,128`): 2D-block size for matrix A.
306+
- `--block_a` (optional, default: `128,128`): 2D-block size for matrix A.
307307
- `--block_b` (optional, default `128,128`): 2D-block size for matrix B.
308308
- `--block_c` (optional, default `128,128`): 2D-block size for matrix C.
309309
- `-p (--p_grid)` (optional, default: `1,P`): 2D-processor grid. By default `1xP` where `P` is the total number of MPI ranks.
@@ -320,12 +320,12 @@ The overview of all supported options is given below:
320320

321321
### Parameters Overview
322322

323-
The overview of tunable parameters, that can be set through environment variables is given in the table below. The default values are given in **bold**.
323+
The overview of tunable parameters, that can be set through environment variables is given in the table below. The default values are given in **bold**.
324324

325325
ENVIRONMENT VARIABLE | POSSIBLE VALUES | DESCRIPTION
326326
| :------------------- | :------------------- |:------------------- |
327327
`COSMA_OVERLAP_COMM_AND_COMP` | ON, **OFF** | If enabled, commmunication and computation might be overlapped, depending on the built-in heuristics.
328-
`COSMA_ADAPT_STRATEGY` | **ON**, OFF | If enabled, COSMA will try to natively use the scalapack layout, without transforming to the COSMA layout. Used only in the pxgemm wrapper.
328+
`COSMA_ADAPT_STRATEGY` | **ON**, OFF | If enabled, COSMA will try to natively use the scalapack layout, without transforming to the COSMA layout. Used only in the pxgemm wrapper.
329329
`COSMA_CPU_MAX_MEMORY` | integer (`size_t`), by default: **infinite** | CPU memory limit in megabytes per MPI process (rank). Allowing too little memory might reduce the performance.
330330
`COSMA_GPU_MEMORY_PINNING` | **ON**, OFF | If enabled, COSMA will pin parts of the host memory to speed up CPU-GPU memory transfers. Used only in the GPU backend.
331331
`COSMA_GPU_MAX_TILE_M`, `COSMA_GPU_MAX_TILE_N`, `COSMA_GPU_MAX_TILE_K` | integer (`size_t`), by default: **5000** | Tile sizes for each dimension, that are used to pipeline the local CPU matrices to GPU. `K` refers to the shared dimension and `MxN` refer to the dimensions of matrix `C`
@@ -350,7 +350,7 @@ export COSMA_GPU_MAX_TILE_K=5000
350350
```
351351
where `K` refers to the shared dimension and `MxN` refer to the dimensions of matrix `C`. By default, all tiles are square and have dimensions `5000x5000`.
352352

353-
These are only the maximum tiles and the actual tile sizes that will be used might be less, depending on the problem size. These variables are only used in the GPU backend for pipelining the local matrices to GPUs.
353+
These are only the maximum tiles and the actual tile sizes that will be used might be less, depending on the problem size. These variables are only used in the GPU backend for pipelining the local matrices to GPUs.
354354

355355
It is also possible to specify the number of GPU streams:
356356
```bash
@@ -411,7 +411,7 @@ The precentage is always relative to the first level above. All time measurement
411411

412412
- Grzegorz Kwasniewski, Marko Kabic, Maciej Besta, Joost VandeVondele, Raffaele Solca, Torsten Hoefler
413413

414-
Cite as:
414+
Cite as:
415415
```
416416
@inproceedings{cosma_algorithm_2019,
417417
title={Red-blue pebbling revisited: Near optimal parallel matrix-matrix multiplication},
@@ -432,7 +432,7 @@ For questions, feel free to contact us, and we will soon get back to you:
432432
433433
## Acknowledgements
434434

435-
This work was funded in part by:
435+
This work was funded in part by:
436436

437437
<img align="left" height="50" src="./docs/eth-logo.svg"> | [**ETH Zurich**](https://ethz.ch/en.html)**: Swiss Federal Institute of Technology in Zurich**
438438
| :------------------- | :------------------- |

0 commit comments

Comments
 (0)