You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+31-31
Original file line number
Diff line number
Diff line change
@@ -38,7 +38,7 @@ COSMA alleviates the issues of current state-of-the-art algorithms, which can be
38
38
39
39
-`2D (SUMMA)`: Requires manual tuning and not communication-optimal in the presence of extra memory.
40
40
-`2.5D`: Optimal for `m=n`, but inefficient for `m << n` or `n << m` and for some numbers of processes `p`.
41
-
-`Recursive (CARMA)`: Asymptotically communication-optimal for all `m, n, k, p`, but splitting always the largest dimension might lead up to `√3` increase in communication volume.
41
+
-`Recursive (CARMA)`: Asymptotically communication-optimal for all `m, n, k, p`, but splitting always the largest dimension might lead up to `√3` increase in communication volume.
42
42
-`COSMA (this work)`: Strictly communication-optimal (not just asymptotically) for all `m, n, k, p` and memory sizes that yields the speedups by factor of up to 8.3x over the second-fastest algorithm.
43
43
44
44
In addition to being communication-optimal, this implementation is higly-optimized to reduce the memory footprint in the following sense:
@@ -48,21 +48,21 @@ In addition to being communication-optimal, this implementation is higly-optimiz
48
48
The library supports both one-sided and two-sided MPI communication backends. It uses `dgemm` for the local computations, but also has a support for the `GPU` acceleration through our `Tiled-MM` library using `cublas` or `rocBLAS`.
49
49
50
50
## COSMA Literature
51
-
51
+
52
52
The paper and other materials on COSMA are available under the following link:
53
-
-**ACM Digital Library (Best Student Paper Award at SC19):**https://dl.acm.org/doi/10.1145/3295500.3356181
53
+
-**ACM Digital Library (Best Student Paper Award at SC19):**https://dl.acm.org/doi/10.1145/3295500.3356181
-**[NEW] Multi-GPU Systems Support:** COSMA is now able to take advantage of fast GPU-to-GPU interconnects either through the use of NCCL/RCCL libraries or by using the GPU-aware MPI. Both, NVIDIA and AMD GPUs are supported.
60
+
-**[NEW] Multi-GPU Systems Support:** COSMA is now able to take advantage of fast GPU-to-GPU interconnects either through the use of NCCL/RCCL libraries or by using the GPU-aware MPI. Both, NVIDIA and AMD GPUs are supported.
61
61
-**ScaLAPACK API Support:** it is enough to link to COSMA, without changing the code and all `p?gemm` calls will use ScaLAPACK wrappers provided by COSMA.
62
62
-**C/Fortran Interface:** written in `C++`, but provides `C` and `Fortran` interfaces.
63
63
-**Custom Types:** fully templatized types.
64
64
-**GPU acceleration:** supports both **NVIDIA** and **AMD** GPUs.
-**Custom Data Layout Support:** natively uses its own blocked data layout of matrices, but supports arbitrary grid-like data layout of matrices.
67
67
-**Tranposition/Conjugation Support:** matrices `A` and `B` can be transposed and/or conjugated.
68
68
-**Communication and Computation Overlap:** supports overlapping of communication and computation.
@@ -77,11 +77,11 @@ See [Installation Instructions](INSTALL.md).
77
77
78
78
COSMA is a CMake project and requires a recent CMake(>=3.17).
79
79
80
-
External dependencies:
80
+
External dependencies:
81
81
82
82
-`MPI 3`: (required)
83
83
-`BLAS`: when the problem becomes local, COSMA uses provided `?gemm` backend, which can be one of the following:
84
-
-`MKL` (default)
84
+
-`MKL` (default)
85
85
-`OPENBLAS`
86
86
-`BLIS`
87
87
-`ATLAS`
@@ -105,7 +105,7 @@ To allow easy integration, COSMA can be used in the following ways:
105
105
-**adapting your code:** if your code is not using ScaLAPACK, then there are two interfaces that can be used:
106
106
-**custom layout:** if you matrices are distributed in a custom way, then it is eanough to pass the descriptors of your data layout to `multiply_using_layout` function, which will then adapt COSMA to your own layout.
107
107
-**native COSMA layout:** to get the maximum performance, the native COSMA matrix layout should be used. To get an idea of the performance you can expect to get, please have a look at the [matrix multiplication miniapp](#matrix-multiplication).
108
-
108
+
109
109
The documentation for the latter option will soon be published here.
With COSMA, even higher speedups are possible, depending on matrix shapes. To illustrate possible performance gains, we also ran different **square matrix** multiplications on the same number of nodes (**=128**) of [Piz Daint supercomputer](https://www.cscs.ch/computers/piz-daint/). The block size is `128x128` and the processor grid is also square: `16x16` (2 ranks per node). The performance of COSMA is compared against Intel MKL ScaLAPACK (version: 19.0.1.144). The results on Cray XC50 (GPU-accelerated) and Cray XC40 (CPU-only) are summarized in the following table:
213
+
With COSMA, even higher speedups are possible, depending on matrix shapes. To illustrate possible performance gains, we also ran different **square matrix** multiplications on the same number of nodes (**=128**) of [Piz Daint supercomputer](https://www.cscs.ch/computers/piz-daint/). The block size is `128x128` and the processor grid is also square: `16x16` (2 ranks per node). The performance of COSMA is compared against Intel MKL ScaLAPACK (version: 19.0.1.144). The results on Cray XC50 (GPU-accelerated) and Cray XC40 (CPU-only) are summarized in the following table:
All the results from this section assumed matrices given in (block-cyclic) ScaLAPACK data layout. However, if the native COSMA layout is used, even higher throughput is possible.
217
+
All the results from this section assumed matrices given in (block-cyclic) ScaLAPACK data layout. However, if the native COSMA layout is used, even higher throughput is possible.
218
218
219
219
### Julia language
220
220
@@ -257,7 +257,7 @@ the project):
257
257
# set the number of threads to be used by each MPI rank
258
258
export OMP_NUM_THREADS=18
259
259
# if using CPU version with MKL backend, set MKL_NUM_THREADS as well
The overview of all supported options is given below:
303
303
-`-m (--m_dim)` (default: `1000`): number of rows of matrices `A` and `C`.
304
-
-`-n (--n_dim)` (default: `1000`): number of columns of matrices `B` and `C`.
304
+
-`-n (--n_dim)` (default: `1000`): number of columns of matrices `B` and `C`.
305
305
-`-k (--k_dim)` (default: `1000`): number of columns of matrix `A` and rows of matrix `B`.
306
-
-`--block_a` (optional, default: `128,128`): 2D-block size for matrix A.
306
+
-`--block_a` (optional, default: `128,128`): 2D-block size for matrix A.
307
307
-`--block_b` (optional, default `128,128`): 2D-block size for matrix B.
308
308
-`--block_c` (optional, default `128,128`): 2D-block size for matrix C.
309
309
-`-p (--p_grid)` (optional, default: `1,P`): 2D-processor grid. By default `1xP` where `P` is the total number of MPI ranks.
@@ -320,12 +320,12 @@ The overview of all supported options is given below:
320
320
321
321
### Parameters Overview
322
322
323
-
The overview of tunable parameters, that can be set through environment variables is given in the table below. The default values are given in **bold**.
323
+
The overview of tunable parameters, that can be set through environment variables is given in the table below. The default values are given in **bold**.
324
324
325
325
ENVIRONMENT VARIABLE | POSSIBLE VALUES | DESCRIPTION
`COSMA_OVERLAP_COMM_AND_COMP` | ON, **OFF** | If enabled, commmunication and computation might be overlapped, depending on the built-in heuristics.
328
-
`COSMA_ADAPT_STRATEGY` | **ON**, OFF | If enabled, COSMA will try to natively use the scalapack layout, without transforming to the COSMA layout. Used only in the pxgemm wrapper.
328
+
`COSMA_ADAPT_STRATEGY` | **ON**, OFF | If enabled, COSMA will try to natively use the scalapack layout, without transforming to the COSMA layout. Used only in the pxgemm wrapper.
329
329
`COSMA_CPU_MAX_MEMORY` | integer (`size_t`), by default: **infinite** | CPU memory limit in megabytes per MPI process (rank). Allowing too little memory might reduce the performance.
330
330
`COSMA_GPU_MEMORY_PINNING` | **ON**, OFF | If enabled, COSMA will pin parts of the host memory to speed up CPU-GPU memory transfers. Used only in the GPU backend.
331
331
`COSMA_GPU_MAX_TILE_M`, `COSMA_GPU_MAX_TILE_N`, `COSMA_GPU_MAX_TILE_K` | integer (`size_t`), by default: **5000** | Tile sizes for each dimension, that are used to pipeline the local CPU matrices to GPU. `K` refers to the shared dimension and `MxN` refer to the dimensions of matrix `C`
where `K` refers to the shared dimension and `MxN` refer to the dimensions of matrix `C`. By default, all tiles are square and have dimensions `5000x5000`.
352
352
353
-
These are only the maximum tiles and the actual tile sizes that will be used might be less, depending on the problem size. These variables are only used in the GPU backend for pipelining the local matrices to GPUs.
353
+
These are only the maximum tiles and the actual tile sizes that will be used might be less, depending on the problem size. These variables are only used in the GPU backend for pipelining the local matrices to GPUs.
354
354
355
355
It is also possible to specify the number of GPU streams:
356
356
```bash
@@ -411,7 +411,7 @@ The precentage is always relative to the first level above. All time measurement
411
411
412
412
- Grzegorz Kwasniewski, Marko Kabic, Maciej Besta, Joost VandeVondele, Raffaele Solca, Torsten Hoefler
413
413
414
-
Cite as:
414
+
Cite as:
415
415
```
416
416
@inproceedings{cosma_algorithm_2019,
417
417
title={Red-blue pebbling revisited: Near optimal parallel matrix-matrix multiplication},
@@ -432,7 +432,7 @@ For questions, feel free to contact us, and we will soon get back to you:
432
432
433
433
## Acknowledgements
434
434
435
-
This work was funded in part by:
435
+
This work was funded in part by:
436
436
437
437
<imgalign="left"height="50"src="./docs/eth-logo.svg"> | [**ETH Zurich**](https://ethz.ch/en.html)**: Swiss Federal Institute of Technology in Zurich**
0 commit comments