Skip to content

Commit 808c253

Browse files
author
Manish Gupta
authored
CUTLASS 2.8 (#363)
CUTLASS 2.8
1 parent 6fc5008 commit 808c253

File tree

127 files changed

+18568
-1351
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

127 files changed

+18568
-1351
lines changed

CHANGELOG.md

+24
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,29 @@
11
# NVIDIA CUTLASS Changelog
22

3+
## [2.8.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.8.0) (2021-11-19)
4+
5+
* **TF32x3:** emulated single-precision using Tensor Cores
6+
* 45+ TFLOPs on NVIDIA A100
7+
* [GEMM SDK example](/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm/27_ampere_3xtf32_fast_accurate_tensorop_gemm.cu) (real)
8+
* [COMPLEX GEMM SDK example](/examples/29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm/29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm.cu) (complex)
9+
* [Implicit GEMM Convolution SDK example](/examples/28_ampere_3xtf32_fast_accurate_tensorop_fprop/ampere_3xtf32_fast_accurate_tensorop_fprop.cu)
10+
* **Mainloop fusion for Convolution:** convolution with fused per-channel scale-bias-relu
11+
* [Conv Fprop SDK example](/examples/25_ampere_fprop_mainloop_fusion/ampere_fprop_mainloop_fusion.cu)
12+
* [Conv WGrad SDK example](/examples/26_ampere_wgrad_mainloop_fusion/ampere_wgrad_mainloop_fusion.cu)
13+
* [cutlass::conv::device::ImplicitGemmConvolutionFusion](/include/cutlass/conv/device/implicit_gemm_convolution_fusion.h)
14+
* **Grouped GEMM:** similar to batched GEMM with distinct problem size per group
15+
* [SDK example](/examples/24_gemm_grouped) with performance comparison with Batched Strided GEMM
16+
* [cutlass::gemm::device::GemmGrouped](/include/cutlass/gemm/device/gemm_grouped.h)
17+
* [Implicit GEMM Convolution fusion](/examples/13_two_tensor_op_fusion/) supports staging 1st convolution's output accumulator in the shared memory on Turing. This allows more flexible warp tile sizes and less regsiter pressue.
18+
* Optimal performance using [**CUDA 11.5**](https://developer.nvidia.com/cuda-downloads)
19+
* Updates from the community (thanks!)
20+
21+
* **Deprecation announcement:** CUTLASS plans to deprecate the following platforms in the future. Let us know if this affects your use case.
22+
* Maxwell and Pascal GPU architectures
23+
* Ubuntu 16.04
24+
* CUDA 10.2
25+
26+
327
## [2.7.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.7.0) (2021-09-24)
428
* Mainloop fusion for GEMM: [summation over A or B](/examples/23_ampere_gemm_operand_reduction_fusion/ampere_gemm_operand_reduction_fusion.cu)
529
* [Strided DGRAD (optimized iterators)](/include/cutlass/conv/kernel/default_conv2d_dgrad.h)

CUDA.cmake

+12-13
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ find_library(
7474
lib64
7575
lib
7676
NO_DEFAULT_PATH
77-
# We aren't going to search any system paths. We want to find the runtime
77+
# We aren't going to search any system paths. We want to find the runtime
7878
# in the CUDA toolkit we're building against.
7979
)
8080

@@ -89,10 +89,10 @@ if(NOT TARGET cudart AND CUDART_LIBRARY)
8989
# from the PATH search.
9090
else()
9191
add_library(cudart SHARED IMPORTED GLOBAL)
92-
endif()
92+
endif()
9393

9494
add_library(nvidia::cudart ALIAS cudart)
95-
95+
9696
set_property(
9797
TARGET cudart
9898
PROPERTY IMPORTED_LOCATION
@@ -120,7 +120,7 @@ find_library(
120120
lib64/stubs
121121
lib/stubs
122122
NO_DEFAULT_PATH
123-
# We aren't going to search any system paths. We want to find the runtime
123+
# We aren't going to search any system paths. We want to find the runtime
124124
# in the CUDA toolkit we're building against.
125125
)
126126

@@ -135,10 +135,10 @@ if(NOT TARGET cuda_driver AND CUDA_DRIVER_LIBRARY)
135135
# from the PATH search.
136136
else()
137137
add_library(cuda_driver SHARED IMPORTED GLOBAL)
138-
endif()
138+
endif()
139139

140140
add_library(nvidia::cuda_driver ALIAS cuda_driver)
141-
141+
142142
set_property(
143143
TARGET cuda_driver
144144
PROPERTY IMPORTED_LOCATION
@@ -164,7 +164,7 @@ find_library(
164164
lib64
165165
lib
166166
NO_DEFAULT_PATH
167-
# We aren't going to search any system paths. We want to find the runtime
167+
# We aren't going to search any system paths. We want to find the runtime
168168
# in the CUDA toolkit we're building against.
169169
)
170170

@@ -179,10 +179,10 @@ if(NOT TARGET nvrtc AND NVRTC_LIBRARY)
179179
# from the PATH search.
180180
else()
181181
add_library(nvrtc SHARED IMPORTED GLOBAL)
182-
endif()
183-
182+
endif()
183+
184184
add_library(nvidia::nvrtc ALIAS nvrtc)
185-
185+
186186
set_property(
187187
TARGET nvrtc
188188
PROPERTY IMPORTED_LOCATION
@@ -242,15 +242,15 @@ function(cutlass_unify_source_files TARGET_ARGS_VAR)
242242

243243
set(CUDA_FILE_ARGS)
244244
set(TARGET_SOURCE_ARGS)
245-
245+
246246
foreach(ARG ${__UNPARSED_ARGUMENTS})
247247
if(${ARG} MATCHES ".*\.cu$")
248248
list(APPEND CUDA_FILE_ARGS ${ARG})
249249
else()
250250
list(APPEND TARGET_SOURCE_ARGS ${ARG})
251251
endif()
252252
endforeach()
253-
253+
254254
list(LENGTH CUDA_FILE_ARGS NUM_CUDA_FILE_ARGS)
255255
while(NUM_CUDA_FILE_ARGS GREATER 0)
256256
list(SUBLIST CUDA_FILE_ARGS 0 ${__BATCH_SIZE} CUDA_FILE_BATCH)
@@ -280,7 +280,6 @@ function(cutlass_unify_source_files TARGET_ARGS_VAR)
280280
set(${TARGET_ARGS_VAR} ${TARGET_SOURCE_ARGS} PARENT_SCOPE)
281281

282282
endfunction()
283-
284283
function(cutlass_add_library NAME)
285284

286285
set(options)

README.md

+22-82
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
![ALT](/media/images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")
22

3-
# CUTLASS 2.7
3+
# CUTLASS 2.8
44

5-
_CUTLASS 2.7 - September 2021_
5+
_CUTLASS 2.8 - November 2021_
66

77
CUTLASS is a collection of CUDA C++ template abstractions for implementing
88
high-performance matrix-multiplication (GEMM) and related computations at all levels
@@ -34,77 +34,20 @@ See the [Quick Start Guide](/media/docs/quickstart.md) to get started quickly.
3434
See the [functionality listing](/media/docs/functionality.md) for the list of operations
3535
supported at each level of the execution model hierarchy.
3636

37-
See the [CHANGELOG](CHANGELOG.md) for descriptions of recent updates.
38-
39-
# What's New in CUTLASS 2.7
40-
CUTLASS 2.7 is a minor update to CUTLASS adding:
41-
- Mainloop fusion for GEMM: [summation over A or B](/examples/23_ampere_gemm_operand_reduction_fusion/ampere_gemm_operand_reduction_fusion.cu)
42-
- [Optimizations for strided DGRAD](/include/cutlass/conv/kernel/default_conv2d_dgrad.h)
43-
- [Half-precision GELU_taylor activation functions](/include/cutlass/epilogue/thread/activation.h#L196)
44-
- Tuning and bug fixes to [fused GEMM + GEMM example](/examples/13_two_tensor_op_fusion/)
45-
- Support for smaller than 128b aligned Convolutions: [see examples](test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.cu#L272)
46-
- Caching of results to accelerate Convolution [unit tests](test/unit/conv/device/cache_testbed_output.h)
47-
- Numerous updates from the community (thanks!)
48-
49-
# What's New in CUTLASS 2.6
50-
CUTLASS 2.6 is a minor update to CUTLASS adding:
51-
- Fused [broadcast](test/unit/gemm/device/gemm_with_broadcast_f16n_f16n_f16n_tensorop_f32_sm75.cu) and [reductions](/test/unit/gemm/device/gemm_with_reduction_f16n_f16n_f16n_tensorop_f32_sm75.cu) in the epilogues of GEMM and Convolution
52-
- [Quaternion-valued GEMM](/examples/21_quaternion_gemm/quaternion_gemm.cu) and [Convolution](/examples/22_quaternion_conv/quaternion_conv.cu) in single-precision
53-
- [New strided Dgrad](test/unit/conv/device/conv2d_strided_dgrad_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) implementation offers up to 4x performance improvements over previous strided Dgrad
54-
- 64-bit strides for large tensor allocations
55-
- [General affine layouts](/examples/18_ampere_fp64_tensorop_affine2_gemm/ampere_fp64_tensorop_affine2_gemm.cu) fp64 tensor core and simt GEMM
56-
- [Batched GEMV](/test/unit/gemm/device/gemv.cu) preview implementation
57-
- Enhanced functionality, boosted performance, and bug fixes in the epilogue.
58-
- Optimal performance when compiled with the [CUDA 11.4 Toolkit](https://developer.nvidia.com/cuda-toolkit)
59-
- Adopt new L2 prefetch feature in [ptx instruction](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#ptx-isa-version-7-4).
60-
- Enhanced Clang support and the combination of Clang 13 and CUDA 11.4 can build and run kernels from Pascal and Ampere.
61-
- Numerous updates from the community (thanks!)
62-
63-
# What's New in CUTLASS 2.5
64-
CUTLASS 2.5 is a minor update to CUTLASS adding:
65-
- [Tensor reductions](/test/unit/reduction/device/tensor_reduce_contiguous.cu)
66-
- [Optimizations for 3-D convolution](include/cutlass/conv/threadblock/conv3d_fprop_activation_tile_access_iterator_optimized.h)
67-
- [Fused Convolution+Convolution example](/examples/13_two_tensor_op_fusion/README.md)
68-
69-
# What's New in CUTLASS 2.4
70-
CUTLASS 2.4 is a significant update to CUTLASS adding:
71-
- 1-D, 2-D, and 3-D convolution targeting Tensor and CUDA cores for NVIDIA Ampere, Turing, and Volta GPU architectures
72-
- CUTLASS profiler support for convolution
73-
- [Documentation](/media/docs/implicit_gemm_convolution.md) describing Implicit GEMM Convolution algorithm and implementation
74-
75-
# What's New in CUTLASS 2.3
76-
77-
CUTLASS 2.3 is a minor update to CUTLASS adding:
78-
- GEMMs targeting structured [Sparse Tensor Cores](test/unit/gemm/device/gemm_f16n_f16n_f32t_tensor_op_f32_sparse_sm80.cu) in NVIDIA Ampere Architecture GPUs
79-
- Fast SGEMM kernels targeting GeForce RTX 30-series CUDA Cores
80-
- Intended to be compiled with [CUDA 11.1 Toolkit](https://developer.nvidia.com/cuda-toolkit) or later
81-
82-
# What's New in CUTLASS 2.2
83-
84-
CUTLASS 2.2 is a significant update to CUTLASS adding:
85-
86-
- Coverage of [NVIDIA Ampere Architecture features](https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/)
87-
- Tensor Core-accelerated GEMMs targeting Tensor Float 32, BFloat16, and double-precision data types
88-
- Deep software pipelines using asynchronous copy
89-
- Described in [GTC 2020 Webinar (SR 21745)](https://developer.nvidia.com/gtc/2020/video/s21745)
90-
- Intended to be compiled with [CUDA 11 Toolkit](https://developer.nvidia.com/cuda-toolkit) or later
91-
92-
# What's New in CUTLASS 2.1
93-
94-
CUTLASS 2.1 is a minor update to CUTLASS adding:
95-
96-
- [Planar complex GEMM kernels](/examples/10_planar_complex/planar_complex.cu) targeting Volta and Turing Tensor Cores
97-
- BLAS-style API to launch kernels compiled into the [CUTLASS Library](/media/docs/quickstart.md#cutlass-library)
98-
99-
# What's New in CUTLASS 2.0
100-
101-
CUTLASS 2.0 is a substantial refactoring from the previous version, intended to offer:
102-
103-
- Better performance over 1.x, particularly for kernels targeting Turing Tensor Cores
104-
- Robust and durable templates that reliably span the design space
105-
- Encapsulated functionality that may be reusable in other contexts
106-
107-
**See the [CHANGELOG](CHANGELOG.md) for more details.**
37+
# What's New in CUTLASS 2.8
38+
CUTLASS 2.8 is an update to CUTLASS adding:
39+
- [TF32x3:](/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm) emulated single-precision using Tensor Cores; 45+ TFLOPs on NVIDIA A100
40+
- [Mainloop fusion for Convolution:](/examples/25_ampere_fprop_mainloop_fusion) convolution with fused per-channel bias-add
41+
- [Grouped GEMM:](/examples/24_gemm_grouped) similar to batched GEMM with distinct problem size per group
42+
- [Implicit GEMM Convolution fusion](/examples/13_two_tensor_op_fusion/) supports staging 1st convolution's output accumulator in the shared memory on Turing.
43+
- Optimal performance using [CUDA 11.5](https://developer.nvidia.com/cuda-downloads)
44+
- CUTLASS plans to **deprecate** the following platforms in the future. Let us know if this affects your use case.
45+
- Maxwell and Pascal GPU architectures
46+
- Ubuntu 16.04
47+
- CUDA 10.2
48+
- Updates and bugfixes from the community (thanks!)
49+
50+
**See the [CHANGELOG](CHANGELOG.md) for a detailed listing of releases and updates.**
10851

10952
# Performance
11053

@@ -120,38 +63,35 @@ using CUDA 11.0 Toolkit. Tensor Core operations are implemented using CUDA's
12063
# Compatibility
12164

12265
CUTLASS requires a C++11 host compiler and
123-
performs best when built with the [CUDA 11.4 Toolkit](https://developer.nvidia.com/cuda-toolkit).
124-
It is also compatible with CUDA 10.2, CUDA 11.0, CUDA 11.1, CUDA 11.2, and CUDA 11.3.
66+
performs best when built with the [CUDA 11.5 Toolkit](https://developer.nvidia.com/cuda-toolkit).
67+
It is also compatible with CUDA 11.0, CUDA 11.1, CUDA 11.2, CUDA 11.3, and CUDA 11.4.
12568

12669
We have tested the following environments.
12770

12871
|**Operating System** | **Compiler** |
12972
|-----------------|----------|
13073
| Windows 10 | Microsoft Visual Studio 2015|
13174
| | Microsoft Visual Studio 2017|
132-
| Ubuntu 16.04 | GCC 5.4.0 |
13375
| Ubuntu 18.04 | GCC 7.5.0 |
134-
| Ubuntu 20.04 | GCC 10.2.0 |
76+
| Ubuntu 20.04 | GCC 10.3.0 |
13577

13678
Additionally, CUTLASS may be built with clang.
13779
See [these instructions](media/docs/quickstart.md#clang) for more details.
13880

13981
CUTLASS runs successfully on the following NVIDIA GPUs, and it is expected to be efficient on
140-
any Maxwell-, Pascal-, Volta-, Turing-, or NVIDIA Ampere- architecture NVIDIA GPU.
82+
any Volta-, Turing-, or NVIDIA Ampere- architecture NVIDIA GPU.
14183

142-
For all GPUs, we recommend compiling with the [CUDA 11.4 Toolkit](https://developer.nvidia.com/cuda-toolkit)
84+
For all GPUs, we recommend compiling with the [**CUDA 11.5 Toolkit**](https://developer.nvidia.com/cuda-toolkit)
14385
for best performance.
14486

14587
|**GPU**|**CUDA Compute Capability**|**Minimum CUDA Toolkit**|**CUDA Toolkit Enabling Native Tensor Cores**|
14688
|---|---|---|---|
147-
|NVIDIA Tesla P100|6.0|9.2| |
148-
|NVIDIA GeForce 1080|6.1|9.2| |
149-
|NVIDIA TitanXP|6.1|9.2| |
15089
|NVIDIA Tesla V100|7.0|9.2|10.1|
15190
|NVIDIA TitanV|7.0|9.2|10.1|
15291
|NVIDIA GeForce RTX 2080 TI, 2080, 2070|7.5|10.0|10.2|
15392
|NVIDIA Tesla T4|7.5|10.0|10.2|
15493
|NVIDIA A100|8.0|11.0|11.0|
94+
|NVIDIA A10 |8.6|11.1|11.1|
15595
|NVIDIA GeForce 3090|8.6|11.1|11.1|
15696

15797
# Documentation

0 commit comments

Comments
 (0)