1
1
![ ALT] ( /media/images/gemm-hierarchy-with-epilogue-no-labels.png " Complete CUDA GEMM decomposition ")
2
2
3
- # CUTLASS 2.7
3
+ # CUTLASS 2.8
4
4
5
- _ CUTLASS 2.7 - September 2021_
5
+ _ CUTLASS 2.8 - November 2021_
6
6
7
7
CUTLASS is a collection of CUDA C++ template abstractions for implementing
8
8
high-performance matrix-multiplication (GEMM) and related computations at all levels
@@ -34,77 +34,20 @@ See the [Quick Start Guide](/media/docs/quickstart.md) to get started quickly.
34
34
See the [ functionality listing] ( /media/docs/functionality.md ) for the list of operations
35
35
supported at each level of the execution model hierarchy.
36
36
37
- See the [ CHANGELOG] ( CHANGELOG.md ) for descriptions of recent updates.
38
-
39
- # What's New in CUTLASS 2.7
40
- CUTLASS 2.7 is a minor update to CUTLASS adding:
41
- - Mainloop fusion for GEMM: [ summation over A or B] ( /examples/23_ampere_gemm_operand_reduction_fusion/ampere_gemm_operand_reduction_fusion.cu )
42
- - [ Optimizations for strided DGRAD] ( /include/cutlass/conv/kernel/default_conv2d_dgrad.h )
43
- - [ Half-precision GELU_taylor activation functions] ( /include/cutlass/epilogue/thread/activation.h#L196 )
44
- - Tuning and bug fixes to [ fused GEMM + GEMM example] ( /examples/13_two_tensor_op_fusion/ )
45
- - Support for smaller than 128b aligned Convolutions: [ see examples] ( test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f16_sm80.cu#L272 )
46
- - Caching of results to accelerate Convolution [ unit tests] ( test/unit/conv/device/cache_testbed_output.h )
47
- - Numerous updates from the community (thanks!)
48
-
49
- # What's New in CUTLASS 2.6
50
- CUTLASS 2.6 is a minor update to CUTLASS adding:
51
- - Fused [ broadcast] ( test/unit/gemm/device/gemm_with_broadcast_f16n_f16n_f16n_tensorop_f32_sm75.cu ) and [ reductions] ( /test/unit/gemm/device/gemm_with_reduction_f16n_f16n_f16n_tensorop_f32_sm75.cu ) in the epilogues of GEMM and Convolution
52
- - [ Quaternion-valued GEMM] ( /examples/21_quaternion_gemm/quaternion_gemm.cu ) and [ Convolution] ( /examples/22_quaternion_conv/quaternion_conv.cu ) in single-precision
53
- - [ New strided Dgrad] ( test/unit/conv/device/conv2d_strided_dgrad_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu ) implementation offers up to 4x performance improvements over previous strided Dgrad
54
- - 64-bit strides for large tensor allocations
55
- - [ General affine layouts] ( /examples/18_ampere_fp64_tensorop_affine2_gemm/ampere_fp64_tensorop_affine2_gemm.cu ) fp64 tensor core and simt GEMM
56
- - [ Batched GEMV] ( /test/unit/gemm/device/gemv.cu ) preview implementation
57
- - Enhanced functionality, boosted performance, and bug fixes in the epilogue.
58
- - Optimal performance when compiled with the [ CUDA 11.4 Toolkit] ( https://developer.nvidia.com/cuda-toolkit )
59
- - Adopt new L2 prefetch feature in [ ptx instruction] ( https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#ptx-isa-version-7-4 ) .
60
- - Enhanced Clang support and the combination of Clang 13 and CUDA 11.4 can build and run kernels from Pascal and Ampere.
61
- - Numerous updates from the community (thanks!)
62
-
63
- # What's New in CUTLASS 2.5
64
- CUTLASS 2.5 is a minor update to CUTLASS adding:
65
- - [ Tensor reductions] ( /test/unit/reduction/device/tensor_reduce_contiguous.cu )
66
- - [ Optimizations for 3-D convolution] ( include/cutlass/conv/threadblock/conv3d_fprop_activation_tile_access_iterator_optimized.h )
67
- - [ Fused Convolution+Convolution example] ( /examples/13_two_tensor_op_fusion/README.md )
68
-
69
- # What's New in CUTLASS 2.4
70
- CUTLASS 2.4 is a significant update to CUTLASS adding:
71
- - 1-D, 2-D, and 3-D convolution targeting Tensor and CUDA cores for NVIDIA Ampere, Turing, and Volta GPU architectures
72
- - CUTLASS profiler support for convolution
73
- - [ Documentation] ( /media/docs/implicit_gemm_convolution.md ) describing Implicit GEMM Convolution algorithm and implementation
74
-
75
- # What's New in CUTLASS 2.3
76
-
77
- CUTLASS 2.3 is a minor update to CUTLASS adding:
78
- - GEMMs targeting structured [ Sparse Tensor Cores] ( test/unit/gemm/device/gemm_f16n_f16n_f32t_tensor_op_f32_sparse_sm80.cu ) in NVIDIA Ampere Architecture GPUs
79
- - Fast SGEMM kernels targeting GeForce RTX 30-series CUDA Cores
80
- - Intended to be compiled with [ CUDA 11.1 Toolkit] ( https://developer.nvidia.com/cuda-toolkit ) or later
81
-
82
- # What's New in CUTLASS 2.2
83
-
84
- CUTLASS 2.2 is a significant update to CUTLASS adding:
85
-
86
- - Coverage of [ NVIDIA Ampere Architecture features] ( https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ )
87
- - Tensor Core-accelerated GEMMs targeting Tensor Float 32, BFloat16, and double-precision data types
88
- - Deep software pipelines using asynchronous copy
89
- - Described in [ GTC 2020 Webinar (SR 21745)] ( https://developer.nvidia.com/gtc/2020/video/s21745 )
90
- - Intended to be compiled with [ CUDA 11 Toolkit] ( https://developer.nvidia.com/cuda-toolkit ) or later
91
-
92
- # What's New in CUTLASS 2.1
93
-
94
- CUTLASS 2.1 is a minor update to CUTLASS adding:
95
-
96
- - [ Planar complex GEMM kernels] ( /examples/10_planar_complex/planar_complex.cu ) targeting Volta and Turing Tensor Cores
97
- - BLAS-style API to launch kernels compiled into the [ CUTLASS Library] ( /media/docs/quickstart.md#cutlass-library )
98
-
99
- # What's New in CUTLASS 2.0
100
-
101
- CUTLASS 2.0 is a substantial refactoring from the previous version, intended to offer:
102
-
103
- - Better performance over 1.x, particularly for kernels targeting Turing Tensor Cores
104
- - Robust and durable templates that reliably span the design space
105
- - Encapsulated functionality that may be reusable in other contexts
106
-
107
- ** See the [ CHANGELOG] ( CHANGELOG.md ) for more details.**
37
+ # What's New in CUTLASS 2.8
38
+ CUTLASS 2.8 is an update to CUTLASS adding:
39
+ - [ TF32x3:] ( /examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm ) emulated single-precision using Tensor Cores; 45+ TFLOPs on NVIDIA A100
40
+ - [ Mainloop fusion for Convolution:] ( /examples/25_ampere_fprop_mainloop_fusion ) convolution with fused per-channel bias-add
41
+ - [ Grouped GEMM:] ( /examples/24_gemm_grouped ) similar to batched GEMM with distinct problem size per group
42
+ - [ Implicit GEMM Convolution fusion] ( /examples/13_two_tensor_op_fusion/ ) supports staging 1st convolution's output accumulator in the shared memory on Turing.
43
+ - Optimal performance using [ CUDA 11.5] ( https://developer.nvidia.com/cuda-downloads )
44
+ - CUTLASS plans to ** deprecate** the following platforms in the future. Let us know if this affects your use case.
45
+ - Maxwell and Pascal GPU architectures
46
+ - Ubuntu 16.04
47
+ - CUDA 10.2
48
+ - Updates and bugfixes from the community (thanks!)
49
+
50
+ ** See the [ CHANGELOG] ( CHANGELOG.md ) for a detailed listing of releases and updates.**
108
51
109
52
# Performance
110
53
@@ -120,38 +63,35 @@ using CUDA 11.0 Toolkit. Tensor Core operations are implemented using CUDA's
120
63
# Compatibility
121
64
122
65
CUTLASS requires a C++11 host compiler and
123
- performs best when built with the [ CUDA 11.4 Toolkit] ( https://developer.nvidia.com/cuda-toolkit ) .
124
- It is also compatible with CUDA 10.2 , CUDA 11.0 , CUDA 11.1 , CUDA 11.2 , and CUDA 11.3 .
66
+ performs best when built with the [ CUDA 11.5 Toolkit] ( https://developer.nvidia.com/cuda-toolkit ) .
67
+ It is also compatible with CUDA 11.0 , CUDA 11.1 , CUDA 11.2 , CUDA 11.3 , and CUDA 11.4 .
125
68
126
69
We have tested the following environments.
127
70
128
71
| ** Operating System** | ** Compiler** |
129
72
| -----------------| ----------|
130
73
| Windows 10 | Microsoft Visual Studio 2015|
131
74
| | Microsoft Visual Studio 2017|
132
- | Ubuntu 16.04 | GCC 5.4.0 |
133
75
| Ubuntu 18.04 | GCC 7.5.0 |
134
- | Ubuntu 20.04 | GCC 10.2 .0 |
76
+ | Ubuntu 20.04 | GCC 10.3 .0 |
135
77
136
78
Additionally, CUTLASS may be built with clang.
137
79
See [ these instructions] ( media/docs/quickstart.md#clang ) for more details.
138
80
139
81
CUTLASS runs successfully on the following NVIDIA GPUs, and it is expected to be efficient on
140
- any Maxwell-, Pascal-, Volta-, Turing-, or NVIDIA Ampere- architecture NVIDIA GPU.
82
+ any Volta-, Turing-, or NVIDIA Ampere- architecture NVIDIA GPU.
141
83
142
- For all GPUs, we recommend compiling with the [ CUDA 11.4 Toolkit] ( https://developer.nvidia.com/cuda-toolkit )
84
+ For all GPUs, we recommend compiling with the [ ** CUDA 11.5 Toolkit** ] ( https://developer.nvidia.com/cuda-toolkit )
143
85
for best performance.
144
86
145
87
| ** GPU** | ** CUDA Compute Capability** | ** Minimum CUDA Toolkit** | ** CUDA Toolkit Enabling Native Tensor Cores** |
146
88
| ---| ---| ---| ---|
147
- | NVIDIA Tesla P100| 6.0| 9.2| |
148
- | NVIDIA GeForce 1080| 6.1| 9.2| |
149
- | NVIDIA TitanXP| 6.1| 9.2| |
150
89
| NVIDIA Tesla V100| 7.0| 9.2| 10.1|
151
90
| NVIDIA TitanV| 7.0| 9.2| 10.1|
152
91
| NVIDIA GeForce RTX 2080 TI, 2080, 2070| 7.5| 10.0| 10.2|
153
92
| NVIDIA Tesla T4| 7.5| 10.0| 10.2|
154
93
| NVIDIA A100| 8.0| 11.0| 11.0|
94
+ | NVIDIA A10 | 8.6| 11.1| 11.1|
155
95
| NVIDIA GeForce 3090| 8.6| 11.1| 11.1|
156
96
157
97
# Documentation
0 commit comments