Releases: ROCm/hipBLASLt
Releases · ROCm/hipBLASLt
hipBLASLt 1.2.0 for ROCm 7.2.0
Added
- Support for the 'BF16' data type for gfx90a.
hipblaslt 1.1.0 for ROCm 7.1.1
hipBLASLt code for ROCm 7.1.1 did not change. The library was rebuilt for the updated ROCm 7.1.1 stack.
hipBLASLt 1.1.0 for ROCm 7.1.0
Added
- Fused Clamp GEMM for
HIPBLASLT_EPILOGUE_CLAMP_EXTandHIPBLASLT_EPILOGUE_CLAMP_BIAS_EXT. This feature requires the minimum (HIPBLASLT_MATMUL_DESC_EPILOGUE_ACT_ARG0_EXT) and maximum (HIPBLASLT_MATMUL_DESC_EPILOGUE_ACT_ARG1_EXT) to be set. - Support for ReLU/Clamp activation functions with auxiliary output for the
f16andbf16data types for gfx942 to capture intermediate results. This feature is enabled forHIPBLASLT_EPILOGUE_RELU_AUX,HIPBLASLT_EPILOGUE_RELU_AUX_BIAS,HIPBLASLT_EPILOGUE_CLAMP_AUX_EXT, andHIPBLASLT_EPILOGUE_CLAMP_AUX_BIAS_EXT. - Support for
HIPBLAS_COMPUTE_32F_FAST_16BFfor FP32 data type for gfx950 only. - Added the CPP extension APIs
setMaxWorkspaceBytesandgetMaxWorkspaceBytes. - Added the ability to print logs (using
HIPBLASLT_LOG_MASK=32) for Grouped GEMM. - Support for swizzleA by using the hipblaslt-ext cpp API.
- Support for hipBLASLt extop for gfx11xx and gfx12xx.
Changed
hipblasLtMatmul()now returns an error when the workspace size is insufficient, rather than causing a segmentation fault.
Resolved issues
- Fix incorrect results when using ldd and ldc with some solutions
hipblaslt 1.0.0 for ROCm 7.0.2
hipBLASLt code for ROCm 7.0.2 did not change. The library was rebuilt for the updated ROCm 7.0.2 stack.
hipblaslt 1.0.0 for ROCm 7.0.1
hipBLASLt code for ROCm 7.0.1 did not change. The library was rebuilt for the updated ROCm 7.0.1 stack.
hipBLASLt 1.0.0 for ROCm 7.0.0
Added
- Stream-K GEMM support has been enabled for the
FP32,FP16,BF16,FP8, andBF8data types on the MI300A APU. To activate this feature, set theTENSILE_SOLUTION_SELECTION_METHODenvironment variable to2, for example,export TENSILE_SOLUTION_SELECTION_METHOD=2. - Fused Swish/SiLU GEMM in hipBLASLt (enabled by
HIPBLASLT_EPILOGUE_SWISH_EXTandHIPBLASLT_EPILOGUE_SWISH_BIAS_EXT) - Added support for
HIPBLASLT_EPILOGUE_GELU_AUX_BIASfor gfx942 - Added
HIPBLASLT_TUNING_USER_MAX_WORKSPACEto constrain max workspace size for user offline tuning - Added
HIPBLASLT_ORDER_COL16_4R16andHIPBLASLT_ORDER_COL16_4R8tohipblasLtOrder_tto support FP16/BF16 swizzle GEMM and FP8/BF8 swizzle GEMM respectively. - Added TF32 emulation on gfx950
Changed
HIPBLASLT_MATMUL_DESC_A_SCALE_POINTER_VEC_EXTandHIPBLASLT_MATMUL_DESC_B_SCALE_POINTER_VEC_EXTare removed. Use theHIPBLASLT_MATMUL_DESC_A_SCALE_MODEandHIPBLASLT_MATMUL_DESC_B_SCALE_MODEattributes to set scalar (HIPBLASLT_MATMUL_MATRIX_SCALE_SCALAR_32F) or vector (HIPBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F).- The non-V2 APIs (
GemmPreference,GemmProblemType,GemmEpilogue,GemmTuning,GemmInputs) in the Cpp header are now the same as the V2 APIs (GemmPreferenceV2,GemmProblemTypeV2,GemmEpilogueV2,GemmTuningV2,GemmInputsV2). The original non-V2 APIs are removed. hipblasltExtAMaxWithScaleAPI is removed.
Optimized
- Improved performance for 8-bit (FP8/BF8/I8) NN/NT cases by adding
s_delay_aluto reduce stalls from dependent ALU operations on gfx12+. - Improved performance for 8-bit and 16-bit (FP16/BF16) TN cases by enabling software dependency check (Expert Scheduling Mode) under certain restrictions to reduce redundant hardware dependency checks on gfx12+.
- Improved performance for 8-bit, 16-bit, and 32-bit batched GEMM with a better heuristic search algorithm for gfx942.
Upcoming changes
- V2 APIs (
GemmPreferenceV2,GemmProblemTypeV2,GemmEpilogueV2,GemmTuningV2,GemmInputsV2) are deprecated.
hipBLASLt 0.12.1 for ROCm 6.4.4
hipBLASLt code for ROCm 6.4.4 did not change. The library was rebuilt for the updated ROCm 6.4.4 stack.
hipBLASLt 0.12.1 for ROCm 6.4.3
hipBLASLt code for ROCm 6.4.3 did not change. The library was rebuilt for the updated ROCm 6.4.3 stack.
hipBLASLt 0.12.1 for ROCm 6.4.2
Added
- Support for gfx1151
hipBLASLt 0.12.1 for ROCm 6.4.1
Resolved issues
- Fixed an accuracy issue that occurred for some solutions using an
FP32orTF32data type with a TT transpose.