Release v3.5.0
Release notes
IREE-Turbine Core
- Enabled stream sharing between IREE and iree-turbine #883: Leveraged IREE's external HIP stream functionality to reduce launch latency when interoperating between PyTorch and IREE, improving performance for mixed-framework workflows.
- Introduced initial async execution support in the runtime #719: Added support for asynchronous task execution in the IREE runtime, enabling better parallelism and future integration with asynchronous frontends and runtimes.
IREE-Wave
New Operators & Kernels
- Added soft-sign kernel to replace tanh approximation kernel #829
- Added GEMM example without transpose B #881
Documentation
- Added documentation on several optimization passes #833, #835
- Added Jupyter notebook with GEMM example #784
- Added tutorial to implement fused softmax using wave #891
Kernel Improvements
- Various bug fixes and performance improvements for decode attention and speculative decoding kernels #838, #839, #854, #858, #878, #890, #888
Compiler Optimizations
- Improved GEMM performance with ping-pong optimizations #846, #832
- Better shared memory allocation management #822, #831
General Improvements
New Contributors
- @AWoloszyn made their first contribution in #719
- @fabianmcg made their first contribution in #838
- @SourishW made their first contribution in #866
- @badgerbroch made their first contribution in #867
- @bodhisaha made their first contribution in #869
- @efric made their first contribution in #881
- @Groverkss made their first contribution in #887
Full changelog
List of changes
- First pass of async execution in the runtime. by @AWoloszyn in #719
- [Wave] Add scaling to reference kernel in cache test by @raikonenfnu in #781
- [Wave] Implement block wide reduction by @raikonenfnu in #765
- [WAVE] Support wave-wise cumsum scan op::more than one elem per thread by @xintin in #753
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator in #782
- Enable benchmarking in performance ci by @harsh-nod in #164
- [TKW] Paged decode MHA using
GenericDot
by @Hardcode84 in #779 - [TKW] Update paged decode API by @Hardcode84 in #786
- [TKW] Silence expansion warning by @Hardcode84 in #788
- [Wave] Fix broken lit tests by @harsh-nod in #789
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator in #785
- [BOO] enable tuning specs for boo_convs by @zjgarvey in #783
- Bump version to 3.5.0 after 3.4.0 release. by @ScottTodd in #793
- [BOO] Add a timeout for out-of-process compilation by @zjgarvey in #796
- [Wave] add excalidraw visualization of wave compilation pipeline by @martin-luecke in #777
- [Wave] support for partial unrolling of iterate by @martin-luecke in #776
- [BOO] Include tuning specs in pip package by @zjgarvey in #800
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator in #799
- Bump actions/create-github-app-token from 2.0.2 to 2.0.6 in the github-actions group by @dependabot in #795
- Add support for additional type interop by @rsuderman in #803
- [TKW] Avoid cache folder race conditions when running tests in parallel by @Hardcode84 in #808
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator in #806
- [TKW] Paged decode: move scaling into the kernel by @Hardcode84 in #807
- [runtime] Enable async for
CustomOp
eager execution by @zjgarvey in #802 - [boo_driver] Align
-t
option name with miopen driver by @rkayaith in #810 - [Wave] Add jupyter notebook showing how to do a GEMM by @harsh-nod in #784
- [TKW] Paged Decode dynamic symbols by @Hardcode84 in #809
- [TKW] Tweak Paged Decode kernel waves/tile sizes by @Hardcode84 in #794
- [Wave] Add mlir style printing by @harsh-nod in #775
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator in #812
- [BOO] Fix a bug in
boo_driver.py
resulting from arg rename by @zjgarvey in #818 - Bump IREE requirement pins to their latest versions. by @iree-pr-automator in #814
- [boo] Add more tuning configurations by @rkayaith in #819
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator in #823
- [BOO] remove
generate.py
script by @zjgarvey in #824 - Bump IREE requirement pins to their latest versions. by @iree-pr-automator in #825
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator in #826
- [Wave] Minimize shared allocs by @harsh-nod in #822
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator in #830
- [BOO] Enable bad kernel launch fix + kwarg filtered replacement by @zjgarvey in #828
- [Wave] Add additional check when minimizing allocs by @harsh-nod in #831
- [BOO] Add autocast functionality to
boo_conv
by @zjgarvey in #827 - [Wave] Add scheduling/synchronization related ops by @raikonenfnu in #832
- [Wave] Add documentation by @harsh-nod in #833
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator in #834
- [Wave] Add install requirement for mermaid diagrams by @harsh-nod in #835
- [TKW] BF16 support in paged decode kernel by @Hardcode84 in #836
- [TKW] Option to print pass timings by @Hardcode84 in #813
- [runtime] Rename dltensor capsule to avoid memory leak by @zjgarvey in #843
- [Wave] Generalize Barriers to work across iterations by @raikonenfnu in #839
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator in #842
- [TKW] Improve wave runtime by @Hardcode84 in #844
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator in #848
- Bump IREE requirement pins to their latest versions. by @iree-pr-automator in #849
- [Wave] Two PP cluster GEMM by @raikonenfnu in #846
- [Wave] Add ability to compare against flash_attn_func by @harsh-nod in #798
- [BOO] Enable
torch.compile
for models with boo convs by @zjgarvey in #850 - Bump IREE requirement pins to their latest versions. by @iree-pr-automator in #852
- [wave] Fix bug in
Permute.transform_index
by @fabianmcg in #838 - Pin IREE version to 3.5.0rc20250516 by @Hardcode84 in #853
- [TKW] Large shape and wave_runtime for paged decode test by @Hardcode84 in #855
- [Wave] Make APLP faster by @harsh-nod in #840
- [Wave] Fix failing jupyter test by @harsh-nod in #854
- [Wave] Fix IREE reference failing with wrong signature by @raikonenfnu in #858
- Use
turbine.runtime
forgenerate_iree_ref
by @Hardcode84 in #861 - [boo] More tuning configurations by @rkayaith in #864
- [BOO] Add an env variable for toggling backward boo convolutions by @zjgarvey in #865
- [Wave] Introduce softsign kernel to replace tanh_approx by @yichiche in #829
- [Wave] Align variable names in speculative decode kernel by @xintin in #872
- [Wave] Added sin op by @SourishW in #866
- [Wave] Added atan2 operation by @badgerbroch in #867
- [Wave] added cos op by @bodhisaha in #869
- [TKW] Drop inplace flag by @Hardcode84 in #863
- Move scaled_dot_product_attention_bhsd under iree.turbine by @aviator19941 in #870
- [Wave] Implement C = A @ B GEMM example by @efric in #881
- [BOO] add a top-level readme showcasing basic usage by @zjgarvey in #882
- Update IREE deps by @Hardcode84 in #880
- [TKW] Add patterns to sink
vector.extract/splat
ops by @Hardcode84 in #851 - Bump IREE requirement pins to their latest versions. by @iree-pr-automator in #886
- Revert "[runtime] Rename dltensor capsule to avoid memory leak" by @Groverkss in #887
- [WAVE] Correcting predicts dim in the first speculative decode kernel by @xintin in #878
- [Wave] Speculative decode - default predict token to last token in vocab by @nithinsubbiah in #890
- Allow stream sharing between iree and iree-turbine. by @AWoloszyn in #883
- Update IREE version, remove dlpack hack and add tensor leak test by @Hardcode84 in #893
- [Wave] Hide APLP warning by @Hardcode84 in #898
- [BOO] Add Conv1d and Conv3d replacement capability by @zjgarvey in #885
- [BOO] don't permute
None
in pytorch convolution backward by @zjgarvey in #884 - [WAVE] Added Fused Softmax and Vector Addition Kernels/Tests Closes #874 by @bodhisaha in #891
- [WAVE] Update ci tk yaml to have updated rocm installation by @xintin in #902
- [WAVE] Update speculative decode first kernel's dims from 3D to 1D by @xintin in #888
- Add a python3.11 mi300 unit tests and type checking job by @zjgarvey in #903
- Reduce CPU overhead in
lookup_device_from_torch
by @zjgarvey in #905
Commit history: v3.4.0...v3.5.0