IREE Release v3.9.0
1. Compiler
1.1 Data Tiling & GEMM Improvements
iree-opt-data-tilingpromoted to umbrella flag with suggested config. (#22295)- Default path switched to DispatchCreation phase; use
--iree-global-opt-data-tilingfor legacy behavior. See
docs. (#21441) - Implemented
subgroups_kin data-tiled MMA layouts. (#22519) - Added per-operand M/N/K interleaving control. (#22626)
- Added layout transfer support in MaterializeEncoding. (#22582)
- Strict
inner_tiledverifier withdistributed/opaqueparams. (#22369) - Unified encoding materialization passes. (#22472)
- Encoding op fusion with multi-use producers at
-O3. (#22444) - Intentional padding for non-K-major layouts (~2.7% GEMM improvement). (#22486)
- Better heuristics for extremely large GEMMs. (#22636)
- Refactored narrow matmul tile size selection. (#22177)
- Split reduction for large-K GEMMs. (#22357)
- Updated ukernel data layout. (#22350)
- Fixed large f16 ukernel bounds. (#22481)
- Added LLaMA 8B FP8 benchmark tests on gfx942. (#22387)
1.2 Dispatch Creation
- Added split-reduction support for arg_compare, preventing shared-memory overflow and fixing LLaMA 8B FP16 compilation failures. (#22466)
- Added aggressive multi-use fusion for encoding ops (enabled at
-O3), significantly improving fusion patterns seen in SDXL. (#22444) - Enabled consumer fusion for GPUApplyTilingLevel on scf.forall loops, enhancing padding-level fusion. (#22522)
1.3 GPU Codegen
- Added barrier insertion before first shared-memory write for AMD GPUs, fixing non-deterministic strided conv results (13% -> 0% failure rate). (#22669)
- Rewrote loop prefetcher with a stage-based backward slicing model for better maintainability (no functional change). (#22605)
- Implemented vector size inference for
UKernelGenericOp, enabling downstream ops (e.g., unpack) to correctly vectorize instead of falling back to scalar code. (#22440) - Improved f16 medium ukernel bounds on ROCm for better matmul throughput. (#22393)
- Added mmt4d ukernel support for RISC-V zvfh/zvfhmin, enabling f16xf16->f16/f32 kernels with runtime hardware probing. (#22231)
- Generalized GPU lowering for linalg.reduce ops, converting illegal i1 reductions to generic form to unblock split-reduction pipelines. (#22490)
1.4 Others
- Interfaces, Layouts & IR Improvements (#22467, #22390, #22368)
- Various correctness and quality improvements across codegen, layout propagation, and GPU lowering. (#22636, #22490, #22466, #22669, #22522, #22605, #22486, #22519, #22444, #22393, #22231, #22467, #22390, #22368, #22440, #22598)
- Exposed C and Python bindings for IGEMM convolution details (#22598)
2. Runtime
- Implemented the first end-to-end support for external transients, enabling early—but functional—handling of control flow and cross-dispatch transient values.
- Current limitations: no function calls and no data-dependent values; simple control flow is supported and aligns with future dispatch specialization work. (#22625)
- Added timeline-aware async execution across module boundaries, introducing foundational interfaces for precise cross-module scheduling. (#22381)
- Improved support for
iree_codegen.extract_strided_metadata, ensuring information-preserving lowering:- Now normalizes into
iree_codegenearlier, avoiding loss of stride/offset/alignment information that occurred when prematurely converting tomemref. (#22606)
- Now normalizes into
- Added new Stream canonicalizations and improved
RefineUsageto reduce unnecessary copies and fix correctness bugs. (#22610) - Added
--gen-dialect-jsontoiree-tblgen, generating JSON databases of dialect definitions using tablegen metadata. (#22603)
Change Log
Git History
What's Changed
- [LinalgExt] Don't vectorize map_scatter in non-contiguous sub-byte access by @jtuyls in #22242
- [python] Set up binding for preprocessing transform ops by @bangtianliu in #22227
- Re-enable lds_barrier on RDNA4 by @krzysz00 in #21922
- [CI][iree-test-suites] Try to make torch_models benchmarks more stable by @Groverkss in #22271
- Reapply "[GPU] Allow multi result and indexing compute generic ops in TilleAndFuse pipeline" (#22205)" by @nirvedhmeshram in #22223
- Reapply "[Dispatch Creation] Rework dispatch formation logic (#21854)" by @IanWood1 in #22065
- [debugging][gpu] Add --iree-hip-emit-debug-info flag by @willghatch in #22216
- [Codegen] Update the td spec using the contraction matcher op by @bangtianliu in #22249
- [Codegen] Update the td spec using the attention matcher op by @bangtianliu in #22266
- Revert "Re-enable lds_barrier on RDNA4" by @kuhar in #22278
- Integrate llvm/llvm-project@b92483c by @newling in #22274
- Support skinny scaled matmul in kernel config by @jtuyls in #22042
- Use llvm wrappers for accumulate. NFC. by @kuhar in #22279
- [NFC][GPU] Move reduction configuration to gpu utilities by @Groverkss in #22286
- [GPU] Move convolution check out of unrelated function by @Groverkss in #22287
- [GPU] Support iree_tensor_ext.dispatch.tensor.store for broadcast producer by @nirvedhmeshram in #22291
- [Docs] Read from first line of
rocm_agent_enumeratoroutput by @sjain-stanford in #22283 - [Codegen] Adding an optional
dma_sizesfield in GPU attributes by @lialan in #22281 - Bump LLVM to llvm/llvm-project@5a636c6 by @MaheshRavishankar in #22290
- Let MLIR ukernels provide their matching and data-tiled-layout info. by @bjacob in #22254
- [LLVMCPU] Propagate target features and CPU name to individual LLVMFuncOp by @mshockwave in #22036
- [CI][TorchModels] Update flags used for LLaMa 8b f8/fp16. by @MaheshRavishankar in #22297
- Promote iree-opt-data-tiling to pipeline options. by @hanhanW in #22295
- Bump version to 3.9.0 after 3.8.0 release. by @sa-faizal in #22308
- [GPU] Enabling Gather-like ops to go through GPUTileAndFuse pipeline by @Abhishek-Varma in #22251
- [python] Set up python binding for matcher convolution and attention op by @bangtianliu in #22311
- [DT][NFC] Trim IRs in encoding materialization tests for GPU and RISCV backends. by @hanhanW in #22313
- [GPU] Update K Tile size picking for multiple K dims by @Muzammiluddin-Syed-ECE in #22310
- [codegen][gpu] Make transfer_write conditional when not fully distributed by @newling in #22198
- [Stream] Replicate globals per affinity before Stream conversion. by @hanhanW in #22117
- Fix non-deterministic hoisting by @IanWood1 in #22319
- Drop revert of llvm/llvm-project#159083 by @MaheshRavishankar in #22298
- [Codegen] Allow pre-padding other dims of a conv except the input channel by @yzhang93 in #22296
- [CI][Torch] Update dispatch counts after non-determinism fix by @Groverkss in #22333
- [Codegen] Use llvm accumulate wrappers. NFC. by @kuhar in #22331
- [Codegen] Tile memref.copy when vectorizing for dynamic dims by @jtuyls in #22168
- Reapply "Re-enable lds_barrier on RDNA4" (#22278) by @krzysz00 in #22326
- [Codegen] Handle multiple dyn dims in tensor load pattern by @IanWood1 in #22328
- [DT][NFC] Add test files for materializing IREE ops with encodings. by @hanhanW in #22322
- [DT][NFC] Trim IRs for materialize_encoding_aarch64.mlir test. by @hanhanW in #22327
- [DT][NFC] Trim unnecessary IRs for materialize_encoding_vmvx.mlir test. by @hanhanW in #22330
- [DT][NFC] Trim unnecessary IRs for materialize_encoding_x86_64.mlir test. by @hanhanW in #22332
- [DispatchCreation] Add split reduction for weight backward convs by @yzhang93 in #22275
- [Integrate] Bump LLVM to llvm/llvm-project@893b1d4 by @MaheshRavishankar in #22334
- [DT][NFCI] Implement getOffsetsSizesStrides for GPU padding resolver. by @hanhanW in #22339
- Remove
moveCrossThreadOutermostby @bjacob in #22284 - [Global Opt] Don't propagate edge reshapes by @IanWood1 in #22320
- [DT][NFC] Collapse MaterializeScaledContractionOp into generic pattern. by @hanhanW in #22340
- [Codegen][Tuner] Add root_op for matvec and reduction along VectorDistribute pipeline by @bangtianliu in #22348
- Catch MLIR ukernel parsing errors by @bjacob in #22353
- [ROCM][DT] Update ukernel data layout by @Yu-Zhewen in #22350
- [GlobalOpt] Fix transpose propagation for index-semantic ops by interchanging indexing maps by @ziliangzl in #22248
- [build flags] 2nd prep to enable more warnings in compile flags (#21996) by @schuermans-roofline in #22273
- [LinalgExt] Fix scatter unique_indices when dropping unit dims by @IanWood1 in #22362
- [DT][NFC] Refactor linalg.fill/generic op lowering to interface implementation. by @hanhanW in #22343
- [DT] Mark partial slices unsupported in padding encoding resolver. by @hanhanW in #22359
- [DT] Implement LayoutMaterializerAttr for identity resolver. by @hanhanW in #22337
- [Codegen] Canonicalize loops and subviews after copy vectorization by @jtuyls in #22344
- Bump LLVM to llvm/llvm-project@c8cf393 by @Muzammiluddin-Syed-ECE in #22354
- [DT] Support partial load/store for identity encoding resolver. by @hanhanW in #22360
- [Codegen] Remove batch size in target intrinsic checks by @jtuyls in #22289
- [NFC] Wrap directory structure within a block. by @hanhanW in #22373
- [DT] Support partial load/store for GPU padding encoding resolver. by @hanhanW in #22372
- [AMDGPU] Cache_swizzle stride for fat raw buffer loads should in bytes by @sebvince in #22314
- [LLVMCPU] Refactor multi lowering config propagation and setting by @Yu-Zhewen in #22126
- [build flags] enable more warnings in compile flags (#21996) by @schuermans-roofline in #22240
- Bump LLVM to llvm/llvm-project@683e2bf by @Muzammiluddin-Syed-ECE in #22366
- [NFC][ROCM] Simplify ukernel encoding materialization tests by @jtuyls in #22376
- [StableHLO] Fix reshape canonicalization for dense_resource constants. by @weidel-p in #22365
- [CI][TorchModels] Add SDXL int8 model to Torch Models CI. by @MaheshRavishankar in #22364
- [VectorDistribute] Fix transfer_write broadcasting guard by @Groverkss in #22352
- [NFC] Merge common type constraints by @krzysz00 in #22358
- [Encoding] fix dependency issues with @3815582bbd by @Muzammiluddin-Syed-ECE in #22384
- [Stream] Deduplicate the dispatch workloads by @jtuyls in #22187
- [DispatchCreation] Set split reduction size for GEMM with large k dim by @yzhang93 in #22357
- Adding markAllAnalysesPreserved to verification passes. by @benvanik in #22380
- Rewriting CombineInitializersPass to not make incorrect programs. by @benvanik in #22118
- Three reverts to undo transfer_write deduplication and return to previous state by @newling in #22392
- [CI][Torch] Add llama 8b fp16 quality tests by @Groverkss in #22379
- [Codegen] Implement value bounds interface for LoadFromBufferOp by @jtuyls in #22390
- [ROCM] Improve f16 medium ukernel bounds by @jtuyls in #22393
- Add mmt4d ukernel for riscv64's zvfhmin and zvfh feature, for types f16xf16->f16/f32 by @adeel10x in #22231
- [DispatchCreation] Add clean up pattern for fusing pad into split reduction dispatch by @yzhang93 in #22398
- Add Max191 to CODEOWNERS by @Max191 in #22411
- [NFC] Replace all uses of OpBuilder.create with OpTy::create by @Muzammiluddin-Syed-ECE in #22406
- [ROCM][Target] Add target for Strix Halo, and Phoenix by @raikonenfnu in #22410
- [Codegen] Cleanup VectorLayoutAnalysis testing by @Groverkss in #22417
- Add final dispatch name to AMDGPU Register spill warning by @sebvince in #22407
- [LinalgExt][NFC] Split the op definition between pure ops and LinalgExt ops by @sakupan102 in #22368
- Give
inner_tileda strict verifier and explicit semantics with boolean parametersdistributedandopaqueby @bjacob in #22369 - [LinalgExt][NFC] Move AttrSizedOperandSegments from base class to individual ops by @Copilot in #22430
- Rewrite SingleSubgroupLayout documentation by @bjacob in #22412
- [Codegen][Tuner] solve name conflicts for merging td specs by @bangtianliu in #22409
- [tools] Add bash autocomplete script for iree-opt/iree-compile by @Groverkss in #22424
- Bump LLVM to llvm/llvm-project@e903494 by @Yu-Zhewen in #22427
- [Global Opt] Raise tensor.extract to input by @IanWood1 in #22434
- [Global Opt] Add flag to control edge reshape propagation by @IanWood1 in #22438
- Adding HAL virtual memory APIs. by @benvanik in #22437
- Fix ReplicateGlobalsPerAffinity to maintain correct order of globals and initializers by @Copilot in #22401
- Update IanWood1 in CODEOWNERS by @IanWood1 in #22447
- [Codegen][ROCm] Don't branch on undef in
getPaddingConvSizeby @kuhar in #22449 - [CI][TorchModels] Update llama 8b fp16 golden time by @jtuyls in #22426
- [LLVMGPU] Fix coding standards / style issues in config utils by @kuhar in #22454
- [Codegen] Cleanup VectorLayoutAnalysis details by @Groverkss in #22418
- [Codegen] Rewrite VectorLayoutAnalysis to a simpler implementation by @Groverkss in #22420
- Bump LLVM to llvm/llvm-project@466c526 by @Yu-Zhewen in #22450
- [Codegen] Move GPUApplyPaddingLevel to an interface implementation by @Groverkss in #22422
- [ukernels] Add missing specializations on gfx942/gfx950 and associated e2e tests by @sebvince in #22446
- [Codegen] Fix more coding style / standards issues by @kuhar in #22459
- [Codegen] Add vector size inference for ukernel operations. by @Copilot in #22440
- Migrate custom LDBG macro to LLVM’s built-in debug logging by @Yu-Zhewen in #22456
- Adding sysfs topology detection logic and switching to it by default. by @benvanik in #22455
- Fix e2e matmul mxfp4 tests on gfx950 post #22446 by @bjacob in #22464
- Adding SILENCE_DEPRECATIONS option to LLVM external projects cmake. by @benvanik in #22463
- [DT][NFC] Fix coding style / standards issues for encoding materialization. by @hanhanW in #22471
- [DT][NFCI] Use no-rollback driver for MaterializeEncoding passes. by @hanhanW in #22474
- Add myself to .github CODEOWNERS by @Groverkss in #22477
- Adding iree-link tool. by @benvanik in #22419
- [ci] Remove gh installation for mi325 ci by @Groverkss in #22476
- [DT] Implement MaterializeInterfaceBindingEncoding with interface methods. by @hanhanW in #22467
- [CPU] Switch IREE::CPU::TilingLevel to enum class by @Copilot in #22433
- Bump the github-actions group with 2 updates by @dependabot[bot] in #22436
- CMake: When
rocminfois present, ask users to explicitly enable or disable ROCm testing. by @bjacob in #22478 - [Integrate] Cherry-pick llvm/llvm-project@41f6566 by @Yu-Zhewen in #22470
- Harmonize
*ScaledMMAAttroperand order and dropMMAFragmentby @bjacob in #22465 - Revert "[LLVMCPU] Propagate target features and CPU name to individual LLVMFuncOp" by @hanhanW in #22488
- Bump LLVM to llvm/llvm-project@03e66ae by @Yu-Zhewen in #22487
- [GPU] Add serial tiling level by @Groverkss in #22479
- Add Cursor files to gitignore by @Max191 in #22469
- [compiler][nfc] Remove using-declarations pollution from headers. by @hanhanW in #22501
- [DT] Collapse MaterializeEncodingIntoPaddingPass into the generic pass. by @hanhanW in #22472
- Bump LLVM to llvm/llvm-project@09318c6 by @Yu-Zhewen in #22494
- [CI] Run w7900 tests on any runner with two w7900 gpus by @kuhar in #22511
- [CPU][NFC] Style fixes and address post-commit comments. by @hanhanW in #22505
- [CI] Fix typo in reserved trailers by @kuhar in #22514
- [CI] Make rdna3 runner requirements more fine-grained by @kuhar in #22513
- Bump LLVM to llvm/llvm-project@04f87c693c7e by @hanhanW in #22515
- [LinalgExt] Decompose sub-byte map_scatter to extract/store by @jtuyls in #22315
- [ROCM] Update bounds for large f16 data-tiling ukernel by @jtuyls in #22481
- Remove value bounds interface for ExpandShapeOp by @jtuyls in #22460
- Revert "Three reverts to undo transfer_write deduplication and return… by @Groverkss in #22521
- [CPU][NFC] Trim IRs for lowering_config tests. (2/N) by @hanhanW in #22512
- Implement
subgroups_kin data-tiled MMA layouts by @bjacob in #22519 - [Codegen][ROCm] Add WMMA intrinsics for gfx1250 by @kuhar in #22516
- Bump LLVM to llvm-project@6a275de13f6c by @hanhanW in #22524
- [Torch] Disable deprecation declaration warnings when building torch-mlir-dialects by @hanhanW in #22526
- [LinalgExt] Don't force MxK layout for im2col output by @Max191 in #22396
- [GPU] Clean up misc issues in IREEGPUAttrs. NFC. by @kuhar in #22531
- [Codegen][GPU] Allow intentional padding for non-K-major matmul layouts by @jerryyin in #22486
- [DispatchCreation] Enable splitting multiple reduction dimensions for weight backward convs by @yzhang93 in #22491
- [Integrate] Drop the revert of affine canonicalization commit (8c05b5cc) by @hanhanW in #22530
- [GPU] Add consumer fusion for GPUApplyTilingLevel by @Groverkss in #22522
- [CPU][NFC] Trim unnecessary IRs for CPU tests. by @hanhanW in #22546
- [DispatchCreation] Enable fusion of encoding ops with multi-use producers by @Abhishek-Varma in #22444
- [LinalgExt] Decompose map_scatter with strided rank-reducing subviews by @Max191 in #22504
- [Global Opt] Move strided contraction pass after transpose prop by @IanWood1 in #22534
- Bump LLVM to llvm/llvm-project@0ce03c2be4c4 by @hanhanW in #22550
- [Input] Add RecomposeComplexOps pass in Torch/InputConversion/Passes by @raayandhar in #22276
- Using our own tablegen with depfile support. by @benvanik in #22554
- [LinalgExt] Added TilingInterface support for ExpReductionOp by @hhkit in #22316
- Fix
iree.buildsource directory being gitignore'd by @rkayaith in #22391 - [Dispatch Creation] Drop unit dims from tensor.extract ops by @IanWood1 in #22503
- [Dispatch Creation] Don't add unfusable consumers to fusion group by @IanWood1 in #22461
- Integrate torch-mlir at llvm/torch-mlir@288cd5e8adb by @IanWood1 in #22508
- [GPU][DT] Refactor tile size selection for narrow matmul by @Yu-Zhewen in #22177
- [CI] Change numprocesses to 1 for amdgpu_vulkan_O0 by @hanhanW in #22567
- Fix BYO LLVM build: handle MLIRTargetLLVMIRImport as non-object library by @hanhanW in #22553
- Bump LLVM to llvm/llvm-project@f60e69315e9e by @hanhanW in #22565
- [CodeGen][Tuner] Add bindings to query SIMDs and CUs info by @RattataKing in #22527
- Bump spirv-cross submodule by @kuhar in #22556
- [runtime] Require aligned memory accesses by default by @kuhar in #22557
- [runtime] Simplify unaligned load/store impl for u64/f64. NFC. by @kuhar in #22570
- Update Lit test checks caused by upstream fcf79e5 by @lialan in #22480
- [Codegen] Allow iree_codegen.swizzle_hint to operate on tensors by @krzysz00 in #22552
- Bump LLVM to llvm/llvm-project@6fce53af846c by @hanhanW in #22573
- [CI] Force amdgpu_vulkan runner be shark10-ci by @hanhanW in #22580
- Example of using HalModuleDebugSink to find numerical divergence by @newling in #22535
- Enable CI for torch ops by @amd-eochoalo in #22548
- [CI][torch_ops] Force amdgpu_vulkan runner be shark10-ci by @amd-eochoalo in #22588
- [NFC] Refresh golden values for benchmarks. by @hanhanW in #22583
- [CI] Relax golden values for torch_models. by @hanhanW in #22592
- [CI] Relax golden values for torch_models more. by @hanhanW in #22593
- Fix LLD support in BYO LLVM builds by @hanhanW in #22594
- Bump LLVM to llvm/llvm-project@37403685298bd3a7 by @hanhanW in #22591
- Increase acceptable error in punet by @newling in #22169
- [CI] Refresh golden values for failing benchmarks: min(val*1.1, val+5ms) by @hanhanW in #22595
- [Codegen][GPU] Update heuristic to consider distribution from split reduction by @yzhang93 in #22575
- [CI] Force CPU torch benchmarks to use Threadripper. by @hanhanW in #22600
- Adding new .td metadata classes and making our defs consistent. by @benvanik in #22569
- [Codegen][GPU] Introduce scf::pipelineForLoop function from upstream for prefetchSharedMemory pass by @jerryyin in #22523
- Adding iree_hal_executable_cache_infer_format. by @benvanik in #21763
- Adding timeline-aware async execution across module boundaries. by @benvanik in #22381
- [NFC] Renaming
stream.parameter.*tostream.cmd.parameter.*. by @benvanik in #22607 - Adding --gen-dialect-json to iree-tblgen. by @benvanik in #22603
- Integrate llvm 2025-11-10 by @nirvedhmeshram in #22608
- [CI] Update clip benchmark by @nirvedhmeshram in #22612
- [Codegen][Tuner] Extend ireeGPUTargetInfo constructor with new added attributes by @RattataKing in #22597
- [TensorExt] Add barrier ops and roundtrip tests 1/2 by @IanWood1 in #22577
- Improving support for iree_codegen.extract_strided_metadata. by @benvanik in #22606
- Integrates/llvm 2025-11-10 (part 2) by @nirvedhmeshram in #22613
- [PJRT] Update rocm pjrt by @castigli in #22317
- Update split reduction cutoff conditions by @yzhang93 in #22596
- Bump the github-actions group with 2 updates by @dependabot[bot] in #22614
- Integrates/llvm 20251112 by @nirvedhmeshram in #22624
- [Stream] Fixing update order and improving the cache for ReplicateGlobalsPerAffinity pass. by @hanhanW in #22499
- Add passes to insert and remove barriers 2/2 by @IanWood1 in #22566
- [TensorExt] Rename barrier to compute_barrier by @IanWood1 in #22627
- [DT] Add support for layout transfer in MaterializeEncoding pass. by @hanhanW in #22582
- [e2e] Use remarks to verify ukernel match by @Yu-Zhewen in #22620
- [runtime] Add explicit casts to char* to silence ubsan warnings by @kuhar in #22628
- [docs] Fix a typo in LinalgExtOps.td by @sakupan102 in #22633
- Fix mixed precision operands in splitReduction pass by @FlintWangacc in #22138
- [TensorExt] Add folder for barrier ops by @IanWood1 in #22616
- [Codegen][Tuner] expose python binding for getIGEMMGenericConvDetails by @bangtianliu in #22598
- [runtime] Fix incorrect alignment assumptions by @kuhar in #22571
- [LLVMCPU] Support tile-and-fuse anchoring on producer ops by @hanhanW in #22632
- Silence remaining UBSan warnings across runtime and spirv-cross by @kuhar in #22638
- Bump torch-mlir to llvm/torch-mlir@8d563af0b68 by @hanhanW in #22637
- [Codegen][GPU] Replace prefetchLoop with stage-based backward slicing by @jerryyin in #22605
- [CI] Optimize and clean up asan and tsan build scripts by @kuhar in #22639
- [VMVX][NFC] Trim unnecessary IRs from select_lowering_strategy.mlir by @hanhanW in #22641
- [DT] Allow to enable/disable interleaving separately for M/N/K dimensions, for each operand by @bjacob in #22626
- [DataTiling] Switch default to start from the DispatchCreation phase. by @hanhanW in #21441
- [Flow] Move ReplicateGlobalsPerAffinity pass to Flow by @sommerlukas in #22634
- [SPIRV][NFC] Simplify lowering strategy tests by removing unnecessary IRs by @hanhanW in #22648
- Bump llvm to llvm/llvm-project@7b7a422 by @nirvedhmeshram in #22635
- Use llvm cast function objects. NFC. by @kuhar in #22652
- Drop unnecessary namespaces from cast functions in plugins. NFC. 1/10 by @kuhar in #22653
- Drop unnecessary namespaces from cast functions in bindings/dispatch/external. NFC. 2/10 by @kuhar in #22654
- Drop unnecessary namespaces from cast functions in codegen common. NFC. 3/10 by @kuhar in #22655
- Drop unnecessary namespaces from cast functions in codegen backends. NFC. 4/10 by @kuhar in #22656
- Drop unnecessary namespaces from cast functions in dialect flow *ext. NFC. 7/10 by @kuhar in #22659
- Drop unnecessary namespaces from cast functions in dialect util. NFC. 8/10 by @kuhar in #22660
- Drop unnecessary namespaces from cast functions in dialect stream. NFC. 9/10 by @kuhar in #22661
- Drop unnecessary namespaces from cast functions in dialect vm vmvx etc. NFC. 10/10 by @kuhar in #22662
- Drop unnecessary namespaces from cast functions in dialect hal encoding. NFC. 6/10 by @kuhar in #22658
- Drop unnecessary namespaces from cast functions in codegen dialect utils. NFC. 5/10 by @kuhar in #22657
- [LLVMGPU][NFC] Simplify lowering_config tests. 1/N by @hanhanW in #22665
- Partial Revert "[e2e] Use remarks to verify ukernel match" by @Yu-Zhewen in #22647
- [CI] Optimize cmake flags for debug info builds by @kuhar in #22651
- [CI] Add ubsan build and test script. Run ubsan tests in CI. by @kuhar in #22650
- [AMD][GPU] Insert barrier in prologue before first shared memory write by @jerryyin in #22669
- [NFC] Switch to dyn_cast_if_present for consistency. by @hanhanW in #22670
- Update split reduction heuristic for extreme large GEMMs by @yzhang93 in #22636
- [Integrate] Bump torch-mlir to llvm/torch-mlir@a2bcca0f025bf0 by @hanhanW in #22680
- Suppress ROCm lsan errors in HIP driver tests by @qedawkins in #22675
- Update
coalesced_gather_dmadefinitions by @lialan in #22294 - [Codegen][GPU] Add configurable num-stages option to prefetch pass by @jerryyin in #22673
- RHS type should be used by @NoumanAmir657 in #22686
- Drop prefetches in AVX512 ukernels by @bjacob in #22668
- Bump actions/checkout from 5.0.0 to 5.0.1 in the github-actions group by @dependabot[bot] in #22677
- [tuner][docs] update sharktuner readme by @bangtianliu in #22683
- Revert "[LDS] Lower to
coalesced_gather_dma(#22294)" by @lialan in #22691 - Relax assert in task_worker_deinitialize in case thread creation failed by @qedawkins in #22689
- [tuner][docs] update the example td spec in sharktuner readme by @bangtianliu in #22692
- Integrate LLVM at 21e0b56d7afc by @lialan in #22667
- Revert "[PJRT] Update rocm pjrt (#22317)" by @lialan in #22678
- [CI] Reduce ctest parallelism in the clang job by @kuhar in #22704
- [RISCV] Clean up toolchain CMake configuration by @HanKuanChen in #22663
- Integrate LLVM at c2b4e481a050 by @lialan in #22701
- Implementing initial end-to-end support for external transients. by @benvanik in #22625
- [Preprocessing] Add compute_barrier in ConvertConvFilterToChannelsLast pass by @yzhang93 in #22679
- [Codegen][GPU] Generalize linalg.reduce operations by @bangtianliu in #22490
- [CI] Update iree-org/iree-test-suites@17a391dc38 by @IanWood1 in #22698
- [Dispatch Creation] Add pass to fold reshapes into barriers by @IanWood1 in #22642
- Integrate llvm @ aa3f930931e6 by @lialan in #22713
- [Dispatch Creation] Don't fuse uses from above by @IanWood1 in #22708
- [DispatchCreation] Move RemoveTensorBarriers to end of pipeline by @IanWood1 in #22703
- [docs] Clarify code review process by @kuhar in #22714
- [docs] Fix a typo in code review process by @kuhar in #22716
- [DispatchCreation] Set split reduction size for ArgCompare by @bangtianliu in #22466
- [CI][TorchModels] Update SDXL int8 model CI (1/2) by @raayandhar in #22621
- [CI][TorchModels] Add data-tiling for Llama 8B Fp8 on gfx942 by @Abhishek-Varma in #22387
- [Build] Optionally use hip headers from system Hip package by @AaronStGeorge in #22715
- [Flow] Transfer globals per affinity instead of replicating by @sommerlukas in #22623
- Adding some Stream canonicalizations and RefineUsage improvements. by @benvanik in #22610
- [LDS] Reland "Lower to
coalesced_gather_dma(#22294)" by @lialan in #22696 - [Codegen] Fold bitcast into bufferized tensor load by @Yu-Zhewen in #22672
- [DispatchCreation][NFC] Refactor split reduction helper methods to static functions by @bangtianliu in #22727
- [spirv] Handle 0d vectors during unrolling by @kuhar in #22730
- [LLVMGPU][Codegen] Emit packed chain FMA from select multi_reductions and contracts by @efric in #21855
- [Encoding] Add SerializableAttr interface to packed_storage by @sommerlukas in #22688
- Revert "[LLVMGPU][Codegen] Emit packed chain FMA from select multi_reductions and contracts" by @hanhanW in #22736
- [Codegen][GPU]Fixing barrier placement for 3+ stages pipelining by @jerryyin in #22725
- [Dispatch Creation] Add aggressive reshape movement flag by @IanWood1 in #22707
- Update CODEOWNERS to add more reviewers for GPU codegen pieces by @MaheshRavishankar in #22721
- [CI][TorchModels] Update flags for CLIP test. by @MaheshRavishankar in #22413
- [TensorExt] Add Operations/Attributes/Interfaces for specifying ragged tensors. by @MaheshRavishankar in #22267
- Bump actions/checkout from 5.0.1 to 6.0.0 in the github-actions group by @dependabot[bot] in #22742
- Fix incompatible pointer types for macOS build. by @hanhanW in #22738
- Integrate llvm/llvm-project@778e104d by @yzhang93 in #22741
- [Codegen] Test Cleanup 1/8: Common CPU tests by @qedawkins in #22744
- [CI] Bump golden value to 165*1.1=181.5 for prefill benchmark on mi325 by @hanhanW in #22752
- [Codegen] Test Cleanup 8/8: VMVX tests by @qedawkins in #22751
- [Codegen] Test Cleanup 4/8: Dialect tests by @qedawkins in #22747
New Contributors
- @willghatch made their first contribution in #22216
- @sjain-stanford made their first contribution in #22283
- @ziliangzl made their first contribution in #22248
- @weidel-p made their first contribution in #22365
- @sakupan102 made their first contribution in #22368
- @Copilot made their first contribution in #22430
- @raayandhar made their first contribution in #22276
- @FlintWangacc made their first contribution in #22138
- @sommerlukas made their first contribution in #22634
Full Changelog: v3.8.0...v3.9.0