IREE Release v3.9.0

1. Compiler

1.1 Data Tiling & GEMM Improvements

iree-opt-data-tiling promoted to umbrella flag with suggested config. (#22295)
Default path switched to DispatchCreation phase; use --iree-global-opt-data-tiling for legacy behavior. See
docs. (#21441)
Implemented subgroups_k in data-tiled MMA layouts. (#22519)
Added per-operand M/N/K interleaving control. (#22626)
Added layout transfer support in MaterializeEncoding. (#22582)
Strict inner_tiled verifier with distributed/opaque params. (#22369)
Unified encoding materialization passes. (#22472)
Encoding op fusion with multi-use producers at -O3. (#22444)
Intentional padding for non-K-major layouts (~2.7% GEMM improvement). (#22486)
Better heuristics for extremely large GEMMs. (#22636)
Refactored narrow matmul tile size selection. (#22177)
Split reduction for large-K GEMMs. (#22357)
Updated ukernel data layout. (#22350)
Fixed large f16 ukernel bounds. (#22481)
Added LLaMA 8B FP8 benchmark tests on gfx942. (#22387)

1.2 Dispatch Creation

Added split-reduction support for arg_compare, preventing shared-memory overflow and fixing LLaMA 8B FP16 compilation failures. (#22466)
Added aggressive multi-use fusion for encoding ops (enabled at -O3), significantly improving fusion patterns seen in SDXL. (#22444)
Enabled consumer fusion for GPUApplyTilingLevel on scf.forall loops, enhancing padding-level fusion. (#22522)

1.3 GPU Codegen

Added barrier insertion before first shared-memory write for AMD GPUs, fixing non-deterministic strided conv results (13% -> 0% failure rate). (#22669)
Rewrote loop prefetcher with a stage-based backward slicing model for better maintainability (no functional change). (#22605)
Implemented vector size inference for UKernelGenericOp, enabling downstream ops (e.g., unpack) to correctly vectorize instead of falling back to scalar code. (#22440)
Improved f16 medium ukernel bounds on ROCm for better matmul throughput. (#22393)
Added mmt4d ukernel support for RISC-V zvfh/zvfhmin, enabling f16xf16->f16/f32 kernels with runtime hardware probing. (#22231)
Generalized GPU lowering for linalg.reduce ops, converting illegal i1 reductions to generic form to unblock split-reduction pipelines. (#22490)

1.4 Others

Interfaces, Layouts & IR Improvements (#22467, #22390, #22368)
Various correctness and quality improvements across codegen, layout propagation, and GPU lowering. (#22636, #22490, #22466, #22669, #22522, #22605, #22486, #22519, #22444, #22393, #22231, #22467, #22390, #22368, #22440, #22598)
Exposed C and Python bindings for IGEMM convolution details (#22598)

2. Runtime

Implemented the first end-to-end support for external transients, enabling early—but functional—handling of control flow and cross-dispatch transient values.
- Current limitations: no function calls and no data-dependent values; simple control flow is supported and aligns with future dispatch specialization work. (#22625)
Added timeline-aware async execution across module boundaries, introducing foundational interfaces for precise cross-module scheduling. (#22381)
Improved support for iree_codegen.extract_strided_metadata, ensuring information-preserving lowering:
- Now normalizes into iree_codegen earlier, avoiding loss of stride/offset/alignment information that occurred when prematurely converting to memref. (#22606)
Added new Stream canonicalizations and improved RefineUsage to reduce unnecessary copies and fix correctness bugs. (#22610)
Added --gen-dialect-json to iree-tblgen, generating JSON databases of dialect definitions using tablegen metadata. (#22603)

Change Log

Git History

What's Changed

[LinalgExt] Don't vectorize map_scatter in non-contiguous sub-byte access by @jtuyls in #22242
[python] Set up binding for preprocessing transform ops by @bangtianliu in #22227
Re-enable lds_barrier on RDNA4 by @krzysz00 in #21922
[CI][iree-test-suites] Try to make torch_models benchmarks more stable by @Groverkss in #22271
Reapply "[GPU] Allow multi result and indexing compute generic ops in TilleAndFuse pipeline" (#22205)" by @nirvedhmeshram in #22223
Reapply "[Dispatch Creation] Rework dispatch formation logic (#21854)" by @IanWood1 in #22065
[debugging][gpu] Add --iree-hip-emit-debug-info flag by @willghatch in #22216
[Codegen] Update the td spec using the contraction matcher op by @bangtianliu in #22249
[Codegen] Update the td spec using the attention matcher op by @bangtianliu in #22266
Revert "Re-enable lds_barrier on RDNA4" by @kuhar in #22278
Integrate llvm/llvm-project@b92483c by @newling in #22274
Support skinny scaled matmul in kernel config by @jtuyls in #22042
Use llvm wrappers for accumulate. NFC. by @kuhar in #22279
[NFC][GPU] Move reduction configuration to gpu utilities by @Groverkss in #22286
[GPU] Move convolution check out of unrelated function by @Groverkss in #22287
[GPU] Support iree_tensor_ext.dispatch.tensor.store for broadcast producer by @nirvedhmeshram in #22291
[Docs] Read from first line of rocm_agent_enumerator output by @sjain-stanford in #22283
[Codegen] Adding an optional dma_sizes field in GPU attributes by @lialan in #22281
Bump LLVM to llvm/llvm-project@5a636c6 by @MaheshRavishankar in #22290
Let MLIR ukernels provide their matching and data-tiled-layout info. by @bjacob in #22254
[LLVMCPU] Propagate target features and CPU name to individual LLVMFuncOp by @mshockwave in #22036
[CI][TorchModels] Update flags used for LLaMa 8b f8/fp16. by @MaheshRavishankar in #22297
Promote iree-opt-data-tiling to pipeline options. by @hanhanW in #22295
Bump version to 3.9.0 after 3.8.0 release. by @sa-faizal in #22308
[GPU] Enabling Gather-like ops to go through GPUTileAndFuse pipeline by @Abhishek-Varma in #22251
[python] Set up python binding for matcher convolution and attention op by @bangtianliu in #22311
[DT][NFC] Trim IRs in encoding materialization tests for GPU and RISCV backends. by @hanhanW in #22313
[GPU] Update K Tile size picking for multiple K dims by @Muzammiluddin-Syed-ECE in #22310
[codegen][gpu] Make transfer_write conditional when not fully distributed by @newling in #22198
[Stream] Replicate globals per affinity before Stream conversion. by @hanhanW in #22117
Fix non-deterministic hoisting by @IanWood1 in #22319
Drop revert of llvm/llvm-project#159083 by @MaheshRavishankar in #22298
[Codegen] Allow pre-padding other dims of a conv except the input channel by @yzhang93 in #22296
[CI][Torch] Update dispatch counts after non-determinism fix by @Groverkss in #22333
[Codegen] Use llvm accumulate wrappers. NFC. by @kuhar in #22331
[Codegen] Tile memref.copy when vectorizing for dynamic dims by @jtuyls in #22168
Reapply "Re-enable lds_barrier on RDNA4" (#22278) by @krzysz00 in #22326
[Codegen] Handle multiple dyn dims in tensor load pattern by @IanWood1 in #22328
[DT][NFC] Add test files for materializing IREE ops with encodings. by @hanhanW in #22322
[DT][NFC] Trim IRs for materialize_encoding_aarch64.mlir test. by @hanhanW in #22327
[DT][NFC] Trim unnecessary IRs for materialize_encoding_vmvx.mlir test. by @hanhanW in #22330
[DT][NFC] Trim unnecessary IRs for materialize_encoding_x86_64.mlir test. by @hanhanW in #22332
[DispatchCreation] Add split reduction for weight backward convs by @yzhang93 in #22275
[Integrate] Bump LLVM to llvm/llvm-project@893b1d4 by @MaheshRavishankar in #22334
[DT][NFCI] Implement getOffsetsSizesStrides for GPU padding resolver. by @hanhanW in #22339
Remove moveCrossThreadOutermost by @bjacob in #22284
[Global Opt] Don't propagate edge reshapes by @IanWood1 in #22320
[DT][NFC] Collapse MaterializeScaledContractionOp into generic pattern. by @hanhanW in #22340
[Codegen][Tuner] Add root_op for matvec and reduction along VectorDistribute pipeline by @bangtianliu in #22348
Catch MLIR ukernel parsing errors by @bjacob in #22353
[ROCM][DT] Update ukernel data layout by @Yu-Zhewen in #22350
[GlobalOpt] Fix transpose propagation for index-semantic ops by interchanging indexing maps by @ziliangzl in #22248
[build flags] 2nd prep to enable more warnings in compile flags (#21996) by @schuermans-roofline in #22273
[LinalgExt] Fix scatter unique_indices when dropping unit dims by @IanWood1 in #22362
[DT][NFC] Refactor linalg.fill/generic op lowering to interface implementation. by @hanhanW in #22343
[DT] Mark partial slices unsupported in padding encoding resolver. by @hanhanW in #22359
[DT] Implement LayoutMaterializerAttr for identity resolver. by @hanhanW in #22337
[Codegen] Canonicalize loops and subviews after copy vectorization by @jtuyls in #22344
Bump LLVM to llvm/llvm-project@c8cf393 by @Muzammiluddin-Syed-ECE in #22354
[DT] Support partial load/store for identity encoding resolver. by @hanhanW in #22360
[Codegen] Remove batch size in target intrinsic checks by @jtuyls in #22289
[NFC] Wrap directory structure within a block. by @hanhanW in #22373
[DT] Support partial load/store for GPU padding encoding resolver. by @hanhanW in #22372
[AMDGPU] Cache_swizzle stride for fat raw buffer loads should in bytes by @sebvince in #22314
[LLVMCPU] Refactor multi lowering config propagation and setting by @Yu-Zhewen in #22126
[build flags] enable more warnings in compile flags (#21996) by @schuermans-roofline in #22240
Bump LLVM to llvm/llvm-project@683e2bf by @Muzammiluddin-Syed-ECE in #22366
[NFC][ROCM] Simplify ukernel encoding materialization tests by @jtuyls in #22376
[StableHLO] Fix reshape canonicalization for dense_resource constants. by @weidel-p in #22365
[CI][TorchModels] Add SDXL int8 model to Torch Models CI. by @MaheshRavishankar in #22364
[VectorDistribute] Fix transfer_write broadcasting guard by @Groverkss in #22352
[NFC] Merge common type constraints by @krzysz00 in #22358
[Encoding] fix dependency issues with @3815582bbd by @Muzammiluddin-Syed-ECE in #22384
[Stream] Deduplicate the dispatch workloads by @jtuyls in #22187
[DispatchCreation] Set split reduction size for GEMM with large k dim by @yzhang93 in #22357
Adding markAllAnalysesPreserved to verification passes. by @benvanik in #22380
Rewriting CombineInitializersPass to not make incorrect programs. by @benvanik in #22118
Three reverts to undo transfer_write deduplication and return to previous state by @newling in #22392
[CI][Torch] Add llama 8b fp16 quality tests by @Groverkss in #22379
[Codegen] Implement value bounds interface for LoadFromBufferOp by @jtuyls in #22390
[ROCM] Improve f16 medium ukernel bounds by @jtuyls in #22393
Add mmt4d ukernel for riscv64's zvfhmin and zvfh feature, for types f16xf16->f16/f32 by @adeel10x in #22231
[DispatchCreation] Add clean up pattern for fusing pad into split reduction dispatch by @yzhang93 in #22398
Add Max191 to CODEOWNERS by @Max191 in #22411
[NFC] Replace all uses of OpBuilder.create with OpTy::create by @Muzammiluddin-Syed-ECE in #22406
[ROCM][Target] Add target for Strix Halo, and Phoenix by @raikonenfnu in #22410
[Codegen] Cleanup VectorLayoutAnalysis testing by @Groverkss in #22417
Add final dispatch name to AMDGPU Register spill warning by @sebvince in #22407
[LinalgExt][NFC] Split the op definition between pure ops and LinalgExt ops by @sakupan102 in #22368
Give inner_tiled a strict verifier and explicit semantics with boolean parameters distributed and opaque by @bjacob in #22369
[LinalgExt][NFC] Move AttrSizedOperandSegments from base class to individual ops by @Copilot in #22430
Rewrite SingleSubgroupLayout documentation by @bjacob in #22412
[Codegen][Tuner] solve name conflicts for merging td specs by @bangtianliu in #22409
[tools] Add bash autocomplete script for iree-opt/iree-compile by @Groverkss in #22424
Bump LLVM to llvm/llvm-project@e903494 by @Yu-Zhewen in #22427
[Global Opt] Raise tensor.extract to input by @IanWood1 in #22434
[Global Opt] Add flag to control edge reshape propagation by @IanWood1 in #22438
Adding HAL virtual memory APIs. by @benvanik in #22437
Fix ReplicateGlobalsPerAffinity to maintain correct order of globals and initializers by @Copilot in #22401
Update IanWood1 in CODEOWNERS by @IanWood1 in #22447
[Codegen][ROCm] Don't branch on undef in getPaddingConvSize by @kuhar in #22449
[CI][TorchModels] Update llama 8b fp16 golden time by @jtuyls in #22426
[LLVMGPU] Fix coding standards / style issues in config utils by @kuhar in #22454
[Codegen] Cleanup VectorLayoutAnalysis details by @Groverkss in #22418
[Codegen] Rewrite VectorLayoutAnalysis to a simpler implementation by @Groverkss in #22420
Bump LLVM to llvm/llvm-project@466c526 by @Yu-Zhewen in #22450
[Codegen] Move GPUApplyPaddingLevel to an interface implementation by @Groverkss in #22422
[ukernels] Add missing specializations on gfx942/gfx950 and associated e2e tests by @sebvince in #22446
[Codegen] Fix more coding style / standards issues by @kuhar in #22459
[Codegen] Add vector size inference for ukernel operations. by @Copilot in #22440
Migrate custom LDBG macro to LLVM’s built-in debug logging by @Yu-Zhewen in #22456
Adding sysfs topology detection logic and switching to it by default. by @benvanik in #22455
Fix e2e matmul mxfp4 tests on gfx950 post #22446 by @bjacob in #22464
Adding SILENCE_DEPRECATIONS option to LLVM external projects cmake. by @benvanik in #22463
[DT][NFC] Fix coding style / standards issues for encoding materialization. by @hanhanW in #22471
[DT][NFCI] Use no-rollback driver for MaterializeEncoding passes. by @hanhanW in #22474
Add myself to .github CODEOWNERS by @Groverkss in #22477
Adding iree-link tool. by @benvanik in #22419
[ci] Remove gh installation for mi325 ci by @Groverkss in #22476
[DT] Implement MaterializeInterfaceBindingEncoding with interface methods. by @hanhanW in #22467
[CPU] Switch IREE::CPU::TilingLevel to enum class by @Copilot in #22433
Bump the github-actions group with 2 updates by @dependabot[bot] in #22436
CMake: When rocminfo is present, ask users to explicitly enable or disable ROCm testing. by @bjacob in #22478
[Integrate] Cherry-pick llvm/llvm-project@41f6566 by @Yu-Zhewen in #22470
Harmonize *ScaledMMAAttr operand order and drop MMAFragment by @bjacob in #22465
Revert "[LLVMCPU] Propagate target features and CPU name to individual LLVMFuncOp" by @hanhanW in #22488
Bump LLVM to llvm/llvm-project@03e66ae by @Yu-Zhewen in #22487
[GPU] Add serial tiling level by @Groverkss in #22479
Add Cursor files to gitignore by @Max191 in #22469
[compiler][nfc] Remove using-declarations pollution from headers. by @hanhanW in #22501
[DT] Collapse MaterializeEncodingIntoPaddingPass into the generic pass. by @hanhanW in #22472
Bump LLVM to llvm/llvm-project@09318c6 by @Yu-Zhewen in #22494
[CI] Run w7900 tests on any runner with two w7900 gpus by @kuhar in #22511
[CPU][NFC] Style fixes and address post-commit comments. by @hanhanW in #22505
[CI] Fix typo in reserved trailers by @kuhar in #22514
[CI] Make rdna3 runner requirements more fine-grained by @kuhar in #22513
Bump LLVM to llvm/llvm-project@04f87c693c7e by @hanhanW in #22515
[LinalgExt] Decompose sub-byte map_scatter to extract/store by @jtuyls in #22315
[ROCM] Update bounds for large f16 data-tiling ukernel by @jtuyls in #22481
Remove value bounds interface for ExpandShapeOp by @jtuyls in #22460
Revert "Three reverts to undo transfer_write deduplication and return… by @Groverkss in #22521
[CPU][NFC] Trim IRs for lowering_config tests. (2/N) by @hanhanW in #22512
Implement subgroups_k in data-tiled MMA layouts by @bjacob in #22519
[Codegen][ROCm] Add WMMA intrinsics for gfx1250 by @kuhar in #22516
Bump LLVM to llvm-project@6a275de13f6c by @hanhanW in #22524
[Torch] Disable deprecation declaration warnings when building torch-mlir-dialects by @hanhanW in #22526
[LinalgExt] Don't force MxK layout for im2col output by @Max191 in #22396
[GPU] Clean up misc issues in IREEGPUAttrs. NFC. by @kuhar in #22531
[Codegen][GPU] Allow intentional padding for non-K-major matmul layouts by @jerryyin in #22486
[DispatchCreation] Enable splitting multiple reduction dimensions for weight backward convs by @yzhang93 in #22491
[Integrate] Drop the revert of affine canonicalization commit (8c05b5cc) by @hanhanW in #22530
[GPU] Add consumer fusion for GPUApplyTilingLevel by @Groverkss in #22522
[CPU][NFC] Trim unnecessary IRs for CPU tests. by @hanhanW in #22546
[DispatchCreation] Enable fusion of encoding ops with multi-use producers by @Abhishek-Varma in #22444
[LinalgExt] Decompose map_scatter with strided rank-reducing subviews by @Max191 in #22504
[Global Opt] Move strided contraction pass after transpose prop by @IanWood1 in #22534
Bump LLVM to llvm/llvm-project@0ce03c2be4c4 by @hanhanW in #22550
[Input] Add RecomposeComplexOps pass in Torch/InputConversion/Passes by @raayandhar in #22276
Using our own tablegen with depfile support. by @benvanik in #22554
[LinalgExt] Added TilingInterface support for ExpReductionOp by @hhkit in #22316
Fix iree.build source directory being gitignore'd by @rkayaith in #22391
[Dispatch Creation] Drop unit dims from tensor.extract ops by @IanWood1 in #22503
[Dispatch Creation] Don't add unfusable consumers to fusion group by @IanWood1 in #22461
Integrate torch-mlir at llvm/torch-mlir@288cd5e8adb by @IanWood1 in #22508
[GPU][DT] Refactor tile size selection for narrow matmul by @Yu-Zhewen in #22177
[CI] Change numprocesses to 1 for amdgpu_vulkan_O0 by @hanhanW in #22567
Fix BYO LLVM build: handle MLIRTargetLLVMIRImport as non-object library by @hanhanW in #22553
Bump LLVM to llvm/llvm-project@f60e69315e9e by @hanhanW in #22565
[CodeGen][Tuner] Add bindings to query SIMDs and CUs info by @RattataKing in #22527
Bump spirv-cross submodule by @kuhar in #22556
[runtime] Require aligned memory accesses by default by @kuhar in #22557
[runtime] Simplify unaligned load/store impl for u64/f64. NFC. by @kuhar in #22570
Update Lit test checks caused by upstream fcf79e5 by @lialan in #22480
[Codegen] Allow iree_codegen.swizzle_hint to operate on tensors by @krzysz00 in #22552
Bump LLVM to llvm/llvm-project@6fce53af846c by @hanhanW in #22573
[CI] Force amdgpu_vulkan runner be shark10-ci by @hanhanW in #22580
Example of using HalModuleDebugSink to find numerical divergence by @newling in #22535
Enable CI for torch ops by @amd-eochoalo in #22548
[CI][torch_ops] Force amdgpu_vulkan runner be shark10-ci by @amd-eochoalo in #22588
[NFC] Refresh golden values for benchmarks. by @hanhanW in #22583
[CI] Relax golden values for torch_models. by @hanhanW in #22592
[CI] Relax golden values for torch_models more. by @hanhanW in #22593
Fix LLD support in BYO LLVM builds by @hanhanW in #22594
Bump LLVM to llvm/llvm-project@37403685298bd3a7 by @hanhanW in #22591
Increase acceptable error in punet by @newling in #22169
[CI] Refresh golden values for failing benchmarks: min(val*1.1, val+5ms) by @hanhanW in #22595
[Codegen][GPU] Update heuristic to consider distribution from split reduction by @yzhang93 in #22575
[CI] Force CPU torch benchmarks to use Threadripper. by @hanhanW in #22600
Adding new .td metadata classes and making our defs consistent. by @benvanik in #22569
[Codegen][GPU] Introduce scf::pipelineForLoop function from upstream for prefetchSharedMemory pass by @jerryyin in #22523
Adding iree_hal_executable_cache_infer_format. by @benvanik in #21763
Adding timeline-aware async execution across module boundaries. by @benvanik in #22381
[NFC] Renaming stream.parameter.* to stream.cmd.parameter.*. by @benvanik in #22607
Adding --gen-dialect-json to iree-tblgen. by @benvanik in #22603
Integrate llvm 2025-11-10 by @nirvedhmeshram in #22608
[CI] Update clip benchmark by @nirvedhmeshram in #22612
[Codegen][Tuner] Extend ireeGPUTargetInfo constructor with new added attributes by @RattataKing in #22597
[TensorExt] Add barrier ops and roundtrip tests 1/2 by @IanWood1 in #22577
Improving support for iree_codegen.extract_strided_metadata. by @benvanik in #22606
Integrates/llvm 2025-11-10 (part 2) by @nirvedhmeshram in #22613
[PJRT] Update rocm pjrt by @castigli in #22317
Update split reduction cutoff conditions by @yzhang93 in #22596
Bump the github-actions group with 2 updates by @dependabot[bot] in #22614
Integrates/llvm 20251112 by @nirvedhmeshram in #22624
[Stream] Fixing update order and improving the cache for ReplicateGlobalsPerAffinity pass. by @hanhanW in #22499
Add passes to insert and remove barriers 2/2 by @IanWood1 in #22566
[TensorExt] Rename barrier to compute_barrier by @IanWood1 in #22627
[DT] Add support for layout transfer in MaterializeEncoding pass. by @hanhanW in #22582
[e2e] Use remarks to verify ukernel match by @Yu-Zhewen in #22620
[runtime] Add explicit casts to char* to silence ubsan warnings by @kuhar in #22628
[docs] Fix a typo in LinalgExtOps.td by @sakupan102 in #22633
Fix mixed precision operands in splitReduction pass by @FlintWangacc in #22138
[TensorExt] Add folder for barrier ops by @IanWood1 in #22616
[Codegen][Tuner] expose python binding for getIGEMMGenericConvDetails by @bangtianliu in #22598
[runtime] Fix incorrect alignment assumptions by @kuhar in #22571
[LLVMCPU] Support tile-and-fuse anchoring on producer ops by @hanhanW in #22632
Silence remaining UBSan warnings across runtime and spirv-cross by @kuhar in #22638
Bump torch-mlir to llvm/torch-mlir@8d563af0b68 by @hanhanW in #22637
[Codegen][GPU] Replace prefetchLoop with stage-based backward slicing by @jerryyin in #22605
[CI] Optimize and clean up asan and tsan build scripts by @kuhar in #22639
[VMVX][NFC] Trim unnecessary IRs from select_lowering_strategy.mlir by @hanhanW in #22641
[DT] Allow to enable/disable interleaving separately for M/N/K dimensions, for each operand by @bjacob in #22626
[DataTiling] Switch default to start from the DispatchCreation phase. by @hanhanW in #21441
[Flow] Move ReplicateGlobalsPerAffinity pass to Flow by @sommerlukas in #22634
[SPIRV][NFC] Simplify lowering strategy tests by removing unnecessary IRs by @hanhanW in #22648
Bump llvm to llvm/llvm-project@7b7a422 by @nirvedhmeshram in #22635
Use llvm cast function objects. NFC. by @kuhar in #22652
Drop unnecessary namespaces from cast functions in plugins. NFC. 1/10 by @kuhar in #22653
Drop unnecessary namespaces from cast functions in bindings/dispatch/external. NFC. 2/10 by @kuhar in #22654
Drop unnecessary namespaces from cast functions in codegen common. NFC. 3/10 by @kuhar in #22655
Drop unnecessary namespaces from cast functions in codegen backends. NFC. 4/10 by @kuhar in #22656
Drop unnecessary namespaces from cast functions in dialect flow *ext. NFC. 7/10 by @kuhar in #22659
Drop unnecessary namespaces from cast functions in dialect util. NFC. 8/10 by @kuhar in #22660
Drop unnecessary namespaces from cast functions in dialect stream. NFC. 9/10 by @kuhar in #22661
Drop unnecessary namespaces from cast functions in dialect vm vmvx etc. NFC. 10/10 by @kuhar in #22662
Drop unnecessary namespaces from cast functions in dialect hal encoding. NFC. 6/10 by @kuhar in #22658
Drop unnecessary namespaces from cast functions in codegen dialect utils. NFC. 5/10 by @kuhar in #22657
[LLVMGPU][NFC] Simplify lowering_config tests. 1/N by @hanhanW in #22665
Partial Revert "[e2e] Use remarks to verify ukernel match" by @Yu-Zhewen in #22647
[CI] Optimize cmake flags for debug info builds by @kuhar in #22651
[CI] Add ubsan build and test script. Run ubsan tests in CI. by @kuhar in #22650
[AMD][GPU] Insert barrier in prologue before first shared memory write by @jerryyin in #22669
[NFC] Switch to dyn_cast_if_present for consistency. by @hanhanW in #22670
Update split reduction heuristic for extreme large GEMMs by @yzhang93 in #22636
[Integrate] Bump torch-mlir to llvm/torch-mlir@a2bcca0f025bf0 by @hanhanW in #22680
Suppress ROCm lsan errors in HIP driver tests by @qedawkins in #22675
Updatecoalesced_gather_dma definitions by @lialan in #22294
[Codegen][GPU] Add configurable num-stages option to prefetch pass by @jerryyin in #22673
RHS type should be used by @NoumanAmir657 in #22686
Drop prefetches in AVX512 ukernels by @bjacob in #22668
Bump actions/checkout from 5.0.0 to 5.0.1 in the github-actions group by @dependabot[bot] in #22677
[tuner][docs] update sharktuner readme by @bangtianliu in #22683
Revert "[LDS] Lower to coalesced_gather_dma (#22294)" by @lialan in #22691
Relax assert in task_worker_deinitialize in case thread creation failed by @qedawkins in #22689
[tuner][docs] update the example td spec in sharktuner readme by @bangtianliu in #22692
Integrate LLVM at 21e0b56d7afc by @lialan in #22667
Revert "[PJRT] Update rocm pjrt (#22317)" by @lialan in #22678
[CI] Reduce ctest parallelism in the clang job by @kuhar in #22704
[RISCV] Clean up toolchain CMake configuration by @HanKuanChen in #22663
Integrate LLVM at c2b4e481a050 by @lialan in #22701
Implementing initial end-to-end support for external transients. by @benvanik in #22625
[Preprocessing] Add compute_barrier in ConvertConvFilterToChannelsLast pass by @yzhang93 in #22679
[Codegen][GPU] Generalize linalg.reduce operations by @bangtianliu in #22490
[CI] Update iree-org/iree-test-suites@17a391dc38 by @IanWood1 in #22698
[Dispatch Creation] Add pass to fold reshapes into barriers by @IanWood1 in #22642
Integrate llvm @ aa3f930931e6 by @lialan in #22713
[Dispatch Creation] Don't fuse uses from above by @IanWood1 in #22708
[DispatchCreation] Move RemoveTensorBarriers to end of pipeline by @IanWood1 in #22703
[docs] Clarify code review process by @kuhar in #22714
[docs] Fix a typo in code review process by @kuhar in #22716
[DispatchCreation] Set split reduction size for ArgCompare by @bangtianliu in #22466
[CI][TorchModels] Update SDXL int8 model CI (1/2) by @raayandhar in #22621
[CI][TorchModels] Add data-tiling for Llama 8B Fp8 on gfx942 by @Abhishek-Varma in #22387
[Build] Optionally use hip headers from system Hip package by @AaronStGeorge in #22715
[Flow] Transfer globals per affinity instead of replicating by @sommerlukas in #22623
Adding some Stream canonicalizations and RefineUsage improvements. by @benvanik in #22610
[LDS] Reland "Lower to coalesced_gather_dma (#22294)" by @lialan in #22696
[Codegen] Fold bitcast into bufferized tensor load by @Yu-Zhewen in #22672
[DispatchCreation][NFC] Refactor split reduction helper methods to static functions by @bangtianliu in #22727
[spirv] Handle 0d vectors during unrolling by @kuhar in #22730
[LLVMGPU][Codegen] Emit packed chain FMA from select multi_reductions and contracts by @efric in #21855
[Encoding] Add SerializableAttr interface to packed_storage by @sommerlukas in #22688
Revert "[LLVMGPU][Codegen] Emit packed chain FMA from select multi_reductions and contracts" by @hanhanW in #22736
[Codegen][GPU]Fixing barrier placement for 3+ stages pipelining by @jerryyin in #22725
[Dispatch Creation] Add aggressive reshape movement flag by @IanWood1 in #22707
Update CODEOWNERS to add more reviewers for GPU codegen pieces by @MaheshRavishankar in #22721
[CI][TorchModels] Update flags for CLIP test. by @MaheshRavishankar in #22413
[TensorExt] Add Operations/Attributes/Interfaces for specifying ragged tensors. by @MaheshRavishankar in #22267
Bump actions/checkout from 5.0.1 to 6.0.0 in the github-actions group by @dependabot[bot] in #22742
Fix incompatible pointer types for macOS build. by @hanhanW in #22738
Integrate llvm/llvm-project@778e104d by @yzhang93 in #22741
[Codegen] Test Cleanup 1/8: Common CPU tests by @qedawkins in #22744
[CI] Bump golden value to 165*1.1=181.5 for prefill benchmark on mi325 by @hanhanW in #22752
[Codegen] Test Cleanup 8/8: VMVX tests by @qedawkins in #22751
[Codegen] Test Cleanup 4/8: Dialect tests by @qedawkins in #22747

New Contributors

@willghatch made their first contribution in #22216
@sjain-stanford made their first contribution in #22283
@ziliangzl made their first contribution in #22248
@weidel-p made their first contribution in #22365
@sakupan102 made their first contribution in #22368
@Copilot made their first contribution in #22430
@raayandhar made their first contribution in #22276
@FlintWangacc made their first contribution in #22138
@sommerlukas made their first contribution in #22634

Full Changelog: v3.8.0...v3.9.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release v3.9.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

IREE Release v3.9.0

1. Compiler

1.1 Data Tiling & GEMM Improvements

1.2 Dispatch Creation

1.3 GPU Codegen

1.4 Others

2. Runtime

Change Log

What's Changed

New Contributors

Contributors

Uh oh!