Skip to content

Release v3.9.0

Latest

Choose a tag to compare

@sa-faizal sa-faizal released this 25 Nov 18:51
· 17 commits to main since this release
1a66819

IREE Release v3.9.0

1. Compiler

1.1 Data Tiling & GEMM Improvements

  • iree-opt-data-tiling promoted to umbrella flag with suggested config. (#22295)
  • Default path switched to DispatchCreation phase; use --iree-global-opt-data-tiling for legacy behavior. See
    docs. (#21441)
  • Implemented subgroups_k in data-tiled MMA layouts. (#22519)
  • Added per-operand M/N/K interleaving control. (#22626)
  • Added layout transfer support in MaterializeEncoding. (#22582)
  • Strict inner_tiled verifier with distributed/opaque params. (#22369)
  • Unified encoding materialization passes. (#22472)
  • Encoding op fusion with multi-use producers at -O3. (#22444)
  • Intentional padding for non-K-major layouts (~2.7% GEMM improvement). (#22486)
  • Better heuristics for extremely large GEMMs. (#22636)
  • Refactored narrow matmul tile size selection. (#22177)
  • Split reduction for large-K GEMMs. (#22357)
  • Updated ukernel data layout. (#22350)
  • Fixed large f16 ukernel bounds. (#22481)
  • Added LLaMA 8B FP8 benchmark tests on gfx942. (#22387)

1.2 Dispatch Creation

  • Added split-reduction support for arg_compare, preventing shared-memory overflow and fixing LLaMA 8B FP16 compilation failures. (#22466)
  • Added aggressive multi-use fusion for encoding ops (enabled at -O3), significantly improving fusion patterns seen in SDXL. (#22444)
  • Enabled consumer fusion for GPUApplyTilingLevel on scf.forall loops, enhancing padding-level fusion. (#22522)

1.3 GPU Codegen

  • Added barrier insertion before first shared-memory write for AMD GPUs, fixing non-deterministic strided conv results (13% -> 0% failure rate). (#22669)
  • Rewrote loop prefetcher with a stage-based backward slicing model for better maintainability (no functional change). (#22605)
  • Implemented vector size inference for UKernelGenericOp, enabling downstream ops (e.g., unpack) to correctly vectorize instead of falling back to scalar code. (#22440)
  • Improved f16 medium ukernel bounds on ROCm for better matmul throughput. (#22393)
  • Added mmt4d ukernel support for RISC-V zvfh/zvfhmin, enabling f16xf16->f16/f32 kernels with runtime hardware probing. (#22231)
  • Generalized GPU lowering for linalg.reduce ops, converting illegal i1 reductions to generic form to unblock split-reduction pipelines. (#22490)

1.4 Others

2. Runtime

  • Implemented the first end-to-end support for external transients, enabling early—but functional—handling of control flow and cross-dispatch transient values.
    • Current limitations: no function calls and no data-dependent values; simple control flow is supported and aligns with future dispatch specialization work. (#22625)
  • Added timeline-aware async execution across module boundaries, introducing foundational interfaces for precise cross-module scheduling. (#22381)
  • Improved support for iree_codegen.extract_strided_metadata, ensuring information-preserving lowering:
    • Now normalizes into iree_codegen earlier, avoiding loss of stride/offset/alignment information that occurred when prematurely converting to memref. (#22606)
  • Added new Stream canonicalizations and improved RefineUsage to reduce unnecessary copies and fix correctness bugs. (#22610)
  • Added --gen-dialect-json to iree-tblgen, generating JSON databases of dialect definitions using tablegen metadata. (#22603)

Change Log

Git History

What's Changed

New Contributors

Full Changelog: v3.8.0...v3.9.0