What's Changed
- bump release to dev version post 0.2.5 by @t-vi in #2513
- torch.cumsum api change by @jjsjann123 in #2507
- DTensor: support linear by @kshitij12345 in #2422
- TEv2 as default TE executor by @riccardofelluga in #2510
- Create
Symbolwithis_prim=Truein_register_custom_opby @crcrpar in #2516 - [pre-commit.ci] pre-commit suggestions by @pre-commit-ci[bot] in #2515
- Use underlying class's new in MutableMappingWrapper.new by @t-vi in #2514
- fix: working README.md example, support nvfuser for torch==2.8 by @lianakoleva in #2525
- Add KaelanDt as codeowner by @t-vi in #2540
- add ci skips by @t-vi in #2547
- TE: Fix cudnn.h not found by @kshitij12345 in #2536
- feat: _register_custom_op supports List[torch.Tensor] by @lianakoleva in #2529
- feat: provide guidance when registering custom op by @lianakoleva in #2530
- fix: string -> f-string where intended by @lianakoleva in #2528
- Initial TE NVFP4 recipe support by @riccardofelluga in #2523
- Add DTensor prim and torch symbol for exp by @kshitij12345 in #2496
- [DTensor] Add prim and torch sym for neg and reciprocal by @kshitij12345 in #2552
- Bump pytest from 8.3.5 to 8.4.2 by @dependabot[bot] in #2567
- Bump bitsandbytes from 0.47.0 to 0.48.0 by @dependabot[bot] in #2565
- Bump diffusers from 0.34.0 to 0.35.1 by @dependabot[bot] in #2564
- switch CI to non-interruptible by @t-vi in #2554
- tests: clean up xfails in VJP tests by @aobolensk in #2578
- fix output dtype for nvfuserex cumsum by @jjsjann123 in #2580
- Use signature (*args, **kwargs) when signature is unavailable by @shino16 in #2542
- fix: call torch.cuda.is_available() in available_devices by @aobolensk in #2602
- Add uint64 to thunder->torch dtype map by @crcrpar in #2519
- Enable direct bindings in Thunder by @rdspring1 in #2502
- move to nvfuser-cu128-torch28 by @t-vi in #2604
- [DTensor] Add torch symbol and prim for _grouped_mm by @kshitij12345 in #2503
- [DTensor] Add prim and torch symbol for
addby @kshitij12345 in #2581 - MoE TensorParallel with Eager by @kshitij12345 in #2582
- fix: missing 'import torch' in README.md by @aobolensk in #2608
- Remove E741 from ruff lint ignore rules by @tpremrud in #2601
- Disallow
custom_opthat mutates arguments by @crcrpar in #2603 - Disabled TF32 on Amper+ devices to stabilize numeric accuracy by @mattteochen in #2579
- Propagate rounding_mode in div_ by @beverlylytle in #2614
- Refactor quantization.py to use TSP by @tejapulagam in #2522
- [DTensor] Add test with parallelize_module by @kshitij12345 in #2598
- Inference benchmark of "meta-llama/Llama-4-Maverick-17B-128E" by @crcrpar in #2487
- Remove outdated scenarios from inference benchmark by @crcrpar in #2619
- Add
float4_e2m1fn_x2to lcdtype_to_nvdtype_map by @crcrpar in #2532 - avoid
torch.float4_e2m1fn_x2in_get_min_and_valby @crcrpar in #2533 - [benchmark_inference] Fix replacing the MoE by @kshitij12345 in #2620
- Fix FSDP NB by @kshitij12345 in #2629
- Fixes function name is not defined when using
DebugTransformby @kiya00 in #2617 - Enable MoE TP with thunderfx by @kshitij12345 in #2611
- Add DIV_EXACT prim by @beverlylytle in #2626
- try getting version from
nvfuser_directfirst by @crcrpar in #2623 - Fix hf example and benchmark run on CPU by @aobolensk in #2583
- getattr should always be taken from the class and then bound by @t-vi in #2584
- Update benchmark_inference.py to support TP with thunderfx by @kshitij12345 in #2625
- Add TE's NVFP4 recipe to the test suite by @riccardofelluga in #2612
- Jj/cumsum nvfuserex opinfo tolerance by @jjsjann123 in #2586
- tests: Extend testing for dunder and binary elementwise operations by @aobolensk in #2597
- Propagate
disable_torch_autogradto thunderfx's_splitterby @crcrpar in #2534 - [pre-commit.ci] pre-commit suggestions by @pre-commit-ci[bot] in #2521
- [benchmark_inference] Enable
tqdmonly on rank0 by @crcrpar in #2630 - fix tests for CI by @t-vi in #2627
- Remove cuda checks from TE and Triton xentropy executors by @riccardofelluga in #2613
- Fixes backward issue where silu outputs nan by @kiya00 in #2624
- [benchmark_inference] Update
from_linearandfrom_grouped_linearto acceptfqn: strby @crcrpar in #2631 - Have seed and offset in int64 for cudnn-frontend SDPA by @crcrpar in #2520
- Refactor low precision option handling in
benchmark_litgpt.pyby @riccardofelluga in #2615 - Bump the gha-updates group with 3 updates by @dependabot[bot] in #2568
- Fix import of TE
Recipeby @ksivaman in #2635 - fix benchmarks job by @t-vi in #2607
- Add profile transform by @t-vi in #2636
- Enable interpolate tests, add PrimID mapping for ceil and floor by @aobolensk in #2609
- [pre-commit.ci] pre-commit suggestions by @pre-commit-ci[bot] in #2637
- Update jvp computation in
test_grad.pyby @mattteochen in #2618 - Propagated requires_grad to torch tensor by @mattteochen in #2616
- add mask lookaside into transformers recipe by @t-vi in #2639
- Fix thunderjit in inference benchmark by @t-vi in #2644
- Warm up sufficiently by @wujingyue in #2638
- Reset peak memory stats before measurement by @wujingyue in #2647
- drop extra executor by @t-vi in #2648
- Relax Test Tolerances for TE executor tests by @riccardofelluga in #2646
- test_vjp_correctness_sdpa_manual: relax test tolerance (#2576) by @kiya00 in #2628
- [benchmark_inference] Decrease max_new_tokens for warm-up by @kshitij12345 in #2649
- [benchmark_inference] Reshape the output from run_routed_experts by @kshitij12345 in #2650
- Revert "[benchmark_inference] Decrease max_new_tokens for warm-up" by @wujingyue in #2660
- Fix the kv cache input STATIC_MEMORY_LOCATION tag in QuickStart example by @kiya00 in #2667
- be more thorough in replacing thunder.Device with torch.device in epilogue by @t-vi in #2669
- TE inference executor for 8 bit by @t-vi in #2632
- Store GroupedLinear's weight in GNK layout by @wujingyue in #2659
- [DTensor] Skip exp test on nvfuser by @kshitij12345 in #2671
- Add an option to profile only non-warmup iterations by @wujingyue in #2661
- Revert new primitive for grad bug fix; Apply localized solution for division output type consistency in _div_prim_grad by @Copilot in #2665
- Remove stray print by @kshitij12345 in #2673
- Remove
--dtensor-single-gpuby @wujingyue in #2666 - reflect GroupedLinear's changed weight by @t-vi in #2674
- Use
torch._inductor.compilefor ThunderFX fallback entrypoint by @shino16 in #2600 - Tom/readme by @t-vi in #2684
- Add repr function for CACHE_OPTIONS and SHARP_EDGES_OPTIONS by @kiya00 in #2676
- add profile plugin by @t-vi in #2683
- bump version for release by @t-vi in #2685
New Contributors
- @aobolensk made their first contribution in #2578
- @tpremrud made their first contribution in #2601
- @mattteochen made their first contribution in #2579
Full Changelog: 0.2.5...0.2.6