Releases · NVIDIA/Model-Optimizer

09 May 05:26

kevalmorabia97

0.29.0

acecdb5

ModelOpt 0.29.0 Release

Backward Breaking Changes

Refactor SequentialQuantizer to improve its implementation and maintainability while preserving its functionality.

Deprecations

Deprecate torch<2.4 support.

New Features

Upgrade LLM examples to use TensorRT-LLM 0.18.
Add new model support in the llm_ptq example: Gemma-3, Llama-Nemotron.
Add INT8 real quantization support.
Add an FP8 GEMM per-tensor quantization kernel for real quantization. After PTQ, you can leverage the mtq.compress <modelopt.torch.quantization.compress> API to accelerate evaluation of quantized models.
Use the shape of Pytorch parameters and buffers of TensorQuantizer <modelopt.torch.quantization.nn.modules.TensorQuantizer> to initialize them during restore. This makes quantized model restoring more robust.
Support adding new custom quantization calibration algorithms. Please refer to mtq.calibrate <modelopt.torch.quantization.model_quant.calibrate> or custom calibration algorithm doc for more details.
Add EAGLE3 (LlamaForCausalLMEagle3) training and unified ModelOpt checkpoint export support for Megatron-LM.
Add support for --override_shapes flag to ONNX quantization.
- --calibration_shapes is reserved for the input shapes used for calibration process.
- --override_shapes is used to override the input shapes of the model with static shapes.
Add support for UNet ONNX quantization.
Enable concat_elimination pass by default to improve the performance of quantized ONNX models.
Enable Redundant Cast elimination pass by default in moq.quantize <modelopt.onnx.quantization.quantize>.
Add new attribute parallel_state to DynamicModule <modelopt.torch.opt.dynamic.DynamicModule> to support distributed parallelism such as data parallel and tensor parallel.
Add MXFP8, NVFP4 quantized ONNX export support.
Add new example for torch quantization to ONNX for MXFP8, NVFP4 precision.

Assets 2

15 Apr 18:24

kevalmorabia97

0.27.1

d59ca04

ModelOpt 0.27.1 Release

Add experimental quantization support for Llama4, QwQ and Qwen MOE models.

Assets 2

03 Apr 05:24

kevalmorabia97

0.27.0

54f4e3c

ModelOpt 0.27.0 Release

Deprecations

Deprecate real quantization configs, please use mtq.compress <modelopt.torch.quantization.compress> API for model compression after quantization.

New Features

New model support in the llm_ptq example: OpenAI Whisper.
Blockwise FP8 quantization support in unified model export.
Add quantization support to the Transformer Engine Linear module.
Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
To support distributed checkpoint resume expert-parallel (EP), modelopt_state in Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) is stored differently. The legacy modelopt_state in the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format.
Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
Add a new API mtq.compress <modelopt.torch.quantization.compress> for model compression for weights after quantization.
Add option to simplify ONNX model before quantization is performed.
(Experimental) Improve support for ONNX models with custom TensorRT op:
- Add support for --calibration_shapes flag.
- Add automatic type and shape tensor propagation for full ORT support with TensorRT EP.

Known Issues

Quantization of T5 models is broken. Please use nvidia-modelopt==0.25.0 with transformers<4.50 meanwhile.

Assets 2

03 Mar 17:41

kevalmorabia97

0.25.0

7eecd11

ModelOpt 0.25.0 Release

Deprecations

Deprecate Torch 2.1 support.
Deprecate humaneval benchmark in llm_eval examples. Please use the newly added simple_eval instead.
Deprecate fp8_naive quantization format in llm_ptq examples. Please use fp8 instead.

New Features

Support fast hadamard transform in TensorQuantizer class (modelopt.torch.quantization.nn.modules.TensorQuantizer).
It can be used for rotation based quantization methods, e.g. QuaRot. Users need to install the package fast_hadamard_transfrom to use this feature.
Add affine quantization support for the KV cache, resolving the low accuracy issue in models such as Qwen2.5 and Phi-3/3.5.
Add FSDP2 support. FSDP2 can now be used for QAT.
Add LiveCodeBench and Simple Evals to the llm_eval examples.
Disabled saving modelopt state in unified hf export APIs by default, i.e., added save_modelopt_state flag in export_hf_checkpoint API and by default set to False.
Add FP8 and NVFP4 real quantization support with LLM QLoRA example.
The modelopt.deploy.llm.LLM class now support use the tensorrt_llm._torch.LLM backend for the quantized HuggingFace checkpoints.
Add NVFP4 PTQ example for DeepSeek-R1.
Add end-to-end AutoDeploy example for AutoQuant LLM models.

Assets 2

19 Feb 12:27

kevalmorabia97

0.23.2

25090b0

ModelOpt 0.23.2 Release

Fix export for Nvidia NeMo models

Assets 2

14 Feb 10:50

kevalmorabia97

0.23.1

5c9390c

ModelOpt 0.23.1 Release

Bug Fixes

Set torch.load(..., weights_only=False) where Model Optimizer state is restored since torch 2.6 makes the default value to True
Other minor fixes

Assets 2

29 Jan 19:05

kevalmorabia97

0.23.0

73d6af7

ModelOpt 0.23.0 - First OSS Release!

Backward Breaking Changes

Nvidia TensorRT Model Optimizer has changed its LICENSE from NVIDIA Proprietary (library wheel) and MIT (examples) to Apache 2.0 in this first full OSS release.
Deprecate Python 3.8, Torch 2.0, and Cuda 11.x support.
ONNX Runtime dependency upgraded to 1.20 which no longer supports Python 3.9.
In the Huggingface examples, the trust_remote_code is by default set to false and require users to explicitly turning it on with --trust_remote_code flag.

New Features

Added OCP Microscaling Formats (MX) for fake quantization support, including FP8 (E5M2, E4M3), FP6 (E3M2, E2M3), FP4, INT8.
Added NVFP4 quantization support for NVIDIA Blackwell GPUs along with updated examples.
Allows export lm_head quantized TensorRT-LLM checkpoint. Quantize lm_head could benefit smaller sized models at a potential cost of additional accuracy loss.
TensorRT-LLM now supports Moe FP8 and w4a8_awq inference on SM89 (Ada) GPUs.
New models support in the llm_ptq example: Llama 3.3, Phi 4.
Added Minitron pruning support for NeMo 2.0 GPT models.
Exclude modules in TensorRT-LLM export configs are now wildcards
The unified llama3.1 FP8 huggingface checkpoints can be deployed on SGLang.

Assets 2

Releases: NVIDIA/Model-Optimizer

ModelOpt 0.29.0 Release

Uh oh!

ModelOpt 0.27.1 Release

Uh oh!

ModelOpt 0.27.0 Release

Uh oh!

ModelOpt 0.25.0 Release

Deprecations

New Features

Uh oh!

ModelOpt 0.23.2 Release

Uh oh!

ModelOpt 0.23.1 Release

Uh oh!

ModelOpt 0.23.0 - First OSS Release!

Uh oh!