Skip to content

Releases: NVIDIA/Model-Optimizer

ModelOpt 0.29.0 Release

09 May 05:26

Choose a tag to compare

Backward Breaking Changes

  • Refactor SequentialQuantizer to improve its implementation and maintainability while preserving its functionality.

Deprecations

  • Deprecate torch<2.4 support.

New Features

  • Upgrade LLM examples to use TensorRT-LLM 0.18.
  • Add new model support in the llm_ptq example: Gemma-3, Llama-Nemotron.
  • Add INT8 real quantization support.
  • Add an FP8 GEMM per-tensor quantization kernel for real quantization. After PTQ, you can leverage the mtq.compress <modelopt.torch.quantization.compress> API to accelerate evaluation of quantized models.
  • Use the shape of Pytorch parameters and buffers of TensorQuantizer <modelopt.torch.quantization.nn.modules.TensorQuantizer> to initialize them during restore. This makes quantized model restoring more robust.
  • Support adding new custom quantization calibration algorithms. Please refer to mtq.calibrate <modelopt.torch.quantization.model_quant.calibrate> or custom calibration algorithm doc for more details.
  • Add EAGLE3 (LlamaForCausalLMEagle3) training and unified ModelOpt checkpoint export support for Megatron-LM.
  • Add support for --override_shapes flag to ONNX quantization.
    • --calibration_shapes is reserved for the input shapes used for calibration process.
    • --override_shapes is used to override the input shapes of the model with static shapes.
  • Add support for UNet ONNX quantization.
  • Enable concat_elimination pass by default to improve the performance of quantized ONNX models.
  • Enable Redundant Cast elimination pass by default in moq.quantize <modelopt.onnx.quantization.quantize>.
  • Add new attribute parallel_state to DynamicModule <modelopt.torch.opt.dynamic.DynamicModule> to support distributed parallelism such as data parallel and tensor parallel.
  • Add MXFP8, NVFP4 quantized ONNX export support.
  • Add new example for torch quantization to ONNX for MXFP8, NVFP4 precision.

ModelOpt 0.27.1 Release

15 Apr 18:24

Choose a tag to compare

Add experimental quantization support for Llama4, QwQ and Qwen MOE models.

ModelOpt 0.27.0 Release

03 Apr 05:24

Choose a tag to compare

Deprecations

  • Deprecate real quantization configs, please use mtq.compress <modelopt.torch.quantization.compress> API for model compression after quantization.

New Features

  • New model support in the llm_ptq example: OpenAI Whisper.
  • Blockwise FP8 quantization support in unified model export.
  • Add quantization support to the Transformer Engine Linear module.
  • Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
  • To support distributed checkpoint resume expert-parallel (EP), modelopt_state in Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) is stored differently. The legacy modelopt_state in the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format.
  • Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
  • Add a new API mtq.compress <modelopt.torch.quantization.compress> for model compression for weights after quantization.
  • Add option to simplify ONNX model before quantization is performed.
  • (Experimental) Improve support for ONNX models with custom TensorRT op:
    • Add support for --calibration_shapes flag.
    • Add automatic type and shape tensor propagation for full ORT support with TensorRT EP.

Known Issues

  • Quantization of T5 models is broken. Please use nvidia-modelopt==0.25.0 with transformers<4.50 meanwhile.

ModelOpt 0.25.0 Release

03 Mar 17:41

Choose a tag to compare

Deprecations

  • Deprecate Torch 2.1 support.
  • Deprecate humaneval benchmark in llm_eval examples. Please use the newly added simple_eval instead.
  • Deprecate fp8_naive quantization format in llm_ptq examples. Please use fp8 instead.

New Features

  • Support fast hadamard transform in TensorQuantizer class (modelopt.torch.quantization.nn.modules.TensorQuantizer).
    It can be used for rotation based quantization methods, e.g. QuaRot. Users need to install the package fast_hadamard_transfrom to use this feature.
  • Add affine quantization support for the KV cache, resolving the low accuracy issue in models such as Qwen2.5 and Phi-3/3.5.
  • Add FSDP2 support. FSDP2 can now be used for QAT.
  • Add LiveCodeBench and Simple Evals to the llm_eval examples.
  • Disabled saving modelopt state in unified hf export APIs by default, i.e., added save_modelopt_state flag in export_hf_checkpoint API and by default set to False.
  • Add FP8 and NVFP4 real quantization support with LLM QLoRA example.
  • The modelopt.deploy.llm.LLM class now support use the tensorrt_llm._torch.LLM backend for the quantized HuggingFace checkpoints.
  • Add NVFP4 PTQ example for DeepSeek-R1.
  • Add end-to-end AutoDeploy example for AutoQuant LLM models.

ModelOpt 0.23.2 Release

19 Feb 12:27

Choose a tag to compare

Fix export for Nvidia NeMo models

ModelOpt 0.23.1 Release

14 Feb 10:50

Choose a tag to compare

Bug Fixes

  • Set torch.load(..., weights_only=False) where Model Optimizer state is restored since torch 2.6 makes the default value to True
  • Other minor fixes

ModelOpt 0.23.0 - First OSS Release!

29 Jan 19:05

Choose a tag to compare

Backward Breaking Changes

  • Nvidia TensorRT Model Optimizer has changed its LICENSE from NVIDIA Proprietary (library wheel) and MIT (examples) to Apache 2.0 in this first full OSS release.
  • Deprecate Python 3.8, Torch 2.0, and Cuda 11.x support.
  • ONNX Runtime dependency upgraded to 1.20 which no longer supports Python 3.9.
  • In the Huggingface examples, the trust_remote_code is by default set to false and require users to explicitly turning it on with --trust_remote_code flag.

New Features

  • Added OCP Microscaling Formats (MX) for fake quantization support, including FP8 (E5M2, E4M3), FP6 (E3M2, E2M3), FP4, INT8.
  • Added NVFP4 quantization support for NVIDIA Blackwell GPUs along with updated examples.
  • Allows export lm_head quantized TensorRT-LLM checkpoint. Quantize lm_head could benefit smaller sized models at a potential cost of additional accuracy loss.
  • TensorRT-LLM now supports Moe FP8 and w4a8_awq inference on SM89 (Ada) GPUs.
  • New models support in the llm_ptq example: Llama 3.3, Phi 4.
  • Added Minitron pruning support for NeMo 2.0 GPT models.
  • Exclude modules in TensorRT-LLM export configs are now wildcards
  • The unified llama3.1 FP8 huggingface checkpoints can be deployed on SGLang.