Releases: NVIDIA/Model-Optimizer
Releases · NVIDIA/Model-Optimizer
ModelOpt 0.29.0 Release
Backward Breaking Changes
- Refactor
SequentialQuantizerto improve its implementation and maintainability while preserving its functionality.
Deprecations
- Deprecate
torch<2.4support.
New Features
- Upgrade LLM examples to use TensorRT-LLM 0.18.
- Add new model support in the
llm_ptqexample: Gemma-3, Llama-Nemotron. - Add INT8 real quantization support.
- Add an FP8 GEMM per-tensor quantization kernel for real quantization. After PTQ, you can leverage the
mtq.compress <modelopt.torch.quantization.compress>API to accelerate evaluation of quantized models. - Use the shape of Pytorch parameters and buffers of
TensorQuantizer <modelopt.torch.quantization.nn.modules.TensorQuantizer>to initialize them during restore. This makes quantized model restoring more robust. - Support adding new custom quantization calibration algorithms. Please refer to
mtq.calibrate <modelopt.torch.quantization.model_quant.calibrate>or custom calibration algorithm doc for more details. - Add EAGLE3 (
LlamaForCausalLMEagle3) training and unified ModelOpt checkpoint export support for Megatron-LM. - Add support for
--override_shapesflag to ONNX quantization.--calibration_shapesis reserved for the input shapes used for calibration process.--override_shapesis used to override the input shapes of the model with static shapes.
- Add support for UNet ONNX quantization.
- Enable
concat_eliminationpass by default to improve the performance of quantized ONNX models. - Enable Redundant Cast elimination pass by default in
moq.quantize <modelopt.onnx.quantization.quantize>. - Add new attribute
parallel_statetoDynamicModule <modelopt.torch.opt.dynamic.DynamicModule>to support distributed parallelism such as data parallel and tensor parallel. - Add MXFP8, NVFP4 quantized ONNX export support.
- Add new example for torch quantization to ONNX for MXFP8, NVFP4 precision.
ModelOpt 0.27.1 Release
Add experimental quantization support for Llama4, QwQ and Qwen MOE models.
ModelOpt 0.27.0 Release
Deprecations
- Deprecate real quantization configs, please use
mtq.compress <modelopt.torch.quantization.compress>API for model compression after quantization.
New Features
- New model support in the
llm_ptqexample: OpenAI Whisper. - Blockwise FP8 quantization support in unified model export.
- Add quantization support to the Transformer Engine Linear module.
- Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
- To support distributed checkpoint resume expert-parallel (EP),
modelopt_statein Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) is stored differently. The legacymodelopt_statein the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format. - Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
- Add a new API
mtq.compress <modelopt.torch.quantization.compress>for model compression for weights after quantization. - Add option to simplify ONNX model before quantization is performed.
- (Experimental) Improve support for ONNX models with custom TensorRT op:
- Add support for
--calibration_shapesflag. - Add automatic type and shape tensor propagation for full ORT support with TensorRT EP.
- Add support for
Known Issues
- Quantization of T5 models is broken. Please use
nvidia-modelopt==0.25.0withtransformers<4.50meanwhile.
ModelOpt 0.25.0 Release
Deprecations
- Deprecate Torch 2.1 support.
- Deprecate
humanevalbenchmark inllm_evalexamples. Please use the newly addedsimple_evalinstead. - Deprecate
fp8_naivequantization format inllm_ptqexamples. Please usefp8instead.
New Features
- Support fast hadamard transform in
TensorQuantizerclass (modelopt.torch.quantization.nn.modules.TensorQuantizer).
It can be used for rotation based quantization methods, e.g. QuaRot. Users need to install the package fast_hadamard_transfrom to use this feature. - Add affine quantization support for the KV cache, resolving the low accuracy issue in models such as Qwen2.5 and Phi-3/3.5.
- Add FSDP2 support. FSDP2 can now be used for QAT.
- Add LiveCodeBench and Simple Evals to the
llm_evalexamples. - Disabled saving modelopt state in unified hf export APIs by default, i.e., added
save_modelopt_stateflag inexport_hf_checkpointAPI and by default set to False. - Add FP8 and NVFP4 real quantization support with LLM QLoRA example.
- The
modelopt.deploy.llm.LLMclass now support use thetensorrt_llm._torch.LLMbackend for the quantized HuggingFace checkpoints. - Add NVFP4 PTQ example for DeepSeek-R1.
- Add end-to-end AutoDeploy example for AutoQuant LLM models.
ModelOpt 0.23.2 Release
Fix export for Nvidia NeMo models
ModelOpt 0.23.1 Release
Bug Fixes
- Set
torch.load(..., weights_only=False)where Model Optimizer state is restored since torch 2.6 makes the default value toTrue - Other minor fixes
ModelOpt 0.23.0 - First OSS Release!
Backward Breaking Changes
- Nvidia TensorRT Model Optimizer has changed its LICENSE from NVIDIA Proprietary (library wheel) and MIT (examples) to Apache 2.0 in this first full OSS release.
- Deprecate Python 3.8, Torch 2.0, and Cuda 11.x support.
- ONNX Runtime dependency upgraded to 1.20 which no longer supports Python 3.9.
- In the Huggingface examples, the
trust_remote_codeis by default set to false and require users to explicitly turning it on with--trust_remote_codeflag.
New Features
- Added OCP Microscaling Formats (MX) for fake quantization support, including FP8 (E5M2, E4M3), FP6 (E3M2, E2M3), FP4, INT8.
- Added NVFP4 quantization support for NVIDIA Blackwell GPUs along with updated examples.
- Allows export lm_head quantized TensorRT-LLM checkpoint. Quantize lm_head could benefit smaller sized models at a potential cost of additional accuracy loss.
- TensorRT-LLM now supports Moe FP8 and w4a8_awq inference on SM89 (Ada) GPUs.
- New models support in the
llm_ptqexample: Llama 3.3, Phi 4. - Added Minitron pruning support for NeMo 2.0 GPT models.
- Exclude modules in TensorRT-LLM export configs are now wildcards
- The unified llama3.1 FP8 huggingface checkpoints can be deployed on SGLang.