Skip to content

int4 quantization output onnx does not load #156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
thejaswi01 opened this issue Mar 13, 2025 · 1 comment
Open

int4 quantization output onnx does not load #156

thejaswi01 opened this issue Mar 13, 2025 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@thejaswi01
Copy link

Describe the bug

Running int4 quantization on a onnx file with conformer architecture outputs an error at the end of quantization and output onnx does not load

Steps/Code to reproduce bug

Run quantization with below script

from modelopt.onnx.quantization import quantize
import os

def run_quantization():
    input_path = 'model.onnx'
    output_path = 'model_int4.onnx'

    # Create output directory if it doesn't exist
    os.makedirs(os.path.dirname(output_path), exist_ok=True)

    quantize(
        input_path,
        quantize_mode='int4',
        use_external_data_format=True,
        output_path=output_path,
        verbose=True,
    )

if __name__ == "__main__":
    run_quantization()

Error

INFO:root:Quantized onnx model is saved as xxx
WARNING:root:ONNX model checker failed, check your deployment status.
WARNING:root:Unrecognized attribute: block_size for operator DequantizeLinear

==> Context: Bad node spec for node. Name: onnx::MatMul_4725_DequantizeLinear OpType: DequantizeLinear

Expected behavior

Output model should have valid onnx and should load/run as expected

System information

  • Container used (if applicable): ?
  • OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): Ubuntu 22.04.5 LTS
  • CPU architecture (x86_64, aarch64): x86_64
  • GPU name (e.g. H100, A100, L40S): NVIDIA A10G
  • GPU memory size: 22.5 GB
  • Number of GPUs: 1
    • Library versions (if applicable):
    • Python: 3.10.12
    • ModelOpt version or commit hash: 0.25.0
    • CUDA: 12.8
    • PyTorch: 2.5.1+cu124
    • Transformers: 4.47.1
      [TensorRT-LLM] TensorRT-LLM version: 0.17.0.post1
    • TensorRT-LLM: 0.17.0.post1
    • ONNXRuntime: 1.20.1
    • TensorRT: 10.8.0.43
@thejaswi01 thejaswi01 added the bug Something isn't working label Mar 13, 2025
@thejaswi01 thejaswi01 changed the title int4 quantization outputs buggy onnx int4 quantization output onnx does not load Mar 13, 2025
@i-riyad
Copy link
Collaborator

i-riyad commented Apr 21, 2025

The warning is not critical. The quantized model should compile and run successfully on both the TensorRT and DML backends. I have tested with TensorRT 10.8 and 10.9, and was able to generate the engine successfully from the quantized ONNX model.
If you’re encountering any issues during deployment, please share the specific error messages so we can help troubleshoot further.

@i-riyad i-riyad self-assigned this Apr 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants