Skip to content

Conversation

cajeonrh
Copy link
Contributor

@cajeonrh cajeonrh commented Oct 2, 2025

SUMMARY:
Created a FAQ page under the "Getting Started" section

TEST PLAN:

  • Created a new page called FAQ under "Getting Started" section
  • Included all FAQs from INFERENG-1867

Requesting review of content

Copy link

github-actions bot commented Oct 2, 2025

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Copy link
Contributor

Summary of Changes

Hello @cajeonrh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new Frequently Asked Questions (FAQ) page within the "Getting Started" section of the documentation. The primary goal is to centralize answers to common user queries about the LLM Compressor, thereby enhancing user self-service and clarity on topics such as model performance post-compression, integration with other tools like sglang, and practical guidance on compression strategies and memory requirements.

Highlights

  • New FAQ Page: A new Frequently Asked Questions (FAQ) page has been added to the documentation.
  • Comprehensive Coverage: The FAQ page addresses common questions regarding LLM Compressor, including performance expectations after compression, sglang integration, compression strategy selection, memory usage, and specifics on which layers to quantize.
  • Resource Links: The page provides direct links to relevant guides and code examples for users seeking more in-depth information on various topics.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new FAQ page, which is a great addition to the documentation. The content is relevant and covers important user questions. I've identified a few areas for improvement, mainly related to Markdown link formatting, content clarity, and consistency. There are several instances of incorrect link syntax that need to be fixed across the document. I've also suggested consolidating a couple of redundant questions and using relative paths for internal links to improve maintainability.

Copy link
Collaborator

@fynnsu fynnsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Cassie! Added a couple suggestions below.

Also some of your links are incorrectly formatted. It should be [link text](link url)

@cajeonrh
Copy link
Contributor Author

cajeonrh commented Oct 2, 2025

Thanks Fynn! I've incorporated your feedback.

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! one comment to add a note on multimodal models for question 5

@fynnsu
Copy link
Collaborator

fynnsu commented Oct 2, 2025

Screenshot 2025-10-02 at 3 32 32 PM

In the sidebar we're getting this title for the page. Can we simplify this to just "Frequently Asked Questions" or maybe even "FAQ". I believe this is being set by the # header at the top of the file

@fynnsu
Copy link
Collaborator

fynnsu commented Oct 2, 2025

Screenshot 2025-10-02 at 3 34 58 PM

Also on the "Getting Started" page we have these boxes for "Installation", "Compress Your Model" and "Deploy on vLLM". Could we add a box for FAQ?

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a quick question on installation: vLLM and llmcompressor should be used in separate environments as they may have dependency mismatches?

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other common question we get asked is about multi-gpu support.

Can we add the following?

  1. LLM Compresor handles all gpu movement for you.
  2. For data-free pathways, we leverage all available gpus and offload anything that doesnt fit onto the allocated gpus. If using pathways that require data, we sequentially onload model layers onto a single gpu. This is the case for LLM Compressor 0.6-0.8.

@cajeonrh
Copy link
Contributor Author

cajeonrh commented Oct 6, 2025

I've incorporated feedback, added more questions, and also added a FAQ box on the Getting Started page. Please let me know if I missed anything.

fynnsu
fynnsu previously approved these changes Oct 6, 2025
Copy link
Collaborator

@fynnsu fynnsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for making those changes!

@fynnsu
Copy link
Collaborator

fynnsu commented Oct 6, 2025

Looks like you need to fix DCO though. There are some instructions here: https://github.com/vllm-project/llm-compressor/pull/1896/checks?check_run_id=52066401360.

@cajeonrh cajeonrh requested a review from dsikka October 6, 2025 20:44
Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not a fan of using casual pronouns like "us", "we", "my". This may sound pedantic, but speaking from personal experience contributing to other OSS repos, words like "we" have the effect of alienating open source contributors. LLM Compressor is owned by everyone, the RedHat/ LLM Compressor team helps to maintain and shepherd it.

Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a section titled "Where can I learn more about LLM Compressor?" which links to talks we've given.

https://www.youtube.com/watch?v=caLYSZMVQ1c
https://www.youtube.com/watch?v=GrhuqQDmBk8
https://www.youtube.com/watch?v=WVenRmF4dPY
https://www.youtube.com/watch?v=G1WNlLxPLSE

@cajeonrh
Copy link
Contributor Author

cajeonrh commented Oct 10, 2025

I'm really not a fan of using casual pronouns like "us", "we", "my". This may sound pedantic, but speaking from personal experience contributing to other OSS repos, words like "we" have the effect of alienating open source contributors. LLM Compressor is owned by everyone, the RedHat/ LLM Compressor team helps to maintain and shepherd it.

@kylesayrs
I think this would need consensus from the team before changing it. I went through the LLM Compressor docs and there are about 35 pages that use the pronouns "we"/"us." Unless the FAQ page should be the only one that doesn't use any personal pronouns. Otherwise, if the change needs to be applied to all the pages for LLM Compressor, that should be a new ticket for the work.

@kylesayrs
Copy link
Collaborator

@cajeonrh That’s fine, we can table the discussion for now

@dsikka dsikka requested a review from kylesayrs October 14, 2025 14:07
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one question on links but we can revisit in a follow-up

kylesayrs and others added 4 commits October 16, 2025 13:25
…servers (vllm-project#1903)

## Purpose ##
* FP4
* Fix bug discovered
[here](vllm-project#1830 (comment))
where dynamic="local" nvfp4 calculations would increment the observer
twice as fast as normal
  * Enable MSE observer to be used with FP4
    ```psuedocode
    mse_quant_error := mean((x - fake_quant(x))**2)
global_scale <- min[min_vals, max_vals,
global_scale](mse_quant_error(x))
scale, zp <- min[min_vals, max_vals](mse_quant_error(x, global_scale))
    ```
* Simplification
* Make supporting attention calibration easier by separating out
weight/activation/attention reshaping
* Improve readability of observer codes by removing many levels of
function indirection
* Drop support for calibration with non-divisible group sizes. This is
not really a loss, since [forward
passes](https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/lifecycle/forward.py#L279)
also make this assumption
* New observers
* `memoryless_minmax` computes min and max values on the fly in a
dynamic-quantization style. This observer is useful for PTQ weight
quantization
* `static_minmax` computes absolute min and max values across all
observations. This observer is useful for PTQ activation quantization
* `memoryless_mse` computes best qparams w.r.t. MSE loss for each
observation. This observer is useful for PTQ weight quantization
* Memory improvements
* All observers no longer store copies of scales and zero points,
reducing the amount of required memory
* Newly introduced "memoryless" observers do not store any quantization
parameters, which greatly reduces the memory requirements for PTQ weight
quantization of very large models

| Diagrams |
| - |
| Before |
| <img width="886" height="595" alt="before"
src="https://github.com/user-attachments/assets/660d94c2-3ac8-4e05-9e9b-53d21145abac"
/> |
| After | 
<img width="1527" height="595" alt="after"
src="https://github.com/user-attachments/assets/51a0107e-3fbd-413c-a7a6-03ddc3612169"
/> |

## Changes ##
* Standardize reshaping using `flatten_for_calibration`
* This function reshapes all observed values to `(num_observations,
*qparams_shape, group_size)`
* This function the complexity associated with passing "reduce dims" and
trying to handle weights, activations, and attention states all in the
same function
* In the future, this function could be applied to the quantization
forward pass, although there's probably no need to outside of
standardization
* Implement `get_global_scale` on `Observer` base
* This function decouples minmax calculations from regular qparam
calculations (avoiding the double increment bug)
* This function enables the MSE observer to be used with FP4 global
scales

## Testing ##
* Added additional minmax tests which check exact values of scales. This
test passes both on main and this branch, demonstrating that minmax
observer behavior remains unchanged
* Added additional MSE tests which check exact values of mse losses.
This test passes both on main and this branch, demonstrating that MSE
observer behavior remains unchanged
* Added FP4 MSE test

## Evaluation ##
```
nvfp4-static-minmax
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.6167|±  |   N/A|
```

```
nvfp4-minmax
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.6011|±  |   N/A|
```

---------

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Dan Huang <[email protected]>
Co-authored-by: dhuangnm <[email protected]>
SUMMARY:

In models with mamba-2 layers e.g.,
[nvidia/NVIDIA-Nemotron-Nano-12B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2),
[Qwen/Qwen3-Next-80B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct),
tracing _update_mamba_masks would lead to
```
  File "NemotronHModel_8045287568680_autowrapped", line 57, in forward
  File "/mnt/LinuxDrive/huggingface/modules/transformers_modules/NVIDIA_hyphen_Nemotron_hyphen_Nano_hyphen_12B_hyphen_v2/modeling_nemotron_h.py", line 1461, in _update_mamba_mask
    if cache_position[0] > 0 or (attention_mask is not None and torch.all(attention_mask == 1)):
       ^^^^^^^^^^^^^^^^^^^^^
  File "/home/toncao/anaconda3/envs/llm-compressor_v1/lib/python3.12/site-packages/transformers/utils/fx.py", line 674, in __bool__
    return super().__bool__()
           ^^^^^^^^^^^^^^^^^^
  File "/home/toncao/anaconda3/envs/llm-compressor_v1/lib/python3.12/site-packages/torch/fx/proxy.py", line 577, in __bool__
    return self.tracer.to_bool(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/toncao/anaconda3/envs/llm-compressor_v1/lib/python3.12/site-packages/torch/fx/proxy.py", line 388, in to_bool
    raise TraceError(
torch.fx.proxy.TraceError: symbolically traced variables cannot be used as inputs to control flow
```
from the function:
```

def _update_mamba_mask(self, attention_mask, cache_position):
--
"""
No need for zeroing states when
1. Cached forward
2. Attending to all inputs
"""
mamba_mask = attention_mask
if cache_position[0] > 0 or (attention_mask is not None and torch.all(attention_mask == 1)):
mamba_mask = None
return mamba_mask
```

And thus, adding _update_mamba_masks to the ignore tracing list makes
AWQ sequential tracing works.

TEST PLAN:

local make test results:
```
===================================================== short test summary info =====================================================
FAILED tests/llmcompressor/modeling/test_calib_deepseek_v3.py::test_calib_deepseekv3_module - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 23.57 GiB of which 14.1...
FAILED tests/llmcompressor/utils/test_helpers.py::test_disable_cache[MllamaForConditionalGeneration-meta-llama/Llama-3.2-11B-Vision-Instruct] - huggingface_hub.errors.GatedRepoError: 403 Client Error. (Request ID: Root=1-68ee275c-378c35b1649b823602164fc0;24ebe331-9031-4...
FAILED tests/lmeval/test_lmeval.py::TestLMEval::test_lm_eval[None] - TypeError: argument should be a str or an os.PathLike object where __fspath__ returns a str, not 'NoneType'
====================================== 3 failed, 242 passed, 4 skipped in 129.47s (0:02:09) =======================================
```

Co-authored-by: toncao <[email protected]>
Co-authored-by: Brian Dellabetta <[email protected]>
SUMMARY:
Added e2e testing for block quantization.


TEST PLAN:
Tested locally with the following command:
```
python -m pytest tests/e2e/vLLM/test_vllm.py -vv -s
```

log:
```
================= vLLM GENERATION =================

PROMPT:
The capital of France is
GENERATED TEXT:
 Paris, which is located in the Île-de-France region. The

PROMPT:
The president of the US is
GENERATED TEXT:
 paying for the protests against him. The White House has reportedly cut

PROMPT:
My name is
GENERATED TEXT:
 [insert name], and I am a [insert job title]. I am excited

PASSED

===================================================================================================================== 1 passed in 130.10s (0:02:10) =====================================================================================================================
```

---------

Signed-off-by: shanjiaz <[email protected]>
@dsikka dsikka merged commit 5061adf into vllm-project:main Oct 17, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready When a PR is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants