Enable loading pre-quantized INT4 weights in Llama4 #330

jiawenliu64 · 2025-04-24T06:13:13Z

Generate INT4 MP8 checkpoint:

torchrun --nproc-per-node=8 -m models.llama4.scripts.quantize --ckpt_dir ../checkpoints/Llama-4-Scout-17B-16E-Instruct --output_dir ../checkpoints/Llama-4-Scout-17B-16E-Instruct-INT4 --quantization_mode int4_mixed --world_size 8

Verify generated INT4 MP8 checkpoint with int4_mixed on single GPU (output):

PYTHONPATH=$(git rev-parse --show-toplevel) torchrun --nproc_per_node=1 -m models.llama4.scripts.chat_completion ../checkpoints/Llama-4-Scout-17B-16E-Instruct-INT4 --world_size 1 --quantization-mode int4_mixed

Generate FP8 MP8 checkpoint:

torchrun --nproc-per-node=8 -m models.llama4.scripts.quantize --ckpt_dir ../checkpoints/Llama-4-Scout-17B-16E-Instruct --output_dir ../checkpoints/Llama-4-Scout-17B-16E-Instruct-FP8 --quantization_mode fp8_mixed --world_size 8

Verify generated FP8 MP8 checkpoint with fp8_mixed (output):

PYTHONPATH=$(git rev-parse --show-toplevel) torchrun --nproc_per_node=8 -m models.llama4.scripts.chat_completion ../checkpoints/Llama-4-Scout-17B-16E-Instruct-FP8 --world_size 8 --quantization-mode fp8_mixed

Verify BF16 MP8 checkpoint (output):

PYTHONPATH=$(git rev-parse --show-toplevel) torchrun --nproc_per_node=8 -m models.llama4.scripts.chat_completion ../checkpoints/Llama-4-Scout-17B-16E-Instruct --world_size 8

Verify BF16 MP8 checkpoint with fp8_mixed (output):

PYTHONPATH=$(git rev-parse --show-toplevel) torchrun --nproc_per_node=8 -m models.llama4.scripts.chat_completion ../checkpoints/Llama-4-Scout-17B-16E-Instruct --world_size 8 --quantization-mode fp8_mixed

Verify BF16 MP8 checkpoint with int4_mixed on single GPU (output):

PYTHONPATH=$(git rev-parse --show-toplevel) torchrun --nproc_per_node=1 -m models.llama4.scripts.chat_completion ../checkpoints/Llama-4-Scout-17B-16E-Instruct --world_size 1 --quantization-mode int4_mixed

Generate INT4 MP8 checkpoint: ``` torchrun --nproc-per-node=8 -m models.llama4.scripts.quantize --ckpt_dir ../checkpoints/Llama-4-Scout-17B-16E-Instruct --output_dir ../checkpoints/Llama-4-Scout-17B-16E-Instruct-INT4 --quantization_mode int4_mixed --world_size 8 ``` Verify generated INT4 MP8 checkpoint with int4_mixed on single GPU (output): ``` PYTHONPATH=$(git rev-parse --show-toplevel) torchrun --nproc_per_node=1 -m models.llama4.scripts.chat_completion ../checkpoints/Llama-4-Scout-17B-16E-Instruct-INT4 --world_size 1 --quantization-mode int4_mixed ``` Generate FP8 MP8 checkpoint: ``` torchrun --nproc-per-node=8 -m models.llama4.scripts.quantize --ckpt_dir ../checkpoints/Llama-4-Scout-17B-16E-Instruct --output_dir ../checkpoints/Llama-4-Scout-17B-16E-Instruct-FP8 --quantization_mode fp8_mixed --world_size 8 ``` Verify generated FP8 MP8 checkpoint with fp8_mixed (output): ``` PYTHONPATH=$(git rev-parse --show-toplevel) torchrun --nproc_per_node=8 -m models.llama4.scripts.chat_completion ../checkpoints/Llama-4-Scout-17B-16E-Instruct-FP8 --world_size 8 --quantization-mode fp8_mixed ``` Verify BF16 MP8 checkpoint (output): ``` PYTHONPATH=$(git rev-parse --show-toplevel) torchrun --nproc_per_node=8 -m models.llama4.scripts.chat_completion ../checkpoints/Llama-4-Scout-17B-16E-Instruct --world_size 8 ``` Verify BF16 MP8 checkpoint with fp8_mixed (output): ``` PYTHONPATH=$(git rev-parse --show-toplevel) torchrun --nproc_per_node=8 -m models.llama4.scripts.chat_completion ../checkpoints/Llama-4-Scout-17B-16E-Instruct --world_size 8 --quantization-mode fp8_mixed ``` Verify BF16 MP8 checkpoint with int4_mixed on single GPU (output): ``` PYTHONPATH=$(git rev-parse --show-toplevel) torchrun --nproc_per_node=1 -m models.llama4.scripts.chat_completion ../checkpoints/Llama-4-Scout-17B-16E-Instruct --world_size 1 --quantization-mode int4_mixed ```

ashwinb · 2025-04-26T17:31:27Z

models/llama4/moe.py

        dtype = torch.get_default_dtype()
+        if int4_weight:


this feels like complexity that truly doesn't belong at this layer. can we please keep it outside into quantization code somehow?

we don't want llama-models to become torchao or vllm or whatever really. it is not a full fledged all powerful inference engine.

ashwinb · 2025-04-26T17:58:31Z

models/llama4/generation.py

+                    )
+                    model_args.quantization_args = QuantizationArgs()
+                    model_args.quantization_args.int4_weight = True
+                    print("Loaded scale checkpoint")
            torch.set_default_tensor_type(torch.BFloat16Tensor)
            model = Transformer(model_args)
            print("Loading state dict...")
            model.load_state_dict(state_dict, strict=False)


if you move the model.load_state_dict() to convert_to_quantized_model() then you can do the following:

change the structure of the Transformer from the outside in this code path (whatever you are doing with Experts)

move all this scale ckpt paths complexity into quantization land

nobody reading generation.py should know about quantization unless they want to dig into it.

jiawenliu64 requested a review from jianyuh April 24, 2025 06:13

jiawenliu64 requested review from ashwinb, yanxi0830, hardikjshah, dltn, raghotham and ehhuang as code owners April 24, 2025 06:13

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 24, 2025

format

7722c3c

jiawenliu64 changed the title ~~Enable loading precompiled INT4 weights in Llama4~~ Enable loading pre-quantized INT4 weights in Llama4 Apr 24, 2025

jianyuh approved these changes Apr 24, 2025

View reviewed changes

shethaadit approved these changes Apr 25, 2025

View reviewed changes

ashwinb reviewed Apr 26, 2025

View reviewed changes

ashwinb requested changes Apr 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable loading pre-quantized INT4 weights in Llama4 #330

Enable loading pre-quantized INT4 weights in Llama4 #330

Uh oh!

jiawenliu64 commented Apr 24, 2025

Uh oh!

ashwinb Apr 26, 2025

Uh oh!

ashwinb Apr 26, 2025

Uh oh!

ashwinb Apr 26, 2025

Uh oh!

Uh oh!

Enable loading pre-quantized INT4 weights in Llama4 #330

Are you sure you want to change the base?

Enable loading pre-quantized INT4 weights in Llama4 #330

Uh oh!

Conversation

jiawenliu64 commented Apr 24, 2025

Uh oh!

ashwinb Apr 26, 2025

Choose a reason for hiding this comment

Uh oh!

ashwinb Apr 26, 2025

Choose a reason for hiding this comment

Uh oh!

ashwinb Apr 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!