[Bug] Llama-3.1-70B-Instruct-q3f16_1-MLC model running across two GPUs with tensor_parallel_shards=2 #3004

shahizat · 2024-10-31T09:27:26Z

Grettings to all

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

python3 -m mlc_llm serve HF://mlc-ai/Llama-3.1-70B-Instruct-q3f16_1-MLC --overrides "tensor_parallel_shards=2"

Output error: ValueError: The linear dimension 16384 has 409 groups under group size 40. The groups cannot be evenly distributed on 2 GPUs.
Possible solutions: reduce the number of GPUs, or use quantization with a smaller group size.

Is it possible to run a 3-bit version of the MLC-LLM model using multiple GPUs?

Thanks in advance!

Hzfengsy · 2024-11-01T03:04:51Z

q3 might not suitable for tensor_parallel :(

MasterJH5574 · 2024-11-04T15:05:34Z

Hi @shahizat, as the error message has suggested, under 3-bit quantization we cannot divide groups evenly by half and thus for this case it is not supported.

shahizat added the bug Confirmed bugs label Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Llama-3.1-70B-Instruct-q3f16_1-MLC model running across two GPUs with tensor_parallel_shards=2 #3004

[Bug] Llama-3.1-70B-Instruct-q3f16_1-MLC model running across two GPUs with tensor_parallel_shards=2 #3004

shahizat commented Oct 31, 2024

Hzfengsy commented Nov 1, 2024

MasterJH5574 commented Nov 4, 2024

[Bug] Llama-3.1-70B-Instruct-q3f16_1-MLC model running across two GPUs with tensor_parallel_shards=2 #3004

[Bug] Llama-3.1-70B-Instruct-q3f16_1-MLC model running across two GPUs with tensor_parallel_shards=2 #3004

Comments

shahizat commented Oct 31, 2024

🐛 Bug

To Reproduce

Hzfengsy commented Nov 1, 2024

MasterJH5574 commented Nov 4, 2024