Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Llama-3.1-70B-Instruct-q3f16_1-MLC model running across two GPUs with tensor_parallel_shards=2 #3004

Open
shahizat opened this issue Oct 31, 2024 · 2 comments
Labels
bug Confirmed bugs

Comments

@shahizat
Copy link

Grettings to all

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

  1. python3 -m mlc_llm serve HF://mlc-ai/Llama-3.1-70B-Instruct-q3f16_1-MLC --overrides "tensor_parallel_shards=2"

Output error: ValueError: The linear dimension 16384 has 409 groups under group size 40. The groups cannot be evenly distributed on 2 GPUs.
Possible solutions: reduce the number of GPUs, or use quantization with a smaller group size.

Is it possible to run a 3-bit version of the MLC-LLM model using multiple GPUs?

Thanks in advance!

@shahizat shahizat added the bug Confirmed bugs label Oct 31, 2024
@Hzfengsy
Copy link
Member

Hzfengsy commented Nov 1, 2024

q3 might not suitable for tensor_parallel :(

@MasterJH5574
Copy link
Member

Hi @shahizat, as the error message has suggested, under 3-bit quantization we cannot divide groups evenly by half and thus for this case it is not supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bugs
Projects
None yet
Development

No branches or pull requests

3 participants