Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to specify the maximum model length when using the vllm engine causes loading failure! #2787

Open
tensorflowt opened this issue Jan 26, 2025 · 1 comment
Milestone

Comments

@tensorflowt
Copy link

Feature request / 功能建议

When using the vllm engine of xinference, it is expected to support commonly used inference parameters for easy use.
For example, the maximum model length.

Motivation / 动机

When using a model that xinference already supports, but the default model maximum length cannot be directly loaded due to GPU graphics card resource limitations, the model loading fails. At this time, it can be solved by registering the model, but it is more troublesome. I hope to expose the commonly used vllm inference parameters for easy use.

Your contribution / 您的贡献

xinference launch --model-engine vllm \
--model-name deepseek-r1-distill-qwen \
--size-in-billions 32 \
--model-format pytorch \
--quantization none \
--max-model-len 18080 ```
@XprobeBot XprobeBot added the gpu label Jan 26, 2025
@XprobeBot XprobeBot added this to the v1.x milestone Jan 26, 2025
@qinxuye
Copy link
Contributor

qinxuye commented Jan 26, 2025

--max_model_len 18080

In this way to specify model len for vllm.

But we'd like to unify all this option for various engines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants