Failure to specify the maximum model length when using the vllm engine causes loading failure！ #2787

tensorflowt · 2025-01-26T10:21:12Z

Feature request / 功能建议

When using the vllm engine of xinference, it is expected to support commonly used inference parameters for easy use.
For example, the maximum model length.

Motivation / 动机

When using a model that xinference already supports, but the default model maximum length cannot be directly loaded due to GPU graphics card resource limitations, the model loading fails. At this time, it can be solved by registering the model, but it is more troublesome. I hope to expose the commonly used vllm inference parameters for easy use.

Your contribution / 您的贡献

xinference launch --model-engine vllm \
--model-name deepseek-r1-distill-qwen \
--size-in-billions 32 \
--model-format pytorch \
--quantization none \
--max-model-len 18080 ```

The text was updated successfully, but these errors were encountered:

qinxuye · 2025-01-26T10:22:23Z

--max_model_len 18080

In this way to specify model len for vllm.

But we'd like to unify all this option for various engines.

tensorflowt added the feature label Jan 26, 2025

XprobeBot added the gpu label Jan 26, 2025

XprobeBot added this to the v1.x milestone Jan 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to specify the maximum model length when using the vllm engine causes loading failure！ #2787

Failure to specify the maximum model length when using the vllm engine causes loading failure！ #2787

tensorflowt commented Jan 26, 2025

qinxuye commented Jan 26, 2025

Failure to specify the maximum model length when using the vllm engine causes loading failure！ #2787

Failure to specify the maximum model length when using the vllm engine causes loading failure！ #2787

Comments

tensorflowt commented Jan 26, 2025

Feature request / 功能建议

Motivation / 动机

Your contribution / 您的贡献

qinxuye commented Jan 26, 2025