Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Max Context Len value to get around with context len 1 error #292

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tokk-nv
Copy link
Member

@tokk-nv tokk-nv commented Mar 15, 2025

Originally, --max-model-len=1 was set and it was causing the "context length 1" error.

"This model's maximum context length is 1 tokens. However, you requested 28 tokens in the messages, Please reduce the length of the messages."

This work around will generate the command like below, instead of --max-model-len=1.

Verified with JAO 64GB & JAO 32GB (gemma-3-4b-it), Orin NX 16GB (gemma-3-1b-it) with vlm.py.

docker run -it --rm \
  --name llm_server \
  --gpus all \
  -p 9000:9000 \
  -e DOCKER_PULL=always --pull always \
  -e HF_TOKEN=${HUGGINGFACE_TOKEN} \
  -e HF_HUB_CACHE=/root/.cache/huggingface \
  -v /mnt/nvme/cache:/root/.cache \
  dustynv/vllm:0.7.4-r36.4.0-cu128-24.04 \
  vllm serve google/gemma-3-4b-it \
  --host=0.0.0.0 --port=9000 --dtype=auto --max-num-seqs=1 --max-model-len=8192 --gpu-memory-utilization=0.75

@tokk-nv tokk-nv requested a review from dusty-nv March 15, 2025 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant