Add Max Context Len value to get around with context len 1 error #292

tokk-nv · 2025-03-15T05:42:38Z

Originally, --max-model-len=1 was set and it was causing the "context length 1" error.

"This model's maximum context length is 1 tokens. However, you requested 28 tokens in the messages, Please reduce the length of the messages."

This work around will generate the command like below, instead of --max-model-len=1.

Verified with JAO 64GB & JAO 32GB (gemma-3-4b-it), ~~Orin NX 16GB (gemma-3-1b-it)~~ with vlm.py.

docker run -it --rm \
  --name llm_server \
  --gpus all \
  -p 9000:9000 \
  -e DOCKER_PULL=always --pull always \
  -e HF_TOKEN=${HUGGINGFACE_TOKEN} \
  -e HF_HUB_CACHE=/root/.cache/huggingface \
  -v /mnt/nvme/cache:/root/.cache \
  dustynv/vllm:0.7.4-r36.4.0-cu128-24.04 \
  vllm serve google/gemma-3-4b-it \
  --host=0.0.0.0 --port=9000 --dtype=auto --max-num-seqs=1 --max-model-len=8192 --gpu-memory-utilization=0.75

tokk-nv added 2 commits March 14, 2025 22:06

Add Max Context Len value to get around with context len 1 error

02c7f2b

Prefill hf_token with ENV variable notation

aba652a

tokk-nv requested a review from dusty-nv March 15, 2025 05:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Max Context Len value to get around with context len 1 error #292

Add Max Context Len value to get around with context len 1 error #292

tokk-nv commented Mar 15, 2025 •

edited

Loading

Add Max Context Len value to get around with context len 1 error #292

Are you sure you want to change the base?

Add Max Context Len value to get around with context len 1 error #292

Conversation

tokk-nv commented Mar 15, 2025 • edited Loading

tokk-nv commented Mar 15, 2025 •

edited

Loading