Description
Name and Version
./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 750 Ti, compute capability 5.0, VMM: yes
version: 5797 (de56944)
built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server --embedding --pooling last/mean/any etc....
Problem description & steps to reproduce
Helo,
For text (which turns out to be of ten tokens) i get 10 vectors even though i have --pooling enabled. Am I missing something obvious?
This is the script for server, curl post and the embeddings in file. It outputs 10x1024 vectors, not 1x1024 vector.
Server script
!/bin/bash
LLAMA_MODEL="Qwen3-Embedding-0.6B-Q8_0.gguf"
LLAMA_MODEL_PATH="/home/DATA/GGUF/embed"
LLAMA_OPTS="-c 1024 --temp 0.3 --top-k 40 --top-p 0.9 --n-predict 60 --no-warmup --port 8081 --embedding"
LLAMA_PERF_OPTS="-ngl 99 --mlock --pooling last"
llama-server ${LLAMA_PERF_OPTS} ${LLAMA_OPTS} -m ${LLAMA_MODEL_PATH}/${LLAMA_MODEL} ${@}
curl -s -X POST http://localhost:8081/embedding
-H "Content-Type: application/json"
-d '{
"model": "Qwen3-Embedding-0.6B-Q8_0.gguf",
"input": "The quick brown fox jumps over the lazy dog."
}' > q-test-embedding.txt
ls -l
212K Jul 5 03:51 q-test-embedding.txt
jq '.[].embedding | length' ~/tmp/q-test-embedding.txt
10
grep -o ',' q-test-embedding.txt | wc -l
10240
First Bad Commit
No response