Skip to content

Misc. bug: Embedding/pooling: I receive 10xvector not 1xvector #14543

Open
@brunette69-ruby

Description

@brunette69-ruby

Name and Version

./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 750 Ti, compute capability 5.0, VMM: yes
version: 5797 (de56944)
built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server --embedding --pooling last/mean/any etc....

Problem description & steps to reproduce

Helo,
For text (which turns out to be of ten tokens) i get 10 vectors even though i have --pooling enabled. Am I missing something obvious?
This is the script for server, curl post and the embeddings in file. It outputs 10x1024 vectors, not 1x1024 vector.

Server script
!/bin/bash
LLAMA_MODEL="Qwen3-Embedding-0.6B-Q8_0.gguf"
LLAMA_MODEL_PATH="/home/DATA/GGUF/embed"
LLAMA_OPTS="-c 1024 --temp 0.3 --top-k 40 --top-p 0.9 --n-predict 60 --no-warmup --port 8081 --embedding"
LLAMA_PERF_OPTS="-ngl 99 --mlock --pooling last"

llama-server ${LLAMA_PERF_OPTS} ${LLAMA_OPTS} -m ${LLAMA_MODEL_PATH}/${LLAMA_MODEL} ${@}

curl -s -X POST http://localhost:8081/embedding
-H "Content-Type: application/json"
-d '{
"model": "Qwen3-Embedding-0.6B-Q8_0.gguf",
"input": "The quick brown fox jumps over the lazy dog."
}' > q-test-embedding.txt

ls -l
212K Jul 5 03:51 q-test-embedding.txt
jq '.[].embedding | length' ~/tmp/q-test-embedding.txt
10
grep -o ',' q-test-embedding.txt | wc -l
10240

First Bad Commit

No response

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions