Eval bug: llama-server is toooo slow when inference int8 model in reranker #11114

Tian14267 · 2025-01-07T01:58:41Z

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA A800 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A800 80GB PCIe, compute capability 8.0, VMM: yes
version: 0 (unknown)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

A800 * 2 & Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz，32核

Models

bge-reranker-v2-m3 ; quanto q8_0;

Problem description & steps to reproduce

When I use int8 model to inference , it is tooo slow.
I get 4min in original torch model bge-reranker-v2-m3， in GPU， run 10000 pairs of sentence。
But I get 40min in llama-server , in same GPU and data. About 10 times the difference. I want know Why?

my code in llama-server:

CUDA_VISIBLE_DEVICES="0" ./llama-server \
              -m ./bge-reranker-v2-m3_q8_0.gguf \
              --reranking \
              -cd 2048 -c 4096 -b 4096 -ub 4096

And My inference code is :

def get_url_result(data_input):
    url = "http://127.0.0.1:8080/v1/rerank"
    headers = {"Content-Type": "application/json"}

    data = {"model": "some-model",
            "top_n": 1,
            "query": data_input[0],
            "documents": [data_input[1]]
            }
    # 
    r = requests.post(url=url,
                      data=json.dumps(data), headers=headers)
    task_result = r.json()
    return task_result

if __name__ == "__main__":
    ###  load data ...
    for one_data in tqdm(all_test_data):
        one_result = get_url_result(one_data)

First Bad Commit

No response

Relevant log output

request: POST /v1/rerank 127.0.0.1 200
slot launch_slot_: id  0 | task 2650 | processing task
slot update_slots: id  0 | task 2650 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 3190
slot update_slots: id  0 | task 2650 | kv cache rm [0, end)
slot update_slots: id  0 | task 2650 | prompt processing progress, n_past = 3190, n_tokens = 3190, progress = 1.000000
slot update_slots: id  0 | task 2650 | prompt done, n_past = 3190, n_tokens = 3190
slot      release: id  0 | task 2650 | stop processing: n_past = 3190, truncated = 0
srv  update_slots: all slots are idle

The text was updated successfully, but these errors were encountered:

ggerganov · 2025-01-07T08:42:34Z

Add -ngl 99 in order to enable the GPU. Also try adding -fa - it might improve the performance further in most cases.

Tian14267 · 2025-01-07T09:21:08Z

Add -ngl 99 in order to enable the GPU. Also try adding -fa - it might improve the performance further in most cases.

I have tried, but it didn't work。。。inference time is still almost 30~40min

Tian14267 added the bug-unconfirmed label Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: llama-server is toooo slow when inference int8 model in reranker #11114

Eval bug: llama-server is toooo slow when inference int8 model in reranker #11114

Tian14267 commented Jan 7, 2025 •

edited

Loading

ggerganov commented Jan 7, 2025

Tian14267 commented Jan 7, 2025

Eval bug: llama-server is toooo slow when inference int8 model in reranker #11114

Eval bug: llama-server is toooo slow when inference int8 model in reranker #11114

Comments

Tian14267 commented Jan 7, 2025 • edited Loading

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

ggerganov commented Jan 7, 2025

Tian14267 commented Jan 7, 2025

Tian14267 commented Jan 7, 2025 •

edited

Loading