Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval bug: llama-server is toooo slow when inference int8 model in reranker #11114

Open
Tian14267 opened this issue Jan 7, 2025 · 2 comments
Open

Comments

@Tian14267
Copy link

Tian14267 commented Jan 7, 2025

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA A800 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A800 80GB PCIe, compute capability 8.0, VMM: yes
version: 0 (unknown)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

A800 * 2 & Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz,32核

Models

bge-reranker-v2-m3 ; quanto q8_0;

Problem description & steps to reproduce

When I use int8 model to inference , it is tooo slow.
I get 4min in original torch model bge-reranker-v2-m3, in GPU, run 10000 pairs of sentence。
But I get 40min in llama-server , in same GPU and data. About 10 times the difference. I want know Why?

my code in llama-server:

CUDA_VISIBLE_DEVICES="0" ./llama-server \
              -m ./bge-reranker-v2-m3_q8_0.gguf \
              --reranking \
              -cd 2048 -c 4096 -b 4096 -ub 4096

And My inference code is :

def get_url_result(data_input):
    url = "http://127.0.0.1:8080/v1/rerank"
    headers = {"Content-Type": "application/json"}

    data = {"model": "some-model",
            "top_n": 1,
            "query": data_input[0],
            "documents": [data_input[1]]
            }
    # 
    r = requests.post(url=url,
                      data=json.dumps(data), headers=headers)
    task_result = r.json()
    return task_result

if __name__ == "__main__":
    ###  load data ...
    for one_data in tqdm(all_test_data):
        one_result = get_url_result(one_data)

First Bad Commit

No response

Relevant log output

request: POST /v1/rerank 127.0.0.1 200
slot launch_slot_: id  0 | task 2650 | processing task
slot update_slots: id  0 | task 2650 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 3190
slot update_slots: id  0 | task 2650 | kv cache rm [0, end)
slot update_slots: id  0 | task 2650 | prompt processing progress, n_past = 3190, n_tokens = 3190, progress = 1.000000
slot update_slots: id  0 | task 2650 | prompt done, n_past = 3190, n_tokens = 3190
slot      release: id  0 | task 2650 | stop processing: n_past = 3190, truncated = 0
srv  update_slots: all slots are idle
@ggerganov
Copy link
Owner

Add -ngl 99 in order to enable the GPU. Also try adding -fa - it might improve the performance further in most cases.

@Tian14267
Copy link
Author

Add -ngl 99 in order to enable the GPU. Also try adding -fa - it might improve the performance further in most cases.

I have tried, but it didn't work。。。inference time is still almost 30~40min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants