You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I use int8 model to inference , it is tooo slow.
I get 4min in original torch model bge-reranker-v2-m3, in GPU, run 10000 pairs of sentence。
But I get 40min in llama-server , in same GPU and data. About 10 times the difference. I want know Why?
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA A800 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A800 80GB PCIe, compute capability 8.0, VMM: yes
version: 0 (unknown)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
A800 * 2 & Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz,32核
Models
bge-reranker-v2-m3 ; quanto q8_0;
Problem description & steps to reproduce
When I use int8 model to inference , it is tooo slow.
I get 4min in original torch model
bge-reranker-v2-m3
, in GPU, run 10000 pairs of sentence。But I get 40min in
llama-server
, in same GPU and data. About 10 times the difference. I want know Why?my code in
llama-server
:And My inference code is :
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: