-
Notifications
You must be signed in to change notification settings - Fork 12k
llama : add thread safety test #14035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
llama : ignore main_gpu <= 0 if there are no GPUs ggml-ci
Maybe we can use an even smaller model for this test: |
The SYCL ggml-ci does not seem to have libcurl installed yet. |
Should be installed now. |
2c5874e
to
a2a0289
Compare
a2a0289
to
b046f0c
Compare
There is some issue with this model (stories15M-q4_0.gguf) on CPU, but I don't think it is a threading issue. Only seems to happen on CPUs with AVX512.
|
I looked into it a bit and it does not seem to happen if OpenMP is disabled. Think it is something related to the repacking, but didn't confirm. I'll take an extra look now. |
Pretty sure this is a data-race because the chunk counter will be shared by all contexts: llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp Lines 395 to 398 in 487a5e0
If I disable @Djip007 Could you take a look and propose a fix? |
ggml-ci
429 is "too many requests". @ngxson do you know if it is a temporary issue with huggingface, or are we being throttled? |
HF backend currently has a problem, the team is investigating, should be back very soon |
@0cc4m @jeffbolznv The Vulkan backend is crashing on this test. It happens even with a single context per model ( |
It is known that the Vulkan backend is not thread-safe yet, yes. |
Basic thread safety tests that loads a copy of the model on each GPU and CPU, and runs inference with multiple contexts in different threads.