-
Notifications
You must be signed in to change notification settings - Fork 7.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added cuda and opencl support #746
Conversation
8592c13
to
dde586f
Compare
This is a very important improvement but will have to be carefully tested. We need to test that
|
I'll be happy to test it on Windows 10, maybe even Linux. NVidia 3060. Just ping me when you think it's in a good-enough state. |
I'm happy to test as well. |
I'd be happy to test on my Windows 10 machine. Cuda is installed already. edit: GPU is 1660 Ti |
Wonderful! Thanks everyone 🙂 |
Just a warning, old models as downloaded automatically will not work properly with OpenCL. Currently, it makes the GUI freeze, but that's some change on the GUI side that needs to be done. Old llama.cpp simply doesn't support them. |
Here to help with testing on Windows 11, RTX 3090. |
Here to help with testing on Windows 11, RTX 3060ti.Thanks everyone! |
I've added testing instructions to the top post. :-) |
Hello ! Thanks for the hard work. I'm on Linux with iris xe integrated GPU (OpenCL compatible). Is there any chance of working ? I've forced "buildVariant = "opencl" in the code as specified above. Backend and chat built without any errors. But when I launch "chat", it just stays forever without doing nothing (neither consuming CPU or RAM) with only a message "deserializing chats took: 0 ms" I use the 13b snoopy model, it works perfectly on the main Nomic branch |
OK I finally made it working ! First of all, I had openCL libs and headers but not CLBlast (I overlooked the cmake warning). I built it from there as the version include in my repos (Ubuntu 20.04) did not work : https://github.com/CNugteren/CLBlast I also downloaded a new model (https://huggingface.co/TheBloke/samantha-13B-GGML/tree/main) as the snoozy one did not work (as you specified in the first message, sorry for having read too fast) I now have a working OpenCL setup ! Hope it can helps others. But unfortunately it does not speed anything :D (my integrated GPU is probably not very suited for that). Any idea about how I could speed that up ?
My GPU capabilities (using the OpenCL API) are below:
|
Please reread the first message, the old models aren't supported :-) |
Nope. Integrated graphics are pretty much unsuitable for this. But this should be enough to show that it's working! Thank you for giving it a try :-) |
Thank you for your answer that's what I thought :(. Just out of curiosity, what would be the limiting factor for such a iGPU : RAM because it's shared with the system (the GPU has indeed only 128 MB from what I understand) or just the number of computing cores ? Or something else ? I did more tests and I notice something strange : I'm using intel_gpu_top to see the GPU usage. It's clearly used when I'm looking at a 4k 60 fps video on Youtube (=> hw acceleration), but it seems to be not used at all with GPT4All (GPU version). Do I miss something ? |
Sorry, it's taking a bit longer. I hadn't actually compiled anything with MSVC in a while -- and it looks like that's the way to go -- so I'm now dealing with some fun build errors (who doesn't love those!). Although I've already knocked a few down by installing/upgrading certain things. Question: are there known minimum requirements for the things involved in the whole toolchain? I'm now up-to-date with some things and using VS Community 2022 and Windows 10 SDK v10.0.20348.0, so newer than that isn't possible anyway (for win10). Still relying on an older CUDA SDK (v11.6), however. Might just have to go update that, too, if nothing else helps. I should probably go and have closer look at the llama.cpp project. |
Yeah, CUDA setup should be documented in the |
I'm a bit reluctant to turn this into a troubleshooting session here -- in a pull request comment of all places -- but what I've seen so far might help others who want to try CUDA. Well, it's quite weird with MSVC to say the least. So far I've run into the following problems. This was still before the forced pull/merge yesterday, which has helped quite a bit now:
(Of course, I cannot exclude the possibility that all of this is yet another case of PEBKAC.) => So now I have managed to have a compiled backend, at last. P.S. I could also try compiling everything with a MinGW setup (I prefer MSYS2 MinGW here). Is that something that's supposed to be supported in the future? I've invested quite some time to help troubleshoot problems there (mainly in 758, 717 and 710) and I guess it's not a good user experience -- but that also has to do with the Python bindings package. |
Some compile issue on MSVC has been found and will be solved soon @cosmic-snow! Will notify you about more. |
Oh really? That's good to know. But not urgent, because here's where I am now:
However, it did not seem to use my GPU, despite me setting it Edit: It's clearly doing work in the cuda-enabled library (the cut-off name is Edit2: Added simple debugging Edit3: Added Edit4: One thing that feels odd is that the macro definition |
@cosmic-snow thanks for the testing efforts!! Please note that MPT/GPT-J isn't supported in the new GGML formats yet. I have added the missing compile defines to the CMake file for llama, please try again now. :-) |
I'm getting the error:
Note: I just copied your most recent changes over, not going through Git. Not sure if that changed any line numbers, but the error should be clear: CUDA isn't present yet in that version. I think I've seen some conditionals like that in Edit: I was mistaken, a previous build produced a Edit2: Trying again with a clean version of the patchset helped already, but now I'm getting the Edit3: Maybe it's an ordering problem now in how the
Edit4: Something is still decidedly wrong here. I'm now getting a linker error (in short, it doesn't find the I'll keep trying for a bit, but I guess I ultimately need to figure out what's wrong with the build process as a whole here. |
I appologize, there was a little mistake in the |
You're welcome. And yes, although I'm not going to pull those fixes again right now, that looks like it solves that particular problem. In the meantime I've managed to get it to work somehow, although I don't understand it yet. And can confirm it was running on CUDA (still v11.6 instead of the latest v12.1), at least until it crashed: Next, I guess I'll try to figure out:
Edit: gpt4all/gpt4all-backend/llama.cpp.cmake Lines 361 to 363 in e859086
target_compile_options(... instead of definitions(... . I was looking at ...\llama.cpp*\CMakeLists.txt this whole time, so it's no wonder I couldn't figure that one out.Edit2: Edit3: gpt4all/gpt4all-backend/CMakeLists.txt Lines 19 to 21 in e859086
Edit4: gpt4all/gpt4all-backend/CMakeLists.txt Lines 31 to 38 in e859086
set(LLAMA_LTO OFF) at first, but that was not enough.
|
Oh also, note: When I said:
This was still with what I learned earlier. That is, I explicitely did a I'm not sure what this branch is based off now, but |
Signed-off-by: niansa/tuxifan <[email protected]>
Thanks for the tip! |
Hi, I tested this with OpenCL on my AMD Radeon RX 5600 XT, but it doesn't seem to utilize the GPU at all, even though I did receive a response. Is there another flag that I need to set for GPU support? I followed the instructions at https://github.com/nomic-ai/gpt4all/tree/main/gpt4all-bindings/python and used orca-mini-3b.ggmlv3.q4_0.bin as suggested in the documentation. |
TL;DR: results inconclusive, general response times vary between 3-10 seconds. Testing results: I tested the buildVariant = "cuda" version and built the backend with the provided instructions. I reverted to default n_threads. The inference feels (somewhat) faster (-ish) than before. With moderate cpu utilization it feels like nvtop or nvidia-smi report slightly higher utilization and power draw, but those sightings appear to correlate only some of the time. I think any potential improvements disappear within general noise. No guarantees on scientific correctness, i may or may not have overlooked something totally obvious. |
Huh. I thought MPT wasn't supported? Maybe that has changed in the meantime? I did my tests with a LLaMA based model. Those with v3 in their name should work, I think. Maybe you can try one of those, too? Edit: Although Orca-mini is kind of a LLaMA model? Odd. Maybe try a 7b or 13b one? I think I've used |
I have now also tested "wizardLM-13B-Uncensored.ggmlv3.q4_0" which is larger and has v3 in its name. In addition settings n_threads to 16 had little to no effect where it would normally improve inference times drastically. |
I probably failed to mention that i am testing directly on the cli. |
It shouldn't.
As I said: // llmodel.cpp
if (!impl) {
//TODO: Auto-detect CUDA/OpenCL
if (buildVariant == "auto") {
if (requires_avxonly()) {
buildVariant = "avxonly";
} else {
buildVariant = "default";
}
}
buildVariant = "cuda"; // <- here before impl = ...
impl = implementation(f, buildVariant);
if (!impl) return nullptr;
}
f.close(); How did you install the CUDA toolkit anyway? I haven't tried that on Linux yet. |
Can confirm, i didnt rtfm and followed the old post and also put it in the wrong spot. wizardLM-13B-Uncensored.ggmlv3.q4_0 - Works, getting loaded onto gpu As far as CUDA toolkit installation goes, i think i just followed the official instructions found here: Cant remember though. |
Signed-off-by: niansa/tuxifan <[email protected]>
Are there any updates on getting this merged in @nomic-ai ? |
Sharing my attempts to run this. Apologies for any dumb mistakes. System detailsDebian 12 (Bookworm)
How I'm building
Errors/WarningsCLBlast
CMAKE_CUDA_ARCHITECTURES
I added this to
Missing cuda_runtime.h
I added this to
Is that necessary? Feels wrong ... cannot find -lcudart
I can get around this by adding to
But that feels super wrong. Did anyone else get stuck on these warnings/errors? |
Thanks for sharing.
Yes, ignore.
I haven't really looked into it myself and just ignored this one, too. I guess the proper way of dealing with this would be to:
But you can of course just override it yourself (or override
No idea why this is necessary, but I haven't tried building it on Linux yet.
Also no idea why this would be necessary. I think if it's installed in the default folder, CMake should just detect it? At least that's how it works on Windows. |
@cosmic-snow Is this Cuda support coming still? Is there still a problem with this? We now have a gpu support, which is a bit faster, but still, cuda I think would be much faster I think. |
Can't say that, I've only ever done testing here. I'm in favour of having it as an option but don't know where this is going. |
@cosmic-snow I'd say, the owner of this repository should just merge it then, because there are no errors, that I've read of this thread and it works as it should. Or did I miss something, are there still some kind of problems, why this is not merged? |
I would change CUDA to HIP now tho, since it works on both Nvidia and AMD |
Yeah I understand, but this would make your work (since you initially started this PR) for nothing. As for CUDA shipping, I am not exactly sure, because all AI projects use Cuda/Cudnn/Tensorrt in one way or the another. However one negative would be, that installation sizes will get bigger, because the librarys of cuda are big. Let us close this PR then? |
@niansa Can we add this or remove support altogother? If yes, then I'd say we can close this PR? |
This PR aims to add support for CUDA and OpenCL. Once ready, I'll need someone to test CUDA support since I don't own an Nvidia card myself.
Testing instructions
Just a warning, old models as downloaded automatically will not work properly with OpenCL. Currently, it makes the GUI freeze, but that's some change on the GUI side that needs to be done. Old llama.cpp simply doesn't support them.
Download a GGML model from here: https://huggingface.co/TheBloke and place it in your models folder. Make sure it starts with
ggml-
! The GUI might attempt to load another model on the way there and crash, since updating that won't be part of this PR. To prevent this, move the other models somewhere else.To make the GUI actually use the GPU, you'll need to add either
buildVariant = "cuda";
orbuildVariant = "opencl";
after this line:https://github.com/tuxifan/gpt4all/blob/dlopen_gpu/gpt4all-backend/llmodel.cpp#L69
We also need some people testing on Windows with AMD graphics cards! And some people on Linux testing on Nvidia.