Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added cuda and opencl support #746

Closed
wants to merge 13 commits into from
Closed

Conversation

niansa
Copy link
Contributor

@niansa niansa commented May 28, 2023

This PR aims to add support for CUDA and OpenCL. Once ready, I'll need someone to test CUDA support since I don't own an Nvidia card myself.

Testing instructions

Just a warning, old models as downloaded automatically will not work properly with OpenCL. Currently, it makes the GUI freeze, but that's some change on the GUI side that needs to be done. Old llama.cpp simply doesn't support them.
Download a GGML model from here: https://huggingface.co/TheBloke and place it in your models folder. Make sure it starts with ggml-! The GUI might attempt to load another model on the way there and crash, since updating that won't be part of this PR. To prevent this, move the other models somewhere else.

To make the GUI actually use the GPU, you'll need to add either buildVariant = "cuda"; or buildVariant = "opencl"; after this line:
https://github.com/tuxifan/gpt4all/blob/dlopen_gpu/gpt4all-backend/llmodel.cpp#L69

We also need some people testing on Windows with AMD graphics cards! And some people on Linux testing on Nvidia.

@niansa niansa force-pushed the dlopen_gpu branch 2 times, most recently from 8592c13 to dde586f Compare May 28, 2023 15:40
@niansa niansa marked this pull request as ready for review May 28, 2023 15:41
@AndriyMulyar
Copy link
Contributor

AndriyMulyar commented May 28, 2023

This is a very important improvement but will have to be carefully tested.

We need to test that

  1. GPU support works on windows and linux machines with Nvidia graphics cards.
  2. Either the chat client or one set of bindings can effectively utilize the support.

@cosmic-snow
Copy link
Collaborator

cosmic-snow commented May 28, 2023

I'll be happy to test it on Windows 10, maybe even Linux. NVidia 3060. Just ping me when you think it's in a good-enough state.

@ani1797
Copy link

ani1797 commented May 29, 2023

I'm happy to test as well.
I have a windows machine with 3090

@chadnice
Copy link

chadnice commented May 29, 2023

I'd be happy to test on my Windows 10 machine. Cuda is installed already.

edit: GPU is 1660 Ti

@niansa
Copy link
Contributor Author

niansa commented May 29, 2023

Wonderful! Thanks everyone 🙂

@niansa
Copy link
Contributor Author

niansa commented May 29, 2023

Just a warning, old models as downloaded automatically will not work properly with OpenCL. Currently, it makes the GUI freeze, but that's some change on the GUI side that needs to be done. Old llama.cpp simply doesn't support them.

@maiko
Copy link

maiko commented May 29, 2023

Here to help with testing on Windows 11, RTX 3090.

@duouoduo
Copy link

Here to help with testing on Windows 11, RTX 3060ti.Thanks everyone!

@niansa
Copy link
Contributor Author

niansa commented May 29, 2023

I've added testing instructions to the top post. :-)

@pierreduf
Copy link

pierreduf commented May 30, 2023

Hello ! Thanks for the hard work.

I'm on Linux with iris xe integrated GPU (OpenCL compatible). Is there any chance of working ? I've forced "buildVariant = "opencl" in the code as specified above. Backend and chat built without any errors.

But when I launch "chat", it just stays forever without doing nothing (neither consuming CPU or RAM) with only a message "deserializing chats took: 0 ms"

I use the 13b snoopy model, it works perfectly on the main Nomic branch

@pierreduf
Copy link

pierreduf commented May 30, 2023

OK I finally made it working !

First of all, I had openCL libs and headers but not CLBlast (I overlooked the cmake warning). I built it from there as the version include in my repos (Ubuntu 20.04) did not work : https://github.com/CNugteren/CLBlast

I also downloaded a new model (https://huggingface.co/TheBloke/samantha-13B-GGML/tree/main) as the snoozy one did not work (as you specified in the first message, sorry for having read too fast)

I now have a working OpenCL setup ! Hope it can helps others. But unfortunately it does not speed anything :D (my integrated GPU is probably not very suited for that).

Any idea about how I could speed that up ?

qt.dbus.integration: Could not connect "org.freedesktop.IBus" to globalEngineChanged(QString)
deserializing chats took: 0 ms
llama.cpp: loading model from /opt/gpt4all/gpt4allgpu//ggml-samantha-13b.ggmlv3.q5_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0,09 MB
ggml_opencl: selecting platform: 'Intel(R) OpenCL HD Graphics'
ggml_opencl: selecting device: 'Intel(R) Gen12LP HD Graphics NEO'
ggml_opencl: device FP16 support: true
llama_model_load_internal: mem required  = 10583,26 MB (+ 1608,00 MB per state)

My GPU capabilities (using the OpenCL API) are below:

GPU VRAM Size: 25440 MB
Number of Compute Units: 96

@niansa
Copy link
Contributor Author

niansa commented May 31, 2023

Hello ! Thanks for the hard work.

I'm on Linux with iris xe integrated GPU (OpenCL compatible). Is there any chance of working ? I've forced "buildVariant = "opencl" in the code as specified above. Backend and chat built without any errors.

But when I launch "chat", it just stays forever without doing nothing (neither consuming CPU or RAM) with only a message "deserializing chats took: 0 ms"

I use the 13b snoopy model, it works perfectly on the main Nomic branch

Please reread the first message, the old models aren't supported :-)

@niansa
Copy link
Contributor Author

niansa commented May 31, 2023

Any idea about how I could speed that up ?

Nope. Integrated graphics are pretty much unsuitable for this. But this should be enough to show that it's working! Thank you for giving it a try :-)

@pierreduf
Copy link

pierreduf commented May 31, 2023

Thank you for your answer that's what I thought :(. Just out of curiosity, what would be the limiting factor for such a iGPU : RAM because it's shared with the system (the GPU has indeed only 128 MB from what I understand) or just the number of computing cores ? Or something else ?

I did more tests and I notice something strange : I'm using intel_gpu_top to see the GPU usage. It's clearly used when I'm looking at a 4k 60 fps video on Youtube (=> hw acceleration), but it seems to be not used at all with GPT4All (GPU version). Do I miss something ?

@cosmic-snow
Copy link
Collaborator

Sorry, it's taking a bit longer. I hadn't actually compiled anything with MSVC in a while -- and it looks like that's the way to go -- so I'm now dealing with some fun build errors (who doesn't love those!). Although I've already knocked a few down by installing/upgrading certain things.

Question: are there known minimum requirements for the things involved in the whole toolchain?

I'm now up-to-date with some things and using VS Community 2022 and Windows 10 SDK v10.0.20348.0, so newer than that isn't possible anyway (for win10). Still relying on an older CUDA SDK (v11.6), however. Might just have to go update that, too, if nothing else helps.

I should probably go and have closer look at the llama.cpp project.

@niansa
Copy link
Contributor Author

niansa commented May 31, 2023

Yeah, CUDA setup should be documented in the llama.cpp repo

@niansa niansa changed the base branch from dlopen_backend_3 to main May 31, 2023 21:41
@cosmic-snow
Copy link
Collaborator

I'm a bit reluctant to turn this into a troubleshooting session here -- in a pull request comment of all places -- but what I've seen so far might help others who want to try CUDA.

Well, it's quite weird with MSVC to say the least. So far I've run into the following problems. This was still before the forced pull/merge yesterday, which has helped quite a bit now:

  • Note: in all of the following, I've used ...\vcvarsall.bat x64 and I was trying to simply build the backend itself in a first step. I worked locally with a git fetch origin pull/746/head:trying-cuda; git checkout trying-cuda.

  • Some earlier problems got resolved by updating to Visual Studio 2022 and the latest Windows 10 SDK. I'm not going to go into detail about those. I'm still on CUDA v11.6, however. It doesn't seem to be a problem, after all.

  • many errors in gpt4all-backend\llama.cpp-mainline\ggml-cuda.cu with message: error : expected an expression

    • Was a very puzzling error initially, because the GGML_CUDA_DMMV_X/GGML_CUDA_DMMV_Y this pointed to were simple #defines in the code. Turns out that for some reason, these #defines are overridable through cmake compiler options and are actually set in the config -- only those settings were somehow not passed through in the end. Resolved by manually editing the relevant .vcxproj file by changing all relevant compiler invocations.
    • Resolved. This doesn't happen anymore since the force push.
  • Warning about a feature requiring C++ standard 20.

    • Fixed by editing CMakeLists.txt and replacing set(CMAKE_CXX_STANDARD 17) with set(CMAKE_CXX_STANDARD 20)
    • Resolved. This isn't necessary anymore since the force push/merge.
  • minor problem: warning C5102: ignoring invalid command-line macro definition '/arch:AVX2' but /arch:AVX2 is a perfectly valid flag in MSVC.

    • I've figured out why it happens: it's following a /D, but is not about setting a macro definition. It's a valid flag by itself. Have not figured out why it's generated that way, though.
    • Doesn't occur when compiling the main branch, it seems?
    • Still happens after the force push/merge.
  • Main problem: Build errors in many projects: error MSB3073: ... <many script lines omitted> ... :VCEnd" exited with code -1073741819.

    • code -1073741819 is hexadecimal 0xC000 0005 which is seems to be the code for an access violation. Yikes. Did my compiler just crash?
    • Found this and this as potentially talking about the same problem. The former is a downvoted and unanswered SO question, and the latter says to disable the /GL compiler flag (not tried before the force push).
    • Still seeing these errors after the force push/merge.
    • So far I did everything on the command line. This was somehow resolved by opening the .sln in Visual Studio and building the whole thing twice (after the first run showed the same errors). (???)

(Of course, I cannot exclude the possibility that all of this is yet another case of PEBKAC.)

=> So now I have managed to have a compiled backend, at last.

P.S. I could also try compiling everything with a MinGW setup (I prefer MSYS2 MinGW here). Is that something that's supposed to be supported in the future? I've invested quite some time to help troubleshoot problems there (mainly in 758, 717 and 710) and I guess it's not a good user experience -- but that also has to do with the Python bindings package.

@niansa
Copy link
Contributor Author

niansa commented Jun 1, 2023

Some compile issue on MSVC has been found and will be solved soon @cosmic-snow! Will notify you about more.

@cosmic-snow
Copy link
Collaborator

cosmic-snow commented Jun 1, 2023

Oh really? That's good to know. But not urgent, because here's where I am now:

  • I tried compiling the backend by itself so I might get away with just testing through the Python bindings.

  • Turns out, the C API has changed, too. So I decided to finally do the full setup and download Qt Creator.

  • Some time and a few gigabytes later, it wasn't very hard to configure, most of the things were set correctly out of the box (I did have to compile this one twice, too, but that's a minor inconvenience). The only thing I changed was CMAKE_GENERATOR to Visual Studio 17 2022:
    image

  • I already had prepared mpt-7b-instruct.ggmlv3.q4_1.bin which I renamed ggml-mpt-7b-instruct.ggmlv3.q4_1.bin, downloaded from: https://huggingface.co/TheBloke/MPT-7B-Instruct-GGML/tree/main. This did not get recognised correctly:

    • gptj_model_load: invalid model file ... (bad vocab size 2003 != 4096) and GPT-J ERROR: failed to load model although of course it's not a GPT-J model.
  • I then downloaded Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_1.bin renamed to ggml-Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_1.bin (https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML/tree/main). And this works.


However, it did not seem to use my GPU, despite me setting it buildVariant = "cuda";, so that's what I'm looking into at the moment.

Edit: It's clearly doing work in the cuda-enabled library (the cut-off name is ggml_graph_compute):
image

Edit2: Added simple debugging printf to line 9786 in ggml.c and looks like the check ggml_cuda_can_mul_mat(...) is simply never true in my case. Maybe I need a different model? But that's just a guess. To really understand what's going on I'd need to spend more time to understand llama.cpp.

Edit3: Added set(CMAKE_AUTOMOC OFF) to the beginning of gpt4all-backend/CMakeLists.txt. This makes it easier for me to understand the compilation output and should not mess up anything I think (but I'm no expert here). Aside: It'd probably be better to not globally set it ON in the chat CMakeLists.txt, but only for the targets that actually use Qt. Might improve build speed slightly, too.

Edit4: One thing that feels odd is that the macro definition GGML_USE_CUBLAS is only ever activated in the compiler options of ggml.c, but llama.cpp (the file, not the project) has an #ifdef section depending on it. Talking about mainline here, but I think other targets have that, too.

@niansa
Copy link
Contributor Author

niansa commented Jun 2, 2023

@cosmic-snow thanks for the testing efforts!! Please note that MPT/GPT-J isn't supported in the new GGML formats yet. I have added the missing compile defines to the CMake file for llama, please try again now. :-)

@cosmic-snow
Copy link
Collaborator

cosmic-snow commented Jun 2, 2023

I'm getting the error:

CMake Error at llama.cpp.cmake:280 (target_compile_definitions):
  Cannot specify compile definitions for target "llama-230511-cuda" which is
  not built by this project.
Call Stack (most recent call first):
  CMakeLists.txt:90 (include_ggml)

Note: I just copied your most recent changes over, not going through Git. Not sure if that changed any line numbers, but the error should be clear: CUDA isn't present yet in that version.

I think I've seen some conditionals like that in CMakeLists.txt. Maybe I can fix it myself.

Edit: I was mistaken, a previous build produced a llama-230511-cuda.dll. Sorry, it's probably better to just start from a clean slate again.

Edit2: Trying again with a clean version of the patchset helped already, but now I'm getting the GGML_CUDA_DMMV_X/GGML_CUDA_DMMV_Y error again which I thought was resolved. Although I can see they're supposed to be defined in the cmake files -- in the compiler string for ...\llama.cpp-mainline\ggml-cuda.cu they show up empty: ... -DGGML_CUDA_DMMV_X= -DGGML_CUDA_DMMV_Y= .... I'm starting to think it's something on my end I'm missing here.

Edit3: Maybe it's an ordering problem now in how the CMakeLists.txt get read? Copying the following from ...\llama.cpp-mainline\CMakeLists.txt to right before they're used in llama.cpp.cmake fixed that particular error:

    set(LLAMA_CUDA_DMMV_X "32" CACHE STRING "llama: x stride for dmmv CUDA kernels")
    set(LLAMA_CUDA_DMMV_Y "1" CACHE STRING  "llama: y block size for dmmv CUDA kernels")
    if (GGML_CUBLAS_USE)
        target_compile_definitions(ggml${SUFFIX} PRIVATE
            GGML_USE_CUBLAS
            GGML_CUDA_DMMV_X=${LLAMA_CUDA_DMMV_X}
            GGML_CUDA_DMMV_Y=${LLAMA_CUDA_DMMV_Y})
    ...

Edit4: Something is still decidedly wrong here. I'm now getting a linker error (in short, it doesn't find the LLModel::construct() symbol) when trying to build the chat application and that doesn't look like something that was even touched by your previous commit. I know where its implementation is, but somehow the llmodel.dll just winds up empty now, inspecting it with 'DLL Export Viewer' at least says that. I have successfully built that on the main branch yesterday and can see the symbol in that version's DLL.

I'll keep trying for a bit, but I guess I ultimately need to figure out what's wrong with the build process as a whole here.

@niansa
Copy link
Contributor Author

niansa commented Jun 2, 2023

I appologize, there was a little mistake in the llama.cpp.cmake :-)
That should be solved now. Again, thanks a lot for testing all this!

@cosmic-snow
Copy link
Collaborator

cosmic-snow commented Jun 2, 2023

That should be solved now. Again, thanks a lot for testing all this!

You're welcome. And yes, although I'm not going to pull those fixes again right now, that looks like it solves that particular problem.

In the meantime I've managed to get it to work somehow, although I don't understand it yet. And can confirm it was running on CUDA (still v11.6 instead of the latest v12.1), at least until it crashed:
image

Next, I guess I'll try to figure out:

  • Build problems, esp. error MSB3073 with code -1073741819 / 0xC000 0005, which seems to be the main culprit
  • the /arch:AVX2 warning

Edit:
I think I've found the problem with the /arch:AVX2. Here:

target_compile_definitions(ggml${SUFFIX} PRIVATE
$<$<COMPILE_LANGUAGE:C>:/arch:AVX2>
$<$<COMPILE_LANGUAGE:CXX>:/arch:AVX2>)
it should be target_compile_options(... instead of definitions(.... I was looking at ...\llama.cpp*\CMakeLists.txt this whole time, so it's no wonder I couldn't figure that one out.

Edit2:
Regarding the build problems, I've figured at least something out: If after compiling everything twice the llmodel.dll ends up empty, manually opening its Visual Studio project, disabling /GL (as mentioned above and recommended here) and recompiling it by itself fixes the problem.

Edit3:
Maybe also bump the version number?

set(LLMODEL_VERSION_MAJOR 0)
set(LLMODEL_VERSION_MINOR 2)
set(LLMODEL_VERSION_PATCH 0)
The new C API is not compatible with the previous one, otherwise I could've just tested the backend with the Python bindings.

Edit4:
So I guess the /GL setting was the problem in all the projects that failed with error MSB3073 ... and had to be built twice. As a workaround, I've added set(IPO_SUPPORTED OFF) right after the following:

# Check for IPO support
include(CheckIPOSupported)
check_ipo_supported(RESULT IPO_SUPPORTED OUTPUT IPO_ERROR)
if (NOT IPO_SUPPORTED)
message(WARNING "Interprocedural optimization is not supported by your toolchain! This will lead to bigger file sizes and worse performance: ${IPO_ERROR}")
else()
message(STATUS "Interprocedural optimization support detected")
endif()
Note: I'm not suggesting it should be turned off permanently for MSVC, maybe myself or someone else is able to figure out why it behaves like that and can come up with a proper fix. I did try with only set(LLAMA_LTO OFF) at first, but that was not enough.

@cosmic-snow
Copy link
Collaborator

Oh also, note: When I said:

Then I did a regular build inside Qt Creator. It went without a problem and I could run it (but it only used the CPU).

This was still with what I learned earlier. That is, I explicitely did a set(IPO_SUPPORTED OFF) to get around all the build errors this would otherwise cause.

I'm not sure what this branch is based off now, but main already has this kind of workaround somewhere in it. You might want to take a closer look and make sure there won't be any regressions.

Signed-off-by: niansa/tuxifan <[email protected]>
@niansa
Copy link
Contributor Author

niansa commented Jul 5, 2023

Thanks for the tip!

@jstr00
Copy link

jstr00 commented Jul 12, 2023

Hi, I tested this with OpenCL on my AMD Radeon RX 5600 XT, but it doesn't seem to utilize the GPU at all, even though I did receive a response. Is there another flag that I need to set for GPU support? I followed the instructions at https://github.com/nomic-ai/gpt4all/tree/main/gpt4all-bindings/python and used orca-mini-3b.ggmlv3.q4_0.bin as suggested in the documentation.

@MJakobs97
Copy link

TL;DR: results inconclusive, general response times vary between 3-10 seconds.

Testing results:
OS: Ubuntu 22.04.2 LTS
CPU: Ryzen 7 3700x
GPU: RTX 3060TI
GPU Driver: 525.125.06 (non-free)

I tested the buildVariant = "cuda" version and built the backend with the provided instructions.
I re-downloaded the "ggml-mpt-7b-chat" model in hopes of achieving maximum compatibility.
"Orca-mini-3b.ggmlv3.q4_0.bin" complained about incompatible model file.

I reverted to default n_threads.
Also tested with n_threads 1-16, which tanked the performance.

The inference feels (somewhat) faster (-ish) than before. With moderate cpu utilization it feels like nvtop or nvidia-smi report slightly higher utilization and power draw, but those sightings appear to correlate only some of the time. I think any potential improvements disappear within general noise.

No guarantees on scientific correctness, i may or may not have overlooked something totally obvious.

@cosmic-snow
Copy link
Collaborator

cosmic-snow commented Jul 12, 2023

@MJakobs97

I re-downloaded the "ggml-mpt-7b-chat" model in hopes of achieving maximum compatibility.

Huh. I thought MPT wasn't supported? Maybe that has changed in the meantime? I did my tests with a LLaMA based model. Those with v3 in their name should work, I think. Maybe you can try one of those, too?

Edit: Although Orca-mini is kind of a LLaMA model? Odd. Maybe try a 7b or 13b one? I think I've used ggml-Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_1.bin for my tests.

@MJakobs97
Copy link

I have now also tested "wizardLM-13B-Uncensored.ggmlv3.q4_0" which is larger and has v3 in its name.
I havent seen any conclusive spikes in utilization or power consumption using either nvidia-smi or nvtop.
Power consumption stays stagnant and utilization is within the normal range.

In addition settings n_threads to 16 had little to no effect where it would normally improve inference times drastically.
This observation may be moot, as i reckon the models got more efficient over the last 2 months.

@cosmic-snow
Copy link
Collaborator

cosmic-snow commented Jul 12, 2023

I havent seen any conclusive spikes in utilization or power consumption using either nvidia-smi or nvtop.
Power consumption stays stagnant and utilization is within the normal range.

🤔

You should not just see spikes. You should have full load on it.

image

@MJakobs97
Copy link

I probably failed to mention that i am testing directly on the cli.
In theory that shouldnt make any difference i thought.
Which model are you running? Is there some kind of reproducible test sequence?

@cosmic-snow
Copy link
Collaborator

I probably failed to mention that i am testing directly on the cli. In theory that shouldnt make any difference i thought.

It shouldn't.

Which model are you running? Is there some kind of reproducible test sequence?

As I said: ggml-Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_1.bin. Btw, earlier when I retried it, I put the buildVariant = "cuda"; in the wrong spot. Maybe check that?

// llmodel.cpp
    if (!impl) {
        //TODO: Auto-detect CUDA/OpenCL
        if (buildVariant == "auto") {
            if (requires_avxonly()) {
                buildVariant = "avxonly";
            } else {
                buildVariant = "default";
            }
        }
        buildVariant = "cuda";  // <- here before impl = ...
        impl = implementation(f, buildVariant);
        if (!impl) return nullptr;
    }
    f.close();

How did you install the CUDA toolkit anyway? I haven't tried that on Linux yet.

@MJakobs97
Copy link

Can confirm, i didnt rtfm and followed the old post and also put it in the wrong spot.
Now got it working with cuda, tested the following models i had lying around:

wizardLM-13B-Uncensored.ggmlv3.q4_0 - Works, getting loaded onto gpu
orca-mini-3b.ggmlv3.q4_0.bin - Works, getting loaded onto gpu, answers BLAZINGLY fast but gets confused almost as fast.
ggml-mpt-7b-chat.bin - not working with cuda
ggml-gpt4all-j-v1.3-groovy.bin - not working with cuda

As far as CUDA toolkit installation goes, i think i just followed the official instructions found here:
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

Cant remember though.

@sammcj
Copy link

sammcj commented Aug 10, 2023

Are there any updates on getting this merged in @nomic-ai ?

@kevinschaul
Copy link

Sharing my attempts to run this. Apologies for any dumb mistakes.

System details

Debian 12 (Bookworm)

% nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
% lspci -k | grep -EA3 'VGA|3D|Display'
01:00.0 VGA compatible controller: NVIDIA Corporation GA106 [GeForce RTX 3060 Lite Hash Rate] (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. GA106 [GeForce RTX 3060 Lite Hash Rate]
        Kernel driver in use: nvidia
        Kernel modules: nvidia

How I'm building

cd gpt4all-backend
mkdir build
cd build
cmake ..
cmake --build .

Errors/Warnings

CLBlast
I think this is expected since I'm using CUDA not OpenCL, so ignoring this one

CMake Warning at CMakeLists.txt:66 (find_package):
  By not providing "FindCLBlast.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "CLBlast", but
  CMake did not find one.

CMAKE_CUDA_ARCHITECTURES

CMake Warning (dev) in CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

I added this to CMakeLists.txt, warning is gone now:

    if (NOT DEFINED CMAKE_CUDA_ARCHITECTURES)
        set(CMAKE_CUDA_ARCHITECTURES native)
    endif()

Missing cuda_runtime.h

/usr/bin/c++ -DLIB_FILE_EXT=\".so\" -DLLMODEL_CUDA -Dllmodel_EXPORTS -I/home/kevin/dev/gpt4all/gpt4all-backend/build -O3 -DNDEBUG -std=gnu++20 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -MD -MT CMakeFiles/llmodel.dir/llmodel.cpp.o -MF CMakeFiles/llmodel.dir/llmodel.cpp.o.d -o CMakeFiles/llmodel.dir/llmodel.cpp.o -c /home/kevin/dev/gpt4all/gpt4all-backend/llmodel.cpp
/home/kevin/dev/gpt4all/gpt4all-backend/llmodel.cpp:14:10: fatal error: cuda_runtime.h: No such file or directory
   14 | #include <cuda_runtime.h>
      |          ^~~~~~~~~~~~~~~~

I added this to CMakeLists.txt, error is gone now:

    include_directories("${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}")

Is that necessary? Feels wrong ...

cannot find -lcudart

[100%] Linking CXX shared library libllmodel.so
/home/kevin/.pyenv/versions/3.9.14/lib/python3.9/site-packages/cmake/data/bin/cmake -E cmake_link_script CMakeFiles/llmodel.dir/link.txt --verbose=1
/usr/bin/c++ -fPIC -O3 -DNDEBUG -shared -Wl,-soname,libllmodel.so.0 -o libllmodel.so.0.3.0 CMakeFiles/llmodel.dir/llmodel.cpp.o CMakeFiles/llmodel.dir/llmodel_shared.cpp.o CMakeFiles/llmodel.dir/llmodel_c.cpp.o  -lcudart
/usr/bin/ld: cannot find -lcudart: No such file or directory
collect2: error: ld returned 1 exit status
gmake[2]: *** [CMakeFiles/llmodel.dir/build.make:132: libllmodel.so.0.3.0] Error 1

I can get around this by adding to CMakeLists.txt:

    link_directories("/usr/local/cuda-11.8/lib64/")

But that feels super wrong.

Did anyone else get stuck on these warnings/errors?

@cosmic-snow
Copy link
Collaborator

cosmic-snow commented Aug 11, 2023

Sharing my attempts to run this. Apologies for any dumb mistakes.

Thanks for sharing.

...

Errors/Warnings

CLBlast I think this is expected since I'm using CUDA not OpenCL, so ignoring this one

Yes, ignore.

CMAKE_CUDA_ARCHITECTURES

CMake Warning (dev) in CMakeLists.txt:
  Policy CMP0104 is not set: CMAKE_CUDA_ARCHITECTURES now detected for NVCC,
  empty CUDA_ARCHITECTURES not allowed.  Run "cmake --help-policy CMP0104"
  for policy details.  Use the cmake_policy command to set the policy and
  suppress this warning.

I added this to CMakeLists.txt, warning is gone now:

    if (NOT DEFINED CMAKE_CUDA_ARCHITECTURES)
        set(CMAKE_CUDA_ARCHITECTURES native)
    endif()

I haven't really looked into it myself and just ignored this one, too. I guess the proper way of dealing with this would be to:

  • Set CMAKE_CUDA_ARCHITECTURES which specifies the default value for CUDA_ARCHITECTURES according to the docs, as suggested in the policy help.
    • I guess the value could be all or all-major if there is ever going to be a release of this (which feels like it's unlikely to happen). Or as you did yourself: native for a local build.
    • Although looks like all of these values require an even more recent version of CMake (3.23/3.24)
  • Bump CMake minimum version to 3.18, as that's where it was introduced.

But you can of course just override it yourself (or override CUDA_ARCHITECTURES directly).

Missing cuda_runtime.h

...

I added this to CMakeLists.txt, error is gone now:

    include_directories("${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}")

No idea why this is necessary, but I haven't tried building it on Linux yet.

cannot find -lcudart

...

I can get around this by adding to CMakeLists.txt:

    link_directories("/usr/local/cuda-11.8/lib64/")

Also no idea why this would be necessary. I think if it's installed in the default folder, CMake should just detect it? At least that's how it works on Windows.

@jensdraht1999
Copy link

@cosmic-snow Is this Cuda support coming still? Is there still a problem with this?

We now have a gpu support, which is a bit faster, but still, cuda I think would be much faster I think.

@cosmic-snow
Copy link
Collaborator

cosmic-snow commented Sep 25, 2023

@cosmic-snow Is this Cuda support coming still? Is there still a problem with this?

We now have a gpu support, which is a bit faster, but still, cuda I think would be much faster I think.

Can't say that, I've only ever done testing here. I'm in favour of having it as an option but don't know where this is going.

@jensdraht1999
Copy link

@cosmic-snow I'd say, the owner of this repository should just merge it then, because there are no errors, that I've read of this thread and it works as it should. Or did I miss something, are there still some kind of problems, why this is not merged?

@niansa
Copy link
Contributor Author

niansa commented Sep 26, 2023

I would change CUDA to HIP now tho, since it works on both Nvidia and AMD
@jensdraht1999 what's holding this back are the Nvidia licensing terms. Shipping CUDA is quite.. problematic.

@jensdraht1999
Copy link

I would change CUDA to HIP now tho, since it works on both Nvidia and AMD @jensdraht1999 what's holding this back are the Nvidia licensing terms. Shipping CUDA is quite.. problematic.

Yeah I understand, but this would make your work (since you initially started this PR) for nothing.

As for CUDA shipping, I am not exactly sure, because all AI projects use Cuda/Cudnn/Tensorrt in one way or the another.

However one negative would be, that installation sizes will get bigger, because the librarys of cuda are big.

Let us close this PR then?

@jensdraht1999
Copy link

@niansa Can we add this or remove support altogother? If yes, then I'd say we can close this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.