Self-hosting AI models with Intel Arc A750 in 2025 #12805

codayon · 2025-04-07T20:12:18Z

codayon
Apr 7, 2025

Can anyone give me a guide that tells, how to self host DeepSeek-Coder 6.7b with Intel Arc A750? I have no previous experiences with self hosting or docker. I want use it in a VS Code extension named continue for auto completion. I followed this guide for SYCL and was able to get this information, I got really tired.

My system information,
CPU: Intel i5 13400F
GPU: Intel Arc A750 8GB VRAM
RAM: 16GB

My system information,
OS: Arch Linux
Kernel: Linux 6.12.21-1-lts

Purpose,
I want to self host an AI that can help me with auto completion in vim or vscode.

Thanks!

characharm · 2025-04-07T23:10:36Z

characharm
Apr 7, 2025

Download prebuilt binaries with IPEX for your system:
https://github.com/intel/ipex-llm/releases/tag/v2.2.0

Then launch the server like this:
llama-server.exe -m path/modelname.gguf -c 4096 -ngl 99 --port 8080

As for the model, there are currently better options available here:
https://huggingface.co/collections/ggml-org/llamavim-6720fece33898ac10544ecf9

1 reply

codayon Apr 8, 2025
Author

Download prebuilt binaries with IPEX for your system: https://github.com/intel/ipex-llm/releases/tag/v2.2.0

Extremely sorry that I forgot to mention my system information. It's Arch Linux, which does not have any binary releases in that list as per my knowledge. In the SYCL installation page they mentioned a specific release of prebuild binaries. I am assuming those are for windows as they are named llama-b4040-bin-win-sycl-x64.zip and llama-b3038-bin-win-sycl-x64.zip. It's said that, "The SYCL backend would be broken by some PRs due to no online CI."

Edit: Just realized the link was not a release of llama.cpp but it's still missing Arch Linux build. Probably I should build from source if possible or if there is a method to install in Arch Linux.

Then launch the server like this: llama-server.exe -m path/modelname.gguf -c 4096 -ngl 99 --port 8080

I think the launching method will be different because .exe is directly not supported by Linux.

As for the model, there are currently better options available here: https://huggingface.co/collections/ggml-org/llamavim-6720fece33898ac10544ecf9

Actually, I asked AI to determine which model will be the best one and it was suggesting DeepSeek-Coder-6.7b. Can you suggest me some suggestion from your own personal experiences about which one will be better on my hardware and my needs? I added my system and hardware information in my first message.

0cc4m · 2025-04-08T07:52:07Z

0cc4m
Apr 8, 2025
Collaborator

You can use Vulkan as well, which should be very simple to set up. The ubuntu binaries should just work and building it is also not complex.

For Intel GPUs, I would recommend legacy quants (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0) currently, as they have the best performance. Others are not yet as optimized. It won't be as fast as SYCL, but it should be usable.

3 replies

codayon Apr 8, 2025
Author

You can use Vulkan as well, which should be very simple to set up.

Can you give me the installation guide link?

The ubuntu binaries should just work

But, I am currently using Arch Linux. Is there any way to install it here?

For Intel GPUs, I would recommend legacy quants (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0) currently, as they have the best performance. Others are not yet as optimized. It won't be as fast as SYCL, but it should be usable.

Yeah I have converted the deepseek-coder repo to a deepseek-coder-6.7b-instruct-q4_k_m.gguf file.

0cc4m Apr 8, 2025
Collaborator

Can you give me the installation guide link?

The ubuntu binaries should just work

But, I am currently using Arch Linux. Is there any way to install it here?

The Ubuntu binaries should work on Arch. If they don't, I can give you build instructions.

Yeah I have converted the deepseek-coder repo to a deepseek-coder-6.7b-instruct-q4_k_m.gguf file.

Q4_K_M is a k-quant, not a legacy quant, and won't work well yet. But you don't have to convert models manually, you can find lots of GGUF files on huggingface. For example: https://huggingface.co/TheBloke/deepseek-coder-6.7B-instruct-GGUF/tree/main

codayon Apr 8, 2025
Author

The Ubuntu binaries should work on Arch. If they don't, I can give you build instructions.

How is this model working T_T, I feel like it's faster or simiar to ChatGPT.

For example: https://huggingface.co/TheBloke/deepseek-coder-6.7B-instruct-GGUF/tree/main

I feel so dumb right now.

NeoZhangJianyu · 2025-04-08T08:14:25Z

NeoZhangJianyu
Apr 8, 2025
Collaborator

llama-b4040-bin-win-sycl-x64.zip is binary for windows. But your OS is Arch. So the package can't run on your OS.

In your case, please follow the guide SYCL.
The guide is drafted for Ubuntu.
If you use Ubuntu, the building should very smoothly.

You need to convert the cmds to Arch Linux. I know someone has built the llama.cpp for SYCL backend on Arch linux.
Please care the Intel GPU driver installation.

1 reply

codayon Apr 8, 2025
Author

llama-b4040-bin-win-sycl-x64.zip is binary for windows. But your OS is Arch. So the package can't run on your OS.

Yeah, that's what I was thinking.

If you use Ubuntu, the building should very smoothly.

I will move to Ubuntu when it releases 25.04. Currently, everything feels outdated.

You need to convert the cmds to Arch Linux. I know someone has built the llama.cpp for SYCL backend on Arch linux. Please care the Intel GPU driver installation.

I was able to run a model successfully using Ubuntu's Vulkan Binary package. Thanks for your suggestions.

qnixsynapse · 2025-04-08T16:21:06Z

qnixsynapse
Apr 8, 2025
Collaborator

As a fellow A750 user, SYCL is faster than vulkan atm.

3 replies

codayon Apr 8, 2025
Author

As a fellow A750 user, SYCL is faster than vulkan atm.

How do you use it? And which models do you use? I am really doomed. It just feels like a ocean of documentation but none works for me. Do you know any source which I can use to learn more about AI hosting and Intel Arc GPU.

DeepSeek-Coder 6.7b inside continue was a headache for me. Because of the lower speed. But without extension, it was fair. I used Ubuntu's Vulkan Binary package while being in Arch Linux.

characharm Apr 8, 2025

For the assisted text completion extension, try the Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf model — it runs very fast and performs well for this specific task. With the Vulkan backend, you can get over 150 tokens per second, meaning the model provides completions instantly as soon as you place the cursor on a new line.
For chat, considering you only have 8GB of VRAM, it’s hard to suggest anything better than Qwen2.5-Coder-7B-Instruct-Q4_0.gguf.

Personally, I’ve found Vulkan to work better than SYCL in llama.cpp these days. SYCL is still faster in prompt processing, but Vulkan has noticeably improved lately, and in token generation, it's actually faster in my experience.

That said, I still recommend giving IPEX a try (as per the first link I shared). It should work on Arch Linux (though I haven’t confirmed it myself). IPEX uses less VRAM and performs very fast.

codayon Apr 9, 2025
Author

For the assisted text completion extension, try the Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf model — it runs very fast and performs well for this specific task. With the Vulkan backend, you can get over 150 tokens per second, meaning the model provides completions instantly as soon as you place the cursor on a new line.

But, isn't the 0.5b models a bit dumb? I have never used them before but I will give it a try!

For chat, considering you only have 8GB of VRAM, it’s hard to suggest anything better than Qwen2.5-Coder-7B

I think I have to download it. By the way, what should be the ideal VRAM for self-hosting and any preferred brands or models?

Personally, I’ve found Vulkan to work better than SYCL in llama.cpp these days. SYCL is still faster in prompt processing, but Vulkan has noticeably improved lately, and in token generation, it's actually faster in my experience.

I understand, almost everyone is saying the same. To use SYCL instead of Vulkan. But the thing is I have no idea how to compile the package which will use SYCL. Because, if i remember correctly there isn't any releases for SYCL. Can you please share the building or compiling method of llama.cpp with SYCL? Also, I think before this process I need to set up SYCL on my system, which is probably this one https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/SYCL.md#linux.

That said, I still recommend giving IPEX a try (as per the first link I shared). It should work on Arch Linux (though I haven’t confirmed it myself). IPEX uses less VRAM and performs very fast.

I tried installing it then realized I was following the wrong documentation the I lost in the darkness of documentation. But, I am happy to give it a few more tries for sure! By the way, what I mean by wrong documentation is that I was trying to install IPEX-LLM on Intel GPU with PyTorch 2.6, no idea why.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-hosting AI models with Intel Arc A750 in 2025 #12805

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Self-hosting AI models with Intel Arc A750 in 2025 #12805

codayon Apr 7, 2025

Replies: 4 comments · 8 replies

characharm Apr 7, 2025

codayon Apr 8, 2025 Author

0cc4m Apr 8, 2025 Collaborator

codayon Apr 8, 2025 Author

0cc4m Apr 8, 2025 Collaborator

codayon Apr 8, 2025 Author

NeoZhangJianyu Apr 8, 2025 Collaborator

codayon Apr 8, 2025 Author

qnixsynapse Apr 8, 2025 Collaborator

codayon Apr 8, 2025 Author

characharm Apr 8, 2025

codayon Apr 9, 2025 Author

codayon
Apr 7, 2025

Replies: 4 comments 8 replies

characharm
Apr 7, 2025

codayon Apr 8, 2025
Author

0cc4m
Apr 8, 2025
Collaborator

codayon Apr 8, 2025
Author

0cc4m Apr 8, 2025
Collaborator

codayon Apr 8, 2025
Author

NeoZhangJianyu
Apr 8, 2025
Collaborator

codayon Apr 8, 2025
Author

qnixsynapse
Apr 8, 2025
Collaborator

codayon Apr 8, 2025
Author

codayon Apr 9, 2025
Author