ggml: Add initial WebGPU backend #14521

reeselevine · 2025-07-03T19:21:57Z

Following the discussion in #7773, this is an initial PR for the WebGPU backend.

At the moment, it only supports enough to pass the CI, and to run the basic matrix multiplication example in ggml: https://github.com/ggml-org/ggml/tree/master/examples/simple. I do notice that on my local machine (M3), test-tokenizers-ggml-vocabs fails, but this appears to be independent of anything I've added for WebGPU.

Opening this PR in case it makes sense to integrate it into llama.cpp now, but also happy to keep working on it and merge when the backend is more fully featured. I'm committed to working on it (along with a couple others from UC Santa Cruz) as my main project over the next couple months and maintaining it, but also would appreciate any collaboration as there's a lot to do, and I do have some other obligations.

@tyler-utah @ggerganov @ngxson @audiovention

…r and segfaults

…since webgpu doesn't support it

ggerganov · 2025-07-04T18:03:57Z

ggml/src/ggml-webgpu/wgsl-shaders/cpy.wgsl

+    stride_dst0: u32,
+    stride_dst1: u32,
+    stride_dst2: u32,
+    stride_dst3: u32,


I think I remember a comment about WebGPU not supporting 64-bit integers. Is this correct?

If it cannot support them, then it might get quite tricky, especially with the new ggml_set_rows() requiring I64 data (#14274).

Yes, WebGPU currently does not support 64-bit integers. However, it is at least on their radar to support it, as well as other data types: gpuweb/gpuweb#273

Do you think this is something that could be emulated for now, or is it a showstopper, or just something we should be aware of for now?

Not a showstopper. We just have to be careful when implementing GGML_OP_SET_ROWS to take the lower 32-bits of the indices and if possible to raise an error when the upper bits are non-zero.

ggerganov · 2025-07-06T05:40:45Z

ggml/src/ggml-webgpu/ggml-webgpu.cpp

+
+
+    wgpu::InstanceDescriptor instanceDescriptor{};
+    instanceDescriptor.capabilities.timedWaitAnyEnable = true;


With latest dawn this causes a build error:

llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:886:24: error: no member named 'capabilities' in 'wgpu::InstanceDescriptor' 886 | instanceDescriptor.capabilities.timedWaitAnyEnable = true; | ~~~~~~~~~~~~~~~~~~ ^ 1 error generated.

I'm building on Mac. Could you suggest a workaround so I can run some tests?

ggerganov · 2025-07-06T05:46:03Z

ggml/src/ggml-webgpu/ggml-webgpu.cpp

+            return false;
+
+        case GGML_OP_CPY: {
+            std::lock_guard<std::mutex> lock(ctx->mutex);


This lock should be avoided. What is the reason to put it?

ggerganov · 2025-07-06T05:54:04Z

ggml/src/ggml-webgpu/ggml-webgpu.cpp

+            params[1] = src_misalignment;
+            params[2] = dst_misalignment;
+
+            // Convert byte-strides to element-strides
+            params[3] = (uint32_t)src->nb[0]/ggml_type_size(src->type);
+            params[4] = (uint32_t)src->nb[1]/ggml_type_size(src->type);
+            params[5] = (uint32_t)src->nb[2]/ggml_type_size(src->type);
+            params[6] = (uint32_t)src->nb[3]/ggml_type_size(src->type);
+            params[7] = (uint32_t)node->nb[0]/ggml_type_size(node->type);
+            params[8] = (uint32_t)node->nb[1]/ggml_type_size(node->type);
+            params[9] = (uint32_t)node->nb[2]/ggml_type_size(node->type);
+            params[10] = (uint32_t)node->nb[3]/ggml_type_size(node->type);


For consistency, all of these should either be byte offsets or element offsets, but not both.

ggerganov · 2025-07-06T07:02:55Z

ggml/src/ggml-webgpu/ggml-webgpu.cpp

+// TODO: Does this need to be thread safe? Is it only called once?
+static ggml_backend_t ggml_backend_webgpu_device_init(ggml_backend_dev_t dev, const char * params) {


Yes, it can be called more than once. For single device I think this implementation should be OK for now.

reeselevine added 24 commits May 19, 2025 17:13

Minimal setup of webgpu backend with dawn. Just prints out the adapte…

63eb646

…r and segfaults

Initialize webgpu device

c0a810e

Making progress on setting up the backend

e50335c

Finish more boilerplate/utility functions

b17b164

Organize file and work on alloc buffer

e7071d1

Add webgpu_context to prepare for actually running some shaders

c9a53d2

Merge remote-tracking branch 'upstream/master' into webgpu

8b860a2

Work on memset and add shader loading

9e0c611

Work on memset polyfill

520f595

Implement set_tensor as webgpu WriteBuffer, remove host_buffer stubs …

2d24a8a

…since webgpu doesn't support it

Implement get_tensor and buffer_clear

d0480ca

Finish rest of setup

f8a53ee

Start work on compute graph

39d956d

Merge remote-tracking branch 'upstream/master' into webgpu

3d92436

Basic mat mul working

d036f10

Work on emscripten build

b8a2207

Basic WebGPU backend instructions

c09bfc5

Merge remote-tracking branch 'upstream/master' into webgpu

aec3483

Use EMSCRIPTEN flag

daa58e2

Work on passing ci, implement 4d tensor multiplication

1c396a2

Pass thread safety test

ecb945e

Implement permuting for mul_mat and cpy

0f0543b

minor cleanups

2eb7626

Merge remote-tracking branch 'upstream/master' into webgpu

949b851

reeselevine requested a review from ggerganov as a code owner July 3, 2025 19:21

github-actions bot added documentation Improvements or additions to documentation python python script changes devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Jul 3, 2025

ggerganov reviewed Jul 4, 2025

View reviewed changes

ggerganov reviewed Jul 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml: Add initial WebGPU backend #14521

ggml: Add initial WebGPU backend #14521

reeselevine commented Jul 3, 2025

Uh oh!

ggerganov Jul 4, 2025

Uh oh!

reeselevine Jul 6, 2025

Uh oh!

ggerganov Jul 6, 2025

Uh oh!

ggerganov Jul 6, 2025

Uh oh!

ggerganov Jul 6, 2025

Uh oh!

ggerganov Jul 6, 2025

Uh oh!

ggerganov Jul 6, 2025

Uh oh!

Uh oh!



		wgpu::InstanceDescriptor instanceDescriptor{};
		instanceDescriptor.capabilities.timedWaitAnyEnable = true;

		// TODO: Does this need to be thread safe? Is it only called once?
		static ggml_backend_t ggml_backend_webgpu_device_init(ggml_backend_dev_t dev, const char * params) {

ggml: Add initial WebGPU backend #14521

Are you sure you want to change the base?

ggml: Add initial WebGPU backend #14521

Conversation

reeselevine commented Jul 3, 2025

Uh oh!

ggerganov Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

reeselevine Jul 6, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Jul 6, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Jul 6, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Jul 6, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Jul 6, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Jul 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!