Skip to content

ggml: Add initial WebGPU backend #14521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

reeselevine
Copy link

Following the discussion in #7773, this is an initial PR for the WebGPU backend.

At the moment, it only supports enough to pass the CI, and to run the basic matrix multiplication example in ggml: https://github.com/ggml-org/ggml/tree/master/examples/simple. I do notice that on my local machine (M3), test-tokenizers-ggml-vocabs fails, but this appears to be independent of anything I've added for WebGPU.

Opening this PR in case it makes sense to integrate it into llama.cpp now, but also happy to keep working on it and merge when the backend is more fully featured. I'm committed to working on it (along with a couple others from UC Santa Cruz) as my main project over the next couple months and maintaining it, but also would appreciate any collaboration as there's a lot to do, and I do have some other obligations.

@tyler-utah @ggerganov @ngxson @audiovention

@reeselevine reeselevine requested a review from ggerganov as a code owner July 3, 2025 19:21
@github-actions github-actions bot added documentation Improvements or additions to documentation python python script changes devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Jul 3, 2025
Comment on lines +20 to +23
stride_dst0: u32,
stride_dst1: u32,
stride_dst2: u32,
stride_dst3: u32,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I remember a comment about WebGPU not supporting 64-bit integers. Is this correct?

If it cannot support them, then it might get quite tricky, especially with the new ggml_set_rows() requiring I64 data (#14274).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, WebGPU currently does not support 64-bit integers. However, it is at least on their radar to support it, as well as other data types: gpuweb/gpuweb#273

Do you think this is something that could be emulated for now, or is it a showstopper, or just something we should be aware of for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a showstopper. We just have to be careful when implementing GGML_OP_SET_ROWS to take the lower 32-bits of the indices and if possible to raise an error when the upper bits are non-zero.



wgpu::InstanceDescriptor instanceDescriptor{};
instanceDescriptor.capabilities.timedWaitAnyEnable = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With latest dawn this causes a build error:

llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:886:24: error: no member named 'capabilities' in 'wgpu::InstanceDescriptor'
  886 |     instanceDescriptor.capabilities.timedWaitAnyEnable = true;
      |     ~~~~~~~~~~~~~~~~~~ ^
1 error generated.

I'm building on Mac. Could you suggest a workaround so I can run some tests?

return false;

case GGML_OP_CPY: {
std::lock_guard<std::mutex> lock(ctx->mutex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock should be avoided. What is the reason to put it?

Comment on lines +267 to +278
params[1] = src_misalignment;
params[2] = dst_misalignment;

// Convert byte-strides to element-strides
params[3] = (uint32_t)src->nb[0]/ggml_type_size(src->type);
params[4] = (uint32_t)src->nb[1]/ggml_type_size(src->type);
params[5] = (uint32_t)src->nb[2]/ggml_type_size(src->type);
params[6] = (uint32_t)src->nb[3]/ggml_type_size(src->type);
params[7] = (uint32_t)node->nb[0]/ggml_type_size(node->type);
params[8] = (uint32_t)node->nb[1]/ggml_type_size(node->type);
params[9] = (uint32_t)node->nb[2]/ggml_type_size(node->type);
params[10] = (uint32_t)node->nb[3]/ggml_type_size(node->type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, all of these should either be byte offsets or element offsets, but not both.

Comment on lines +682 to +683
// TODO: Does this need to be thread safe? Is it only called once?
static ggml_backend_t ggml_backend_webgpu_device_init(ggml_backend_dev_t dev, const char * params) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can be called more than once. For single device I think this implementation should be OK for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants