-
Notifications
You must be signed in to change notification settings - Fork 12.3k
ggml: Add initial WebGPU backend #14521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…since webgpu doesn't support it
stride_dst0: u32, | ||
stride_dst1: u32, | ||
stride_dst2: u32, | ||
stride_dst3: u32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I remember a comment about WebGPU not supporting 64-bit integers. Is this correct?
If it cannot support them, then it might get quite tricky, especially with the new ggml_set_rows()
requiring I64 data (#14274).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, WebGPU currently does not support 64-bit integers. However, it is at least on their radar to support it, as well as other data types: gpuweb/gpuweb#273
Do you think this is something that could be emulated for now, or is it a showstopper, or just something we should be aware of for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a showstopper. We just have to be careful when implementing GGML_OP_SET_ROWS
to take the lower 32-bits of the indices and if possible to raise an error when the upper bits are non-zero.
|
||
|
||
wgpu::InstanceDescriptor instanceDescriptor{}; | ||
instanceDescriptor.capabilities.timedWaitAnyEnable = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With latest dawn
this causes a build error:
llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp:886:24: error: no member named 'capabilities' in 'wgpu::InstanceDescriptor'
886 | instanceDescriptor.capabilities.timedWaitAnyEnable = true;
| ~~~~~~~~~~~~~~~~~~ ^
1 error generated.
I'm building on Mac. Could you suggest a workaround so I can run some tests?
return false; | ||
|
||
case GGML_OP_CPY: { | ||
std::lock_guard<std::mutex> lock(ctx->mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lock should be avoided. What is the reason to put it?
params[1] = src_misalignment; | ||
params[2] = dst_misalignment; | ||
|
||
// Convert byte-strides to element-strides | ||
params[3] = (uint32_t)src->nb[0]/ggml_type_size(src->type); | ||
params[4] = (uint32_t)src->nb[1]/ggml_type_size(src->type); | ||
params[5] = (uint32_t)src->nb[2]/ggml_type_size(src->type); | ||
params[6] = (uint32_t)src->nb[3]/ggml_type_size(src->type); | ||
params[7] = (uint32_t)node->nb[0]/ggml_type_size(node->type); | ||
params[8] = (uint32_t)node->nb[1]/ggml_type_size(node->type); | ||
params[9] = (uint32_t)node->nb[2]/ggml_type_size(node->type); | ||
params[10] = (uint32_t)node->nb[3]/ggml_type_size(node->type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency, all of these should either be byte offsets or element offsets, but not both.
// TODO: Does this need to be thread safe? Is it only called once? | ||
static ggml_backend_t ggml_backend_webgpu_device_init(ggml_backend_dev_t dev, const char * params) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can be called more than once. For single device I think this implementation should be OK for now.
Following the discussion in #7773, this is an initial PR for the WebGPU backend.
At the moment, it only supports enough to pass the CI, and to run the basic matrix multiplication example in ggml: https://github.com/ggml-org/ggml/tree/master/examples/simple. I do notice that on my local machine (M3),
test-tokenizers-ggml-vocabs
fails, but this appears to be independent of anything I've added for WebGPU.Opening this PR in case it makes sense to integrate it into llama.cpp now, but also happy to keep working on it and merge when the backend is more fully featured. I'm committed to working on it (along with a couple others from UC Santa Cruz) as my main project over the next couple months and maintaining it, but also would appreciate any collaboration as there's a lot to do, and I do have some other obligations.
@tyler-utah @ggerganov @ngxson @audiovention