-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Depthwise 2D convolution #1152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Depthwise 2D convolution #1152
Conversation
As an idea, could the operator call the appropriate implementation function, depending on whether the op params require dilation etc? This way, we'd keep a single operator for depthwise conv, but it would use the appropriate implementation dynamically (for the given params). Similar to how pytorch dispatches to different conv implementations depending on the params. Thanks! |
We can do dynamic dispatch at the But in this case it would dispatch to having different ops in the graph. In other words, when dilation is required we will have |
The conv2d function already has too many parameters, I don't we need to merge everything together. The biggest issue is that there isn't a good way to introduce convolution kernels that do not depend on IM2COL without introducing new ops seamlessly, because it would require applications to choose which function to use depending on the backend that they are using. We also cannot let backends determine which implementation to use because IM2COL requires more memory to store the transformed kernel, and some backends cannot handle that. |
True, I was only talking about the existing depthwise function ( I understand the reason why it's defined as a different function in this PR, but was trying to see if there's a way to dynamically switch between the two implementations of depthwise conv.
Yeah, that's the idea. But yes, the challenge is that we'll have to implement But yeah that's out-of-scope for this PR, so no worries :) -- I'm actually facing a very similar problem. I'm experimenting with a custom CUDA kernel for Not sure how to do that. -- To be clear, I've no objections to this PR. Just trying to see if there's a way to dispatch to different implementations dynamically, without implementing it on every backend. |
Ideally we would have one |
Agreed, this definitely shouldn't be the application's responsibility. Doing this completely on the backend side is the cleaner way, and makes more sense. But like you said, temporary memory allocation is a challenge. The other approach (which I don't like) will require the graph creation functions in But that opens up other messy situations, because |
It's not possible to do this in |
I think the Metal backend can create a memory pool using MTLHeap and allocate temporary buffers from it - will take a look and try to add support for this.
The nodes would be the same (i.e. |
That seemed like a good place to me: Create a new graph, dup nodes over from the input graph and conditionally replace them with "low level" OPs if they're not supported by the backend. Maintain an additional buffer in the schedule if required. Is that what you consider too costly, or am I missing something? Some more places where this would be useful that I've stumbled upon:
Dealing with those substitutions individually in all backends may end up being a lot of work. By now I have some experience with the Vulkan backend, and while it's technically capable of allocating a temp buffer, doing these subtitutions is... architecturally difficult. Although you could argue that it shouldn't be. |
It's costly because you have to make a copy of the list of nodes. Manipulating a graph in this way would be easy, you can create a new subgraph, replace some pointers, and you are done. But since we are operating on a list rather than a graph, effectively you have to duplicate the entire graph. If it was something that only needs to be done once that wouldn't be a problem, but in practice graphs can rarely be reused, and need to be reconstructed on every evaluation. It would also add a significant amount of complexity to code that is already bordering on too complex, and should be rewritten in a more clean way before adding more complexity on top. |
@slaren In ggml-org/llama.cpp#12850 I prototyped a way for Metal to allocate and use temporary buffers. I think it works correctly and should be possible to use it for the convolutions and for other use cases that require temporary data. Atm, it's probably not implemented in the best possible way (it requires an extra pass over the graph nodes to determine the necessary amount of temporary memory and reallocate the heap from which the temporary buffers are allocated, if it's more than what is currently available), but it should be good enough to consider combining the convolutions under a single op type. I have a few more ideas I want to try for the Metal implementation and will look to merge it at some point in the next days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could merge this implementation now to avoid more merge conflicts, and over time we will work on porting the conv2d implementations to use a single op.
include/ggml.h
Outdated
// a: KW KH 1 C convolution kernel | ||
// b: W H C N input data | ||
// res: W_out H_out C N | ||
GGML_API struct ggml_tensor * ggml_depthwise_conv_2d( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we rename this to something closer to the current naming scheme? Something like ggml_conv_2d_dw_direct
or similar. It would be a temporary name until we unify the different conv2d implementations into a single op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
Is there a preference on which order to use for tensor dimensions in comments and function names? When working on ggml I find it less confusing to stick with tensor->ne
order, but things like NCHW
and NHWC
are commonly used to describe memory layout across frameworks and in literature.
src/ggml-cpu/ops.cpp
Outdated
const struct ggml_tensor * src, | ||
const struct ggml_tensor * kernel, | ||
struct ggml_tensor * dst, | ||
const struct ggml_depthwise_conv_2d_params p) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const struct ggml_depthwise_conv_2d_params p) { | |
const ggml_depthwise_conv_2d_params & p) { |
src/ggml-cpu/ops.cpp
Outdated
const struct ggml_tensor * src, | ||
const struct ggml_tensor * kernel, | ||
struct ggml_tensor * dst, | ||
const struct ggml_depthwise_conv_2d_params p) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const struct ggml_depthwise_conv_2d_params p) { | |
const ggml_depthwise_conv_2d_params & p) { |
src/ggml-cpu/ops.cpp
Outdated
for (int64_t dst_y = 0; dst_y < p.dst_h; ++dst_y) { | ||
for (int64_t dst_x = 0; dst_x < p.dst_w; ++dst_x) { | ||
|
||
float sum = 0.0f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float sum = 0.0f; | |
float sum = 0.0f; |
Sounds good. I could add dilation to the new kernels too - it's only a small change to index computation, and might make the transition easier. |
…words, pass by ref, whitespace
src/ggml-cpu/ops.cpp
Outdated
@@ -6064,6 +6064,178 @@ void ggml_compute_forward_conv_transpose_2d( | |||
} | |||
} | |||
|
|||
// ggml_compute_forward_depthwise_conv_2d | |||
|
|||
struct ggml_depthwise_conv_2d_params { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's preferable to also rename the symbols throughout - although not really necessary as these are temporary, it's mainly to keep consistency and ease fuzzy searches:
ggml_depthwise_conv_2d_params
->ggml_conv_2d_dw_params
GGML_OP_DEPTHWISE_CONV_2D
->GGML_OP_CONV_2D_DEPTHWISE
orGGML_OP_CONV_2D_DW
ggml_compute_forward_depthwise_conv_2d
->ggml_compute_forward_conv_2d_dw
test-depthwise-conv2d.cpp
->test-conv2d-dw.cpp
- etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - I liked searching for "depthwise", but yes consistency and finding all places with one search is nice.
This PR adds kernels for depthwise 2D convolution (CPU only for now).
There is an existing
ggml_conv_2d_dw
based on im2col + mul_mat, but it has high overhead. The approach makes sense for regular conv2d since it can profit from fast gemm, but depthwise convolution is much simpler and im2col will always slow it down I think.Timings (W=256, H=256, C=256)
ggml_conv_2d_dw
ggml_depthwise_conv_2d
ggml_depthwise_conv_2d
Timings (W=1024, H=1024, C=3)
ggml_conv_2d_dw
ggml_depthwise_conv_2d
ggml_depthwise_conv_2d
I didn't replace
ggml_conv_2d_dw
because it supports more backends (and dilation).Memory layout
Having channels/depth most contiguous in memory allows for better vectorization. It also improves memory access for im2col in regular 2D convolutions, and can avoid many costly
ggml_cont(ggml_permute(...))
. Since the default for 2D ops on the API seems to be spatial dimension first, this is kept in place, and opportunity to use channels-first kernel is detected from strides. Could also make that more explicit.Background
I've implemented MobileSAM (fast SAM variant with TinyViT as image encoder) here. Runtime was ~2.1s initially, with depthwise convolution eating a sizeable chunk. After changing memory layout and optimizing conv2d it now runs in 570ms (PyTorch: 608ms, ONNX: 549ms).
Ryzen 5 5600X (6 core, AVX2), windows, OpenBLAS