Skip to content

Depthwise 2D convolution #1152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 17, 2025
Merged

Depthwise 2D convolution #1152

merged 5 commits into from
Apr 17, 2025

Conversation

Acly
Copy link
Contributor

@Acly Acly commented Mar 20, 2025

This PR adds kernels for depthwise 2D convolution (CPU only for now).

There is an existing ggml_conv_2d_dw based on im2col + mul_mat, but it has high overhead. The approach makes sense for regular conv2d since it can profit from fast gemm, but depthwise convolution is much simpler and im2col will always slow it down I think.

Timings (W=256, H=256, C=256)

Method Layout Time
ggml_conv_2d_dw WHCN 320 ms ± 25
ggml_depthwise_conv_2d WHCN 25 ms ± 5
ggml_depthwise_conv_2d CWHN 8 ms ± 0.5

Timings (W=1024, H=1024, C=3)

Method Layout Time
ggml_conv_2d_dw WHCN 54.6 ms ± 5
ggml_depthwise_conv_2d WHCN 8.4 ms ± 2
ggml_depthwise_conv_2d CWHN 5.2 ms ± 1

I didn't replace ggml_conv_2d_dw because it supports more backends (and dilation).

Memory layout

Having channels/depth most contiguous in memory allows for better vectorization. It also improves memory access for im2col in regular 2D convolutions, and can avoid many costly ggml_cont(ggml_permute(...)). Since the default for 2D ops on the API seems to be spatial dimension first, this is kept in place, and opportunity to use channels-first kernel is detected from strides. Could also make that more explicit.

Background

I've implemented MobileSAM (fast SAM variant with TinyViT as image encoder) here. Runtime was ~2.1s initially, with depthwise convolution eating a sizeable chunk. After changing memory layout and optimizing conv2d it now runs in 570ms (PyTorch: 608ms, ONNX: 549ms).

Ryzen 5 5600X (6 core, AVX2), windows, OpenBLAS

@cmdr2
Copy link
Collaborator

cmdr2 commented Apr 9, 2025

As an idea, could the operator call the appropriate implementation function, depending on whether the op params require dilation etc?

This way, we'd keep a single operator for depthwise conv, but it would use the appropriate implementation dynamically (for the given params).

Similar to how pytorch dispatches to different conv implementations depending on the params. Thanks!

@ggerganov
Copy link
Member

As an idea, could the operator call the appropriate implementation function, depending on whether the op params require dilation etc?

We can do dynamic dispatch at the ggml_compute_forward_... level (e.g. ggml_compute_forward_unary is something very similar).

But in this case it would dispatch to having different ops in the graph. In other words, when dilation is required we will have GGML_OP_IM2COL + GGML_MUL_MAT nodes and if it is not required - we will have GGML_OP_DEPTHWISE_CONV_2D nodes. It might be fine, but not 100% sure atm, as I don't think we have other ops that do this currently.

@slaren
Copy link
Member

slaren commented Apr 9, 2025

The conv2d function already has too many parameters, I don't we need to merge everything together. The biggest issue is that there isn't a good way to introduce convolution kernels that do not depend on IM2COL without introducing new ops seamlessly, because it would require applications to choose which function to use depending on the backend that they are using. We also cannot let backends determine which implementation to use because IM2COL requires more memory to store the transformed kernel, and some backends cannot handle that.

@cmdr2
Copy link
Collaborator

cmdr2 commented Apr 9, 2025

The conv2d function already has too many parameters, I don't we need to merge everything together.

True, I was only talking about the existing depthwise function (ggml_conv_2d_dw), not the broader ggml_conv_2d function. Because the current depthwise conv, and the proposed depthwise conv are basically just different implementations of the same operation.

I understand the reason why it's defined as a different function in this PR, but was trying to see if there's a way to dynamically switch between the two implementations of depthwise conv.

In other words, when dilation is required we will have GGML_OP_IM2COL + GGML_MUL_MAT nodes and if it is not required - we will have GGML_OP_DEPTHWISE_CONV_2D nodes

Yeah, that's the idea. But yes, the challenge is that we'll have to implement GGML_OP_DEPTHWISE_CONV_2D in all the backends, to avoid breaking backends that already use ggml_conv_2d_dw without dilation.

But yeah that's out-of-scope for this PR, so no worries :)

--

I'm actually facing a very similar problem. I'm experimenting with a custom CUDA kernel for conv_2d, which seems to be decently faster than im2col+mm (pending a lot more testing). I implemented a new CONV_2D operator for this. But I want it to continue using im2col+mm for the other backends, to allow ggml_conv_2d to continue working on them.

Not sure how to do that.

--

To be clear, I've no objections to this PR. Just trying to see if there's a way to dispatch to different implementations dynamically, without implementing it on every backend.

@slaren
Copy link
Member

slaren commented Apr 9, 2025

Ideally we would have one GGML_OP_CONV_2D operation, and it would be the responsibility of backends to implement it in any way they want. The main issue is that some backends, most importantly Metal, do not currently have a way to allocate a temporary buffer to store the result of an IM2COL operation. Nonetheless, it may be better to fix this once and for all, and temporarily break Metal support, rather than going through the mess of forcing applications to deal with this in their code.

@cmdr2
Copy link
Collaborator

cmdr2 commented Apr 9, 2025

Agreed, this definitely shouldn't be the application's responsibility.

Doing this completely on the backend side is the cleaner way, and makes more sense. But like you said, temporary memory allocation is a challenge.

The other approach (which I don't like) will require the graph creation functions in ggml.c to know about the backend and check whether the backend supports the pure operation, and use im2col+mm as a fallback.

But that opens up other messy situations, because test-backend-ops will fail since the number of graph nodes will be different for different backends (for that operator function).

@slaren
Copy link
Member

slaren commented Apr 9, 2025

The other approach will require the graph creation functions in ggml.c to know about the backend and check whether the backend supports the pure operation, and use im2col+mm as a fallback.

It's not possible to do this in ggml.c, that information is not available at that point. At best it could be done in ggml_backend_sched, but ggml graphs are not really graphs, they are lists of operations, and manipulating them in this way is too costly.

@ggerganov
Copy link
Member

I think the Metal backend can create a memory pool using MTLHeap and allocate temporary buffers from it - will take a look and try to add support for this.

But that opens up other messy situations, because test-backend-ops will fail since the number of graph nodes will be different for different backends (for that operator function).

The nodes would be the same (i.e. GGML_OP_CONV_2D), but the underlying implementation will be able to call 2 kernels for example (im2col into a temp buffer and then mul_mat using that temp buffer). From the graph PoV, everything would be the same. @slaren correct?

@slaren
Copy link
Member

slaren commented Apr 9, 2025

@slaren correct?

Yes, but I think what @cmdr2 was suggesting was creating different graphs in the ggml_conv_2d etc functions depending on the backend being used, but as pointed that's not really possible, and it would introduce a whole new list of problems.

@Acly
Copy link
Contributor Author

Acly commented Apr 9, 2025

At best it could be done in ggml_backend_sched, but ggml graphs are not really graphs, they are lists of operations, and manipulating them in this way is too costly.

That seemed like a good place to me: Create a new graph, dup nodes over from the input graph and conditionally replace them with "low level" OPs if they're not supported by the backend. Maintain an additional buffer in the schedule if required. Is that what you consider too costly, or am I missing something?

Some more places where this would be useful that I've stumbled upon:

  • GGML_OP_WIN_PART and GGML_OP_WIN_UNPART, currently CPU-only, but can be substituted with combination of view/pad/reshape/permute which work in many backends
  • Things like batch_norm_2d, which is just sub+div+mul+add, but used a lot in some vision transformers, so a fused kernel helps

Dealing with those substitutions individually in all backends may end up being a lot of work. By now I have some experience with the Vulkan backend, and while it's technically capable of allocating a temp buffer, doing these subtitutions is... architecturally difficult. Although you could argue that it shouldn't be.

@slaren
Copy link
Member

slaren commented Apr 9, 2025

That seemed like a good place to me: Create a new graph, dup nodes over from the input graph and conditionally replace them with "low level" OPs if they're not supported by the backend. Maintain an additional buffer in the schedule if required. Is that what you consider too costly, or am I missing something?

It's costly because you have to make a copy of the list of nodes. Manipulating a graph in this way would be easy, you can create a new subgraph, replace some pointers, and you are done. But since we are operating on a list rather than a graph, effectively you have to duplicate the entire graph. If it was something that only needs to be done once that wouldn't be a problem, but in practice graphs can rarely be reused, and need to be reconstructed on every evaluation.

It would also add a significant amount of complexity to code that is already bordering on too complex, and should be rewritten in a more clean way before adding more complexity on top.

@Acly Acly force-pushed the depthwise-conv-2d branch from 0d5d3df to 352c9c0 Compare April 9, 2025 15:11
@ggerganov
Copy link
Member

@slaren In ggml-org/llama.cpp#12850 I prototyped a way for Metal to allocate and use temporary buffers. I think it works correctly and should be possible to use it for the convolutions and for other use cases that require temporary data. Atm, it's probably not implemented in the best possible way (it requires an extra pass over the graph nodes to determine the necessary amount of temporary memory and reallocate the heap from which the temporary buffers are allocated, if it's more than what is currently available), but it should be good enough to consider combining the convolutions under a single op type. I have a few more ideas I want to try for the Metal implementation and will look to merge it at some point in the next days.

Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could merge this implementation now to avoid more merge conflicts, and over time we will work on porting the conv2d implementations to use a single op.

include/ggml.h Outdated
// a: KW KH 1 C convolution kernel
// b: W H C N input data
// res: W_out H_out C N
GGML_API struct ggml_tensor * ggml_depthwise_conv_2d(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename this to something closer to the current naming scheme? Something like ggml_conv_2d_dw_direct or similar. It would be a temporary name until we unify the different conv2d implementations into a single op.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Is there a preference on which order to use for tensor dimensions in comments and function names? When working on ggml I find it less confusing to stick with tensor->ne order, but things like NCHW and NHWC are commonly used to describe memory layout across frameworks and in literature.

const struct ggml_tensor * src,
const struct ggml_tensor * kernel,
struct ggml_tensor * dst,
const struct ggml_depthwise_conv_2d_params p) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const struct ggml_depthwise_conv_2d_params p) {
const ggml_depthwise_conv_2d_params & p) {

const struct ggml_tensor * src,
const struct ggml_tensor * kernel,
struct ggml_tensor * dst,
const struct ggml_depthwise_conv_2d_params p) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const struct ggml_depthwise_conv_2d_params p) {
const ggml_depthwise_conv_2d_params & p) {

for (int64_t dst_y = 0; dst_y < p.dst_h; ++dst_y) {
for (int64_t dst_x = 0; dst_x < p.dst_w; ++dst_x) {

float sum = 0.0f;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
float sum = 0.0f;
float sum = 0.0f;

@Acly
Copy link
Contributor Author

Acly commented Apr 15, 2025

We could merge this implementation now to avoid more merge conflicts, and over time we will work on porting the conv2d implementations to use a single op.

Sounds good. I could add dilation to the new kernels too - it's only a small change to index computation, and might make the transition easier.

@@ -6064,6 +6064,178 @@ void ggml_compute_forward_conv_transpose_2d(
}
}

// ggml_compute_forward_depthwise_conv_2d

struct ggml_depthwise_conv_2d_params {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's preferable to also rename the symbols throughout - although not really necessary as these are temporary, it's mainly to keep consistency and ease fuzzy searches:

  • ggml_depthwise_conv_2d_params -> ggml_conv_2d_dw_params
  • GGML_OP_DEPTHWISE_CONV_2D -> GGML_OP_CONV_2D_DEPTHWISE or GGML_OP_CONV_2D_DW
  • ggml_compute_forward_depthwise_conv_2d -> ggml_compute_forward_conv_2d_dw
  • test-depthwise-conv2d.cpp -> test-conv2d-dw.cpp
  • etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - I liked searching for "depthwise", but yes consistency and finding all places with one search is nice.

@ggerganov ggerganov merged commit eb22d6d into ggml-org:master Apr 17, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants