Depthwise 2D convolution #1152

Acly · 2025-03-20T19:36:49Z

This PR adds kernels for depthwise 2D convolution (CPU only for now).

There is an existing ggml_conv_2d_dw based on im2col + mul_mat, but it has high overhead. The approach makes sense for regular conv2d since it can profit from fast gemm, but depthwise convolution is much simpler and im2col will always slow it down I think.

Timings (W=256, H=256, C=256)

Method	Layout	Time
`ggml_conv_2d_dw`	WHCN	320 ms ± 25
`ggml_depthwise_conv_2d`	WHCN	25 ms ± 5
`ggml_depthwise_conv_2d`	CWHN	8 ms ± 0.5

Timings (W=1024, H=1024, C=3)

Method	Layout	Time
`ggml_conv_2d_dw`	WHCN	54.6 ms ± 5
`ggml_depthwise_conv_2d`	WHCN	8.4 ms ± 2
`ggml_depthwise_conv_2d`	CWHN	5.2 ms ± 1

I didn't replace ggml_conv_2d_dw because it supports more backends (and dilation).

Memory layout

Having channels/depth most contiguous in memory allows for better vectorization. It also improves memory access for im2col in regular 2D convolutions, and can avoid many costly ggml_cont(ggml_permute(...)). Since the default for 2D ops on the API seems to be spatial dimension first, this is kept in place, and opportunity to use channels-first kernel is detected from strides. Could also make that more explicit.

Background

I've implemented MobileSAM (fast SAM variant with TinyViT as image encoder) here. Runtime was ~2.1s initially, with depthwise convolution eating a sizeable chunk. After changing memory layout and optimizing conv2d it now runs in 570ms (PyTorch: 608ms, ONNX: 549ms).

Ryzen 5 5600X (6 core, AVX2), windows, OpenBLAS

cmdr2 · 2025-04-09T05:33:11Z

As an idea, could the operator call the appropriate implementation function, depending on whether the op params require dilation etc?

This way, we'd keep a single operator for depthwise conv, but it would use the appropriate implementation dynamically (for the given params).

Similar to how pytorch dispatches to different conv implementations depending on the params. Thanks!

ggerganov · 2025-04-09T08:28:39Z

As an idea, could the operator call the appropriate implementation function, depending on whether the op params require dilation etc?

We can do dynamic dispatch at the ggml_compute_forward_... level (e.g. ggml_compute_forward_unary is something very similar).

But in this case it would dispatch to having different ops in the graph. In other words, when dilation is required we will have GGML_OP_IM2COL + GGML_MUL_MAT nodes and if it is not required - we will have GGML_OP_DEPTHWISE_CONV_2D nodes. It might be fine, but not 100% sure atm, as I don't think we have other ops that do this currently.

slaren · 2025-04-09T08:48:45Z

The conv2d function already has too many parameters, I don't we need to merge everything together. The biggest issue is that there isn't a good way to introduce convolution kernels that do not depend on IM2COL without introducing new ops seamlessly, because it would require applications to choose which function to use depending on the backend that they are using. We also cannot let backends determine which implementation to use because IM2COL requires more memory to store the transformed kernel, and some backends cannot handle that.

cmdr2 · 2025-04-09T09:37:45Z

The conv2d function already has too many parameters, I don't we need to merge everything together.

True, I was only talking about the existing depthwise function (ggml_conv_2d_dw), not the broader ggml_conv_2d function. Because the current depthwise conv, and the proposed depthwise conv are basically just different implementations of the same operation.

I understand the reason why it's defined as a different function in this PR, but was trying to see if there's a way to dynamically switch between the two implementations of depthwise conv.

In other words, when dilation is required we will have GGML_OP_IM2COL + GGML_MUL_MAT nodes and if it is not required - we will have GGML_OP_DEPTHWISE_CONV_2D nodes

Yeah, that's the idea. But yes, the challenge is that we'll have to implement GGML_OP_DEPTHWISE_CONV_2D in all the backends, to avoid breaking backends that already use ggml_conv_2d_dw without dilation.

But yeah that's out-of-scope for this PR, so no worries :)

--

I'm actually facing a very similar problem. I'm experimenting with a custom CUDA kernel for conv_2d, which seems to be decently faster than im2col+mm (pending a lot more testing). I implemented a new CONV_2D operator for this. But I want it to continue using im2col+mm for the other backends, to allow ggml_conv_2d to continue working on them.

Not sure how to do that.

--

To be clear, I've no objections to this PR. Just trying to see if there's a way to dispatch to different implementations dynamically, without implementing it on every backend.

slaren · 2025-04-09T09:48:33Z

Ideally we would have one GGML_OP_CONV_2D operation, and it would be the responsibility of backends to implement it in any way they want. The main issue is that some backends, most importantly Metal, do not currently have a way to allocate a temporary buffer to store the result of an IM2COL operation. Nonetheless, it may be better to fix this once and for all, and temporarily break Metal support, rather than going through the mess of forcing applications to deal with this in their code.

cmdr2 · 2025-04-09T10:08:22Z

Agreed, this definitely shouldn't be the application's responsibility.

Doing this completely on the backend side is the cleaner way, and makes more sense. But like you said, temporary memory allocation is a challenge.

The other approach (which I don't like) will require the graph creation functions in ggml.c to know about the backend and check whether the backend supports the pure operation, and use im2col+mm as a fallback.

But that opens up other messy situations, because test-backend-ops will fail since the number of graph nodes will be different for different backends (for that operator function).

slaren · 2025-04-09T10:11:15Z

The other approach will require the graph creation functions in ggml.c to know about the backend and check whether the backend supports the pure operation, and use im2col+mm as a fallback.

It's not possible to do this in ggml.c, that information is not available at that point. At best it could be done in ggml_backend_sched, but ggml graphs are not really graphs, they are lists of operations, and manipulating them in this way is too costly.

ggerganov · 2025-04-09T10:12:18Z

I think the Metal backend can create a memory pool using MTLHeap and allocate temporary buffers from it - will take a look and try to add support for this.

But that opens up other messy situations, because test-backend-ops will fail since the number of graph nodes will be different for different backends (for that operator function).

The nodes would be the same (i.e. GGML_OP_CONV_2D), but the underlying implementation will be able to call 2 kernels for example (im2col into a temp buffer and then mul_mat using that temp buffer). From the graph PoV, everything would be the same. @slaren correct?

slaren · 2025-04-09T10:17:15Z

@slaren correct?

Yes, but I think what @cmdr2 was suggesting was creating different graphs in the ggml_conv_2d etc functions depending on the backend being used, but as pointed that's not really possible, and it would introduce a whole new list of problems.

Acly · 2025-04-09T10:53:32Z

At best it could be done in ggml_backend_sched, but ggml graphs are not really graphs, they are lists of operations, and manipulating them in this way is too costly.

That seemed like a good place to me: Create a new graph, dup nodes over from the input graph and conditionally replace them with "low level" OPs if they're not supported by the backend. Maintain an additional buffer in the schedule if required. Is that what you consider too costly, or am I missing something?

Some more places where this would be useful that I've stumbled upon:

GGML_OP_WIN_PART and GGML_OP_WIN_UNPART, currently CPU-only, but can be substituted with combination of view/pad/reshape/permute which work in many backends
Things like batch_norm_2d, which is just sub+div+mul+add, but used a lot in some vision transformers, so a fused kernel helps

Dealing with those substitutions individually in all backends may end up being a lot of work. By now I have some experience with the Vulkan backend, and while it's technically capable of allocating a temp buffer, doing these subtitutions is... architecturally difficult. Although you could argue that it shouldn't be.

slaren · 2025-04-09T11:20:24Z

That seemed like a good place to me: Create a new graph, dup nodes over from the input graph and conditionally replace them with "low level" OPs if they're not supported by the backend. Maintain an additional buffer in the schedule if required. Is that what you consider too costly, or am I missing something?

It's costly because you have to make a copy of the list of nodes. Manipulating a graph in this way would be easy, you can create a new subgraph, replace some pointers, and you are done. But since we are operating on a list rather than a graph, effectively you have to duplicate the entire graph. If it was something that only needs to be done once that wouldn't be a problem, but in practice graphs can rarely be reused, and need to be reconstructed on every evaluation.

It would also add a significant amount of complexity to code that is already bordering on too complex, and should be rewritten in a more clean way before adding more complexity on top.

ggerganov · 2025-04-11T11:22:18Z

@slaren In ggml-org/llama.cpp#12850 I prototyped a way for Metal to allocate and use temporary buffers. I think it works correctly and should be possible to use it for the convolutions and for other use cases that require temporary data. Atm, it's probably not implemented in the best possible way (it requires an extra pass over the graph nodes to determine the necessary amount of temporary memory and reallocate the heap from which the temporary buffers are allocated, if it's more than what is currently available), but it should be good enough to consider combining the convolutions under a single op type. I have a few more ideas I want to try for the Metal implementation and will look to merge it at some point in the next days.

slaren

We could merge this implementation now to avoid more merge conflicts, and over time we will work on porting the conv2d implementations to use a single op.

slaren · 2025-04-14T21:26:44Z

include/ggml.h

+    // a:   KW    KH    1    C    convolution kernel
+    // b:   W     H     C    N    input data
+    // res: W_out H_out C    N
+    GGML_API struct ggml_tensor * ggml_depthwise_conv_2d(


Could we rename this to something closer to the current naming scheme? Something like ggml_conv_2d_dw_direct or similar. It would be a temporary name until we unify the different conv2d implementations into a single op.

Will do.

Is there a preference on which order to use for tensor dimensions in comments and function names? When working on ggml I find it less confusing to stick with tensor->ne order, but things like NCHW and NHWC are commonly used to describe memory layout across frameworks and in literature.

slaren · 2025-04-14T21:33:43Z

src/ggml-cpu/ops.cpp

+        const struct ggml_tensor * src,
+        const struct ggml_tensor * kernel,
+        struct ggml_tensor * dst,
+        const struct ggml_depthwise_conv_2d_params p) {


Suggested change

const struct ggml_depthwise_conv_2d_params p) {

const ggml_depthwise_conv_2d_params & p) {

slaren · 2025-04-14T21:33:53Z

src/ggml-cpu/ops.cpp

+        const struct ggml_tensor * src,
+        const struct ggml_tensor * kernel,
+        struct ggml_tensor * dst,
+        const struct ggml_depthwise_conv_2d_params p) {


Suggested change

const struct ggml_depthwise_conv_2d_params p) {

const ggml_depthwise_conv_2d_params & p) {

ggerganov · 2025-04-15T07:58:45Z

src/ggml-cpu/ops.cpp

+        for (int64_t dst_y = 0; dst_y < p.dst_h; ++dst_y) {
+            for (int64_t dst_x = 0; dst_x < p.dst_w; ++dst_x) {
+
+                float sum = 0.0f;                


Suggested change

float sum = 0.0f;

float sum = 0.0f;

Acly · 2025-04-15T09:15:30Z

We could merge this implementation now to avoid more merge conflicts, and over time we will work on porting the conv2d implementations to use a single op.

Sounds good. I could add dilation to the new kernels too - it's only a small change to index computation, and might make the transition easier.

…words, pass by ref, whitespace

ggerganov · 2025-04-17T09:08:08Z

src/ggml-cpu/ops.cpp

@@ -6064,6 +6064,178 @@ void ggml_compute_forward_conv_transpose_2d(
    }
 }

+// ggml_compute_forward_depthwise_conv_2d
+
+struct ggml_depthwise_conv_2d_params {


I think it's preferable to also rename the symbols throughout - although not really necessary as these are temporary, it's mainly to keep consistency and ease fuzzy searches:

ggml_depthwise_conv_2d_params -> ggml_conv_2d_dw_params

GGML_OP_DEPTHWISE_CONV_2D -> GGML_OP_CONV_2D_DEPTHWISE or GGML_OP_CONV_2D_DW

ggml_compute_forward_depthwise_conv_2d -> ggml_compute_forward_conv_2d_dw

test-depthwise-conv2d.cpp -> test-conv2d-dw.cpp

etc.

Done - I liked searching for "depthwise", but yes consistency and finding all places with one search is nice.

ggerganov mentioned this pull request Apr 9, 2025

metal : add memory pool for temp allocs ggml-org/llama.cpp#12850

Open

9 tasks

ggml-cpu : kernels for faster depthwise 2D convolution

352c9c0

Acly force-pushed the depthwise-conv-2d branch from 0d5d3df to 352c9c0 Compare April 9, 2025 15:11

Acly added 2 commits April 11, 2025 14:54

fix compile: remove static after moving to ops.cpp

0af1433

add dilation for depthwise_conv_2d

ac6beac

slaren approved these changes Apr 14, 2025

View reviewed changes

ggerganov approved these changes Apr 15, 2025

View reviewed changes

review: rename to ggml_conv_2d_dw_direct, remove redundant struct key…

084bfe1

…words, pass by ref, whitespace

ggerganov reviewed Apr 17, 2025

View reviewed changes

review: rename depthwise_conv_2d -> conv_2d_dw everywhere

c202573

ggerganov merged commit eb22d6d into ggml-org:master Apr 17, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Depthwise 2D convolution #1152

Depthwise 2D convolution #1152

Acly commented Mar 20, 2025

cmdr2 commented Apr 9, 2025

ggerganov commented Apr 9, 2025

slaren commented Apr 9, 2025 •

edited

Loading

cmdr2 commented Apr 9, 2025 •

edited

Loading

slaren commented Apr 9, 2025

cmdr2 commented Apr 9, 2025 •

edited

Loading

slaren commented Apr 9, 2025 •

edited

Loading

ggerganov commented Apr 9, 2025

slaren commented Apr 9, 2025

Acly commented Apr 9, 2025

slaren commented Apr 9, 2025

ggerganov commented Apr 11, 2025

slaren left a comment

slaren Apr 14, 2025

Acly Apr 15, 2025

slaren Apr 14, 2025

slaren Apr 14, 2025

ggerganov Apr 15, 2025

Acly commented Apr 15, 2025

ggerganov Apr 17, 2025

Acly Apr 17, 2025

	const struct ggml_depthwise_conv_2d_params p) {
	const ggml_depthwise_conv_2d_params & p) {

Depthwise 2D convolution #1152

Depthwise 2D convolution #1152

Conversation

Acly commented Mar 20, 2025

Timings (W=256, H=256, C=256)

Timings (W=1024, H=1024, C=3)

Memory layout

Background

cmdr2 commented Apr 9, 2025

ggerganov commented Apr 9, 2025

slaren commented Apr 9, 2025 • edited Loading

cmdr2 commented Apr 9, 2025 • edited Loading

slaren commented Apr 9, 2025

cmdr2 commented Apr 9, 2025 • edited Loading

slaren commented Apr 9, 2025 • edited Loading

ggerganov commented Apr 9, 2025

slaren commented Apr 9, 2025

Acly commented Apr 9, 2025

slaren commented Apr 9, 2025

ggerganov commented Apr 11, 2025

slaren left a comment

Choose a reason for hiding this comment

slaren Apr 14, 2025

Choose a reason for hiding this comment

Acly Apr 15, 2025

Choose a reason for hiding this comment

slaren Apr 14, 2025

Choose a reason for hiding this comment

slaren Apr 14, 2025

Choose a reason for hiding this comment

ggerganov Apr 15, 2025

Choose a reason for hiding this comment

Acly commented Apr 15, 2025

ggerganov Apr 17, 2025

Choose a reason for hiding this comment

Acly Apr 17, 2025

Choose a reason for hiding this comment

slaren commented Apr 9, 2025 •

edited

Loading

cmdr2 commented Apr 9, 2025 •

edited

Loading

cmdr2 commented Apr 9, 2025 •

edited

Loading

slaren commented Apr 9, 2025 •

edited

Loading