Add Deformable Convolution 2D (deform_conv2d) Support #3292
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds support for Deformable Convolution v2 (DCNv2) to Candle, implementing the operation across CPU, Metal, and CUDA backends. The implementation follows the torchvision reference and maintains numerical consistency with PyTorch.
Motivation
Deformable Convolution is a learnable convolution operation that predicts sampling position offsets to achieve adaptive receptive fields. It is widely used in:
References:
API
Implementation Details
Architecture
The implementation follows Candle's existing
conv2dpattern:deform_im2colkernel that generates the columns matrix with deformable samplingdeform_im2colto generate columnsBackend Support
Key Algorithm: Bilinear Interpolation
Since offsets are floating-point values, bilinear interpolation is used to sample from the input feature map:
CUDA Half-Precision Handling
For
__halfand__nv_bfloat16types, intermediate calculations are performed infloatto avoid type ambiguity issues with CUDA's constructor overloads. This follows the same pattern used inupsample_bilinear2d.Files Changed
candle-core
src/conv.rs- AddedParamsDeformConv2Dstruct andTensor::deform_conv2d()methodsrc/backend.rs- Addeddeform_conv2dtoBackendStoragetraitsrc/storage.rs- Added storage layer dispatchsrc/cpu_backend/mod.rs- CPU backend implementationsrc/cpu_backend/deform_conv2d.rs- CPU kernel implementation (new file)src/metal_backend/mod.rs- Metal backend implementationsrc/cuda_backend/mod.rs- CUDA backend implementationsrc/dummy_cuda_backend.rs- Dummy CUDA backend stubsrc/dummy_metal_backend.rs- Dummy Metal backend stubtests/deform_conv2d_tests.rs- Comprehensive test suite (new file)benches/benchmarks/deform_conv2d.rs- Performance benchmark (new file)candle-kernels
src/deform_conv2d.cu- CUDA kernel implementation (new file)src/lib.rs- Kernel registrationcandle-metal-kernels
src/metal_src/deform_conv2d.metal- Metal shader implementation (new file)src/kernels/deform_conv2d.rs- Metal kernel bindings (new file)src/kernels/mod.rs- Module registrationsrc/kernel.rs- Kernel registrationsrc/lib.rs- Export registrationsrc/source.rs- Source registrationTesting
Test Cases
The test suite includes:
Numerical Consistency
Test data was generated using PyTorch/torchvision:
All tests pass with max absolute error < 1e-4.
Test Results
CPU Tests
Metal Tests
CUDA Tests
Usage Example
Limitations
Performance Benchmarks
Benchmarks run using
cargo bench --bench bench_main -- deform_conv2d.CPU Performance
Config:
[1, 64, 32, 32]input,[64, 64, 3, 3]weight, with maskMetal Performance (Apple M4 Pro)
Config:
[1, 256, 64, 64]input,[256, 256, 3, 3]weight, with maskCUDA Performance (NVIDIA RTX A6000)
Config:
[1, 256, 64, 64]input,[256, 256, 3, 3]weight, with maskChecklist