Skip to content

stereo: performance optimization of sgm on Window-ARM64#4055

Merged
asmorkalov merged 2 commits intoopencv:4.xfrom
pratham-mcw:sgm-neon-optimization
Jan 5, 2026
Merged

stereo: performance optimization of sgm on Window-ARM64#4055
asmorkalov merged 2 commits intoopencv:4.xfrom
pratham-mcw:sgm-neon-optimization

Conversation

@pratham-mcw
Copy link
Contributor

Pull Request Readiness Checklist

  • This PR adds an ARM64 NEON intrinsics-based optimization for the computeDisparityBinarySGBM function in stereo_binary_sgbm.cpp.
  • The new implementation uses NEON vector instructions (e.g., vld1q_s16, vminq_s16, vqaddq_s16), allowing for efficient parallel computation. This is guarded under the CV_NEON macro and does not affect other platforms.
  • This change is similar to existing SSE2 optimizations for x64 and brings the same performance benefits to ARM64.

Performance Improvements:

  • The optimization significantly improves the performance of sgm on Windows ARM64 targets.
  • The table below shows timing comparisons before and after the optimization:
image

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch

@pratham-mcw pratham-mcw changed the title stereo: performance optimization of stereo on Window-ARM64 stereo: performance optimization of sgm on Window-ARM64 Dec 19, 2025
@asmorkalov
Copy link
Contributor

The modified code is not called in the existing perf tests. Please extend perf test coverage.

@pratham-mcw
Copy link
Contributor Author

The modified code is not called in the existing perf tests. Please extend perf test coverage.

Hi @asmorkalov,
Thanks for the feedback! I've added debug statements in the added NEON code to verify the code execution.

#elif CV_NEON
    if ( useSIMD )
    {
        printf("NEON is supported\n");
        int16x8_t vP1 = vdupq_n_s16((short)P1);
        int16x8_t vDelta0 = vdupq_n_s16((short)delta0);
        int16x8_t vDelta1 = vdupq_n_s16((short)delta1);
        int16x8_t vDelta2 = vdupq_n_s16((short)delta2);
        int16x8_t vDelta3 = vdupq_n_s16((short)delta3);
        int16x8_t vMinL0 = vdupq_n_s16((short)MAX_COST);
        int16x8_t vCpd, vL0, vL1, vL2, vL3, vL0m1, vL0p1;
        int16x8_t vL1m1, vL1p1, vL2m1, vL2p1, vL3m1, vL3p1;
        for ( d = 0; d < D; d += 8 )
        {
            vCpd = vld1q_s16(Cp + d);
        // ... rest of the NEON implementation
    }
#endif

When running opencv_perf_stereo.exe, the debug statements are printed, confirming that the NEON-optimized path is being executed by the existing performance tests.
I've attached the log file containing the output of opencv_perf_stereo.exe
Steroe_Perf_Output.txt

@asmorkalov
Copy link
Contributor

Perf results for my Jetson board:

Geometric mean (ms)

                             Name of Test                              4.x-1  patch-1  patch-1  
                                                                                          vs    
                                                                                        4.x-1   
                                                                                      (x-factor)
bm_perf::s_bm::(320x240, 8UC1, CV_8U)                                  7.073   6.840     1.03   
bm_perf::s_bm::(512x383, 8UC1, CV_8U)                                  18.183 17.796     1.02   
census_sparse_descriptor::descript_params::(127x61, 8UC1, CV_32S)      0.037   0.036     1.04   
census_sparse_descriptor::descript_params::(127x61, 8UC1, UNKNOWN)     0.037   0.035     1.03   
census_sparse_descriptor::descript_params::(640x480, 8UC1, CV_32S)     1.105   1.102     1.00   
census_sparse_descriptor::descript_params::(640x480, 8UC1, UNKNOWN)    1.107   1.104     1.00   
census_sparse_descriptor::descript_params::(1280x720, 8UC1, CV_32S)    3.294   3.305     1.00   
census_sparse_descriptor::descript_params::(1280x720, 8UC1, UNKNOWN)   3.293   3.297     1.00   
census_sparse_descriptor::descript_params::(1920x1080, 8UC1, CV_32S)   7.381   7.385     1.00   
census_sparse_descriptor::descript_params::(1920x1080, 8UC1, UNKNOWN)  7.383   7.384     1.00   
center_symetric_census::descript_params::(127x61, 8UC1, CV_32S)        0.043   0.043     1.00   
center_symetric_census::descript_params::(127x61, 8UC1, UNKNOWN)       0.044   0.043     1.02   
center_symetric_census::descript_params::(640x480, 8UC1, CV_32S)       1.330   1.319     1.01   
center_symetric_census::descript_params::(640x480, 8UC1, UNKNOWN)      1.319   1.321     1.00   
center_symetric_census::descript_params::(1280x720, 8UC1, CV_32S)      3.939   3.982     0.99   
center_symetric_census::descript_params::(1280x720, 8UC1, UNKNOWN)     3.942   3.945     1.00   
center_symetric_census::descript_params::(1920x1080, 8UC1, CV_32S)     8.857   8.869     1.00   
center_symetric_census::descript_params::(1920x1080, 8UC1, UNKNOWN)    8.851   8.859     1.00   
modified_census_transform::descript_params::(127x61, 8UC1, CV_32S)     0.118   0.119     1.00   
modified_census_transform::descript_params::(127x61, 8UC1, UNKNOWN)    0.118   0.119     0.99   
modified_census_transform::descript_params::(640x480, 8UC1, CV_32S)    4.780   4.787     1.00   
modified_census_transform::descript_params::(640x480, 8UC1, UNKNOWN)   4.770   4.818     0.99   
modified_census_transform::descript_params::(1280x720, 8UC1, CV_32S)   14.365 14.522     0.99   
modified_census_transform::descript_params::(1280x720, 8UC1, UNKNOWN)  14.355 14.496     0.99   
modified_census_transform::descript_params::(1920x1080, 8UC1, CV_32S)  32.361 32.632     0.99   
modified_census_transform::descript_params::(1920x1080, 8UC1, UNKNOWN) 32.331 32.602     0.99   
sgm_perf::s_bm::(320x240, 8UC1, CV_8U)                                 23.401 14.695     1.59   
sgm_perf::s_bm::(320x240, 8UC1, CV_16S)                                23.392 14.567     1.61   
sgm_perf::s_bm::(512x283, 8UC1, CV_8U)                                 44.305 27.594     1.61   
sgm_perf::s_bm::(512x283, 8UC1, CV_16S)                                44.492 28.086     1.58   
star_census_transform::descript_params::(127x61, 8UC1, CV_32S)         0.041   0.041     1.00   
star_census_transform::descript_params::(127x61, 8UC1, UNKNOWN)        0.041   0.041     1.01   
star_census_transform::descript_params::(640x480, 8UC1, CV_32S)        1.296   1.294     1.00   
star_census_transform::descript_params::(640x480, 8UC1, UNKNOWN)       1.296   1.294     1.00   
star_census_transform::descript_params::(1280x720, 8UC1, CV_32S)       3.893   3.892     1.00   
star_census_transform::descript_params::(1280x720, 8UC1, UNKNOWN)      3.885   3.883     1.00   
star_census_transform::descript_params::(1920x1080, 8UC1, CV_32S)      8.740   8.744     1.00   
star_census_transform::descript_params::(1920x1080, 8UC1, UNKNOWN)     8.739   8.739     1.00   

@asmorkalov asmorkalov self-requested a review January 5, 2026 10:25
@asmorkalov asmorkalov self-assigned this Jan 5, 2026
@asmorkalov asmorkalov merged commit b545e2f into opencv:4.x Jan 5, 2026
35 of 37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants