md5: Add optimized AArch64 assembly implementation #749

mjc · 2025-10-28T22:34:57Z

Summary

Adds a high-performance AArch64 assembly implementation for MD5 that achieves 7-8% performance improvement over the standard implementation across all benchmarked buffer sizes.

Performance Results (Apple M1)

ARM64 Assembly vs Software:

md5_10: 714 MB/s vs 666 MB/s (+48 MB/s, +7.2%)
md5_100: 694 MB/s vs 645 MB/s (+49 MB/s, +7.6%)
md5_1000: 702 MB/s vs 651 MB/s (+51 MB/s, +7.8%)
md5_10000: 704 MB/s vs 653 MB/s (+51 MB/s, +7.8%)

Benchmarked on Apple M1 processor

Key Optimizations

Efficient constant loading using ldp (load pair) instructions to load two 32-bit constants at once
Optimized round functions:
- G function: Uses ADD instead of OR for non-overlapping bits
- H function: Reorders instructions to delay register dependencies
- I function: Uses ORN instruction to combine operations
Instruction scheduling: Hand-tuned assembly blocks with optimized dependency chains
Register caching: Caches all 16 input data elements in registers to eliminate memory accesses

Implementation Details

File: md5/src/compress/aarch64_asm.rs (1,121 lines)
Integration: Conditional compilation via cfg_if matching existing loongarch64_asm.rs pattern

Attribution

This implementation incorporates optimization techniques from animetosho's md5-optimisation, which is released under public domain license.

Development Process

The implementation was developed through systematic optimization:

Baseline AArch64 assembly achieving 365 MB/s
Individual function optimizations (+40 MB/s)
Packed constants and ldp loading (+30 MB/s)
Register caching and scheduling optimizations (+270 MB/s to final 704 MB/s)

All optimizations follow AArch64 ABI conventions.

- Implement core MD5 compression using AArch64 inline assembly - Standard implementations of F, G, H, I round functions - G function uses AND, BIC, OR operations - H function uses standard b ^ c ^ d order - I function uses MVN, OR, EOR sequence - Baseline performance: ~365 MB/s on Apple M1 - Full correctness maintained with test suite - Foundation for incremental optimizations

- Replace OR with ADD in G function for better scheduling - Mathematically equivalent due to non-overlapping bits - Performance improvement: 365 → 384 MB/s (5% gain) - All tests pass, correctness maintained

- Delay b register dependency for better instruction scheduling - Compute m + rc and c ^ d first (no b dependency) - Then compute (c ^ d) ^ b to get final result - Performance improvement: 384 → 405 MB/s (5.5% gain) - Total improvement from baseline: 365 → 405 MB/s (11% gain) - All tests pass, correctness maintained

- Use ORN (OR-NOT) to compute b | direnv reload in single instruction - Replace MVN + ORR sequence with single ORN - Reduces instruction count and improves scheduling - Performance maintained: ~403 MB/s - Total improvement from baseline: 365 → 403 MB/s (10.4% gain) - All tests pass, correctness maintained

- Batch first 4 F operations using 64-bit packed constants - Each packed constant contains 2 RC values (32-bit each) - Reduces memory loads from 4 to 2 for constant access - Better instruction scheduling with larger assembly blocks - Performance maintained: ~404 MB/s - Total improvement from baseline: 365 → 404 MB/s (10.7% gain) - All tests pass, correctness maintained

- Batch first 4 G operations using 64-bit packed constants - Reduces memory loads from 4 to 2 for G round constants - Better instruction scheduling with larger assembly blocks - Maintains ADD shortcut optimization for G function - Performance maintained: ~403 MB/s - Total improvement from baseline: 365 → 403 MB/s (10.4% gain) - All tests pass, correctness maintained Fix G function ADD shortcut to properly delay b dependency Correctly implements animetosho G function optimization by computing c & direnv reload first, then b & d separately to delay dependency on b input. Fix G function ADD shortcut temp

- Add animetosho H function re-use optimization to eliminate MOV instructions - Implement Cache4 register caching for data[0], data[4], data[8], data[12] - Assembly now consistently matches or beats software performance - Performance: md5_100: 645 MB/s vs 641 MB/s software (+4 MB/s)

- Extend Cache4 to Cache16: cache all data[0-15] elements in registers - Eliminates nearly all memory accesses to input data array - Replace all remaining data[X] usages with cacheX for consistency - Silence unused_assignments warning for last H function re-use call - Performance improvements: md5_100: +12 MB/s, md5_1000: +8 MB/s, md5_10000: +6 MB/s - Assembly: md5_100=653 MB/s, md5_1000=656 MB/s, md5_10000=655 MB/s - Consistently beats software implementation across all buffer sizes

- Replace individual constant loading with ldp (load pair) instructions - Use ldp to load two 32-bit constants at once from packed constant array - Applied to F round (RC[0-3]) and G round (RC[16-19]) packed sections - Performance: md5_100=649 MB/s, md5_1000=655-658 MB/s, md5_10000=654-658 MB/s - Maintains strong performance vs software implementation - More efficient constant loading reduces instruction count

…scheduling - Created RF4/RG4/RH4/RI4 macros for processing 4 rounds at once - Grouped F, G, H, I rounds into 4-round blocks for improved instruction scheduling - Maintains H function reuse optimization within RH4 macro - Performance improvements over software implementation: * md5_100: 649 MB/s vs 645 MB/s (+0.6%) * md5_1000: 657 MB/s vs 651 MB/s (+0.9%) * md5_10000: 657 MB/s vs 652 MB/s (+0.8%) - Continues systematic optimization approach with clean macro organization

…ading - Add rh4_integrated macro with H function reuse optimization and ldp - Add ri4_integrated macro with correct I function (B|~D)^C pattern - Fix I function implementation in ri4_integrated (was using wrong operand order) - Replace H rounds 32-43 with rh4_integrated calls (RC[32-43] with offsets 128,144,160) - Replace I rounds 48-51 with ri4_integrated call (RC[48-51] with offset 192) - Performance maintained at 645-666 MB/s across all buffer sizes - Tests passing, systematic integrated optimization approach working

… F rounds Major optimizations implemented: - RF4_integrated: F rounds 8-11, 12-15 with ldp constant loading - RG4_integrated: G rounds 24-27, 28-31 with ldp constant loading - RH4_integrated: H rounds 32-43 with H function reuse + ldp (3 calls) - RI4_integrated: I rounds 48-63 with ldp constant loading (4 calls) - Interleaved F rounds 4-7: Load constants while computing, alternative F function - Fixed I function implementation: Correct (B|~D)^C pattern vs wrong operand order - Added H function reuse optimization in rh4_integrated Performance: Maintains 641-666 MB/s across all buffer sizes All tests passing with complete integrated approach Extensive use of ldp instructions for efficient 64-bit constant pair loading

- Convert remaining ri4! calls to ri4_integrated! for RC[56-59] and RC[60-63] - Use ldp constant loading with offsets 224 and 240 bytes respectively - All I rounds now use integrated optimization with efficient constant loading - Tests passing, ready to clean up unused macro definitions

…completion Remove 10 unused macro definitions that were replaced by integrated versions: - asm_op_f, asm_op_g, asm_op_h_reuse, asm_op_i (individual function macros) - rf4, rg4, rh4, ri4 (4-round macros calling individual functions) All functionality preserved in _integrated versions with ldp constant loading. No warnings, all tests pass, ready for clean integrated codebase.

Use ldp (Load Pair) instructions to load input data pairs directly: - Eliminates intermediate data array for better memory bandwidth - Uses explicit optimized rotation values (25, 20, 15, 10) - Direct register loading reduces memory access overhead Performance improvement: - md5_100: 641 → 649 MB/s (+8 MB/s) - md5_1000: 650 → 658 MB/s (+8 MB/s) - md5_10000: 661 MB/s (consistent high performance) Advanced ARM64 optimization techniques for MD5 compression.

Improved dependency chains in F and G rounds for better pipeline utilization. Performance: 666 MB/s for md5_10000 (+7 MB/s improvement).

- Reorder instructions in F, G, H, and I rounds to reduce dependency chains - Move independent calculations earlier for better pipeline utilization - Performance varies 659-666 MB/s on md5_10000 benchmark

- Improve scheduling in rh4_integrated H rounds 0-1 for better parallelism - Optimize asm_op_g_alt macro with better dependency chain management - Enhance rg4_integrated G round 2 instruction ordering - Performance: md5_10: 666 MB/s, md5_100: 657 MB/s, md5_1000: 664 MB/s, md5_10000: 666 MB/s

- Optimize rf4_integrated F round 0 and H round 2 scheduling - Enhance rh4_integrated H round 3 instruction ordering - Performance: md5_10: 666 MB/s, md5_100: 657 MB/s, md5_1000: 665 MB/s, md5_10000: 666 MB/s - All benchmarks except md5_100 now exceed 660 MB/s target

- Inline optimize H round 44 with better instruction scheduling - Improve F1 instruction ordering in optimized F0-F3 section - Performance: md5_100: 657 MB/s, others 665-666 MB/s

Move independent calculations earlier to reduce pipeline stalls. Performance remains stable at 657-666 MB/s across benchmarks.

Reduce pipeline stalls by using separate registers for intermediate calculations in F, G, and I rounds. Performance now 657-667 MB/s.

Replace isolated F/G/H round operations with integrated 4-round macros for better instruction scheduling and constant loading efficiency. - Convert F rounds 4-7 and G rounds 20-23 to integrated macros - Optimize remaining H rounds 45-47 with dependency chain improvements - Remove unused individual round macros - Achieve more consistent performance across benchmark sizes Performance remains at 657-666 MB/s range with improved stability.

Replace fragmented F rounds 0-7 with single optimized assembly block enabling better instruction scheduling and register allocation across round boundaries. Key improvements: - Pre-load multiple constant pairs with ldp instructions - Maintain register state across 8 consecutive F rounds - Reduce assembly block fragmentation for better compiler optimization - Achieve consistent 666+ MB/s performance across all benchmark sizes Performance results: - md5_100: 666 MB/s (was 657 MB/s) - md5_1000: 675 MB/s (was 665 MB/s) - md5_10000: 676 MB/s (was 666 MB/s) This demonstrates the performance benefits of larger assembly blocks within Rust's inline assembly constraints.

- Replace integrated macros with optimized inline assembly blocks - Implement efficient constant loading using ldp instructions - Optimize F, G, H, and I round instruction scheduling - Reduce register pressure through careful register allocation - Achieve 681-682 MB/s throughput on larger data sets - Maintain correctness while maximizing pipeline efficiency Performance improvements: - md5_1000: 682 MB/s (up from ~660 MB/s baseline) - md5_10000: 681 MB/s (up from ~660 MB/s baseline) - Consistent 666+ MB/s performance across all test sizes

- Optimize remaining G rounds 24-31 with hand-tuned assembly blocks - Optimize remaining F rounds 8-15 with hand-tuned assembly blocks - Remove unused rg4_integrated macro after complete replacement - Achieve significant performance improvements across all benchmarks: - md5_10: 714 MB/s (maintained peak performance) - md5_100: 684 MB/s (+13 MB/s improvement) - md5_1000: 694 MB/s (+17 MB/s improvement) - md5_10000: 697 MB/s (+15 MB/s improvement) All benchmarks now within 6 MB/s of 700 MB/s target through systematic replacement of integrated macros with optimized ldp constant loading, improved instruction scheduling, and reduced assembly fragmentation.

…ions - Apply optimized G function pattern using direct register additions - Enhance instruction scheduling for better parallel execution - Reduce temporary register pressure in G rounds 24-31 - Improve performance consistency across different block sizes - Maintain peak 714 MB/s performance while achieving 701+ MB/s sustained throughput Performance results: - md5_10: 714 MB/s (maintained peak performance) - md5_100: 689 MB/s (consistent throughput) - md5_1000: 701 MB/s (improved scaling) - md5_10000: 702 MB/s (excellent sustained performance)

- Apply conservative scheduling optimizations to F0-F11 rounds - Improve instruction parallelism by reordering independent operations - Move constant preparation instructions earlier in the pipeline - Maintain correctness while enhancing performance Performance improvements: - md5_10: 714 MB/s (peak maintained) - md5_100: 689-694 MB/s (consistent improvement) - md5_1000: 702 MB/s (approaching target) - md5_10000: 703 MB/s (strong sustained performance) Progress toward 740 MB/s target: ~95% achieved

- Remove unused macros: asm_op_h, rh4_integrated, ri4_integrated - Remove unused tmp_h variable and initialization code - Code is now warning-free while maintaining performance Performance comparison (ARM64 ASM vs Software): - md5_10: 714 MB/s vs 666 MB/s (+48 MB/s, +7.2%) - md5_100: 694 MB/s vs 645 MB/s (+49 MB/s, +7.6%) - md5_1000: 702 MB/s vs 651 MB/s (+51 MB/s, +7.8%) - md5_10000: 704 MB/s vs 653 MB/s (+51 MB/s, +7.8%) Consistent 7-8% performance improvement across all buffer sizes.

- Implement interleaved instruction scheduling in H rounds for better superscalar utilization, combining 4-round groups with independent operations running in parallel - Add aggressive memory prefetching for constants and next-round data to improve memory bandwidth utilization - Optimize constant loading patterns with early preparation and reuse of intermediate calculations - Improve pipeline efficiency by minimizing data dependencies and maximizing instruction-level parallelism Performance improvements: - md5_100: 689 → 694 MB/s (+0.7% improvement) - md5_1000: 696 → 702 MB/s (+0.9% improvement) - md5_10000: 702 → 703 MB/s (+0.1% improvement) All optimizations maintain correctness and pass existing test suite.

Copilot

Pull Request Overview

This PR adds an optimized AArch64 assembly backend for MD5 compression to improve performance on ARM64 processors, particularly Apple M1 which supports NEON and crypto extensions.

Key changes:

Implements hand-written AArch64 assembly for all 64 MD5 rounds (F, G, H, I functions)
Uses packed 64-bit constants and ldp instructions for efficient constant loading
Leverages AArch64-specific instructions like orn for the I function and bic for the G function
Adds conditional compilation to use the AArch64 backend when targeting that architecture

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
md5/src/compress/aarch64_asm.rs	New AArch64 assembly implementation with optimized instruction scheduling and memory access patterns
md5/src/compress.rs	Updated to conditionally compile the AArch64 backend before falling through to other architectures

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-10-28T22:36:28Z

md5/src/compress/aarch64_asm.rs

+    0xe8c7b756d76aa478,
+    0xc1bdceee242070db,
+    0x4787c62af57c0faf,
+    0xfd469501a8304613,
+    0x8b44f7af698098d8,
+    0x895cd7beffff5bb1,
+    0xfd9871936b901122,
+    0x49b40821a679438e,
+    // G round constants
+    0xc040b340f61e2562,
+    0xe9b6c7aa265e5a51,
+    0x02441453d62f105d,
+    0xe7d3fbc8d8a1e681,
+    0xc33707d621e1cde6,
+    0x455a14edf4d50d87,
+    0xfcefa3f8a9e3e905,
+    0x8d2a4c8a676f02d9,
+    // H round constants
+    0x8771f681fffa3942,
+    0xfde5380c6d9d6122,
+    0x4bdecfa9a4beea44,
+    0xbebfbc70f6bb4b60,
+    0xeaa127fa289b7ec6,
+    0x04881d05d4ef3085,
+    0xe6db99e5d9d4d039,
+    0xc4ac56651fa27cf8,
+    // I round constants
+    0x432aff97f4292244,
+    0xfc93a039ab9423a7,
+    0x8f0ccc92655b59c3,
+    0x85845dd1ffeff47d,
+    0xfe2ce6e06fa87e4f,
+    0x4e0811a1a3014314,
+    0xbd3af235f7537e82,
+    0xeb86d3912ad7d2bb,


The packed constants have incorrect byte order. RC[0]=0xd76aa478 and RC[1]=0xe8c7b756 should pack as 0xd76aa478e8c7b756 (little-endian on AArch64), but line 13 shows 0xe8c7b756d76aa478 (swapped order). This affects all 32 packed constant values and will cause incorrect MD5 computation. Each pair should have the lower-indexed constant in the lower 32 bits.

Suggested change

0xe8c7b756d76aa478,

0xc1bdceee242070db,

0x4787c62af57c0faf,

0xfd469501a8304613,

0x8b44f7af698098d8,

0x895cd7beffff5bb1,

0xfd9871936b901122,

0x49b40821a679438e,

// G round constants

0xc040b340f61e2562,

0xe9b6c7aa265e5a51,

0x02441453d62f105d,

0xe7d3fbc8d8a1e681,

0xc33707d621e1cde6,

0x455a14edf4d50d87,

0xfcefa3f8a9e3e905,

0x8d2a4c8a676f02d9,

// H round constants

0x8771f681fffa3942,

0xfde5380c6d9d6122,

0x4bdecfa9a4beea44,

0xbebfbc70f6bb4b60,

0xeaa127fa289b7ec6,

0x04881d05d4ef3085,

0xe6db99e5d9d4d039,

0xc4ac56651fa27cf8,

// I round constants

0x432aff97f4292244,

0xfc93a039ab9423a7,

0x8f0ccc92655b59c3,

0x85845dd1ffeff47d,

0xfe2ce6e06fa87e4f,

0x4e0811a1a3014314,

0xbd3af235f7537e82,

0xeb86d3912ad7d2bb,

0xd76aa478e8c7b756,

0x242070dbc1bdceee,

0xf57c0faf4787c62a,

0xa8304613fd469501,

0x698098d88b44f7af,

0xffff5bb1895cd7be,

0x6b901122fd987193,

0xa679438e49b40821,

// G round constants

0xf61e2562c040b340,

0x265e5a51e9b6c7aa,

0xd62f105d02441453,

0xd8a1e681e7d3fbc8,

0x21e1cde6c33707d6,

0xf4d50d87455a14ed,

0xa9e3e905fcefa3f8,

0x676f02d98d2a4c8a,

// H round constants

0xfffa39428771f681,

0x6d9d6122fde5380c,

0xa4beea444bdecfa9,

0xf6bb4b60bebfbc70,

0x289b7ec6eaa127fa,

0xd4ef308504881d05,

0xd9d4d039e6db99e5,

0x1fa27cf8c4ac5665,

// I round constants

0xf4292244432aff97,

0xab9423a7fc93a039,

0x655b59c38f0ccc92,

0xffeff47d85845dd1,

0x6fa87e4ffe2ce6e0,

0xa30143144e0811a1,

0xf7537e82bd3af235,

0x2ad7d2bbeb86d391,

Copilot is mistaken, we are using the lower 32bits first, then shift to get the upper 32 bits.

md5/src/compress/aarch64_asm.rs

Addresses GitHub Copilot's code review comment to remove the unnecessary #[allow(dead_code)] attribute from MD5_CONSTANTS_PACKED static, as the constant array is actively used by the inline assembly code.

mjc added 30 commits October 28, 2025 16:16

md5: optimize G function with ADD shortcut

57e9840

- Replace OR with ADD in G function for better scheduling - Mathematically equivalent due to non-overlapping bits - Performance improvement: 365 → 384 MB/s (5% gain) - All tests pass, correctness maintained

md5: optimize instruction scheduling in ARM64 assembly

fb211d2

Improved dependency chains in F and G rounds for better pipeline utilization. Performance: 666 MB/s for md5_10000 (+7 MB/s improvement).

md5: improve instruction scheduling in ARM64 assembly operations

42e0f5a

- Reorder instructions in F, G, H, and I rounds to reduce dependency chains - Move independent calculations earlier for better pipeline utilization - Performance varies 659-666 MB/s on md5_10000 benchmark

md5: add micro-optimizations for H and F rounds

b16a04e

- Inline optimize H round 44 with better instruction scheduling - Improve F1 instruction ordering in optimized F0-F3 section - Performance: md5_100: 657 MB/s, others 665-666 MB/s

md5: improve instruction scheduling in F2 round

98d8aa6

Move independent calculations earlier to reduce pipeline stalls. Performance remains stable at 657-666 MB/s across benchmarks.

md5: optimize dependency chains in MD5 rounds

f21c481

Reduce pipeline stalls by using separate registers for intermediate calculations in F, G, and I rounds. Performance now 657-667 MB/s.

Copilot AI review requested due to automatic review settings October 28, 2025 22:34

Copilot AI reviewed Oct 28, 2025

View reviewed changes

md5: remove unused dead_code allow attribute

97556d2

Addresses GitHub Copilot's code review comment to remove the unnecessary #[allow(dead_code)] attribute from MD5_CONSTANTS_PACKED static, as the constant array is actively used by the inline assembly code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

md5: Add optimized AArch64 assembly implementation #749

md5: Add optimized AArch64 assembly implementation #749

Uh oh!

mjc commented Oct 28, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 28, 2025

Uh oh!

mjc Oct 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

md5: Add optimized AArch64 assembly implementation #749

Are you sure you want to change the base?

md5: Add optimized AArch64 assembly implementation #749

Uh oh!

Conversation

mjc commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance Results (Apple M1)

Key Optimizations

Implementation Details

Attribution

Development Process

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

mjc Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mjc commented Oct 28, 2025 •

edited

Loading