-
Notifications
You must be signed in to change notification settings - Fork 300
md5: Add optimized AArch64 assembly implementation #749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Implement core MD5 compression using AArch64 inline assembly - Standard implementations of F, G, H, I round functions - G function uses AND, BIC, OR operations - H function uses standard b ^ c ^ d order - I function uses MVN, OR, EOR sequence - Baseline performance: ~365 MB/s on Apple M1 - Full correctness maintained with test suite - Foundation for incremental optimizations
- Replace OR with ADD in G function for better scheduling - Mathematically equivalent due to non-overlapping bits - Performance improvement: 365 → 384 MB/s (5% gain) - All tests pass, correctness maintained
- Delay b register dependency for better instruction scheduling - Compute m + rc and c ^ d first (no b dependency) - Then compute (c ^ d) ^ b to get final result - Performance improvement: 384 → 405 MB/s (5.5% gain) - Total improvement from baseline: 365 → 405 MB/s (11% gain) - All tests pass, correctness maintained
- Use ORN (OR-NOT) to compute b | direnv reload in single instruction - Replace MVN + ORR sequence with single ORN - Reduces instruction count and improves scheduling - Performance maintained: ~403 MB/s - Total improvement from baseline: 365 → 403 MB/s (10.4% gain) - All tests pass, correctness maintained
- Batch first 4 F operations using 64-bit packed constants - Each packed constant contains 2 RC values (32-bit each) - Reduces memory loads from 4 to 2 for constant access - Better instruction scheduling with larger assembly blocks - Performance maintained: ~404 MB/s - Total improvement from baseline: 365 → 404 MB/s (10.7% gain) - All tests pass, correctness maintained
- Batch first 4 G operations using 64-bit packed constants - Reduces memory loads from 4 to 2 for G round constants - Better instruction scheduling with larger assembly blocks - Maintains ADD shortcut optimization for G function - Performance maintained: ~403 MB/s - Total improvement from baseline: 365 → 403 MB/s (10.4% gain) - All tests pass, correctness maintained Fix G function ADD shortcut to properly delay b dependency Correctly implements animetosho G function optimization by computing c & direnv reload first, then b & d separately to delay dependency on b input. Fix G function ADD shortcut temp
- Add animetosho H function re-use optimization to eliminate MOV instructions - Implement Cache4 register caching for data[0], data[4], data[8], data[12] - Assembly now consistently matches or beats software performance - Performance: md5_100: 645 MB/s vs 641 MB/s software (+4 MB/s)
- Extend Cache4 to Cache16: cache all data[0-15] elements in registers - Eliminates nearly all memory accesses to input data array - Replace all remaining data[X] usages with cacheX for consistency - Silence unused_assignments warning for last H function re-use call - Performance improvements: md5_100: +12 MB/s, md5_1000: +8 MB/s, md5_10000: +6 MB/s - Assembly: md5_100=653 MB/s, md5_1000=656 MB/s, md5_10000=655 MB/s - Consistently beats software implementation across all buffer sizes
- Replace individual constant loading with ldp (load pair) instructions - Use ldp to load two 32-bit constants at once from packed constant array - Applied to F round (RC[0-3]) and G round (RC[16-19]) packed sections - Performance: md5_100=649 MB/s, md5_1000=655-658 MB/s, md5_10000=654-658 MB/s - Maintains strong performance vs software implementation - More efficient constant loading reduces instruction count
…scheduling - Created RF4/RG4/RH4/RI4 macros for processing 4 rounds at once - Grouped F, G, H, I rounds into 4-round blocks for improved instruction scheduling - Maintains H function reuse optimization within RH4 macro - Performance improvements over software implementation: * md5_100: 649 MB/s vs 645 MB/s (+0.6%) * md5_1000: 657 MB/s vs 651 MB/s (+0.9%) * md5_10000: 657 MB/s vs 652 MB/s (+0.8%) - Continues systematic optimization approach with clean macro organization
…ading - Add rh4_integrated macro with H function reuse optimization and ldp - Add ri4_integrated macro with correct I function (B|~D)^C pattern - Fix I function implementation in ri4_integrated (was using wrong operand order) - Replace H rounds 32-43 with rh4_integrated calls (RC[32-43] with offsets 128,144,160) - Replace I rounds 48-51 with ri4_integrated call (RC[48-51] with offset 192) - Performance maintained at 645-666 MB/s across all buffer sizes - Tests passing, systematic integrated optimization approach working
… F rounds Major optimizations implemented: - RF4_integrated: F rounds 8-11, 12-15 with ldp constant loading - RG4_integrated: G rounds 24-27, 28-31 with ldp constant loading - RH4_integrated: H rounds 32-43 with H function reuse + ldp (3 calls) - RI4_integrated: I rounds 48-63 with ldp constant loading (4 calls) - Interleaved F rounds 4-7: Load constants while computing, alternative F function - Fixed I function implementation: Correct (B|~D)^C pattern vs wrong operand order - Added H function reuse optimization in rh4_integrated Performance: Maintains 641-666 MB/s across all buffer sizes All tests passing with complete integrated approach Extensive use of ldp instructions for efficient 64-bit constant pair loading
- Convert remaining ri4! calls to ri4_integrated! for RC[56-59] and RC[60-63] - Use ldp constant loading with offsets 224 and 240 bytes respectively - All I rounds now use integrated optimization with efficient constant loading - Tests passing, ready to clean up unused macro definitions
…completion Remove 10 unused macro definitions that were replaced by integrated versions: - asm_op_f, asm_op_g, asm_op_h_reuse, asm_op_i (individual function macros) - rf4, rg4, rh4, ri4 (4-round macros calling individual functions) All functionality preserved in _integrated versions with ldp constant loading. No warnings, all tests pass, ready for clean integrated codebase.
Use ldp (Load Pair) instructions to load input data pairs directly: - Eliminates intermediate data array for better memory bandwidth - Uses explicit optimized rotation values (25, 20, 15, 10) - Direct register loading reduces memory access overhead Performance improvement: - md5_100: 641 → 649 MB/s (+8 MB/s) - md5_1000: 650 → 658 MB/s (+8 MB/s) - md5_10000: 661 MB/s (consistent high performance) Advanced ARM64 optimization techniques for MD5 compression.
Improved dependency chains in F and G rounds for better pipeline utilization. Performance: 666 MB/s for md5_10000 (+7 MB/s improvement).
- Reorder instructions in F, G, H, and I rounds to reduce dependency chains - Move independent calculations earlier for better pipeline utilization - Performance varies 659-666 MB/s on md5_10000 benchmark
- Improve scheduling in rh4_integrated H rounds 0-1 for better parallelism - Optimize asm_op_g_alt macro with better dependency chain management - Enhance rg4_integrated G round 2 instruction ordering - Performance: md5_10: 666 MB/s, md5_100: 657 MB/s, md5_1000: 664 MB/s, md5_10000: 666 MB/s
- Optimize rf4_integrated F round 0 and H round 2 scheduling - Enhance rh4_integrated H round 3 instruction ordering - Performance: md5_10: 666 MB/s, md5_100: 657 MB/s, md5_1000: 665 MB/s, md5_10000: 666 MB/s - All benchmarks except md5_100 now exceed 660 MB/s target
- Inline optimize H round 44 with better instruction scheduling - Improve F1 instruction ordering in optimized F0-F3 section - Performance: md5_100: 657 MB/s, others 665-666 MB/s
Move independent calculations earlier to reduce pipeline stalls. Performance remains stable at 657-666 MB/s across benchmarks.
Reduce pipeline stalls by using separate registers for intermediate calculations in F, G, and I rounds. Performance now 657-667 MB/s.
Replace isolated F/G/H round operations with integrated 4-round macros for better instruction scheduling and constant loading efficiency. - Convert F rounds 4-7 and G rounds 20-23 to integrated macros - Optimize remaining H rounds 45-47 with dependency chain improvements - Remove unused individual round macros - Achieve more consistent performance across benchmark sizes Performance remains at 657-666 MB/s range with improved stability.
Replace fragmented F rounds 0-7 with single optimized assembly block enabling better instruction scheduling and register allocation across round boundaries. Key improvements: - Pre-load multiple constant pairs with ldp instructions - Maintain register state across 8 consecutive F rounds - Reduce assembly block fragmentation for better compiler optimization - Achieve consistent 666+ MB/s performance across all benchmark sizes Performance results: - md5_100: 666 MB/s (was 657 MB/s) - md5_1000: 675 MB/s (was 665 MB/s) - md5_10000: 676 MB/s (was 666 MB/s) This demonstrates the performance benefits of larger assembly blocks within Rust's inline assembly constraints.
- Replace integrated macros with optimized inline assembly blocks - Implement efficient constant loading using ldp instructions - Optimize F, G, H, and I round instruction scheduling - Reduce register pressure through careful register allocation - Achieve 681-682 MB/s throughput on larger data sets - Maintain correctness while maximizing pipeline efficiency Performance improvements: - md5_1000: 682 MB/s (up from ~660 MB/s baseline) - md5_10000: 681 MB/s (up from ~660 MB/s baseline) - Consistent 666+ MB/s performance across all test sizes
- Optimize remaining G rounds 24-31 with hand-tuned assembly blocks - Optimize remaining F rounds 8-15 with hand-tuned assembly blocks - Remove unused rg4_integrated macro after complete replacement - Achieve significant performance improvements across all benchmarks: - md5_10: 714 MB/s (maintained peak performance) - md5_100: 684 MB/s (+13 MB/s improvement) - md5_1000: 694 MB/s (+17 MB/s improvement) - md5_10000: 697 MB/s (+15 MB/s improvement) All benchmarks now within 6 MB/s of 700 MB/s target through systematic replacement of integrated macros with optimized ldp constant loading, improved instruction scheduling, and reduced assembly fragmentation.
…ions - Apply optimized G function pattern using direct register additions - Enhance instruction scheduling for better parallel execution - Reduce temporary register pressure in G rounds 24-31 - Improve performance consistency across different block sizes - Maintain peak 714 MB/s performance while achieving 701+ MB/s sustained throughput Performance results: - md5_10: 714 MB/s (maintained peak performance) - md5_100: 689 MB/s (consistent throughput) - md5_1000: 701 MB/s (improved scaling) - md5_10000: 702 MB/s (excellent sustained performance)
- Apply conservative scheduling optimizations to F0-F11 rounds - Improve instruction parallelism by reordering independent operations - Move constant preparation instructions earlier in the pipeline - Maintain correctness while enhancing performance Performance improvements: - md5_10: 714 MB/s (peak maintained) - md5_100: 689-694 MB/s (consistent improvement) - md5_1000: 702 MB/s (approaching target) - md5_10000: 703 MB/s (strong sustained performance) Progress toward 740 MB/s target: ~95% achieved
- Remove unused macros: asm_op_h, rh4_integrated, ri4_integrated - Remove unused tmp_h variable and initialization code - Code is now warning-free while maintaining performance Performance comparison (ARM64 ASM vs Software): - md5_10: 714 MB/s vs 666 MB/s (+48 MB/s, +7.2%) - md5_100: 694 MB/s vs 645 MB/s (+49 MB/s, +7.6%) - md5_1000: 702 MB/s vs 651 MB/s (+51 MB/s, +7.8%) - md5_10000: 704 MB/s vs 653 MB/s (+51 MB/s, +7.8%) Consistent 7-8% performance improvement across all buffer sizes.
- Implement interleaved instruction scheduling in H rounds for better superscalar utilization, combining 4-round groups with independent operations running in parallel - Add aggressive memory prefetching for constants and next-round data to improve memory bandwidth utilization - Optimize constant loading patterns with early preparation and reuse of intermediate calculations - Improve pipeline efficiency by minimizing data dependencies and maximizing instruction-level parallelism Performance improvements: - md5_100: 689 → 694 MB/s (+0.7% improvement) - md5_1000: 696 → 702 MB/s (+0.9% improvement) - md5_10000: 702 → 703 MB/s (+0.1% improvement) All optimizations maintain correctness and pass existing test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds an optimized AArch64 assembly backend for MD5 compression to improve performance on ARM64 processors, particularly Apple M1 which supports NEON and crypto extensions.
Key changes:
- Implements hand-written AArch64 assembly for all 64 MD5 rounds (F, G, H, I functions)
- Uses packed 64-bit constants and
ldpinstructions for efficient constant loading - Leverages AArch64-specific instructions like
ornfor the I function andbicfor the G function - Adds conditional compilation to use the AArch64 backend when targeting that architecture
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| md5/src/compress/aarch64_asm.rs | New AArch64 assembly implementation with optimized instruction scheduling and memory access patterns |
| md5/src/compress.rs | Updated to conditionally compile the AArch64 backend before falling through to other architectures |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| 0xe8c7b756d76aa478, | ||
| 0xc1bdceee242070db, | ||
| 0x4787c62af57c0faf, | ||
| 0xfd469501a8304613, | ||
| 0x8b44f7af698098d8, | ||
| 0x895cd7beffff5bb1, | ||
| 0xfd9871936b901122, | ||
| 0x49b40821a679438e, | ||
| // G round constants | ||
| 0xc040b340f61e2562, | ||
| 0xe9b6c7aa265e5a51, | ||
| 0x02441453d62f105d, | ||
| 0xe7d3fbc8d8a1e681, | ||
| 0xc33707d621e1cde6, | ||
| 0x455a14edf4d50d87, | ||
| 0xfcefa3f8a9e3e905, | ||
| 0x8d2a4c8a676f02d9, | ||
| // H round constants | ||
| 0x8771f681fffa3942, | ||
| 0xfde5380c6d9d6122, | ||
| 0x4bdecfa9a4beea44, | ||
| 0xbebfbc70f6bb4b60, | ||
| 0xeaa127fa289b7ec6, | ||
| 0x04881d05d4ef3085, | ||
| 0xe6db99e5d9d4d039, | ||
| 0xc4ac56651fa27cf8, | ||
| // I round constants | ||
| 0x432aff97f4292244, | ||
| 0xfc93a039ab9423a7, | ||
| 0x8f0ccc92655b59c3, | ||
| 0x85845dd1ffeff47d, | ||
| 0xfe2ce6e06fa87e4f, | ||
| 0x4e0811a1a3014314, | ||
| 0xbd3af235f7537e82, | ||
| 0xeb86d3912ad7d2bb, |
Copilot
AI
Oct 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The packed constants have incorrect byte order. RC[0]=0xd76aa478 and RC[1]=0xe8c7b756 should pack as 0xd76aa478e8c7b756 (little-endian on AArch64), but line 13 shows 0xe8c7b756d76aa478 (swapped order). This affects all 32 packed constant values and will cause incorrect MD5 computation. Each pair should have the lower-indexed constant in the lower 32 bits.
| 0xe8c7b756d76aa478, | |
| 0xc1bdceee242070db, | |
| 0x4787c62af57c0faf, | |
| 0xfd469501a8304613, | |
| 0x8b44f7af698098d8, | |
| 0x895cd7beffff5bb1, | |
| 0xfd9871936b901122, | |
| 0x49b40821a679438e, | |
| // G round constants | |
| 0xc040b340f61e2562, | |
| 0xe9b6c7aa265e5a51, | |
| 0x02441453d62f105d, | |
| 0xe7d3fbc8d8a1e681, | |
| 0xc33707d621e1cde6, | |
| 0x455a14edf4d50d87, | |
| 0xfcefa3f8a9e3e905, | |
| 0x8d2a4c8a676f02d9, | |
| // H round constants | |
| 0x8771f681fffa3942, | |
| 0xfde5380c6d9d6122, | |
| 0x4bdecfa9a4beea44, | |
| 0xbebfbc70f6bb4b60, | |
| 0xeaa127fa289b7ec6, | |
| 0x04881d05d4ef3085, | |
| 0xe6db99e5d9d4d039, | |
| 0xc4ac56651fa27cf8, | |
| // I round constants | |
| 0x432aff97f4292244, | |
| 0xfc93a039ab9423a7, | |
| 0x8f0ccc92655b59c3, | |
| 0x85845dd1ffeff47d, | |
| 0xfe2ce6e06fa87e4f, | |
| 0x4e0811a1a3014314, | |
| 0xbd3af235f7537e82, | |
| 0xeb86d3912ad7d2bb, | |
| 0xd76aa478e8c7b756, | |
| 0x242070dbc1bdceee, | |
| 0xf57c0faf4787c62a, | |
| 0xa8304613fd469501, | |
| 0x698098d88b44f7af, | |
| 0xffff5bb1895cd7be, | |
| 0x6b901122fd987193, | |
| 0xa679438e49b40821, | |
| // G round constants | |
| 0xf61e2562c040b340, | |
| 0x265e5a51e9b6c7aa, | |
| 0xd62f105d02441453, | |
| 0xd8a1e681e7d3fbc8, | |
| 0x21e1cde6c33707d6, | |
| 0xf4d50d87455a14ed, | |
| 0xa9e3e905fcefa3f8, | |
| 0x676f02d98d2a4c8a, | |
| // H round constants | |
| 0xfffa39428771f681, | |
| 0x6d9d6122fde5380c, | |
| 0xa4beea444bdecfa9, | |
| 0xf6bb4b60bebfbc70, | |
| 0x289b7ec6eaa127fa, | |
| 0xd4ef308504881d05, | |
| 0xd9d4d039e6db99e5, | |
| 0x1fa27cf8c4ac5665, | |
| // I round constants | |
| 0xf4292244432aff97, | |
| 0xab9423a7fc93a039, | |
| 0x655b59c38f0ccc92, | |
| 0xffeff47d85845dd1, | |
| 0x6fa87e4ffe2ce6e0, | |
| 0xa30143144e0811a1, | |
| 0xf7537e82bd3af235, | |
| 0x2ad7d2bbeb86d391, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot is mistaken, we are using the lower 32bits first, then shift to get the upper 32 bits.
Addresses GitHub Copilot's code review comment to remove the unnecessary #[allow(dead_code)] attribute from MD5_CONSTANTS_PACKED static, as the constant array is actively used by the inline assembly code.
Summary
Adds a high-performance AArch64 assembly implementation for MD5 that achieves 7-8% performance improvement over the standard implementation across all benchmarked buffer sizes.
Performance Results (Apple M1)
ARM64 Assembly vs Software:
Benchmarked on Apple M1 processor
Key Optimizations
ldp(load pair) instructions to load two 32-bit constants at onceImplementation Details
md5/src/compress/aarch64_asm.rs(1,121 lines)cfg_ifmatching existingloongarch64_asm.rspatternAttribution
This implementation incorporates optimization techniques from animetosho's md5-optimisation, which is released under public domain license.
Development Process
The implementation was developed through systematic optimization:
All optimizations follow AArch64 ABI conventions.