Skip to content

verify Bitpacker4x Neon implementation is actually useful #59

@trinity-1686a

Description

@trinity-1686a

benchmarking bitpacking on an Apple M3 Max-powered laptop, it seems the handwritten neon code is actually detrimental to performance.
Below is a run of cargo bench, reference is current main, results of this run is with anything "neon" or "aarch"-specific removed. There is little impact on plain bitpacking, but the delta and strict-delta variant show huge improvements accros the board.

It would be interesting if someone can reproduce on a different arm-powered device

bench results
BitPacker4x/decompress-1                                                                            
                        time:   [52.879 ns 52.905 ns 52.941 ns]
                        thrpt:  [24.178 Gelem/s 24.194 Gelem/s 24.206 Gelem/s]
                 change:
                        time:   [-1.0221% -0.8966% -0.7573%] (p = 0.00 < 0.05)
                        thrpt:  [+0.7631% +0.9047% +1.0326%]
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low severe
  5 (5.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe

BitPacker4x/decompress-delta-1                                                                             
                        time:   [665.92 ns 666.13 ns 666.32 ns]
                        thrpt:  [1.9210 Gelem/s 1.9215 Gelem/s 1.9222 Gelem/s]
                 change:
                        time:   [-53.512% -53.470% -53.426%] (p = 0.00 < 0.05)
                        thrpt:  [+114.71% +114.92% +115.11%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  4 (4.00%) high severe

BitPacker4x/decompress-strict-delta-1                                                                             
                        time:   [921.91 ns 923.87 ns 926.28 ns]
                        thrpt:  [1.3819 Gelem/s 1.3855 Gelem/s 1.3884 Gelem/s]
                 change:
                        time:   [-29.550% -29.334% -29.123%] (p = 0.00 < 0.05)
                        thrpt:  [+41.090% +41.511% +41.945%]
                        Performance has improved.

BitPacker4x/compress-1  time:   [104.32 ns 104.44 ns 104.59 ns]                                   
                        thrpt:  [12.239 Gelem/s 12.255 Gelem/s 12.270 Gelem/s]
                 change:
                        time:   [-0.9624% -0.7354% -0.5054%] (p = 0.00 < 0.05)
                        thrpt:  [+0.5080% +0.7408% +0.9718%]
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

BitPacker4x/compress-delta-1                                                                            
                        time:   [154.11 ns 154.40 ns 154.66 ns]
                        thrpt:  [8.2765 Gelem/s 8.2900 Gelem/s 8.3057 Gelem/s]
                 change:
                        time:   [-20.762% -20.633% -20.503%] (p = 0.00 < 0.05)
                        thrpt:  [+25.791% +25.997% +26.202%]
                        Performance has improved.

BitPacker4x/compress-strict-delta-1                                                                            
                        time:   [176.85 ns 177.13 ns 177.49 ns]
                        thrpt:  [7.2117 Gelem/s 7.2264 Gelem/s 7.2377 Gelem/s]
                 change:
                        time:   [-7.9960% -7.7879% -7.5765%] (p = 0.00 < 0.05)
                        thrpt:  [+8.1976% +8.4456% +8.6909%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

BitPacker4x/decompress-2                                                                            
                        time:   [52.683 ns 52.861 ns 53.024 ns]
                        thrpt:  [24.140 Gelem/s 24.214 Gelem/s 24.296 Gelem/s]
                 change:
                        time:   [-1.9552% -1.7086% -1.4512%] (p = 0.00 < 0.05)
                        thrpt:  [+1.4726% +1.7383% +1.9942%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

BitPacker4x/decompress-delta-2                                                                             
                        time:   [661.34 ns 662.67 ns 664.36 ns]
                        thrpt:  [1.9267 Gelem/s 1.9316 Gelem/s 1.9355 Gelem/s]
                 change:
                        time:   [-47.969% -47.723% -47.464%] (p = 0.00 < 0.05)
                        thrpt:  [+90.346% +91.289% +92.192%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  10 (10.00%) high severe

BitPacker4x/decompress-strict-delta-2                                                                             
                        time:   [917.22 ns 921.65 ns 926.67 ns]
                        thrpt:  [1.3813 Gelem/s 1.3888 Gelem/s 1.3955 Gelem/s]
                 change:
                        time:   [-25.226% -24.839% -24.468%] (p = 0.00 < 0.05)
                        thrpt:  [+32.394% +33.048% +33.737%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

BitPacker4x/compress-2  time:   [102.32 ns 102.80 ns 103.41 ns]                                   
                        thrpt:  [12.378 Gelem/s 12.452 Gelem/s 12.510 Gelem/s]
                 change:
                        time:   [-2.3590% -1.9400% -1.4818%] (p = 0.00 < 0.05)
                        thrpt:  [+1.5041% +1.9784% +2.4160%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

BitPacker4x/compress-delta-2                                                                            
                        time:   [155.74 ns 156.14 ns 156.57 ns]
                        thrpt:  [8.1755 Gelem/s 8.1980 Gelem/s 8.2190 Gelem/s]
                 change:
                        time:   [-8.8859% -8.6692% -8.4502%] (p = 0.00 < 0.05)
                        thrpt:  [+9.2302% +9.4921% +9.7524%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

BitPacker4x/compress-strict-delta-2                                                                            
                        time:   [177.30 ns 177.97 ns 178.71 ns]
                        thrpt:  [7.1623 Gelem/s 7.1923 Gelem/s 7.2196 Gelem/s]
                 change:
                        time:   [-5.9149% -5.6137% -5.3176%] (p = 0.00 < 0.05)
                        thrpt:  [+5.6163% +5.9476% +6.2867%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

BitPacker4x/decompress-24                                                                            
                        time:   [60.974 ns 61.069 ns 61.173 ns]
                        thrpt:  [20.924 Gelem/s 20.960 Gelem/s 20.993 Gelem/s]
                 change:
                        time:   [-0.6847% -0.4837% -0.2973%] (p = 0.00 < 0.05)
                        thrpt:  [+0.2982% +0.4860% +0.6895%]
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

BitPacker4x/decompress-delta-24                                                                             
                        time:   [533.76 ns 534.65 ns 535.61 ns]
                        thrpt:  [2.3898 Gelem/s 2.3941 Gelem/s 2.3981 Gelem/s]
                 change:
                        time:   [-52.624% -52.456% -52.290%] (p = 0.00 < 0.05)
                        thrpt:  [+109.60% +110.33% +111.08%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high mild

BitPacker4x/decompress-strict-delta-24                                                                             
                        time:   [795.94 ns 799.04 ns 802.41 ns]
                        thrpt:  [1.5952 Gelem/s 1.6019 Gelem/s 1.6082 Gelem/s]
                 change:
                        time:   [-27.538% -27.277% -27.006%] (p = 0.00 < 0.05)
                        thrpt:  [+36.997% +37.509% +38.004%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) high mild
  10 (10.00%) high severe

BitPacker4x/compress-24 time:   [105.41 ns 105.60 ns 105.81 ns]                                    
                        thrpt:  [12.097 Gelem/s 12.121 Gelem/s 12.143 Gelem/s]
                 change:
                        time:   [+0.3718% +0.6612% +0.9934%] (p = 0.00 < 0.05)
                        thrpt:  [-0.9836% -0.6568% -0.3705%]
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

BitPacker4x/compress-delta-24                                                                            
                        time:   [130.25 ns 130.53 ns 130.85 ns]
                        thrpt:  [9.7819 Gelem/s 9.8061 Gelem/s 9.8272 Gelem/s]
                 change:
                        time:   [-28.061% -27.890% -27.711%] (p = 0.00 < 0.05)
                        thrpt:  [+38.333% +38.677% +39.006%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild

BitPacker4x/compress-strict-delta-24                                                                            
                        time:   [147.13 ns 147.37 ns 147.60 ns]
                        thrpt:  [8.6721 Gelem/s 8.6853 Gelem/s 8.6997 Gelem/s]
                 change:
                        time:   [-21.304% -21.093% -20.890%] (p = 0.00 < 0.05)
                        thrpt:  [+26.407% +26.731% +27.072%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

BitPacker4x/decompress-31                                                                            
                        time:   [71.861 ns 71.909 ns 71.960 ns]
                        thrpt:  [17.788 Gelem/s 17.800 Gelem/s 17.812 Gelem/s]
                 change:
                        time:   [-1.0571% -0.7500% -0.4758%] (p = 0.00 < 0.05)
                        thrpt:  [+0.4781% +0.7556% +1.0684%]
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

BitPacker4x/decompress-delta-31                                                                             
                        time:   [635.70 ns 636.28 ns 636.88 ns]
                        thrpt:  [2.0098 Gelem/s 2.0117 Gelem/s 2.0135 Gelem/s]
                 change:
                        time:   [-49.275% -49.152% -49.047%] (p = 0.00 < 0.05)
                        thrpt:  [+96.258% +96.664% +97.142%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

BitPacker4x/decompress-strict-delta-31                                                                             
                        time:   [941.63 ns 944.23 ns 947.10 ns]
                        thrpt:  [1.3515 Gelem/s 1.3556 Gelem/s 1.3593 Gelem/s]
                 change:
                        time:   [-24.421% -24.203% -23.991%] (p = 0.00 < 0.05)
                        thrpt:  [+31.563% +31.932% +32.312%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

BitPacker4x/compress-31 time:   [123.71 ns 124.11 ns 124.57 ns]                                    
                        thrpt:  [10.275 Gelem/s 10.313 Gelem/s 10.347 Gelem/s]
                 change:
                        time:   [-1.2531% -0.9205% -0.5680%] (p = 0.00 < 0.05)
                        thrpt:  [+0.5712% +0.9290% +1.2690%]
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

BitPacker4x/compress-delta-31                                                                            
                        time:   [112.19 ns 112.36 ns 112.53 ns]
                        thrpt:  [11.374 Gelem/s 11.392 Gelem/s 11.409 Gelem/s]
                 change:
                        time:   [-38.633% -38.431% -38.246%] (p = 0.00 < 0.05)
                        thrpt:  [+61.932% +62.420% +62.955%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

BitPacker4x/compress-strict-delta-31                                                                            
                        time:   [128.39 ns 128.65 ns 128.94 ns]
                        thrpt:  [9.9272 Gelem/s 9.9495 Gelem/s 9.9699 Gelem/s]
                 change:
                        time:   [-30.464% -30.310% -30.148%] (p = 0.00 < 0.05)
                        thrpt:  [+43.160% +43.494% +43.811%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions