Skip to content

verify Bitpacker4x SSE compression is actually useful #60

@trinity-1686a

Description

@trinity-1686a

benchmarking bitpacking on a Ryzen PRO 5850U-powered laptop, it seems the handwritten sse code is vital to decompression, but detrimental to compression.
Below is a run of cargo bench, reference is current main, results of this run is with anything "sse" or "x86_64"-specific removed.
This suggest we may be better-of ditching or rewriting the compression part, and the fallback implementation of decompression may be improved to be more kind to auto-vectorisation as to make it faster for non hand-optimized platforms
It would be interesting if someone can reproduce on other x86 devices

bench results
BitPacker4x/decompress-1
                        time:   [81.526 ns 81.683 ns 81.884 ns]
                        thrpt:  [15.632 Gelem/s 15.670 Gelem/s 15.701 Gelem/s]
                 change:
                        time:   [+5.9196% +6.2317% +6.5342%] (p = 0.00 < 0.05)
                        thrpt:  [-6.1334% -5.8661% -5.5888%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

Benchmarking BitPacker4x/decompress-delta-1: Collecting 100 samples in
estimated 5.0005 s (2.9M iterations

BitPacker4x/decompress-delta-1
                        time:   [1.6949 µs 1.6966 µs 1.6986 µs]
                        thrpt:  [753.58 Melem/s 754.47 Melem/s 755.21 Melem/s]
                 change:
                        time:   [+661.96% +665.09% +668.56%] (p = 0.00 < 0.05)
                        thrpt:  [-86.989% -86.930% -86.876%]
                        Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
  4 (4.00%) high mild
  14 (14.00%) high severe

Benchmarking BitPacker4x/decompress-strict-delta-1: Collecting 100
samples in estimated 5.0092 s (2.8M ite

   BitPacker4x/decompress-strict-delta-1
                        time:   [1.8215 µs 1.8243 µs 1.8275 µs]
                        thrpt:  [700.42 Melem/s 701.65 Melem/s 702.71 Melem/s]
                 change:
                        time:   [+647.48% +650.01% +652.44%] (p = 0.00 < 0.05)
                        thrpt:  [-86.710% -86.667% -86.622%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

BitPacker4x/compress-1  time:   [141.59 ns 141.72 ns 141.88 ns]
                        thrpt:  [9.0220 Gelem/s 9.0318 Gelem/s 9.0403 Gelem/s]
                 change:
                        time:   [+3.1783% +3.4735% +3.7609%] (p = 0.00 < 0.05)
                        thrpt:  [-3.6246% -3.3569% -3.0804%]
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

BitPacker4x/compress-delta-1
                        time:   [219.40 ns 219.70 ns 220.06 ns]
                        thrpt:  [5.8165 Gelem/s 5.8261 Gelem/s 5.8342 Gelem/s]
                 change:
                        time:   [-7.2601% -6.6117% -5.9881%] (p = 0.00 < 0.05)
                        thrpt:  [+6.3695% +7.0798% +7.8284%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

Benchmarking BitPacker4x/compress-strict-delta-1: Collecting 100
samples in estimated 5.0012 s (20M iterat

     BitPacker4x/compress-strict-delta-1
                        time:   [246.39 ns 246.82 ns 247.32 ns]
                        thrpt:  [5.1755 Gelem/s 5.1860 Gelem/s 5.1950 Gelem/s]
                 change:
                        time:   [-9.8551% -9.1131% -8.3893%] (p = 0.00 < 0.05)
                        thrpt:  [+9.1576% +10.027% +10.932%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

BitPacker4x/decompress-2
                        time:   [80.723 ns 80.803 ns 80.891 ns]
                        thrpt:  [15.824 Gelem/s 15.841 Gelem/s 15.857 Gelem/s]
                 change:
                        time:   [+1.7119% +2.4564% +3.0279%] (p = 0.00 < 0.05)
                        thrpt:  [-2.9389% -2.3975% -1.6831%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

Benchmarking BitPacker4x/decompress-delta-2: Collecting 100 samples in
estimated 5.0083 s (2.9M iterations

BitPacker4x/decompress-delta-2
                        time:   [1.7146 µs 1.7182 µs 1.7227 µs]
                        thrpt:  [743.03 Melem/s 744.96 Melem/s 746.53 Melem/s]
                 change:
                        time:   [+717.28% +720.67% +724.08%] (p = 0.00 < 0.05)
                        thrpt:  [-87.865% -87.815% -87.764%]
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Benchmarking BitPacker4x/decompress-strict-delta-2: Collecting 100
samples in estimated 5.0073 s (2.7M ite

   BitPacker4x/decompress-strict-delta-2
                        time:   [1.8311 µs 1.8353 µs 1.8402 µs]
                        thrpt:  [695.56 Melem/s 697.42 Melem/s 699.04 Melem/s]
                 change:
                        time:   [+631.17% +633.52% +635.97%] (p = 0.00 < 0.05)
                        thrpt:  [-86.413% -86.367% -86.323%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

BitPacker4x/compress-2  time:   [140.01 ns 140.24 ns 140.53 ns]
                        thrpt:  [9.1084 Gelem/s 9.1270 Gelem/s 9.1423 Gelem/s]
                 change:
                        time:   [+0.3380% +0.5743% +0.8256%] (p = 0.00 < 0.05)
                        thrpt:  [-0.8189% -0.5710% -0.3369%]
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

BitPacker4x/compress-delta-2
                        time:   [223.31 ns 223.66 ns 224.07 ns]
                        thrpt:  [5.7125 Gelem/s 5.7231 Gelem/s 5.7319 Gelem/s]
                 change:
                        time:   [-5.6500% -5.0903% -4.5382%] (p = 0.00 < 0.05)
                        thrpt:  [+4.7539% +5.3633% +5.9883%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  8 (8.00%) high mild
  3 (3.00%) high severe

Benchmarking BitPacker4x/compress-strict-delta-2: Collecting 100
samples in estimated 5.0004 s (20M iterat

     BitPacker4x/compress-strict-delta-2
                        time:   [250.31 ns 250.79 ns 251.37 ns]
                        thrpt:  [5.0921 Gelem/s 5.1039 Gelem/s 5.1138 Gelem/s]
                 change:
                        time:   [-6.7716% -6.0884% -5.4438%] (p = 0.00 < 0.05)
                        thrpt:  [+5.7572% +6.4831% +7.2634%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

BitPacker4x/decompress-24
                        time:   [98.654 ns 98.799 ns 98.986 ns]
                        thrpt:  [12.931 Gelem/s 12.956 Gelem/s 12.975 Gelem/s]
                 change:
                        time:   [+4.1156% +4.3488% +4.5934%] (p = 0.00 < 0.05)
                        thrpt:  [-4.3917% -4.1675% -3.9529%]
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe

Benchmarking BitPacker4x/decompress-delta-24: Collecting 100 samples
in estimated 5.0056 s (4.3M iteration

 BitPacker4x/decompress-delta-24
                        time:   [1.1541 µs 1.1571 µs 1.1608 µs]
                        thrpt:  [1.1027 Gelem/s 1.1062 Gelem/s 1.1090 Gelem/s]
                 change:
                        time:   [+435.63% +436.93% +438.26%] (p = 0.00 < 0.05)
                        thrpt:  [-81.422% -81.376% -81.330%]
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  17 (17.00%) high severe

Benchmarking BitPacker4x/decompress-strict-delta-24: Collecting 100
samples in estimated 5.0008 s (3.8M it

  BitPacker4x/decompress-strict-delta-24
                        time:   [1.3251 µs 1.3279 µs 1.3315 µs]
                        thrpt:  [961.33 Melem/s 963.90 Melem/s 965.98 Melem/s]
                 change:
                        time:   [+451.99% +453.78% +455.70%] (p = 0.00 < 0.05)
                        thrpt:  [-82.005% -81.942% -81.884%]
                        Performance has regressed.
Found 21 outliers among 100 measurements (21.00%)
  9 (9.00%) high mild
  12 (12.00%) high severe

BitPacker4x/compress-24 time:   [153.40 ns 153.55 ns 153.74 ns]
                        thrpt:  [8.3258 Gelem/s 8.3360 Gelem/s 8.3445 Gelem/s]
                 change:
                        time:   [-0.5924% -0.4011% -0.2184%] (p = 0.00 < 0.05)
                        thrpt:  [+0.2189% +0.4028% +0.5959%]
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

BitPacker4x/compress-delta-24
                        time:   [190.28 ns 190.61 ns 190.99 ns]
                        thrpt:  [6.7020 Gelem/s 6.7152 Gelem/s 6.7271 Gelem/s]
                 change:
                        time:   [-5.6890% -5.4981% -5.2935%] (p = 0.00 < 0.05)
                        thrpt:  [+5.5894% +5.8179% +6.0322%]
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) low severe
  3 (3.00%) low mild
  5 (5.00%) high mild
  6 (6.00%) high severe

Benchmarking BitPacker4x/compress-strict-delta-24: Collecting 100
samples in estimated 5.0004 s (21M itera

    BitPacker4x/compress-strict-delta-24
                        time:   [234.94 ns 235.30 ns 235.75 ns]
                        thrpt:  [5.4296 Gelem/s 5.4399 Gelem/s 5.4482 Gelem/s]
                 change:
                        time:   [-7.5641% -7.3679% -7.1672%] (p = 0.00 < 0.05)
                        thrpt:  [+7.7205% +7.9539% +8.1831%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

BitPacker4x/decompress-31
                        time:   [119.15 ns 119.39 ns 119.63 ns]
                        thrpt:  [10.699 Gelem/s 10.722 Gelem/s 10.743 Gelem/s]
                 change:
                        time:   [-0.6080% -0.3993% -0.1623%] (p = 0.00 < 0.05)
                        thrpt:  [+0.1625% +0.4009% +0.6117%]
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe

Benchmarking BitPacker4x/decompress-delta-31: Collecting 100 samples
in estimated 5.0038 s (2.8M iteration

 BitPacker4x/decompress-delta-31
                        time:   [1.7672 µs 1.7729 µs 1.7799 µs]
                        thrpt:  [719.15 Melem/s 721.97 Melem/s 724.33 Melem/s]
                 change:
                        time:   [+609.60% +611.46% +613.44%] (p = 0.00 < 0.05)
                        thrpt:  [-85.983% -85.944% -85.908%]
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  4 (4.00%) high mild
  12 (12.00%) high severe

Benchmarking BitPacker4x/decompress-strict-delta-31: Collecting 100
samples in estimated 5.0033 s (2.6M it

  BitPacker4x/decompress-strict-delta-31

                        time:   [1.9287 µs 1.9305 µs 1.9327 µs]
                        thrpt:  [662.30 Melem/s 663.02 Melem/s 663.67 Melem/s]
                 change:
                        time:   [+593.50% +598.06% +601.80%] (p = 0.00 < 0.05)
                        thrpt:  [-85.751% -85.675% -85.580%]
                        Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
  2 (2.00%) high mild
  16 (16.00%) high severe

BitPacker4x/compress-31 time:   [167.44 ns 167.73 ns 168.06 ns]
                        thrpt:  [7.6165 Gelem/s 7.6315 Gelem/s 7.6447 Gelem/s]
                 change:
                        time:   [-1.2386% -1.0825% -0.9272%] (p = 0.00 < 0.05)
                        thrpt:  [+0.9358% +1.0943% +1.2541%]
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  3 (3.00%) high severe

BitPacker4x/compress-delta-31
                        time:   [175.49 ns 175.73 ns 176.01 ns]
                        thrpt:  [7.2724 Gelem/s 7.2841 Gelem/s 7.2940 Gelem/s]
                 change:
                        time:   [-5.4460% -5.2640% -5.0936%] (p = 0.00 < 0.05)
                        thrpt:  [+5.3670% +5.5565% +5.7597%]
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  7 (7.00%) high mild
  11 (11.00%) high severe

Benchmarking BitPacker4x/compress-strict-delta-31: Collecting 100
samples in estimated 5.0007 s (22M itera

    BitPacker4x/compress-strict-delta-31
                        time:   [225.08 ns 225.44 ns 225.88 ns]
                        thrpt:  [5.6668 Gelem/s 5.6779 Gelem/s 5.6869 Gelem/s]
                 change:
                        time:   [-8.5164% -8.2743% -8.0374%] (p = 0.00 < 0.05)
                        thrpt:  [+8.7398% +9.0207% +9.3092%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions