-
-
Notifications
You must be signed in to change notification settings - Fork 34
Open
Description
benchmarking bitpacking on a Ryzen PRO 5850U-powered laptop, it seems the handwritten sse code is vital to decompression, but detrimental to compression.
Below is a run of cargo bench, reference is current main, results of this run is with anything "sse" or "x86_64"-specific removed.
This suggest we may be better-of ditching or rewriting the compression part, and the fallback implementation of decompression may be improved to be more kind to auto-vectorisation as to make it faster for non hand-optimized platforms
It would be interesting if someone can reproduce on other x86 devices
bench results
BitPacker4x/decompress-1
time: [81.526 ns 81.683 ns 81.884 ns]
thrpt: [15.632 Gelem/s 15.670 Gelem/s 15.701 Gelem/s]
change:
time: [+5.9196% +6.2317% +6.5342%] (p = 0.00 < 0.05)
thrpt: [-6.1334% -5.8661% -5.5888%]
Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) high mild
2 (2.00%) high severe
Benchmarking BitPacker4x/decompress-delta-1: Collecting 100 samples in
estimated 5.0005 s (2.9M iterations
BitPacker4x/decompress-delta-1
time: [1.6949 µs 1.6966 µs 1.6986 µs]
thrpt: [753.58 Melem/s 754.47 Melem/s 755.21 Melem/s]
change:
time: [+661.96% +665.09% +668.56%] (p = 0.00 < 0.05)
thrpt: [-86.989% -86.930% -86.876%]
Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
4 (4.00%) high mild
14 (14.00%) high severe
Benchmarking BitPacker4x/decompress-strict-delta-1: Collecting 100
samples in estimated 5.0092 s (2.8M ite
BitPacker4x/decompress-strict-delta-1
time: [1.8215 µs 1.8243 µs 1.8275 µs]
thrpt: [700.42 Melem/s 701.65 Melem/s 702.71 Melem/s]
change:
time: [+647.48% +650.01% +652.44%] (p = 0.00 < 0.05)
thrpt: [-86.710% -86.667% -86.622%]
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
5 (5.00%) high mild
2 (2.00%) high severe
BitPacker4x/compress-1 time: [141.59 ns 141.72 ns 141.88 ns]
thrpt: [9.0220 Gelem/s 9.0318 Gelem/s 9.0403 Gelem/s]
change:
time: [+3.1783% +3.4735% +3.7609%] (p = 0.00 < 0.05)
thrpt: [-3.6246% -3.3569% -3.0804%]
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
3 (3.00%) high mild
6 (6.00%) high severe
BitPacker4x/compress-delta-1
time: [219.40 ns 219.70 ns 220.06 ns]
thrpt: [5.8165 Gelem/s 5.8261 Gelem/s 5.8342 Gelem/s]
change:
time: [-7.2601% -6.6117% -5.9881%] (p = 0.00 < 0.05)
thrpt: [+6.3695% +7.0798% +7.8284%]
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
4 (4.00%) high mild
5 (5.00%) high severe
Benchmarking BitPacker4x/compress-strict-delta-1: Collecting 100
samples in estimated 5.0012 s (20M iterat
BitPacker4x/compress-strict-delta-1
time: [246.39 ns 246.82 ns 247.32 ns]
thrpt: [5.1755 Gelem/s 5.1860 Gelem/s 5.1950 Gelem/s]
change:
time: [-9.8551% -9.1131% -8.3893%] (p = 0.00 < 0.05)
thrpt: [+9.1576% +10.027% +10.932%]
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe
BitPacker4x/decompress-2
time: [80.723 ns 80.803 ns 80.891 ns]
thrpt: [15.824 Gelem/s 15.841 Gelem/s 15.857 Gelem/s]
change:
time: [+1.7119% +2.4564% +3.0279%] (p = 0.00 < 0.05)
thrpt: [-2.9389% -2.3975% -1.6831%]
Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
5 (5.00%) high mild
3 (3.00%) high severe
Benchmarking BitPacker4x/decompress-delta-2: Collecting 100 samples in
estimated 5.0083 s (2.9M iterations
BitPacker4x/decompress-delta-2
time: [1.7146 µs 1.7182 µs 1.7227 µs]
thrpt: [743.03 Melem/s 744.96 Melem/s 746.53 Melem/s]
change:
time: [+717.28% +720.67% +724.08%] (p = 0.00 < 0.05)
thrpt: [-87.865% -87.815% -87.764%]
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
Benchmarking BitPacker4x/decompress-strict-delta-2: Collecting 100
samples in estimated 5.0073 s (2.7M ite
BitPacker4x/decompress-strict-delta-2
time: [1.8311 µs 1.8353 µs 1.8402 µs]
thrpt: [695.56 Melem/s 697.42 Melem/s 699.04 Melem/s]
change:
time: [+631.17% +633.52% +635.97%] (p = 0.00 < 0.05)
thrpt: [-86.413% -86.367% -86.323%]
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
BitPacker4x/compress-2 time: [140.01 ns 140.24 ns 140.53 ns]
thrpt: [9.1084 Gelem/s 9.1270 Gelem/s 9.1423 Gelem/s]
change:
time: [+0.3380% +0.5743% +0.8256%] (p = 0.00 < 0.05)
thrpt: [-0.8189% -0.5710% -0.3369%]
Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
BitPacker4x/compress-delta-2
time: [223.31 ns 223.66 ns 224.07 ns]
thrpt: [5.7125 Gelem/s 5.7231 Gelem/s 5.7319 Gelem/s]
change:
time: [-5.6500% -5.0903% -4.5382%] (p = 0.00 < 0.05)
thrpt: [+4.7539% +5.3633% +5.9883%]
Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
8 (8.00%) high mild
3 (3.00%) high severe
Benchmarking BitPacker4x/compress-strict-delta-2: Collecting 100
samples in estimated 5.0004 s (20M iterat
BitPacker4x/compress-strict-delta-2
time: [250.31 ns 250.79 ns 251.37 ns]
thrpt: [5.0921 Gelem/s 5.1039 Gelem/s 5.1138 Gelem/s]
change:
time: [-6.7716% -6.0884% -5.4438%] (p = 0.00 < 0.05)
thrpt: [+5.7572% +6.4831% +7.2634%]
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
BitPacker4x/decompress-24
time: [98.654 ns 98.799 ns 98.986 ns]
thrpt: [12.931 Gelem/s 12.956 Gelem/s 12.975 Gelem/s]
change:
time: [+4.1156% +4.3488% +4.5934%] (p = 0.00 < 0.05)
thrpt: [-4.3917% -4.1675% -3.9529%]
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
7 (7.00%) high mild
3 (3.00%) high severe
Benchmarking BitPacker4x/decompress-delta-24: Collecting 100 samples
in estimated 5.0056 s (4.3M iteration
BitPacker4x/decompress-delta-24
time: [1.1541 µs 1.1571 µs 1.1608 µs]
thrpt: [1.1027 Gelem/s 1.1062 Gelem/s 1.1090 Gelem/s]
change:
time: [+435.63% +436.93% +438.26%] (p = 0.00 < 0.05)
thrpt: [-81.422% -81.376% -81.330%]
Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
17 (17.00%) high severe
Benchmarking BitPacker4x/decompress-strict-delta-24: Collecting 100
samples in estimated 5.0008 s (3.8M it
BitPacker4x/decompress-strict-delta-24
time: [1.3251 µs 1.3279 µs 1.3315 µs]
thrpt: [961.33 Melem/s 963.90 Melem/s 965.98 Melem/s]
change:
time: [+451.99% +453.78% +455.70%] (p = 0.00 < 0.05)
thrpt: [-82.005% -81.942% -81.884%]
Performance has regressed.
Found 21 outliers among 100 measurements (21.00%)
9 (9.00%) high mild
12 (12.00%) high severe
BitPacker4x/compress-24 time: [153.40 ns 153.55 ns 153.74 ns]
thrpt: [8.3258 Gelem/s 8.3360 Gelem/s 8.3445 Gelem/s]
change:
time: [-0.5924% -0.4011% -0.2184%] (p = 0.00 < 0.05)
thrpt: [+0.2189% +0.4028% +0.5959%]
Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
4 (4.00%) high mild
3 (3.00%) high severe
BitPacker4x/compress-delta-24
time: [190.28 ns 190.61 ns 190.99 ns]
thrpt: [6.7020 Gelem/s 6.7152 Gelem/s 6.7271 Gelem/s]
change:
time: [-5.6890% -5.4981% -5.2935%] (p = 0.00 < 0.05)
thrpt: [+5.5894% +5.8179% +6.0322%]
Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
2 (2.00%) low severe
3 (3.00%) low mild
5 (5.00%) high mild
6 (6.00%) high severe
Benchmarking BitPacker4x/compress-strict-delta-24: Collecting 100
samples in estimated 5.0004 s (21M itera
BitPacker4x/compress-strict-delta-24
time: [234.94 ns 235.30 ns 235.75 ns]
thrpt: [5.4296 Gelem/s 5.4399 Gelem/s 5.4482 Gelem/s]
change:
time: [-7.5641% -7.3679% -7.1672%] (p = 0.00 < 0.05)
thrpt: [+7.7205% +7.9539% +8.1831%]
Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
3 (3.00%) high mild
6 (6.00%) high severe
BitPacker4x/decompress-31
time: [119.15 ns 119.39 ns 119.63 ns]
thrpt: [10.699 Gelem/s 10.722 Gelem/s 10.743 Gelem/s]
change:
time: [-0.6080% -0.3993% -0.1623%] (p = 0.00 < 0.05)
thrpt: [+0.1625% +0.4009% +0.6117%]
Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
6 (6.00%) high mild
2 (2.00%) high severe
Benchmarking BitPacker4x/decompress-delta-31: Collecting 100 samples
in estimated 5.0038 s (2.8M iteration
BitPacker4x/decompress-delta-31
time: [1.7672 µs 1.7729 µs 1.7799 µs]
thrpt: [719.15 Melem/s 721.97 Melem/s 724.33 Melem/s]
change:
time: [+609.60% +611.46% +613.44%] (p = 0.00 < 0.05)
thrpt: [-85.983% -85.944% -85.908%]
Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
4 (4.00%) high mild
12 (12.00%) high severe
Benchmarking BitPacker4x/decompress-strict-delta-31: Collecting 100
samples in estimated 5.0033 s (2.6M it
BitPacker4x/decompress-strict-delta-31
time: [1.9287 µs 1.9305 µs 1.9327 µs]
thrpt: [662.30 Melem/s 663.02 Melem/s 663.67 Melem/s]
change:
time: [+593.50% +598.06% +601.80%] (p = 0.00 < 0.05)
thrpt: [-85.751% -85.675% -85.580%]
Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
2 (2.00%) high mild
16 (16.00%) high severe
BitPacker4x/compress-31 time: [167.44 ns 167.73 ns 168.06 ns]
thrpt: [7.6165 Gelem/s 7.6315 Gelem/s 7.6447 Gelem/s]
change:
time: [-1.2386% -1.0825% -0.9272%] (p = 0.00 < 0.05)
thrpt: [+0.9358% +1.0943% +1.2541%]
Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
1 (1.00%) low mild
7 (7.00%) high mild
3 (3.00%) high severe
BitPacker4x/compress-delta-31
time: [175.49 ns 175.73 ns 176.01 ns]
thrpt: [7.2724 Gelem/s 7.2841 Gelem/s 7.2940 Gelem/s]
change:
time: [-5.4460% -5.2640% -5.0936%] (p = 0.00 < 0.05)
thrpt: [+5.3670% +5.5565% +5.7597%]
Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
7 (7.00%) high mild
11 (11.00%) high severe
Benchmarking BitPacker4x/compress-strict-delta-31: Collecting 100
samples in estimated 5.0007 s (22M itera
BitPacker4x/compress-strict-delta-31
time: [225.08 ns 225.44 ns 225.88 ns]
thrpt: [5.6668 Gelem/s 5.6779 Gelem/s 5.6869 Gelem/s]
change:
time: [-8.5164% -8.2743% -8.0374%] (p = 0.00 < 0.05)
thrpt: [+8.7398% +9.0207% +9.3092%]
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
1 (1.00%) low mild
4 (4.00%) high mild
3 (3.00%) high severe
Metadata
Metadata
Assignees
Labels
No labels