Performance improvements and simplifications for fixed size row-based rolling windows #17623

wence- · 2024-12-18T17:37:33Z

Description

Having introduced benchmarks of rolling window performance in #17613, we can now look at what happens when cleaning up some the implementation.

Breaking change

Previously, the most general "variable-sized" rolling window function applied a transform iterator (sometimes multiple times) to the user-provided preceding and following columns. We now require, and document, that the user provides values that are always in bounds.

This simplifies the book-keeping, and allows us to remove some (now redundant) checks from the rolling kernel.

Performance improvements

In various places, we accept fixed size integers as row-based offsets for rolling windows. We used to materialise these into columns representing the clamped preceding and following columns. However, we did so inconsistently with respect to which side we would clamp (after the change to allow negative window offsets).

To fix this, rationalise this clamping in one place, and just use a transform iterator. We now do this consistently for both grouped and ungrouped fixed-size windows.

Benchmark results

Comparing da4accc (the tip of #17613) and this branch:

nvbench compare output

Headline numbers:

group-aware fixed rolling windows between 10 and 30% faster, use 30% less memory
ungrouped fixed rolling windows between 10 and 50% faster, use 50% less memory
ungrouped variable rolling windows basically no change (so no regression)

All the data

latest/_deps/nvbench-src/scripts/nvbench_compare.py da4accc.json c3973aa36c373e013c6c90f8d7bd58fe0aed6880.json
['da4acccf6b76a93b1e9395b51950a17c45da99c3.json', 'c3973aa36c373e013c6c90f8d7bd58fe0aed6880.json']

row_grouped_rolling_sum

[0] NVIDIA RTX A6000

T	num_rows	preceding_size	following_size	min_periods	cardinality	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I32	2^14	1	2	1	10	123.556 us	3.90%	111.199 us	4.09%	-12.357 us	-10.00%	FAIL
I32	2^28	1	2	1	10	24.945 ms	0.13%	17.126 ms	0.26%	-7819.548 us	-31.35%	FAIL
I32	2^14	10	2	1	10	122.737 us	4.09%	109.055 us	4.40%	-13.682 us	-11.15%	FAIL
I32	2^28	10	2	1	10	24.944 ms	0.18%	17.151 ms	0.20%	-7793.093 us	-31.24%	FAIL
I32	2^14	1	2	1	100	119.947 us	4.07%	108.432 us	5.18%	-11.515 us	-9.60%	FAIL
I32	2^28	1	2	1	100	24.954 ms	0.15%	17.149 ms	0.20%	-7805.558 us	-31.28%	FAIL
I32	2^14	10	2	1	100	123.919 us	3.85%	109.812 us	4.63%	-14.107 us	-11.38%	FAIL
I32	2^28	10	2	1	100	24.955 ms	0.14%	17.173 ms	0.18%	-7782.547 us	-31.19%	FAIL
I32	2^14	1	2	1	1000000	121.896 us	4.59%	109.115 us	4.57%	-12.781 us	-10.49%	FAIL
I32	2^28	1	2	1	1000000	25.291 ms	0.15%	17.460 ms	0.21%	-7830.713 us	-30.96%	FAIL
I32	2^14	10	2	1	1000000	126.450 us	3.65%	111.787 us	3.37%	-14.662 us	-11.60%	FAIL
I32	2^28	10	2	1	1000000	25.295 ms	0.13%	17.593 ms	0.19%	-7701.923 us	-30.45%	FAIL
I32	2^14	1	2	1	100000000	122.268 us	4.42%	109.338 us	5.17%	-12.930 us	-10.57%	FAIL
I32	2^28	1	2	1	100000000	28.409 ms	0.12%	20.422 ms	0.16%	-7987.039 us	-28.11%	FAIL
I32	2^14	10	2	1	100000000	127.440 us	2.96%	113.009 us	4.54%	-14.431 us	-11.32%	FAIL
I32	2^28	10	2	1	100000000	28.395 ms	0.12%	20.789 ms	0.17%	-7606.146 us	-26.79%	FAIL
F64	2^14	1	2	1	10	119.900 us	4.27%	110.024 us	5.21%	-9.876 us	-8.24%	FAIL
F64	2^28	1	2	1	10	26.547 ms	0.14%	18.790 ms	0.19%	-7756.825 us	-29.22%	FAIL
F64	2^14	10	2	1	10	127.211 us	3.62%	112.724 us	4.63%	-14.487 us	-11.39%	FAIL
F64	2^28	10	2	1	10	30.087 ms	0.31%	23.886 ms	0.19%	-6201.201 us	-20.61%	FAIL
F64	2^14	1	2	1	100	120.433 us	2.42%	109.777 us	4.76%	-10.656 us	-8.85%	FAIL
F64	2^28	1	2	1	100	26.568 ms	0.14%	18.797 ms	0.18%	-7770.905 us	-29.25%	FAIL
F64	2^14	10	2	1	100	126.702 us	3.34%	112.492 us	4.27%	-14.210 us	-11.22%	FAIL
F64	2^28	10	2	1	100	30.106 ms	0.24%	24.001 ms	0.46%	-6105.382 us	-20.28%	FAIL
F64	2^14	1	2	1	1000000	119.557 us	4.07%	110.355 us	5.38%	-9.203 us	-7.70%	FAIL
F64	2^28	1	2	1	1000000	26.906 ms	0.15%	19.126 ms	0.18%	-7779.215 us	-28.91%	FAIL
F64	2^14	10	2	1	1000000	128.546 us	3.89%	113.429 us	2.36%	-15.117 us	-11.76%	FAIL
F64	2^28	10	2	1	1000000	30.898 ms	0.18%	24.737 ms	0.45%	-6161.068 us	-19.94%	FAIL
F64	2^14	1	2	1	100000000	120.927 us	4.36%	109.903 us	4.67%	-11.024 us	-9.12%	FAIL
F64	2^28	1	2	1	100000000	29.997 ms	0.12%	21.968 ms	0.16%	-8028.453 us	-26.76%	FAIL
F64	2^14	10	2	1	100000000	127.789 us	3.78%	112.861 us	3.81%	-14.928 us	-11.68%	FAIL
F64	2^28	10	2	1	100000000	36.555 ms	0.15%	29.882 ms	0.12%	-6672.173 us	-18.25%	FAIL

row_fixed_rolling_sum

[0] NVIDIA RTX A6000

T	num_rows	preceding_size	following_size	min_periods	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I32	2^14	1	2	1	31.751 us	4.87%	17.603 us	6.85%	-14.148 us	-44.56%	FAIL
I32	2^22	1	2	1	206.365 us	1.51%	95.830 us	2.39%	-110.534 us	-53.56%	FAIL
I32	2^28	1	2	1	10.883 ms	0.20%	4.767 ms	0.25%	-6115.576 us	-56.20%	FAIL
I32	2^14	10	2	1	31.956 us	5.91%	17.559 us	7.14%	-14.397 us	-45.05%	FAIL
I32	2^22	10	2	1	206.654 us	1.65%	95.575 us	2.29%	-111.079 us	-53.75%	FAIL
I32	2^28	10	2	1	10.872 ms	0.12%	4.758 ms	0.29%	-6114.139 us	-56.24%	FAIL
I32	2^14	100	2	1	40.611 us	4.58%	26.636 us	9.61%	-13.976 us	-34.41%	FAIL
I32	2^22	100	2	1	293.156 us	1.54%	235.445 us	2.23%	-57.710 us	-19.69%	FAIL
I32	2^28	100	2	1	17.109 ms	0.42%	14.613 ms	0.48%	-2496.488 us	-14.59%	FAIL
I32	2^14	1	2	20	32.786 us	6.05%	18.115 us	10.13%	-14.671 us	-44.75%	FAIL
I32	2^22	1	2	20	206.530 us	1.61%	96.530 us	2.79%	-109.999 us	-53.26%	FAIL
I32	2^28	1	2	20	10.883 ms	0.16%	4.767 ms	0.27%	-6115.283 us	-56.19%	FAIL
I32	2^14	10	2	20	31.927 us	4.21%	17.784 us	15.89%	-14.144 us	-44.30%	FAIL
I32	2^22	10	2	20	206.093 us	1.58%	95.657 us	2.92%	-110.436 us	-53.59%	FAIL
I32	2^28	10	2	20	10.872 ms	0.12%	4.757 ms	0.06%	-6115.435 us	-56.25%	FAIL
I32	2^14	100	2	20	40.361 us	3.87%	26.359 us	3.42%	-14.002 us	-34.69%	FAIL
I32	2^22	100	2	20	292.868 us	1.80%	235.340 us	2.69%	-57.528 us	-19.64%	FAIL
I32	2^28	100	2	20	17.075 ms	0.47%	14.613 ms	0.53%	-2462.145 us	-14.42%	FAIL
F64	2^14	1	2	1	35.335 us	11.59%	18.771 us	13.65%	-16.564 us	-46.88%	FAIL
F64	2^22	1	2	1	230.322 us	1.42%	121.412 us	3.23%	-108.910 us	-47.29%	FAIL
F64	2^28	1	2	1	12.460 ms	0.09%	6.311 ms	0.11%	-6149.789 us	-49.35%	FAIL
F64	2^14	10	2	1	39.365 us	8.03%	24.712 us	14.97%	-14.653 us	-37.22%	FAIL
F64	2^22	10	2	1	283.241 us	1.25%	224.296 us	1.86%	-58.945 us	-20.81%	FAIL
F64	2^28	10	2	1	15.983 ms	0.13%	13.039 ms	0.86%	-2943.223 us	-18.42%	FAIL
F64	2^14	100	2	1	46.514 us	8.38%	28.323 us	10.77%	-18.191 us	-39.11%	FAIL
F64	2^22	100	2	1	1.739 ms	0.16%	1.668 ms	0.53%	-70.890 us	-4.08%	FAIL
F64	2^28	100	2	1	108.509 ms	0.06%	104.891 ms	0.04%	-3618.662 us	-3.33%	FAIL
F64	2^14	1	2	20	31.566 us	6.77%	17.406 us	8.84%	-14.160 us	-44.86%	FAIL
F64	2^22	1	2	20	230.307 us	1.47%	121.044 us	1.90%	-109.263 us	-47.44%	FAIL
F64	2^28	1	2	20	12.462 ms	0.08%	6.315 ms	0.21%	-6147.508 us	-49.33%	FAIL
F64	2^14	10	2	20	39.277 us	6.21%	23.033 us	16.63%	-16.244 us	-41.36%	FAIL
F64	2^22	10	2	20	281.374 us	1.09%	219.307 us	1.12%	-62.067 us	-22.06%	FAIL
F64	2^28	10	2	20	15.944 ms	0.47%	12.882 ms	0.72%	-3062.193 us	-19.21%	FAIL
F64	2^14	100	2	20	46.079 us	8.03%	28.287 us	11.55%	-17.792 us	-38.61%	FAIL
F64	2^22	100	2	20	1.740 ms	0.19%	1.664 ms	0.36%	-75.602 us	-4.35%	FAIL
F64	2^28	100	2	20	108.475 ms	0.04%	104.920 ms	0.04%	-3555.241 us	-3.28%	FAIL

row_variable_rolling_sum

[0] NVIDIA RTX A6000

T	num_rows	preceding_size	following_size	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I32	2^14	10	2	20.407 us	17.95%	17.981 us	9.19%	-2.427 us	-11.89%	FAIL
I32	2^22	10	2	145.353 us	1.96%	145.458 us	1.82%	0.105 us	0.07%	PASS
I32	2^28	10	2	7.850 ms	0.17%	7.863 ms	0.21%	12.240 us	0.16%	PASS
I32	2^14	100	2	26.130 us	6.01%	26.324 us	3.86%	0.194 us	0.74%	PASS
I32	2^22	100	2	234.618 us	2.86%	245.878 us	3.30%	11.260 us	4.80%	FAIL
I32	2^28	100	2	14.624 ms	0.79%	15.221 ms	0.68%	596.983 us	4.08%	FAIL
F64	2^14	10	2	26.256 us	2.86%	26.508 us	6.46%	0.252 us	0.96%	PASS
F64	2^22	10	2	223.001 us	1.96%	222.131 us	1.85%	-0.870 us	-0.39%	PASS
F64	2^28	10	2	12.896 ms	0.58%	12.893 ms	0.78%	-3.356 us	-0.03%	PASS
F64	2^14	100	2	33.438 us	6.71%	34.323 us	1.75%	0.885 us	2.65%	FAIL
F64	2^22	100	2	1.667 ms	0.49%	1.672 ms	0.26%	5.138 us	0.31%	FAIL
F64	2^28	100	2	104.673 ms	0.05%	104.982 ms	0.04%	309.519 us	0.30%	FAIL

Summary

Total Matches: 80
- Pass (diff <= min_noise): 6
- Unknown (infinite noise): 0
- Failure (diff > min_noise): 74

Before

Benchmark Results

row_grouped_rolling_sum

[0] NVIDIA RTX A6000

T	num_rows	preceding_size	following_size	min_periods	cardinality	Samples	CPU Time	Noise	GPU Time	Noise	Mrows/s	peak_memory_usage
I32	2^14 = 16384	1	2	1	10	840x	127.585 us	5.09%	123.556 us	3.90%	132	450.031 KiB
I32	2^28 = 268435456	1	2	1	10	601x	24.948 ms	0.13%	24.945 ms	0.13%	10761	7.031 GiB
I32	2^14 = 16384	10	2	1	10	1020x	126.638 us	5.18%	122.737 us	4.09%	133	450.031 KiB
I32	2^28 = 268435456	10	2	1	10	601x	24.947 ms	0.18%	24.944 ms	0.18%	10761	7.031 GiB
I32	2^14 = 16384	1	2	1	100	630x	123.913 us	5.30%	119.947 us	4.07%	136	450.031 KiB
I32	2^28 = 268435456	1	2	1	100	601x	24.958 ms	0.15%	24.954 ms	0.15%	10757	7.031 GiB
I32	2^14 = 16384	10	2	1	100	984x	127.839 us	4.98%	123.919 us	3.85%	132	450.031 KiB
I32	2^28 = 268435456	10	2	1	100	601x	24.958 ms	0.14%	24.955 ms	0.14%	10756	7.031 GiB
I32	2^14 = 16384	1	2	1	1000000	802x	125.851 us	5.65%	121.896 us	4.59%	134	450.031 KiB
I32	2^28 = 268435456	1	2	1	1000000	593x	25.294 ms	0.15%	25.291 ms	0.15%	10613	7.031 GiB
I32	2^14 = 16384	10	2	1	1000000	992x	130.416 us	4.81%	126.450 us	3.65%	129	450.031 KiB
I32	2^28 = 268435456	10	2	1	1000000	593x	25.298 ms	0.14%	25.295 ms	0.13%	10612	7.031 GiB
I32	2^14 = 16384	1	2	1	100000000	820x	126.300 us	5.56%	122.268 us	4.42%	134	450.031 KiB
I32	2^28 = 268435456	1	2	1	100000000	528x	28.413 ms	0.12%	28.409 ms	0.12%	9448	7.031 GiB
I32	2^14 = 16384	10	2	1	100000000	752x	131.420 us	4.30%	127.440 us	2.96%	128	450.031 KiB
I32	2^28 = 268435456	10	2	1	100000000	528x	28.398 ms	0.12%	28.395 ms	0.12%	9453	7.031 GiB
F64	2^14 = 16384	1	2	1	10	860x	123.857 us	5.40%	119.900 us	4.27%	136	450.031 KiB
F64	2^28 = 268435456	1	2	1	10	565x	26.551 ms	0.14%	26.547 ms	0.14%	10111	7.031 GiB
F64	2^14 = 16384	10	2	1	10	706x	131.193 us	4.81%	127.211 us	3.62%	128	450.031 KiB
F64	2^28 = 268435456	10	2	1	10	499x	30.091 ms	0.31%	30.087 ms	0.31%	8921	7.031 GiB
F64	2^14 = 16384	1	2	1	100	598x	124.407 us	4.10%	120.433 us	2.42%	136	450.031 KiB
F64	2^28 = 268435456	1	2	1	100	565x	26.572 ms	0.14%	26.568 ms	0.14%	10103	7.031 GiB
F64	2^14 = 16384	10	2	1	100	778x	130.659 us	4.55%	126.702 us	3.34%	129	450.031 KiB
F64	2^28 = 268435456	10	2	1	100	498x	30.109 ms	0.25%	30.106 ms	0.24%	8916	7.031 GiB
F64	2^14 = 16384	1	2	1	1000000	716x	123.521 us	5.27%	119.557 us	4.07%	137	450.031 KiB
F64	2^28 = 268435456	1	2	1	1000000	557x	26.909 ms	0.15%	26.906 ms	0.15%	9976	7.031 GiB
F64	2^14 = 16384	10	2	1	1000000	802x	132.551 us	5.00%	128.546 us	3.89%	127	450.031 KiB
F64	2^28 = 268435456	10	2	1	1000000	486x	30.902 ms	0.18%	30.898 ms	0.18%	8687	7.031 GiB
F64	2^14 = 16384	1	2	1	100000000	818x	124.957 us	5.54%	120.927 us	4.36%	135	450.031 KiB
F64	2^28 = 268435456	1	2	1	100000000	500x	30.000 ms	0.12%	29.997 ms	0.12%	8948	7.031 GiB
F64	2^14 = 16384	10	2	1	100000000	810x	131.756 us	4.90%	127.789 us	3.78%	128	450.031 KiB
F64	2^28 = 268435456	10	2	1	100000000	411x	36.558 ms	0.15%	36.555 ms	0.15%	7343	7.031 GiB

row_fixed_rolling_sum

[0] NVIDIA RTX A6000

T	num_rows	preceding_size	following_size	min_periods	Samples	CPU Time	Noise	GPU Time	Noise	Mrows/s	peak_memory_usage
I32	2^14 = 16384	1	2	1	404x	35.776 us	13.79%	31.751 us	4.87%	516	258.016 KiB
I32	2^22 = 4194304	1	2	1	760x	210.368 us	2.49%	206.365 us	1.51%	20324	64.500 MiB
I32	2^28 = 268435456	1	2	1	1032x	10.886 ms	0.20%	10.883 ms	0.20%	24666	4.031 GiB
I32	2^14 = 16384	10	2	1	382x	35.926 us	13.76%	31.956 us	5.91%	512	258.016 KiB
I32	2^22 = 4194304	10	2	1	780x	210.618 us	2.52%	206.654 us	1.65%	20296	64.500 MiB
I32	2^28 = 268435456	10	2	1	938x	10.876 ms	0.12%	10.872 ms	0.12%	24690	4.031 GiB
I32	2^14 = 16384	100	2	1	370x	44.559 us	10.75%	40.611 us	4.58%	403	258.016 KiB
I32	2^22 = 4194304	100	2	1	782x	297.286 us	2.11%	293.156 us	1.54%	14307	64.500 MiB
I32	2^28 = 268435456	100	2	1	876x	17.113 ms	0.42%	17.109 ms	0.42%	15689	4.031 GiB
I32	2^14 = 16384	1	2	20	568x	36.961 us	14.10%	32.786 us	6.05%	499	258.016 KiB
I32	2^22 = 4194304	1	2	20	808x	210.503 us	2.51%	206.530 us	1.61%	20308	64.500 MiB
I32	2^28 = 268435456	1	2	20	1124x	10.886 ms	0.16%	10.883 ms	0.16%	24666	4.031 GiB
I32	2^14 = 16384	10	2	20	408x	35.851 us	12.88%	31.927 us	4.21%	513	258.016 KiB
I32	2^22 = 4194304	10	2	20	756x	210.040 us	2.50%	206.093 us	1.58%	20351	64.500 MiB
I32	2^28 = 268435456	10	2	20	1004x	10.876 ms	0.13%	10.872 ms	0.12%	24690	4.031 GiB
I32	2^14 = 16384	100	2	20	496x	44.365 us	10.65%	40.361 us	3.87%	405	258.016 KiB
I32	2^22 = 4194304	100	2	20	694x	296.938 us	2.29%	292.868 us	1.80%	14321	64.500 MiB
I32	2^28 = 268435456	100	2	20	878x	17.079 ms	0.47%	17.075 ms	0.47%	15720	4.031 GiB
F64	2^14 = 16384	1	2	1	622x	39.469 us	16.54%	35.335 us	11.59%	463	258.016 KiB
F64	2^22 = 4194304	1	2	1	784x	234.374 us	2.31%	230.322 us	1.42%	18210	64.500 MiB
F64	2^28 = 268435456	1	2	1	1060x	12.464 ms	0.09%	12.460 ms	0.09%	21543	4.031 GiB
F64	2^14 = 16384	10	2	1	434x	43.384 us	13.04%	39.365 us	8.03%	416	258.016 KiB
F64	2^22 = 4194304	10	2	1	648x	287.190 us	1.86%	283.241 us	1.25%	14808	64.500 MiB
F64	2^28 = 268435456	10	2	1	938x	15.986 ms	0.13%	15.983 ms	0.13%	16795	4.031 GiB
F64	2^14 = 16384	100	2	1	1006x	50.493 us	11.98%	46.514 us	8.38%	352	258.016 KiB
F64	2^22 = 4194304	100	2	1	678x	1.743 ms	0.28%	1.739 ms	0.16%	2411	64.500 MiB
F64	2^28 = 268435456	100	2	1	139x	108.511 ms	0.06%	108.509 ms	0.06%	2473	4.031 GiB
F64	2^14 = 16384	1	2	20	572x	35.501 us	14.15%	31.566 us	6.77%	519	258.016 KiB
F64	2^22 = 4194304	1	2	20	792x	234.279 us	2.27%	230.307 us	1.47%	18211	64.500 MiB
F64	2^28 = 268435456	1	2	20	1112x	12.466 ms	0.09%	12.462 ms	0.08%	21539	4.031 GiB
F64	2^14 = 16384	10	2	20	430x	43.210 us	11.79%	39.277 us	6.21%	417	258.016 KiB
F64	2^22 = 4194304	10	2	20	690x	285.352 us	1.80%	281.374 us	1.09%	14906	64.500 MiB
F64	2^28 = 268435456	10	2	20	886x	15.948 ms	0.47%	15.944 ms	0.47%	16835	4.031 GiB
F64	2^14 = 16384	100	2	20	618x	50.056 us	11.92%	46.079 us	8.03%	355	258.016 KiB
F64	2^22 = 4194304	100	2	20	710x	1.744 ms	0.30%	1.740 ms	0.19%	2411	64.500 MiB
F64	2^28 = 268435456	100	2	20	139x	108.477 ms	0.04%	108.475 ms	0.04%	2474	4.031 GiB

row_variable_rolling_sum

[0] NVIDIA RTX A6000

T	num_rows	preceding_size	following_size	Samples	CPU Time	Noise	GPU Time	Noise	Mrows/s	peak_memory_usage
I32	2^14 = 16384	10	2	436x	24.361 us	26.44%	20.407 us	17.95%	802	130.016 KiB
I32	2^22 = 4194304	10	2	714x	149.553 us	5.30%	145.353 us	1.96%	28856	32.500 MiB
I32	2^28 = 268435456	10	2	974x	7.854 ms	0.17%	7.850 ms	0.17%	34193	2.031 GiB
I32	2^14 = 16384	100	2	364x	30.150 us	16.63%	26.130 us	6.01%	627	130.016 KiB
I32	2^22 = 4194304	100	2	628x	238.815 us	3.44%	234.618 us	2.86%	17877	32.500 MiB
I32	2^28 = 268435456	100	2	624x	14.628 ms	0.79%	14.624 ms	0.79%	18355	2.031 GiB
F64	2^14 = 16384	10	2	440x	30.546 us	16.87%	26.256 us	2.86%	624	130.016 KiB
F64	2^22 = 4194304	10	2	574x	227.037 us	2.68%	223.001 us	1.96%	18808	32.500 MiB
F64	2^28 = 268435456	10	2	930x	12.900 ms	0.58%	12.896 ms	0.58%	20814	2.031 GiB
F64	2^14 = 16384	100	2	412x	37.479 us	13.85%	33.438 us	6.71%	489	130.016 KiB
F64	2^22 = 4194304	100	2	792x	1.671 ms	0.55%	1.667 ms	0.49%	2516	32.500 MiB
F64	2^28 = 268435456	100	2	144x	104.675 ms	0.05%	104.673 ms	0.05%	2564	2.031 GiB

After

Benchmark Results

row_grouped_rolling_sum

[0] NVIDIA RTX A6000

T	num_rows	preceding_size	following_size	min_periods	cardinality	Samples	CPU Time	Noise	GPU Time	Noise	Mrows/s	peak_memory_usage
I32	2^14 = 16384	1	2	1	10	610x	116.114 us	4.38%	112.111 us	2.55%	146	322.031 KiB
I32	2^28 = 268435456	1	2	1	10	875x	17.128 ms	0.22%	17.124 ms	0.22%	15676	5.031 GiB
I32	2^14 = 16384	10	2	1	10	1038x	113.661 us	12.55%	109.758 us	12.03%	149	322.031 KiB
I32	2^28 = 268435456	10	2	1	10	874x	17.147 ms	0.19%	17.143 ms	0.19%	15658	5.031 GiB
I32	2^14 = 16384	1	2	1	100	784x	112.206 us	5.86%	108.231 us	4.51%	151	322.031 KiB
I32	2^28 = 268435456	1	2	1	100	874x	17.144 ms	0.19%	17.140 ms	0.19%	15661	5.031 GiB
I32	2^14 = 16384	10	2	1	100	742x	114.504 us	5.52%	110.542 us	4.21%	148	322.031 KiB
I32	2^28 = 268435456	10	2	1	100	872x	17.189 ms	0.19%	17.185 ms	0.19%	15620	5.031 GiB
I32	2^14 = 16384	1	2	1	1000000	660x	113.501 us	5.42%	109.536 us	4.03%	149	322.031 KiB
I32	2^28 = 268435456	1	2	1	1000000	857x	17.490 ms	0.30%	17.486 ms	0.30%	15351	5.031 GiB
I32	2^14 = 16384	10	2	1	1000000	596x	115.135 us	5.20%	111.178 us	3.78%	147	322.031 KiB
I32	2^28 = 268435456	10	2	1	1000000	851x	17.608 ms	0.18%	17.604 ms	0.18%	15248	5.031 GiB
I32	2^14 = 16384	1	2	1	100000000	646x	113.959 us	4.91%	109.990 us	3.34%	148	322.031 KiB
I32	2^28 = 268435456	1	2	1	100000000	734x	20.436 ms	0.19%	20.432 ms	0.18%	13138	5.031 GiB
I32	2^14 = 16384	10	2	1	100000000	418x	116.388 us	4.26%	112.387 us	2.31%	145	322.031 KiB
I32	2^28 = 268435456	10	2	1	100000000	721x	20.796 ms	0.15%	20.792 ms	0.15%	12910	5.031 GiB
F64	2^14 = 16384	1	2	1	10	744x	113.892 us	5.18%	109.941 us	3.75%	149	322.031 KiB
F64	2^28 = 268435456	1	2	1	10	797x	18.798 ms	0.17%	18.794 ms	0.17%	14283	5.031 GiB
F64	2^14 = 16384	10	2	1	10	694x	114.918 us	5.14%	110.968 us	3.69%	147	322.031 KiB
F64	2^28 = 268435456	10	2	1	10	626x	23.972 ms	0.41%	23.968 ms	0.41%	11199	5.031 GiB
F64	2^14 = 16384	1	2	1	100	754x	114.013 us	5.51%	110.094 us	4.22%	148	322.031 KiB
F64	2^28 = 268435456	1	2	1	100	797x	18.809 ms	0.19%	18.805 ms	0.18%	14274	5.031 GiB
F64	2^14 = 16384	10	2	1	100	624x	114.799 us	5.12%	110.850 us	3.64%	147	322.031 KiB
F64	2^28 = 268435456	10	2	1	100	624x	24.035 ms	0.53%	24.031 ms	0.53%	11170	5.031 GiB
F64	2^14 = 16384	1	2	1	1000000	588x	114.699 us	5.00%	110.743 us	3.46%	147	322.031 KiB
F64	2^28 = 268435456	1	2	1	1000000	783x	19.138 ms	0.17%	19.134 ms	0.17%	14029	5.031 GiB
F64	2^14 = 16384	10	2	1	1000000	560x	114.636 us	5.26%	110.662 us	3.82%	148	322.031 KiB
F64	2^28 = 268435456	10	2	1	1000000	606x	24.757 ms	0.45%	24.753 ms	0.45%	10844	5.031 GiB
F64	2^14 = 16384	1	2	1	100000000	780x	113.344 us	5.66%	109.376 us	4.32%	149	322.031 KiB
F64	2^28 = 268435456	1	2	1	100000000	682x	21.974 ms	0.15%	21.970 ms	0.15%	12218	5.031 GiB
F64	2^14 = 16384	10	2	1	100000000	568x	115.521 us	4.91%	111.565 us	3.34%	146	322.031 KiB
F64	2^28 = 268435456	10	2	1	100000000	502x	29.905 ms	0.19%	29.901 ms	0.19%	8977	5.031 GiB

row_fixed_rolling_sum

[0] NVIDIA RTX A6000

T	num_rows	preceding_size	following_size	min_periods	Samples	CPU Time	Noise	GPU Time	Noise	Mrows/s	peak_memory_usage
I32	2^14 = 16384	1	2	1	474x	21.860 us	26.56%	17.797 us	12.02%	920	130.016 KiB
I32	2^22 = 4194304	1	2	1	728x	100.117 us	5.62%	96.093 us	3.38%	43648	32.500 MiB
I32	2^28 = 268435456	1	2	1	824x	4.772 ms	0.21%	4.768 ms	0.19%	56303	2.031 GiB
I32	2^14 = 16384	10	2	1	424x	21.741 us	26.94%	17.721 us	14.31%	924	130.016 KiB
I32	2^22 = 4194304	10	2	1	660x	99.712 us	5.91%	95.725 us	4.13%	43816	32.500 MiB
I32	2^28 = 268435456	10	2	1	962x	4.763 ms	0.28%	4.759 ms	0.26%	56409	2.031 GiB
I32	2^14 = 16384	100	2	1	592x	30.530 us	19.38%	26.492 us	11.86%	618	130.016 KiB
I32	2^22 = 4194304	100	2	1	558x	238.409 us	2.91%	234.292 us	2.29%	17902	32.500 MiB
I32	2^28 = 268435456	100	2	1	814x	14.617 ms	0.47%	14.612 ms	0.47%	18370	2.031 GiB
I32	2^14 = 16384	1	2	20	474x	22.231 us	24.96%	17.994 us	7.92%	910	130.016 KiB
I32	2^22 = 4194304	1	2	20	776x	99.747 us	4.85%	95.737 us	2.39%	43810	32.500 MiB
I32	2^28 = 268435456	1	2	20	866x	4.771 ms	0.11%	4.767 ms	0.08%	56317	2.031 GiB
I32	2^14 = 16384	10	2	20	404x	21.898 us	25.01%	17.827 us	9.79%	919	130.016 KiB
I32	2^22 = 4194304	10	2	20	692x	99.264 us	4.72%	95.282 us	2.15%	44019	32.500 MiB
I32	2^28 = 268435456	10	2	20	706x	4.762 ms	0.29%	4.758 ms	0.28%	56419	2.031 GiB
I32	2^14 = 16384	100	2	20	584x	30.366 us	15.81%	26.306 us	3.31%	622	130.016 KiB
I32	2^22 = 4194304	100	2	20	540x	237.905 us	2.62%	233.802 us	1.91%	17939	32.500 MiB
I32	2^28 = 268435456	100	2	20	616x	14.618 ms	0.50%	14.614 ms	0.49%	18368	2.031 GiB
F64	2^14 = 16384	1	2	1	414x	23.969 us	28.36%	19.710 us	17.67%	831	130.016 KiB
F64	2^22 = 4194304	1	2	1	716x	125.282 us	4.01%	121.248 us	2.18%	34592	32.500 MiB
F64	2^28 = 268435456	1	2	1	1082x	6.315 ms	0.22%	6.311 ms	0.21%	42537	2.031 GiB
F64	2^14 = 16384	10	2	1	458x	29.235 us	19.23%	25.181 us	10.50%	650	130.016 KiB
F64	2^22 = 4194304	10	2	1	668x	229.256 us	2.39%	225.265 us	1.60%	18619	32.500 MiB
F64	2^28 = 268435456	10	2	1	1145x	13.088 ms	0.94%	13.084 ms	0.94%	20516	2.031 GiB
F64	2^14 = 16384	100	2	1	470x	36.907 us	14.33%	32.953 us	8.02%	497	130.016 KiB
F64	2^22 = 4194304	100	2	1	704x	1.699 ms	1.58%	1.695 ms	1.56%	2473	32.500 MiB
F64	2^28 = 268435456	100	2	1	143x	105.210 ms	0.37%	105.206 ms	0.37%	2551	2.031 GiB
F64	2^14 = 16384	1	2	20	416x	21.304 us	24.50%	17.376 us	9.38%	942	130.016 KiB
F64	2^22 = 4194304	1	2	20	584x	124.586 us	3.85%	120.627 us	1.91%	34770	32.500 MiB
F64	2^28 = 268435456	1	2	20	998x	6.318 ms	0.23%	6.314 ms	0.22%	42516	2.031 GiB
F64	2^14 = 16384	10	2	20	632x	28.582 us	20.04%	24.602 us	11.91%	665	130.016 KiB
F64	2^22 = 4194304	10	2	20	664x	222.036 us	2.17%	218.110 us	1.20%	19230	32.500 MiB
F64	2^28 = 268435456	10	2	20	1000x	12.880 ms	0.76%	12.876 ms	0.76%	20847	2.031 GiB
F64	2^14 = 16384	100	2	20	356x	33.704 us	17.99%	29.671 us	11.54%	552	130.016 KiB
F64	2^22 = 4194304	100	2	20	528x	1.668 ms	0.47%	1.664 ms	0.40%	2520	32.500 MiB
F64	2^28 = 268435456	100	2	20	143x	104.902 ms	0.04%	104.898 ms	0.04%	2559	2.031 GiB

row_variable_rolling_sum

[0] NVIDIA RTX A6000

T	num_rows	preceding_size	following_size	Samples	CPU Time	Noise	GPU Time	Noise	Mrows/s	peak_memory_usage
I32	2^14 = 16384	10	2	392x	21.847 us	24.65%	17.969 us	12.83%	911	130.016 KiB
I32	2^22 = 4194304	10	2	786x	149.751 us	3.19%	145.831 us	1.77%	28761	32.500 MiB
I32	2^28 = 268435456	10	2	1186x	7.867 ms	0.21%	7.863 ms	0.20%	34140	2.031 GiB
I32	2^14 = 16384	100	2	406x	29.974 us	16.38%	26.012 us	5.87%	629	130.016 KiB
I32	2^22 = 4194304	100	2	588x	246.351 us	3.57%	242.230 us	3.12%	17315	32.500 MiB
I32	2^28 = 268435456	100	2	612x	15.222 ms	0.75%	15.218 ms	0.75%	17639	2.031 GiB
F64	2^14 = 16384	10	2	410x	30.346 us	16.77%	26.087 us	3.69%	628	130.016 KiB
F64	2^22 = 4194304	10	2	632x	226.719 us	2.91%	222.736 us	2.26%	18830	32.500 MiB
F64	2^28 = 268435456	10	2	1058x	12.882 ms	0.82%	12.877 ms	0.82%	20845	2.031 GiB
F64	2^14 = 16384	100	2	414x	38.118 us	12.55%	34.123 us	4.52%	480	130.016 KiB
F64	2^22 = 4194304	100	2	658x	1.677 ms	0.33%	1.673 ms	0.24%	2507	32.500 MiB
F64	2^28 = 268435456	100	2	143x	104.991 ms	0.04%	104.987 ms	0.04%	2556	2.031 GiB

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

This way, the responsibility of a producing an iterator that is in bounds falls on the caller, and we don't have to do extra work in cases where the input is guaranteed to be in bounds already.

When passing column_view objects for preceding and following windows, the caller is responsible for ensuring that any resulting indexing is in-bounds.

The caller must provide in-bounds columns for the preceding and following windows.

Now that the requirement for producing in-bounds data has moved to the caller, we must adapt the tests to produce the correct inputs.

These invariants are guaranteed by the constructor.

Now neither the ungrouped nor the grouped fixed size rolling window calculations need to materialise the preceding and following columns.

wence- · 2024-12-18T17:39:15Z

Sits on top of #17613, so don't merge until that is in and this is rebased/has trunk back-merged.

wence-

Signposts

wence- · 2024-12-18T17:41:06Z

cpp/benchmarks/CMakeLists.txt

+# ##################################################################################################
+# * rolling benchmark
+# ---------------------------------------------------------------------------------
+ConfigureNVBench(ROLLING_NVBENCH rolling/grouped_rolling_sum.cpp rolling/rolling_sum.cpp)
+
 add_custom_target(
  run_benchmarks
  DEPENDS CUDF_BENCHMARKS


Ignore this, or review in #17613.

wence- · 2024-12-18T17:41:20Z

cpp/benchmarks/rolling/grouped_rolling_sum.cpp

Part of #17613

wence- · 2024-12-18T17:41:27Z

cpp/benchmarks/rolling/rolling_sum.cpp

Part of #17613

wence- · 2024-12-18T17:41:48Z

cpp/src/rolling/detail/rolling.cuh

@@ -1010,7 +1010,7 @@ class rolling_aggregation_postprocessor final : public cudf::detail::aggregation
 * @param[out] output_valid_count Output count of valid values
 * @param[in] device_operator The operator used to perform a single window operation
 * @param[in] preceding_window_begin Rolling window size iterator, accumulates from
- *            in_col[i-preceding_window] to in_col[i] inclusive
+ *            in_col[i-preceding_window + 1] to in_col[i] inclusive


Fix off by one.

wence- · 2024-12-18T17:42:21Z

cpp/src/rolling/detail/rolling.cuh


    // aggregate
    // TODO: We should explore using shared memory to avoid redundant loads.
    //       This might require separating the kernel into a special version
    //       for dynamic and static sizes.

-    volatile bool output_is_valid = false;


I think this is a leftover from a cuda 10-era bug.

wence- · 2024-12-18T17:45:36Z

cpp/src/rolling/grouped_rolling.cu

+    namespace utils = cudf::detail::rolling;
+    auto groups     = utils::grouped{group_labels.data(), group_offsets.data()};
+    auto preceding =
+      utils::make_clamped_window_iterator<utils::direction::PRECEDING>(preceding_window, groups);
+    auto following =
+      utils::make_clamped_window_iterator<utils::direction::FOLLOWING>(following_window, groups);
+    return cudf::detail::rolling_window(
+      input, default_outputs, preceding, following, min_periods, aggr, stream, mr);


Using the new utilities, so we can delete a bunch of code.

wence- · 2024-12-18T17:45:53Z

cpp/tests/rolling/rolling_test.cpp

@@ -660,7 +661,7 @@ TEST_F(RollingErrorTest, WindowArraySizeMismatch)
  cudf::test::fixed_width_column_wrapper<cudf::size_type> input(
    col_data.begin(), col_data.end(), col_valid.begin());

-  std::vector<cudf::size_type> five({2, 1, 2, 1, 4});
+  std::vector<cudf::size_type> five({1, 1, 2, 1, 0});


Needs adapted to conform to new caller requirement.

wence- · 2024-12-18T17:46:08Z

cpp/tests/rolling/rolling_test.cpp

-// this is a special test to check the volatile count variable issue (see rolling.cu for detail)
-TYPED_TEST(RollingTest, VolatileCount)


All those volatiles are long-gone (cuda 10.1 bug)

wence- · 2024-12-18T17:46:22Z

cpp/tests/rolling/rolling_test.cpp

+  auto it = thrust::make_counting_iterator<cudf::size_type>(0);
+  std::transform(it, it + num_rows, preceding_window.begin(), [&window_rng, num_rows](auto i) {
+    auto p = window_rng.generate();
+    return std::min(i + 1, std::max(p, i + 1 - num_rows));
+  });
+  std::transform(it, it + num_rows, following_window.begin(), [&window_rng, num_rows](auto i) {
+    auto f = window_rng.generate();
+    return std::max(-i - 1, std::min(f, num_rows - i - 1));
+  });


Ensuring the randomly generated values conform to interface requirements

wence- · 2024-12-18T17:46:29Z

cpp/tests/rolling/rolling_test.cpp

+  auto it = thrust::make_counting_iterator<cudf::size_type>(0);
+  std::transform(it, it + num_rows, preceding_window.begin(), [&window_rng, num_rows](auto i) {
+    auto p = window_rng.generate();
+    return std::min(i + 1, std::max(p, i + 1 - num_rows));
+  });
+  std::transform(it, it + num_rows, following_window.begin(), [&window_rng, num_rows](auto i) {
+    auto f = window_rng.generate();
+    return std::max(-i - 1, std::min(f, num_rows - i - 1));
+  });


wence- · 2024-12-18T17:47:36Z

Note that I haven't touched the jit rolling window kernel yet (some of the same changes will apply there eventually).

@mythrocks This changes the API contract for to rolling_window, so spark may need to adapt.

mythrocks · 2024-12-18T19:29:34Z

@mythrocks This changes the API contract for to rolling_window, so spark may need to adapt.

I'm trying to work out what of this has impact on spark-rapids. Spark calls into cudf::grouped_rolling*() with fixed preceding/following offsets, which might exceed group bounds at the margins.

wence- · 2024-12-19T10:23:17Z

@mythrocks This changes the API contract for to rolling_window, so spark may need to adapt.

I'm trying to work out what of this has impact on spark-rapids. Spark calls into cudf::grouped_rolling*() with fixed preceding/following offsets, which might exceed group bounds at the margins.

I think this is a call into

cudf/cpp/src/rolling/grouped_rolling.cu

Lines 47 to 55 in 030ae7a

    
           std::unique_ptr<column> grouped_rolling_window(table_view const& group_keys, 
        
                                                          column_view const& input, 
        
                                                          column_view const& default_outputs, 
        
                                                          window_bounds preceding_window_bounds, 
        
                                                          window_bounds following_window_bounds, 
        
                                                          size_type min_periods, 
        
                                                          rolling_aggregation const& aggr, 
        
                                                          rmm::cuda_stream_view stream, 
        
                                                          rmm::device_async_resource_ref mr)

If so, those fixed offsets are still clamped to the group bounds, via a transform iterator:

cudf/cpp/src/rolling/grouped_rolling.cu

Lines 127 to 133 in 030ae7a

    
           auto groups     = utils::grouped{group_labels.data(), group_offsets.data()}; 
        
           auto preceding = 
        
             utils::make_clamped_window_iterator<utils::direction::PRECEDING>(preceding_window, groups); 
        
           auto following = 
        
             utils::make_clamped_window_iterator<utils::direction::FOLLOWING>(following_window, groups); 
        
           return cudf::detail::rolling_window( 
        
             input, default_outputs, preceding, following, min_periods, aggr, stream, mr);

mythrocks · 2024-12-19T22:41:34Z

If so, those fixed offsets are still clamped to the group bounds, via a transform iterator:

That's what I thought. It seems to me like this really ought to work for spark-rapids without code change, if the transform iterator's doing the clamping.

I'll try have a go at integrating this with spark-rapids, and running some tests.

wence- added 14 commits December 18, 2024 11:26

New rolling window benchmark

6606024

New grouped rolling window benchmark

da4accc

Fix typo

cb2530f

Require that window offsets to rolling GPU kernel are in bounds

7bec1e4

This way, the responsibility of a producing an iterator that is in bounds falls on the caller, and we don't have to do extra work in cases where the input is guaranteed to be in bounds already.

Note in-bounds requirements when calling cudf::rolling_window

9c80201

When passing column_view objects for preceding and following windows, the caller is responsible for ensuring that any resulting indexing is in-bounds.

Remove unnecessary transform iterators

f9ec460

The caller must provide in-bounds columns for the preceding and following windows.

Rolling collect list no longer needs to fix up window boundaries

730db6f

Adapt rolling tests that relied on kernel-side in-bounds clamping

a849c7e

Now that the requirement for producing in-bounds data has moved to the caller, we must adapt the tests to produce the correct inputs.

Remove stray volatile

55940da

Remove test for bug that was fixed in CUDA 10.1

ed8fa3b

No need for int64s

fcf7c34

Remove assertion of post-conditions on groupby helper

3d3e80b

These invariants are guaranteed by the constructor.

Abstract and share code for fixed window clamped iterators

57d318c

Now neither the ungrouped nor the grouped fixed size rolling window calculations need to materialise the preceding and following columns.

Lead/lag nested postprocessing no longer needs to clamp

e60bcfd

wence- requested a review from a team as a code owner December 18, 2024 17:37

wence- requested review from shrshi and mhaseeb123 December 18, 2024 17:37

github-actions bot assigned wence- Dec 18, 2024

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Dec 18, 2024

wence- added improvement Improvement / enhancement to an existing function breaking Breaking change 5 - DO NOT MERGE Hold off on merging; see PR for details labels Dec 18, 2024

wence- commented Dec 18, 2024

View reviewed changes

mythrocks self-requested a review December 18, 2024 18:19

wence- added 2 commits December 19, 2024 10:11

Adapt stream test

8c9dde5

Also adapt java variable-size rolling tests

030ae7a

wence- requested a review from a team as a code owner December 19, 2024 10:15

github-actions bot added the Java Affects Java cuDF API. label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements and simplifications for fixed size row-based rolling windows #17623

Performance improvements and simplifications for fixed size row-based rolling windows #17623

wence- commented Dec 18, 2024

wence- commented Dec 18, 2024

wence- left a comment

wence- Dec 18, 2024

wence- Dec 18, 2024

wence- Dec 18, 2024

wence- Dec 18, 2024

wence- Dec 18, 2024

wence- Dec 18, 2024

wence- Dec 18, 2024

wence- Dec 18, 2024

wence- Dec 18, 2024

wence- Dec 18, 2024

wence- commented Dec 18, 2024

mythrocks commented Dec 18, 2024

wence- commented Dec 19, 2024

mythrocks commented Dec 19, 2024

		// this is a special test to check the volatile count variable issue (see rolling.cu for detail)
		TYPED_TEST(RollingTest, VolatileCount)

Performance improvements and simplifications for fixed size row-based rolling windows #17623

Are you sure you want to change the base?

Performance improvements and simplifications for fixed size row-based rolling windows #17623

Conversation

wence- commented Dec 18, 2024

Description

Breaking change

Performance improvements

Benchmark results

nvbench compare output

row_grouped_rolling_sum

[0] NVIDIA RTX A6000

row_fixed_rolling_sum

[0] NVIDIA RTX A6000

row_variable_rolling_sum

[0] NVIDIA RTX A6000

Summary

Benchmark Results

row_grouped_rolling_sum

[0] NVIDIA RTX A6000

row_fixed_rolling_sum

[0] NVIDIA RTX A6000

row_variable_rolling_sum

[0] NVIDIA RTX A6000

Benchmark Results

row_grouped_rolling_sum

[0] NVIDIA RTX A6000

row_fixed_rolling_sum

[0] NVIDIA RTX A6000

row_variable_rolling_sum

[0] NVIDIA RTX A6000

Checklist

wence- commented Dec 18, 2024

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- commented Dec 18, 2024

mythrocks commented Dec 18, 2024

wence- commented Dec 19, 2024

mythrocks commented Dec 19, 2024