Add support for Hyper Log Log PLus Plus(HLL++) [databricks] by res-life · Pull Request #11638 · NVIDIA/spark-rapids

res-life · 2024-10-21T12:41:00Z

depends on

Description

Spark approx_count_distinct description link
Spark accepts one column(can be nested column) and a double literal relativeSD.

Perf test

memory settings:

--conf spark.executor.memory=20G 
--conf spark.driver.memory=10G

// group by
import org.apache.spark.sql.functions
spark.range(10000000).repartition(16).withColumn("m", functions.expr("id % 10")).createOrReplaceTempView("tab")
spark.time(spark.sql("select m, APPROX_COUNT_DISTINCT(id) from tab group by m").show())

// group by
spark.range(10000000).repartition(16).withColumn("m", functions.expr("id % 1000000")).createOrReplaceTempView("tab")
spark.time(spark.sql("select m, APPROX_COUNT_DISTINCT(id) from tab group by m").show())

// reduction
spark.range(10000000).repartition(16).createOrReplaceTempView("tab")
spark.time(spark.sql("select APPROX_COUNT_DISTINCT(id) from tab ").show())

precision	num_groups	CPU time(hot runs) ms	GPU time(hot runs) ms	speedup
9(default)	10	1176+1080+1093	349+295+290	3.59
9(default)	1,000,000	2734+2809+2637	1527+1477+1414	1.85
9(default)	reduction	874+884+922	198+195+181	4.67
10	10	1261+969+830	276+267+258	3.82
10	1,000,000	5766+5624+5562	2406+2519+2425	2.31
10	reduction	729+747+726	319+334+325	2.25
11	10	881+871+877	339+343+370	2.50
11	1,000,000	9627+9734+9777	4459+4651+4546	2.13
11	reduction	764+758+789	510+478+481	1.57
12	10	987+982+917	476+478+517	1.96
12	1,000,000	18071	9016	2.00
12	reduction	871+844+886	850+840+848	1.02
13	10	1060+1090+1089	848+801+855	1.29
13	1,000,000	35076	17777	1.97
13	reduction	1094+1131+1064	1622+1635+1605	0.68
14	10	1598+1624+1567	1550+1492+1537	1.05
14	1,000,000	65569	66556	0.99
14	reduction	1433+1410+1400	3086+3096+3247	0.45
15 (GPU not support now)	10	2510+2483+2494	3154+3177+3346	0.77
15 (GPU not support now)	1,000,000	129837	GPU OutOfMemory	NULL
15 (GPU not support now)	reduction	2088+2118+2058	6469+6440+6367	0.32

correctness

The results are identical between CPU and GPU.

Limitations

The maximum supported precision is 14(default is 9). The formula of precision is:

Math.ceil(2.0d * Math.log(1.106d / rsd) / Math.log(2.0d)).toInt

The rsd is abbreviation of relative standard deviation.
It also means the minimum supported rsd is 0.0061.

Followup

Signed-off-by: Chong Gao res_life@163.com

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/aggregate/GpuHLL.scala

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2024-11-26T10:38:41Z

Ready to review except test cases.

revans2

Looks good

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

res-life · 2024-12-13T03:40:15Z

Explain for HLLPP:
In general, HLLPP sketch is a block of memory to estimate distinct value, it contains several integer registers. The num of registers is decided by precision parameter.
num_of_registers_in_a_sketch = pow(2, precision)
e.g.: precision = 9, then num_of_registers_in_a_sketch = 2^9 = 512
Each integer register stores the number of zero bits in a hash code.
Because Spark use xxhash64 to compute hash code, thus hash code is 64 bits.
The max value of register is 64. So Refer to link

  /**
   * The number of bits that is required per register.
   *
   * This number is determined by the maximum number of leading binary zeros a hashcode can
   * produce. This is equal to the number of bits the hashcode returns. The current
   * implementation uses a 64-bit hashcode, this means 6-bits are (at most) needed to store the
   * number of leading zeros.
   */
  val REGISTER_SIZE = 6

6 bits is enough to save a register value.
Spark uses long columns to save HLLPP sketch.
e.g.: precision = 9, num_of_registers_in_a_sketch = 512
becasue of max register value is 6 bits, thus a long can hold 10 register values.
So spark uses 512/10+1 = 52 long columns to save HLLPP sketch column.
In this PR, there are some handlings the conversion:

cuDF uses Struct<long, ..., long> column to do aggregate
Convert long columns to Struct<long, ..., long> column
Convert Struct<long, ..., long> column to long columns

TODO:
Add more test cases
Support nested types: after #11859

@revans2 could you have a look first?

res-life · 2024-12-20T12:57:21Z

[DONE] Add more test cases
[DONE] Support nested types
[DONE] Check the stack depth GPU will use does not exceed threshold

res-life · 2025-03-25T07:54:54Z

build

res-life · 2025-03-25T12:44:13Z

The premerge without testing Databricks passed.
@ttnghia Help review. Please help approve if it looks good to you.

For the following, I'll call another one to review.
The premerge for Databricks failed in a previous running. All the HLLPPcases falled back to CPU on Databricks, I do not know the reason now. I'll update test case to skip Databricks, file a follow-up issue, and update this limitation to markdown doc.

res-life · 2025-03-26T02:17:29Z

@revans2 Could you please have a more review on this PR.

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2025-03-26T02:22:34Z

build

integration_tests/src/main/python/hyper_log_log_plus_plus_test.py

firestarman · 2025-03-26T04:53:53Z

integration_tests/src/main/python/hyper_log_log_plus_plus_test.py

+    0.02,  #  precision 12
+    0.015, #  precision 13
+    0.01,  #  precision 14
+    # 0.008, #  precision 15 Refer to bug: https://github.com/NVIDIA/spark-rapids/issues/12347


If this can be detected during planing, we can choose to fall back to CPU instead ?

Yes:

// Spark already checked: precision >= 4, no need to check again. val precision = GpuHyperLogLogPlusPlus.computePrecision(a.relativeSD) // Spark supports precision range: [4, Infinity) // Spark-Rapids only supports precision range: [4, 14] if (precision > 14) { // // Info: cuCollection supports precision range [4, 18] // Due to https://github.com/NVIDIA/spark-rapids/issues/12347, the Spark-Rapids supports // fewer precisions than cuCollection: range: [4, 14] willNotWorkOnGpu(s"The precision $precision from relativeSD ${a.relativeSD} is bigger" + s" than 14, GPU only supports precision is less or equal to 14.") }

Then we can just remove these comments and add fallback tests for them.

fallback test case is done; but why remove these comments?

The comment is likely to cause confusion that it is still an issue to be fixed, but we already handle these cases by the fallback to CPU.

I am fine if others agree to keep it.

firestarman · 2025-03-26T04:55:13Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

+        // HyperLogLogPlusPlus depends on Xxhash64
+        // HyperLogLogPlusPlus supports all the types that Xxhash 64 supports
+        Seq(ParamCheck("input",XxHash64Shims.supportedTypes, TypeSig.all))),
+      (a, conf, p, r) => new UnaryExprMeta[HyperLogLogPlusPlus](a, conf, p, r) {


Better to create a new named meta class for this. See #10838

Will file a follow-up PR after this is merged.

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/aggregate/GpuHyperLogLogPlusPlus.scala

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2025-03-26T07:13:07Z

build

firestarman · 2025-03-26T07:30:58Z

integration_tests/src/main/python/hyper_log_log_plus_plus_test.py

+    0.02,  #  precision 12
+    0.015, #  precision 13
+    0.01,  #  precision 14
+    # 0.008, #  precision 15 Refer to bug: https://github.com/NVIDIA/spark-rapids/issues/12347


Then we can just remove these comments and add fallback tests for them.

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/aggregate/GpuHyperLogLogPlusPlus.scala

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2025-03-26T09:32:36Z

build

firestarman

LGTM, better have more reviews since I am not good at this Hyper Log Log PLus Plus operator.

firestarman · 2025-03-26T10:34:00Z

BTW, the doc for 400 is still missing. If this PR is merged as it is, other PRs are also likely to have this failure in premerge.

res-life · 2025-03-26T12:10:34Z

BTW, the doc for 400 is still missing. If this PR is merged as it is, other PRs are also likely to have this failure in premerge.

Thanks, for the reminder.
I met error when building 400 locally, so can not generate doc for Spark 400.

Signed-off-by: Chong Gao <res_life@163.com>

res-life · 2025-03-26T12:16:29Z

build

res-life · 2025-03-26T15:52:55Z

Updated doc for 400 successfully after merged branch-25.04

res-life · 2025-03-26T15:54:08Z

build

ttnghia

Do we know the reason of #12347?

res-life · 2025-03-26T23:13:44Z

Do we know the reason of #12347?
Do not know the root cause currently.

res-life requested a review from ttnghia October 21, 2024 12:46

res-life force-pushed the hll branch 2 times, most recently from d42d80a to 1945192 Compare October 23, 2024 01:34

ttnghia reviewed Oct 23, 2024

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/aggregate/GpuHLL.scala Outdated Show resolved Hide resolved

ttnghia reviewed Oct 23, 2024

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/aggregate/GpuHLL.scala Outdated Show resolved Hide resolved

res-life force-pushed the hll branch from efb351a to 5f5f26a Compare October 24, 2024 04:00

res-life changed the title ~~[Do not review] Add Hyper Log Log PLus Plus(HLL++)~~ [Do not review] Add support for Hyper Log Log PLus Plus(HLL++) Oct 24, 2024

res-life force-pushed the hll branch 4 times, most recently from 0a4939f to eb00c2b Compare October 30, 2024 12:37

res-life force-pushed the hll branch from eb00c2b to 0fc310f Compare October 31, 2024 11:58

res-life changed the title ~~[Do not review] Add support for Hyper Log Log PLus Plus(HLL++)~~ Add support for Hyper Log Log PLus Plus(HLL++) Oct 31, 2024

res-life force-pushed the hll branch from 0fc310f to 32bcbb0 Compare October 31, 2024 12:05

Support HyperLogLog++

4d5b1e7

Signed-off-by: Chong Gao <res_life@163.com>

res-life force-pushed the hll branch from 32bcbb0 to 4d5b1e7 Compare November 21, 2024 07:32

res-life changed the base branch from branch-24.12 to branch-25.02 November 25, 2024 09:53

revans2 reviewed Nov 26, 2024

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Outdated Show resolved Hide resolved

res-life changed the title ~~Add support for Hyper Log Log PLus Plus(HLL++)~~ [WIP] Add support for Hyper Log Log PLus Plus(HLL++) Dec 13, 2024

Use Host UDF

0d2b2a2

res-life mentioned this pull request Dec 17, 2024

Hyper log log plus plus(HLL++) NVIDIA/spark-rapids-jni#2522

Merged

Chong Gao added 3 commits December 20, 2024 20:32

Merge branch 'branch-25.02' into hll

98a4b08

Refactor; Update cases

9d1bac2

Check the stack depth GPU will use does not exceed threshold

16119f4

Chong Gao added 3 commits January 1, 2025 08:55

Fix NullType

a24d9b8

Update APIs

f305e0c

Work around to avoid core dump, Note: this commit will cause memory leak

b38c240

Disable Databricks

1222e80

Signed-off-by: Chong Gao <res_life@163.com>

res-life changed the title ~~Add support for Hyper Log Log PLus Plus(HLL++)~~ Add support for Hyper Log Log PLus Plus(HLL++) [databricks] Mar 26, 2025

firestarman reviewed Mar 26, 2025

View reviewed changes

Fix comments

65bff14

Signed-off-by: Chong Gao <res_life@163.com>

firestarman reviewed Mar 26, 2025

View reviewed changes

Chong Gao added 2 commits March 26, 2025 16:59

Refactor

f94e60a

Signed-off-by: Chong Gao <res_life@163.com>

Add test case for fallback

d266475

Signed-off-by: Chong Gao <res_life@163.com>

firestarman reviewed Mar 26, 2025

View reviewed changes

Disable a test case on Databricks

eeaa5c9

Signed-off-by: Chong Gao <res_life@163.com>

Chong Gao added 2 commits March 26, 2025 23:40

Merge branch 'branch-25.04' into hll

b8cee84

Update doc for 400

59eaca9

ttnghia approved these changes Mar 26, 2025

View reviewed changes

res-life merged commit 7e19dbc into NVIDIA:branch-25.04 Mar 26, 2025
54 checks passed

res-life deleted the hll branch March 26, 2025 23:14

sameerz added the feature request New feature or request label Mar 31, 2025

res-life mentioned this pull request Mar 31, 2025

[FEA]Support function approx_count_distinct #5199

Closed

amahussein mentioned this pull request Mar 31, 2025

[FEA] Add support for Hyper Log Log PLus Plus(HLL++) NVIDIA/spark-rapids-tools#1605

Closed

res-life mentioned this pull request Apr 7, 2025

[FEA] Support for approx_count_distinct rapidsai/cudf#10652

Closed

Conversation

res-life commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

depends on

Description

Perf test

correctness

Limitations

Followup

Uh oh!

Uh oh!

Uh oh!

res-life commented Nov 26, 2024

Uh oh!

revans2 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

res-life commented Dec 13, 2024

Uh oh!

res-life commented Dec 20, 2024

Uh oh!

res-life commented Mar 25, 2025

Uh oh!

res-life commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

res-life commented Mar 26, 2025

Uh oh!

res-life commented Mar 26, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

firestarman Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

res-life commented Mar 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

res-life commented Mar 26, 2025

Uh oh!

firestarman left a comment

Choose a reason for hiding this comment

Uh oh!

firestarman commented Mar 26, 2025

Uh oh!

res-life commented Mar 26, 2025

Uh oh!

res-life commented Mar 26, 2025

Uh oh!

res-life commented Mar 26, 2025

Uh oh!

res-life commented Mar 26, 2025

Uh oh!

ttnghia left a comment

Choose a reason for hiding this comment

Uh oh!

res-life commented Mar 26, 2025

res-life commented Oct 21, 2024 •

edited

Loading

res-life commented Mar 25, 2025 •

edited

Loading

firestarman Mar 26, 2025 •

edited

Loading