Add support for Hyper Log Log PLus Plus(HLL++) [databricks]#11638
Add support for Hyper Log Log PLus Plus(HLL++) [databricks]#11638res-life merged 29 commits intoNVIDIA:branch-25.04from
Conversation
d42d80a to
1945192
Compare
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/aggregate/GpuHLL.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/aggregate/GpuHLL.scala
Outdated
Show resolved
Hide resolved
0a4939f to
eb00c2b
Compare
Signed-off-by: Chong Gao <res_life@163.com>
|
Ready to review except test cases. |
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Outdated
Show resolved
Hide resolved
|
Explain for HLLPP: 6 bits is enough to save a register value. TODO: @revans2 could you have a look first? |
|
|
build |
|
The premerge without testing Databricks passed. For the following, I'll call another one to review. |
|
@revans2 Could you please have a more review on this PR. |
Signed-off-by: Chong Gao <res_life@163.com>
|
build |
integration_tests/src/main/python/hyper_log_log_plus_plus_test.py
Outdated
Show resolved
Hide resolved
integration_tests/src/main/python/hyper_log_log_plus_plus_test.py
Outdated
Show resolved
Hide resolved
| 0.02, # precision 12 | ||
| 0.015, # precision 13 | ||
| 0.01, # precision 14 | ||
| # 0.008, # precision 15 Refer to bug: https://github.com/NVIDIA/spark-rapids/issues/12347 |
There was a problem hiding this comment.
If this can be detected during planing, we can choose to fall back to CPU instead ?
There was a problem hiding this comment.
Yes:
// Spark already checked: precision >= 4, no need to check again.
val precision = GpuHyperLogLogPlusPlus.computePrecision(a.relativeSD)
// Spark supports precision range: [4, Infinity)
// Spark-Rapids only supports precision range: [4, 14]
if (precision > 14) {
//
// Info: cuCollection supports precision range [4, 18]
// Due to https://github.com/NVIDIA/spark-rapids/issues/12347, the Spark-Rapids supports
// fewer precisions than cuCollection: range: [4, 14]
willNotWorkOnGpu(s"The precision $precision from relativeSD ${a.relativeSD} is bigger" +
s" than 14, GPU only supports precision is less or equal to 14.")
}There was a problem hiding this comment.
Then we can just remove these comments and add fallback tests for them.
There was a problem hiding this comment.
fallback test case is done; but why remove these comments?
There was a problem hiding this comment.
The comment is likely to cause confusion that it is still an issue to be fixed, but we already handle these cases by the fallback to CPU.
I am fine if others agree to keep it.
| // HyperLogLogPlusPlus depends on Xxhash64 | ||
| // HyperLogLogPlusPlus supports all the types that Xxhash 64 supports | ||
| Seq(ParamCheck("input",XxHash64Shims.supportedTypes, TypeSig.all))), | ||
| (a, conf, p, r) => new UnaryExprMeta[HyperLogLogPlusPlus](a, conf, p, r) { |
There was a problem hiding this comment.
Better to create a new named meta class for this. See #10838
There was a problem hiding this comment.
Will file a follow-up PR after this is merged.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/aggregate/GpuHyperLogLogPlusPlus.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/aggregate/GpuHyperLogLogPlusPlus.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/aggregate/GpuHyperLogLogPlusPlus.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/aggregate/GpuHyperLogLogPlusPlus.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Chong Gao <res_life@163.com>
|
build |
| 0.02, # precision 12 | ||
| 0.015, # precision 13 | ||
| 0.01, # precision 14 | ||
| # 0.008, # precision 15 Refer to bug: https://github.com/NVIDIA/spark-rapids/issues/12347 |
There was a problem hiding this comment.
Then we can just remove these comments and add fallback tests for them.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/aggregate/GpuHyperLogLogPlusPlus.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/aggregate/GpuHyperLogLogPlusPlus.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Chong Gao <res_life@163.com>
|
build |
firestarman
left a comment
There was a problem hiding this comment.
LGTM, better have more reviews since I am not good at this Hyper Log Log PLus Plus operator.
|
BTW, the doc for 400 is still missing. If this PR is merged as it is, other PRs are also likely to have this failure in premerge. |
Thanks, for the reminder. |
Signed-off-by: Chong Gao <res_life@163.com>
|
build |
|
Updated doc for 400 successfully after merged branch-25.04 |
|
build |
|
closes #5199
depends on
Description
Spark
approx_count_distinctdescription linkSpark accepts one column(can be nested column) and a double literal
relativeSD.Perf test
memory settings:
correctness
The results are identical between CPU and GPU.
Limitations
The maximum supported precision is 14(default is 9). The formula of precision is:
The
rsdis abbreviation of relative standard deviation.It also means the minimum supported rsd is 0.0061.
Followup
Signed-off-by: Chong Gao res_life@163.com