Skip to content

Add support for Hyper Log Log PLus Plus(HLL++) [databricks]#11638

Merged
res-life merged 29 commits intoNVIDIA:branch-25.04from
res-life:hll
Mar 26, 2025
Merged

Add support for Hyper Log Log PLus Plus(HLL++) [databricks]#11638
res-life merged 29 commits intoNVIDIA:branch-25.04from
res-life:hll

Conversation

@res-life
Copy link
Collaborator

@res-life res-life commented Oct 21, 2024

closes #5199

depends on

Description

Spark approx_count_distinct description link
Spark accepts one column(can be nested column) and a double literal relativeSD.

Perf test

memory settings:

--conf spark.executor.memory=20G 
--conf spark.driver.memory=10G
// group by
import org.apache.spark.sql.functions
spark.range(10000000).repartition(16).withColumn("m", functions.expr("id % 10")).createOrReplaceTempView("tab")
spark.time(spark.sql("select m, APPROX_COUNT_DISTINCT(id) from tab group by m").show())

// group by
spark.range(10000000).repartition(16).withColumn("m", functions.expr("id % 1000000")).createOrReplaceTempView("tab")
spark.time(spark.sql("select m, APPROX_COUNT_DISTINCT(id) from tab group by m").show())

// reduction
spark.range(10000000).repartition(16).createOrReplaceTempView("tab")
spark.time(spark.sql("select APPROX_COUNT_DISTINCT(id) from tab ").show())
precision num_groups CPU time(hot runs) ms GPU time(hot runs) ms speedup
9(default) 10 1176+1080+1093 349+295+290 3.59
9(default) 1,000,000 2734+2809+2637 1527+1477+1414 1.85
9(default) reduction 874+884+922 198+195+181 4.67
10 10 1261+969+830 276+267+258 3.82
10 1,000,000 5766+5624+5562 2406+2519+2425 2.31
10 reduction 729+747+726 319+334+325 2.25
11 10 881+871+877 339+343+370 2.50
11 1,000,000 9627+9734+9777 4459+4651+4546 2.13
11 reduction 764+758+789 510+478+481 1.57
12 10 987+982+917 476+478+517 1.96
12 1,000,000 18071 9016 2.00
12 reduction 871+844+886 850+840+848 1.02
13 10 1060+1090+1089 848+801+855 1.29
13 1,000,000 35076 17777 1.97
13 reduction 1094+1131+1064 1622+1635+1605 0.68
14 10 1598+1624+1567 1550+1492+1537 1.05
14 1,000,000 65569 66556 0.99
14 reduction 1433+1410+1400 3086+3096+3247 0.45
15 (GPU not support now) 10 2510+2483+2494 3154+3177+3346 0.77
15 (GPU not support now) 1,000,000 129837 GPU OutOfMemory NULL
15 (GPU not support now) reduction 2088+2118+2058 6469+6440+6367 0.32

correctness

The results are identical between CPU and GPU.

Limitations

The maximum supported precision is 14(default is 9). The formula of precision is:

Math.ceil(2.0d * Math.log(1.106d / rsd) / Math.log(2.0d)).toInt

The rsd is abbreviation of relative standard deviation.
It also means the minimum supported rsd is 0.0061.

Followup

Signed-off-by: Chong Gao res_life@163.com

@res-life res-life requested a review from ttnghia October 21, 2024 12:46
@res-life res-life force-pushed the hll branch 2 times, most recently from d42d80a to 1945192 Compare October 23, 2024 01:34
@res-life res-life changed the title [Do not review] Add Hyper Log Log PLus Plus(HLL++) [Do not review] Add support for Hyper Log Log PLus Plus(HLL++) Oct 24, 2024
@res-life res-life force-pushed the hll branch 4 times, most recently from 0a4939f to eb00c2b Compare October 30, 2024 12:37
@res-life res-life changed the title [Do not review] Add support for Hyper Log Log PLus Plus(HLL++) Add support for Hyper Log Log PLus Plus(HLL++) Oct 31, 2024
Signed-off-by: Chong Gao <res_life@163.com>
@res-life res-life changed the base branch from branch-24.12 to branch-25.02 November 25, 2024 09:53
@res-life
Copy link
Collaborator Author

Ready to review except test cases.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@res-life res-life changed the title Add support for Hyper Log Log PLus Plus(HLL++) [WIP] Add support for Hyper Log Log PLus Plus(HLL++) Dec 13, 2024
@res-life
Copy link
Collaborator Author

Explain for HLLPP:
In general, HLLPP sketch is a block of memory to estimate distinct value, it contains several integer registers. The num of registers is decided by precision parameter.
num_of_registers_in_a_sketch = pow(2, precision)
e.g.: precision = 9, then num_of_registers_in_a_sketch = 2^9 = 512
Each integer register stores the number of zero bits in a hash code.
Because Spark use xxhash64 to compute hash code, thus hash code is 64 bits.
The max value of register is 64. So Refer to link

  /**
   * The number of bits that is required per register.
   *
   * This number is determined by the maximum number of leading binary zeros a hashcode can
   * produce. This is equal to the number of bits the hashcode returns. The current
   * implementation uses a 64-bit hashcode, this means 6-bits are (at most) needed to store the
   * number of leading zeros.
   */
  val REGISTER_SIZE = 6

6 bits is enough to save a register value.
Spark uses long columns to save HLLPP sketch.
e.g.: precision = 9, num_of_registers_in_a_sketch = 512
becasue of max register value is 6 bits, thus a long can hold 10 register values.
So spark uses 512/10+1 = 52 long columns to save HLLPP sketch column.
In this PR, there are some handlings the conversion:

cuDF uses Struct<long, ..., long> column to do aggregate
Convert long columns to Struct<long, ..., long> column
Convert Struct<long, ..., long> column to long columns

TODO:
Add more test cases
Support nested types: after #11859

@revans2 could you have a look first?

@res-life
Copy link
Collaborator Author

  • [DONE] Add more test cases
  • [DONE] Support nested types
  • [DONE] Check the stack depth GPU will use does not exceed threshold

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

res-life commented Mar 25, 2025

The premerge without testing Databricks passed.
@ttnghia Help review. Please help approve if it looks good to you.

For the following, I'll call another one to review.
The premerge for Databricks failed in a previous running. All the HLLPPcases falled back to CPU on Databricks, I do not know the reason now. I'll update test case to skip Databricks, file a follow-up issue, and update this limitation to markdown doc.

@res-life
Copy link
Collaborator Author

@revans2 Could you please have a more review on this PR.

Signed-off-by: Chong Gao <res_life@163.com>
@res-life res-life changed the title Add support for Hyper Log Log PLus Plus(HLL++) Add support for Hyper Log Log PLus Plus(HLL++) [databricks] Mar 26, 2025
@res-life
Copy link
Collaborator Author

build

0.02, # precision 12
0.015, # precision 13
0.01, # precision 14
# 0.008, # precision 15 Refer to bug: https://github.com/NVIDIA/spark-rapids/issues/12347
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this can be detected during planing, we can choose to fall back to CPU instead ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes:

          // Spark already checked: precision >= 4, no need to check again.
          val precision = GpuHyperLogLogPlusPlus.computePrecision(a.relativeSD)
          // Spark supports precision range: [4, Infinity)
          // Spark-Rapids only supports precision range: [4, 14]
          if (precision > 14) {
            //
            // Info: cuCollection supports precision range [4, 18]
            // Due to https://github.com/NVIDIA/spark-rapids/issues/12347, the Spark-Rapids supports
            // fewer precisions than cuCollection: range: [4, 14]
            willNotWorkOnGpu(s"The precision $precision from relativeSD ${a.relativeSD} is bigger" +
              s" than 14, GPU only supports precision is less or equal to 14.")
          }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we can just remove these comments and add fallback tests for them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fallback test case is done; but why remove these comments?

Copy link
Collaborator

@firestarman firestarman Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is likely to cause confusion that it is still an issue to be fixed, but we already handle these cases by the fallback to CPU.

I am fine if others agree to keep it.

// HyperLogLogPlusPlus depends on Xxhash64
// HyperLogLogPlusPlus supports all the types that Xxhash 64 supports
Seq(ParamCheck("input",XxHash64Shims.supportedTypes, TypeSig.all))),
(a, conf, p, r) => new UnaryExprMeta[HyperLogLogPlusPlus](a, conf, p, r) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to create a new named meta class for this. See #10838

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will file a follow-up PR after this is merged.

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

build

0.02, # precision 12
0.015, # precision 13
0.01, # precision 14
# 0.008, # precision 15 Refer to bug: https://github.com/NVIDIA/spark-rapids/issues/12347
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we can just remove these comments and add fallback tests for them.

Chong Gao added 2 commits March 26, 2025 16:59
Signed-off-by: Chong Gao <res_life@163.com>
Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

build

Copy link
Collaborator

@firestarman firestarman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, better have more reviews since I am not good at this Hyper Log Log PLus Plus operator.

@firestarman
Copy link
Collaborator

BTW, the doc for 400 is still missing. If this PR is merged as it is, other PRs are also likely to have this failure in premerge.

@res-life
Copy link
Collaborator Author

BTW, the doc for 400 is still missing. If this PR is merged as it is, other PRs are also likely to have this failure in premerge.

Thanks, for the reminder.
I met error when building 400 locally, so can not generate doc for Spark 400.

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

Updated doc for 400 successfully after merged branch-25.04

@res-life
Copy link
Collaborator Author

build

Copy link
Collaborator

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know the reason of #12347?

@res-life
Copy link
Collaborator Author

Do we know the reason of #12347?
Do not know the root cause currently.

@res-life res-life merged commit 7e19dbc into NVIDIA:branch-25.04 Mar 26, 2025
54 checks passed
@res-life res-life deleted the hll branch March 26, 2025 23:14
@sameerz sameerz added the feature request New feature or request label Mar 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA]Support function approx_count_distinct

6 participants