[BUG] NDS queries fail with sufficiently low offHeap limits in kudo deserializer

**Describe the bug**
When running nds queries eg. query67 (or just full power run) in local mode, if the offHeap limits are set sufficiently low, it causes queries to fail with (see expanded details):
<details>
  <summary>details</summary>
<pre>
[1499267][14:38:59:382345][error ] [A][Stream 0x1][Upstream 20971520B][FAILURE maximum pool size exceeded: Not enough room to grow, current/max/try size = 3.765625 GiB, 3.765625 GiB, 20.000000 MiB]
[1499366][14:38:59:382374][error ] [A][Stream 0x1][Upstream 20971520B][FAILURE maximum pool size exceeded: Not enough room to grow, current/max/try size = 3.765625 GiB, 3.765625 GiB, 20.000000 MiB]
25/07/22 14:38:59 INFO HostAlloc: Spilled 31622 bytes from the host store
25/07/22 14:38:59 INFO HostAlloc: Spilled 34758 bytes from the host store
25/07/22 14:38:59 WARN HostAlloc: Host store exhausted, unable to allocate 20971520 bytes. Total host allocated is 3934693450 bytes. Attempt 1. Attempting a retry.
25/07/22 14:38:59 INFO HostAlloc: Spilled 35530 bytes from the host store
25/07/22 14:38:59 INFO HostAlloc: Spilled 34902 bytes from the host store
25/07/22 14:38:59 INFO HostAlloc: Spilled 35582 bytes from the host store
25/07/22 14:38:59 INFO HostAlloc: Spilled 36694 bytes from the host store
25/07/22 14:38:59 INFO Executor: Executor interrupted and killed task 15.0 in stage 449.0 (TID 13331), reason: Stage cancelled: Job aborted due to stage failure: Task 2 in stage 449.0 failed 1 times, most recent failure: Lost task 2.0 in stage 449.0 (TID 13318) (10.110.46.230 executor driver): com.nvidia.spark.rapids.jni.CpuSplitAndRetryOOM: CPU OutOfMemory: could not split inputs and retry
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$NoInputSpliterator.split(RmmRapidsRetryIterator.scala:404)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:656)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:578)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:295)
        at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:189)
        at com.nvidia.spark.rapids.KudoSerializedBatchIterator.allocateHostWithRetry(GpuColumnarBatchSerializer.scala:701)
        at com.nvidia.spark.rapids.KudoSerializedBatchIterator.$anonfun$readNextBatch$5(GpuColumnarBatchSerializer.scala:726)
        at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
        at com.nvidia.spark.rapids.KudoSerializedBatchIterator.$anonfun$readNextBatch$4(GpuColumnarBatchSerializer.scala:706)
        at com.nvidia.spark.rapids.GpuMetric.ns(GpuMetrics.scala:321)
        at com.nvidia.spark.rapids.KudoSerializedBatchIterator.readNextBatch(GpuColumnarBatchSerializer.scala:706)
        at com.nvidia.spark.rapids.KudoSerializedBatchIterator.next(GpuColumnarBatchSerializer.scala:772)
        at com.nvidia.spark.rapids.KudoSerializedBatchIterator.next(GpuColumnarBatchSerializer.scala:649)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
        at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
        at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
        at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)
        at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
        at com.nvidia.spark.rapids.HostCoalesceIteratorBase.bufferNextBatch(GpuShuffleCoalesceExec.scala:331)
        at com.nvidia.spark.rapids.HostCoalesceIteratorBase.hasNext(GpuShuffleCoalesceExec.scala:354)
        at com.nvidia.spark.rapids.GpuShuffleCoalesceIterator.hasNext(GpuShuffleCoalesceExec.scala:421)
        at com.nvidia.spark.rapids.DynamicGpuPartialAggregateIterator.$anonfun$hasNext$4(GpuAggregateExec.scala:2085)
        at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
        at scala.Option.getOrElse(Option.scala:189)
        at com.nvidia.spark.rapids.DynamicGpuPartialAggregateIterator.hasNext(GpuAggregateExec.scala:2085)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
        at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.hasNext(GpuSortExec.scala:323)
        at com.nvidia.spark.rapids.shims.GpuWindowGroupLimitingIterator.hasNext(GpuWindowGroupLimitExec.scala:104)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:390)
        at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:413)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:141)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
        at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
</pre>
</details>

We would expect that since batch size is 1gb and off heap limit is 4gb, that some tasks would be able to spill and block while others can proceed, but instead the whole thing fails. We want to understand better why this is happening, if there is a bug or what is consuming the memory.

**Steps/Code to reproduce bug**

This can be produced on a local NDS run with `spark.rapids.memory.host.offHeapLimit.size=4g` configured

**Expected behavior**
The query can succeed by spilling and retrying as needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] NDS queries fail with sufficiently low offHeap limits in kudo deserializer #13145

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] NDS queries fail with sufficiently low offHeap limits in kudo deserializer #13145

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions