Skip to content

[BUG] nds run illegal access seen last night #13154

@abellina

Description

@abellina

This is a built that used cuDF sha rapidsai/cudf@2ee89db

So it includes a recent fix @ttnghia helped with rapidsai/cudf#19414.

I wonder if this is another manifestation of this in other cuDF sorts.

Stacks tend to point around group by:

25/07/23 11:02:46 INFO RmmRapidsRetryIterator: got a throwable in RmmRapidsRetryIterator.next():
ai.rapids.cudf.CudaFatalException: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
	at ai.rapids.cudf.Table.orderBy(Native Method)
	at ai.rapids.cudf.Table.orderBy(Table.java:2376)
	at com.nvidia.spark.rapids.GpuSorter.$anonfun$sort$2(SortUtils.scala:210)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.GpuSorter.$anonfun$sort$1(SortUtils.scala:209)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.GpuSorter.sort(SortUtils.scala:208)
	at com.nvidia.spark.rapids.GpuSorter.$anonfun$appendProjectedAndSort$1(SortUtils.scala:357)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.GpuSorter.appendProjectedAndSort(SortUtils.scala:356)
	at com.nvidia.spark.rapids.GpuSpillableProjectedSortEachBatchIterator$.$anonfun$apply$5(GpuSortExec.scala:223)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.GpuSpillableProjectedSortEachBatchIterator$.$anonfun$apply$4(GpuSortExec.scala:222)
	at com.nvidia.spark.rapids.GpuMetric.ns(GpuMetrics.scala:316)
	at com.nvidia.spark.rapids.GpuSpillableProjectedSortEachBatchIterator$.$anonfun$apply$3(GpuSortExec.scala:221)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:537)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:690)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:577)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.next(GpuSortExec.scala:618)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.next(GpuSortExec.scala:298)

Metadata

Metadata

Assignees

Labels

? - Needs TriageNeed team to review and classifybugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions