Skip to content

Make RETRY_COVERAGE_TRACKING also able to detect nested withRetry#14138

Open
thirtiseven wants to merge 3 commits intoNVIDIA:mainfrom
thirtiseven:nested_withretry
Open

Make RETRY_COVERAGE_TRACKING also able to detect nested withRetry#14138
thirtiseven wants to merge 3 commits intoNVIDIA:mainfrom
thirtiseven:nested_withretry

Conversation

@thirtiseven
Copy link
Collaborator

Contributes to #14136

Description

This PR adds a new type, NESTED, to SPARK_RAPIDS_RETRY_COVERAGE_TRACKING, enabling the detection of nested withRetry calls, which can indicate potential performance issues.

Checklists

  • This PR has added documentation for new or modified features or behaviors.
  • This PR has added new tests or modified existing tests to cover new code paths.
    (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
  • Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

@thirtiseven thirtiseven self-assigned this Jan 13, 2026
@thirtiseven
Copy link
Collaborator Author

@greptileai full review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 13, 2026

Greptile Overview

Greptile Summary

This PR extends the SPARK_RAPIDS_RETRY_COVERAGE_TRACKING debug feature to detect nested withRetry calls. When allocations occur within nested retry blocks (depth > 1), they are now logged as "NESTED" instead of being ignored as properly covered allocations.

Key Changes:

  • Added getRetryBlockDepth() method to RetryStateTracker to expose the current retry block nesting depth
  • Modified AllocationRetryCoverageTracker.checkAllocationInternal() to detect and log nested retry scenarios
  • Nested retry blocks can cause cascading retry attempts and reduced performance, making them important to track

Implementation:

  • Uses existing localRetryBlockDepth ThreadLocal tracking infrastructure
  • Logs allocations with depth = 0 as uncovered (HOST/DEVICE)
  • Logs allocations with depth > 1 as NESTED
  • Returns early for depth = 1 (properly covered, no logging needed)

The implementation correctly leverages the existing retry block depth tracking mechanism that was already in place for the coverage tracker.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation is straightforward and leverages existing infrastructure. The new getRetryBlockDepth() method simply exposes already-tracked state, and the modified logging logic in AllocationRetryCoverageTracker correctly handles the three cases (uncovered, nested, and properly covered). This is a debug-only feature gated behind an environment variable, so it won't affect production behavior when disabled.
  • No files require special attention

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/AllocationRetryCoverageTracker.scala Added nested retry block detection: logs allocations with depth > 1 as NESTED
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RmmRapidsRetryIterator.scala Added getRetryBlockDepth method to expose retry block nesting depth

Sequence Diagram

sequenceDiagram
    participant User as User Code
    participant Outer as Outer withRetry
    participant Tracker as RetryStateTracker
    participant Alloc as AllocationRetryCoverageTracker
    participant Inner as Inner withRetry
    
    User->>Outer: withRetry(data) { fn }
    Outer->>Tracker: enterRetryBlock()
    Note over Tracker: depth = 1
    Outer->>User: execute fn(data)
    User->>Inner: withRetry(data2) { fn2 }
    Inner->>Tracker: enterRetryBlock()
    Note over Tracker: depth = 2 (NESTED!)
    Inner->>User: execute fn2(data2)
    User->>Alloc: checkAllocation(kind)
    Alloc->>Tracker: getRetryBlockDepth()
    Tracker-->>Alloc: depth = 2
    Note over Alloc: depth > 1, log as NESTED
    Alloc->>Alloc: writeToFile("NESTED,...")
    User-->>Inner: return result
    Inner->>Tracker: exitRetryBlock()
    Note over Tracker: depth = 1
    Inner-->>User: return result
    User-->>Outer: return result
    Outer->>Tracker: exitRetryBlock()
    Note over Tracker: depth = 0
    Outer-->>User: return result
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

…tryCoverageTracker.scala

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@nvauto
Copy link
Collaborator

nvauto commented Jan 26, 2026

NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release.

@thirtiseven
Copy link
Collaborator Author

@greptileai full review

Signed-off-by: Haoyang Li <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@thirtiseven thirtiseven marked this pull request as ready for review January 29, 2026 05:04
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@sameerz sameerz added the task Work required that improves the product but is not user facing label Feb 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

task Work required that improves the product but is not user facing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants