[controller] Increase dead store stats prefetch timeout to 5 minutes #2232

MikeDafi · 2025-10-22T19:27:06Z

Problem Statement

We have been seeing timeout issues when only allowing 60 seconds for the controller to become leader. This logic runs in an async thread (DeadStoreStatsPreFetchTask), which is why we see communication/timing issues. The timeout is too short to accommodate:

Controller leadership election time
Network latency when fetching from livenice aggregated stats
Trino query execution time for aggregated dead store statistics

The current 60-second timeout causes the prefetch task to fail before data can be retrieved, resulting in incomplete dead store stats and potential false negatives in store lifecycle management.

Solution

Increased the timeout from 60 seconds to 300 seconds (5 minutes) in DeadStoreStatsPreFetchTask.java. This gives sufficient time for:

Controller to become leader
Livenice aggregated stats API to be queried via Trino
Network roundtrips and query processing

The change is conservative and aligns with typical Trino query execution times for aggregated data.

Code changes

Added new code behind a config. No - this is a constant timeout value change.
Introduced new log lines. No new log lines, only updated comments for clarity.

Concurrency-Specific Checks

Both reviewer and PR author to verify

Code has no race conditions or thread safety issues. (No concurrency logic changed)
Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed. (Not applicable - only timeout value changed)
No blocking calls inside critical sections that could lead to deadlocks or performance degradation. (No changes to blocking behavior)
Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList). (Not applicable - no collection changes)
Validated proper exception handling in multi-threaded code to avoid silent thread termination. (Existing exception handling unchanged)

How was this PR tested?

Modified or extended existing tests. Existing tests in DeadStoreStatsPreFetchTaskTest still pass with the increased timeout.
Verified backward compatibility (if applicable). This is a timeout increase, fully backward compatible.
Manual testing: Verified that the prefetch task now has sufficient time to complete when fetching from livenice aggregated stats endpoint.

Does this PR introduce any user-facing or breaking changes?

No. You can skip the rest of this section.

…mmodate aggregated stats fetching

majisourav99 · 2025-10-22T22:41:19Z

...nice-controller/src/main/java/com/linkedin/venice/controller/DeadStoreStatsPreFetchTask.java

        if (System.currentTimeMillis() > deadline) {
          throw new VeniceException("Timed out waiting for controller to become leader for cluster: " + clusterName);
        }
        Utils.sleep(10_000); // sleep for 10 seconds


check for return value of sleep to track is the thread has been interrupted. otherwise it will block shutdown

Actually you pointed out a key finding, we need it below otherise we attempt a prefetch unnessecarilly after returning from sleep

We don't need it here because isRunning is checked below and would return false

no, my point was that if we call shutodown, it will create a interrupException, which will be swallowed by the Utils.sleep(10_000) and will keep iterating the loop for 5mins

makes sense, made the change as well

…ceful shutdown

majisourav99 · 2025-10-22T23:15:24Z

...nice-controller/src/main/java/com/linkedin/venice/controller/DeadStoreStatsPreFetchTask.java

      try {
-        Utils.sleep(refreshIntervalMs);
+        // Check return value of sleep to detect thread interruption for graceful shutdown
+        if (!Utils.sleep(refreshIntervalMs)) {


its cleaner to call Thread.sleep and capture the exception without this extra check on utils.sleep

…rrupt handling

github-actions · 2025-11-22T02:08:08Z

Hi there. This pull request has been inactive for 30 days. To keep our review queue healthy, we plan to close it in 7 days unless there is new activity. If you are still working on this, please push a commit, leave a comment, or convert it to draft to signal intent. Thank you for your time and contributions.

[controller] increase dead store stats prefetch timeout to 5 minutes

4d579c4

eldernewborn previously approved these changes Oct 22, 2025

View reviewed changes

Skip if no longer leader

d838fd1

MikeDafi dismissed eldernewborn’s stale review via d838fd1 October 22, 2025 22:38

feat: increase dead store stats prefetch timeout to 5 minutes to acco…

d544e57

…mmodate aggregated stats fetching

majisourav99 reviewed Oct 22, 2025

View reviewed changes

feat: handle thread interruption in dead store stats prefetch for gra…

592ded0

…ceful shutdown

MikeDafi enabled auto-merge (squash) October 22, 2025 23:09

Adding interrupt check

ccf1894

majisourav99 reviewed Oct 22, 2025

View reviewed changes

refactor: use Thread.sleep with InterruptedException for cleaner inte…

e51b210

…rrupt handling

majisourav99 approved these changes Oct 22, 2025

View reviewed changes

github-actions bot added the stale label Nov 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[controller] Increase dead store stats prefetch timeout to 5 minutes #2232

[controller] Increase dead store stats prefetch timeout to 5 minutes #2232

Uh oh!

MikeDafi commented Oct 22, 2025 •

edited

Loading

Uh oh!

majisourav99 Oct 22, 2025

Uh oh!

MikeDafi Oct 22, 2025

Uh oh!

majisourav99 Oct 22, 2025

Uh oh!

MikeDafi Oct 22, 2025

Uh oh!

majisourav99 Oct 22, 2025

Uh oh!

github-actions bot commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[controller] Increase dead store stats prefetch timeout to 5 minutes #2232

Are you sure you want to change the base?

[controller] Increase dead store stats prefetch timeout to 5 minutes #2232

Uh oh!

Conversation

MikeDafi commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem Statement

Solution

Code changes

Concurrency-Specific Checks

How was this PR tested?

Does this PR introduce any user-facing or breaking changes?

Uh oh!

majisourav99 Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

MikeDafi Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

majisourav99 Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

MikeDafi Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

majisourav99 Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MikeDafi commented Oct 22, 2025 •

edited

Loading