Skip to content

Conversation

@MikeDafi
Copy link
Contributor

@MikeDafi MikeDafi commented Oct 22, 2025

Problem Statement

We have been seeing timeout issues when only allowing 60 seconds for the controller to become leader. This logic runs in an async thread (DeadStoreStatsPreFetchTask), which is why we see communication/timing issues. The timeout is too short to accommodate:

  • Controller leadership election time
  • Network latency when fetching from livenice aggregated stats
  • Trino query execution time for aggregated dead store statistics

The current 60-second timeout causes the prefetch task to fail before data can be retrieved, resulting in incomplete dead store stats and potential false negatives in store lifecycle management.

Solution

Increased the timeout from 60 seconds to 300 seconds (5 minutes) in DeadStoreStatsPreFetchTask.java. This gives sufficient time for:

  1. Controller to become leader
  2. Livenice aggregated stats API to be queried via Trino
  3. Network roundtrips and query processing

The change is conservative and aligns with typical Trino query execution times for aggregated data.

Code changes

  • Added new code behind a config. No - this is a constant timeout value change.
  • Introduced new log lines. No new log lines, only updated comments for clarity.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues. (No concurrency logic changed)
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed. (Not applicable - only timeout value changed)
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation. (No changes to blocking behavior)
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList). (Not applicable - no collection changes)
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination. (Existing exception handling unchanged)

How was this PR tested?

  • Modified or extended existing tests. Existing tests in DeadStoreStatsPreFetchTaskTest still pass with the increased timeout.
  • Verified backward compatibility (if applicable). This is a timeout increase, fully backward compatible.
  • Manual testing: Verified that the prefetch task now has sufficient time to complete when fetching from livenice aggregated stats endpoint.

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.

eldernewborn
eldernewborn previously approved these changes Oct 22, 2025
if (System.currentTimeMillis() > deadline) {
throw new VeniceException("Timed out waiting for controller to become leader for cluster: " + clusterName);
}
Utils.sleep(10_000); // sleep for 10 seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check for return value of sleep to track is the thread has been interrupted. otherwise it will block shutdown

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you pointed out a key finding, we need it below otherise we attempt a prefetch unnessecarilly after returning from sleep

We don't need it here because isRunning is checked below and would return false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, my point was that if we call shutodown, it will create a interrupException, which will be swallowed by the Utils.sleep(10_000) and will keep iterating the loop for 5mins

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, made the change as well

@MikeDafi MikeDafi enabled auto-merge (squash) October 22, 2025 23:09
try {
Utils.sleep(refreshIntervalMs);
// Check return value of sleep to detect thread interruption for graceful shutdown
if (!Utils.sleep(refreshIntervalMs)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its cleaner to call Thread.sleep and capture the exception without this extra check on utils.sleep

@github-actions
Copy link

Hi there. This pull request has been inactive for 30 days. To keep our review queue healthy, we plan to close it in 7 days unless there is new activity. If you are still working on this, please push a commit, leave a comment, or convert it to draft to signal intent. Thank you for your time and contributions.

@github-actions github-actions bot added the stale label Nov 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants