-
Notifications
You must be signed in to change notification settings - Fork 108
[controller] Increase dead store stats prefetch timeout to 5 minutes #2232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[controller] Increase dead store stats prefetch timeout to 5 minutes #2232
Conversation
…mmodate aggregated stats fetching
| if (System.currentTimeMillis() > deadline) { | ||
| throw new VeniceException("Timed out waiting for controller to become leader for cluster: " + clusterName); | ||
| } | ||
| Utils.sleep(10_000); // sleep for 10 seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check for return value of sleep to track is the thread has been interrupted. otherwise it will block shutdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually you pointed out a key finding, we need it below otherise we attempt a prefetch unnessecarilly after returning from sleep
We don't need it here because isRunning is checked below and would return false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, my point was that if we call shutodown, it will create a interrupException, which will be swallowed by the Utils.sleep(10_000) and will keep iterating the loop for 5mins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, made the change as well
| try { | ||
| Utils.sleep(refreshIntervalMs); | ||
| // Check return value of sleep to detect thread interruption for graceful shutdown | ||
| if (!Utils.sleep(refreshIntervalMs)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its cleaner to call Thread.sleep and capture the exception without this extra check on utils.sleep
|
Hi there. This pull request has been inactive for 30 days. To keep our review queue healthy, we plan to close it in 7 days unless there is new activity. If you are still working on this, please push a commit, leave a comment, or convert it to draft to signal intent. Thank you for your time and contributions. |
Problem Statement
We have been seeing timeout issues when only allowing 60 seconds for the controller to become leader. This logic runs in an async thread (
DeadStoreStatsPreFetchTask), which is why we see communication/timing issues. The timeout is too short to accommodate:The current 60-second timeout causes the prefetch task to fail before data can be retrieved, resulting in incomplete dead store stats and potential false negatives in store lifecycle management.
Solution
Increased the timeout from 60 seconds to 300 seconds (5 minutes) in
DeadStoreStatsPreFetchTask.java. This gives sufficient time for:The change is conservative and aligns with typical Trino query execution times for aggregated data.
Code changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed. (Not applicable - only timeout value changed)ConcurrentHashMap,CopyOnWriteArrayList). (Not applicable - no collection changes)How was this PR tested?
DeadStoreStatsPreFetchTaskTeststill pass with the increased timeout.Does this PR introduce any user-facing or breaking changes?