-
Notifications
You must be signed in to change notification settings - Fork 358
Development: Improve Hazelcast cluster stability and failure detection
#12050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
This commit addresses cluster instability issues observed during exams with large numbers of concurrent users. When individual build agents become unresponsive, the entire cluster could experience cascading failures due to default timeout values being too conservative. Changes to CacheConfiguration.java: - Switch from deadline to phi-accrual failure detector (adaptive to network conditions and GC pauses) - Reduce MAX_NO_HEARTBEAT_SECONDS from 60s to 15s for faster detection - Reduce OPERATION_CALL_TIMEOUT_MILLIS from 60s to 15s to prevent thread blocking on unresponsive members - Reduce INVOCATION_MAX_RETRY_COUNT from ~250 to 5 for faster failure - Enable slow operation detection with 5s threshold for diagnostics Changes to HazelcastConnection.java: - Add detection and logging of stale/zombie members (in Hazelcast but not in service registry) - Improve efficiency by using Set instead of List for membership checks Co-Authored-By: Claude Opus 4.5 <[email protected]>
Development: Improve Hazelcast cluster stability and failure detection
|
@krusche Test coverage could not be fully measured because some tests failed. Please check the workflow logs for details. |
WalkthroughThis pull request enhances Hazelcast cluster stability by adding detailed configuration parameters for failure detection, operation timeouts, and retry behavior, alongside improvements to cluster member synchronization logic to detect and handle missing or stale members. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
End-to-End (E2E) Test Results Summary
|
||||||||||||||||||||||||
Summary
Improves Hazelcast cluster stability by implementing more aggressive failure detection and timeout configurations. This addresses potential cluster instability issues in large scale exams with thousands concurrent users, where unresponsive build agents caused cascading failures across the entire cluster.
Checklist
General
Server
Motivation and Context
During large scale exams, the Hazelcast cluster can became unstable when build agents stop sending heartbeats for a long time but remain listed in the jhipster registry.
The root issues is the default 60-second heartbeat timeout combined with the deadline-based failure detector, which doesn't adapt to network conditions or GC pauses.
Description
Changes to CacheConfiguration.java:
MAX_NO_HEARTBEAT_SECONDSfrom 60s to 15s for 4x faster failure detectionOPERATION_CALL_TIMEOUT_MILLISfrom 60s to 15s to prevent threads blocking on unresponsive membersINVOCATION_MAX_RETRY_COUNTfrom ~250 to 5 to fail faster instead of retrying indefinitelyChanges to HazelcastConnection.java:
Expected improvements:
Steps for Testing
Prerequisites:
Review Progress
Code Review
Manual Tests
Summary by CodeRabbit
Bug Fixes
Performance
✏️ Tip: You can customize this high-level summary in your review settings.