`Development`: Improve Hazelcast cluster stability and failure detection #12050

krusche · 2026-01-27T22:45:00Z

Summary

Improves Hazelcast cluster stability by implementing more aggressive failure detection and timeout configurations. This addresses potential cluster instability issues in large scale exams with thousands concurrent users, where unresponsive build agents caused cascading failures across the entire cluster.

Checklist

General

I tested all changes and their related features with all corresponding user types on a test server.
Language: I followed the guidelines for inclusive, diversity-sensitive, and appreciative language.
I chose a title conforming to the naming conventions for pull requests.

Server

Important: I implemented the changes with a very good performance and prevented too many (unnecessary) and too complex database calls.
I strictly followed the principle of data economy for all database calls.
I strictly followed the server coding and design guidelines and the REST API guidelines.
I added multiple integration tests (Spring) related to the features (with a high test coverage).
I documented the Java code using JavaDoc style.

Motivation and Context

During large scale exams, the Hazelcast cluster can became unstable when build agents stop sending heartbeats for a long time but remain listed in the jhipster registry.

The root issues is the default 60-second heartbeat timeout combined with the deadline-based failure detector, which doesn't adapt to network conditions or GC pauses.

Description

Changes to CacheConfiguration.java:

Switch from deadline to phi-accrual failure detector - adaptive to network conditions and tolerates temporary GC pauses without false positives
Reduce MAX_NO_HEARTBEAT_SECONDS from 60s to 15s for 4x faster failure detection
Reduce OPERATION_CALL_TIMEOUT_MILLIS from 60s to 15s to prevent threads blocking on unresponsive members
Reduce INVOCATION_MAX_RETRY_COUNT from ~250 to 5 to fail faster instead of retrying indefinitely
Enable slow operation detection with 5s threshold for better diagnostics
Add socket connection timeout of 5s

Changes to HazelcastConnection.java:

Add detection and warning logging for stale/zombie members (members in Hazelcast but not in service registry)
Improve efficiency by using Set instead of List for membership checks

Expected improvements:

Individual member failures detected in ~15-20s instead of 60+ seconds
Cascading failures should be contained much faster
Better logging for diagnosing cluster issues

Steps for Testing

Prerequisites:

A multi-node Artemis deployment with at least 2 core nodes and multiple build agents
Access to server logs

Deploy the changes to a test environment
Verify the Hazelcast cluster forms correctly (check logs for member join events)
Simulate a build agent failure by stopping one build agent container so that the whole build agent becomes unresponsive
Observe that:
- The failed agent is suspected within ~15-20 seconds (check logs for "Suspecting Member")
- Core nodes remain stable and responsive
- No cascade of false positives for other healthy members
Restart the build agent and verify it rejoins the cluster
Check for "stale/zombie member" warnings in logs when a member is in Hazelcast but not in registry

Review Progress

Code Review

Code Review 1
Code Review 2

Manual Tests

Test 1
Test 2

Summary by CodeRabbit

Bug Fixes
- Enhanced cluster stability detection and failure handling
- Improved identification and management of stale cluster members
- Better synchronization of cluster member information across the system
Performance
- Optimized cluster communication timeouts and retry behavior for improved reliability

_{✏️ Tip: You can customize this high-level summary in your review settings.}

This commit addresses cluster instability issues observed during exams with large numbers of concurrent users. When individual build agents become unresponsive, the entire cluster could experience cascading failures due to default timeout values being too conservative. Changes to CacheConfiguration.java: - Switch from deadline to phi-accrual failure detector (adaptive to network conditions and GC pauses) - Reduce MAX_NO_HEARTBEAT_SECONDS from 60s to 15s for faster detection - Reduce OPERATION_CALL_TIMEOUT_MILLIS from 60s to 15s to prevent thread blocking on unresponsive members - Reduce INVOCATION_MAX_RETRY_COUNT from ~250 to 5 for faster failure - Enable slow operation detection with 5s threshold for diagnostics Changes to HazelcastConnection.java: - Add detection and logging of stale/zombie members (in Hazelcast but not in service registry) - Improve efficiency by using Set instead of List for membership checks Co-Authored-By: Claude Opus 4.5 <[email protected]>

github-actions · 2026-01-27T22:51:23Z

@krusche Test coverage could not be fully measured because some tests failed. Please check the workflow logs for details.

coderabbitai · 2026-01-27T22:52:32Z

Walkthrough

This pull request enhances Hazelcast cluster stability by adding detailed configuration parameters for failure detection, operation timeouts, and retry behavior, alongside improvements to cluster member synchronization logic to detect and handle missing or stale members.

Changes

Cohort / File(s)	Summary
Hazelcast Cluster Stability Configuration `src/main/java/de/tum/cit/aet/artemis/core/config/CacheConfiguration.java`	Added 45 lines of cluster stability parameters: phi-accrual failure detector configuration (threshold, sample size, deviation), heartbeat tuning, operation call and backup timeouts, invocation retry settings, and slow-operation detection thresholds.
Hazelcast Member Connection Management `src/main/java/de/tum/cit/aet/artemis/core/config/HazelcastConnection.java`	Switched task scheduling from fixedRate to fixedDelay; introduced registry-to-Hazelcast member address synchronization with detection of missing members (not in Hazelcast but in registry) and stale members (in Hazelcast but not in registry); added IPv6 address normalization and warning logs for potential zombie members. Updated imports to support Set operations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main objective of the pull request: improving Hazelcast cluster stability and failure detection through configuration changes.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/improve-hazelcast-cluster-stability

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-01-27T23:47:23Z

End-to-End (E2E) Test Results Summary

	Tests	Passed ☑️	Skipped ⚠️	Failed ❌️	Time ⏱
End-to-End (E2E) Test Report	223 ran	221 passed	1 skipped	1 failed	1h 40m 26s 435ms

Test	Result	Time ⏱
End-to-End (E2E) Test Report
e2e/atlas/LearningPathManagement.spec.ts
ts.Learning Path Management › Instructor disables learning paths via course settings	❌ failure	41s 168ms

github-project-automation bot added this to Artemis Development Jan 27, 2026

github-project-automation bot moved this to Work In Progress in Artemis Development Jan 27, 2026

github-actions bot assigned krusche Jan 27, 2026

github-actions bot added server Pull requests that update Java code. (Added Automatically!) core Pull requests that affect the corresponding module labels Jan 27, 2026

krusche changed the title ~~Development: Improve Hazelcast cluster stability and failure detection~~ Development: Improve Hazelcast cluster stability and failure detection Jan 27, 2026

coderabbitai bot approved these changes Jan 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Development`: Improve Hazelcast cluster stability and failure detection #12050

`Development`: Improve Hazelcast cluster stability and failure detection #12050

krusche commented Jan 27, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 27, 2026

Uh oh!

coderabbitai bot commented Jan 27, 2026

Uh oh!

github-actions bot commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Development: Improve Hazelcast cluster stability and failure detection #12050

Are you sure you want to change the base?

Development: Improve Hazelcast cluster stability and failure detection #12050

Conversation

krusche commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

General

Server

Motivation and Context

Description

Steps for Testing

Review Progress

Code Review

Manual Tests

Summary by CodeRabbit

Uh oh!

github-actions bot commented Jan 27, 2026

Uh oh!

coderabbitai bot commented Jan 27, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

github-actions bot commented Jan 27, 2026

End-to-End (E2E) Test Results Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`Development`: Improve Hazelcast cluster stability and failure detection #12050

`Development`: Improve Hazelcast cluster stability and failure detection #12050

krusche commented Jan 27, 2026 •

edited

Loading