Skip to content

Conversation

@krusche
Copy link
Member

@krusche krusche commented Jan 27, 2026

Summary

Improves Hazelcast cluster stability by implementing more aggressive failure detection and timeout configurations. This addresses potential cluster instability issues in large scale exams with thousands concurrent users, where unresponsive build agents caused cascading failures across the entire cluster.

Checklist

General

Server

  • Important: I implemented the changes with a very good performance and prevented too many (unnecessary) and too complex database calls.
  • I strictly followed the principle of data economy for all database calls.
  • I strictly followed the server coding and design guidelines and the REST API guidelines.
  • I added multiple integration tests (Spring) related to the features (with a high test coverage).
  • I documented the Java code using JavaDoc style.

Motivation and Context

During large scale exams, the Hazelcast cluster can became unstable when build agents stop sending heartbeats for a long time but remain listed in the jhipster registry.

The root issues is the default 60-second heartbeat timeout combined with the deadline-based failure detector, which doesn't adapt to network conditions or GC pauses.

Description

Changes to CacheConfiguration.java:

  • Switch from deadline to phi-accrual failure detector - adaptive to network conditions and tolerates temporary GC pauses without false positives
  • Reduce MAX_NO_HEARTBEAT_SECONDS from 60s to 15s for 4x faster failure detection
  • Reduce OPERATION_CALL_TIMEOUT_MILLIS from 60s to 15s to prevent threads blocking on unresponsive members
  • Reduce INVOCATION_MAX_RETRY_COUNT from ~250 to 5 to fail faster instead of retrying indefinitely
  • Enable slow operation detection with 5s threshold for better diagnostics
  • Add socket connection timeout of 5s

Changes to HazelcastConnection.java:

  • Add detection and warning logging for stale/zombie members (members in Hazelcast but not in service registry)
  • Improve efficiency by using Set instead of List for membership checks

Expected improvements:

  • Individual member failures detected in ~15-20s instead of 60+ seconds
  • Cascading failures should be contained much faster
  • Better logging for diagnosing cluster issues

Steps for Testing

Prerequisites:

  • A multi-node Artemis deployment with at least 2 core nodes and multiple build agents
  • Access to server logs
  1. Deploy the changes to a test environment
  2. Verify the Hazelcast cluster forms correctly (check logs for member join events)
  3. Simulate a build agent failure by stopping one build agent container so that the whole build agent becomes unresponsive
  4. Observe that:
    • The failed agent is suspected within ~15-20 seconds (check logs for "Suspecting Member")
    • Core nodes remain stable and responsive
    • No cascade of false positives for other healthy members
  5. Restart the build agent and verify it rejoins the cluster
  6. Check for "stale/zombie member" warnings in logs when a member is in Hazelcast but not in registry

Review Progress

Code Review

  • Code Review 1
  • Code Review 2

Manual Tests

  • Test 1
  • Test 2

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced cluster stability detection and failure handling
    • Improved identification and management of stale cluster members
    • Better synchronization of cluster member information across the system
  • Performance

    • Optimized cluster communication timeouts and retry behavior for improved reliability

✏️ Tip: You can customize this high-level summary in your review settings.

This commit addresses cluster instability issues observed during exams
with large numbers of concurrent users. When individual build agents
become unresponsive, the entire cluster could experience cascading
failures due to default timeout values being too conservative.

Changes to CacheConfiguration.java:
- Switch from deadline to phi-accrual failure detector (adaptive to
  network conditions and GC pauses)
- Reduce MAX_NO_HEARTBEAT_SECONDS from 60s to 15s for faster detection
- Reduce OPERATION_CALL_TIMEOUT_MILLIS from 60s to 15s to prevent
  thread blocking on unresponsive members
- Reduce INVOCATION_MAX_RETRY_COUNT from ~250 to 5 for faster failure
- Enable slow operation detection with 5s threshold for diagnostics

Changes to HazelcastConnection.java:
- Add detection and logging of stale/zombie members (in Hazelcast
  but not in service registry)
- Improve efficiency by using Set instead of List for membership checks

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@github-project-automation github-project-automation bot moved this to Work In Progress in Artemis Development Jan 27, 2026
@github-actions github-actions bot added server Pull requests that update Java code. (Added Automatically!) core Pull requests that affect the corresponding module labels Jan 27, 2026
@krusche krusche changed the title Development: Improve Hazelcast cluster stability and failure detection Development: Improve Hazelcast cluster stability and failure detection Jan 27, 2026
@github-actions
Copy link

@krusche Test coverage could not be fully measured because some tests failed. Please check the workflow logs for details.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 27, 2026

Walkthrough

This pull request enhances Hazelcast cluster stability by adding detailed configuration parameters for failure detection, operation timeouts, and retry behavior, alongside improvements to cluster member synchronization logic to detect and handle missing or stale members.

Changes

Cohort / File(s) Summary
Hazelcast Cluster Stability Configuration
src/main/java/de/tum/cit/aet/artemis/core/config/CacheConfiguration.java
Added 45 lines of cluster stability parameters: phi-accrual failure detector configuration (threshold, sample size, deviation), heartbeat tuning, operation call and backup timeouts, invocation retry settings, and slow-operation detection thresholds.
Hazelcast Member Connection Management
src/main/java/de/tum/cit/aet/artemis/core/config/HazelcastConnection.java
Switched task scheduling from fixedRate to fixedDelay; introduced registry-to-Hazelcast member address synchronization with detection of missing members (not in Hazelcast but in registry) and stale members (in Hazelcast but not in registry); added IPv6 address normalization and warning logs for potential zombie members. Updated imports to support Set operations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main objective of the pull request: improving Hazelcast cluster stability and failure detection through configuration changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/improve-hazelcast-cluster-stability

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

End-to-End (E2E) Test Results Summary

TestsPassed ☑️Skipped ⚠️Failed ❌️Time ⏱
End-to-End (E2E) Test Report223 ran221 passed1 skipped1 failed1h 40m 26s 435ms
TestResultTime ⏱
End-to-End (E2E) Test Report
e2e/atlas/LearningPathManagement.spec.ts
ts.Learning Path Management › Instructor disables learning paths via course settings❌ failure41s 168ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Pull requests that affect the corresponding module server Pull requests that update Java code. (Added Automatically!)

Projects

Status: Work In Progress

Development

Successfully merging this pull request may close these issues.

2 participants