Skip to content

Conversation

@ctbrennan
Copy link

Why I'm doing:

Problem

Observer FEs had stale WorkerId mappings in their StarOSAgent cache, causing backup Compute Nodes to be selected instead of primary CNs even when primaries were healthy. This manifested as:

  1. Unexpected cn_selected_for_backup_tablet_scan metric increments
  2. NullPointerException in SHOW PROC '/tablet_mapping'
  3. Inconsistent WorkerId values between Leader and Observer FEs
    Issue resolved temporarily by Observer FE restart

Root Cause

In shared-data mode, when CNs restart they receive new WorkerId assignments from StarMgr. The heartbeat processing logic only allowed cache updates during live processing (!isReplay), not during journal replay (isReplay):
}
Failure sequence:
CN restarts and gets new WorkerId from StarMgr
Leader FE processes heartbeat live and updates its cache
Observer FE replays heartbeat from journal but skips cache update
Observer cache contains stale WorkerId
TabletComputeNodeMapper uses stale WorkerId, CN appears unavailable
System falls back to backup CN selection

What I'm doing:

Solution

Allow non-leader FEs to update their StarOSAgent cache during journal replay:
Logic:
Live heartbeat processing: All FEs update cache (unchanged)
Journal replay: Only Observer/Follower FEs update cache (new)
Leader FEs never update cache during replay (they are the source of truth)
Testing
Added unit tests:
testObserverFeCallsAddWorkerDuringJournalReplay: Verifies cache updates during replay
testLeaderFeSkipsAddWorkerDuringJournalReplay: Ensures leaders skip replay updates
testLiveHeartbeatAlwaysCallsAddWorker: Confirms live processing unchanged
Safety
This change is safe because:
addWorker() is idempotent when called with current WorkerId
StarMgr handles duplicate worker registrations gracefully
No additional RPC calls or performance overhead
Leader FEs retain authoritative role in heartbeat processing
Limited scope to shared-data mode only
The fix ensures all FE types maintain synchronized WorkerId mappings without requiring restarts as a workaround.

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

@ctbrennan ctbrennan force-pushed the cbrennan/fix_workerid_staleness branch from af78389 to 1f0af04 Compare October 22, 2025 03:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants