Fix stale workerid after CN restart #57

ctbrennan · 2025-09-09T03:43:47Z

Why I'm doing:

Problem

Observer FEs had stale WorkerId mappings in their StarOSAgent cache, causing backup Compute Nodes to be selected instead of primary CNs even when primaries were healthy. This manifested as:

Unexpected cn_selected_for_backup_tablet_scan metric increments
NullPointerException in SHOW PROC '/tablet_mapping'
Inconsistent WorkerId values between Leader and Observer FEs
Issue resolved temporarily by Observer FE restart

Root Cause

In shared-data mode, when CNs restart they receive new WorkerId assignments from StarMgr. The heartbeat processing logic only allowed cache updates during live processing (!isReplay), not during journal replay (isReplay):
}
Failure sequence:
CN restarts and gets new WorkerId from StarMgr
Leader FE processes heartbeat live and updates its cache
Observer FE replays heartbeat from journal but skips cache update
Observer cache contains stale WorkerId
TabletComputeNodeMapper uses stale WorkerId, CN appears unavailable
System falls back to backup CN selection

What I'm doing:

Solution

Allow non-leader FEs to update their StarOSAgent cache during journal replay:
Logic:
Live heartbeat processing: All FEs update cache (unchanged)
Journal replay: Only Observer/Follower FEs update cache (new)
Leader FEs never update cache during replay (they are the source of truth)
Testing
Added unit tests:
testObserverFeCallsAddWorkerDuringJournalReplay: Verifies cache updates during replay
testLeaderFeSkipsAddWorkerDuringJournalReplay: Ensures leaders skip replay updates
testLiveHeartbeatAlwaysCallsAddWorker: Confirms live processing unchanged
Safety
This change is safe because:
addWorker() is idempotent when called with current WorkerId
StarMgr handles duplicate worker registrations gracefully
No additional RPC calls or performance overhead
Leader FEs retain authoritative role in heartbeat processing
Limited scope to shared-data mode only
The fix ensures all FE types maintain synchronized WorkerId mappings without requiring restarts as a workaround.

Fixes #issue

What type of PR is this:

ctbrennan added 2 commits October 21, 2025 23:08

replay heartbeat in non-observer to keep workerids fresh

8206c67

logging of unavailable CN

1f0af04

ctbrennan force-pushed the cbrennan/fix_workerid_staleness branch from af78389 to 1f0af04 Compare October 22, 2025 03:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix stale workerid after CN restart #57

Fix stale workerid after CN restart #57

Uh oh!

ctbrennan commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix stale workerid after CN restart #57

Are you sure you want to change the base?

Fix stale workerid after CN restart #57

Uh oh!

Conversation

ctbrennan commented Sep 9, 2025

Why I'm doing:

Problem

Root Cause

What I'm doing:

Solution

What type of PR is this:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants