Fix stale workerid after CN restart #57
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why I'm doing:
Problem
Observer FEs had stale WorkerId mappings in their StarOSAgent cache, causing backup Compute Nodes to be selected instead of primary CNs even when primaries were healthy. This manifested as:
Issue resolved temporarily by Observer FE restart
Root Cause
In shared-data mode, when CNs restart they receive new WorkerId assignments from StarMgr. The heartbeat processing logic only allowed cache updates during live processing (!isReplay), not during journal replay (isReplay):
}
Failure sequence:
CN restarts and gets new WorkerId from StarMgr
Leader FE processes heartbeat live and updates its cache
Observer FE replays heartbeat from journal but skips cache update
Observer cache contains stale WorkerId
TabletComputeNodeMapper uses stale WorkerId, CN appears unavailable
System falls back to backup CN selection
What I'm doing:
Solution
Allow non-leader FEs to update their StarOSAgent cache during journal replay:
Logic:
Live heartbeat processing: All FEs update cache (unchanged)
Journal replay: Only Observer/Follower FEs update cache (new)
Leader FEs never update cache during replay (they are the source of truth)
Testing
Added unit tests:
testObserverFeCallsAddWorkerDuringJournalReplay: Verifies cache updates during replay
testLeaderFeSkipsAddWorkerDuringJournalReplay: Ensures leaders skip replay updates
testLiveHeartbeatAlwaysCallsAddWorker: Confirms live processing unchanged
Safety
This change is safe because:
addWorker() is idempotent when called with current WorkerId
StarMgr handles duplicate worker registrations gracefully
No additional RPC calls or performance overhead
Leader FEs retain authoritative role in heartbeat processing
Limited scope to shared-data mode only
The fix ensures all FE types maintain synchronized WorkerId mappings without requiring restarts as a workaround.
Fixes #issue
What type of PR is this: