Fix node restart bad signature #1051
Draft
+302
−20
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix BadSignature Error During Channel Reestablishment
Problem Description
Issue 1: Duplicate Retryable Operations Causing Panic
During channel reestablishment after node restart, the system would panic with:
Root Cause: Due to the complex message buffering and replay mechanism during reestablishment, certain TLC operations could be added to the retry queue multiple times, causing duplicates.
Issue 2: BadSignature Error (Primary Issue)
The more critical issue was intermittent
Musig2VerifyError(BadSignature)errors when processingCommitmentSignedmessages during reestablishment. This occurred in approximately 66% of test runs before the fix.Root Cause: Race condition in the reestablishment synchronization barrier mechanism. Specifically:
CommitmentSignedmessagesmy_local_commitment_number == peer_remote_commitment_number && my_remote_commitment_number == peer_local_commitment_number), the code would callresend_tlcs_on_reestablish(true)which could send aCommitmentSignedreestablish_syncingflag was not set totrueRevokeAndAckresponse and subsequent messages would not be buffered, but processed immediatelyThe synchronization barrier requires two flags:
reestablishing: Indicates the channel is in reestablishment modereestablish_syncing: Indicates we sent aCommitmentSignedand are waiting forRevokeAndAckto complete synchronizationWhen
reestablish_syncingis true, all incoming messages are buffered until theRevokeAndAckis received, ensuring proper message ordering and state consistency.Solution
Fix for Issue 1: Duplicate Operations
Added proactive deduplication logic in
apply_retryable_tlc_operationsfunction (channel.rs:1939-1960) that:Fix for Issue 2: BadSignature Error
Modified the reestablishment flow to ensure the synchronization barrier is always properly set:
Changed
resend_tlcs_on_reestablishsignature (channel.rs:7013, 7082):Result<bool, ProcessingChannelError>where the boolean indicates whetherCommitmentSignedwas sentUpdated all call sites (channel.rs:6957-6975):
resend_tlcs_on_reestablishreturnstrue, setreestablish_syncing = trueCommitmentSignedduring reestablishmentAdded comprehensive logging (channel.rs:770-787, 7067-7070):
CommitmentSignedCommitmentSignedduring reestablishmentTesting
Test Infrastructure Changes
#[ignore]attribute totest_node_restartto exclude it from default test runstest-reestablishmentthat:BadSignaturekeywords and fails if detectedpanickeywords and fails if detectedTest Results
Before Fix:
BadSignatureduringCommitmentSignedverificationAfter Fix:
Verification Strategy
The test relies on log-based verification rather than explicit assertions:
Files Changed
crates/fiber-lib/src/fiber/channel.rs:CommitmentSignedhandlingresend_tlcs_on_reestablishreturn typecrates/fiber-lib/src/fiber/tests/channel.rs:#[ignore]attribute totest_node_restart.github/workflows/ci.yml:test-reestablishmentjob that:Impact
This fix ensures reliable channel reestablishment under stress conditions by:
The changes are backward compatible and do not affect normal channel operations.