dumpling: support keyspace-level GC for keyspace clusters (#66883)#67248
dumpling: support keyspace-level GC for keyspace clusters (#66883)#67248ti-chi-bot wants to merge 16 commits intopingcap:release-nextgen-20251011from
Conversation
|
This cherry pick PR is for a release branch and has not yet been approved by triage owners. To merge this cherry pick:
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
📝 WalkthroughWalkthroughThis PR adds keyspace-level garbage collection support for premium TiDB clusters in dumpling. It introduces new CLI flags Changes
Sequence Diagram(s)sequenceDiagram
participant Dumper
participant TiDB as TiDB Database
participant PDClient as PD Client
participant GCStates as GC States API
Dumper->>TiDB: Query KEYSPACE_META
TiDB-->>Dumper: keyspaceName, keyspaceID
alt Premium (Keyspace) Cluster
Dumper->>PDClient: Create with keyspace API v2
Dumper->>GCStates: SetGCBarrier(snapshotTS - 1)
GCStates-->>Dumper: barrier set
Note over Dumper,GCStates: Periodic updates during dump
Dumper->>GCStates: SetGCBarrier(current protection TS)
GCStates-->>Dumper: updated
Dumper->>GCStates: DeleteGCBarrier (cleanup on done)
GCStates-->>Dumper: barrier removed
else Classical Cluster
Dumper->>PDClient: Discover endpoints via GetPdAddrs
Dumper->>PDClient: UpdateServiceGCSafePoint(snapshotTS - 1)
PDClient-->>Dumper: safe point updated
Note over Dumper,PDClient: Periodic updates during dump
Dumper->>PDClient: UpdateServiceGCSafePoint(current protection TS)
PDClient-->>Dumper: updated
Dumper->>PDClient: UpdateServiceGCSafePoint(0, 0) (cleanup)
PDClient-->>Dumper: cleaned up
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.3)Command failed Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: GMHDBJD The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
|
@ti-chi-bot: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
dumpling/export/dump_test.go (2)
402-406: Remove unused code.Lines 403-404 assign
origUpdatebut it's immediately discarded with_ = origUpdate. This appears to be leftover from an abandoned approach.🧹 Remove dead code
- // Wrap UpdateServiceGCSafePoint to count calls including failures. - origUpdate := mockPD.UpdateServiceGCSafePoint - _ = origUpdate // ensure the method exists - // We can't easily wrap the method, so instead track via the mock's counter - // and poll it. The mock already counts calls.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dumpling/export/dump_test.go` around lines 402 - 406, Remove the dead assignment that captures a reference to mockPD.UpdateServiceGCSafePoint (origUpdate := mockPD.UpdateServiceGCSafePoint) and the immediate discard (_ = origUpdate); these lines are leftover from an abandoned wrapping approach and should be deleted so the test relies solely on the mock's existing call counter as described in the surrounding comment.
254-256: Consider usingrequire.Eventuallyinstead oftime.Sleepfor waiting on the first call.The fixed 200ms sleep may cause flakiness under load. Consider using a polling pattern similar to what's already used later in the test.
♻️ Suggested improvement
- // Give the background goroutine a moment to make its first call. - time.Sleep(200 * time.Millisecond) + // Wait for the background goroutine to make its first call. + if tc.expectBarrierAPI { + require.Eventually(t, func() bool { + mockPD.gcStatesClient.mu.Lock() + defer mockPD.gcStatesClient.mu.Unlock() + return mockPD.gcStatesClient.setCalls > 0 + }, 5*time.Second, 50*time.Millisecond) + } else { + require.Eventually(t, func() bool { + mockPD.mu.Lock() + defer mockPD.mu.Unlock() + return mockPD.updateSafePointCalls > 0 + }, 5*time.Second, 50*time.Millisecond) + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dumpling/export/dump_test.go` around lines 254 - 256, Replace the fixed time.Sleep(200 * time.Millisecond) with a polling assertion using require.Eventually so the test waits until the background goroutine makes its first call without flakiness; specifically, remove the time.Sleep and call require.Eventually(t, func() bool { /* return true when the mock/spy call count or condition indicating "first call" is met */ }, time.Second, 10*time.Millisecond) (adjust timeout/interval as appropriate) to poll the mock/counter used in the test instead of sleeping.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@dumpling/export/dump_test.go`:
- Around line 402-406: Remove the dead assignment that captures a reference to
mockPD.UpdateServiceGCSafePoint (origUpdate := mockPD.UpdateServiceGCSafePoint)
and the immediate discard (_ = origUpdate); these lines are leftover from an
abandoned wrapping approach and should be deleted so the test relies solely on
the mock's existing call counter as described in the surrounding comment.
- Around line 254-256: Replace the fixed time.Sleep(200 * time.Millisecond) with
a polling assertion using require.Eventually so the test waits until the
background goroutine makes its first call without flakiness; specifically,
remove the time.Sleep and call require.Eventually(t, func() bool { /* return
true when the mock/spy call count or condition indicating "first call" is met */
}, time.Second, 10*time.Millisecond) (adjust timeout/interval as appropriate) to
poll the mock/counter used in the test instead of sleeping.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: dc0987a9-d10b-46b2-8a5b-688e4c741562
📒 Files selected for processing (6)
dumpling/export/BUILD.bazeldumpling/export/config.godumpling/export/dump.godumpling/export/dump_test.godumpling/export/sql.godumpling/export/util_for_test.go
This is an automated cherry-pick of #66883
What problem does this PR solve?
Issue Number: close #66882
Problem Summary:
Dumpling needs keyspace-aware GC protection in keyspace (premium) clusters. Cloud control passes PD endpoints and keyspace name, and Dumpling should validate via
information_schema.KEYSPACE_METAand keep GC below the dump snapshot TS.What changed and how does it work?
--pdand--keyspace-namefor keyspace clusters.information_schema.KEYSPACE_METAand validate--keyspace-namematchesKEYSPACE_NAME(mismatch -> error).--pd/--keyspace-name.SetGCBarrier) below snapshot TS during dump; classical cluster keeps service GC safepoint.Check List
Tests
Test details:
make failpoint-enable && go test ./dumpling/export && make failpoint-disablemake lintmake bazel_preparemake bazel_lintSide effects
Documentation
Release note
Summary by CodeRabbit
New Features
--pd,--cluster-ssl-ca,--cluster-ssl-cert, and--cluster-ssl-keyCLI flags to configure garbage collection and cluster-specific TLS settingsTests