Skip to content

dumpling: support keyspace-level GC for keyspace clusters (#66883)#67248

Open
ti-chi-bot wants to merge 16 commits intopingcap:release-nextgen-20251011from
ti-chi-bot:cherry-pick-66883-to-release-nextgen-20251011
Open

dumpling: support keyspace-level GC for keyspace clusters (#66883)#67248
ti-chi-bot wants to merge 16 commits intopingcap:release-nextgen-20251011from
ti-chi-bot:cherry-pick-66883-to-release-nextgen-20251011

Conversation

@ti-chi-bot
Copy link
Member

@ti-chi-bot ti-chi-bot commented Mar 24, 2026

This is an automated cherry-pick of #66883

What problem does this PR solve?

Issue Number: close #66882

Problem Summary:
Dumpling needs keyspace-aware GC protection in keyspace (premium) clusters. Cloud control passes PD endpoints and keyspace name, and Dumpling should validate via information_schema.KEYSPACE_META and keep GC below the dump snapshot TS.

What changed and how does it work?

  • Add --pd and --keyspace-name for keyspace clusters.
  • Read information_schema.KEYSPACE_META and validate --keyspace-name matches KEYSPACE_NAME (mismatch -> error).
  • If both keyspace names are empty, treat as classical cluster and reject --pd/--keyspace-name.
  • Keep a keyspace GC barrier (PD GCStates SetGCBarrier) below snapshot TS during dump; classical cluster keeps service GC safepoint.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Test details:

  • make failpoint-enable && go test ./dumpling/export && make failpoint-disable
  • make lint
  • make bazel_prepare
  • make bazel_lint

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Dumpling supports keyspace-level GC protection for keyspace clusters by keeping a keyspace GC barrier during dump.

Summary by CodeRabbit

  • New Features

    • Added --pd, --cluster-ssl-ca, --cluster-ssl-cert, and --cluster-ssl-key CLI flags to configure garbage collection and cluster-specific TLS settings
    • Enhanced support for garbage collection operations across different cluster configurations
  • Tests

    • Added comprehensive test coverage for keyspace metadata resolution, GC dispatcher logic, TLS option selection, and GC updater control flows

@ti-chi-bot ti-chi-bot added component/dumpling This is related to Dumpling of TiDB. ok-to-test Indicates a PR is ready to be tested. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/cherry-pick-for-release-nextgen-20251011 labels Mar 24, 2026
@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 24, 2026

This cherry pick PR is for a release branch and has not yet been approved by triage owners.
Adding the do-not-merge/cherry-pick-not-approved label.

To merge this cherry pick:

  1. It must be LGTMed and approved by the reviewers firstly.
  2. For pull requests to TiDB-x branches, it must have no failed tests.
  3. AFTER it has lgtm and approved labels, please wait for the cherry-pick merging approval from triage owners.
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link

coderabbitai bot commented Mar 24, 2026

📝 Walkthrough

Walkthrough

This PR adds keyspace-level garbage collection support for premium TiDB clusters in dumpling. It introduces new CLI flags --pd and cluster-specific TLS overrides, implements keyspace metadata resolution from the database, and provides dual GC protection paths: keyspace-level barrier updates via PD for premium clusters and service-level safe-point updates for classical clusters.

Changes

Cohort / File(s) Summary
Build Configuration
dumpling/export/BUILD.bazel
Added test dependencies on pflag, pd-client, and pd-client/clients/gc to support keyspace GC testing.
Configuration
dumpling/export/config.go
Added CLI flags --pd, --cluster-ssl-ca, --cluster-ssl-cert, --cluster-ssl-key and corresponding Config struct fields (PDAddr, ClusterSSLCA, ClusterSSLCert, ClusterSSLKey) with fallback behavior to existing TLS flags.
Core Keyspace GC Logic
dumpling/export/dump.go
Implemented keyspace metadata resolution step, dual GC paths based on cluster type (keyspace vs classical), PD client setup with keyspace-aware API context v2, and separate update routines for keyspace GC barriers (SetGCBarrier/DeleteGCBarrier) vs service safe-point updates. Added helper functions pdSecurityOptionForGC and firstNonEmpty for TLS configuration reuse.
SQL Utilities
dumpling/export/sql.go
Added helper function queryCurrentKeyspaceNameAndID() to query keyspace metadata from information_schema.KEYSPACE_META with proper NULL handling.
Test Infrastructure
dumpling/export/util_for_test.go
Added mock implementations mockGCStatesClient and mockPDClientForGC with call tracking and error injection for testing GC protection flows.
Comprehensive Test Suite
dumpling/export/dump_test.go
Added extensive test coverage for keyspace metadata resolution (including classical/premium validation), GC dispatch logic, TLS override behavior, and concurrent GC updater control flows with retry and cleanup semantics for both keyspace barriers and service safe-points.

Sequence Diagram(s)

sequenceDiagram
    participant Dumper
    participant TiDB as TiDB Database
    participant PDClient as PD Client
    participant GCStates as GC States API

    Dumper->>TiDB: Query KEYSPACE_META
    TiDB-->>Dumper: keyspaceName, keyspaceID
    
    alt Premium (Keyspace) Cluster
        Dumper->>PDClient: Create with keyspace API v2
        Dumper->>GCStates: SetGCBarrier(snapshotTS - 1)
        GCStates-->>Dumper: barrier set
        
        Note over Dumper,GCStates: Periodic updates during dump
        Dumper->>GCStates: SetGCBarrier(current protection TS)
        GCStates-->>Dumper: updated
        
        Dumper->>GCStates: DeleteGCBarrier (cleanup on done)
        GCStates-->>Dumper: barrier removed
    else Classical Cluster
        Dumper->>PDClient: Discover endpoints via GetPdAddrs
        Dumper->>PDClient: UpdateServiceGCSafePoint(snapshotTS - 1)
        PDClient-->>Dumper: safe point updated
        
        Note over Dumper,PDClient: Periodic updates during dump
        Dumper->>PDClient: UpdateServiceGCSafePoint(current protection TS)
        PDClient-->>Dumper: updated
        
        Dumper->>PDClient: UpdateServiceGCSafePoint(0, 0) (cleanup)
        PDClient-->>Dumper: cleaned up
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

approved, lgtm

Suggested reviewers

  • joechenrh
  • D3Hunter

Poem

🐰 A keyspace GC tale unfolds
With dual paths, classical and bold
Barriers dance, safe-points align
Premium clusters now can shine
TiDB's dumpling, safe and fine!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding keyspace-level GC support for keyspace clusters in Dumpling.
Description check ✅ Passed The description follows the template with all required sections: problem statement, changes explanation, test checklist completion, documentation checklist, and release note.
Linked Issues check ✅ Passed All coding requirements from issue #66882 are met: CLI flags added, keyspace validation implemented, GC barrier support added with proper fallback for classical clusters.
Out of Scope Changes check ✅ Passed All changes directly support keyspace GC requirements; Bazel dependency additions are necessary for test compilation and the new SQL helper function supports keyspace metadata querying.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.3)

Command failed


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 24, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: GMHDBJD

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Mar 24, 2026
@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 24, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-03-24 02:27:22.272402711 +0000 UTC m=+235238.308472961: ☑️ agreed by GMHDBJD.

@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 24, 2026

@ti-chi-bot: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen f1399e5 link true /test pull-unit-test-next-gen
pull-build-next-gen f1399e5 link true /test pull-build-next-gen

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
dumpling/export/dump_test.go (2)

402-406: Remove unused code.

Lines 403-404 assign origUpdate but it's immediately discarded with _ = origUpdate. This appears to be leftover from an abandoned approach.

🧹 Remove dead code
-	// Wrap UpdateServiceGCSafePoint to count calls including failures.
-	origUpdate := mockPD.UpdateServiceGCSafePoint
-	_ = origUpdate // ensure the method exists
-	// We can't easily wrap the method, so instead track via the mock's counter
-	// and poll it. The mock already counts calls.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dumpling/export/dump_test.go` around lines 402 - 406, Remove the dead
assignment that captures a reference to mockPD.UpdateServiceGCSafePoint
(origUpdate := mockPD.UpdateServiceGCSafePoint) and the immediate discard (_ =
origUpdate); these lines are leftover from an abandoned wrapping approach and
should be deleted so the test relies solely on the mock's existing call counter
as described in the surrounding comment.

254-256: Consider using require.Eventually instead of time.Sleep for waiting on the first call.

The fixed 200ms sleep may cause flakiness under load. Consider using a polling pattern similar to what's already used later in the test.

♻️ Suggested improvement
-			// Give the background goroutine a moment to make its first call.
-			time.Sleep(200 * time.Millisecond)
+			// Wait for the background goroutine to make its first call.
+			if tc.expectBarrierAPI {
+				require.Eventually(t, func() bool {
+					mockPD.gcStatesClient.mu.Lock()
+					defer mockPD.gcStatesClient.mu.Unlock()
+					return mockPD.gcStatesClient.setCalls > 0
+				}, 5*time.Second, 50*time.Millisecond)
+			} else {
+				require.Eventually(t, func() bool {
+					mockPD.mu.Lock()
+					defer mockPD.mu.Unlock()
+					return mockPD.updateSafePointCalls > 0
+				}, 5*time.Second, 50*time.Millisecond)
+			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dumpling/export/dump_test.go` around lines 254 - 256, Replace the fixed
time.Sleep(200 * time.Millisecond) with a polling assertion using
require.Eventually so the test waits until the background goroutine makes its
first call without flakiness; specifically, remove the time.Sleep and call
require.Eventually(t, func() bool { /* return true when the mock/spy call count
or condition indicating "first call" is met */ }, time.Second,
10*time.Millisecond) (adjust timeout/interval as appropriate) to poll the
mock/counter used in the test instead of sleeping.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@dumpling/export/dump_test.go`:
- Around line 402-406: Remove the dead assignment that captures a reference to
mockPD.UpdateServiceGCSafePoint (origUpdate := mockPD.UpdateServiceGCSafePoint)
and the immediate discard (_ = origUpdate); these lines are leftover from an
abandoned wrapping approach and should be deleted so the test relies solely on
the mock's existing call counter as described in the surrounding comment.
- Around line 254-256: Replace the fixed time.Sleep(200 * time.Millisecond) with
a polling assertion using require.Eventually so the test waits until the
background goroutine makes its first call without flakiness; specifically,
remove the time.Sleep and call require.Eventually(t, func() bool { /* return
true when the mock/spy call count or condition indicating "first call" is met */
}, time.Second, 10*time.Millisecond) (adjust timeout/interval as appropriate) to
poll the mock/counter used in the test instead of sleeping.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: dc0987a9-d10b-46b2-8a5b-688e4c741562

📥 Commits

Reviewing files that changed from the base of the PR and between 77d7b1b and f1399e5.

📒 Files selected for processing (6)
  • dumpling/export/BUILD.bazel
  • dumpling/export/config.go
  • dumpling/export/dump.go
  • dumpling/export/dump_test.go
  • dumpling/export/sql.go
  • dumpling/export/util_for_test.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved component/dumpling This is related to Dumpling of TiDB. do-not-merge/cherry-pick-not-approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. ok-to-test Indicates a PR is ready to be tested. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/cherry-pick-for-release-nextgen-20251011

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants