Skip to content

Conversation

@0xPoe
Copy link
Member

@0xPoe 0xPoe commented Oct 19, 2025

What problem does this PR solve?

Issue Number: close #61273

Problem Summary:

What changed and how does it work?

Although we shouldn’t encourage users to execute this command multiple times concurrently, we need to ensure that it’s safe to do so.

In this PR, I removed the non–thread-safe initialization and replaced it with a session pool–based initialization for statistics.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. sig/planner SIG: Planner labels Oct 19, 2025
@0xPoe 0xPoe force-pushed the poe-patch-refresh-stats-concurrently branch from 7dff835 to 7aaa6da Compare October 19, 2025 09:26
@codecov
Copy link

codecov bot commented Oct 19, 2025

Codecov Report

❌ Patch coverage is 90.16393% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.7345%. Comparing base (a84aea0) to head (3d60017).
⚠️ Report is 64 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #64034        +/-   ##
================================================
+ Coverage   72.7425%   74.7345%   +1.9920%     
================================================
  Files          1854       1885        +31     
  Lines        501719     524448     +22729     
================================================
+ Hits         364963     391944     +26981     
+ Misses       114581     108648      -5933     
- Partials      22175      23856      +1681     
Flag Coverage Δ
integration 47.9823% <52.5423%> (?)
unit 72.5600% <88.5245%> (+0.2591%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 52.8700% <ø> (ø)
parser ∅ <ø> (∅)
br 60.5897% <50.0000%> (+14.2044%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@0xPoe
Copy link
Member Author

0xPoe commented Oct 19, 2025

/retest

2 similar comments
@0xPoe
Copy link
Member Author

0xPoe commented Oct 19, 2025

/retest

@0xPoe
Copy link
Member Author

0xPoe commented Oct 19, 2025

/retest

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses concurrency safety issues in statistics initialization by removing non-thread-safe session management and implementing session pool-based initialization. The changes enable safe concurrent execution of statistics refresh operations.

  • Removed global maxTidRecord state and dedicated initStatsCtx session field
  • Introduced session pool wrapper methods to ensure GC blocking during stats initialization
  • Refactored stats loading methods to accept session context parameters instead of using a shared session

Reviewed Changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/statistics/handle/handle.go Removed initStatsCtx field from Handle struct
pkg/statistics/handle/bootstrap.go Refactored to pass session context as parameter; removed global maxTidRecord tracking
pkg/session/syssession/pool.go Added WithForceBlockGCSession method with GC blocking logic
pkg/domain/domain.go Updated CreateStatsHandle and LoadAndUpdateStatsLoop signatures to remove initStatsCtx parameter
pkg/session/session.go Removed dedicated initStatsCtx creation in bootstrapSessionImpl
pkg/statistics/handle/handletest/initstats/init_stats_test.go Added maxPhysicalTableID helper to replace GetMaxTidRecordForTest
pkg/executor/simple_test.go Added concurrent REFRESH STATS test
pkg/server/stat_test.go Added ForceBlockGCInTest failpoint enablement
pkg/domain/infosync/info.go Renamed ContainsInternalSessionForTest to ContainsInternalSession
Multiple test files Updated NewHandle/CreateStatsHandle calls to remove initStatsCtx argument

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@0xPoe 0xPoe force-pushed the poe-patch-refresh-stats-concurrently branch from 55f9aa3 to 4f29b18 Compare October 25, 2025 11:15
@ti-chi-bot ti-chi-bot bot added component/statistics size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 25, 2025
@0xPoe 0xPoe force-pushed the poe-patch-refresh-stats-concurrently branch from 2ac55ae to 67c75ea Compare October 25, 2025 11:18
@0xPoe 0xPoe force-pushed the poe-patch-refresh-stats-concurrently branch from 67c75ea to 1b92e21 Compare October 25, 2025 11:18
@0xPoe
Copy link
Member Author

0xPoe commented Oct 27, 2025

/retest

Copy link
Member Author

@0xPoe 0xPoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔢 Self-check (PR reviewed by myself and ready for feedback)

  • Code compiles successfully

  • Unit tests added

  • All tests pass

  • Bazel files updated

  • Comments added where necessary

  • PR title and description updated

  • Documentation PR created (or confirmed not needed)

  • PR size is reasonable

/cc @mjonss @elsa0520

@ti-chi-bot ti-chi-bot bot requested review from elsa0520 and mjonss October 27, 2025 11:04
}()

// Make sure the internal session is registered to the session manager to block GC.
const retryInterval = 100 * time.Millisecond
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @lcwangchao

Not sure if you’re interested in taking a look at this change. If you have any thoughts on it, please feel free to let me know.

I think the fact that StoreInternalSession is not guaranteed to succeed is a bad design. I’ll try to create an issue for it to explore if we can change it in the future.

@ti-chi-bot ti-chi-bot bot requested a review from lcwangchao October 27, 2025 11:08
@0xPoe
Copy link
Member Author

0xPoe commented Oct 27, 2025

/retest

@mjonss
Copy link
Contributor

mjonss commented Oct 27, 2025

Is the following test failure issue in mysql-test related to this PR?

[2025/10/27 19:25:48.011 +08:00] [INFO] [conn.go:1200] ["command dispatched failed"] [conn=2098692] [session_alias=] [connInfo="id:2098692, addr:127.0.0.1:43836 status:10, collation:utf8mb4_general_ci, user:root"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="ANALYZE TABLE a,d,bb,cc,dd;"] [txn_mode=PESSIMISTIC] [timestamp=461783524398596096] [err="pessimistic lock retry limit reached\ngithub.com/pingcap/tidb/pkg/executor.(*AnalyzeExec).handleResultsErrorWithConcurrency\n\t/home/jenkins/agent/workspace/pingcap/tidb/ghpr_mysql_test/tidb/pkg/executor/analyze.go:485\ngithub.com/pingcap/tidb/pkg/executor.(*AnalyzeExec).handleResultsError\n\t/home/jenkins/agent/workspace/pingcap/tidb/ghpr_mysql_test/tidb/pkg/executor/analyze.go:419\ngithub.com/pingcap/tidb/pkg/executor.(*AnalyzeExec).Next.func2\n\t/home/jenkins/agent/workspace/pingcap/tidb/ghpr_mysql_test/tidb/pkg/executor/analyze.go:133\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:93\nruntime.goexit\n\t/go/pkg/mod/golang.org/[email protected]/src/runtime/asm_amd64.s:1700"]

@0xPoe
Copy link
Member Author

0xPoe commented Oct 27, 2025

/retest

@elsa0520
Copy link
Contributor

elsa0520 commented Nov 3, 2025

Did we write in the documentation that we do not recommend users use the concurrent refresh stats function?

@0xPoe
Copy link
Member Author

0xPoe commented Nov 3, 2025

Did we write in the documentation that we do not recommend users use the concurrent refresh stats function?

See: pingcap/docs@1e76212 (#21873)

// In most cases, the session manager is not set, so this step will be skipped.
// It is only enabled explicitly in tests through a failpoint.
forceBlockGCInTest := !intest.InTest
failpoint.Inject("ForceBlockGCInTest", func(val failpoint.Value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or this is better

	if intest.Intest {
		var forceBlockGCInTest bool
		failpoint.Inject("ForceBlockGCInTest", func(val failpoint.Value) {
			forceBlockGCInTest = val.(bool)
		})
		if !forceBlockGCInTest {
			break
		}
	}

Copy link
Contributor

@elsa0520 elsa0520 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Nov 7, 2025
Copy link
Contributor

@mjonss mjonss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only some minor questions and curiosity :)

func (h *Handle) initStatsMeta(ctx context.Context, tableIDs ...int64) (statstypes.StatsCache, int64, error) {
func (h *Handle) initStatsMeta(ctx context.Context, sctx sessionctx.Context, tableIDs ...int64) (statstypes.StatsCache, int64, error) {
ctx = kv.WithInternalSourceType(ctx, kv.InternalTxnStats)
sql := genInitStatsMetaSQL(tableIDs...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to PR, but I am curious, what is the max number of tableIDs that can come in this call? Could it be an issue that the SQL string becomes too long?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be too many. Because usually normal users would never use this command. And BR will only use REFRESH *.*.

func (h *Handle) initStatsLiteWithSession(ctx context.Context, sctx sessionctx.Context, tableIDs ...int64) (err error) {
defer func() {
_, err1 := util.Exec(h.initStatsCtx, "commit")
_, err1 := util.Exec(sctx, "commit")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly related to this PR, but should it not do "rollback" in case of error?
And if only doing SELECTs, why not default to "rollback"? Or even use auto_commit mode, unless we really need snapshot consistency of the statistics, which I assume could be the case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we always do it here https://github.com/0xPoe/tidb/blob/poe-patch-refresh-stats-concurrently/pkg/session/syssession/pool.go#L220. But you made a good point. We should consider rolling back here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll create an issue for it. I’d like to separate it from this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ti-chi-bot
Copy link

ti-chi-bot bot commented Nov 7, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: elsa0520, mjonss
Once this PR has been reviewed and has the lgtm label, please assign 3pointer, d3hunter for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Nov 7, 2025
@ti-chi-bot
Copy link

ti-chi-bot bot commented Nov 7, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-11-07 10:39:34.254505542 +0000 UTC m=+439423.697535411: ☑️ agreed by elsa0520.
  • 2025-11-07 12:40:37.496922478 +0000 UTC m=+446686.939952358: ☑️ agreed by mjonss.

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 7, 2025
@tiprow
Copy link

tiprow bot commented Nov 7, 2025

@0xPoe: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
fast_test_tiprow 3d60017 link true /test fast_test_tiprow

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@0xPoe
Copy link
Member Author

0xPoe commented Nov 7, 2025

/assign gmhdbjd

@0xPoe
Copy link
Member Author

0xPoe commented Nov 7, 2025

/assign Leavrth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/statistics lgtm release-note-none Denotes a PR that doesn't merit a release note. sig/planner SIG: Planner size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tracking issue: Refresh Stats Command

5 participants