Skip to content

planner: add slow query QPS metric and show it in TiDB QPS panel#67249

Open
qw4990 wants to merge 3 commits intopingcap:masterfrom
qw4990:slow-query-qps
Open

planner: add slow query QPS metric and show it in TiDB QPS panel#67249
qw4990 wants to merge 3 commits intopingcap:masterfrom
qw4990:slow-query-qps

Conversation

@qw4990
Copy link
Contributor

@qw4990 qw4990 commented Mar 24, 2026

What problem does this PR solve?

Issue Number: close #67247

Problem Summary: planner: add slow query QPS metric and show it in TiDB QPS panel

What changed and how does it work?

  • Add a new TiDB server counter metric: tidb_server_slow_query_total (label: sql_type).
  • Increment this counter when a statement is logged as a slow query (both general and internal SQL paths).
  • Expose slow query QPS in METRICS_SCHEMA via tidb_slow_query_qps.
  • Update Grafana pkg/metrics/grafana/tidb.json QPS panel to include a SlowQuery series:
    • sum(rate(tidb_server_slow_query_total{...}[1m]))
image

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

  • New Features

    • Added slow-query metrics with separate counters for general and internal slow queries.
    • Added a slow query QPS metric and updated Grafana dashboards to show SlowQuery series.
  • Tests

    • Added test coverage asserting slow-query counter increments alongside existing slow-log assertions.

@ti-chi-bot ti-chi-bot bot added the release-note-none Denotes a PR that doesn't merit a release note. label Mar 24, 2026
@pantheon-ai
Copy link

pantheon-ai bot commented Mar 24, 2026

Review Complete

Findings: 1 issues
Posted: 1
Duplicates/Skipped: 1

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot ti-chi-bot bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 24, 2026
@tiprow
Copy link

tiprow bot commented Mar 24, 2026

Hi @qw4990. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link

coderabbitai bot commented Mar 24, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 245d8f1f-c0f7-4e3b-9c56-003c0928e6cd

📥 Commits

Reviewing files that changed from the base of the PR and between 64235bb and 84bd6a6.

📒 Files selected for processing (2)
  • pkg/metrics/nextgengrafana/tidb_with_keyspace_name.json
  • pkg/metrics/nextgengrafana/tidb_worker.json

📝 Walkthrough

Walkthrough

Adds a slow-query counter metric and wires it through server metrics, executor increments (internal vs. general), metric table definition, Grafana panels, and tests to expose slow-query QPS.

Changes

Cohort / File(s) Summary
Server metric declaration & registration
pkg/metrics/server.go, pkg/metrics/metrics.go
Introduce SlowQueryCounter (*prometheus.CounterVec) and register it in global metrics registration.
Executor metric binding & increments
pkg/executor/adapter.go, pkg/executor/metrics/metrics.go
Add executor-level counters SlowQueryCounterGeneral and SlowQueryCounterInternal; increment the appropriate counter in ExecStmt.LogSlowQuery.
Executor tests
pkg/executor/adapter_test.go
Extend TestWriteSlowLog to assert slow-query counter values before/after execution (expect 0.0 or 1.0).
Metric table & Grafana panels
pkg/infoschema/metric_table_def.go, pkg/metrics/grafana/tidb.json, pkg/metrics/nextgengrafana/tidb_with_keyspace_name.json, pkg/metrics/nextgengrafana/tidb_worker.json
Add tidb_slow_query_qps metric definition and add Grafana query targets that surface tidb_server_slow_query_total rate (legend "SlowQuery").

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

size/L

Suggested reviewers

  • yibin87
  • XuHuaiyu
  • nolouch

Poem

🐰
I hopped through code with counters bright,
Counting slow queries in day and night,
Internal, general — each gets a cheer,
Grafana paints what I hold dear,
Tiny paws, big metrics delight.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding a slow query QPS metric and integrating it into the TiDB QPS Grafana panel.
Description check ✅ Passed The PR description includes the required Issue Number (close #67247), problem summary, detailed explanation of what changed, and a completed checklist with unit test marked as done.
Linked Issues check ✅ Passed The code changes fully implement the requirements from #67247: adding a Prometheus counter metric for slow queries, incrementing it on slow-query paths, exposing it in METRICS_SCHEMA, and integrating it into the Grafana dashboard.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing the slow-query QPS metric feature; no out-of-scope modifications were introduced.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hawkingrei
Copy link
Member

/ok-to-test

@ti-chi-bot ti-chi-bot bot added the ok-to-test Indicates a PR is ready to be tested. label Mar 24, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/executor/adapter_test.go (1)

417-433: Extend this test to cover the internal SQL counter path too.

The new assertions validate only sql_type=general. Since production code now increments both general and internal counters, please add an internal-SQL assertion in this test (or companion test) to cover the restricted branch as well.

✅ Minimal extension sketch
- readSlowQueryCounter := func() float64 {
-   counter := metrics.SlowQueryCounter.WithLabelValues(metrics.LblGeneral)
+ readSlowQueryCounter := func(sqlType string) float64 {
+   counter := metrics.SlowQueryCounter.WithLabelValues(sqlType)
    pb := &dto.Metric{}
    require.NoError(t, counter.Write(pb))
    return pb.GetCounter().GetValue()
  }

+ // existing general-path checks
- before := readSlowQueryCounter()
+ before := readSlowQueryCounter(metrics.LblGeneral)
  tk.MustExec(sql)
- after := readSlowQueryCounter()
+ after := readSlowQueryCounter(metrics.LblGeneral)

+ // add internal-path check (example)
+ beforeInternal := readSlowQueryCounter(metrics.LblInternal)
+ rs, err := tk.Session().ExecuteInternal(context.Background(), sql)
+ require.NoError(t, err)
+ _, err = session.ResultSetToStringSlice(context.Background(), tk.Session(), rs)
+ require.NoError(t, err)
+ afterInternal := readSlowQueryCounter(metrics.LblInternal)
+ require.Equal(t, 1.0, afterInternal-beforeInternal)

Based on learnings: "For SQL behavior changes in executor, perform targeted unit test plus relevant integration test."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/adapter_test.go` around lines 417 - 433, The test currently only
reads the general slow-query metric; extend it to also assert the internal-SQL
counter path by adding a reader for the internal label (e.g., add a function
similar to readSlowQueryCounter that accepts or uses metrics.LblInternal and
calls metrics.SlowQueryCounter.WithLabelValues(metrics.LblInternal)), then
update or add assertions in checkWriteSlowLog (or create
checkWriteSlowLogInternal) to capture the before/after delta for the internal
counter and validate expected increments when expectWrite is true and no change
when false; reference readSlowQueryCounter, checkWriteSlowLog,
metrics.SlowQueryCounter, metrics.LblInternal to find where to add the new
reader and assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/executor/adapter_test.go`:
- Around line 417-433: The test currently only reads the general slow-query
metric; extend it to also assert the internal-SQL counter path by adding a
reader for the internal label (e.g., add a function similar to
readSlowQueryCounter that accepts or uses metrics.LblInternal and calls
metrics.SlowQueryCounter.WithLabelValues(metrics.LblInternal)), then update or
add assertions in checkWriteSlowLog (or create checkWriteSlowLogInternal) to
capture the before/after delta for the internal counter and validate expected
increments when expectWrite is true and no change when false; reference
readSlowQueryCounter, checkWriteSlowLog, metrics.SlowQueryCounter,
metrics.LblInternal to find where to add the new reader and assertions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: cc8ae555-a8f3-4b25-bf5d-7da4b44bfffb

📥 Commits

Reviewing files that changed from the base of the PR and between dffbd3e and 64235bb.

📒 Files selected for processing (7)
  • pkg/executor/adapter.go
  • pkg/executor/adapter_test.go
  • pkg/executor/metrics/metrics.go
  • pkg/infoschema/metric_table_def.go
  • pkg/metrics/grafana/tidb.json
  • pkg/metrics/metrics.go
  • pkg/metrics/server.go

@codecov
Copy link

codecov bot commented Mar 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.2846%. Comparing base (8c83d3f) to head (84bd6a6).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #67249        +/-   ##
================================================
+ Coverage   77.7724%   79.2846%   +1.5122%     
================================================
  Files          2022       1968        -54     
  Lines        554420     545905      -8515     
================================================
+ Hits         431186     432819      +1633     
+ Misses       121492     111753      -9739     
+ Partials       1742       1333       -409     
Flag Coverage Δ
integration 46.2751% <100.0000%> (-1.8513%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 61.5065% <ø> (ø)
parser ∅ <ø> (∅)
br 62.2266% <ø> (+1.3639%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Mar 24, 2026
@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Mar 24, 2026
@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 24, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-03-24 03:10:53.860741118 +0000 UTC m=+237849.896811378: ☑️ agreed by AilinKid.
  • 2026-03-24 03:14:16.850776968 +0000 UTC m=+238052.886847238: ☑️ agreed by guo-shaoge.

@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 24, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: AilinKid, GMHDBJD, guo-shaoge
Once this PR has been reviewed and has the lgtm label, please assign zimulala for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@qw4990
Copy link
Contributor Author

qw4990 commented Mar 24, 2026

/retest

"refId": "B"
},
{
"expr": "sum(rate(tidb_server_slow_query_total{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", instance=~\"$instance\"}[1m]))",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add it to tidb_with_keyspace_name.json?

@tiprow
Copy link

tiprow bot commented Mar 24, 2026

@qw4990: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
tidb_parser_test 84bd6a6 link true /test tidb_parser_test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

planner: a new Slow Query QPS metric

6 participants