Add deltalake source and sink for topsql #63

yibin87 · 2026-01-07T00:21:45Z

TopSQL v2 Source and Delta Lake Sinks Implementation

Overview

This PR implements a complete TopSQL data collection and storage solution, including:

topsql_v2 source - Next-generation TopSQL data source that collects TopSQL data from TiDB and TiKV clusters
topsql_data_deltalake sink - Writes TopSQL data (tidb_topsql, tikv_topsql, tikv_topregion) to Delta Lake
topsql_meta_deltalake sink - Writes TopSQL metadata (SQL meta and Plan meta) to Delta Lake with deduplication

Key Features

1. TopSQL v2 Source (`topsql_v2`)

Data Output

TiDB TopSQL Data: tidb_topsql table containing SQL execution statistics
TiKV TopSQL Data: tikv_topsql table containing TiKV-level SQL statistics
TiKV TopRegion Data: tikv_topregion table containing Region-level statistics
SQL Meta: topsql_sql_meta table containing SQL text and metadata
Plan Meta: topsql_plan_meta table containing execution plan information

Configuration Options

sources:
  topsql:
    type: "topsql_v2"
    pd_address: "127.0.0.1:2379"
    tidb_group: "tidb-cluster"           # Optional: Filter TiDB instance group
    label_k8s_instance: "tidb"            # Optional: Kubernetes label filter
    downsampling_interval: 60             # Downsampling interval (seconds)
    top_n: 100                            # Keep Top N SQL statements
    topology_fetch_interval_seconds: 30   # Topology fetch interval
    init_retry_delay_seconds: 1.0         # Initial retry delay

2. TopSQL Data Delta Lake Sink (`topsql_data_deltalake`)

Core Features

Multi-Table Support: Automatically identifies source_table field and routes data to corresponding Delta Lake tables
Date Partitioning: Supports automatic date partitioning for easier data management and query optimization
Batch Processing: Supports batch writes to improve write performance
Time-Based Flush: Supports automatic flush based on time intervals

Supported Data Types

tidb_topsql: TiDB TopSQL data
tikv_topsql: TiKV TopSQL data
tikv_topregion: TiKV TopRegion data

Schema Definition

Contains complete TopSQL metric fields:

Timestamp and date fields
SQL/Plan digest
Database and table information
Performance metrics: CPU time, execution count, network traffic, logical reads/writes, etc.
TiKV-specific fields: Region ID, read/write keys, logical read/write bytes, etc.

3. TopSQL Meta Delta Lake Sink (`topsql_meta_deltalake`)

Core Features

Deduplication Mechanism: LRU Cache-based deduplication to avoid writing duplicate metadata
- Maintains separate caches for SQL Meta and Plan Meta
- Deduplication key format: {digest}_{date} (e.g., sql_digest_2024-01-01)
- Supports configurable cache capacity (default 10000)
Event Buffering: Implements event buffering mechanism for batch writes
Auto Flush:
- Based on time interval (max_delay_secs, default 180 seconds)
- Based on buffer size (EVENT_BUFFER_MAX_SIZE, default 1000)

Supported Data Types

topsql_sql_meta: SQL metadata (normalized_sql, sql_digest, etc.)
topsql_plan_meta: Plan metadata (normalized_plan, encoded_normalized_plan, etc.)

Configuration Examples

Complete Configuration Example

sources:
  topsql:
    type: "topsql_v2"
    pd_address: "127.0.0.1:2379"
    downsampling_interval: 30
    top_n: 80

transforms:
  route_by_type:
    type: "route"
    inputs:
      - "topsql"
    route:
      non_meta_logs: '.source_table == "tidb_topsql" || .source_table == "tikv_topsql" || .source_table == "tikv_topregion"'
      meta_logs: '.source_table == "topsql_sql_meta" || .source_table == "topsql_plan_meta"'

sinks:
  # TopSQL data write
  topsql_data:
    type: "topsql_data_deltalake"
    inputs:
      - "route_by_type.non_meta_logs"
    base_path: "s3://bucket-name/topsql-data"
    batch_size: 10000
    max_delay_secs: 60
    timeout_secs: 30
    bucket: "bucket-name"
    region: "us-east-1"
    auth:
      assume_role: "arn:aws:iam::123456789012:role/vector-role"
    acknowledgements:
      enabled: true

  # TopSQL metadata write (with deduplication)
  topsql_meta:
    type: "topsql_meta_deltalake"
    inputs:
      - "route_by_type.meta_logs"
    base_path: "s3://bucket-name/topsql-meta"
    batch_size: 1000
    max_delay_secs: 180
    meta_cache_capacity: 10000
    bucket: "bucket-name"
    region: "us-east-1"
    endpoint: "https://oss-cn-hangzhou.aliyuncs.com"  # Alibaba Cloud OSS
    force_path_style: false
    auth:
      # Use environment variables to configure Alibaba Cloud RRSA
      # ALIBABA_CLOUD_OIDC_TOKEN_FILE=/var/run/secrets/tokens/oidc-token
      # ALIBABA_CLOUD_ROLE_ARN=acs:ram::123456789012:role/vector-role
    acknowledgements:
      enabled: true

pingcap-cla-assistant · 2026-01-07T00:21:59Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Signed-off-by: yibin87 <[email protected]>

zhoucai-pingcap · 2026-01-07T06:27:08Z

src/sinks/topsql_data_deltalake/mod.rs

+use std::path::PathBuf;
+
+use vector::{
+    aws::{AwsAuthentication, RegionOrEndpoint},


我们现在只支持aws和alicloud？

zhoucai-pingcap · 2026-01-07T06:28:43Z

src/sinks/topsql_meta_deltalake/mod.rs

meta信息和data信息只是存储格式不同，但是mod.rs大部分是重复的，有必要分开吗？

嗯好观点。目前分开的话好处在于，两部分的数据完全分开，这样两部分的 batch 不会相互影响，同时两部分实际处理逻辑还是不一样的，后续可以考虑把不同云的适配这部分逻辑单独提取出来，像 deltalake_writer 这样，这样两部分 mod 其实就基本是个壳子了

zhoucai-pingcap · 2026-01-07T06:34:59Z

src/sources/topsql/upstream/tidb/parser.rs

当前版本的格式应该不用改吧，为什么要加入新的字段呢？

因为变更了协议，所以一些代码路径需要补全，另外加字段其实不影响最后存 vm 数据，因为 metrics 是0 的情况下，是会被过滤掉的。我把不影响编译部分的代码更新逻辑去掉吧

更新了下，目前只保留了一些为了通过编译的改动，还有一些测试中的变动，原来 topsql 到 vm 的数据不会有任何变化

Signed-off-by: yibin87 <[email protected]>

ti-chi-bot bot added the size/XXL label Jan 7, 2026

yibin87 added 8 commits January 7, 2026 08:24

Add topsql deltalake source topsql_v2

35c4dd0

Signed-off-by: yibin87 <[email protected]>

Remove instance event

4dfe8ad

Signed-off-by: yibin87 <[email protected]>

Clean topsql_v2 output events and columns

412260f

Signed-off-by: yibin87 <[email protected]>

update tidb keep_top_n logic

6c8593f

Signed-off-by: yibin87 <[email protected]>

Update parser keep_top_n and downsampling logic

52814a9

Signed-off-by: yibin87 <[email protected]>

Save work

e7cfdf0

Signed-off-by: yibin87 <[email protected]>

Complete sinks/topsql_deltalake dev

dc10ab9

Signed-off-by: yibin87 <[email protected]>

Add topsql_meta_deltalake

9dc07fd

Signed-off-by: yibin87 <[email protected]>

yibin87 force-pushed the add_topsql_deltalake branch from 6532cd4 to 9dc07fd Compare January 7, 2026 00:25

yibin87 requested a review from zhoucai-pingcap January 7, 2026 00:57

yibin87 added 3 commits January 7, 2026 10:46

Add mocked_topsql source

a4c418a

Signed-off-by: yibin87 <[email protected]>

Update mocked_topsql

e7c9252

Signed-off-by: yibin87 <[email protected]>

update cargo cfg

376878b

Signed-off-by: yibin87 <[email protected]>

zhoucai-pingcap reviewed Jan 7, 2026

View reviewed changes

yibin87 added 2 commits January 7, 2026 16:36

Address comments

18568a9

Signed-off-by: yibin87 <[email protected]>

Little change

ca44f6b

Signed-off-by: yibin87 <[email protected]>

yibin87 requested a review from zhoucai-pingcap January 7, 2026 08:44

yibin87 added 2 commits January 7, 2026 19:36

remove performance warnings

3332127

Signed-off-by: yibin87 <[email protected]>

Add conflict retry logic

14152d8

Signed-off-by: yibin87 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add deltalake source and sink for topsql #63

Add deltalake source and sink for topsql #63

Uh oh!

yibin87 commented Jan 7, 2026 •

edited

Loading

Uh oh!

pingcap-cla-assistant bot commented Jan 7, 2026

Uh oh!

zhoucai-pingcap Jan 7, 2026

Uh oh!

yibin87 Jan 7, 2026

Uh oh!

zhoucai-pingcap Jan 7, 2026

Uh oh!

yibin87 Jan 7, 2026 •

edited

Loading

Uh oh!

zhoucai-pingcap Jan 7, 2026

Uh oh!

yibin87 Jan 7, 2026 •

edited

Loading

Uh oh!

yibin87 Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add deltalake source and sink for topsql #63

Are you sure you want to change the base?

Add deltalake source and sink for topsql #63

Uh oh!

Conversation

yibin87 commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TopSQL v2 Source and Delta Lake Sinks Implementation

Overview

Key Features

1. TopSQL v2 Source (topsql_v2)

Data Output

Configuration Options

2. TopSQL Data Delta Lake Sink (topsql_data_deltalake)

Core Features

Supported Data Types

Schema Definition

3. TopSQL Meta Delta Lake Sink (topsql_meta_deltalake)

Core Features

Supported Data Types

Configuration Examples

Complete Configuration Example

Uh oh!

pingcap-cla-assistant bot commented Jan 7, 2026

Uh oh!

zhoucai-pingcap Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

yibin87 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

zhoucai-pingcap Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

yibin87 Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhoucai-pingcap Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

yibin87 Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yibin87 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yibin87 commented Jan 7, 2026 •

edited

Loading

1. TopSQL v2 Source (`topsql_v2`)

2. TopSQL Data Delta Lake Sink (`topsql_data_deltalake`)

3. TopSQL Meta Delta Lake Sink (`topsql_meta_deltalake`)

yibin87 Jan 7, 2026 •

edited

Loading

yibin87 Jan 7, 2026 •

edited

Loading