Skip to content

Conversation

@yibin87
Copy link
Collaborator

@yibin87 yibin87 commented Jan 7, 2026

TopSQL v2 Source and Delta Lake Sinks Implementation

Overview

This PR implements a complete TopSQL data collection and storage solution, including:

  1. topsql_v2 source - Next-generation TopSQL data source that collects TopSQL data from TiDB and TiKV clusters
  2. topsql_data_deltalake sink - Writes TopSQL data (tidb_topsql, tikv_topsql, tikv_topregion) to Delta Lake
  3. topsql_meta_deltalake sink - Writes TopSQL metadata (SQL meta and Plan meta) to Delta Lake with deduplication

Key Features

1. TopSQL v2 Source (topsql_v2)

Data Output

  • TiDB TopSQL Data: tidb_topsql table containing SQL execution statistics
  • TiKV TopSQL Data: tikv_topsql table containing TiKV-level SQL statistics
  • TiKV TopRegion Data: tikv_topregion table containing Region-level statistics
  • SQL Meta: topsql_sql_meta table containing SQL text and metadata
  • Plan Meta: topsql_plan_meta table containing execution plan information

Configuration Options

sources:
  topsql:
    type: "topsql_v2"
    pd_address: "127.0.0.1:2379"
    tidb_group: "tidb-cluster"           # Optional: Filter TiDB instance group
    label_k8s_instance: "tidb"            # Optional: Kubernetes label filter
    downsampling_interval: 60             # Downsampling interval (seconds)
    top_n: 100                            # Keep Top N SQL statements
    topology_fetch_interval_seconds: 30   # Topology fetch interval
    init_retry_delay_seconds: 1.0         # Initial retry delay

2. TopSQL Data Delta Lake Sink (topsql_data_deltalake)

Core Features

  • Multi-Table Support: Automatically identifies source_table field and routes data to corresponding Delta Lake tables
  • Date Partitioning: Supports automatic date partitioning for easier data management and query optimization
  • Batch Processing: Supports batch writes to improve write performance
  • Time-Based Flush: Supports automatic flush based on time intervals

Supported Data Types

  • tidb_topsql: TiDB TopSQL data
  • tikv_topsql: TiKV TopSQL data
  • tikv_topregion: TiKV TopRegion data

Schema Definition

Contains complete TopSQL metric fields:

  • Timestamp and date fields
  • SQL/Plan digest
  • Database and table information
  • Performance metrics: CPU time, execution count, network traffic, logical reads/writes, etc.
  • TiKV-specific fields: Region ID, read/write keys, logical read/write bytes, etc.

3. TopSQL Meta Delta Lake Sink (topsql_meta_deltalake)

Core Features

  • Deduplication Mechanism: LRU Cache-based deduplication to avoid writing duplicate metadata
    • Maintains separate caches for SQL Meta and Plan Meta
    • Deduplication key format: {digest}_{date} (e.g., sql_digest_2024-01-01)
    • Supports configurable cache capacity (default 10000)
  • Event Buffering: Implements event buffering mechanism for batch writes
  • Auto Flush:
    • Based on time interval (max_delay_secs, default 180 seconds)
    • Based on buffer size (EVENT_BUFFER_MAX_SIZE, default 1000)

Supported Data Types

  • topsql_sql_meta: SQL metadata (normalized_sql, sql_digest, etc.)
  • topsql_plan_meta: Plan metadata (normalized_plan, encoded_normalized_plan, etc.)

Configuration Examples

Complete Configuration Example

sources:
  topsql:
    type: "topsql_v2"
    pd_address: "127.0.0.1:2379"
    downsampling_interval: 30
    top_n: 80

transforms:
  route_by_type:
    type: "route"
    inputs:
      - "topsql"
    route:
      non_meta_logs: '.source_table == "tidb_topsql" || .source_table == "tikv_topsql" || .source_table == "tikv_topregion"'
      meta_logs: '.source_table == "topsql_sql_meta" || .source_table == "topsql_plan_meta"'

sinks:
  # TopSQL data write
  topsql_data:
    type: "topsql_data_deltalake"
    inputs:
      - "route_by_type.non_meta_logs"
    base_path: "s3://bucket-name/topsql-data"
    batch_size: 10000
    max_delay_secs: 60
    timeout_secs: 30
    bucket: "bucket-name"
    region: "us-east-1"
    auth:
      assume_role: "arn:aws:iam::123456789012:role/vector-role"
    acknowledgements:
      enabled: true

  # TopSQL metadata write (with deduplication)
  topsql_meta:
    type: "topsql_meta_deltalake"
    inputs:
      - "route_by_type.meta_logs"
    base_path: "s3://bucket-name/topsql-meta"
    batch_size: 1000
    max_delay_secs: 180
    meta_cache_capacity: 10000
    bucket: "bucket-name"
    region: "us-east-1"
    endpoint: "https://oss-cn-hangzhou.aliyuncs.com"  # Alibaba Cloud OSS
    force_path_style: false
    auth:
      # Use environment variables to configure Alibaba Cloud RRSA
      # ALIBABA_CLOUD_OIDC_TOKEN_FILE=/var/run/secrets/tokens/oidc-token
      # ALIBABA_CLOUD_ROLE_ARN=acs:ram::123456789012:role/vector-role
    acknowledgements:
      enabled: true

@pingcap-cla-assistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@ti-chi-bot ti-chi-bot bot added the size/XXL label Jan 7, 2026
@yibin87 yibin87 force-pushed the add_topsql_deltalake branch from 6532cd4 to 9dc07fd Compare January 7, 2026 00:25
Signed-off-by: yibin87 <[email protected]>
Signed-off-by: yibin87 <[email protected]>
use std::path::PathBuf;

use vector::{
aws::{AwsAuthentication, RegionOrEndpoint},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们现在只支持aws和alicloud?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯 是的

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meta信息和data信息只是存储格式不同,但是mod.rs大部分是重复的,有必要分开吗?

Copy link
Collaborator Author

@yibin87 yibin87 Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯 好观点。目前分开的话好处在于,两部分的数据完全分开,这样两部分的 batch 不会相互影响,同时两部分实际处理逻辑还是不一样的,后续可以考虑把 不同云的 适配这部分逻辑单独提取出来,像 deltalake_writer 这样,这样两部分 mod 其实就基本是个壳子了

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当前版本的格式应该不用改吧,为什么要加入新的字段呢?

Copy link
Collaborator Author

@yibin87 yibin87 Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为变更了协议,所以一些代码路径需要补全,另外加字段其实不影响最后存 vm 数据,因为 metrics 是0 的情况下,是会被过滤掉的。我把不影响编译部分的代码更新逻辑去掉吧
img_v3_02tn_47a6119a-e2aa-4678-ada8-a8cf00246e7g

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

更新了下,目前只保留了一些为了通过编译的改动,还有一些测试中的变动,原来 topsql 到 vm 的数据不会有任何变化

Signed-off-by: yibin87 <[email protected]>
Signed-off-by: yibin87 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants