Skip to content

Conversation

@dantengsky
Copy link
Member

@dantengsky dantengsky commented Jan 22, 2026

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Add heuristic-based block-level shuffle for better load balancing when tables have few segments relative to cluster size.

Background

In distributed query scenarios, when a table has few segments relative to the cluster size, the original segment-level Mod distribution strategy causes uneven load balancing.

Problem scenarios:

  • 3 segments, 4 nodes, 300 blocks → Segments per node: 1-1-1-0 (1 node idle, max 2x workload difference)
  • 10 segments, 4 nodes, 1000 blocks → Segments per node: 3-3-2-2 (50% workload difference)

Solution

Introduce an automatic block-level distribution heuristic:

  1. Trigger condition: Activates when segment_count < cluster_nodes * threshold
  2. Distribution method: All segments assign to all nodes; each node filters blocks by block_idx % num_nodes == node_idx
  3. setting: auto_block_shuffle_threshold (default: 5, set to 0 to disable)

How it improves

┌───────────────────────────────────┬────────────────────────────────────────────┬─────────────────────────┐
│             Scenario              │               Original (Mod)               │     New (BlockMod)      │
├───────────────────────────────────┼────────────────────────────────────────────┼─────────────────────────┤
│ 3 segments, 4 nodes, 300 blocks   │ Segments: 1-1-1-0, blocks: 100-100-100-0   │ Blocks: 75-75-75-75     │
├───────────────────────────────────┼────────────────────────────────────────────┼─────────────────────────┤
│ 10 segments, 4 nodes, 1000 blocks │ Segments: 3-3-2-2, blocks: 300-300-200-200 │ Blocks: 250-250-250-250 │
└───────────────────────────────────┴────────────────────────────────────────────┴─────────────────────────┘

With block-level distribution, workload is evenly distributed regardless of segment count.

New Settings

-- View current threshold
SELECT value FROM system.settings WHERE name = 'auto_block_shuffle_threshold';

-- Adjust threshold (block-level distribution when segment < nodes * threshold)
SET auto_block_shuffle_threshold = 5; -- default

-- Disable automatic block-level distribution
SET auto_block_shuffle_threshold = 0;

Changes

  • Add BlockMod shuffle kind for block-level distribution
  • Add auto_block_shuffle_threshold setting (default=5, 0 to disable)
  • When segment_count < nodes * threshold, use block-level shuffle
  • Each executor filters blocks by block_idx % num_executors == executor_idx
  • Add info logging for shuffle strategy selection

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Jan 22, 2026
@dantengsky dantengsky added the ci-cloud Build docker image for cloud test label Jan 22, 2026
@github-actions
Copy link
Contributor

Docker Image for PR

  • tag: pr-19311-14cf1ad-1769051528

note: this image tag is only available for internal use.

@dantengsky dantengsky force-pushed the feat/block-level-partition-shuffle branch from 5c7075e to c0a0d92 Compare January 22, 2026 06:30
@dantengsky dantengsky added the ci-benchmark-cloud Benchmark: run only cloud tests for tpch/hits label Jan 22, 2026
@github-actions
Copy link
Contributor

Docker Image for PR

  • tag: pr-19311-27fa899-1769066148

note: this image tag is only available for internal use.

@dantengsky dantengsky force-pushed the feat/block-level-partition-shuffle branch 2 times, most recently from ade09d8 to 9764f7a Compare January 22, 2026 08:33
@github-actions
Copy link
Contributor

github-actions bot commented Jan 22, 2026

🤖 CI Job Analysis

Workflow: 21243076599

⛔️ CANCELLED

Higher priority request detected - retry cancelled to avoid conflicts.

View Workflow

Add heuristic-based block-level shuffle for better load balancing when
tables have few segments relative to cluster size.

Changes:
- Add BlockMod shuffle kind for block-level distribution
- Add auto_block_shuffle_threshold setting (default=5, 0 to disable)
- When segment_count < nodes * threshold, use block-level shuffle
- Each executor filters blocks by block_idx % num_executors == executor_idx
- Add info logging for shuffle strategy selection
- Preserve partition kind during reshuffle to prevent data duplication
@dantengsky dantengsky force-pushed the feat/block-level-partition-shuffle branch from 9764f7a to cedc7b5 Compare January 22, 2026 08:35
Move block_slot computation from executor-side (prune_segments_with_pipeline)
to coordinator-side (redistribute_source_fragment). This ensures all executors
use the same cluster view that was determined when the plan was created,
preventing data duplication or loss if cluster membership changes.

Changes:
- Add block_slot field to DataSourcePlan
- Compute block_slot in redistribute_source_fragment for BlockMod shuffle
- Pass block_slot through plan instead of computing at execution time
…castWarehouse

Block filtering is now controlled by plan.block_slot, not by partition kind.
After reshuffle, all executors just process partitions sequentially.

Also revert incorrect change in memory_table.rs (should use BroadcastCluster).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-benchmark-cloud Benchmark: run only cloud tests for tpch/hits ci-cloud Build docker image for cloud test pr-feature this PR introduces a new feature to the codebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant