-
Notifications
You must be signed in to change notification settings - Fork 69
[RFC ]region level isolation #121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,380 @@ | ||
| # RFC: Region-Level Resource Isolation in TiKV | ||
|
|
||
| ## Summary | ||
|
|
||
| This RFC proposes enhancements to TiKV's resource control to provide region-level isolation, preventing hot regions from overwhelming tenant resources. The design extends the existing resource group-based priority system with region-level RU (Resource Unit) tracking and introduces traffic moderation mechanisms. | ||
|
|
||
| ## Motivation | ||
|
|
||
| ### Current State | ||
|
|
||
| TiKV implements resource control at the **resource group level**: | ||
| - Resource groups represent tenants and track RU (Resource Unit) consumption | ||
| - Each resource group has a `group_priority` (LOW, MEDIUM, HIGH) configured | ||
| - The `ResourceController` uses mClock algorithm to prioritize the requests. It maintains virtual time (VT) per resource group for fairness. VT increases as a group consumes resources - groups with higher VT have consumed more and get lower scheduling priority | ||
| - Tasks are ordered by priority: `concat_priority_vt(group_priority, resource_group_virtual_time)`. Lower values are scheduled first, so high-priority groups with low VT run first | ||
| - VT is periodically normalized to prevent starvation: lagging groups (low VT) are pulled toward the leader (highest VT), and all VTs are reset when nearing overflow | ||
| - The unified read pool uses yatp's priority queue (implemented with a SkipMap) | ||
|
|
||
| ## Goals | ||
| These functionalities are missing in the current implementation. | ||
|
|
||
| 1. **Region-level fairness**: Hot regions (with hot keys or large scans) should be deprioritized to prevent resource monopolization within a tenant | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I understand correctly, the goal is to solve the case that large query exhausting the tenant resource and causing high latency on small queries. In this case, I suspect the "region-level resource control" can really solve this problem. As the high resource consumption can be either "large scans on a small number of regions" or "medium/small scans on a large number of regions". So, maybe "query level resource control" is what you want: automically de-priotize read requests(from the same sql) if their resource consumption(total execution cpu time) exceed a certain threshold. https://github.com/tikv/yatp/blob/793be4d789d4bd15292fe4d06e38063b4ec9d48e/src/queue/multilevel.rs#L576 Yatp's multi-level running has an implementation that degrade the level of long-running tasks. BTW, in my experience, the current queue-based scheduling doesn’t really add the value we expected while the theory seems plausible. Maybe you can first test with "set up 2 resource groups(with different priority) and run the small/large sql under the 2 groups(large sql under the low-priority group), then see the impact of the large sql on the small sqls". This should be the best case that can be gained in resource control on a single tenant.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here goal is to prevent group of regions in a resource group from consuming all the CPUs. It generally happens in these scenarios
I tested the m queue approach for resource groups. It works as expected ensuring that all resource groups get the fair share of resources. I am extending the same algorithm of fairness to regions.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @glorv does it make sense ?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have the same question as @glorv. First, the proposed solution might be counter-productive for complex queries. For example, if a query spawns 100 coprocessor tasks where one is a heavy scan and 99 are light, the heavy scan is the natural bottleneck. If our resource control logic deprioritizes this 'hot' task to prevent resource saturation, we are effectively stretching the tail latency of the entire query. In this case, 'fairness' becomes a penalty for the user. I think the request/query-level fairness makes more sense? I am not sure what region-level fairness can bring us. Do you have an example, that the QoS is impacted, if we don't do region-level fairness?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here goal is not to de prioritize a long running query. Our goal is to de prioritize the traffic on hot regions. Lets say system is healthy and region 1, 2 ,3,4 are getting 5k qps. Suddenly region 1 and 2 start getting 20k qps and CPU is reached 100%. Without region level fairness, all regions (region 3 and region 4) will be impacted until split and scatter happens. With region level fairness, region 3 and region 4 won't get impacted.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And our goal to de prioritize traffic on region 1 and 2 only when system is under resources.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if we can use benchmarks like sysbench to demonstrate that without region-level fairness, tail latency suffers—and that implementing fairness can stabilize those latencies (even if total throughput remains the same). I feel it could be challenging. If we can show this, the technical merit is clear. However, I suspect it is extremely difficult to 'prove' the necessity of this change; I want to look past the technical implementation and see the actual value for the end user.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is very easily reproducible. We have to reproduce the hot regions by reducing the unified read pool queue size and number of cores and see its impact on non hot regions.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need to create two tables in the same resource group with one table very hot. We see the latency on other table getting impacted. Here goal is not to stabilize tail latency. Goal is to penalize one which is consuming more resources in a multi tenant system |
||
| 3. **Traffic Moderation**: In a multi-tenant SOA environment, setting correct rate limits is challenging - limits that are too tight reject valid traffic, while limits that are too loose allow overload. Instead of hard rate limits, implement adaptive traffic moderation that responds to sudden spikes on hot regions by gracefully deprioritizing rather than outright rejecting requests | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: 3rd when it is 2nd This point sounds more like non-goal. Move it to alternative approaches section? Separately, do you want to add a goal around non interfering with region split?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a goal. If we don't have this goal then system will become overloaded after split.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If there is a hot region that received sudden high QPS, the load-based split will be triggered for this region may be 30s later, and the hot-region scheduler will transfer leader to other tikv nodes. Here what problem we want to resolve is preventing the sudden hot region eat too much CPU resource and then impact other normal traffic (no matter other tenants' traffic or same tenant's traffic) before load-based split take affect. How about just introduce some simple rules for region level resource consumption, like single region can't eat more than 60% CPU resource to make TiKV node always reserve some resource to keep node level health (similar like DynamoDB's in each storage node, each tenant just can consume 50% node's resource to introduce node level self protection mechanism). I'm afraid we add region level fairness mechanism on top of resource group level resource fairness/priority mechanism may introduce unexpected complexity and behaviors.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are two scenarios where there could be more than one hot regions
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I understand right, in the case of multiple hot regions, we expect PD hot scheduler can take effect and balance some hot regions' leaders to other nodes, your concern is the case there is sudden traffic that cause multiple hot regions in the same node before hot scheduler take effect? I don't know if "deprioritizing hot regions' requests" can really prevent resource monopolization within a tenant, is it possible these hot regions' request(like index scan) can't tolerant high latency (because of deprioritizing their priority?). In that case, how about come to the essential of "tenant resource usage" aspect, and set tenant-level/resource group-level resource usage throttle in storage node like DynamoDB to make sure we reserve some resource to prevent node from exhausted by single tenant?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are you suggesting to use region level throttle like dynamoDB (https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/throttling-key-range-limit-exceeded-mitigation.html#key-range-additional-resources)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. https://www.usenix.org/system/files/atc22-elhemali.pdf from this latest DynamoDB paper, I think there is some idea we can refer to:
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. once we hit the token node limit then traffic on all region is proportionally throttled. Our goal is to penalize the one which has caused the system overload. We are talking about a scenario with in a resource group which doesn't have any rate limit set. |
||
| 4. **Queue Fairness**: Ensure the unified read pool queue maintains fairness across tenants/regions/background traffic. In the existing system any one tenant or background traffic can consume the entire queue. | ||
|
|
||
| ## Design | ||
|
|
||
| ### Overview | ||
|
|
||
| Introduce **region-level virtual time (VT)** alongside existing **resource group VT**. Each request's priority is determined by three factors in hierarchical order: | ||
|
|
||
| 1. **group_priority**: Tenant priority (HIGH/MEDIUM/LOW) - tenant isolation | ||
| 2. **group_vt**: Resource group virtual time - tenant fairness | ||
| 3. **region_vt**: Region virtual time - region fairness | ||
|
|
||
| ### Priority Structure | ||
|
|
||
| Replace the current 64-bit `u64` priority with a struct: | ||
|
|
||
| ```rust | ||
| struct TaskPriority { | ||
| group_priority: u8, // 1-16 (HIGH/MEDIUM/LOW from PD) | ||
| group_vt: u64, // Resource group virtual time | ||
| region_vt: u64, // Region virtual time | ||
| } | ||
| ``` | ||
|
|
||
| **Comparison order** (most significant first): | ||
| 1. `group_priority`: Higher value = higher priority (tenant isolation) | ||
| 2. `group_vt`: Lower value = higher priority (tenant fairness) | ||
| 3. `region_vt`: Lower value = higher priority (region fairness) | ||
|
|
||
| **On task scheduling**: | ||
| - Group VT increases by `vt_delta_for_get` (fixed per group) | ||
| - Region VT increases by `vt_delta_for_get` (varies based on region hotness) | ||
| - Hot regions accumulate VT faster → pushed back in queue | ||
|
|
||
| **On task completion**: | ||
| - Group VT increases by actual CPU time consumed | ||
| - Region VT increases by actual CPU time consumed | ||
|
|
||
| **Periodic normalization** (every ~1 second): | ||
| - Find min/max VT across all groups/regions | ||
| - Pull lagging entities toward leader (prevent starvation) | ||
| - Reset all VTs if near overflow | ||
|
|
||
| ### Traffic moderation and split/scatter | ||
|
|
||
| Currently, split/scatter is non-deterministic when node is overloaded - it depends on how many requests on this region are succeeded. With this design, Hot regions accumulate high VT and get deprioritized, which slows down split decisions based on served QPS. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If split depends on requests succeeded then deprioritization of such requests will delay the split. Should it use scheduled/dropped QPS rather than succeeded instead>
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we can use scheduled/dropped qps. I will modify the design. |
||
|
|
||
| #### VT Handling for Split Regions | ||
|
|
||
| When a region splits, the VT behavior depends on CPU utilization: | ||
|
|
||
| **When CPU utilization > 80% (system overloaded)**: | ||
| - Split regions share a **common VT** inherited from the parent region | ||
| - Both child regions contribute to and read from the same VT tracker | ||
| - This maintains strong traffic moderation - even after splitting, the hot key/region group remains deprioritized as a unit | ||
| - The common VT continues accumulating based on combined traffic to both regions | ||
| - This prevents the split from immediately bypassing the backpressure that delayed the split in the first place | ||
|
|
||
| **When CPU utilization drops < 80% (system has capacity)**: | ||
| - Split regions transition to **independent VTs** | ||
| - Each region gets its own VT tracker, initialized to the common VT value at time of transition | ||
| - From this point forward, each region accumulates VT based on its own traffic patterns | ||
| - This allows natural load balancing - if traffic shifts to one split region, only that region gets deprioritized | ||
|
|
||
| **Implementation**: | ||
| - Track CPU utilization as a rolling average (e.g., last 10 seconds) | ||
| - On region split, create a `RegionGroup` if CPU > 80%, linking child regions to shared VT | ||
| - Periodically check CPU utilization (every 1-5 seconds) | ||
| - When CPU drops < 80%, dissolve region groups and transition to independent VTs | ||
| - Store region group membership in `RegionVtTracker` with atomic reference to shared VT state | ||
|
|
||
| This adaptive approach provides stronger traffic moderation when the system is overloaded (maintaining backpressure across splits), while allowing normal load balancing when the system has capacity | ||
|
|
||
| ### Background Task Demotion | ||
|
|
||
| Background tasks (GC, compaction, statistics) use LOW `group_priority` regardless of their resource group's configured priority: | ||
| ``` | ||
| group_priority = LOW // instead of resource group's configured priority | ||
| ``` | ||
|
|
||
| This ensures foreground traffic is always prioritized over background. | ||
|
Comment on lines
+97
to
+104
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the proposal here to create a virtual RG for background task with low priority to schedule them relative to other traffic as well?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. tagged you below in the implementation. |
||
|
|
||
| ### Queue Eviction | ||
|
|
||
| When queue is full: | ||
| 1. Calculate priority of incoming task | ||
| 2. Compare with lowest priority task in queue | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is probably hard to implement effectively. Will it require more one priority queue with inverse priority? Or the current SkipMap implementation allows popping from both ends effectively?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, current skipmap allows. It is O(log N) complexity. |
||
| 3. If incoming has higher priority: evict lowest, enqueue incoming | ||
| 4. Else: reject incoming with ServerIsBusy | ||
|
|
||
| Evicted tasks are failed with ServerIsBusy error. | ||
|
|
||
| ## Implementation | ||
|
|
||
| ### 1. yatp Modifications | ||
|
|
||
| **Change priority type from `u64` to struct**: | ||
|
|
||
| ```rust | ||
| // In yatp/src/queue/priority.rs | ||
|
|
||
| struct TaskPriority { | ||
| group_priority: u8, | ||
| group_vt: u64, | ||
| region_vt: u64, | ||
| } | ||
|
|
||
| // Implement Ord with hierarchical comparison | ||
| // Update TaskPriorityProvider trait | ||
| trait TaskPriorityProvider { | ||
| fn priority_of(&self, extras: &Extras) -> TaskPriority; | ||
| } | ||
|
|
||
| // Update MapKey | ||
| struct MapKey { | ||
| priority: TaskPriority, | ||
| sequence: u64, | ||
| } | ||
| ``` | ||
|
|
||
| ### 2. TaskMetadata Changes | ||
|
|
||
| Add region_id and is_background fields: | ||
|
|
||
| ```rust | ||
| // In components/tikv_util/src/resource_control.rs | ||
|
|
||
| const REGION_ID_MASK: u8 = 0b0000_0100; | ||
| const IS_BACKGROUND_MASK: u8 = 0b0000_1000; | ||
|
|
||
| impl TaskMetadata { | ||
| fn region_id(&self) -> u64 { | ||
| // Extract from metadata bytes | ||
| } | ||
|
|
||
| fn is_background(&self) -> bool { | ||
| self.mask & IS_BACKGROUND_MASK != 0 | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ### 3. Region VT Tracker | ||
|
|
||
| Create new component for region-level tracking: | ||
|
|
||
| ```rust | ||
| // In components/resource_control/src/region_tracker.rs | ||
|
|
||
| struct RegionResourceTracker { | ||
| region_vts: DashMap<u64, RegionVtTracker>, | ||
| cpu_utilization: AtomicU64, // Rolling average, encoded as u64 | ||
| } | ||
|
|
||
| struct RegionVtTracker { | ||
| virtual_time: AtomicU64, | ||
| vt_delta_for_get: AtomicU64, | ||
| parent_vt: Option<Arc<AtomicU64>>, // Shared parent VT if CPU > 80% at split | ||
| } | ||
|
|
||
| impl RegionResourceTracker { | ||
| fn get_and_increment_vt(region_id) -> u64 { | ||
| // If parent_vt exists, use shared parent VT | ||
| // Otherwise use independent VT | ||
| // Similar to ResourceGroup::get_priority() | ||
| } | ||
|
|
||
| fn on_region_split(parent_id, child1_id, child2_id) { | ||
| // Get parent VT value | ||
| // If cpu_utilization > 80%: | ||
| // Create Arc<AtomicU64> with parent VT | ||
| // Both children share reference to parent_vt | ||
| // Else: | ||
| // Both children get independent VT initialized to parent VT | ||
| // parent_vt = None | ||
| // Remove parent tracker | ||
| } | ||
|
|
||
| fn check_and_transition_to_independent() { | ||
| // If cpu_utilization < 80%: | ||
| // For each region with parent_vt: | ||
| // Copy parent_vt value to virtual_time | ||
| // Set parent_vt to None | ||
| } | ||
|
|
||
| fn update_vt_deltas() { | ||
| // Periodically adjust vt_delta based on region hotness | ||
| // ratio = region_ru / avg_ru | ||
| // delta = base_delta * ratio | ||
| } | ||
|
|
||
| fn normalize_region_vts() { | ||
| // Periodically normalize VTs (like update_min_virtual_time) | ||
| // Pull lagging regions forward, reset if near overflow | ||
| } | ||
|
|
||
| fn consume(region_id, cpu_time, keys, bytes) { | ||
| // Update EMA metrics | ||
| // Increment VT based on actual consumption | ||
| // If parent_vt exists, increment shared parent VT | ||
| // Otherwise increment independent VT | ||
| } | ||
|
|
||
| fn update_cpu_utilization(cpu_util) { | ||
| // Update rolling average (EMA over ~10 seconds) | ||
| } | ||
|
|
||
| fn cleanup_inactive_regions() { | ||
| // Periodically remove regions with no recent VT updates | ||
| // For each region: | ||
| // If virtual_time hasn't changed in last N seconds: | ||
| // Remove from region_vts hashmap | ||
| // This reduces memory usage for cold/deleted regions | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ### 4. Priority Calculation | ||
|
|
||
| Update ResourceController to return TaskPriority: | ||
|
|
||
| ```rust | ||
| // In components/resource_control/src/resource_group.rs | ||
|
|
||
| impl TaskPriorityProvider for ResourceController { | ||
| fn priority_of(&self, extras: &Extras) -> TaskPriority { | ||
| let metadata = TaskMetadata::from(extras.metadata()); | ||
|
|
||
| // 1. Get group VT | ||
| let group_vt = self.resource_group(metadata.group_name()) | ||
| .get_group_vt(level, override_priority); | ||
|
|
||
| // 2. Get region VT | ||
| let region_id = metadata.region_id(); | ||
| let region_vt = self.region_tracker.get_and_increment_vt(region_id); | ||
|
|
||
| // 3. Use LOW priority for background tasks | ||
| let group_priority = if metadata.is_background() { | ||
| LOW | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Tema this is how priority is decided for background task |
||
| } else { | ||
| base_priority // from resource group config | ||
| }; | ||
|
|
||
| TaskPriority { group_priority, group_vt, region_vt } | ||
| } | ||
|
|
||
| fn approximate_priority_of(&self, extras: &Extras) -> TaskPriority { | ||
| // Read VT without incrementing (for eviction check) | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ### 5. Queue Eviction | ||
|
|
||
| Extend yatp to support eviction: | ||
|
|
||
| ```rust | ||
| // In yatp/src/queue/priority.rs | ||
|
|
||
| impl QueueCore { | ||
| fn try_evict_for_priority(incoming_priority: TaskPriority) -> bool { | ||
| if let Some(lowest_entry) = self.pq.back() { | ||
| if incoming_priority < lowest_entry.priority { | ||
| // Evict lowest priority task | ||
| if let Some(entry) = self.pq.pop_back() { | ||
| // Send eviction signal via oneshot channel | ||
| entry.eviction_handle.evict(); | ||
| return true; | ||
| } | ||
| } | ||
| } | ||
| false | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| Wrap futures with eviction notification: | ||
|
|
||
| ```rust | ||
| // In src/read_pool.rs | ||
|
|
||
| struct EvictableFuture<F> { | ||
| future: F, | ||
| eviction_rx: oneshot::Receiver<()>, | ||
| } | ||
|
|
||
| // On eviction: send signal via oneshot channel | ||
| // Future polls eviction_rx and returns ServerIsBusy if signaled | ||
| ``` | ||
|
|
||
| Update ReadPoolHandle::spawn(): | ||
|
|
||
| ```rust | ||
| impl ReadPoolHandle { | ||
| fn spawn(...) -> Result<(), ReadPoolError> { | ||
| // 1. Calculate approximate priority (without VT increment) | ||
| let approx_priority = resource_ctl.approximate_priority_of(&extras); | ||
|
|
||
| // 2. Check queue full | ||
| if running_tasks >= max_tasks { | ||
| // 3. Try eviction | ||
| if !remote.try_evict_for_priority(approx_priority) { | ||
| return Err(ReadPoolError::UnifiedReadPoolFull); | ||
| } | ||
| } | ||
|
|
||
| // 4. Spawn task (actual priority calculated by yatp) | ||
| remote.spawn(task_cell); | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| ### 6. Tracking Integration | ||
|
|
||
| Wire region tracking into execution paths: | ||
|
|
||
| ```rust | ||
| // In src/storage/mod.rs and src/coprocessor/ | ||
|
|
||
| // After task completes: | ||
| region_tracker.consume( | ||
| region_id, | ||
| cpu_time, | ||
| keys_scanned, | ||
| bytes_read, | ||
| ); | ||
| ``` | ||
|
|
||
| ### 7. Background Task | ||
|
|
||
| Periodic normalization and delta updates: | ||
|
|
||
| ```rust | ||
| // Run every 1 second | ||
| fn periodic_region_maintenance() { | ||
| region_tracker.normalize_region_vts(); | ||
| region_tracker.update_vt_deltas(); | ||
| } | ||
| ``` | ||
|
|
||
| ## Configuration | ||
|
|
||
| New configuration options in `tikv.toml`: | ||
|
|
||
| ```toml | ||
| [resource-control] | ||
| # Enable region-level resource tracking | ||
| enable-region-tracking = true | ||
| ``` | ||
|
|
||
| ## Drawbacks | ||
|
|
||
| 1. **Temporary traffic moderation**: The VT-based traffic moderation is temporary. It does not work if a node is rebooted after regions are split. | ||
| 2. **Shared region fairness issues**: When multiple resource groups access the same region, two fairness problems arise: | ||
| - **Innocent tenant penalized**: Tenant A's heavy usage increases the region's VT, penalizing Tenant B's requests to that region even though Tenant B didn't cause the hotness | ||
| - **Hot region stays hot**: If Tenant A and B alternate requests to a shared region, each tenant's group_vt stays low (they're taking turns), so the region never gets properly deprioritized despite being continuously hot | ||
|
Comment on lines
+376
to
+378
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you consider to have a region tracker per group? What are trade offs of this approach instead?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what is region tracker per group ?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Current design tracks cpu per regions across all tenants which results on the highlighted issues. I'm asking if you considered to track it per tenant.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. existing design is if a region is split into r1 and r2 then they will share the same VT if cpu utilization is more than 80%. Having region tracker per group will complicate this design. |
||
|
|
||
| This can be mitigated by ensuring resource groups don't share tables. regions are generally created at the table boundary if it is big enough. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, the goal of this RFC is to address the issues of system avalanches and starvation caused by hot regions. However, it carries a potential side effect: when the system is not overloaded, the excessive pursuit of region-level fairness in business hotspot scenarios may harm the performance of critical business operations.
This highlights a core conflict in defining "Resource Scheduling Fairness": Request Fairness vs. Region Fairness.
Consider a scenario with two regions:
region_1andregion_2. 90% of requests targetregion_1, while 10% targetregion_2. All requests are small queries (TP requests), and the system is under high load but has not yet exhausted all resources (e.g., at 98% utilization). Under the proposed scheme, requests forregion_1would be deprioritized and delayed (causing the RT for someregion_1requests to increase). However, from the user's perspective, they likely expect requests from both regions to be treated equally (processed in the order of arrival, i.e., First-In-First-Served).In the above example (where 90% of requests are on Region 1), this typically represents a genuine business hotspot (e.g., a flash sale item table or a hot account). The user's (business owner's) expectation is: "This is my most critical business; the system should dedicate maximum effort to processing these requests."
If TiKV forcibly suppresses requests from Region 1 for the sake of "fairness," allowing the 10% of "cold" requests to cut the line, the result is an increase in P99 latency for the core business and throttled throughput. In single-tenant, purely business-oriented scenarios, this "robbing the rich to feed the poor" approach could indeed be perceived as a performance regression.
For small queries, processing time is extremely short. In such cases, the overhead of maintaining a complex priority queue (calculating VT, reordering tasks), combined with artificial queuing delays, may outweigh the benefits. Users intuitively believe that "First-Come-First-Served" is the fairest approach, as no single request is maliciously monopolizing time slices (unlike a full table scan).
Therefore, the scope and triggering conditions for this RFC require careful consideration: Region-level isolation is beneficial primarily when system resources are really nearing saturation (How to assess whether a system will avalanche?) or when there is a distinct mixed workload (TP mixed with large long running queries).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiadebin I would argue that for genuine business hotspot you need to create a separate Resource Group. This proposal is to arbitrate regions within one Resource Group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiadebin I understand your concern. I think that TiDB doesn't typically have a scenario where one region is getting 90% of traffic and other region is getting 10% of traffic while system has spare resources. When region splits are successful, traffic is generally distributed more evenly across regions. If a single region is handling 90% of the traffic on a node, it usually indicates one of two issues:
This RFC assumes that regions generally have a moderate uniform traffic if there is no single hot key.