-
Notifications
You must be signed in to change notification settings - Fork 129
[Darft WIP]Add Best-Effort Routing Fallback When All Backends in Default Group Are Unhealthy #810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -54,6 +54,7 @@ public abstract class BaseRoutingManager | |
| private final GatewayBackendManager gatewayBackendManager; | ||
| private final ConcurrentHashMap<String, TrinoStatus> backendToStatus; | ||
| private final String defaultRoutingGroup; | ||
| private final boolean bestEffortRouting; | ||
| private final QueryHistoryManager queryHistoryManager; | ||
| private final LoadingCache<String, String> queryIdBackendCache; | ||
| private final LoadingCache<String, String> queryIdRoutingGroupCache; | ||
|
|
@@ -63,6 +64,7 @@ public BaseRoutingManager(GatewayBackendManager gatewayBackendManager, QueryHist | |
| { | ||
| this.gatewayBackendManager = gatewayBackendManager; | ||
| this.defaultRoutingGroup = routingConfiguration.getDefaultRoutingGroup(); | ||
| this.bestEffortRouting = routingConfiguration.isBestEffortRouting(); | ||
| this.queryHistoryManager = queryHistoryManager; | ||
| this.queryIdBackendCache = buildCache(this::findBackendForUnknownQueryId); | ||
| this.queryIdRoutingGroupCache = buildCache(this::findRoutingGroupForUnknownQueryId); | ||
|
|
@@ -92,10 +94,16 @@ public void setRoutingGroupForQueryId(String queryId, String routingGroup) | |
| */ | ||
| public ProxyBackendConfiguration provideDefaultBackendConfiguration(String user) | ||
| { | ||
| List<ProxyBackendConfiguration> backends = gatewayBackendManager.getActiveDefaultBackends().stream() | ||
| List<ProxyBackendConfiguration> activeDefaults = gatewayBackendManager.getActiveDefaultBackends(); | ||
| List<ProxyBackendConfiguration> healthyDefaults = activeDefaults.stream() | ||
| .filter(backEnd -> isBackendHealthy(backEnd.getName())) | ||
| .toList(); | ||
| return selectBackend(backends, user).orElseThrow(() -> new IllegalStateException("Number of active backends found zero")); | ||
| // If no healthy defaults, optionally route among all active defaults when enabled | ||
| List<ProxyBackendConfiguration> candidates = !healthyDefaults.isEmpty() | ||
| ? healthyDefaults | ||
| : (bestEffortRouting ? activeDefaults : healthyDefaults); | ||
| return selectBackend(candidates, user) | ||
| .orElseThrow(() -> new IllegalStateException("Number of active backends found zero")); | ||
|
Comment on lines
+97
to
+106
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm a bit concerned with the readability of the ternary branch. Shall we change it to something like this ⬇️ We can also benefit from different exception messages List<ProxyBackendConfiguration> activeDefaults = gatewayBackendManager.getActiveDefaultBackends();
List<ProxyBackendConfiguration> healthyDefaults = activeDefaults.stream()
.filter(backEnd -> isBackendHealthy(backEnd.getName()))
.toList();
if (bestEffortRouting && healthyDefaults.isEmpty()) {
return selectBackend(activeDefaults, user)
.orElseThrow(() -> new IllegalStateException(
"No active default backend found under best-effort routing"));
}
return selectBackend(healthyDefaults, user)
.orElseThrow(() -> new IllegalStateException(
"No healthy default backend found")); |
||
| } | ||
|
|
||
| /** | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure if this new standalone test class is necessary. Shall we integrate these tests in |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,86 @@ | ||
| /* | ||
| * Licensed under the Apache License, Version 2.0 (the "License"); | ||
| * you may not use this file except in compliance with the License. | ||
| * You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
| package io.trino.gateway.ha.router; | ||
|
|
||
| import io.trino.gateway.ha.clustermonitor.TrinoStatus; | ||
| import io.trino.gateway.ha.config.ProxyBackendConfiguration; | ||
| import io.trino.gateway.ha.config.RoutingConfiguration; | ||
| import io.trino.gateway.ha.persistence.JdbcConnectionManager; | ||
| import org.junit.jupiter.api.Test; | ||
|
|
||
| import static io.trino.gateway.ha.TestingJdbcConnectionManager.createTestingJdbcConnectionManager; | ||
| import static org.assertj.core.api.Assertions.assertThat; | ||
|
|
||
| final class TestBestEffortRouting | ||
| { | ||
| @Test | ||
| void testBestEffortRoutingEnabledAllUnhealthy() | ||
| { | ||
| JdbcConnectionManager connectionManager = createTestingJdbcConnectionManager(); | ||
| RoutingConfiguration routingConfiguration = new RoutingConfiguration(); | ||
| routingConfiguration.setBestEffortRouting(true); | ||
| GatewayBackendManager backendMgr = new HaGatewayManager(connectionManager.getJdbi(), routingConfiguration); | ||
| RoutingManager rm = new StochasticRoutingManager(backendMgr, new HaQueryHistoryManager(connectionManager.getJdbi(), false), routingConfiguration); | ||
|
|
||
| String group = "adhoc"; | ||
| addActiveBackend(backendMgr, group, "trino-1"); | ||
| addActiveBackend(backendMgr, group, "trino-2"); | ||
|
|
||
| rm.updateBackEndHealth("trino-1", TrinoStatus.UNHEALTHY); | ||
| rm.updateBackEndHealth("trino-2", TrinoStatus.UNHEALTHY); | ||
|
|
||
| ProxyBackendConfiguration selected = rm.provideBackendConfiguration(group, "user"); | ||
| assertThat(selected.getName()).isIn("trino-1", "trino-2"); | ||
| assertThat(selected.getRoutingGroup()).isEqualTo(group); | ||
| } | ||
|
|
||
| @Test | ||
| void testFallsBackWhenAllUnhealthyInGroup() | ||
| { | ||
| JdbcConnectionManager connectionManager = createTestingJdbcConnectionManager(); | ||
| RoutingConfiguration routingConfiguration = new RoutingConfiguration(); | ||
| routingConfiguration.setBestEffortRouting(true); | ||
| routingConfiguration.setDefaultRoutingGroup("adhoc"); | ||
| GatewayBackendManager backendMgr = new HaGatewayManager(connectionManager.getJdbi(), routingConfiguration); | ||
| RoutingManager rm = new StochasticRoutingManager(backendMgr, new HaQueryHistoryManager(connectionManager.getJdbi(), false), routingConfiguration); | ||
|
|
||
| // Non-default group with all unhealthy | ||
| String vipGroup = "vip"; | ||
| addActiveBackend(backendMgr, vipGroup, "vip-1"); | ||
| addActiveBackend(backendMgr, vipGroup, "vip-2"); | ||
| rm.updateBackEndHealth("vip-1", TrinoStatus.UNHEALTHY); | ||
| rm.updateBackEndHealth("vip-2", TrinoStatus.UNHEALTHY); | ||
|
|
||
| // Default group with one healthy and one unhealthy | ||
| addActiveBackend(backendMgr, "adhoc", "adhoc-1"); | ||
| addActiveBackend(backendMgr, "adhoc", "adhoc-2"); | ||
| rm.updateBackEndHealth("adhoc-1", TrinoStatus.HEALTHY); | ||
| rm.updateBackEndHealth("adhoc-2", TrinoStatus.UNHEALTHY); | ||
|
|
||
| ProxyBackendConfiguration selected = rm.provideBackendConfiguration(vipGroup, "user"); | ||
| assertThat(selected.getRoutingGroup()).isEqualTo("adhoc"); | ||
| assertThat(selected.getName()).isEqualTo("adhoc-1"); | ||
| } | ||
|
|
||
| private static void addActiveBackend(GatewayBackendManager mgr, String group, String name) | ||
| { | ||
| ProxyBackendConfiguration backend = new ProxyBackendConfiguration(); | ||
| backend.setActive(true); | ||
| backend.setRoutingGroup(group); | ||
| backend.setName(name); | ||
| backend.setProxyTo(name + ".trino.example.com"); | ||
| backend.setExternalUrl("trino.example.com"); | ||
| mgr.addBackend(backend); | ||
| } | ||
| } |
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to how we are introducing
strictRoutingin #804, I am wondering if this config should be a per-rule basis instead of global. We can think of each routing rule as having a varying level of "strictness", from least to most:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @Peiyingy thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different levels of strictness makes sense. However, if we move that to routing rule definition, is won't cover the default routing case at all. Maybe we can keep this global config, and allow per-rule config to override it? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should move it so that the default rule is specified in the rule configs along the other rules, instead of being treated separately? Kind of like resource group definitions in Trino -- where the last selector is typically a "catch all" that you fall back to if no other rules match (example). Not sure if that totally makes sense for routing rules, though -- I forget all the places that the default is used.
Global config with override makes sense to me as well, if we make it a
strictnessLevelconfig (same as the per-rule config)Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand correctly, the best effort will not make query choose a unhealthy target within the customized group, it only apply to adhoc/default group. What do we need the override for?
If we set best effort at global level, and the rule targeted group has no healthy cluster:
strictRouting, query goes to adhoc, may fail if adhocs are also unhealthy.strictRouting, query fail at target group unhealthy clusters.if we override the best effort per this rule, I think the result is gonna be the same?
But I agree that thinking in terms of routing “strictness levels” (STRICT / DEFAULT / BEST_EFFORT) is the right long-term model, and that it maps well to how we want routing behavior to evolve.
As for moving the rule to same fallback as resource group, I don't think MVEL has any blocker on this maybe it will add one more default rule to traverse every time we inspect a rule. Maybe we can align more on this and make a separate PR if needed since that remodels the rule behavior a bit, then come back and generalize the strictness.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying that. Yes that would be the same here. I guess in rules, we can still decide that when there's fail back to default, do we want to make it
bestEffortor not. Say globally that's disabled, but for this rule, if it falls back to defaults, and defaults are unhealthy, we still want it to fall back to unhealthy clusters. I'm not sure if this granularity is necessary, though.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a implicit default routing rule makes sense to me. I don't know if it's necessary to specify that.
I think under the current design, we've already handled all possible fallbacks:
If there are no available active healthy cluster for the selected routing group, fallback to ActiveDefaultBackends (ie. active healthy cluster in defaultRoutingGroup).
-> Handled by
StrictRoutingIf there's no If there are no available active healthy cluster for the default routing group, fallback to unhealthy ones
-> Handled by
BestEffortRoutingI agree we should unify the naming, but the current code design makes sense to me.