[Darft WIP]Add Best-Effort Routing Fallback When All Backends in Default Group Are Unhealthy #810

xkrogen · 2025-12-15T16:25:15Z

Similar to how we are introducing strictRouting in #804, I am wondering if this config should be a per-rule basis instead of global. We can think of each routing rule as having a varying level of "strictness", from least to most:

Strict: Route only to the target group

Default: Fall back to default group if target group is all unhealthy

Best Effort: Fall back to unhealthy clusters if all of target group + default group are unhealthy

cc @Peiyingy thoughts?

Different levels of strictness makes sense. However, if we move that to routing rule definition, is won't cover the default routing case at all. Maybe we can keep this global config, and allow per-rule config to override it? WDYT?

I wonder if we should move it so that the default rule is specified in the rule configs along the other rules, instead of being treated separately? Kind of like resource group definitions in Trino -- where the last selector is typically a "catch all" that you fall back to if no other rules match (example). Not sure if that totally makes sense for routing rules, though -- I forget all the places that the default is used.

Global config with override makes sense to me as well, if we make it a strictnessLevel config (same as the per-rule config)

Maybe we can keep this global config, and allow per-rule config to override it?

Not sure I understand correctly, the best effort will not make query choose a unhealthy target within the customized group, it only apply to adhoc/default group. What do we need the override for?

If we set best effort at global level, and the rule targeted group has no healthy cluster:

without strictRouting, query goes to adhoc, may fail if adhocs are also unhealthy.

with strictRouting, query fail at target group unhealthy clusters.

if we override the best effort per this rule, I think the result is gonna be the same?

But I agree that thinking in terms of routing “strictness levels” (STRICT / DEFAULT / BEST_EFFORT) is the right long-term model, and that it maps well to how we want routing behavior to evolve.
As for moving the rule to same fallback as resource group, I don't think MVEL has any blocker on this maybe it will add one more default rule to traverse every time we inspect a rule. Maybe we can align more on this and make a separate PR if needed since that remodels the rule behavior a bit, then come back and generalize the strictness.

if we override the best effort per this rule, I think the result is gonna be the same?

Thanks for clarifying that. Yes that would be the same here. I guess in rules, we can still decide that when there's fail back to default, do we want to make it bestEffort or not. Say globally that's disabled, but for this rule, if it falls back to defaults, and defaults are unhealthy, we still want it to fall back to unhealthy clusters. I'm not sure if this granularity is necessary, though.

I wonder if we should move it so that the default rule is specified in the rule configs along the other rules, instead of being treated separately?

Having a implicit default routing rule makes sense to me. I don't know if it's necessary to specify that.

I think under the current design, we've already handled all possible fallbacks:

If there are no available active healthy cluster for the selected routing group, fallback to ActiveDefaultBackends (ie. active healthy cluster in defaultRoutingGroup).
-> Handled by StrictRouting

If there's no If there are no available active healthy cluster for the default routing group, fallback to unhealthy ones
-> Handled by BestEffortRouting

I agree we should unify the naming, but the current code design makes sense to me.

Peiyingy · 2025-12-13T02:18:03Z

I'm a bit concerned with the readability of the ternary branch. Shall we change it to something like this ⬇️ We can also benefit from different exception messages

List<ProxyBackendConfiguration> activeDefaults = gatewayBackendManager.getActiveDefaultBackends(); List<ProxyBackendConfiguration> healthyDefaults = activeDefaults.stream() .filter(backEnd -> isBackendHealthy(backEnd.getName())) .toList(); if (bestEffortRouting && healthyDefaults.isEmpty()) { return selectBackend(activeDefaults, user) .orElseThrow(() -> new IllegalStateException( "No active default backend found under best-effort routing")); } return selectBackend(healthyDefaults, user) .orElseThrow(() -> new IllegalStateException( "No healthy default backend found"));

Peiyingy · 2025-12-13T02:25:36Z

I'm not sure if this new standalone test class is necessary. Shall we integrate these tests in TestStochasticRoutingManager? At least we can create some common test util methods for both of them.

-Original file line number
+Diff line change
@@ Expand Up @@
     #### NOOP
     This option disables health checks.
+    ### Best-effort routing when all backends are unhealthy
+    By default, routing only selects backends that are both ACTIVE and HEALTHY.
+    However, in environments where health checks may occasionally be flaky,
+    this behavior can result in “Number of active backends found zero” errors—even when
+    viable clusters technically exist.
+    In reality, if a cluster is truly unhealthy, the query will fail regardless of whether
+    the gateway routes to it or not. To prevent unnecessary immediate failures, you can enable
+    best-effort routing.
+    When best-effort mode is enabled, if all active backends in the routing group are marked
+    UNHEALTHY, the router will still choose among them as a last resort, rather than failing
+    the routing decision outright.
+    ```yaml
+    routing:
+      bestEffortRouting: true
+    ```

-Original file line number
+Diff line change
@@ Expand Up / @@ -25,6 +25,10 @@ public class RoutingConfiguration @@
         private String defaultRoutingGroup = "adhoc";
+        // When true, if all active backends are unhealthy, route among active backends anyway (best-effort).
+        // Default is false for backward compatibility (strict: healthy-only).
+        private boolean bestEffortRouting;
         public Duration getAsyncTimeout()
         {
             return asyncTimeout;
@@ Expand Down Expand Up @@
         {
             this.defaultRoutingGroup = defaultRoutingGroup;
         }
+        public boolean isBestEffortRouting()
+        {
+            return bestEffortRouting;
+        }
+        public void setBestEffortRouting(boolean bestEffortRouting)
+        {
+            this.bestEffortRouting = bestEffortRouting;
+        }
     }

-Original file line number
+Diff line change
@@ Expand Up / @@ -54,6 +54,7 @@ public abstract class BaseRoutingManager @@
         private final GatewayBackendManager gatewayBackendManager;
         private final ConcurrentHashMap<String, TrinoStatus> backendToStatus;
         private final String defaultRoutingGroup;
+        private final boolean bestEffortRouting;
         private final QueryHistoryManager queryHistoryManager;
         private final LoadingCache<String, String> queryIdBackendCache;
         private final LoadingCache<String, String> queryIdRoutingGroupCache;
@@ Expand All @@
         {
             this.gatewayBackendManager = gatewayBackendManager;
             this.defaultRoutingGroup = routingConfiguration.getDefaultRoutingGroup();
+            this.bestEffortRouting = routingConfiguration.isBestEffortRouting();
             this.queryHistoryManager = queryHistoryManager;
             this.queryIdBackendCache = buildCache(this::findBackendForUnknownQueryId);
             this.queryIdRoutingGroupCache = buildCache(this::findRoutingGroupForUnknownQueryId);
@@ Expand Down Expand Up @@
          */
         public ProxyBackendConfiguration provideDefaultBackendConfiguration(String user)
         {
-            List<ProxyBackendConfiguration> backends = gatewayBackendManager.getActiveDefaultBackends().stream()
+            List<ProxyBackendConfiguration> activeDefaults = gatewayBackendManager.getActiveDefaultBackends();
+            List<ProxyBackendConfiguration> healthyDefaults = activeDefaults.stream()
                     .filter(backEnd -> isBackendHealthy(backEnd.getName()))
                     .toList();
-            return selectBackend(backends, user).orElseThrow(() -> new IllegalStateException("Number of active backends found zero"));
+            // If no healthy defaults, optionally route among all active defaults when enabled
+            List<ProxyBackendConfiguration> candidates = !healthyDefaults.isEmpty()
+                    ? healthyDefaults
+                    : (bestEffortRouting ? activeDefaults : healthyDefaults);
+            return selectBackend(candidates, user)
+                    .orElseThrow(() -> new IllegalStateException("Number of active backends found zero"));
         }
         /**
@@ Expand Down @@

-Original file line number
+Diff line change
@@ -0,0 +1,86 @@
+    /*
+     * Licensed under the Apache License, Version 2.0 (the "License");
+     * you may not use this file except in compliance with the License.
+     * You may obtain a copy of the License at
+     *
+     *     http://www.apache.org/licenses/LICENSE-2.0
+     *
+     * Unless required by applicable law or agreed to in writing, software
+     * distributed under the License is distributed on an "AS IS" BASIS,
+     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+     * See the License for the specific language governing permissions and
+     * limitations under the License.
+     */
+    package io.trino.gateway.ha.router;
+    import io.trino.gateway.ha.clustermonitor.TrinoStatus;
+    import io.trino.gateway.ha.config.ProxyBackendConfiguration;
+    import io.trino.gateway.ha.config.RoutingConfiguration;
+    import io.trino.gateway.ha.persistence.JdbcConnectionManager;
+    import org.junit.jupiter.api.Test;
+    import static io.trino.gateway.ha.TestingJdbcConnectionManager.createTestingJdbcConnectionManager;
+    import static org.assertj.core.api.Assertions.assertThat;
+    final class TestBestEffortRouting
+    {
+        @Test
+        void testBestEffortRoutingEnabledAllUnhealthy()
+        {
+            JdbcConnectionManager connectionManager = createTestingJdbcConnectionManager();
+            RoutingConfiguration routingConfiguration = new RoutingConfiguration();
+            routingConfiguration.setBestEffortRouting(true);
+            GatewayBackendManager backendMgr = new HaGatewayManager(connectionManager.getJdbi(), routingConfiguration);
+            RoutingManager rm = new StochasticRoutingManager(backendMgr, new HaQueryHistoryManager(connectionManager.getJdbi(), false), routingConfiguration);
+            String group = "adhoc";
+            addActiveBackend(backendMgr, group, "trino-1");
+            addActiveBackend(backendMgr, group, "trino-2");
+            rm.updateBackEndHealth("trino-1", TrinoStatus.UNHEALTHY);
+            rm.updateBackEndHealth("trino-2", TrinoStatus.UNHEALTHY);
+            ProxyBackendConfiguration selected = rm.provideBackendConfiguration(group, "user");
+            assertThat(selected.getName()).isIn("trino-1", "trino-2");
+            assertThat(selected.getRoutingGroup()).isEqualTo(group);
+        }
+        @Test
+        void testFallsBackWhenAllUnhealthyInGroup()
+        {
+            JdbcConnectionManager connectionManager = createTestingJdbcConnectionManager();
+            RoutingConfiguration routingConfiguration = new RoutingConfiguration();
+            routingConfiguration.setBestEffortRouting(true);
+            routingConfiguration.setDefaultRoutingGroup("adhoc");
+            GatewayBackendManager backendMgr = new HaGatewayManager(connectionManager.getJdbi(), routingConfiguration);
+            RoutingManager rm = new StochasticRoutingManager(backendMgr, new HaQueryHistoryManager(connectionManager.getJdbi(), false), routingConfiguration);
+            // Non-default group with all unhealthy
+            String vipGroup = "vip";
+            addActiveBackend(backendMgr, vipGroup, "vip-1");
+            addActiveBackend(backendMgr, vipGroup, "vip-2");
+            rm.updateBackEndHealth("vip-1", TrinoStatus.UNHEALTHY);
+            rm.updateBackEndHealth("vip-2", TrinoStatus.UNHEALTHY);
+            // Default group with one healthy and one unhealthy
+            addActiveBackend(backendMgr, "adhoc", "adhoc-1");
+            addActiveBackend(backendMgr, "adhoc", "adhoc-2");
+            rm.updateBackEndHealth("adhoc-1", TrinoStatus.HEALTHY);
+            rm.updateBackEndHealth("adhoc-2", TrinoStatus.UNHEALTHY);
+            ProxyBackendConfiguration selected = rm.provideBackendConfiguration(vipGroup, "user");
+            assertThat(selected.getRoutingGroup()).isEqualTo("adhoc");
+            assertThat(selected.getName()).isEqualTo("adhoc-1");
+        }
+        private static void addActiveBackend(GatewayBackendManager mgr, String group, String name)
+        {
+            ProxyBackendConfiguration backend = new ProxyBackendConfiguration();
+            backend.setActive(true);
+            backend.setRoutingGroup(group);
+            backend.setName(name);
+            backend.setProxyTo(name + ".trino.example.com");
+            backend.setExternalUrl("trino.example.com");
+            mgr.addBackend(backend);
+        }
+    }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Darft WIP]Add Best-Effort Routing Fallback When All Backends in Default Group Are Unhealthy #810

Uh oh!

Diff view

Diff view

There are no files selected for viewing

xkrogen Dec 15, 2025 •

edited

Loading

Uh oh!

xkrogen Dec 15, 2025

Uh oh!

Peiyingy Dec 15, 2025

Uh oh!

xkrogen Dec 15, 2025

Uh oh!

felicity3786 Dec 15, 2025 •

edited

Loading

Uh oh!

Peiyingy Dec 15, 2025 •

edited

Loading

Uh oh!

Peiyingy Dec 15, 2025

Uh oh!

Peiyingy Dec 13, 2025 •

edited

Loading

Uh oh!

Peiyingy Dec 13, 2025

Uh oh!

[Darft WIP]Add Best-Effort Routing Fallback When All Backends in Default Group Are Unhealthy #810

Are you sure you want to change the base?

Uh oh!

[Darft WIP]Add Best-Effort Routing Fallback When All Backends in Default Group Are Unhealthy #810

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

xkrogen Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xkrogen Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Peiyingy Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

xkrogen Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

felicity3786 Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Peiyingy Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Peiyingy Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Peiyingy Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Peiyingy Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

xkrogen Dec 15, 2025 •

edited

Loading

felicity3786 Dec 15, 2025 •

edited

Loading

Peiyingy Dec 15, 2025 •

edited

Loading

Peiyingy Dec 13, 2025 •

edited

Loading