Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -557,3 +557,23 @@ a username and password using `backendState` as with the `JDBC` option.
#### NOOP

This option disables health checks.

### Best-effort routing when all backends are unhealthy

By default, routing only selects backends that are both ACTIVE and HEALTHY.
However, in environments where health checks may occasionally be flaky,
this behavior can result in “Number of active backends found zero” errors—even when
viable clusters technically exist.

In reality, if a cluster is truly unhealthy, the query will fail regardless of whether
the gateway routes to it or not. To prevent unnecessary immediate failures, you can enable
best-effort routing.

When best-effort mode is enabled, if all active backends in the routing group are marked
UNHEALTHY, the router will still choose among them as a last resort, rather than failing
the routing decision outright.

```yaml
routing:
bestEffortRouting: true
```
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,10 @@ public class RoutingConfiguration

private String defaultRoutingGroup = "adhoc";

// When true, if all active backends are unhealthy, route among active backends anyway (best-effort).
// Default is false for backward compatibility (strict: healthy-only).
private boolean bestEffortRouting;

public Duration getAsyncTimeout()
{
return asyncTimeout;
Expand Down Expand Up @@ -54,4 +58,14 @@ public void setDefaultRoutingGroup(String defaultRoutingGroup)
{
this.defaultRoutingGroup = defaultRoutingGroup;
}

public boolean isBestEffortRouting()
{
return bestEffortRouting;
}

public void setBestEffortRouting(boolean bestEffortRouting)
{
this.bestEffortRouting = bestEffortRouting;
}
Comment on lines +61 to +70
Copy link
Member

@xkrogen xkrogen Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to how we are introducing strictRouting in #804, I am wondering if this config should be a per-rule basis instead of global. We can think of each routing rule as having a varying level of "strictness", from least to most:

  • Strict: Route only to the target group
  • Default: Fall back to default group if target group is all unhealthy
  • Best Effort: Fall back to unhealthy clusters if all of target group + default group are unhealthy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Peiyingy thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different levels of strictness makes sense. However, if we move that to routing rule definition, is won't cover the default routing case at all. Maybe we can keep this global config, and allow per-rule config to override it? WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should move it so that the default rule is specified in the rule configs along the other rules, instead of being treated separately? Kind of like resource group definitions in Trino -- where the last selector is typically a "catch all" that you fall back to if no other rules match (example). Not sure if that totally makes sense for routing rules, though -- I forget all the places that the default is used.

Global config with override makes sense to me as well, if we make it a strictnessLevel config (same as the per-rule config)

Copy link
Contributor Author

@felicity3786 felicity3786 Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can keep this global config, and allow per-rule config to override it?

Not sure I understand correctly, the best effort will not make query choose a unhealthy target within the customized group, it only apply to adhoc/default group. What do we need the override for?

If we set best effort at global level, and the rule targeted group has no healthy cluster:

  • without strictRouting, query goes to adhoc, may fail if adhocs are also unhealthy.
  • with strictRouting, query fail at target group unhealthy clusters.

if we override the best effort per this rule, I think the result is gonna be the same?

But I agree that thinking in terms of routing “strictness levels” (STRICT / DEFAULT / BEST_EFFORT) is the right long-term model, and that it maps well to how we want routing behavior to evolve.
As for moving the rule to same fallback as resource group, I don't think MVEL has any blocker on this maybe it will add one more default rule to traverse every time we inspect a rule. Maybe we can align more on this and make a separate PR if needed since that remodels the rule behavior a bit, then come back and generalize the strictness.

Copy link
Member

@Peiyingy Peiyingy Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we override the best effort per this rule, I think the result is gonna be the same?

Thanks for clarifying that. Yes that would be the same here. I guess in rules, we can still decide that when there's fail back to default, do we want to make it bestEffort or not. Say globally that's disabled, but for this rule, if it falls back to defaults, and defaults are unhealthy, we still want it to fall back to unhealthy clusters. I'm not sure if this granularity is necessary, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should move it so that the default rule is specified in the rule configs along the other rules, instead of being treated separately?

Having a implicit default routing rule makes sense to me. I don't know if it's necessary to specify that.

I think under the current design, we've already handled all possible fallbacks:

  1. If there are no available active healthy cluster for the selected routing group, fallback to ActiveDefaultBackends (ie. active healthy cluster in defaultRoutingGroup).
    -> Handled by StrictRouting

  2. If there's no If there are no available active healthy cluster for the default routing group, fallback to unhealthy ones
    -> Handled by BestEffortRouting

I agree we should unify the naming, but the current code design makes sense to me.

}
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ public abstract class BaseRoutingManager
private final GatewayBackendManager gatewayBackendManager;
private final ConcurrentHashMap<String, TrinoStatus> backendToStatus;
private final String defaultRoutingGroup;
private final boolean bestEffortRouting;
private final QueryHistoryManager queryHistoryManager;
private final LoadingCache<String, String> queryIdBackendCache;
private final LoadingCache<String, String> queryIdRoutingGroupCache;
Expand All @@ -63,6 +64,7 @@ public BaseRoutingManager(GatewayBackendManager gatewayBackendManager, QueryHist
{
this.gatewayBackendManager = gatewayBackendManager;
this.defaultRoutingGroup = routingConfiguration.getDefaultRoutingGroup();
this.bestEffortRouting = routingConfiguration.isBestEffortRouting();
this.queryHistoryManager = queryHistoryManager;
this.queryIdBackendCache = buildCache(this::findBackendForUnknownQueryId);
this.queryIdRoutingGroupCache = buildCache(this::findRoutingGroupForUnknownQueryId);
Expand Down Expand Up @@ -92,10 +94,16 @@ public void setRoutingGroupForQueryId(String queryId, String routingGroup)
*/
public ProxyBackendConfiguration provideDefaultBackendConfiguration(String user)
{
List<ProxyBackendConfiguration> backends = gatewayBackendManager.getActiveDefaultBackends().stream()
List<ProxyBackendConfiguration> activeDefaults = gatewayBackendManager.getActiveDefaultBackends();
List<ProxyBackendConfiguration> healthyDefaults = activeDefaults.stream()
.filter(backEnd -> isBackendHealthy(backEnd.getName()))
.toList();
return selectBackend(backends, user).orElseThrow(() -> new IllegalStateException("Number of active backends found zero"));
// If no healthy defaults, optionally route among all active defaults when enabled
List<ProxyBackendConfiguration> candidates = !healthyDefaults.isEmpty()
? healthyDefaults
: (bestEffortRouting ? activeDefaults : healthyDefaults);
return selectBackend(candidates, user)
.orElseThrow(() -> new IllegalStateException("Number of active backends found zero"));
Comment on lines +97 to +106
Copy link
Member

@Peiyingy Peiyingy Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned with the readability of the ternary branch. Shall we change it to something like this ⬇️ We can also benefit from different exception messages

List<ProxyBackendConfiguration> activeDefaults = gatewayBackendManager.getActiveDefaultBackends();
List<ProxyBackendConfiguration> healthyDefaults = activeDefaults.stream()
        .filter(backEnd -> isBackendHealthy(backEnd.getName()))
        .toList();

if (bestEffortRouting && healthyDefaults.isEmpty()) {
    return selectBackend(activeDefaults, user)
            .orElseThrow(() -> new IllegalStateException(
                    "No active default backend found under best-effort routing"));
}

return selectBackend(healthyDefaults, user)
        .orElseThrow(() -> new IllegalStateException(
                "No healthy default backend found"));

}

/**
Expand Down
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this new standalone test class is necessary. Shall we integrate these tests in TestStochasticRoutingManager? At least we can create some common test util methods for both of them.

Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
/*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package io.trino.gateway.ha.router;

import io.trino.gateway.ha.clustermonitor.TrinoStatus;
import io.trino.gateway.ha.config.ProxyBackendConfiguration;
import io.trino.gateway.ha.config.RoutingConfiguration;
import io.trino.gateway.ha.persistence.JdbcConnectionManager;
import org.junit.jupiter.api.Test;

import static io.trino.gateway.ha.TestingJdbcConnectionManager.createTestingJdbcConnectionManager;
import static org.assertj.core.api.Assertions.assertThat;

final class TestBestEffortRouting
{
@Test
void testBestEffortRoutingEnabledAllUnhealthy()
{
JdbcConnectionManager connectionManager = createTestingJdbcConnectionManager();
RoutingConfiguration routingConfiguration = new RoutingConfiguration();
routingConfiguration.setBestEffortRouting(true);
GatewayBackendManager backendMgr = new HaGatewayManager(connectionManager.getJdbi(), routingConfiguration);
RoutingManager rm = new StochasticRoutingManager(backendMgr, new HaQueryHistoryManager(connectionManager.getJdbi(), false), routingConfiguration);

String group = "adhoc";
addActiveBackend(backendMgr, group, "trino-1");
addActiveBackend(backendMgr, group, "trino-2");

rm.updateBackEndHealth("trino-1", TrinoStatus.UNHEALTHY);
rm.updateBackEndHealth("trino-2", TrinoStatus.UNHEALTHY);

ProxyBackendConfiguration selected = rm.provideBackendConfiguration(group, "user");
assertThat(selected.getName()).isIn("trino-1", "trino-2");
assertThat(selected.getRoutingGroup()).isEqualTo(group);
}

@Test
void testFallsBackWhenAllUnhealthyInGroup()
{
JdbcConnectionManager connectionManager = createTestingJdbcConnectionManager();
RoutingConfiguration routingConfiguration = new RoutingConfiguration();
routingConfiguration.setBestEffortRouting(true);
routingConfiguration.setDefaultRoutingGroup("adhoc");
GatewayBackendManager backendMgr = new HaGatewayManager(connectionManager.getJdbi(), routingConfiguration);
RoutingManager rm = new StochasticRoutingManager(backendMgr, new HaQueryHistoryManager(connectionManager.getJdbi(), false), routingConfiguration);

// Non-default group with all unhealthy
String vipGroup = "vip";
addActiveBackend(backendMgr, vipGroup, "vip-1");
addActiveBackend(backendMgr, vipGroup, "vip-2");
rm.updateBackEndHealth("vip-1", TrinoStatus.UNHEALTHY);
rm.updateBackEndHealth("vip-2", TrinoStatus.UNHEALTHY);

// Default group with one healthy and one unhealthy
addActiveBackend(backendMgr, "adhoc", "adhoc-1");
addActiveBackend(backendMgr, "adhoc", "adhoc-2");
rm.updateBackEndHealth("adhoc-1", TrinoStatus.HEALTHY);
rm.updateBackEndHealth("adhoc-2", TrinoStatus.UNHEALTHY);

ProxyBackendConfiguration selected = rm.provideBackendConfiguration(vipGroup, "user");
assertThat(selected.getRoutingGroup()).isEqualTo("adhoc");
assertThat(selected.getName()).isEqualTo("adhoc-1");
}

private static void addActiveBackend(GatewayBackendManager mgr, String group, String name)
{
ProxyBackendConfiguration backend = new ProxyBackendConfiguration();
backend.setActive(true);
backend.setRoutingGroup(group);
backend.setName(name);
backend.setProxyTo(name + ".trino.example.com");
backend.setExternalUrl("trino.example.com");
mgr.addBackend(backend);
}
}