-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PENDING type to healthchecks #360
Conversation
gateway-ha/src/main/java/io/trino/gateway/ha/resource/EntityEditorResource.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some small nitpicks, mostly around use of a 'healthy' variable that has > 2 states
@@ -46,7 +47,7 @@ static class LocalStats | |||
{ | |||
private int runningQueryCount; | |||
private int queuedQueryCount; | |||
private boolean healthy; | |||
private TrinoHealthStateType healthy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: what are your thoughts on having this be heathState
instead of healthy
?
as well as respective places where healthy
is used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. healthy
kinda has a binary implication
@@ -34,7 +34,7 @@ public HealthChecker(Notifier notifier) | |||
public void observe(List<ClusterStats> clustersStats) | |||
{ | |||
for (ClusterStats clusterStats : clustersStats) { | |||
if (!clusterStats.healthy()) { | |||
if (clusterStats.healthy() == TrinoHealthStateType.UNHEALTHY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: feels weird to read does healthy() = UNHEALTHY ?
, would healthState() == UNHEALTHY
read better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe just clusterStats.Health()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, open to whichever naming - seeing the ‘y’ suffix (to me) on healthy implies boolean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like healthState() == UNHEALTHY
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/TrinoHealthStateType.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about TrinoHealthStateType
to TrinoHealthType
? i find it quite a redundant to say healthState as health is state of some status.
@@ -34,7 +34,7 @@ public HealthChecker(Notifier notifier) | |||
public void observe(List<ClusterStats> clustersStats) | |||
{ | |||
for (ClusterStats clusterStats : clustersStats) { | |||
if (!clusterStats.healthy()) { | |||
if (clusterStats.healthy() == TrinoHealthStateType.UNHEALTHY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe just clusterStats.Health()
?
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClusterStatsInfoApiMonitor.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please resolve conversation after fixes has been made.
It makes it easier to PR (know that fix has been made for the commend)
LGTM 👍
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/TrinoHealthStateType.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/resource/EntityEditorResource.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/router/RoutingManager.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/router/RoutingManager.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/test/java/io/trino/gateway/ha/router/TestStochasticRoutingManager.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/test/java/io/trino/gateway/ha/router/TestStochasticRoutingManager.java
Outdated
Show resolved
Hide resolved
f1a2261
to
bca99f9
Compare
/** | ||
* PENDING is for ui/observability purpose and functionally it's unhealthy | ||
* We should use PENDING when Trino clusters are still spinning up | ||
* HEALTHY is when health checks report clusters as up | ||
* UNHEALTHY is when health checks report clusters as down | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be added to the docs. As to placement I think a section on health checks should be added, and linked to from the routing logic and operation sections. Wdyt @mosabua ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we want to put this in now that we have new doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with @willmostly .. probably should go into the routing rules page and have a separate section about the Trino cluster status .. and that can then contain this info
Sorry for the confusion. In The default healthstate when clusters are first added to the gateway is trino-gateway/gateway-ha/src/test/java/io/trino/gateway/ha/TestGatewayHaSingleBackend.java Lines 65 to 70 in d45c64f
PENDING when the test cases run. Unless we wait until the first round of healthcheck kicks in and changes the states from PENDING to HEALTHY , no clusters are available.
|
bca99f9
to
2a00b88
Compare
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Andy Su (Apps).
|
2d6291b
to
a2b693a
Compare
a2b693a
to
fe451b3
Compare
fe451b3
to
96fe7db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good to be merged. Does anyone have any further comments? We can wait for a day or two and merge it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one potential change to the new test modifications, otherwise lg2m as a first step
while (newClusterHealthState != TrinoHealthStateType.HEALTHY && (lastExecutionTime - startTime) < timeout) { | ||
// check the state of newly added cluster every second | ||
if (System.currentTimeMillis() - lastExecutionTime <= 1000) { | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: should we be adding some form of yield here so the while loop isn't burning through CPU cycles? maybe a 100ms sleep or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as in Thread.sleep(100);
? In that case do we still need this while loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 you should refactor more like:
int timeout = 10 * 1000;
while (newClusterHealthState != TrinoHealthStateType.HEALTHY && (System.currentTimeMillis() - startTime) < timeout) {
// do whatever logic
Thread.sleep(100);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you'd still need the while loop right, as you want to check the cluster healthy every 1 second or so?
you could update this to be:
if (System.currentTimeMillis() - lastExecutionTime <= 1000) {
Thread.sleep(System.currentTimeMillis() - lastExecutionTime)
}
Or later on for the if (response.isSuccessful()) {
call you could have an else Thread.sleep(1000) and remove this current if-condition
@@ -86,8 +95,13 @@ public Response updateEntity( | |||
ProxyBackendConfiguration backend = | |||
OBJECT_MAPPER.readValue(jsonPayload, ProxyBackendConfiguration.class); | |||
gatewayBackendManager.updateBackend(backend); | |||
log.info("Setting up the backend %s with healthy state", backend.getName()); | |||
routingManager.updateBackEndHealth(backend.getName(), backend.isActive()); | |||
log.info("Turning cluster %s %s", backend.getName(), backend.isActive() ? "on" : "off"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on/off may be a little confusing, let's stick to the same terminology used in the code?
TrinoHealthStateType healthState = backend.isActive() ? TrinoHealthStateType.PENDING : TrinoHealthStateType.UNHEALTHY;
log.info("Marking cluster '%s' with health state %s", backend.getName(), healthState);
routingManager.updateBackEndHealth(backend.getName(), healthState);
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PENDING is more like an internal state that's managed by healthcheck. In this case since it's turned on/off by the UI I think it's fine here. Also, if we use the healthState
, it's going to say "Marking cluster cluster_a with health state PENDING", which is a bit weird IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @xkrogen .. the error message is is misleading .. the cluster is not shut down or anything .. its just flagged with a specific status from the point of view of the Trino Gateway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about "Marking the cluster active/inactive"?
while (newClusterHealthState != TrinoHealthStateType.HEALTHY && (lastExecutionTime - startTime) < timeout) { | ||
// check the state of newly added cluster every second | ||
if (System.currentTimeMillis() - lastExecutionTime <= 1000) { | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 you should refactor more like:
int timeout = 10 * 1000;
while (newClusterHealthState != TrinoHealthStateType.HEALTHY && (System.currentTimeMillis() - startTime) < timeout) {
// do whatever logic
Thread.sleep(100);
}
ce2d321
to
6bf8f6d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lg2m
/** | ||
* PENDING is for ui/observability purpose and functionally it's unhealthy | ||
* We should use PENDING when Trino clusters are still spinning up | ||
* HEALTHY is when health checks report clusters as up | ||
* UNHEALTHY is when health checks report clusters as down | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with @willmostly .. probably should go into the routing rules page and have a separate section about the Trino cluster status .. and that can then contain this info
@@ -86,8 +95,13 @@ public Response updateEntity( | |||
ProxyBackendConfiguration backend = | |||
OBJECT_MAPPER.readValue(jsonPayload, ProxyBackendConfiguration.class); | |||
gatewayBackendManager.updateBackend(backend); | |||
log.info("Setting up the backend %s with healthy state", backend.getName()); | |||
routingManager.updateBackEndHealth(backend.getName(), backend.isActive()); | |||
log.info("Turning cluster %s %s", backend.getName(), backend.isActive() ? "on" : "off"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @xkrogen .. the error message is is misleading .. the cluster is not shut down or anything .. its just flagged with a specific status from the point of view of the Trino Gateway
monitorType: INFO_API | ||
|
||
monitor: | ||
taskDelaySeconds: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we look at using Airlift duration instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can I put this in a separate PR? This will require some code changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure .. ideally soon though so it can ship in the same release .. otherwise it will be a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This config has been there for a while now so it will be a breaking change for this or next release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay .. then it should definitely be a separate change. Please file an issue for that work so we dont forget.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created #520
* HEALTHY is when health checks report clusters as up | ||
* UNHEALTHY is when health checks report clusters as down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about changing HEALTHY/UNHEALTHY to UP/DOWN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think UP/DOWN has the implication of server being up or down. HEALTHY could be a better candidate here because it means it passed the healthcheck. On the other hand, UNHEALTHY means it failed the healthcheck, but the trino cluster is still UP
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/TrinoHealthStateType.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/resource/EntityEditorResource.java
Outdated
Show resolved
Hide resolved
Thread.sleep(1000); | ||
} | ||
} | ||
assertThat(newClusterHealthState).isEqualTo(TrinoHealthStateType.HEALTHY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a utility class that manages helper methods. I don't understand why this check & assertion exist here. Could you extract into a dedicated test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I follow. This check and assert is added to make sure when setupBackend
is called, the backends state are healthy before returning to the caller.
gateway-ha/src/test/java/io/trino/gateway/ha/TestGatewayHaMultipleBackend.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/test/java/io/trino/gateway/ha/HaGatewayTestUtils.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/test/java/io/trino/gateway/ha/HaGatewayTestUtils.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/test/java/io/trino/gateway/ha/HaGatewayTestUtils.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/test/java/io/trino/gateway/ha/HaGatewayTestUtils.java
Outdated
Show resolved
Hide resolved
{ | ||
Request request = prepareGet() | ||
.setUri(uriBuilderFrom(URI.create(baseUrl)).appendPath("/v1/info").build()) | ||
.build(); | ||
try { | ||
ServerInfo serverInfo = client.execute(request, SERVER_INFO_JSON_RESPONSE_HANDLER); | ||
return !serverInfo.isStarting(); | ||
return serverInfo.isStarting() ? TrinoHealthStateType.PENDING : TrinoHealthStateType.HEALTHY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why "pending" is "ready" status. It worth leaving a code comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if trino is still starting, its status should be "pending", otherwise its status will be healthy
gateway-ha/src/test/java/io/trino/gateway/ha/HaGatewayTestUtils.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClusterStatsInfoApiMonitor.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/router/QueryCountBasedRouter.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/router/RoutingManager.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/test/java/io/trino/gateway/ha/HaGatewayTestUtils.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClusterStats.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClusterStats.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClusterStats.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/router/QueryCountBasedRouter.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/router/RoutingManager.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClusterStats.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/router/QueryCountBasedRouter.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/router/QueryCountBasedRouter.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/router/QueryCountBasedRouter.java
Outdated
Show resolved
Hide resolved
gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClusterStats.java
Outdated
Show resolved
Hide resolved
2b76c56
to
1bb4748
Compare
gateway-ha/src/main/java/io/trino/gateway/ha/resource/EntityEditorResource.java
Show resolved
Hide resolved
gateway-ha/src/test/java/io/trino/gateway/ha/HaGatewayTestUtils.java
Outdated
Show resolved
Hide resolved
Please fix CI failures. |
Description
Resolves #222 part 1
Additional context and related issues
Release notes
(X) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:
*