Add PENDING type to healthchecks #360

andythsu · 2024-05-24T21:12:01Z

Description

Resolves #222 part 1

Additional context and related issues

Release notes

(X) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

gateway-ha/src/main/java/io/trino/gateway/ha/resource/EntityEditorResource.java

rdsarvar

some small nitpicks, mostly around use of a 'healthy' variable that has > 2 states

rdsarvar · 2024-05-29T19:15:53Z

gateway-ha/src/main/java/io/trino/gateway/ha/router/QueryCountBasedRouter.java

@@ -46,7 +47,7 @@ static class LocalStats
    {
        private int runningQueryCount;
        private int queuedQueryCount;
-        private boolean healthy;
+        private TrinoHealthStateType healthy;


question: what are your thoughts on having this be heathState instead of healthy?

as well as respective places where healthy is used

I agree. healthy kinda has a binary implication

rdsarvar · 2024-05-29T19:17:50Z

gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/HealthChecker.java

@@ -34,7 +34,7 @@ public HealthChecker(Notifier notifier)
    public void observe(List<ClusterStats> clustersStats)
    {
        for (ClusterStats clusterStats : clustersStats) {
-            if (!clusterStats.healthy()) {
+            if (clusterStats.healthy() == TrinoHealthStateType.UNHEALTHY) {


nitpick: feels weird to read does healthy() = UNHEALTHY ?, would healthState() == UNHEALTHY read better?

or maybe just clusterStats.Health() ?

Sure, open to whichever naming - seeing the ‘y’ suffix (to me) on healthy implies boolean

I like healthState() == UNHEALTHY

gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/TrinoHealthStateType.java

Chaho12

How about TrinoHealthStateType to TrinoHealthType ? i find it quite a redundant to say healthState as health is state of some status.

Chaho12 · 2024-05-30T00:38:16Z

gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/HealthChecker.java

@@ -34,7 +34,7 @@ public HealthChecker(Notifier notifier)
    public void observe(List<ClusterStats> clustersStats)
    {
        for (ClusterStats clusterStats : clustersStats) {
-            if (!clusterStats.healthy()) {
+            if (clusterStats.healthy() == TrinoHealthStateType.UNHEALTHY) {


or maybe just clusterStats.Health() ?

gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClusterStatsInfoApiMonitor.java

Chaho12

Please resolve conversation after fixes has been made.
It makes it easier to PR (know that fix has been made for the commend)

LGTM 👍

gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/TrinoHealthStateType.java

gateway-ha/src/main/java/io/trino/gateway/baseapp/BaseApp.java

gateway-ha/src/main/java/io/trino/gateway/ha/resource/EntityEditorResource.java

gateway-ha/src/main/java/io/trino/gateway/ha/router/RoutingManager.java

gateway-ha/src/test/java/io/trino/gateway/ha/router/TestStochasticRoutingManager.java

gateway-ha/src/main/java/io/trino/gateway/baseapp/BaseApp.java

willmostly · 2024-06-06T15:09:35Z

gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/TrinoHealthStateType.java

+/**
+ * PENDING is for ui/observability purpose and functionally it's unhealthy
+ * We should use PENDING when Trino clusters are still spinning up
+ * HEALTHY is when health checks report clusters as up
+ * UNHEALTHY is when health checks report clusters as down
+ */


this should be added to the docs. As to placement I think a section on health checks should be added, and linked to from the routing logic and operation sections. Wdyt @mosabua ?

Where do we want to put this in now that we have new doc?

Agreed with @willmostly .. probably should go into the routing rules page and have a separate section about the Trino cluster status .. and that can then contain this info

andythsu · 2024-06-06T15:20:07Z

@willmostly

TestGatewayMultipleBackend uses the TrinoContainer from TestContainers for trino1 and trino2, which does not finish its startup() until SELECT 1 returns. So the health check should succeed. customBackend is a MockWebServer so you will need to add an endpoint to satisfy the healthcheck. I do not believe you should need to set health status manually through injection

Sorry for the confusion.

In TestGatewayMultipleBackend and TestGatewaySingleBackend, the clusters are added by calling the post api

The default healthstate when clusters are first added to the gateway is PENDING (and should be). Because PENDING is functionally treated as unhealthy, the test cases will fail (for example,

trino-gateway/gateway-ha/src/test/java/io/trino/gateway/ha/TestGatewayHaSingleBackend.java

Lines 65 to 70 in d45c64f

    
           Request request = 
        
                   new Request.Builder() 
        
                           .url("http://localhost:" + routerPort + "/v1/statement") 
        
                           .addHeader("X-Trino-User", "test") 
        
                           .post(requestBody) 
        
                           .build();

) since all clusters' states are still in PENDING when the test cases run. Unless we wait until the first round of healthcheck kicks in and changes the states from PENDING to HEALTHY, no clusters are available.

cla-bot · 2024-09-25T13:45:02Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Andy Su (Apps).
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email [email protected]
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

.

vishalya

I think this is good to be merged. Does anyone have any further comments? We can wait for a day or two and merge it.

rdsarvar

one potential change to the new test modifications, otherwise lg2m as a first step

rdsarvar · 2024-09-25T18:38:28Z

gateway-ha/src/test/java/io/trino/gateway/ha/HaGatewayTestUtils.java

+        while (newClusterHealthState != TrinoHealthStateType.HEALTHY && (lastExecutionTime - startTime) < timeout) {
+            // check the state of newly added cluster every second
+            if (System.currentTimeMillis() - lastExecutionTime <= 1000) {
+                continue;


question: should we be adding some form of yield here so the while loop isn't burning through CPU cycles? maybe a 100ms sleep or something

as in Thread.sleep(100);? In that case do we still need this while loop?

+1 you should refactor more like:

int timeout = 10 * 1000; while (newClusterHealthState != TrinoHealthStateType.HEALTHY && (System.currentTimeMillis() - startTime) < timeout) { // do whatever logic Thread.sleep(100); }

you'd still need the while loop right, as you want to check the cluster healthy every 1 second or so?

you could update this to be:

if (System.currentTimeMillis() - lastExecutionTime <= 1000) { Thread.sleep(System.currentTimeMillis() - lastExecutionTime) }

Or later on for the if (response.isSuccessful()) { call you could have an else Thread.sleep(1000) and remove this current if-condition

xkrogen · 2024-09-30T17:51:26Z

gateway-ha/src/main/java/io/trino/gateway/ha/resource/EntityEditorResource.java

@@ -86,8 +95,13 @@ public Response updateEntity(
                    ProxyBackendConfiguration backend =
                            OBJECT_MAPPER.readValue(jsonPayload, ProxyBackendConfiguration.class);
                    gatewayBackendManager.updateBackend(backend);
-                    log.info("Setting up the backend %s with healthy state", backend.getName());
-                    routingManager.updateBackEndHealth(backend.getName(), backend.isActive());
+                    log.info("Turning cluster %s %s", backend.getName(), backend.isActive() ? "on" : "off");


on/off may be a little confusing, let's stick to the same terminology used in the code?

TrinoHealthStateType healthState = backend.isActive() ? TrinoHealthStateType.PENDING : TrinoHealthStateType.UNHEALTHY; log.info("Marking cluster '%s' with health state %s", backend.getName(), healthState); routingManager.updateBackEndHealth(backend.getName(), healthState); ...

PENDING is more like an internal state that's managed by healthcheck. In this case since it's turned on/off by the UI I think it's fine here. Also, if we use the healthState, it's going to say "Marking cluster cluster_a with health state PENDING", which is a bit weird IMO

I agree with @xkrogen .. the error message is is misleading .. the cluster is not shut down or anything .. its just flagged with a specific status from the point of view of the Trino Gateway

what about "Marking the cluster active/inactive"?

xkrogen · 2024-09-30T17:58:19Z

gateway-ha/src/test/java/io/trino/gateway/ha/HaGatewayTestUtils.java

+        while (newClusterHealthState != TrinoHealthStateType.HEALTHY && (lastExecutionTime - startTime) < timeout) {
+            // check the state of newly added cluster every second
+            if (System.currentTimeMillis() - lastExecutionTime <= 1000) {
+                continue;


+1 you should refactor more like:

int timeout = 10 * 1000; while (newClusterHealthState != TrinoHealthStateType.HEALTHY && (System.currentTimeMillis() - startTime) < timeout) { // do whatever logic Thread.sleep(100); }

rdsarvar

lg2m

mosabua · 2024-10-03T20:34:01Z

gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/TrinoHealthStateType.java

+/**
+ * PENDING is for ui/observability purpose and functionally it's unhealthy
+ * We should use PENDING when Trino clusters are still spinning up
+ * HEALTHY is when health checks report clusters as up
+ * UNHEALTHY is when health checks report clusters as down
+ */


Agreed with @willmostly .. probably should go into the routing rules page and have a separate section about the Trino cluster status .. and that can then contain this info

mosabua · 2024-10-03T20:37:23Z

gateway-ha/src/main/java/io/trino/gateway/ha/resource/EntityEditorResource.java

@@ -86,8 +95,13 @@ public Response updateEntity(
                    ProxyBackendConfiguration backend =
                            OBJECT_MAPPER.readValue(jsonPayload, ProxyBackendConfiguration.class);
                    gatewayBackendManager.updateBackend(backend);
-                    log.info("Setting up the backend %s with healthy state", backend.getName());
-                    routingManager.updateBackEndHealth(backend.getName(), backend.isActive());
+                    log.info("Turning cluster %s %s", backend.getName(), backend.isActive() ? "on" : "off");


I agree with @xkrogen .. the error message is is misleading .. the cluster is not shut down or anything .. its just flagged with a specific status from the point of view of the Trino Gateway

mosabua · 2024-10-03T20:39:54Z

gateway-ha/src/test/resources/test-config-template.yml

+  monitorType: INFO_API
+
+monitor:
+  taskDelaySeconds: 1


could we look at using Airlift duration instead?

can I put this in a separate PR? This will require some code changes

Sure .. ideally soon though so it can ship in the same release .. otherwise it will be a breaking change.

This config has been there for a while now so it will be a breaking change for this or next release.

Okay .. then it should definitely be a separate change. Please file an issue for that work so we dont forget.

created #520

ebyhr · 2024-10-03T22:30:13Z

gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/TrinoHealthStateType.java

+ * HEALTHY is when health checks report clusters as up
+ * UNHEALTHY is when health checks report clusters as down


How about changing HEALTHY/UNHEALTHY to UP/DOWN?

I think UP/DOWN has the implication of server being up or down. HEALTHY could be a better candidate here because it means it passed the healthcheck. On the other hand, UNHEALTHY means it failed the healthcheck, but the trino cluster is still UP

gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/TrinoHealthStateType.java

gateway-ha/src/main/java/io/trino/gateway/ha/resource/EntityEditorResource.java

ebyhr · 2024-10-03T22:35:12Z

gateway-ha/src/test/java/io/trino/gateway/ha/HaGatewayTestUtils.java

+                Thread.sleep(1000);
+            }
+        }
+        assertThat(newClusterHealthState).isEqualTo(TrinoHealthStateType.HEALTHY);


This is a utility class that manages helper methods. I don't understand why this check & assertion exist here. Could you extract into a dedicated test?

Not sure if I follow. This check and assert is added to make sure when setupBackend is called, the backends state are healthy before returning to the caller.

gateway-ha/src/test/java/io/trino/gateway/ha/TestGatewayHaMultipleBackend.java

gateway-ha/src/test/java/io/trino/gateway/ha/HaGatewayTestUtils.java

ebyhr · 2024-10-03T22:48:59Z

gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClusterStatsInfoApiMonitor.java

    {
        Request request = prepareGet()
                .setUri(uriBuilderFrom(URI.create(baseUrl)).appendPath("/v1/info").build())
                .build();
        try {
            ServerInfo serverInfo = client.execute(request, SERVER_INFO_JSON_RESPONSE_HANDLER);
-            return !serverInfo.isStarting();
+            return serverInfo.isStarting() ? TrinoHealthStateType.PENDING : TrinoHealthStateType.HEALTHY;


I don't understand why "pending" is "ready" status. It worth leaving a code comment.

if trino is still starting, its status should be "pending", otherwise its status will be healthy

gateway-ha/src/test/java/io/trino/gateway/ha/HaGatewayTestUtils.java

gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClusterStatsInfoApiMonitor.java

gateway-ha/src/main/java/io/trino/gateway/ha/router/QueryCountBasedRouter.java

gateway-ha/src/main/java/io/trino/gateway/ha/router/RoutingManager.java

gateway-ha/src/test/java/io/trino/gateway/ha/HaGatewayTestUtils.java

gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClusterStats.java

docs/routing-rules.md

gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClusterStats.java

gateway-ha/src/main/java/io/trino/gateway/ha/router/QueryCountBasedRouter.java

gateway-ha/src/main/java/io/trino/gateway/ha/router/RoutingManager.java

gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClusterStats.java

gateway-ha/src/main/java/io/trino/gateway/ha/router/QueryCountBasedRouter.java

gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClusterStats.java

docs/routing-rules.md

gateway-ha/src/main/java/io/trino/gateway/ha/resource/EntityEditorResource.java

gateway-ha/src/test/java/io/trino/gateway/ha/HaGatewayTestUtils.java

ebyhr · 2024-10-11T06:13:46Z

Please fix CI failures.

cla-bot bot added the cla-signed label May 24, 2024

andythsu commented May 24, 2024

View reviewed changes

gateway-ha/src/main/java/io/trino/gateway/ha/resource/EntityEditorResource.java Show resolved Hide resolved

rdsarvar suggested changes May 29, 2024

View reviewed changes

Chaho12 reviewed May 30, 2024

View reviewed changes

andythsu force-pushed the healthstate branch from 45ee624 to 14a2d0f Compare May 30, 2024 16:04

Chaho12 approved these changes May 31, 2024

View reviewed changes

gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/TrinoHealthStateType.java Outdated Show resolved Hide resolved

ebyhr previously requested changes May 31, 2024

View reviewed changes

andythsu force-pushed the healthstate branch 2 times, most recently from f1a2261 to bca99f9 Compare May 31, 2024 17:27

willmostly requested changes Jun 6, 2024

View reviewed changes

andythsu force-pushed the healthstate branch from bca99f9 to 2a00b88 Compare September 20, 2024 23:18

cla-bot bot removed the cla-signed label Sep 25, 2024

andythsu force-pushed the healthstate branch from 2d6291b to a2b693a Compare September 25, 2024 13:45

cla-bot bot added the cla-signed label Sep 25, 2024

andythsu force-pushed the healthstate branch from a2b693a to fe451b3 Compare September 25, 2024 13:47

andythsu requested review from Chaho12, rdsarvar, willmostly and ebyhr September 25, 2024 13:48

andythsu force-pushed the healthstate branch from fe451b3 to 96fe7db Compare September 25, 2024 13:52

ebyhr requested review from oneonestar and removed request for ebyhr September 25, 2024 14:03

vishalya approved these changes Sep 25, 2024

View reviewed changes

rdsarvar suggested changes Sep 25, 2024

View reviewed changes

xkrogen reviewed Sep 30, 2024

View reviewed changes

andythsu force-pushed the healthstate branch 2 times, most recently from ce2d321 to 6bf8f6d Compare October 3, 2024 04:07

rdsarvar approved these changes Oct 3, 2024

View reviewed changes

mosabua requested changes Oct 3, 2024

View reviewed changes

ebyhr reviewed Oct 3, 2024

View reviewed changes

gateway-ha/src/test/java/io/trino/gateway/ha/HaGatewayTestUtils.java Outdated Show resolved Hide resolved

Fix log when cluster is turned on/off

4f65808

andythsu force-pushed the healthstate branch from 6bf8f6d to 66daf9e Compare October 4, 2024 20:27

ebyhr reviewed Oct 5, 2024

View reviewed changes

andythsu force-pushed the healthstate branch from 48b34d1 to ee7313c Compare October 7, 2024 22:07

ebyhr reviewed Oct 7, 2024

View reviewed changes

andythsu force-pushed the healthstate branch from ee7313c to e960471 Compare October 7, 2024 23:14

ebyhr reviewed Oct 7, 2024

View reviewed changes

gateway-ha/src/main/java/io/trino/gateway/ha/clustermonitor/ClusterStats.java Outdated Show resolved Hide resolved

docs/routing-rules.md Show resolved Hide resolved

andythsu force-pushed the healthstate branch 2 times, most recently from 2b76c56 to 1bb4748 Compare October 8, 2024 03:16

ebyhr approved these changes Oct 11, 2024

View reviewed changes

gateway-ha/src/main/java/io/trino/gateway/ha/resource/EntityEditorResource.java Show resolved Hide resolved

gateway-ha/src/test/java/io/trino/gateway/ha/HaGatewayTestUtils.java Outdated Show resolved Hide resolved

andythsu force-pushed the healthstate branch from 1bb4748 to c54e8f2 Compare October 11, 2024 06:06

andythsu force-pushed the healthstate branch from c54e8f2 to 52fc706 Compare October 11, 2024 06:39

Add PENDING type to healthchecks

b2215f0

ebyhr force-pushed the healthstate branch from 52fc706 to b2215f0 Compare October 11, 2024 11:12

ebyhr merged commit 98c8e03 into trinodb:main Oct 11, 2024
3 checks passed

github-actions bot added this to the 12 milestone Oct 11, 2024

This was referenced Oct 15, 2024

Add release notes for Trino Gateway 12 and related changes #473

Merged

Improve grammar and wording #525

Merged

		* HEALTHY is when health checks report clusters as up
		* UNHEALTHY is when health checks report clusters as down

Add PENDING type to healthchecks #360

Add PENDING type to healthchecks #360

Conversation

andythsu commented May 24, 2024 • edited Loading

Description

Additional context and related issues

Release notes

rdsarvar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Chaho12 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Chaho12 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andythsu commented Jun 6, 2024 • edited Loading

cla-bot bot commented Sep 25, 2024

vishalya left a comment • edited Loading

Choose a reason for hiding this comment

rdsarvar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdsarvar Sep 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andythsu Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andythsu Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdsarvar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mosabua Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebyhr commented Oct 11, 2024

andythsu commented May 24, 2024 •

edited

Loading

andythsu commented Jun 6, 2024 •

edited

Loading

vishalya left a comment •

edited

Loading

rdsarvar Sep 30, 2024 •

edited

Loading

andythsu Oct 3, 2024 •

edited

Loading

andythsu Oct 4, 2024 •

edited

Loading

mosabua Oct 4, 2024 •

edited

Loading