Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(prometheus): expose controlplane connectivity state as a gauge #14020

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

aryan9600
Copy link
Member

@aryan9600 aryan9600 commented Dec 13, 2024

Summary

Add a new Prometheus gauge metric control_plane_connected. Similar to datastore_reachable gauge, 0 means the connection is not healthy; 1 means that the connection is healthy. We mark the connection as unhealthy under the following circumstances:

  • Failure while establihing a websocket connection
  • Failure while sending basic information to controlplane
  • Failure while sending ping to controlplane
  • Failure while receiving a packet from the websocket connection

This is helpful for users running a signficant number of gateways to be alerted about potential issues any gateway(s) may be facing while talking to the controlplane.

Checklist

  • The Pull Request has tests
  • A changelog file has been created under changelog/unreleased/kong or skip-changelog label added on PR if changelog is unnecessary. README.md
  • There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

Issue reference

Fix #[issue number]

@github-actions github-actions bot added core/clustering plugins/prometheus cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee labels Dec 13, 2024
@aryan9600 aryan9600 force-pushed the cp-conn-prom-metric branch 3 times, most recently from c2d6278 to 697c1e6 Compare December 23, 2024 13:32
@pull-request-size pull-request-size bot added size/L and removed size/M labels Dec 23, 2024
@aryan9600 aryan9600 marked this pull request as ready for review December 23, 2024 16:21
@aryan9600 aryan9600 force-pushed the cp-conn-prom-metric branch from 697c1e6 to de4a868 Compare January 2, 2025 11:08
@RobSerafini RobSerafini requested review from gszr and flrgh January 7, 2025 19:22
Add a new Prometheus gauge metric `control_plane_reachable`. Similar to
`datastore_reachable` gauge, 0 means the connection is not healthy; 1
means that the connection is healthy. We mark the connection as
unhealthy under the following circumstances:
* Failure while establihing a websocket connection
* Failure while sending basic information to controlplane
* Failure while sending ping to controlplane
* Failure while receiving a packet from the websocket connection

This is helpful for users running a signficant number of gateways to be
alerted about potential issues any gateway(s) may be facing while
talking to the controlplane.

Signed-off-by: Sanskar Jaiswal <[email protected]>
@aryan9600 aryan9600 force-pushed the cp-conn-prom-metric branch from de4a868 to ee922f2 Compare January 10, 2025 10:24
ngx_log(ngx_INFO, _log_prefix, "sent ping frame to control plane with hash: ", hash, log_suffix)
else
set_control_plane_connected(true)
if hash ~= prev_hash then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep the code as is, I suggest moving the comment string on line 155 so that it continues to point to the correct context

Suggested change
if hash ~= prev_hash then
-- only log a ping if the hash changed
if hash ~= prev_hash then

local function set_control_plane_connected(reachable)
local ok, err = ngx.shared.kong:safe_set("control_plane_connected", reachable)
if not ok then
ngx_log(ngx_ERR, _log_prefix, "failed to set controlplane_reachable key in shm to " .. reachable .. " :", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not use string concatenation in ngx.log() (docs)

Suggested change
ngx_log(ngx_ERR, _log_prefix, "failed to set controlplane_reachable key in shm to " .. reachable .. " :", err)
ngx_log(ngx_ERR, _log_prefix, "failed to set controlplane_reachable key in shm to ", reachable, " :", err)

@@ -74,11 +74,20 @@ function _M.new(clustering)
end


local function set_control_plane_connected(reachable)
local ok, err = ngx.shared.kong:safe_set("control_plane_connected", reachable)
Copy link
Contributor

@flrgh flrgh Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_control_plane_connected() is called in a lot of places. There are a couple risks to consider in this implementation:

  1. Since we can't reliably test all of the code paths that lead to a set_control_plane_connected(false) call, and it would be easy for a future maintainer to refactor/modify the code and forget to call set_control_plane_connected(false) in some crucial path.
  2. An unhandled exception (easy to accidentally introduce these as well) could cause a call to set_control_plane_connected(false) could be missed.

Either of these could lead to a scenario where the gauge is set to 1 even though control plane connectivity is unhealthy.

I think we should consider using a TTL ("exptime" in the docs) for this value. This will have the effect of causing the metric to become 0 if set_control_plane_connected(true) has not been called within the last $TTL seconds, because the value has expired. The value of the TTL should probably be derived from PING_INTERVAL or PING_WAIT.

@@ -65,6 +66,14 @@ local function init()
"0 is unreachable",
nil,
prometheus.LOCAL_STORAGE)
if role == "data_plane" then
metrics.cp_connected = prometheus:gauge("control_plane_connected",
"Kong connected to controlplane, " ..
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Kong connected to controlplane, " ..
"Kong connected to control plane, " ..

Comment on lines +591 to +596
local body = get_metrics()
assert.matches('kong_control_plane_connected 1', body, nil, true)

helpers.stop_kong("prom_cp")
local body = get_metrics()
assert.matches('kong_control_plane_connected 0', body, nil, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my eyes, these look like eventual conditions (i.e. get_metrics() may not return the expected value on the first call). You should retry them.

Suggested change
local body = get_metrics()
assert.matches('kong_control_plane_connected 1', body, nil, true)
helpers.stop_kong("prom_cp")
local body = get_metrics()
assert.matches('kong_control_plane_connected 0', body, nil, true)
assert.eventually(function()
local body = get_metrics()
assert.matches('kong_control_plane_connected 1', body, nil, true)
end).has_no_error("metric kong_control_plane_connected => 1")
helpers.stop_kong("prom_cp")
assert.eventually(function()
local body = get_metrics()
assert.matches('kong_control_plane_connected 0', body, nil, true)
end).has_no_error("metric kong_control_plane_connected => 0")

local body = get_metrics()
assert.matches('kong_control_plane_connected 1', body, nil, true)

helpers.stop_kong("prom_cp")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this line of code may be skipped if the prior assertion fails, which could leak the prom_cp instance and potentially interfere with other running tests. You should always ensure that it is stopped by code within your teardown() function. This may or may not require tracking whether or not you reached the helpers.stop_kong("prom_cp") call in the test code--I'm not sure.

helpers.stop_kong("prom_dp")
end)

it("exposes controlplane connectivity status", function ()
Copy link
Contributor

@flrgh flrgh Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea: instead of starting prom_cp in setup(), what if you defer startup until it("exposes control plane connectivity status"):

  1. assert kong_control_plane_connected 0
  2. start prom_cp
  3. assert kong_control_plane_connected 1
  4. stop prom_cp
  5. assert kong_control_plane_connected 0

This would be a trivial change to make, but it gives us more test coverage.

helpers.stop_kong("prom_dp")
end)

it("exposes controlplane connectivity status", function ()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
it("exposes controlplane connectivity status", function ()
it("exposes control plane connectivity status", function ()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick kong-ee schedule this PR for cherry-picking to kong/kong-ee core/clustering plugins/prometheus size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants