-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(prometheus): expose controlplane connectivity state as a gauge #14020
base: master
Are you sure you want to change the base?
Conversation
c2d6278
to
697c1e6
Compare
697c1e6
to
de4a868
Compare
Add a new Prometheus gauge metric `control_plane_reachable`. Similar to `datastore_reachable` gauge, 0 means the connection is not healthy; 1 means that the connection is healthy. We mark the connection as unhealthy under the following circumstances: * Failure while establihing a websocket connection * Failure while sending basic information to controlplane * Failure while sending ping to controlplane * Failure while receiving a packet from the websocket connection This is helpful for users running a signficant number of gateways to be alerted about potential issues any gateway(s) may be facing while talking to the controlplane. Signed-off-by: Sanskar Jaiswal <[email protected]>
…viour Signed-off-by: Sanskar Jaiswal <[email protected]>
Signed-off-by: Sanskar Jaiswal <[email protected]>
de4a868
to
ee922f2
Compare
ngx_log(ngx_INFO, _log_prefix, "sent ping frame to control plane with hash: ", hash, log_suffix) | ||
else | ||
set_control_plane_connected(true) | ||
if hash ~= prev_hash then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep the code as is, I suggest moving the comment string on line 155 so that it continues to point to the correct context
if hash ~= prev_hash then | |
-- only log a ping if the hash changed | |
if hash ~= prev_hash then |
local function set_control_plane_connected(reachable) | ||
local ok, err = ngx.shared.kong:safe_set("control_plane_connected", reachable) | ||
if not ok then | ||
ngx_log(ngx_ERR, _log_prefix, "failed to set controlplane_reachable key in shm to " .. reachable .. " :", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not use string concatenation in ngx.log()
(docs)
ngx_log(ngx_ERR, _log_prefix, "failed to set controlplane_reachable key in shm to " .. reachable .. " :", err) | |
ngx_log(ngx_ERR, _log_prefix, "failed to set controlplane_reachable key in shm to ", reachable, " :", err) |
@@ -74,11 +74,20 @@ function _M.new(clustering) | |||
end | |||
|
|||
|
|||
local function set_control_plane_connected(reachable) | |||
local ok, err = ngx.shared.kong:safe_set("control_plane_connected", reachable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set_control_plane_connected()
is called in a lot of places. There are a couple risks to consider in this implementation:
- Since we can't reliably test all of the code paths that lead to a
set_control_plane_connected(false)
call, and it would be easy for a future maintainer to refactor/modify the code and forget to callset_control_plane_connected(false)
in some crucial path. - An unhandled exception (easy to accidentally introduce these as well) could cause a call to
set_control_plane_connected(false)
could be missed.
Either of these could lead to a scenario where the gauge is set to 1
even though control plane connectivity is unhealthy.
I think we should consider using a TTL ("exptime" in the docs) for this value. This will have the effect of causing the metric to become 0
if set_control_plane_connected(true)
has not been called within the last $TTL seconds, because the value has expired. The value of the TTL should probably be derived from PING_INTERVAL or PING_WAIT.
@@ -65,6 +66,14 @@ local function init() | |||
"0 is unreachable", | |||
nil, | |||
prometheus.LOCAL_STORAGE) | |||
if role == "data_plane" then | |||
metrics.cp_connected = prometheus:gauge("control_plane_connected", | |||
"Kong connected to controlplane, " .. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Kong connected to controlplane, " .. | |
"Kong connected to control plane, " .. |
local body = get_metrics() | ||
assert.matches('kong_control_plane_connected 1', body, nil, true) | ||
|
||
helpers.stop_kong("prom_cp") | ||
local body = get_metrics() | ||
assert.matches('kong_control_plane_connected 0', body, nil, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my eyes, these look like eventual conditions (i.e. get_metrics()
may not return the expected value on the first call). You should retry them.
local body = get_metrics() | |
assert.matches('kong_control_plane_connected 1', body, nil, true) | |
helpers.stop_kong("prom_cp") | |
local body = get_metrics() | |
assert.matches('kong_control_plane_connected 0', body, nil, true) | |
assert.eventually(function() | |
local body = get_metrics() | |
assert.matches('kong_control_plane_connected 1', body, nil, true) | |
end).has_no_error("metric kong_control_plane_connected => 1") | |
helpers.stop_kong("prom_cp") | |
assert.eventually(function() | |
local body = get_metrics() | |
assert.matches('kong_control_plane_connected 0', body, nil, true) | |
end).has_no_error("metric kong_control_plane_connected => 0") |
local body = get_metrics() | ||
assert.matches('kong_control_plane_connected 1', body, nil, true) | ||
|
||
helpers.stop_kong("prom_cp") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: this line of code may be skipped if the prior assertion fails, which could leak the prom_cp
instance and potentially interfere with other running tests. You should always ensure that it is stopped by code within your teardown()
function. This may or may not require tracking whether or not you reached the helpers.stop_kong("prom_cp")
call in the test code--I'm not sure.
helpers.stop_kong("prom_dp") | ||
end) | ||
|
||
it("exposes controlplane connectivity status", function () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idea: instead of starting prom_cp
in setup()
, what if you defer startup until it("exposes control plane connectivity status")
:
- assert
kong_control_plane_connected 0
- start
prom_cp
- assert
kong_control_plane_connected 1
- stop
prom_cp
- assert
kong_control_plane_connected 0
This would be a trivial change to make, but it gives us more test coverage.
helpers.stop_kong("prom_dp") | ||
end) | ||
|
||
it("exposes controlplane connectivity status", function () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it("exposes controlplane connectivity status", function () | |
it("exposes control plane connectivity status", function () |
Summary
Add a new Prometheus gauge metric
control_plane_connected
. Similar todatastore_reachable
gauge, 0 means the connection is not healthy; 1 means that the connection is healthy. We mark the connection as unhealthy under the following circumstances:This is helpful for users running a signficant number of gateways to be alerted about potential issues any gateway(s) may be facing while talking to the controlplane.
Checklist
changelog/unreleased/kong
orskip-changelog
label added on PR if changelog is unnecessary. README.mdIssue reference
Fix #[issue number]