Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(prometheus): expose controlplane connectivity state as a gauge #14020

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
message: |
**Prometheus**: Added gauge to expose connectivity state to controlplane.
type: feature
scope: Plugin
25 changes: 22 additions & 3 deletions kong/clustering/data_plane.lua
Original file line number Diff line number Diff line change
Expand Up @@ -74,11 +74,20 @@ function _M.new(clustering)
end


local function set_control_plane_connected(reachable)
local ok, err = ngx.shared.kong:safe_set("control_plane_connected", reachable)
Copy link
Contributor

@flrgh flrgh Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_control_plane_connected() is called in a lot of places. There are a couple risks to consider in this implementation:

  1. Since we can't reliably test all of the code paths that lead to a set_control_plane_connected(false) call, and it would be easy for a future maintainer to refactor/modify the code and forget to call set_control_plane_connected(false) in some crucial path.
  2. An unhandled exception (easy to accidentally introduce these as well) could cause a call to set_control_plane_connected(false) could be missed.

Either of these could lead to a scenario where the gauge is set to 1 even though control plane connectivity is unhealthy.

I think we should consider using a TTL ("exptime" in the docs) for this value. This will have the effect of causing the metric to become 0 if set_control_plane_connected(true) has not been called within the last $TTL seconds, because the value has expired. The value of the TTL should probably be derived from PING_INTERVAL or PING_WAIT.

if not ok then
ngx_log(ngx_ERR, _log_prefix, "failed to set controlplane_reachable key in shm to " .. reachable .. " :", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not use string concatenation in ngx.log() (docs)

Suggested change
ngx_log(ngx_ERR, _log_prefix, "failed to set controlplane_reachable key in shm to " .. reachable .. " :", err)
ngx_log(ngx_ERR, _log_prefix, "failed to set controlplane_reachable key in shm to ", reachable, " :", err)

end
end


function _M:init_worker(basic_info)
-- ROLE = "data_plane"

self.plugins_list = basic_info.plugins
self.filters = basic_info.filters
set_control_plane_connected(false)

local function start_communicate()
assert(ngx.timer.at(0, function(premature)
Expand Down Expand Up @@ -139,13 +148,17 @@ local function send_ping(c, log_suffix)

local _, err = c:send_ping(hash)
if err then
set_control_plane_connected(false)
ngx_log(is_timeout(err) and ngx_NOTICE or ngx_WARN, _log_prefix,
"unable to send ping frame to control plane: ", err, log_suffix)

-- only log a ping if the hash changed
elseif hash ~= prev_hash then
prev_hash = hash
ngx_log(ngx_INFO, _log_prefix, "sent ping frame to control plane with hash: ", hash, log_suffix)
else
set_control_plane_connected(true)
if hash ~= prev_hash then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep the code as is, I suggest moving the comment string on line 155 so that it continues to point to the correct context

Suggested change
if hash ~= prev_hash then
-- only log a ping if the hash changed
if hash ~= prev_hash then

prev_hash = hash
ngx_log(ngx_INFO, _log_prefix, "sent ping frame to control plane with hash: ", hash, log_suffix)
end
end
end

Expand Down Expand Up @@ -197,6 +210,7 @@ function _M:communicate(premature)

local c, uri, err = clustering_utils.connect_cp(self, "/v1/outlet")
if not c then
set_control_plane_connected(false)
ngx_log(ngx_WARN, _log_prefix, "connection to control plane ", uri, " broken: ", err,
" (retrying after ", reconnection_delay, " seconds)", log_suffix)

Expand Down Expand Up @@ -229,6 +243,7 @@ function _M:communicate(premature)
filters = self.filters,
labels = labels, }))
if err then
set_control_plane_connected(false)
ngx_log(ngx_ERR, _log_prefix, "unable to send basic information to control plane: ", uri,
" err: ", err, " (retrying after ", reconnection_delay, " seconds)", log_suffix)

Expand All @@ -238,6 +253,7 @@ function _M:communicate(premature)
end))
return
end
set_control_plane_connected(true)

local config_semaphore = semaphore.new(0)

Expand Down Expand Up @@ -344,16 +360,19 @@ function _M:communicate(premature)
local data, typ, err = c:recv_frame()
if err then
if not is_timeout(err) then
set_control_plane_connected(false)
return nil, "error while receiving frame from control plane: " .. err
end

local waited = ngx_time() - last_seen
if waited > PING_WAIT then
set_control_plane_connected(false)
return nil, "did not receive pong frame from control plane within " .. PING_WAIT .. " seconds"
end

goto continue
end
set_control_plane_connected(true)

if typ == "close" then
ngx_log(ngx_DEBUG, _log_prefix, "received close frame from control plane", log_suffix)
Expand Down
18 changes: 18 additions & 0 deletions kong/plugins/prometheus/exporter.lua
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ local lower = string.lower
local ngx_timer_pending_count = ngx.timer.pending_count
local ngx_timer_running_count = ngx.timer.running_count
local get_all_upstreams = balancer.get_all_upstreams

if not balancer.get_all_upstreams then -- API changed since after Kong 2.5
get_all_upstreams = require("kong.runloop.balancer.upstreams").get_all_upstreams
end
Expand Down Expand Up @@ -65,6 +66,14 @@ local function init()
"0 is unreachable",
nil,
prometheus.LOCAL_STORAGE)
if role == "data_plane" then
metrics.cp_connected = prometheus:gauge("control_plane_connected",
"Kong connected to controlplane, " ..
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Kong connected to controlplane, " ..
"Kong connected to control plane, " ..

"0 is unconnected",
nil,
prometheus.LOCAL_STORAGE)
end

metrics.node_info = prometheus:gauge("node_info",
"Kong Node metadata information",
{"node_id", "version"},
Expand Down Expand Up @@ -449,6 +458,15 @@ local function metric_data(write_fn)
kong.log.err("prometheus: failed to reach database while processing",
"/metrics endpoint: ", err)
end

if role == "data_plane" then
local cp_reachable = ngx.shared.kong:get("control_plane_connected")
if cp_reachable then
metrics.cp_connected:set(1)
else
metrics.cp_connected:set(0)
end
end
end

local phase = get_phase()
Expand Down
67 changes: 67 additions & 0 deletions spec/03-plugins/26-prometheus/04-status_api_spec.lua
Original file line number Diff line number Diff line change
Expand Up @@ -529,3 +529,70 @@ describe("Plugin: prometheus (access) granular metrics switch", function()

end)
end

describe("CP/DP connectivity state #", function ()
local status_client

local function get_metrics()
if not status_client then
status_client = helpers.http_client("127.0.0.1", tcp_status_port, 20000)
status_client.reopen = true -- retry on a closed connection
end

local res, err = status_client:get("/metrics")

assert.is_nil(err, "failed GET /metrics: " .. tostring(err))
return assert.res_status(200, res)
end

setup(function()
local bp = helpers.get_db_utils()

bp.plugins:insert {
protocols = { "http", "https", "grpc", "grpcs", "tcp", "tls" },
name = "prometheus",
}

assert(helpers.start_kong({
role = "control_plane",
prefix = "prom_cp",
cluster_cert = "spec/fixtures/kong_clustering.crt",
cluster_cert_key = "spec/fixtures/kong_clustering.key",
cluster_listen = "127.0.0.1:9005",
plugins = "bundled, prometheus",
}))

assert(helpers.start_kong({
role = "data_plane",
database = "off",
prefix = "prom_dp",
cluster_cert = "spec/fixtures/kong_clustering.crt",
cluster_cert_key = "spec/fixtures/kong_clustering.key",
cluster_control_plane = "127.0.0.1:9005",
proxy_listen = "0.0.0.0:9000",
worker_state_update_frequency = 1,
status_listen = "0.0.0.0:" .. tcp_status_port,
nginx_worker_processes = 1,
dedicated_config_processing = "on",
plugins = "bundled, prometheus",
}))
status_client = helpers.http_client("127.0.0.1", tcp_status_port, 20000)
end)

teardown(function()
if status_client then
status_client:close()
end

helpers.stop_kong("prom_dp")
end)

it("exposes controlplane connectivity status", function ()
Copy link
Contributor

@flrgh flrgh Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea: instead of starting prom_cp in setup(), what if you defer startup until it("exposes control plane connectivity status"):

  1. assert kong_control_plane_connected 0
  2. start prom_cp
  3. assert kong_control_plane_connected 1
  4. stop prom_cp
  5. assert kong_control_plane_connected 0

This would be a trivial change to make, but it gives us more test coverage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
it("exposes controlplane connectivity status", function ()
it("exposes control plane connectivity status", function ()

local body = get_metrics()
assert.matches('kong_control_plane_connected 1', body, nil, true)

helpers.stop_kong("prom_cp")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this line of code may be skipped if the prior assertion fails, which could leak the prom_cp instance and potentially interfere with other running tests. You should always ensure that it is stopped by code within your teardown() function. This may or may not require tracking whether or not you reached the helpers.stop_kong("prom_cp") call in the test code--I'm not sure.

local body = get_metrics()
assert.matches('kong_control_plane_connected 0', body, nil, true)
Comment on lines +591 to +596
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my eyes, these look like eventual conditions (i.e. get_metrics() may not return the expected value on the first call). You should retry them.

Suggested change
local body = get_metrics()
assert.matches('kong_control_plane_connected 1', body, nil, true)
helpers.stop_kong("prom_cp")
local body = get_metrics()
assert.matches('kong_control_plane_connected 0', body, nil, true)
assert.eventually(function()
local body = get_metrics()
assert.matches('kong_control_plane_connected 1', body, nil, true)
end).has_no_error("metric kong_control_plane_connected => 1")
helpers.stop_kong("prom_cp")
assert.eventually(function()
local body = get_metrics()
assert.matches('kong_control_plane_connected 0', body, nil, true)
end).has_no_error("metric kong_control_plane_connected => 0")

end)
end)
Loading