This is a Zeek plugin which will track performance and health metrics in real-time, and run a web server which can be scraped with Prometheus.
$ zkg install zeek-exporter
The scripts will assign each node a unique port. By default, it will bind to all addresses (0.0.0.0).
A Prometheus scrape target example for file discovery is available in prometheus/target.yml.
The easiest way to get the list of ports is via zeekctl
:
$ zeekctl print Exporter::bind_port
zeek-logger Exporter::bind_port = 9101/tcp
zeek-manager Exporter::bind_port = 9102/tcp
zeek-proxy Exporter::bind_port = 9103/tcp
zeek-worker-1 Exporter::bind_port = 9104/tcp
zeek-worker-2 Exporter::bind_port = 9105/tcp
zeek-worker-3 Exporter::bind_port = 9106/tcp
zeek-worker-4 Exporter::bind_port = 9107/tcp
Scrapes can occur as often as you need, and you can track how long scrapes take in the exposer_request_latencies
metric.
Our scrapes take about 0.2 seconds, and we scrape every 10 seconds, to detect some micro-bursts.
The cost of frequent scrapes is disk space on the Prometheus system, and increased memory/computation when generating the graphs.
A Grafana dashboard is included in prometheus/dashboard.json.
To get the necessary visibility, each node in a Zeek cluster will run this plugin and a Prometheus exporter.
Metrics are tracked individually for each node.
The plugin uses some of the available hooks to inspect function calls, log writes, and even itself and other plugin hooks.
The following are instrumented at a high granularity:
- Scripts,
- Built In Functions (BIFs),
- Plugins implemented via hooks
While we have total CPU run-time, we don't have fine-grained visibility into:
- The core I/O loop,
- Protocol and file analyzers,
Don't know, but you can measure it!
On ESnet systems, the impact ranges from 0.05% to 15%. Workers are impacted most heavily, and the impact is determined by the volume of event and function calls. The function call hook is the most expensive, adding about 2 microseconds to every function call. However, on a very busy system, with > 100k calls/second, this will add up to 200ms each second.
For increased visibility into some functions, simply having the function name isn't enough. For instance, for SumStats,
we'd like to be able to measure the cost of each SumStat (detect-sqli-victims
versus detect-ssh-bruteforcing
). To do this,
we augment the metrics with additional labels for Prometheus.
There's a Zeek option available, Exporter::addl_functions
which allows you to define which events and functions to add labels for.
There are two labels available, arg
and addl
.
This supports the case of, for instance, for the unknown_protocol
weird, grabbing the addl
value telling you which protocol was unknown.
For more information, see the Zeek script documentation.
-
zeek_start_time_seconds
The epoch timestamp of when the process was started. Used to detect periodic crashes. -
zeek_total_cpu_time_seconds
The total amount of CPU time spent in this process. This uses the standard libraryclock()
function call, which returns an approximation. This can give an idea of how much "headroom" there is before more resources are needed.Note: The logger node runs multiple threads, which will result in an inaccurate count in most cases, as they get scheduled on different CPUs and execute in parallel.
-
zeek_function_calls_total
The number of times Zeek functions were called, by function and parent function. This can be used to identify the most common execution path in scripts. -
zeek_hooks_total
The number of times Zeek plugin hooks were called. This can be used to identify the most common execution paths in plugins. -
zeek_log_writes_total
The number of log writes per log, writer and filter. This is mainly used to understand the network traffic profile.
-
zeek_cpu_time_per_function_type_seconds
The amount of time spent in Zeek functions, by type. Types are Built In Functions (BIFs), and the three script-land function types: events, hooks, and functions. Note that script hooks are different from plugin hooks. -
zeek_hook_cpu_time_seconds
The amount of time spent in Zeek plugin hooks, by hook name.
There are two very similar metrics, one which includes execution time in child functions, and one which doesn't.
-
zeek_cpu_time_per_function_seconds
The amount of time spent in Zeek functions, by function and parent function. Note that this includes the time any child functions take to execute, and as such will count those executions multiple times when summed. -
zeek_absolute_cpu_time_per_function_seconds
The "absolute" amount of time spent in Zeek functions. These metrics do not include the time spent in child functions, and thus will give valid data when summed.