Monitoring: Essential Metrics for Custom Dashboards

Summary

The curated list of "100 Essential CockroachDB Metrics" enables operators to build custom dashboards built with tools like Grafana and Datadog. All metrics in this list are important to observe, i.e. visualize in a custom dashboard. Each metric in this list "tells a story" and a narrative how to make a practical/actionable use of each metric in a production deployment is included in the corresponding column.

A subset of these metrics should also be alerted on. The alerts are listed in the "Monitoring and Alerting" section of this runbook template.

The list is current as of CockroachDB version v23.1. The version of Datadog agent version that includes all essential CockroachDB metrics will be provided here later.

Essential Monitoring Metrics

CockroachDB 23.1 Metric Name	Metric Name in Datadog (add `cockroachdb.` leading prefix)	Category	Description	Narrative
sys.cpu.combined.percent-normalized	sys.cpu.combined.percent.normalized	Platform	Current user+system CPU percentage consumed by the CRDB process, normalized by number of cores	Provides"at-a-glance" view into the level of %CPU utilization by the CRDB node process. If 100% - the CPU is overloaded. CRDB nodes should not be running with over 80% utilization for prolonged periods of time (hours). This metric is used in DBConsole's "CPU Percent" graph.
sys.cpu.host.combined.percent-normalized	[SUBMITTED TO DATADOG]	Platform	Current user+system C{U percentage for all processes in the OS or container, normalized by number of cores	Provides a practical "at-a-glance" view into the level of %CPU utilization of the underlying server, virtual server or a container hosting the CRDB node process. It includes non-CRDB usage. If 100% - the CPU is overloaded. CRDB nodes should not be running in this environment/state for prolonged periods of time (hours). This metric is used in DBConsole's "Host CPU Percent" graph.
sys.rss	sys.rss	Platform	Current process memory (RSS)	The amount of RAM occupied by the CRDB node process. A persistently low values over an extended period of time suggests there is underutilized memory that can be put to work with adjusted --cache and/or --max_sql_memory. Conversely, a high utilization, even if a temporary spike, indicates an increased risk of OOM kill by the OS (particularly since the swap is generally disabled).
sql.mem.root.current	sql.mem.root.current	Platform	Current sql statement memory usage for root	Shows how memory set aside for temporary materializations (like hash tables, intermediary result sets, etc.) is utilized. Enables an operator to optimize memory allocations based on long term observations. The max amount is set with --max_sql_memory. If the utilization of sql memory is persistently low, perhaps some portion of this memory allocation can be shifted to --cache.
sys.host.disk.write.bytes	sys.host.disk.write.bytes	Platform	Bytes written to all disks since this process started	Reports the effective storage device write throughput (MB/s) rate. To confirm the storage is sufficiently provisioned, assess the I/O performance rate (IOPS and MBPS) rate in the context of `sys.host.disk.iopsinprogress` metric.
sys.host.disk.write.count	[CUSTOM MAPPING REQUIRED]	Platform	Disk write operations across all disks since this process started	Reports the effective storage device write IOPS rate. To confirm the storage is sufficiently provisioned, assess the I/O performance rate (IOPS and MBPS) rate in the context of `sys.host.disk.iopsinprogress` metric.
sys.host.disk.read.bytes	sys.host.disk.read.bytes	Platform	Bytes read from all disks since this process started	Reports the effective storage device read throughput (MB/s) rate. To confirm the storage is sufficiently provisioned, assess the I/O performance rate (IOPS and MBPS) rate in the context of `sys.host.disk.iopsinprogress` metric.
sys.host.disk.read.count	[SUBMITTED TO DATADOG]	Platform	Disk read operations across all disks since this process started	Reports the effective storage device read IOPS rate. To confirm the storage is sufficiently provisioned, assess the I/O performance rate (IOPS and MBPS) rate in the context of `sys.host.disk.iopsinprogress` metric.
sys.host.disk.iopsinprogress	sys.host.disk.iopsinprogress	Platform	IO operations currently in progress on this host	This is the storage device average queue length. It characterizes the storage device's performance capability. All I/O performance metrics are Linux counters and will match what would be reported by `iostat`. This metric corresponds to the `avgqu-sz` in the `iostat` command output. Operators need to view the device queue graph in the context of the actual read/write IOPS and MBPS metrics that show the actual device utilization. If the device is not keeping up, the queue will grow. Over 10 is bad. Over 5 means the device is working hard, trying to keep up. For internal (on chassis) nvme devices the queue is typically 0. For network connected devices, like AWS EBS volumes the normal operating range is 1-2. Spikes are OK. Means there was an IO spike, the device fell behind, but caught up. End-users may experience inconsistent response times but there should be no cluster stability issues. If the queue is >5 for extended period of time and IOPS or MBPS are low - the storage is most likely not provisioned per Cockroach Labs guidance. In AWS EBS it's commonly an EBS type not suitable as database primary storage, like gp2. If I/O is low and the queue is low, the most likely scenario the CPU is lacking and not driving I/O. With 2 vcpus (not supported for production deployments), it's a very likely scenario. There are quite a few background processes in the database that take CPU away from the workload, so the workload is just not getting the CPU. Review this runbook template article.
sys.host.net.recv.bytes	sys.host.net.recv.bytes	Platform	Bytes received on all network interfaces since this process started	Monitor the node's ingress/egress network transfer rates for flat sections which may indicate insufficiently provisioned networking or high error rates (CRDB is using a reliable TCP/IP protocol, so errors result in delivery retries that create a "slow network" effect).
sys.host.net.send.bytes	sys.host.net.send.bytes	Platform	Bytes sent on all network interfaces since this process started	Monitor the node's ingress/egress network transfer rates for flat sections which may indicate insufficiently provisioned networking or high error rates (CRDB is using a reliable TCP/IP protocol, so errors result in delivery retries that create a "slow network" effect).
clock-offset.meannanos	clock.offset.meannanos	Platform	Mean clock offset with other nodes	In a well configured environment the actual clock skew would be in sub-millisecond range. A skew exceeding 5 ms is likely due to a NTP service mis-configuration. Alert may be set if clock offset is over 60% of the `--max-offset`. Reducing the actual clock skew reduces the probability of uncertainty related conflicts and corresponding retires. Which has a positive impact on workload performance. Conversely, a larger actual clock skew increases the probability of retires due to uncertainty conflicts, with potentially measurable adverse effects on workload performance. Review this runbook template article.
capacity	capacity.total	Storage	Total storage capacity	Total/used/available storage capacity measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). The following alert is recommended.
capacity.available	capacity.available	Storage	Available storage capacity	Total/used/available storage capacity measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). The following alert is recommended.
capacity.used	capacity.used	Storage	Used storage capacity	Total/used/available storage capacity measurements should comply with the following rule: CockroachDB storage volumes should not be utilized more than 60% (40% free space). The following alert is recommended.
storage.write-stalls	[SUBMITTED TO DATADOG]	Storage	Number of instances of intentional write stalls to backpressure incoming writes	This metric records actual disk stall events. Arguably, all reports of disk stalls should be investigated by an operator. As a rule of thumb, 1 stall per minute is not likely to have a material impact on workload (beyond an occasional increase in response time). However 1/second should be viewed as problematic and investigated vigorously. It's particularly problematic if the rate persists over an extended period of time, and worse - is increasing.
rocksdb.compactions	rocksdb.compactions.total	Storage	Number of SST compactions	This metric reports the number of node's LSM compactions. If the number of compactions remains elevated while the LSM health does not improve - compactions are not keeping up with the workload. If the condition persists for an extended period, the cluster will initially exhibit performance issues that will eventually escalate into stability issues.
rocksdb.block.cache.hits	rocksdb.block.cache.hits	Storage	Count of block cache hits	Enables an operator to optimize memory allocations based on long term observations. Block cache is a reserved memory. It's allocated upon a node process start (--cache flag) and never shrinks. By observing block cache hits and misses, an operator can fine tune memory allocations in the node process for the demands of the workload.
rocksdb.block.cache.misses	rocksdb.block.cache.misses	Storage	Count of block cache misses	Enables an operator to optimize memory allocations based on long term observations. Block cache is a reserved memory. It's allocated upon a node process start (--cache flag) and never shrinks. By observing block cache hits and misses, an operator can fine tune memory allocations in the node process for the demands of the workload.
rocksdb.read-amplification	rocksdb.read.amplification	Storage	Number of disk reads per query	Measures the worst-case/max number of files to read to serve a single key. The following alert is recommended.
sys.uptime	sys.uptime	Health	Process uptime
admission.io.overload	[SUBMITTED TO DATADOG]	Health	1-normalized float indicating whether IO admission control considers the store as overloaded with respect to compaction out of L0 (considers sub-level and file counts).	If exceeding 1, indicates overload. Operator can also look at `storage.l0-num-files`, `storage.l0-sublevels` or `rocksdb.read-amplification` directly. A healthy LSM shape is defined as “read-amp < 20” and “L0-files < 1000” (looking at `admission.l0_sub_level_count_overload_threshold` and `admission.l0_file_count_overload_threshold` respectively).
admission.wait_durations.kv	admission.wait.durations.kv	Health	Wait time durations for requests that waited (append _p75)	Shows if CPU utilization-based admission control feature is working effectively or potentially overzealous. This is latency histogram of how much delay was added to workload due to throttling by CPU control. If observing over 100ms waits for over 5 seconds while there was excess CPU capacity available, then the admission control is overly aggressive.
admission.wait_durations.kv-stores	admission.wait.durations.kv_stores	Health	Wait time durations for requests that waited (append _p75)	Shows if I/O utilization-based admission control feature is working effectively or potentially overzealous. This is latency histogram of how much delay was added to workload due to throttling by I/O control. If observing over 100ms waits for over 5 seconds while there was excess I/O capacity available, then the admission control is overly aggressive.
sys.runnable.goroutines.per.cpu	[SUBMITTED TO DATADOG]	Health	Average number of goroutines that are waiting to run, normalized by number of cores	A value over 30 indicates a CPU overload. If the condition lasts a short period of time (a few seconds), the database users are likely to experience inconsistent response times. If the condition persists for an extended period of time (tens of seconds, or minutes) the cluster may start developing stability issues. Review this runbook template article.
seconds.until.enterprise.license.expiry	[SUBMITTED TO DATADOG]	Expiration	Seconds until enterprise license expiry (0 if no license present or running without enterprise features)	See Description
security.certificate.expiration.ca	[SUBMITTED TO DATADOG]	Expiration
security.certificate.expiration.client-ca	[SUBMITTED TO DATADOG]	Expiration
security.certificate.expiration.ui-ca	[SUBMITTED TO DATADOG]	Expiration
security.certificate.expiration.node	[SUBMITTED TO DATADOG]	Expiration
security.certificate.expiration.node-client	[SUBMITTED TO DATADOG]	Expiration
security.certificate.expiration.ui	[SUBMITTED TO DATADOG]	Expiration
liveness.heartbeatfailures	liveness.heartbeatfailures	KV - distributed	The number of node liveness heartbeat misses	A node reports a heartbeat failure if it is unable to complete it within 3 secs. Occasional, isolated heartbeat misses are expected, particularly in cloud deployments. Frequent or correlated heartbeat misses of multiple nodes indicate a serious cluster issue. Review this runbook template article.
liveness.heartbeatlatency	liveness.heartbeatlatency	KV - distributed	Node liveness heartbeat latency (append _p95)	If this metric exceeds 1 sec, it's a sign of cluster instability. Review this runbook template article.
liveness.livenodes	liveness.livenodes	KV - distributed	Number of live nodes in the cluster (will be 0 if this node is not itself live)	Tracks the live nodes in the cluster. Review this runbook template article.
distsender.rpc.sent.nextreplicaerror	distsender.rpc.sent.nextreplicaerror	KV - distributed	Number of replica-addressed RPCs sent due to per-replica errors	RPC call errors do not necessarily indicate a problem. It tracks remote functional calls that return a status value other than "success". A non-success status of an RPC call should not be misconstrued as a network transport issue. It's a database code logic executed on another cluster node. The non-success status is a result of an orderly execution of an RPC call that reports a specific logical condition.
distsender.errors.notleaseholder	distsender.errors.notleaseholder	KV - distributed	Number of NotLeaseHolderErrors encountered from replica-addressed RPCs	These errors are normal during elastic cluster topology changes when lease holders are actively rebalancing. They are automatically retired. However they may create occasional response time spikes and that case this metric may provide the explanation of the cause.
leases.transfers.success	leases.transfers.success	KV - replication	Number of successful lease transfers	A high number of lease transfers is not a negative or positive signal. It's a reflection of the elastic cluster activities. For example, this metric is high during cluster topology changes. A high value is often the reason for NotLeaseHolderErrors (which are normal/expected during rebalancing), so observing this metric may provide a confirmation of the cause.
rebalancing.queriespersecond	[SUBMITTED TO DATADOG]	KV - replication	Number of kv-level requests received per second by the store, considering the last 30 minutes, as used in rebalancing decisions.	This metric shows hotspots along the QPS dimension, providing insights into the ongoing rebalancing activities
ranges	ranges	KV - replication	Number of ranges	Provides a measure of the data size scale.
replicas	replicas.total	KV - replication	Number of replicas	Provides an operator with an essential characterization of the data distribution across cluster nodes.
replicas.leaseholders	replicas.leaseholders	KV - replication	Number of lease holders	Provides an operator with an essential characterization of the data processing points across cluster nodes
ranges.underreplicated	ranges.underreplicated	KV - replication	Number of ranges with fewer live replicas than the replication target	Provides an operator an idea of whether the cluster has data that is not conforming to resilience goals. The next step is to determine where these under-replicated ranges live (which database, table, etc.) and whether this is temporarily expected.
ranges.unavailable	ranges.unavailable	KV - replication	Number of ranges with fewer live replicas than needed for quorum (should probably be using replicas, not ranges)	Provides an operator and indicator that the cluster is unhealthy and can impact workload. If entire range is unavailable, then it will be unable to process queries.
queue.replicate.replacedecommissioningreplica.error	[SUBMITTED TO DATADOG]	KV - replication	Number of failed decommissioning replica replacements processed by the replicate queue
range.adds	range.adds	KV - replication	Number of range additions	What's the difference with range splits?
range.splits	range.splits.total	KV - replication	Number of range splits	Metric indicating how fast a workload is scaling. Abnormal patterns such as spikes can indicate resource hot spots (split heuristic is based on QPS). Consider correlating with metrics such as CPU or using the Hot ranges page to understand definitively whether hot spots are an issue and where (tables/indexes) they are occurring.
range.merges	[SUBMITTED TO DATADOG]	KV - replication	Number of range merges	DELETE metric? This is CRDB optimization for performance. Indicates that there has been deletes in the workload but how is this actionable?
sql.conns	sql.conns	SQL	Number of active SQL connections	Operators should monitor the number of connections as well as the distribution (balancing) of connections across cluster nodes. An imbalance can lead to nodes becoming overloaded. Review this runbook template article.
sql.new_conns	[SUBMITTED TO DATADOG]	SQL	Number of new connection attempts.	Looking at the rate of it will show how frequently new connections are being established. This can be useful in determining if a high rate of incoming new connections is causing additional load on the server due to a misconfigured application.
sql.txns.open	sql.txns.open	SQL	Number of currently open user SQL transactions	The number of open transactions should roughly correspond to the number of CPUs * 4. If this metric is consistently larger, scale out the cluster.
sql.statements.active	sql.statements.active	SQL	Number of currently active user SQL statements	High-level indicator of workload (and application) degradation with query failures. Use the Insights page to find failed executions (and their error code) to troubleshoot or application-level logs (if instrumented) to determine cause of error.
sql.failure.count	sql.failure	SQL	Number of statements resulting in a planning or runtime error	High-level indicator of workload (and application) degradation with query failures. Use Insights to find failed executions (and their error code) to troubleshoot.
sql.full.scan.count	sql.full.scan	SQL	Number of full table or index scans	High-level indicator of potentially suboptimal query plans in the workload that may require index tuning and maintenance. Use SHOW FULL TABLE SCAN or the SQL Activity page with the corresponding metric time frame to identify the statements with a suboptimal plan. The SQL Activity page also includes the plan(s) and index recommendations. Not all full scans are necessarily bad especially over smaller tables.
sql.insert.count	sql.insert.count	SQL	Number of SQL INSERT statements successfully executed	High-level metric reflecting workload volume. Monitor this metric to identify abnormal application behavior/patterns over time. If abnormal patterns emerge, use the metric time range and filter the SQL Activity page(s) for interesting outliers or patterns. For instance, use the Transactions/Statements page or the Sessions page to sort on the transaction count. Find problematic sessions with high transaction counts and trace back to a user or application.
sql.update.count	sql.update.count	SQL	Number of SQL UPDATE statements successfully executed	See above
sql.delete.count	sql.delete.count	SQL	Number of SQL DELETE statements successfully executed	See above
sql.select.count	sql.select.count	SQL	Number of SQL SELECT statements successfully executed	See above
sql.ddl.count	sql.ddl.count	SQL	Number of SQL DDL statements successfully executed	See above
sql.txn.begin.count	sql.txn.begin.count	SQL	Number of SQL transaction BEGIN statements successfully executed	Count number of explicit transactions reflecting workload volume. Determine whether Explicit transactions can be refactored to implicit.
sql.txn.commit.count	sql.txn.commit.count	SQL	Number of SQL transaction COMMIT statements successfully executed	Count number of explicit transactions that completed successfully. This metric can be used as a proxy to measure the number of successful explicit transactions.
sql.txn.rollback.count	sql.txn.rollback.count	SQL	Number of SQL transaction ROLLBACK statements successfully executed	This metric shows the number of orderly transaction rollbacks. A persistently high number of rollbacks may negatively impact the workload performance and needs to be investigated.
sql.txn.abort.count	sql.txn.abort.count	SQL	Number of SQL transaction abort errors	A persistently high number of Number of SQL transaction abort errors may negatively impact the workload performance and needs to be investigated.
sql.service.latency	sql.service.latency	SQL	Latency of SQL request execution (append _p90/_p99)	High-level metric reflecting workload performance. Monitor this metrics to understand P90 latency over time. If abnormal patterns emerge, use the metric time range and filter the SQL Activity page(s) for interesting outliers or patterns. In 23.1, SQL statements have P90 latency to enable correlation with this metric.
sql.txn.latency	sql.txn.latency	SQL	Latency of SQL transactions (append _p90/_p99)	A latency histogram of all executed SQL transactions. Provides a bird's-eye view of the current SQL workload.
txnwaitqueue.deadlocks_total	[SUBMITTED TO DATADOG]	SQL	Number of deadlocks detected by the transaction wait queue	Alert on this metric if > 0, especially if transaction throughput is lower than expected. Although applications should be able to detect and recover from deadlock errors, performance and throughput can be maximized if the application logic is modified to avoid deadlock conditions in the first place - for example by keeping transactions as short as possible.
sql.distsql.contended_queries.count	sql.distsql.contended.queries	SQL	Number of SQL queries that experienced contention	This metric is incremented whenever there is a non-trivial amount of contention experienced by a statement (read-write, write-write). Monitor this metric to correlate possible workload performance issues to contention conflicts.
sql.conn.latency	sql.conn.latency	SQL	Latency to establish and authenticate a SQL connection (append _p90/_p99)	Characterizes the database connection latency which can affect the application performance e.g. startup times
txn.restarts.serializable	txn.restarts.serializable	SQL	Number of restarts due to a forwarded commit timestamp and isolation=SERIALIZABLE	This is one of the measures of the impact of contention conflicts on workload performance. For detailed guidance on contention conflicts see this runbook template article. Tens of restarts per minute may be a high value, a signal of an elevated degree of contention in the workload, which should attract operator's attention.
txn.restarts.writetooold	txn.restarts.writetooold	SQL	Number of restarts due to a concurrent writer committing first	See above
txn.restarts.writetoooldmulti	[SUBMITTED TO DATADOG]	SQL	Number of restarts due to multiple concurrent writers committing first	See above
txn.restarts.unknown	[SUBMITTED TO DATADOG]	SQL	Number of restarts due to a unknown reasons	See above
txn.restarts.txnpush	[SUBMITTED TO DATADOG]	SQL	Number of restarts due to a transaction push failure	See above
txn.restarts.txnaborted	[SUBMITTED TO DATADOG]	SQL	Number of restarts due to an abort by a concurrent transaction	These errors are generally due to deadlocks. Deadlocks can often be prevented with a considerate transaction design. Identify the conflicting/deadlocking transactions and consider redesigning deadlock prone business logic implementation.
jobs.auto_create_stats.resume_failed	[SUBMITTED TO DATADOG]	Jobs - Table Statistics	Number of auto_create_stats jobs which failed with a non-retriable error	High-level indicator that auto-create statistics is failing which can lead to the query optimizer running with stale statistics. This can cause suboptimal query plans to be selected leading to poor query performance.
jobs.auto_create_stats.currently_running	[SUBMITTED TO DATADOG]	Jobs - Table Statistics	Number of auto_create_stats jobs currently running	Number of active auto create statistics job that could also be consuming resources. Ensure that foreground SQL traffic is not impacted by correlating with SQL latency and query volume metrics.
jobs.auto_create_stats.currently_paused	[SUBMITTED TO DATADOG]	Jobs - Table Statistics	Number of auto_create_stats jobs currently considered Paused	High-level indicator that auto-create statistics is paused which can lead to the query optimizer running with stale statistics. This can cause suboptimal query plans to be selected leading to poor query performance.
jobs.create_stats.currently_running	[SUBMITTED TO DATADOG]	Jobs - Table Statistics	Number of create_stats jobs currently running	Number of active create statistics job that could also be consuming resources. Ensure that foreground SQL traffic is not impacted by correlating with SQL latency and query volume metrics.
jobs.backup.currently_running	jobs.backup.currently_running	Jobs - Backup/Restore	Number of backup jobs currently running	Self-explanatory
jobs.backup.currently_paused	[SUBMITTED TO DATADOG]	Jobs - Backup/Restore	Number of backup jobs currently considered Paused	Monitoring and alerting on this metric is a hedge against an inadvertent operational error leaving a backup job in a paused state for an extended period of time. In functional areas where a paused job can tie up resources or have concurrency impact or some other negative consequence. Paused backup may break the RPO.
schedules.BACKUP.failed	schedules.backup.failed	Jobs - Backup/Restore	Number of BACKUP jobs failed	Monitor the number of backup job failures.
schedules.BACKUP.last-completed-time	schedules.backup.last_completed_time	Jobs - Backup/Restore	The unix timestamp of the most recently completed backup by a schedule specified as maintaining this metric	Monitor this metric to know that backups are meeting the RPO.

Essential Monitoring Metrics for Changefeeds

If changefeeds are created in a CockroachDB cluster, monitor these additional changefeeds -related metrics in your custom dashboards:

CockroachDB 23.1 Metric Name	Metric Name in Datadog (add `cockroachdb.` leading prefix)	Category	Description	Narrative
changefeed.running	changefeed.running	Jobs - Changefeeds	Number of currently running changefeeds, including sinkless	This tracks the total number of all running changefeeds.
jobs.changefeed.currently_paused	[SUBMITTED TO DATADOG]	Jobs - Changefeeds	Number of changefeed jobs currently considered Paused	Changefeed jobs should not be paused for long time b/c the protected timestamp prevents garbage collections. This is a safety catch to guard against an inadvertent pause. Consider setting an alert as a hedge for an oversight/operational error.
changefeed.failures	changefeed.failures	Jobs - Changefeeds	Total number of changefeed jobs which have failed	This tracks the permanent failures (that the jobs system will not try to restart) across all changefeeds. Any increase in this counter should prompt an operator's investigative action. An alert is recommended.
changefeed.error_retries	changefeed.error.retries	Jobs - Changefeeds	Total retriable errors encountered by all changefeeds	Transient changefeed errors. Alert on "too many", which means - Example: during the rolling upgrade this counter will increase b/c the changefeed jobs will restart following node restarts. There is exponential back off, up to 10 minutes. But if there is no maintenance and the error rate is high - investigate. An alert is recommended.
changefeed.emitted_messages	changefeed.emitted.messages	Jobs - Changefeeds	Messages emitted by all feeds	Provides an operator with a useful context when assessing the state of changefeeds. This characterizes the rate of changes being streamed from the CRDB cluster.
changefeed.emitted_bytes	[SUBMITTED TO DATADOG]	Jobs - Changefeeds	Bytes emitted by all feeds	Provides an operator with a useful context when assessing the state of changefeeds. This characterizes the throughput bytes being streamed from the CRDB cluster.
changefeed.commit_latency	changefeed.commit.latency	Jobs - Changefeeds	A difference between event MVCC timestamp and the time it was acknowledged by the downstream sink. If the sink batches events, then the difference between the oldest event in the batch and acknowledgement is recorded; Excludes latency during backfill	Provides an operator with a useful context when assessing the state of changefeeds. This characterizes the end to end lag between a committed change and that change applied at destination. An alert is recommended.
jobs.changefeed.protected_age_sec	[SUBMITTED TO DATADOG]	Jobs - Changefeeds	Alert on too much data being protected	Changefeed protects the data from being garbage collected. The operator should ensure the protected timestamp should not significantly exceed the GC TTL zone configuration. It's recommended to alert if the protected timestamp is greater than 3 times the GC TTL.

Essential Monitoring Metrics for Row-Level TTL

If Row-Level TTL are configured for any table(s) in a CockroachDB cluster, monitor these additional Row-Level TTL -related metrics in your custom dashboards:

CockroachDB 23.1 Metric Name	Metric Name in Datadog (add `cockroachdb.` leading prefix)	Category	Description	Narrative
jobs.row_level_ttl.resume_completed	[SUBMITTED TO DATADOG]	Jobs - Row TTL	Number of row_level_ttl jobs which successfully resumed to completion	If TTL is enabled, this should be nonzero and correspond to the ttl_cron setting that was chosen. If it's zero, it means the job isn't running
jobs.row_level_ttl.resume_failed	[SUBMITTED TO DATADOG]	Jobs - Row TTL	Number of row_level_ttl jobs which failed with a non-retriable error	This should remain at zero. Repeated errors means the TTL job is not deleting data.
jobs.row_level_ttl.rows_selected	[SUBMITTED TO DATADOG]	Jobs - Row TTL	Number of rows selected for deletion by the row level TTL job.	Correlate with the metric below to ensure all the rows that should be deleted are getting deleted.
jobs.row_level_ttl.rows_deleted	[SUBMITTED TO DATADOG]	Jobs - Row TTL	Number of rows deleted by the row level TTL job.	Correlate with the metric above to ensure all the rows that should be deleted are getting deleted.
jobs.row_level_ttl.currently_paused	[SUBMITTED TO DATADOG]	Jobs - Row TTL	Number of row_level_ttl jobs currently considered Paused	Check this to ensure the TTL job does not remain paused inadvertently for an extended period.
jobs.row_level_ttl.currently_running	[SUBMITTED TO DATADOG]	Jobs - Row TTL	Number of row_level_ttl jobs currently running	Monitor this to ensure there aren't too many TTL jobs running at the same time. Generally, this should be in the low single digits.
schedules.scheduled-row-level-ttl-executor.failed	[SUBMITTED TO DATADOG]	Jobs - Row TTL	Number of scheduled-row-level-ttl-executor jobs failed	Check this to ensure the TTL job is running. If it's non-zero, it means the job could not be created.

Additional Resources


CRDB Metrics Doc Page	https://www.cockroachlabs.com/docs/stable/metrics.html
Metrics Available in Datadog	https://docs.datadoghq.com/integrations/cockroachdb/
CRDB Rest API	https://www.cockroachlabs.com/docs/api/cluster/v2
CRDB Example Dashboards - Grafana, Splunk; plus Prometheus Alerts	https://github.com/cockroachdb/cockroach/tree/master/monitoring
CRDB Source Code - DB Console metrics to graphs mappings (in *.tsx files)	https://github.com/cockroachdb/cockroach/tree/master/pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards
CRDB Doc Alert Suggestions	https://www.cockroachlabs.com/docs/stable/monitoring-and-alerting#events-to-alert-on
Runbook Template - Monitoring and Alerting	https://github.com/cockroachlabs/cockroachdb-runbook-template/tree/main/monitoring-alerts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!