Skip to content
This repository was archived by the owner on Dec 5, 2023. It is now read-only.

Commit 6c3d56c

Browse files
committed
initial commit
0 parents  commit 6c3d56c

20 files changed

+18739
-0
lines changed

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.DS_Store

README.md

+79
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Redis Cloud Dashboards
2+
3+
These dashboards are intended to graphically present standard metrics of every level of a Redis Enterprise installation. Alert configuration files
4+
will assist you in providing notifications should any of a number of key values exceed their expected ranges. Lastly, metrics description files
5+
provide information about additional values that can be monitored.
6+
7+
## Getting Started
8+
9+
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for
10+
notes on how to deploy the project on a live system.
11+
12+
### Prerequisites
13+
14+
You will need to install the following software packages. Depending on your distribution there may be different ways of installing; choose the
15+
packaging style with which you are most familiar
16+
17+
```
18+
Prometheus
19+
Grafana
20+
```
21+
22+
### Installing
23+
24+
Once both Prometheus and Grafana have been installed you will need to modify Prometheus' config file and point it at Redis' metrics endpoint. Once
25+
that has been done you must create a Prometheus data source in Grafana's administration console. You should name the data source 'Redis-Enterprise';
26+
if you decide to name something else you will need to change the data source names in the individual dashboard JSON files. Please follow the
27+
instructions on the following page
28+
29+
```
30+
https://docs.redis.com/latest/rs/clusters/monitoring/prometheus-integration/
31+
```
32+
33+
Once this has been done you can use the Grafana administration console to import the files in dashboards/
34+
35+
## Running the tests
36+
37+
In order to run the alerting tests you will need to copy the rules/ and tests/ folders to your Prometheus installation. Once they have been copied
38+
you can execute the tests as follows:
39+
40+
```
41+
promtool test rules tests/*
42+
```
43+
44+
### Modifying the alerts
45+
46+
Alerts can, and probably should, be modified to correpsond to your environment and its configuration. Additional alerts can be created
47+
following Prometheus' alerting guidelines. It is strongly recommend to create unit tests for each of your alerts to ensure they perform as expected.
48+
49+
Further details can be found [here](https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/)
50+
51+
## Deployment
52+
53+
Open Grafana's dashboard tab, click on the blue 'New' button on the far right and select 'Import', then click on the 'Upload JSON file' button and
54+
navigate to the dashboard files included with the project in the 'dashboards' folder.
55+
56+
## Authors
57+
58+
* *B*John Burke** - *Initial work* - [Redis](https://github.com/redis-field-engineering)
59+
60+
See also the list of [contributors](https://github.com/redis-field-engineering/redis-cloud-dashboards/graphs/contributors) who participated in this
61+
project.
62+
63+
## Support
64+
Redis Cloud Dashboards is supported by Redis, Inc. on a good faith effort basis. To report bugs, request features, or receive assistance, please
65+
file an [issue](https://github.com/redis-field-engineering/redis-cloud-dashboards/issues).
66+
67+
## License
68+
Redis Cloud Dashboards is licensed under the MIT License. Copyright © 2023 Redis, Inc.
69+
70+
## License
71+
72+
This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details
73+
74+
## Acknowledgments
75+
76+
* Hat tip to anyone whose code was used
77+
* Inspiration
78+
* etc
79+

alerts/rules/alerts.yml

+65
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
groups:
2+
- name: database alerts
3+
rules:
4+
- alert: ClusterDown
5+
expr: min(redis_cluster_state) == 0
6+
for: 1m
7+
labels:
8+
severity: page
9+
annotations:
10+
summary: Redis cluster is down
11+
description: Redis cluster state is {{ $value }}
12+
13+
- alert: InstanceDown
14+
expr: redis_up == 0
15+
for: 1m
16+
labels:
17+
severity: page
18+
annotations:
19+
summary: Redis instance is down
20+
description: Redis instance {{ $labels.instance }} is down
21+
22+
- alert: ShardConnectionCount
23+
expr: redis_connected_clients > 500
24+
for: 5m
25+
labels:
26+
severity: warning
27+
annotations:
28+
summary: Shards - Excessive Connection Count
29+
description: Shard {{ $labels.instance }} exceeded 500 connections for 5 minutes
30+
31+
- alert: ReplicaSync
32+
expr: bdb_replicaof_syncer_status != 0
33+
for: 5m
34+
labels:
35+
severity: warning
36+
annotations:
37+
summary: Replication - Resynchronization Requests (Status)
38+
description: Replication on {{ $labels.instance }} not synchronized in 10 minutes
39+
40+
- alert: CRDTSync
41+
expr: bdb_crdt_syncer_status > 0
42+
for: 5m
43+
labels:
44+
severity: warning
45+
annotations:
46+
summary: Replication - Unsynchronized (CRDT)
47+
description: CRDT Replication on {{ $labels.instance }} not synchronized for 5 minutes
48+
49+
- alert: ReplicaLag
50+
expr: bdb_replicaof_syncer_local_ingress_lag_time > 500
51+
for: 10m
52+
labels:
53+
severity: warning
54+
annotations:
55+
summary: Replication - High Latency (Status)
56+
description: Replication Latency {{ $labels.instance }} exceeded 500ms for 10 minutes
57+
58+
- alert: CRDTLag
59+
expr: bdb_crdt_syncer_local_ingress_lag_time > 500
60+
for: 10m
61+
labels:
62+
severity: warning
63+
annotations:
64+
summary: Replication - High Latency (CRDT)
65+
description: CRDT Replication Latency on {{ $labels.instance }} exceeded 500ms for 10 minutes

alerts/tests/cluster.yml

+28
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# This is the main input for unit testing.
2+
# Only this file is passed as command line argument.
3+
4+
rule_files:
5+
- ../rules/alerts.yml
6+
7+
evaluation_interval: 1m
8+
9+
tests:
10+
# Test 1.
11+
- interval: 1m
12+
# Series data.
13+
input_series:
14+
- series: 'redis_cluster_state{job="redis", instance="localhost:6379"}'
15+
values: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
16+
17+
# Unit test for alerting rules.
18+
alert_rule_test:
19+
# Unit test 1.
20+
- eval_time: 1m
21+
alertname: ClusterDown
22+
exp_alerts:
23+
# Alert 1.
24+
- exp_labels:
25+
severity: page
26+
exp_annotations:
27+
summary: Redis cluster is down
28+
description: Redis cluster state is 0

alerts/tests/connection.yml

+30
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# This is the main input for unit testing.
2+
# Only this file is passed as command line argument.
3+
4+
rule_files:
5+
- ../rules/alerts.yml
6+
7+
evaluation_interval: 1m
8+
9+
tests:
10+
# Test 1.
11+
- interval: 1m
12+
# Series data.
13+
input_series:
14+
- series: 'redis_connected_clients{job="redis", instance="localhost:6379"}'
15+
values: '420 480 520 540 560 580 600 520 480 520 460 440 380 400 500'
16+
17+
# Unit test for alerting rules.
18+
alert_rule_test:
19+
# Unit test 1.
20+
- eval_time: 7m
21+
alertname: ShardConnectionCount
22+
exp_alerts:
23+
# Alert 1.
24+
- exp_labels:
25+
severity: warning
26+
instance: localhost:6379
27+
job: redis
28+
exp_annotations:
29+
summary: "Shards - Excessive Connection Count"
30+
description: Shard localhost:6379 exceeded 500 connections for 5 minutes

alerts/tests/crdt_lag.yml

+30
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# This is the main input for unit testing.
2+
# Only this file is passed as command line argument.
3+
4+
rule_files:
5+
- ../rules/alerts.yml
6+
7+
evaluation_interval: 1m
8+
9+
tests:
10+
# Test 1.
11+
- interval: 1m
12+
# Series data.
13+
input_series:
14+
- series: 'bdb_crdt_syncer_local_ingress_lag_time{job="redis", instance="localhost:6379"}'
15+
values: '420 580 520 540 560 580 600 520 510 520 530 550 580 400 500'
16+
17+
# Unit test for alerting rules.
18+
alert_rule_test:
19+
# Unit test 1.
20+
- eval_time: 12m
21+
alertname: CRDTLag
22+
exp_alerts:
23+
# Alert 1.
24+
- exp_labels:
25+
severity: warning
26+
instance: localhost:6379
27+
job: redis
28+
exp_annotations:
29+
summary: Replication - High Latency (CRDT)
30+
description: CRDT Replication Latency on localhost:6379 exceeded 500ms for 10 minutes

alerts/tests/crdt_sync.yml

+30
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# This is the main input for unit testing.
2+
# Only this file is passed as command line argument.
3+
4+
rule_files:
5+
- ../rules/alerts.yml
6+
7+
evaluation_interval: 1m
8+
9+
tests:
10+
# Test 1.
11+
- interval: 1m
12+
# Series data.
13+
input_series:
14+
- series: 'bdb_crdt_syncer_status{job="redis", instance="localhost:6379"}'
15+
values: '0 1 2 2 1 2 1 1 2 2 2 1 2 0 1'
16+
17+
# Unit test for alerting rules.
18+
alert_rule_test:
19+
# Unit test 1.
20+
- eval_time: 10m
21+
alertname: CRDTSync
22+
exp_alerts:
23+
# Alert 1.
24+
- exp_labels:
25+
severity: warning
26+
instance: localhost:6379
27+
job: redis
28+
exp_annotations:
29+
summary: Replication - Unsynchronized (CRDT)
30+
description: CRDT Replication on localhost:6379 not synchronized for 5 minutes

alerts/tests/instance.yml

+30
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# This is the main input for unit testing.
2+
# Only this file is passed as command line argument.
3+
4+
rule_files:
5+
- ../rules/alerts.yml
6+
7+
evaluation_interval: 1m
8+
9+
tests:
10+
# Test 1.
11+
- interval: 1m
12+
# Series data.
13+
input_series:
14+
- series: 'redis_up{job="redis", instance="localhost:6379"}'
15+
values: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
16+
17+
# Unit test for alerting rules.
18+
alert_rule_test:
19+
# Unit test 1.
20+
- eval_time: 1m
21+
alertname: InstanceDown
22+
exp_alerts:
23+
# Alert 1.
24+
- exp_labels:
25+
severity: page
26+
instance: localhost:6379
27+
job: redis
28+
exp_annotations:
29+
summary: Redis instance is down
30+
description: Redis instance localhost:6379 is down

alerts/tests/rep_lag.yml

+30
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# This is the main input for unit testing.
2+
# Only this file is passed as command line argument.
3+
4+
rule_files:
5+
- ../rules/alerts.yml
6+
7+
evaluation_interval: 1m
8+
9+
tests:
10+
# Test 1.
11+
- interval: 1m
12+
# Series data.
13+
input_series:
14+
- series: 'bdb_replicaof_syncer_local_ingress_lag_time{job="redis", instance="localhost:6379"}'
15+
values: '420 580 520 540 560 580 600 520 510 520 530 550 580 400 500'
16+
17+
# Unit test for alerting rules.
18+
alert_rule_test:
19+
# Unit test 1.
20+
- eval_time: 12m
21+
alertname: ReplicaLag
22+
exp_alerts:
23+
# Alert 1.
24+
- exp_labels:
25+
severity: warning
26+
instance: localhost:6379
27+
job: redis
28+
exp_annotations:
29+
summary: Replication - High Latency (Status)
30+
description: Replication Latency localhost:6379 exceeded 500ms for 10 minutes

alerts/tests/rep_sync.yml

+30
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# This is the main input for unit testing.
2+
# Only this file is passed as command line argument.
3+
4+
rule_files:
5+
- ../rules/alerts.yml
6+
7+
evaluation_interval: 1m
8+
9+
tests:
10+
# Test 1.
11+
- interval: 1m
12+
# Series data.
13+
input_series:
14+
- series: 'bdb_replicaof_syncer_status{job="redis", instance="localhost:6379"}'
15+
values: '0 1 2 2 1 2 1 1 2 2 2 1 2 0 1'
16+
17+
# Unit test for alerting rules.
18+
alert_rule_test:
19+
# Unit test 1.
20+
- eval_time: 11m
21+
alertname: ReplicaSync
22+
exp_alerts:
23+
# Alert 1.
24+
- exp_labels:
25+
severity: warning
26+
instance: localhost:6379
27+
job: redis
28+
exp_annotations:
29+
summary: Replication - Resynchronization Requests (Status)
30+
description: Replication on localhost:6379 not synchronized in 10 minutes

0 commit comments

Comments
 (0)