Awesome Observability

Monitoring as defined by the Oxford dictionary is to "Observe and check the progress or quality of (something) over a period of time; keep under systematic review".

For systems monitoring that means being able to give an overview over the state of a system by exposing key metrics about the systems. The monitoring can be implemented in different ways:

one can push metrics from a service,
pull metrics from it or
use a combined hybrid approach

Furthermore, the concept of observability can be seen a superset of monitoring where it is a part of giving visibility into the system. Providing the ability to reason about the system health in a better way.

It can be said to consist of three parts:

event logs (can be in different forms, plain text, structured or binary and is in general about having a log about what happened at a certain time),
metrics (measurement over time, for example number of failed requests) and
tracing (represent related and distributed events together as a request flows through a system)

Metrics, Logs and Traces: The Golden Triangle of Observability in Monitoring

This repo is not only for monitoring. As said Adrian Cole’s in the talk about "Observability 3 Ways" we are going to focus on the three types of systems necessaries to understand how your applications behave: Logging, Metrics & Tracing.

1. Best Practices

2. General Tools

Before to start with huge observability solution. If you just need to control some application aspects, visualize how is working your system, or just identify a problem, may be usefull start with one, or a collection application, that help you to get this information in a easy and cheap way.

Additional to this, start with tools to get information about your system to determine if it's working well, can help you to define the final stack if you want to install a corporative solution to any project. I know some stories abot people that install, configure and even evolution some monitoring tools as a corporative solution, an when the solution is in production, they realize that the tools don't cover all the necessaries to control their applications :-D

Following you can see an interesting post from Netflix writteb by Brendan Gregg that show this very clear.

https://netflixtechblog.com/linux-performance-analysis-in-60-000-milliseconds-accc10403c55

In the article you can see how with a few tools and in a short time, you can get a lot of information about your system ;-)

 $ uptime
 $ dmesg | tail
 $ vmstat 1
 $ mpstat -P ALL 1
 $ pidstat 1
 $ iostat -xz 1
 $ free -m
 $ sar -n DEV 1
 $ sar -n TCP,ETCP 1
 $ top

There are many more commands and methodologies you can apply to drill deeper.

3. Collect

Get any data – metrics, events, logs, traces – from everywhere – systems, sensors, queues, databases and networks.

Metrics

top - Allows users to monitor processes and system resource usage on Linux. It is one of the most useful tools in a sysadmin’s toolbox, and it comes pre-installed on every distribution
htop - Command line utility that allows you to interactively monitor your system’s vital resources or server’s processes in real time
ctop - Top-like interface for container metrics
Opentelemetry - OpenTelemetry is made up of an integrated set of APIs and libraries as well as a collection mechanism via an agent and collector
OpenCensus - OpenCensus is a set of libraries for various languages that allow you to collect application metrics and distributed traces, then transfer the data to a backend of your choice in real time
Opentracing - Vendor-neutral APIs and instrumentation for distributed tracing
Openmetrics - An effort to create an open standard for transmitting metrics at scale, with support for both text representation and Protocol Buffers
Micrometer - Micrometer provides a simple facade over the instrumentation clients for the most popular monitoring systems, allowing you to instrument your JVM-based application code without vendor lock-in. Think SLF4J, but for metrics
cAdvisor - cAdvisor (Container Advisor) provides container users an understanding of the resource usage and performance characteristics of their running containers
Node-exporter - Prometheus stack, Exporter for machine metrics
Beats - Lightweight shippers for Elasticsearch & Logstash, Elastic stack
Collectd - The system statistics collection daemon
Tcollector - Data collection framework for OpenTSDB
Performance Co-Pilot -
inspectIT Ocelot - Java agent for collecting performance, tracing and business data
Kamon - Monitoring applications running on the JVM

Tracing

Sleuth - Spring Cloud Sleuth implements a distributed tracing solution for Spring Cloud, borrowing heavily from Dapper, Zipkin and HTrace
Jaeger - Monitor and troubleshoot transactions in complex distributed systems

Logging

Loki - Prometheus-inspired logging for cloud natives

Events

Nothing for the moment :-P

4. Load Generators & Synthetic Traffic

JMeter - Java application designed to load test functional behavior and measure performance. It was originally designed for testing Web Applications but has since expanded to other test functions
Yandex Tank - Yandex.Tank is an extensible open source load testing tool for advanced linux users which is especially good as a part of an automated load testing suite
ghz - Simple gRPC benchmarking and load testing tool inspired by hey and grpcurl
Taurus - Taurus relies on JMeter, Gatling, Locust.io, Grinder and Selenium WebDriver as its underlying tools. Free and open source under Apache 2.0 License
Locust - Locust is an easy-to-use, distributed, user load testing tool. It is intended for load-testing web sites (or other systems) and figuring out how many concurrent users a system can handle
Pandora - Pandora is a high-performance load generator in Go language. It has built-in HTTP(S) and HTTP/2 support and you can write your own load scenarios in Go, compiling them just before your test
Gatling - Load test as code
Vegeta - HTTP load testing tool built out of a need to drill HTTP services with a constant request rate. It can be used both as a command line utility and a library
GoReplay - Open-source tool for capturing and replaying live HTTP traffic into a test environment in order to continuously test your system with real data
phantom - Evgeniy Mamchits' phantom is a very fast (100 000+ RPS) shooter written in C++ (default)
BFG - A modular tool and framework for load generation that supports HTTP/2
Bender - Bender makes it easy to build load testing applications for services using protocols like HTTP, Thrift, Protocol Buffers and many more. Bender provides a library of flexible, powerful primitives that can be combined (with plain Go code) to build load testers customized to any use case and that evolve with your service over time

5. Transport

The transport tools simply serve as transport pipelines for data. This includes messaging systems, proprietary protocols and exchange formats.

Apache Kafka - Publish-subscribe messaging rethought as a distributed commit log
Redis - Redis is an open source, in-memory data structure store, used as a database, cache and message broker. It supports many different data structures such as stringes, hashes, list, etc.
Rsyslog - RSYSLOG is the rocket-fast system for log processing
ØMQ - Brokerless intelligent transport layer
ActiveMQ - Powerful open source messaging and integration patterns server
Aeron - Efficient reliable UDP unicast, UDP multicast, and IPC message transport
Apollo - Faster, more reliable, easier to maintain messaging broker built from the foundations of the original ActiveMQ
Ascoltatori - Pub/sub library for Node
Beanstalk - Simple, fast work queue
Disque - Distributed message broker
Eventuate - A platform for developing asynchronous microservices solving the distributed data management problems
Malamute - ZeroMQ enterprise messaging broker
Mist - A distributed, tag-based pub/sub service
Mosca - MQTT broker as a module
Mosquitto - Open source message broker that implements the MQTT protocol
Nanomsg - Socket library that provides several common communication patterns for building distributed systems
NATS - Open source, high-performance, lightweight cloud messaging system
NSQ - A realtime distributed messaging platform
Pulsar - Distributed pub-sub messaging system
Qpid - Cross-platform messaging components built on AMQP
RabbitMQ - Open source Erlang-based message broker that just works
RocketMQ - A low latency, reliable, scalable, easy to use message oriented middleware born from alibaba massive messaging business
VerneMQ - Open source software, extendable, and enterprise support is available

6. Collector

Receive data from the agents or instrumentation frameworks. The received data is usually persisted to some kind of storage or piped to another tool.

Depending on the collector type, performance data enhancement and modification is also possible inside of the collector.

In addition, collectors can have other responsibilities. For example, some expose the data access API, configuration points for the agents or user interface for interaction with the stored data.

Metrics

Telegraf - TICK stack, The plugin-driven server agent for collecting & reporting metrics
Prometheus - The Prometheus monitoring system and time series database

Tracing

Zipkin - A distributed tracing system
Jaeger - Monitor and troubleshoot transactions in complex distributed systems

Logging

Graylog - Simply great centralized log management
Loki - Horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus
Brubeck - Statsd-compatible stats aggregator written in C
GoAccess - GoAccess is an open source real-time web log analyzer and interactive viewer that runs in a terminal in *nix systems or through your browser. It provides fast and valuable HTTP statistics for system administrators that require a visual server report on the fly

Events

Nothing for the moment :-P

7. Storage

Data storage

Thanos - Highly available Prometheus setup with long term storage capabilities
Cortex - Horizontally scalable, highly available, multi-tenant, long term storage for Prometheus
Apache HBase - Apache HBase is the Hadoop database, a distributed, scalable, big data store

Time Series Database

InfluxDB - InfluxDB is an open-source time series database developed by InfluxData
Prometheus - The Prometheus monitoring system and time series database
OpenTSDB - OpenTSDB, written in java
kairosDB - Fast Time Series Database on Cassandra
Graphite - Store numeric time-series data and render graphs of this data on demand
M3DB - Fully open source metrics platform built on M3DB, a distributed timeseries database
Netflix Atlas - Atlas features in-memory data storage, allowing it to gather and report very large numbers of metrics, very quickly
TimescaleDB - PostgreSQL for time‑series

Search Engine

Graph Database

SQL Database

MySQL - Relational database management system
MariaDB - Open source relational database
PostgreSQL - Open source relational database

NoSQL Database

Apache Cassandra - Scalability and high availability with linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure
MongoDB - MongoDB is a document database with the scalability and flexibility that you want with the querying and indexing that you need
Redis - Multi-model NoSQL database server enables search, messaging, streaming, graph, and other capabilities

8. Visualization

General

Dashboarding

Grafana - The first really good dashboard for displaying metrics
Chronograf - Chronograf is the user interface and administrative component of the InfluxDB platform
Kibana - Elastic stack
Prometheus - The Prometheus monitoring system and time series database
PromViz - Promviz is an application that helps you visualize the traffic of your cluster from Prometheus data
Vizceral - vizceral is a component for displaying traffic data on a webgl canvas. If a graph of nodes and edges with data about traffic volume is provided, it will render a traffic graph animating the connection volume between nodes
Trickster - Trickster is a reverse proxy cache for the Prometheus HTTP APIv1 that dramatically accelerates dashboard rendering times for any series queried from Prometheus
Stagemonitor -
Scouter -
Uchiwa -
Alerta web UI -
netdata -
Netflix Vector
Netflix Atlas -
Pinpoint Web -
Java Flame Graph
Nagios Core - Computer system, network and infrastructure monitoring software application

Trazing

Zipkin - A distributed tracing system
Jaeger - Monitor and troubleshoot transactions in complex distributed systems
Kieker Trace Analysis - Reconstruct and visualize architectural representations of the monitored systems from trace information collected at runtime. Currently supported architectural representations include

Uptime

Monitive - Free for 1 service, checked every 10 minutes with unlimited email & twitter alerts
UptimeRobot - Free for 50 monitors, checked every 5 minutes
OverOps - OverOps provides Automated Root Cause (ARC) analysis to reduce the time to identify and fix critical production application errors

9. Processing & Analyze & Act

Tools for rocessing the system data.

Pipeline tools that receive system data in one format, buffer or generate additional value on the raw data, and usually output or store it in another or the same format
Usually ingests the data from a multitude of sources and also sends the results to different destinations
Can setup alerts with a simple click or perform complex anomaly detection based on machine learning algorithms
Send alerts to popular services like Slack, Email, SMS or PagerDuty
Create custom triggers to perform any action. Integration with the corporative systems like Jira, CI/CD environment, source code anayze tools, etc.

Processing

Logstash - Logstash is an open source, server-side data processing pipeline that ingests data from a multitude of sources simultaneously, transforms it, and then sends it to your favorite "stash."
Fluentd - Fluentd is an open source data collector for unified logging layer
Vector - Vector is a high-performance observability data router. It makes collecting, transforming, and sending logs, metrics, and events easy. It decouples data collection & routing from your services, giving you control and data ownership, among many other benefits
Haystack -
icingabeat -
Kapacitor -
Kieker -
Metric Tank -
Nagios Remote Data Processor -
OpenCensus Collector -
Scouter -
Sensu -
statsd -

Alerts

Bosun
Checkmk Server
ElastAlert
Grafana
HayStack
Icinga 2 server
Kapacitor - TICK stack, written in go
Nagios Core - Computer system, network and infrastructure monitoring software application
Pinpoint Web
Prometheus Alertmanager - Prometheus stack, Prometheus Alertmanager, written in go
Scouter Collector
Sensu
Collectd
Netdata
Stagemonitor Alerter
X-Pack - Elastic stack
Bosun - Time Series Alerting Framework
Moira - Most powerful alerting system, backed by Graphite
Alerta - Distributed, scaleable and flexible monitoring system
Flapjack - Monitoring notification routing & event processing system
Seyren - An alerting dashboard for Graphite

Triggers

WIP

Anomalies Detection

Failure Mode and Effects Analysis (FMEA) - Documents current knowledge and actions about the risks of failures, for use in continuous improvement.
Banshee - Real-time anomalies(outliers) detection system for periodic metrics
Project Scorpio - Log Anomaly Detector
Anomaly Detection in Prometheus Metrics - Prototype for a Prometheus Anomaly Detector (PAD) which can be deployed on OpenShift. The PAD is a framework to deploy a metric prediction model to detect anomalies in prometheus metrics.
Prophet - Prophet is a forecasting procedure implemented in R and Python. It is fast and provides completely automated forecasts that can be tuned by hand by data scientists and analysts.
Anomaly Detection Toolkit (ADTK) - Python package for unsupervised / rule-based time series anomaly detection.

10. Application performance monitoring (APM)

SkyWalking - Application performance monitor tool for distributed systems, especially designed for microservices, cloud native and container-based (Docker, K8s, Mesos) architectures
PinPoint - Open source APM tool for large-scale distributed systems written in java
Falcon Plus - An open-source and enterprise-level monitoring system
dynatrace APM - Best-in-class APM from the category leader. Ensure application performance, innovate faster, collaborate efficiently, and deliver more value with dramatically less effort
Elastic APM - Application performance monitoring system built on the Elastic Stack
DataDog - Unified Monitoring For Metrics, Traces, & Logs
NewRelic - Complete view of your applications and operating environment
AppDynamics - Business and application performance monitoring
SPM - solutions for performance monitoring.
Instrumental - Real-time application and server monitoring

11. Observability as a Service

Kiali - Observability console for Istio with service mesh configuration capabilities. It helps you to understand the structure of your service mesh by inferring the topology, and also provides the health of your mesh
NexClipper - NexClipper is Open Source software for Cloud Native monitoring and operation , especailly for Kubernetes, to support enterprise environments and integrate with Prometheus
Sysdig Prometheus - Cloud scale monitoring solution with full Prometheus compatibility

12. References

13. License

13. Contributing

Contributions welcome! Read the contribution guidelines first.

Feel free to open an issue or create a pull request with your additions.

Thank you!

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
CODE-OF-CONDUCT.md		CODE-OF-CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

License

solaim/awesome-observability

Folders and files

Latest commit

History

Repository files navigation

Awesome Observability

1. Best Practices

2. General Tools

3. Collect

Metrics

Tracing

Logging

Events

4. Load Generators & Synthetic Traffic

5. Transport

6. Collector

Metrics

Tracing

Logging

Events

7. Storage

Data storage

Time Series Database

Search Engine

Graph Database

SQL Database

NoSQL Database

8. Visualization

General

Dashboarding

Trazing

Uptime

9. Processing & Analyze & Act

Processing

Alerts

Triggers

Anomalies Detection

10. Application performance monitoring (APM)

11. Observability as a Service

12. References

13. License

13. Contributing

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Packages