NOTE: Only needed if you are doing this demo on your own cluster, i.e., setting up the demo env from scratch.
- Install Kafka Cluster on Kubernetes (the demo uses KUDO to install this)
- Install the Litmus (Portal) Control Plane. You can use the step provided in the Portal Documentation
- Setup the Kafka (via helm) & LitmusChaos (it is optional and comes with portal installation) prometheus exporters.
- Setup the ServiceMonitors for Kafka & LitmusChaos prometheus exporters
- Setup the Prometheus monitoring infra via the Prometheus Operator with suitable Configuration
- Apply the Prometheus CR to scrape metrics from the Kafka & LitmusChaos exporters
- (Optional) Setup Grafana to view the Kafka & LitmusChaos metrics. This sample dashboard JSON can be used.
- Update the prometheus endpoint at all places inside the workflow, if prometheus endpoint is different from the provided default endpoint.
NOTE
- The ChaosCarnival bootcamp has an optional lab session for which a prepackaged cluster environment will be provided to interested users for the duration of the lab
- The aforementioned prerequisites have already been setup for these clusters
- Reach out to Udit Gaurav or Shubham Chaudhary for on-demand lab sessions
The demo involves using Litmus to inject chaos on a Kafka statefulset to hypothesize & arrive upon the correct value for a deployment attribute (message timeout on kafka consumer) for a given cluster environment (3 kafka brokers using default/standard storage class), at the same time observing & validating some critical kafka metrics. More details on what the scenario entails are provided below:
-
The experiment begins by generating load on the Kafka cluster via a test producer/consumer pair, the consumer being configured with a desired message timeout. The message stream in itself is very simple and just prints a timestamped string. It consists of a single topic with just a single partition, replicated across the 3 brokers.
-
The partition leader amongst the kafka brokers is identified (via kafka client) and killed, triggering a failover resulting in a new partition leader. It is possible that the broker pod that is targeted for failure happens to be the "Controller Broker" (also called "Active Controller") that facilitates the failovers for partition leaders and speaks to zookeeper et al. In such a case, a new active controller is elected. All of which can contribute to data transfer being stalled for a while. The time taken for this may depend upon the infrastructure.
-
The chosen message timeout may be good enough to tolerate this disruption in data traffic OR might be insufficient, resulting in a broken message stream on the consumer.
It is expected to play around with the timeout values (specified as an ENV variable in the Chaos Workflow spec to arrive at an eventual value that succeeds.
-
In addition to the core constraint of an unbroken message stream, the experiment also factors in other important considerations around application behavior. In this experiment, we also expect that despite chaos, there should be no :
- OfflinePartitions (unavailable data-stores) throughout the procedure
- UnderReplicatedPartitions (we use 3 replicas by default) before beginning the experiment (to avoid running chaos on a degraded setup) & once the chaos ceases/ends (acts as a check to see if all brokers are back to optimal state)
courtesy: Key Kafka Metrics to Monitor
-
This is ensured by setting up the right litmus Probes
NOTE: To learn more about litmus portal setup and usage refer to the Litmus Portal Docs
-
Login to the Litmus portal with the right credentials
-
Schedule the kafka-broker-pod-failure workflow from the litmus Portal (using the
upload YAML
option). -
Visualize the workflow progress from the Litmus Portal
-
Observe app statistics in the Grafana dashboard during active chaos period.
-
Verify the ChaosEngine & ChaosResult Statuses to know the experiment verdict post workflow completion. Sample results (including experiment pod logs & chaosresult dumps for successful & failure cases) can be found here