This code is designed for testing purposes only and should never be used in production environments.
- This code will DELETE TOPICS on your Kafka clusters
- Data loss will occur during testing
- Changes are irreversible
Use this code ONLY on:
- Dedicated test clusters
- Development environments
- Isolated sandbox environments
- Production clusters
- Live data environments
- Shared development clusters
- Any system containing important data
This performance testing framework evaluates the maximum throughput of a Kafka cluster, and compares the put latency of different broker, producer, and consumer configurations.
To run a test, you need to specify the different parameters that should be tested, and the tool will iterate through all different combinations of the parameters.
The test results can be visualized in a comparative performance graph similar to the graph below.
The performance testing framework is based on the kafka-producer-perf-test.sh
and kafka-consumer-perf-test.sh
tools that are part of the Apache Kafka distribution.
This framework adds AWS-powered automation and visualization around Kafka's native performence testing utilities, leveraging AWS Step Functions for scheduling and orchestration, and AWS Batch for running individual tests.
To deploy the CDK template and run the performance test, you have two options:
Ensure you have the following tools installed and configured on your local machine:
- docker >= 24.0.0
- node >= 18.0.0
- typescript >= 5.0.0
- cdk >= 2.171.1
- npm >= 10.9.1
- git >= 2.40.1
If you prefer not to configure your local environment, you can use an Amazon EC2 instance. The create environment section contains detailed instructions on how to install the dependencies required to successfully deploy the CDK template.
You have to clone the remote repository to create a local copy on your development environment
# Clone the remote repository
git clone https://github.com/aws-samples/sample-apache-kafka-performance-testing-framework.git
# Navigate to project directory
cd sample-apache-kafka-performance-testing-framework
# Verify repository contents
ls -la
Once your environment is configured, you need to edit the defaults
variable in cdk/bin/cdk.ts
so that it matches your AWS account details:
# Navigate to 'cdk/bin'
cd cdk/bin
# Edit the `cdk.ts` using the nano text editor
nano cdk.ts
Update the AWS account ID and target AWS Region
const defaults : { 'env': Environment } = {
env: {
account: '123456789012', // Replace with your AWS account ID
region: 'us-east-1' // Replace with your target region
}
};
Finally, you need to install all dependencies of the CDK template by running the following command in the cdk
folder.
# Navigate back to the 'cdk' folder
cd ..
# Install CDK template dependencies
npm install
# Bootstrap CDK envitonment, if you have never executed CDK in this AWS account and region.
cdk bootstrap
You can now start deploying the CDK stacks to create the resouces in your AWS account.
# List the available stacks
cdk ls
You can observe 9 CDK stacks:
MSK-Perf-Test-VPC
MSK-Perf-Test-M7g-Large
MSK-Perf-Test-M7g-XLarge
MSK-Perf-Test-M7g-2XLarge
MSK-Perf-Test-M7g-2XLarge-PST (Provisioned Storage Throughput enabled at 312 MB/s)
MSK-Perf-Test-M7g-4XLarge
MSK-Perf-Test-M7g-4XLarge-PST (Provisioned Storage Throughput enabled at 625 MB/s)
MSK-Perf-Test-M7g-8XLarge
MSK-Perf-Test-M7g-8XLarge-PST (Provisioned Storage Throughput enabled at 1000 MB/s)
The VPC stack will create a new VPC in your AWS account with a 10.0.0.0/16 CIDR, 3 public subnets, 3 private subnets, 3 NAT Gateways and 1 Internet Gateway.
The VPC stack will be deployed by the MSK stasks as a dependency, and it's not required to deploy it independently. However, in case you'd like to deploy the VPC stack individually, you can run the below command:
# Deploy the CDK VPC stack
cdk deploy MSK-Perf-Test-VPC
The CDK template can either create a new MSK cluster for the performance tests, or you can specify an existing MSK cluster. The framework is creating and deleting topics on the MSK cluster to run the performance tests, so it's highly recommended to use a dedicated cluster to avoid any data loss.
The default configuration of the CDK template includes three MSK provisioned clusters leveraging standard brokers backed by AWS Graviton processors. You can customize the CDK stack name, the count of brokers, the instance type, instance size, the Apache Kafka version, and more, by editing the cdk/bin/cdk.ts
file.
new CdkStack(app, 'MSK-Perf-Test-M7g-Large', {
...defaults,
vpc: vpc,
clusterProps: {
numberOfBrokerNodes: 1,
instanceType: InstanceType.of(InstanceClass.M7G, InstanceSize.LARGE),
ebsStorageInfo: {
volumeSize: 6000
},
encryptionInTransit: {
enableInCluster: false,
clientBroker: ClientBrokerEncryption.PLAINTEXT
},
kafkaVersion: KafkaVersion.V3_6_0,
},
});
new CdkStack(app, 'stack-prefix--', {
...defaults,
vpc: vpc,
clusterName: '...',
bootstrapBrokerString: '...',
sg: '...',
});
You can then deploy the MSK CDK templates to create the performance test resources into your account.
# List the available stacks
cdk ls
# Deploy a specific stack
cdk deploy MSK-Perf-Test-M7g-Large
- Resources created by the CDK stacks will incur AWS charges immediately upon creation
- Costs continue to accumulate even when no tests are running
After the infrastructure is deployed, you can start the performance tests using an AWS Step Functions workflow that was created during deployment. The workflow uses a simplified test specification that controls how each individual test will run and what parameters will be used.
In the Step Functions console, navigate to State machines, and select {CDK-stack-name}-State-Machine-main
. Start the execution by providing the following input:
{
"test_specification": {
"parameters": {
"cluster_throughput_mb_per_sec": [ 12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132, 144, 156, 168 ],
"consumer_groups" : [ { "num_groups": 0, "size": 0 }, { "num_groups": 1, "size": 6 }, { "num_groups": 2, "size": 6 } ],
"num_producers": [ 6 ],
"client_props": [
{
"producer": "acks=all linger.ms=5 batch.size=262114 buffer.memory=2147483648 security.protocol=PLAINTEXT",
"consumer": "security.protocol=PLAINTEXT"
}
],
"num_partitions": [ 72 ],
"record_size_byte": [ 1024 ],
"replication_factor": [ 3 ],
"duration_sec": [ 3600 ]
}
}
}
Using the test specification above, the workflow will execute 42 performance tests to measure put latency under varying throughput conditions.
The test configuration is using the following parameters:
- Producers: 6
- Partitions: 72
- Record Size: 1024 bytes
- Replication Factor: 3
- Duration: 3600 seconds (1 hour)
- Throughput: 12 → 168 MB/sec
- Consumer Groups:
- 0 groups (no consumers)
- 1 group (6 consumers)
- 2 groups (6 consumers each)
Phase 1: No consumer groups
- Throughput: 16 → 168 MB/sec
- Measure put latency
Phase 2: One consumer group
- Throughput: 16 → 168 MB/sec
- 6 consumers active
- Measure put latency
Phase 3: Two consumer groups
- Throughput: 16 → 168 MB/sec
- 12 consumers total (6 per group)
- Measure put latency
The framework supports additional test parameters through list configuration. Any parameter defined as a list will be automatically included in the test iterations.
Since you may not know a cluster's maximum throughput, you can specify a stop condition. This condition tells the test framework when to reset the cluster throughput to the lowest value and select the next test option (or stop if there are no more options).
To automatically stop increasing throughput when the cluster becomes saturated, start the {CDK-stack-name}-State-Machine-main
execution by providing the following input:
{
"test_specification": {
"parameters": {
"cluster_throughput_mb_per_sec": [ 12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132, 144, 156, 168 ],
"consumer_groups" : [ { "num_groups": 0, "size": 0 }, { "num_groups": 1, "size": 6 }, { "num_groups": 2, "size": 6 } ],
"num_producers": [ 6 ],
"client_props": [
{
"producer": "acks=all linger.ms=5 batch.size=262114 buffer.memory=2147483648 security.protocol=PLAINTEXT",
"consumer": "security.protocol=PLAINTEXT"
}
],
"num_partitions": [ 72 ],
"record_size_byte": [ 1024 ],
"replication_factor": [ 3 ],
"duration_sec": [ 3600 ]
},
"skip_remaining_throughput": {
"less-than": [ "sent_div_requested_mb_per_sec", 0.995 ]
}
}
}
The framework will skip remaining throughput values if the actual throughput sent to the cluster by producers falls below 99.5% of the specified value.
The testing process follows this pattern:
- The first test uses 0 consumer groups and sends 12 MB/s into the cluster.
- The next test maintains 0 consumer groups but increases throughput to 24 MB/s.
- This pattern continues, incrementing throughput, until the cluster becomes saturated.
For example, if the cluster can only absorb 154 MB/s instead of the specified 156 MB/s, the skip condition is met (as the difference exceeds 0.5%).
When this occurs:
- The framework stops increasing throughput for the current consumer group setting.
- It resets the throughput to 12 MB/s.
- It increments to the next consumer group setting (in this case, from 0 to 1).
Amazon EC2 networking, Amazon EBS, and Amazon EBS networking all leverage a credit system that allows the performance to burst over a given baseline for a certain amount of time. This means that the cluster is able to ingest much more throughput for a certain amount of time.
To obtain stable results for the baseline performance of a cluster, the test framework can deplete credits before a test is carried out. To this end it generates traffic that exceeds the peak performance of a cluster and waits until the throughput drops below configurable thresholds.
{
"test_specification": {
"parameters": {
"cluster_throughput_mb_per_sec": [ 12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132, 144, 156, 168 ],
"consumer_groups" : [ { "num_groups": 0, "size": 0 }, { "num_groups": 1, "size": 6 }, { "num_groups": 2, "size": 6 } ],
"num_producers": [ 6 ],
"client_props": [
{
"producer": "acks=all linger.ms=5 batch.size=262114 buffer.memory=2147483648 security.protocol=PLAINTEXT",
"consumer": "security.protocol=PLAINTEXT"
}
],
"num_partitions": [ 72 ],
"record_size_byte": [ 1024 ],
"replication_factor": [ 3 ],
"duration_sec": [ 3600 ]
},
"skip_remaining_throughput": {
"less-than": [ "sent_div_requested_mb_per_sec", 0.995 ]
},
"depletion_configuration": {
"upper_threshold": {
"mb_per_sec": 185
},
"lower_threshold": {
"mb_per_sec": 150
},
"approximate_timeout_hours": 1
}
}
}
With this configuration, the framework will generate a high load to consume burst credits.
Then, it monitors the throughput to detect when credits are depleted by observing two conditions in sequence:
- First, the cluster throughput must exceed the
upper_threshold
, indicating the cluster is operating at burst performance levels - Then, the throughput must drop below the
lower_threshold
, indicating burst credits have been depleted
Only after both conditions are met (exceeding upper threshold, then falling below lower threshold) will the actual performance test begin. This ensures the test measures baseline performance without the influence of burst credits.
The approximate_timeout_hours
value of 1 hour sets the maximum duration for the credit depletion phase.
-
upper_threshold
:- Set to approximately 70-80% of your cluster's maximum burst throughput
- Should be achievable during burst credit usage
- Example: If your cluster can burst to 250 MB/sec, set to ~185 MB/sec
-
lower_threshold
- Set to approximately 80% of your upper threshold
- Should be above your expected baseline performance
- Example: If upper threshold is 185 MB/sec, set to ~150 MB/sec
-
For clusters with different broker counts:
- Scale thresholds linearly with broker count
- Example: For 6 brokers, multiply the 3-broker values by 2
You can visualize the performance test results through Jupyter notebooks once the AWS Step Functions workflow is running. You can find the URL for the SageMaker Notebook in the Outputs section of the CloudFormation stack.
Note Initial test execution is required before any output becomes visible. With credit depletion enabled, this process may take up to several hours to complete.
To install the dependencies, you have to run the 01-install-dependencies.ipynb
Jupyter notebook.
To create a visualization for the performance test results:
- Open the
02-create-visualization-template.ipynb
Jupyter notebook. - Add the execution ARN of your Step Functions workflow to the second cell
test_params = []
test_params.extend([
{'execution_arn': 'your-execution-arn-here'} # Replace with your actual ARN
])
- Execute all cells in the notebook sequentially.
The logic in the Jupyter notebook:
- Retrieves the performance test results from CloudWatch Logs for one or more state machines execution ARNs
- Applies grouping and aggregations of the performance tests results
- Creates grid-based visualizations where:
- Tests in same row have identical producer/consumer settings
- Tests in same column have identical consumer numbers
- Different colors represent different broker configurations and partition numbers
You can adapt the grouping of different test runs by changing the partitions
variable. By adapting row_keys
and column_keys
you can create a grid of diagrams and all meassurments within the same sub-diabram will have identical test settings for the specified keys.
partitions = {
'ignore_keys': [ 'topic_id', 'cluster_name', 'test_id', 'client_props.consumer', 'cluster_id', 'duration_sec', 'provisioned_throughput', 'throughput_series_id', 'brokers_type_numeric', ],
'title_keys': [ 'kafka_version', 'broker_storage', 'in_cluster_encryption', 'producer.security.protocol', ],
'row_keys': [ 'num_producers', 'consumer_groups.num_groups', ],
'column_keys': [ 'producer.acks', 'producer.batch.size', 'num_partitions', ],
'metric_color_keys': [ 'brokers_type_numeric', 'brokers', ],
}
In this case, all diagrams in the same row will have identical producer and consumer settings. Within a column, all diagrams will have the same number of consumers. And withing a diagram, tests with different brokers and number of partitions will have different colors.
To remove all CDK deployed resources for the performance test and avoid unnecessary AWS charges, follow these steps:
-
Run the destroy command:
cdk destroy {stack-name}
-
Important Notes:
- CloudWatch log streams containing experiment outputs will be retained. These logs are preserved for future analysis and visualization
- To completely eliminate all costs, manually delete the following:
- CloudWatch Log Groups
- Any remaining CloudWatch Log Streams
- Any manually created resources not managed by CDK
-
Verify Cleanup:
- Check your AWS Console to ensure all resources are properly removed
- Pay special attention to regions where you deployed resources
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.