Metric for exceeding limits #2112

dtwilliamsWork · 2025-04-02T09:50:48Z

dtwilliamsWork
Apr 2, 2025

Is there a specific Karpenter metric I can use to monitor if a node pool's limit is exceeded?
https://karpenter.sh/docs/reference/metrics/

I would like to set up a Prometheus alert manager rule to monitor it.

If not, what is the error message that displays in the Karpenter logs to look out for?
I can set up a CloudWatch metric instead if its not possible with Karpenter metrics.

Answered by dtwilliamsWork

Apr 10, 2025

I've managed to test this myself and found the relevant message in the logs.

all available instance types exceed limits for nodepool:

Our Karpenter pod logs are exported to CloudWatch using FluentBit so we're able to add a metric filter on our log group with a relevant alarm

Here's the Terraform code I used, if it's helpful for anyone. I've configured a SNS topic to send our alerts to, which creates them on our OpsGenie platform.

resource "aws_cloudwatch_log_metric_filter" "karpenter_node_limit_exceeded" {
  name           = "Karpenter Node Limits Exceeded - ${var.cluster_name}"
  log_group_name = "/aws/containerinsights/${var.cluster_name}/application"
  pattern        = "{ $.log_process…

View full answer

dtwilliamsWork · 2025-04-10T14:11:57Z

dtwilliamsWork
Apr 10, 2025
Author

I've managed to test this myself and found the relevant message in the logs.

all available instance types exceed limits for nodepool:

Our Karpenter pod logs are exported to CloudWatch using FluentBit so we're able to add a metric filter on our log group with a relevant alarm

Here's the Terraform code I used, if it's helpful for anyone. I've configured a SNS topic to send our alerts to, which creates them on our OpsGenie platform.

resource "aws_cloudwatch_log_metric_filter" "karpenter_node_limit_exceeded" {
  name           = "Karpenter Node Limits Exceeded - ${var.cluster_name}"
  log_group_name = "/aws/containerinsights/${var.cluster_name}/application"
  pattern        = "{ $.log_processed.error = %all available instance types exceed limits for nodepool:% && $.kubernetes.pod_name = %karpenter% }"

  metric_transformation {
    name          = "Karpenter Node Limits Exceeded - ${var.cluster_name}"
    namespace     = "Karpenter"
    value         = "1"
    default_value = "0"
    unit          = "Count"
  }

  depends_on = [
    kubectl_manifest.fluent-bit-daemonset
  ]
}

data "aws_sns_topic" "opsgenie-alerts" {
  name = "opsgenie-alerts"
}

resource "aws_cloudwatch_metric_alarm" "karpenter_node_limit_exceeded" {
  alarm_name        = "karpenter_node_limit_exceeded_alarm_${var.cluster_name}"
  alarm_description = "Alarm triggered by Cloudwatch Logs for Karpenter Node Limits Exceeded"
  alarm_actions     = [data.aws_sns_topic.opsgenie-alerts.arn]

  metric_name         = "Karpenter Node Limits Exceeded - ${var.cluster_name}"
  threshold           = 0
  statistic           = "Sum"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 5
  period              = 60
  namespace           = "Karpenter"

  depends_on = [
    kubectl_manifest.fluent-bit-daemonset
  ]
}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metric for exceeding limits #2112

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Metric for exceeding limits #2112

Uh oh!

dtwilliamsWork Apr 2, 2025

Replies: 1 comment

Uh oh!

dtwilliamsWork Apr 10, 2025 Author

dtwilliamsWork
Apr 2, 2025

dtwilliamsWork
Apr 10, 2025
Author