-
Notifications
You must be signed in to change notification settings - Fork 93
Description
We've deployed the Nomad autoscaler as a Nomad job (running docker.io/hashicorp/nomad-autoscaler:0.4.6
in a Podman container) to do cluster horizontal autoscaling. We have a couple of AWS auto scaling groups for Nomad client node pools and we're using the Nomad APM to get metrics about allocated CPU and memory on the nodes. Our config looks like:
# Not using the HA mode at the moment
high_availability {
enabled = false
lock_namespace = "default"
lock_path = "nomad-autoscaler/lock_for_europe-1"
lock_ttl = "30s"
lock_delay = "15s"
}
http {
bind_address = "0.0.0.0"
bind_port = 28865
}
policy {
dir = "/etc/autoscaler/policies"
}
nomad {
address = "unix:///etc/autoscaler-secrets/api.sock"
region = "europe-1"
}
apm "nomad-apm" {
driver = "nomad-apm"
}
target "aws-asg-in-eu-west-1" {
driver = "aws-asg"
config = {
aws_region = "eu-west-1"
aws_access_key_id = "redacted"
aws_secret_access_key = "redacted"
aws_session_token = "redacted"
}
}
strategy "target-value" {
driver = "target-value"
}
and the policy in /etc/autoscaler/policies/aws-eu-west-1-default-asg.hcl
for scaling the AWS group:
scaling "autoscaling-aws-eu-west-1-default-asg" {
enabled = true
min = 3
max = 9
policy {
check "nomad_allocated_cpu" {
source = "nomad-apm"
query = "percentage-allocated_cpu"
strategy "target-value" {
target = 70
max_scale_up = 2
max_scale_down = 1
}
}
check "nomad_allocated_memory" {
source = "nomad-apm"
query = "percentage-allocated_memory"
strategy "target-value" {
target = 70
max_scale_up = 2
max_scale_down = 1
}
}
target "aws-asg-in-eu-west-1" {
aws_asg_name = "cluster-europe-1-pool-default"
node_class = "aws-eu-west-1-default-asg"
node_purge = true
node_selector_strategy = "empty_ignore_system"
}
}
}
The Nomad ACL policy used with the autoscaler is:
namespace "*" {
policy = "scale"
capabilities = ["read-job"]
}
namespace "default" {
policy = "scale"
capabilities = ["read-job"]
variables {
path "nomad-autoscaler/lock_for_europe-1" {
capabilities = ["write"]
}
}
}
# Node write access is needed for the autoscaler to be able to drain and purge nodes.
node {
policy = "write"
}
# If running Nomad Autoscaler Enterprise, the following ACL policy addition is needed to ensure it can read the Nomad Enterprise license:
operator {
policy = "read"
}
and the AWS permissions for the role we give to the autoscaler are:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"autoscaling:DescribeScalingActivities",
"autoscaling:DescribeInstanceRefreshes",
"autoscaling:DescribeAutoScalingGroups"
],
"Effect": "Allow",
"Resource": "*"
},
{
"Action": [
"autoscaling:UpdateAutoScalingGroup",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"autoscaling:CreateOrUpdateTags"
],
"Effect": "Allow",
"Resource": [
"...ARNs of the ASGs here..."
]
}
]
}
The autoscaler seems to work but we get these warning messages in the logs after it performs a scale out:
2025-06-04T03:29:49.852Z [INFO] policy_eval.worker: scaling target: id=5be0c2ab-3407-77e0-e4ce-d123b5dba165 policy_id=4ea06909-cf09-0230-b6f6-25d8847396c5 queue=cluster target=aws-asg-in-eu-west-1 from=5 to=7 reason="scaling up because factor is 1.650362" meta=map[nomad_policy_id:4ea06909-cf09-0230-b6f6-25d8847396c5]
2025-06-04T03:30:10.451Z [INFO] internal_plugin.aws-asg-in-eu-west-1: successfully performed and verified scaling out: action=scale_out asg_name=cluster-europe-1-pool-default desired_count=7
2025-06-04T03:34:48.950Z [WARN] policy_eval.broker: eval delivery limit reached: eval_id=85b235ed-0763-45c0-7168-499ae224acb7 policy_id=4ea06909-cf09-0230-b6f6-25d8847396c5 token=0dbdf3b8-3879-03d9-aba9-5b55d71db140 count=1 limit=1
2025-06-04T03:35:06.504Z [WARN] policy_eval.worker: failed to ACK policy evaluation: eval_id=85b235ed-0763-45c0-7168-499ae224acb7 eval_token=0dbdf3b8-3879-03d9-aba9-5b55d71db140 id=5be0c2ab-3407-77e0-e4ce-d123b5dba165 policy_id=4ea06909-cf09-0230-b6f6-25d8847396c5 queue=cluster error="evaluation ID not found"
2025-06-04T03:35:07.750Z [INFO] policy_eval.worker: scaling target: id=aa3c1652-f3d5-ad68-e1da-5168b7c5fe62 policy_id=4ea06909-cf09-0230-b6f6-25d8847396c5 queue=cluster target=aws-asg-in-eu-west-1 from=7 to=9 reason="scaling up because factor is 1.530879" meta=map[nomad_policy_id:4ea06909-cf09-0230-b6f6-25d8847396c5]
2025-06-04T03:35:28.502Z [INFO] internal_plugin.aws-asg-in-eu-west-1: successfully performed and verified scaling out: action=scale_out asg_name=cluster-europe-1-pool-default desired_count=9
2025-06-04T03:40:06.505Z [WARN] policy_eval.broker: eval delivery limit reached: eval_id=7b6f16ec-938e-9b7b-2fbf-4df78ee71cb9 policy_id=4ea06909-cf09-0230-b6f6-25d8847396c5 token=b0dff010-a68a-0020-d189-9163fd0aa627 count=1 limit=1
2025-06-04T03:40:36.441Z [WARN] policy_eval.worker: failed to ACK policy evaluation: eval_id=7b6f16ec-938e-9b7b-2fbf-4df78ee71cb9 eval_token=b0dff010-a68a-0020-d189-9163fd0aa627 id=aa3c1652-f3d5-ad68-e1da-5168b7c5fe62 policy_id=4ea06909-cf09-0230-b6f6-25d8847396c5 queue=cluster error="evaluation ID not found"
I could find issue #343 about this, but it doesn't seem to be the same problem as the autoscaler seems to be working fine otherwise (it successfully scaled out twice in the above log snippet), on issue #343 they mentioned the autoscaler had stopped working.
What do these warning messages indicate, are they a problem, and if so, how can I fix them?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status