Skip to content

The ASG is not getting scaled when using same node_class in 2 different nomad datacenters #938

@sandi91

Description

@sandi91

We have a very strange issue and we cannot find out what is wrong we are doing.
We have 2 datacenters in nomad: aws-eun-1 and aws-euw-1 but we have the same node_class set for a set of nodes.

Now when we are trying to configure an autoscaling policy and try filtering by both settings it takes into consideration all the servers from node_class, it does also make any difference in calculations when we are using the only node_class which proves it does not take the datacenter configuration while calculating the count of nodes.

So here is the example configuration we are trying to apply:

scaling "nomad_worker_test_stage_policy" {
  enabled = true
  min     = 1
  max     = 15
  policy {
    cooldown            = "3m"
    evaluation_interval = "1m"

    check "memory_allocated_percentage" {
      source       = "nomad-apm"
      query        = "percentage-allocated_memory"
      query_window = "1m"
      strategy "target-value" {
        target = 80.0
      }
    }

    target "aws-asg-euw" {
      aws_asg_name                  = "test-stage"
      datacenter                    = "aws-euw-1"
      node_class                    = "test-stage"
      dry-run                       = true
      node_drain_deadline           = "5m"
      node_purge                    = "true"
      node_drain_ignore_system_jobs = "false"
    }
  }
}

The result of this would be:

myapp-1  | 2024-07-25T09:01:39.241Z [DEBUG] internal_plugin.nomad-apm: collected node pool resource data: allocated_cpu=45790 allocated_memory=53368 allocatable_cpu=70000 allocatable_memory=101465 
myapp-1  | 2024-07-25T09:01:39.241Z [TRACE] policy_eval.worker.check_handler: metric result: check=memory_allocated_percentage id=6fea318a-2c45-0540-1514-5c996cd179e3 policy_id=4caf85ae-eaea-9718-24ce-9bd26bbe911e queue=cluster source=nomad-apm strategy=target-value target=aws-asg-euw ts="2024-07-25 09:01:39.241342385 +0000 UTC m=+908.747117582" value=52.597447395653674 
myapp-1  | 2024-07-25T09:01:39.241Z [DEBUG] policy_eval.worker.check_handler: calculating new count: check=memory_allocated_percentage id=6fea318a-2c45-0540-1514-5c996cd179e3 policy_id=4caf85ae-eaea-9718-24ce-9bd26bbe911e queue=cluster source=nomad-apm strategy=target-value target=aws-asg-euw count=2 
myapp-1  | 2024-07-25T09:01:39.241Z [TRACE] internal_plugin.target-value: calculated scaling strategy results: check_name=memory_allocated_percentage current_count=2 new_count=2 metric_value=52.597447395653674 metric_time="2024-07-25 09:01:39.241342385 +0000 UTC m=+908.747117582" factor=0.657468092445671 direction=down max_scale_up="+Inf" max_scale_down=-Inf

If we delete the datacenter setting the result is still "the same" if we check the allocatable_cpu and allocatable_memory:

myapp-1  | 2024-07-25T09:27:15.989Z [DEBUG] internal_plugin.nomad-apm: collected node pool resource data: allocated_cpu=45790 allocated_memory=53368 allocatable_cpu=70000 allocatable_memory=101465 
myapp-1  | 2024-07-25T09:27:15.989Z [TRACE] policy_eval.worker.check_handler: metric result: check=memory_allocated_percentage id=b7fc6c61-fce4-df96-f013-afc46a63da83 policy_id=9d72db2d-5383-363f-a613-004df4866ab9 queue=cluster source=nomad-apm strategy=target-value target=aws-asg-euw ts="2024-07-25 09:27:15.98928497 +0000 UTC m=+65.658597573" value=52.597447395653674 
myapp-1  | 2024-07-25T09:27:15.989Z [DEBUG] policy_eval.worker.check_handler: calculating new count: check=memory_allocated_percentage id=b7fc6c61-fce4-df96-f013-afc46a63da83 policy_id=9d72db2d-5383-363f-a613-004df4866ab9 queue=cluster source=nomad-apm strategy=target-value target=aws-asg-euw count=2 
myapp-1  | 2024-07-25T09:27:15.989Z [TRACE] internal_plugin.target-value: calculated scaling strategy results: check_name=memory_allocated_percentage current_count=2 new_count=2 metric_value=52.597447395653674 metric_time="2024-07-25 09:27:15.98928497 +0000 UTC m=+65.658597573" factor=0.657468092445671 direction=down max_scale_up="+Inf" max_scale_down=-Inf

BUT if we change the node_class to something very unique between the 2 datacenters it works properly, it is also visible by allocatable_cpu and allocatable_memory values:

myapp-1  | 2024-07-25T09:21:55.781Z [DEBUG] internal_plugin.nomad-apm: collected node pool resource data: allocated_cpu=500 allocated_memory=2176 allocatable_cpu=20000 allocatable_memory=28990 
myapp-1  | 2024-07-25T09:21:55.781Z [TRACE] policy_eval.worker.check_handler: metric result: check=memory_allocated_percentage id=b0656bf6-fa07-f97f-bd5d-145cc79e4573 policy_id=15eadd6f-7491-d0d4-9353-2a7a6ad8ff06 queue=cluster source=nomad-apm strategy=target-value target=aws-asg-euw ts="2024-07-25 09:21:55.781273586 +0000 UTC m=+64.251790905" value=7.506036564332528 
myapp-1  | 2024-07-25T09:21:55.781Z [DEBUG] policy_eval.worker.check_handler: calculating new count: check=memory_allocated_percentage id=b0656bf6-fa07-f97f-bd5d-145cc79e4573 policy_id=15eadd6f-7491-d0d4-9353-2a7a6ad8ff06 queue=cluster source=nomad-apm strategy=target-value target=aws-asg-euw count=2 
myapp-1  | 2024-07-25T09:21:55.781Z [TRACE] internal_plugin.target-value: calculated scaling strategy results: check_name=memory_allocated_percentage current_count=2 new_count=1 metric_value=7.506036564332528 metric_time="2024-07-25 09:21:55.781273586 +0000 UTC m=+64.251790905" factor=0.0938254570541566 direction=down max_scale_up="+Inf" max_scale_down=-Inf

Is this some limitation of nomad-apm and we just have to use other more advanced queries where we would use the datacenter filtering instead? or is this something we are doing wrong?

Thank you in advance for any answer.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Needs Roadmapping

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions