Skip to content

Few child runners not getting terminated #1298

@Divyaansh313

Description

@Divyaansh313

Describe the bug

Some of the Child runners EC2 instances are not going to terminated state even after it is showing in the CloudTrail logs that it is transitioning from stopping to shutting-down state. This is keeping some of the Child runners EC2 instances in the stopped state for a long period of time and during the roll out of runners, there is a need to manually delete that.

To Reproduce

Steps to reproduce the behavior:

  1. Unable to reproduce this issue manually since we are using same module for spinning up multiple runners but it is happening intermittently.
  2. Try going to the AWS Console and filter out EC2 instances using the gitlab-runner-parent-id=<parent> and search for Stopped instances.

Expected behavior

  • No instances should be in Stopped state. Every child runner EC2 instances should be in Terminated state.
  • This hinders the roll out of new runners since we destroy every resources (security groups, etc.) and since the instances are in stopped state, security groups are attached and the destroy resources fails. Then, we have to manually terminate those instances after which the resources are destroyed and recreated.

Additional context

I checked with the AWS Support folks as I thought this may be an issue from their end but they replied with this:

Issue Summary:

Child instances from one specific ASG are not completing termination
Instances reach "shutting-down" state but don't complete termination
Other ASGs with identical configurations are working as expected
AWS services (API calls, permissions) working as expected

Architecture Overview:

ASG launches parent EC2 instances (GitLab Runners)
Parent instances create child instances for jobs
After job completion, child instances should terminate
Parent instances remain running for future jobs



Our Investigation:

==============

We've verified AWS service functionality:
CloudTrail logs show successful TerminateInstances API calls
IAM permissions are working correctly (evidenced by other ASGs)
No termination protection issues (confirmed disabled)
ASG and lifecycle hook configurations are identical across groups


Key Findings:
As per cloudtrail, termination API requests are being called.
Instances successfully transition to "shutting-down" state
No API throttling or permission issues detected
Infrastructure-level configurations are consistent across ASGs

==> Since AWS services are functioning as expected, we recommend:

Engaging your application team to investigate instance-level shutdown behavior
Reviewing instance termination calls from parent to child instance.
Analyzing application logs on parent during termination attempts.

As part of AWS Support's scope, we can provide comprehensive guidance on AWS infrastructure components, while application-specific troubleshooting such as GitLab Runner configurations or instance-level processes would be best addressed by your application team. Hence please check the code related components with the development team as they would be best to provide an expertise on that.

If you guys have any feedback on something I may have misunderstood/misconfigured I'm more than happy to read it.

Best regards & keep up the good work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions