-
-
Notifications
You must be signed in to change notification settings - Fork 346
Description
Describe the bug
Some of the Child runners EC2 instances are not going to terminated state even after it is showing in the CloudTrail logs that it is transitioning from stopping to shutting-down state. This is keeping some of the Child runners EC2 instances in the stopped state for a long period of time and during the roll out of runners, there is a need to manually delete that.
To Reproduce
Steps to reproduce the behavior:
- Unable to reproduce this issue manually since we are using same module for spinning up multiple runners but it is happening intermittently.
- Try going to the AWS Console and filter out EC2 instances using the
gitlab-runner-parent-id=<parent>
and search for Stopped instances.
Expected behavior
- No instances should be in Stopped state. Every child runner EC2 instances should be in Terminated state.
- This hinders the roll out of new runners since we destroy every resources (security groups, etc.) and since the instances are in stopped state, security groups are attached and the destroy resources fails. Then, we have to manually terminate those instances after which the resources are destroyed and recreated.
Additional context
I checked with the AWS Support folks as I thought this may be an issue from their end but they replied with this:
Issue Summary:
Child instances from one specific ASG are not completing termination
Instances reach "shutting-down" state but don't complete termination
Other ASGs with identical configurations are working as expected
AWS services (API calls, permissions) working as expected
Architecture Overview:
ASG launches parent EC2 instances (GitLab Runners)
Parent instances create child instances for jobs
After job completion, child instances should terminate
Parent instances remain running for future jobs
Our Investigation:
==============
We've verified AWS service functionality:
CloudTrail logs show successful TerminateInstances API calls
IAM permissions are working correctly (evidenced by other ASGs)
No termination protection issues (confirmed disabled)
ASG and lifecycle hook configurations are identical across groups
Key Findings:
As per cloudtrail, termination API requests are being called.
Instances successfully transition to "shutting-down" state
No API throttling or permission issues detected
Infrastructure-level configurations are consistent across ASGs
==> Since AWS services are functioning as expected, we recommend:
Engaging your application team to investigate instance-level shutdown behavior
Reviewing instance termination calls from parent to child instance.
Analyzing application logs on parent during termination attempts.
As part of AWS Support's scope, we can provide comprehensive guidance on AWS infrastructure components, while application-specific troubleshooting such as GitLab Runner configurations or instance-level processes would be best addressed by your application team. Hence please check the code related components with the development team as they would be best to provide an expertise on that.
If you guys have any feedback on something I may have misunderstood/misconfigured I'm more than happy to read it.
Best regards & keep up the good work!