feat(agent): add a agent pid management module#940
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #940 +/- ##
==========================================
+ Coverage 73.85% 74.09% +0.24%
==========================================
Files 108 109 +1
Lines 10311 10362 +51
Branches 886 889 +3
==========================================
+ Hits 7615 7678 +63
+ Misses 2508 2496 -12
Partials 188 188
*This pull request uses carry forward flags. Click here to find out more.
🚀 New features to boost your workflow:
|
| except ( | ||
| OSError, | ||
| yaml.YAMLError, | ||
| voluptuous.MultipleInvalid, |
There was a problem hiding this comment.
These changes seem unrelated to the original issue -- I don't see any new code that is using voluptuous other than exception handling.
There was a problem hiding this comment.
It was just me trying to remove the broad exception, but there are some other broad exceptions that needs to be addressed first. Since OSError was raised when the port is already in use I added the other couple that can also be raised on configuration errors. I can remove those two if preferred
|
Why are we moving forward with a solution that has agents killing agents rather than a higher-level process management tool ensuring that is happening? Isn't that the whole point of having a "supervisor"? I do not think this is implemented in the correct place -- the issue seems to be that an agent that is being restarted isn't properly being killed off first -- this is exactly a supervisory activity IMO |
|
I agree that this should be handled by Supervisor. The ideal (if possible) will be to increase the memory in the agent charm unit so this doesn't happen in the first place but this requires a (short?) maintenance window to replace the unit with one with more memory... If this is not possible, at least this workaround should help to not manually locate/terminate the orphaned processes. |
|
Closing as per comment |
Description
This PR is meant to add more resiliency to agents spawned by Supervisor.
In a scenario where Supervisor crash or is OOM Killed, it can become unaware of the old process, leaving agents orphaned and requiring manual intervention, long term fix is to increase charm unit memory per deployment but the proposed module should ensure only one PID exist per agent.
Agent now creates a PID file on start, if for some reason supervisor tries to restart the agent and the old process became orphaned, the agent will now get the old PID and terminate the old process on startup
Resolved issues
Resolves #939
Resolves CERTTF-843
Documentation
Web service API changes
Tests
Added unit tests that should cover all of the module scenarios.
Quick test on staging just to verify the PID was getting rewritten properly:
After agent restart: