Skip to content

feat(agent): add a agent pid management module#940

Closed
rene-oromtz wants to merge 3 commits intomainfrom
feat/add-agent-pid
Closed

feat(agent): add a agent pid management module#940
rene-oromtz wants to merge 3 commits intomainfrom
feat/add-agent-pid

Conversation

@rene-oromtz
Copy link
Contributor

@rene-oromtz rene-oromtz commented Feb 27, 2026

Description

This PR is meant to add more resiliency to agents spawned by Supervisor.

In a scenario where Supervisor crash or is OOM Killed, it can become unaware of the old process, leaving agents orphaned and requiring manual intervention, long term fix is to increase charm unit memory per deployment but the proposed module should ensure only one PID exist per agent.
Agent now creates a PID file on start, if for some reason supervisor tries to restart the agent and the old process became orphaned, the agent will now get the old PID and terminate the old process on startup

Resolved issues

Resolves #939
Resolves CERTTF-843

Documentation

Web service API changes

Tests

Added unit tests that should cover all of the module scenarios.
Quick test on staging just to verify the PID was getting rewritten properly:

sudo supervisorctl status audino
audino                           RUNNING   pid 1522718, uptime 0:33:25

cat testflinger/audino/logs/audino.pid 
1522718

After agent restart:

sudo supervisorctl status audino
audino                           RUNNING   pid 1527087, uptime 0:51:53

cat testflinger/audino/logs/audino.pid 
1527087

@codecov
Copy link

codecov bot commented Feb 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.09%. Comparing base (787bfb2) to head (2fb4584).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #940      +/-   ##
==========================================
+ Coverage   73.85%   74.09%   +0.24%     
==========================================
  Files         108      109       +1     
  Lines       10311    10362      +51     
  Branches      886      889       +3     
==========================================
+ Hits         7615     7678      +63     
+ Misses       2508     2496      -12     
  Partials      188      188              
Flag Coverage Δ *Carryforward flag
agent 76.31% <100.00%> (+1.90%) ⬆️
cli 89.56% <ø> (ø) Carriedforward from 787bfb2
device 59.84% <ø> (ø) Carriedforward from 787bfb2
server 87.85% <ø> (ø) Carriedforward from 787bfb2

*This pull request uses carry forward flags. Click here to find out more.

Components Coverage Δ
Agent 76.31% <100.00%> (+1.90%) ⬆️
CLI 89.56% <ø> (ø)
Common ∅ <ø> (∅)
Device Connectors 59.84% <ø> (ø)
Server 87.85% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@rene-oromtz rene-oromtz marked this pull request as ready for review February 27, 2026 23:58
@rene-oromtz rene-oromtz requested a review from ajzobro February 27, 2026 23:58
except (
OSError,
yaml.YAMLError,
voluptuous.MultipleInvalid,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes seem unrelated to the original issue -- I don't see any new code that is using voluptuous other than exception handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just me trying to remove the broad exception, but there are some other broad exceptions that needs to be addressed first. Since OSError was raised when the port is already in use I added the other couple that can also be raised on configuration errors. I can remove those two if preferred

@ajzobro
Copy link
Collaborator

ajzobro commented Mar 2, 2026

Why are we moving forward with a solution that has agents killing agents rather than a higher-level process management tool ensuring that is happening? Isn't that the whole point of having a "supervisor"?

I do not think this is implemented in the correct place -- the issue seems to be that an agent that is being restarted isn't properly being killed off first -- this is exactly a supervisory activity IMO

@rene-oromtz
Copy link
Contributor Author

I agree that this should be handled by Supervisor. The ideal (if possible) will be to increase the memory in the agent charm unit so this doesn't happen in the first place but this requires a (short?) maintenance window to replace the unit with one with more memory... If this is not possible, at least this workaround should help to not manually locate/terminate the orphaned processes.

@rene-oromtz
Copy link
Contributor Author

Closing as per comment

@rene-oromtz rene-oromtz closed this Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Supervisor managed agents shows as EXITED after OOM Kill

2 participants