feat(agent): add a agent pid management module by rene-oromtz · Pull Request #940 · canonical/testflinger

rene-oromtz · 2026-02-27T19:59:41Z

Description

This PR is meant to add more resiliency to agents spawned by Supervisor.

In a scenario where Supervisor crash or is OOM Killed, it can become unaware of the old process, leaving agents orphaned and requiring manual intervention, long term fix is to increase charm unit memory per deployment but the proposed module should ensure only one PID exist per agent.
Agent now creates a PID file on start, if for some reason supervisor tries to restart the agent and the old process became orphaned, the agent will now get the old PID and terminate the old process on startup

Resolved issues

Resolves #939
Resolves CERTTF-843

Documentation

Web service API changes

Tests

Added unit tests that should cover all of the module scenarios.
Quick test on staging just to verify the PID was getting rewritten properly:

sudo supervisorctl status audino
audino                           RUNNING   pid 1522718, uptime 0:33:25

cat testflinger/audino/logs/audino.pid 
1522718

After agent restart:

sudo supervisorctl status audino
audino                           RUNNING   pid 1527087, uptime 0:51:53

cat testflinger/audino/logs/audino.pid 
1527087

codecov · 2026-02-27T20:34:55Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.09%. Comparing base (787bfb2) to head (2fb4584).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #940      +/-   ##
==========================================
+ Coverage   73.85%   74.09%   +0.24%     
==========================================
  Files         108      109       +1     
  Lines       10311    10362      +51     
  Branches      886      889       +3     
==========================================
+ Hits         7615     7678      +63     
+ Misses       2508     2496      -12     
  Partials      188      188

Flag	Coverage Δ		*Carryforward flag
agent	`76.31% <100.00%> (+1.90%)`	⬆️
cli	`89.56% <ø> (ø)`		Carriedforward from 787bfb2
device	`59.84% <ø> (ø)`		Carriedforward from 787bfb2
server	`87.85% <ø> (ø)`		Carriedforward from 787bfb2

*This pull request uses carry forward flags. Click here to find out more.

Components	Coverage Δ
Agent	`76.31% <100.00%> (+1.90%)`	⬆️
CLI	`89.56% <ø> (ø)`
Common	`∅ <ø> (∅)`
Device Connectors	`59.84% <ø> (ø)`
Server	`87.85% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ajzobro · 2026-03-02T12:40:22Z

agent/src/testflinger_agent/cmd.py

+    except (
+        OSError,
+        yaml.YAMLError,
+        voluptuous.MultipleInvalid,


These changes seem unrelated to the original issue -- I don't see any new code that is using voluptuous other than exception handling.

It was just me trying to remove the broad exception, but there are some other broad exceptions that needs to be addressed first. Since OSError was raised when the port is already in use I added the other couple that can also be raised on configuration errors. I can remove those two if preferred

ajzobro · 2026-03-02T12:41:26Z

Why are we moving forward with a solution that has agents killing agents rather than a higher-level process management tool ensuring that is happening? Isn't that the whole point of having a "supervisor"?

I do not think this is implemented in the correct place -- the issue seems to be that an agent that is being restarted isn't properly being killed off first -- this is exactly a supervisory activity IMO

rene-oromtz · 2026-03-02T15:24:14Z

I agree that this should be handled by Supervisor. The ideal (if possible) will be to increase the memory in the agent charm unit so this doesn't happen in the first place but this requires a (short?) maintenance window to replace the unit with one with more memory... If this is not possible, at least this workaround should help to not manually locate/terminate the orphaned processes.

rene-oromtz · 2026-03-03T16:59:30Z

Closing as per comment

rene-oromtz added 3 commits February 27, 2026 12:11

feat(agent): add a pid management module

c9f063c

tests: added unit test for new module

aae1d3a

tests: add unit tests for entrypoint module

2fb4584

rene-oromtz marked this pull request as ready for review February 27, 2026 23:58

rene-oromtz requested a review from ajzobro February 27, 2026 23:58

ajzobro reviewed Mar 2, 2026

View reviewed changes

rene-oromtz closed this Mar 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent): add a agent pid management module#940

feat(agent): add a agent pid management module#940
rene-oromtz wants to merge 3 commits intomainfrom
feat/add-agent-pid

rene-oromtz commented Feb 27, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 27, 2026 •

edited

Loading

Uh oh!

ajzobro Mar 2, 2026

Uh oh!

rene-oromtz Mar 2, 2026

Uh oh!

ajzobro commented Mar 2, 2026

Uh oh!

rene-oromtz commented Mar 2, 2026

Uh oh!

rene-oromtz commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rene-oromtz commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Resolved issues

Documentation

Web service API changes

Tests

Uh oh!

codecov bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ajzobro Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

rene-oromtz Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

ajzobro commented Mar 2, 2026

Uh oh!

rene-oromtz commented Mar 2, 2026

Uh oh!

rene-oromtz commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rene-oromtz commented Feb 27, 2026 •

edited

Loading

codecov bot commented Feb 27, 2026 •

edited

Loading