-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
test: use process groups for robust child cleanup #21261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Currently, the test cleanup logic only targets direct child processes (`ppid`). This is fragile as it fails to terminate grandchildren or detached processes, potentially leaving zombie processes in the CI environment. This change switches the strategy to target the entire process group (`pgid`). Safety measures implemented: - The runner temporarily ignores `SIGTERM` before sending it to the group to avoid terminating itself. - When force-killing remaining processes (`SIGKILL`), the runner explicitly filters out its own PID from the list returned by `pgrep`. Signed-off-by: Martin Lopatář <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR replaces the ppid-based child process cleanup logic with a pgid-based (process group) approach to ensure more comprehensive test cleanup, including grandchildren and detached processes.
Key Changes:
- Switches from
pkill -P pidtoProcess.kill('TERM', -pgid)for terminating child processes - Adds signal trapping to protect the test runner from self-termination during SIGTERM
- Implements a two-phase cleanup: SIGTERM followed by SIGKILL after a 1-second grace period
- Replaces external
pkillcalls with directProcess.killfor the final SIGKILL phase
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pgid = Process.getpgrp | ||
|
|
||
| $stderr.puts 'Terminating child processes...' | ||
| trap_saved = Signal.trap('TERM', 'IGNORE') | ||
| Process.kill('TERM', -pgid) rescue nil | ||
| Signal.trap('TERM', trap_saved) |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical issue with process group termination logic. When calling Process.kill('TERM', -pgid), this will send SIGTERM to ALL processes in the process group, including the test runner itself. The SIGTERM trap on line 72 will not fully protect the runner because the signal is sent to the entire process group simultaneously. This could cause the test runner to terminate unexpectedly.
Additionally, Process.getpgrp returns the process group ID of the calling process, which means if this script was launched as part of an existing process group, it will attempt to kill all processes in that group, not just child processes spawned by the test.
Consider using Process.setsid at the start of the script to create a new process session/group, ensuring child processes are in their own group that can be safely terminated without affecting the parent test runner.
| pgrep = '/usr/bin/pgrep' | ||
| if File.executable?(pgrep) |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path is hardcoded to /usr/bin/pgrep, but the startup code at line 36 uses the which function to check for pgrep availability. This creates an inconsistency. Consider using which('pgrep') to find the actual location of pgrep or using the same path resolution strategy used elsewhere in the codebase.
| pgrep = '/usr/bin/pgrep' | |
| if File.executable?(pgrep) | |
| pgrep = which("pgrep") | |
| if pgrep && File.executable?(pgrep) |
| begin | ||
| pgid = Process.getpgrp | ||
|
|
||
| $stderr.puts 'Terminating child processes...' | ||
| trap_saved = Signal.trap('TERM', 'IGNORE') | ||
| Process.kill('TERM', -pgid) rescue nil | ||
| Signal.trap('TERM', trap_saved) | ||
|
|
||
| sleep 1 | ||
| system pgrep, "-9", "-P", pid | ||
|
|
||
| pgrep = '/usr/bin/pgrep' | ||
| if File.executable?(pgrep) | ||
| $stderr.puts 'Killing child processes...' | ||
| pids = IO.popen([pgrep, '-g', pgid.to_s]) { |io| io.read.split.map(&:to_i) } rescue [] | ||
| pids.each do |pid| | ||
| next if pid == Process.pid | ||
| Process.kill('KILL', pid) rescue nil | ||
| end | ||
| end |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleanup code runs unconditionally in the ensure block even when tests succeed without spawning child processes. This adds unnecessary overhead (signal trapping, process group query, 1 second sleep) to all test runs. Consider checking if any child processes actually exist before running the cleanup logic, or at minimum, make the sleep conditional on whether processes were found.
| begin | ||
| pgid = Process.getpgrp | ||
|
|
||
| $stderr.puts 'Terminating child processes...' | ||
| trap_saved = Signal.trap('TERM', 'IGNORE') | ||
| Process.kill('TERM', -pgid) rescue nil | ||
| Signal.trap('TERM', trap_saved) | ||
|
|
||
| sleep 1 | ||
| system pgrep, "-9", "-P", pid | ||
|
|
||
| pgrep = '/usr/bin/pgrep' | ||
| if File.executable?(pgrep) | ||
| $stderr.puts 'Killing child processes...' | ||
| pids = IO.popen([pgrep, '-g', pgid.to_s]) { |io| io.read.split.map(&:to_i) } rescue [] | ||
| pids.each do |pid| | ||
| next if pid == Process.pid | ||
| Process.kill('KILL', pid) rescue nil | ||
| end | ||
| end | ||
| rescue Exception => cleanup_ex | ||
| $stderr.puts "Cleanup failed: #{cleanup_ex.message}" | ||
| end |
Copilot
AI
Dec 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new process group cleanup logic lacks test coverage. Given the critical nature of this change (switching from ppid-based to pgid-based process termination) and the complexity involved (signal handling, process group management), tests should be added to verify:
- Cleanup properly terminates child processes
- The test runner itself is not terminated
- Cleanup handles cases where no child processes exist
- Error handling in the cleanup rescue block works correctly
The repository has comprehensive test coverage for other files (e.g., test_runner_formula_spec.rb), suggesting tests should be added for this functionality as well.
MikeMcQuaid
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide some reproducible test cases for the case this fixes? I'm not sure it's worth all the extra code here as-is.
Ok, I will try. I understand, it's really a lot of extra code. Is there a test for the the current (by parent pid) cleanup logic? I didn't investigating it much, but I think it can currently catch only rear cases. Because if descendant process is completely detached, it's not a child and if it is not completely detached, it would be killed as soon as |
Sorry, to clarify, on this I more mean "formulae/tests/commands I can run locally to verify" rather than "please write a test for this in Ruby". |
This PR implements the architectural improvement suggested in the previous bug report regarding test cleanup stability.
Currently, the test cleanup logic only targets direct child processes (
ppid). This is fragile as it fails to terminate grandchildren or detached processes, potentially leaving zombie processes in the CI environment.As suggested earlier:
Implementation Details
This change switches the cleanup strategy to target the entire Process Group.
To ensure safety and prevent the test runner from killing itself:
SIGTERMbefore sending it to the group (Process.kill('TERM', -pgid)).pgrepfor a force-kill, the runner explicitly filters out its own PID from the list.This ensures a completely clean state even if tests spawn complex process trees.
brew lgtm(style, typechecking and tests) with your changes locally?