Skip to content

Conversation

@loplex
Copy link
Contributor

@loplex loplex commented Dec 16, 2025

This PR implements the architectural improvement suggested in the previous bug report regarding test cleanup stability.

Currently, the test cleanup logic only targets direct child processes (ppid). This is fragile as it fails to terminate grandchildren or detached processes, potentially leaving zombie processes in the CI environment.

As suggested earlier:

"Architectural suggestion: Killing by ppid is fragile as it does not cover grandchildren processes. It would be more robust to kill the entire process group (pgid) to ensure a clean test environment shutdown."

Implementation Details

This change switches the cleanup strategy to target the entire Process Group.

To ensure safety and prevent the test runner from killing itself:

  1. SIGTERM: The runner temporarily traps/ignores SIGTERM before sending it to the group (Process.kill('TERM', -pgid)).
  2. SIGKILL: When identifying remaining processes via pgrep for a force-kill, the runner explicitly filters out its own PID from the list.

This ensures a completely clean state even if tests spawn complex process trees.

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same change?
  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your changes? Here's an example.
  • Have you successfully run brew lgtm (style, typechecking and tests) with your changes locally?

Currently, the test cleanup logic only targets direct child processes (`ppid`).
This is fragile as it fails to terminate grandchildren or detached processes,
potentially leaving zombie processes in the CI environment.

This change switches the strategy to target the entire process group (`pgid`).

Safety measures implemented:
- The runner temporarily ignores `SIGTERM` before sending it to the group to
  avoid terminating itself.
- When force-killing remaining processes (`SIGKILL`), the runner explicitly
  filters out its own PID from the list returned by `pgrep`.

Signed-off-by: Martin Lopatář <[email protected]>
Copilot AI review requested due to automatic review settings December 16, 2025 03:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the ppid-based child process cleanup logic with a pgid-based (process group) approach to ensure more comprehensive test cleanup, including grandchildren and detached processes.

Key Changes:

  • Switches from pkill -P pid to Process.kill('TERM', -pgid) for terminating child processes
  • Adds signal trapping to protect the test runner from self-termination during SIGTERM
  • Implements a two-phase cleanup: SIGTERM followed by SIGKILL after a 1-second grace period
  • Replaces external pkill calls with direct Process.kill for the final SIGKILL phase

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +69 to +74
pgid = Process.getpgrp

$stderr.puts 'Terminating child processes...'
trap_saved = Signal.trap('TERM', 'IGNORE')
Process.kill('TERM', -pgid) rescue nil
Signal.trap('TERM', trap_saved)
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical issue with process group termination logic. When calling Process.kill('TERM', -pgid), this will send SIGTERM to ALL processes in the process group, including the test runner itself. The SIGTERM trap on line 72 will not fully protect the runner because the signal is sent to the entire process group simultaneously. This could cause the test runner to terminate unexpectedly.

Additionally, Process.getpgrp returns the process group ID of the calling process, which means if this script was launched as part of an existing process group, it will attempt to kill all processes in that group, not just child processes spawned by the test.

Consider using Process.setsid at the start of the script to create a new process session/group, ensuring child processes are in their own group that can be safely terminated without affecting the parent test runner.

Copilot uses AI. Check for mistakes.
Comment on lines +78 to +79
pgrep = '/usr/bin/pgrep'
if File.executable?(pgrep)
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path is hardcoded to /usr/bin/pgrep, but the startup code at line 36 uses the which function to check for pgrep availability. This creates an inconsistency. Consider using which('pgrep') to find the actual location of pgrep or using the same path resolution strategy used elsewhere in the codebase.

Suggested change
pgrep = '/usr/bin/pgrep'
if File.executable?(pgrep)
pgrep = which("pgrep")
if pgrep && File.executable?(pgrep)

Copilot uses AI. Check for mistakes.
Comment on lines +68 to +86
begin
pgid = Process.getpgrp

$stderr.puts 'Terminating child processes...'
trap_saved = Signal.trap('TERM', 'IGNORE')
Process.kill('TERM', -pgid) rescue nil
Signal.trap('TERM', trap_saved)

sleep 1
system pgrep, "-9", "-P", pid

pgrep = '/usr/bin/pgrep'
if File.executable?(pgrep)
$stderr.puts 'Killing child processes...'
pids = IO.popen([pgrep, '-g', pgid.to_s]) { |io| io.read.split.map(&:to_i) } rescue []
pids.each do |pid|
next if pid == Process.pid
Process.kill('KILL', pid) rescue nil
end
end
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup code runs unconditionally in the ensure block even when tests succeed without spawning child processes. This adds unnecessary overhead (signal trapping, process group query, 1 second sleep) to all test runs. Consider checking if any child processes actually exist before running the cleanup logic, or at minimum, make the sleep conditional on whether processes were found.

Copilot uses AI. Check for mistakes.
Comment on lines +68 to 89
begin
pgid = Process.getpgrp

$stderr.puts 'Terminating child processes...'
trap_saved = Signal.trap('TERM', 'IGNORE')
Process.kill('TERM', -pgid) rescue nil
Signal.trap('TERM', trap_saved)

sleep 1
system pgrep, "-9", "-P", pid

pgrep = '/usr/bin/pgrep'
if File.executable?(pgrep)
$stderr.puts 'Killing child processes...'
pids = IO.popen([pgrep, '-g', pgid.to_s]) { |io| io.read.split.map(&:to_i) } rescue []
pids.each do |pid|
next if pid == Process.pid
Process.kill('KILL', pid) rescue nil
end
end
rescue Exception => cleanup_ex
$stderr.puts "Cleanup failed: #{cleanup_ex.message}"
end
Copy link

Copilot AI Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new process group cleanup logic lacks test coverage. Given the critical nature of this change (switching from ppid-based to pgid-based process termination) and the complexity involved (signal handling, process group management), tests should be added to verify:

  • Cleanup properly terminates child processes
  • The test runner itself is not terminated
  • Cleanup handles cases where no child processes exist
  • Error handling in the cleanup rescue block works correctly

The repository has comprehensive test coverage for other files (e.g., test_runner_formula_spec.rb), suggesting tests should be added for this functionality as well.

Copilot uses AI. Check for mistakes.
Copy link
Member

@MikeMcQuaid MikeMcQuaid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide some reproducible test cases for the case this fixes? I'm not sure it's worth all the extra code here as-is.

@loplex
Copy link
Contributor Author

loplex commented Dec 16, 2025

Can you provide some reproducible test cases for the case this fixes? I'm not sure it's worth all the extra code here as-is.

Ok, I will try. I understand, it's really a lot of extra code. Is there a test for the the current (by parent pid) cleanup logic? I didn't investigating it much, but I think it can currently catch only rear cases. Because if descendant process is completely detached, it's not a child and if it is not completely detached, it would be killed as soon as test.rb ends anyway. The only exception is if process is not executed using shell, running in background and have no connection to parent process, ...
e.g. spawn("nohup", "sleep", "42", :in => File::NULL, :out => File::NULL, :err => File::NULL)

@MikeMcQuaid
Copy link
Member

Ok, I will try.

Sorry, to clarify, on this I more mean "formulae/tests/commands I can run locally to verify" rather than "please write a test for this in Ruby".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants