Skip to content

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Oct 11, 2025

Why are these changes needed?

I've encountered an issue where Ray sends SIGKILL to child processes (grandchild will not receive the signal) launched by a Ray actor. As a result, the subprocess cannot catch the signal to gracefully clean up its child processes. Therefore, the grandchild processes of the actor will leak.

I'm glad to see #56476 by @codope, and I also built a similar solution myself. This PR adds the case where I met.

@codope why not enable this feature by default?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Kai-Hsun Chen <[email protected]>
@kevin85421 kevin85421 requested a review from a team as a code owner October 11, 2025 02:31
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a test for nested subprocess cleanup. The test verifies that when an actor is killed, a subprocess spawned by another subprocess of that actor is also cleaned up.

My review focuses on improving the robustness of the new test. I've identified a potential race condition that could lead to test flakiness and provided a suggestion to fix it.

cursor[bot]

This comment was marked as outdated.

Signed-off-by: Kai-Hsun Chen <[email protected]>
@kevin85421
Copy link
Member Author

cc @codope @jjyao @edoakes: Could we enable this by default? The process leak is pretty annoying. Imagine that some users encapsulate inference engines in a Ray actor with DP > 1. When users launch the Ray job again, they will observe GPU OOMs due to the DP process leaks. It's not straightforward for most users.

@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Oct 11, 2025
@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Oct 11, 2025
@edoakes
Copy link
Collaborator

edoakes commented Oct 13, 2025

@kevin85421 we plan to enable it by default, just being careful and testing with users who requested it first since it is technically a breaking change.

Copy link
Contributor

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin85421 Thanks for adding this test! Approved with some minor comments.

Signed-off-by: Kai-Hsun Chen <[email protected]>
@kevin85421
Copy link
Member Author

we plan to enable it by default

SG

Thanks @edoakes @codope for the review.

Signed-off-by: Kai-Hsun Chen <[email protected]>
@kevin85421 kevin85421 enabled auto-merge (squash) October 13, 2025 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants