-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[core] Add a test for nested sub-processes cleanup #57638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[core] Add a test for nested sub-processes cleanup #57638
Conversation
Signed-off-by: Kai-Hsun Chen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a test for nested subprocess cleanup. The test verifies that when an actor is killed, a subprocess spawned by another subprocess of that actor is also cleaned up.
My review focuses on improving the robustness of the new test. I've identified a potential race condition that could lead to test flakiness and provided a suggestion to fix it.
Signed-off-by: Kai-Hsun Chen <[email protected]>
cc @codope @jjyao @edoakes: Could we enable this by default? The process leak is pretty annoying. Imagine that some users encapsulate inference engines in a Ray actor with DP > 1. When users launch the Ray job again, they will observe GPU OOMs due to the DP process leaks. It's not straightforward for most users. |
@kevin85421 we plan to enable it by default, just being careful and testing with users who requested it first since it is technically a breaking change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevin85421 Thanks for adding this test! Approved with some minor comments.
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Why are these changes needed?
I've encountered an issue where Ray sends SIGKILL to child processes (grandchild will not receive the signal) launched by a Ray actor. As a result, the subprocess cannot catch the signal to gracefully clean up its child processes. Therefore, the grandchild processes of the actor will leak.
I'm glad to see #56476 by @codope, and I also built a similar solution myself. This PR adds the case where I met.
@codope why not enable this feature by default?
Related issue number
Checks
git commit -s
) in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.