-
Notifications
You must be signed in to change notification settings - Fork 6.4k
[core] fix detached actor being unexpectedly killed #53562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[core] fix detached actor being unexpectedly killed #53562
Conversation
By replacing the inaccurate `worker->IsDetachedActor()` with `worker->GetAssignedTask().GetTaskSpecification().IsDetachedActor()`. Signed-off-by: Rueian <[email protected]>
Test the new behaivor with a private static function so that we don't need to create a node manager in tests. Signed-off-by: Rueian <[email protected]>
8fc1fa5
to
21ffca5
Compare
open an issue to track the progress: ray-project/kuberay#3700 |
21ffca5
to
6b3205a
Compare
6b3205a
to
d6e49c3
Compare
KillWorkersOwnedByNodeID( | ||
leased_workers_, | ||
[this](const std::shared_ptr<WorkerInterface> &worker) { KillWorker(worker); }, | ||
node_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR essentially replaces worker->IsDetachedActor()
with worker->GetAssignedTask().GetTaskSpecification().IsDetachedActor()
.
To make the changes be tested, I followed the practice in TestHandleReportWorkerBacklog to extract the loop on the leased_workers_
to static methods KillWorkersOwnedByNodeID
and KillWorkersOwnedByWorkerID
for unit testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we fully remove worker.IsDetachedActor()
and/or replace its implementation with what you have here? Looks like it's likely to cause similar bugs in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Worker::IsDetachedActor
will no longer be used after this PR. I was originally planning to remove it in a follow-up PR, but now the removal is included here.
@kevin85421 Done. This PR is also ready for review. Please take a look. Thanks! |
…Actor Signed-off-by: Rueian <[email protected]>
06aecc7
to
3ed4f15
Compare
/// This is created for unit test purpose so that we don't need to create | ||
/// a node manager in order to test KillWorkersOwnedByNodeID. | ||
static void KillWorkersOwnedByNodeID( | ||
const absl::flat_hash_map<WorkerID, std::shared_ptr<WorkerInterface>> | ||
&leased_workers, | ||
const std::function<void(const std::shared_ptr<WorkerInterface> &)> &kill_worker, | ||
const NodeID &node_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an anti-pattern, we should be writing tests against the public interface of the relevant class (in this case, NodeManager
.
Is it possible to rewrite it in that way instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I think we're going to have to setup a NodeManager with the right workers and then call the public API.
@israbbani can you help review this PR please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent find and root cause. I left a few comments.
void MarkDetachedActor() override { is_detached_actor_ = true; } | ||
bool IsDetachedActor() const override { return is_detached_actor_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code would be a simpler if you deleted MarkDetachedActor()
and changed the implementation of IsDetachedActor
to do the right thing i.e.
bool IsDetachedActor() const override {
return assigned_task_.GetTaskSpecification().IsDetachedActor();
}
That way the rest of the code will work as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When calling a chain of functions like that, it's also worth making sure that each function in the chain returns a valid object i.e. GetTaskSpecification() always returns a properly constructed TaskSpec object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When calling a chain of functions like that, it's also worth making sure that each function in the chain returns a valid object i.e. GetTaskSpecification() always returns a properly constructed TaskSpec object.
Hi @israbbani, are you suggesting that I add a check in the GetTaskSpecification() method to ensure that the underlying specification is properly constructed, or make the method fail if it’s not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rueian. I looked at the code and in this case RayTask and TaskSpec both have default no-args constructors so we should be okay to just write
bool IsDetachedActor() const override {
return assigned_task_.GetTaskSpecification().IsDetachedActor();
}
/// This is created for unit test purpose so that we don't need to create | ||
/// a node manager in order to test KillWorkersOwnedByNodeID. | ||
static void KillWorkersOwnedByNodeID( | ||
const absl::flat_hash_map<WorkerID, std::shared_ptr<WorkerInterface>> | ||
&leased_workers, | ||
const std::function<void(const std::shared_ptr<WorkerInterface> &)> &kill_worker, | ||
const NodeID &node_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I think we're going to have to setup a NodeManager with the right workers and then call the public API.
By replacing the inaccurate
worker->IsDetachedActor()
withworker->GetAssignedTask().GetTaskSpecification().IsDetachedActor()
.Why are these changes needed?
In the previous PR #14184, the

worker.MarkDetachedActor()
that happened on assigning a task to a worker was deleted.And that causes a leased worker for a detached worker can be killed by HandleUnexpectedWorkerFailure, as mentioned in #40864, which is also even triggered by a normal exit of driver. The reproducible scripts can be found in the comment.
I think actually
Worker::IsDetachedActor
andWorker::MarkDetachedActor
are redundant and better be removed because we can access the info of whether the worker is detached or not through its assigned task.The info is first ready after
worker->SetAssignedTask(task)
(L962) duringLocalTaskManager::Dispatch
and then the worker is inserted into theleased_workers
map (L972).ray/src/ray/raylet/local_task_manager.cc
Lines 962 to 972 in 118c370
Therefore, we can access the info through
worker->GetAssignedTask().GetTaskSpecification().IsDetachedActor()
safely while looping over theleased_workers_
in theNodeManager
. By doing that, we don't need to worry about we could missworker.MarkDetachedActor()
sometimes.Related issue number
Closes #40864
Related to ray-project/kuberay#3701 and ray-project/kuberay#3700
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.