Skip to content

[core] fix detached actor being unexpectedly killed #53562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Jun 4, 2025

By replacing the inaccurate worker->IsDetachedActor() with worker->GetAssignedTask().GetTaskSpecification().IsDetachedActor().

Why are these changes needed?

In the previous PR #14184, the worker.MarkDetachedActor() that happened on assigning a task to a worker was deleted.
image
And that causes a leased worker for a detached worker can be killed by HandleUnexpectedWorkerFailure, as mentioned in #40864, which is also even triggered by a normal exit of driver. The reproducible scripts can be found in the comment.

I think actually Worker::IsDetachedActor and Worker::MarkDetachedActor are redundant and better be removed because we can access the info of whether the worker is detached or not through its assigned task.

The info is first ready after worker->SetAssignedTask(task)(L962) during LocalTaskManager::Dispatch and then the worker is inserted into the leased_workers map (L972).

worker->SetAssignedTask(task);
// Pass the contact info of the worker to use.
reply->set_worker_pid(worker->GetProcess().GetId());
reply->mutable_worker_address()->set_ip_address(worker->IpAddress());
reply->mutable_worker_address()->set_port(worker->Port());
reply->mutable_worker_address()->set_worker_id(worker->WorkerId().Binary());
reply->mutable_worker_address()->set_raylet_id(self_node_id_.Binary());
RAY_CHECK(leased_workers.find(worker->WorkerId()) == leased_workers.end());
leased_workers[worker->WorkerId()] = worker;

Therefore, we can access the info through worker->GetAssignedTask().GetTaskSpecification().IsDetachedActor() safely while looping over the leased_workers_ in the NodeManager. By doing that, we don't need to worry about we could miss worker.MarkDetachedActor() sometimes.

Related issue number

Closes #40864
Related to ray-project/kuberay#3701 and ray-project/kuberay#3700

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

rueian added 2 commits June 4, 2025 13:39
By replacing the inaccurate `worker->IsDetachedActor()` with
`worker->GetAssignedTask().GetTaskSpecification().IsDetachedActor()`.

Signed-off-by: Rueian <[email protected]>
Test the new behaivor with a private static function
so that we don't need to create a node manager in tests.

Signed-off-by: Rueian <[email protected]>
@rueian rueian force-pushed the fix-detached-actor-killed-on-owner-exit branch from 8fc1fa5 to 21ffca5 Compare June 6, 2025 03:22
@kevin85421
Copy link
Member

open an issue to track the progress: ray-project/kuberay#3700

@rueian rueian force-pushed the fix-detached-actor-killed-on-owner-exit branch from 21ffca5 to 6b3205a Compare June 6, 2025 05:29
@rueian rueian force-pushed the fix-detached-actor-killed-on-owner-exit branch from 6b3205a to d6e49c3 Compare June 6, 2025 05:41
KillWorkersOwnedByNodeID(
leased_workers_,
[this](const std::shared_ptr<WorkerInterface> &worker) { KillWorker(worker); },
node_id);
Copy link
Contributor Author

@rueian rueian Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR essentially replaces worker->IsDetachedActor() with worker->GetAssignedTask().GetTaskSpecification().IsDetachedActor().

To make the changes be tested, I followed the practice in TestHandleReportWorkerBacklog to extract the loop on the leased_workers_ to static methods KillWorkersOwnedByNodeID and KillWorkersOwnedByWorkerID for unit testing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fully remove worker.IsDetachedActor() and/or replace its implementation with what you have here? Looks like it's likely to cause similar bugs in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Worker::IsDetachedActor will no longer be used after this PR. I was originally planning to remove it in a follow-up PR, but now the removal is included here.

@rueian
Copy link
Contributor Author

rueian commented Jun 6, 2025

open an issue to track the progress: ray-project/kuberay#3700

@kevin85421 Done. This PR is also ready for review. Please take a look. Thanks!

@edoakes edoakes requested a review from a team June 6, 2025 22:07
@edoakes edoakes self-assigned this Jun 6, 2025
@rueian rueian force-pushed the fix-detached-actor-killed-on-owner-exit branch from 06aecc7 to 3ed4f15 Compare June 6, 2025 23:10
Comment on lines +316 to +322
/// This is created for unit test purpose so that we don't need to create
/// a node manager in order to test KillWorkersOwnedByNodeID.
static void KillWorkersOwnedByNodeID(
const absl::flat_hash_map<WorkerID, std::shared_ptr<WorkerInterface>>
&leased_workers,
const std::function<void(const std::shared_ptr<WorkerInterface> &)> &kill_worker,
const NodeID &node_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an anti-pattern, we should be writing tests against the public interface of the relevant class (in this case, NodeManager.

Is it possible to rewrite it in that way instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I think we're going to have to setup a NodeManager with the right workers and then call the public API.

@edoakes
Copy link
Collaborator

edoakes commented Jun 9, 2025

@israbbani can you help review this PR please

Copy link
Contributor

@israbbani israbbani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent find and root cause. I left a few comments.

Comment on lines -122 to -123
void MarkDetachedActor() override { is_detached_actor_ = true; }
bool IsDetachedActor() const override { return is_detached_actor_; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code would be a simpler if you deleted MarkDetachedActor() and changed the implementation of IsDetachedActor to do the right thing i.e.

bool IsDetachedActor() const override {
    return assigned_task_.GetTaskSpecification().IsDetachedActor();
}

That way the rest of the code will work as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When calling a chain of functions like that, it's also worth making sure that each function in the chain returns a valid object i.e. GetTaskSpecification() always returns a properly constructed TaskSpec object.

Copy link
Contributor Author

@rueian rueian Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When calling a chain of functions like that, it's also worth making sure that each function in the chain returns a valid object i.e. GetTaskSpecification() always returns a properly constructed TaskSpec object.

Hi @israbbani, are you suggesting that I add a check in the GetTaskSpecification() method to ensure that the underlying specification is properly constructed, or make the method fail if it’s not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rueian. I looked at the code and in this case RayTask and TaskSpec both have default no-args constructors so we should be okay to just write

bool IsDetachedActor() const override {
    return assigned_task_.GetTaskSpecification().IsDetachedActor();
}

Comment on lines +316 to +322
/// This is created for unit test purpose so that we don't need to create
/// a node manager in order to test KillWorkersOwnedByNodeID.
static void KillWorkersOwnedByNodeID(
const absl::flat_hash_map<WorkerID, std::shared_ptr<WorkerInterface>>
&leased_workers,
const std::function<void(const std::shared_ptr<WorkerInterface> &)> &kill_worker,
const NodeID &node_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I think we're going to have to setup a NodeManager with the right workers and then call the public API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] Detached actor being killed when its parent actor crashes
4 participants