-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[core] Fix raylet shutdown race(s) #57198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses race conditions during raylet shutdown by introducing a unified atomic shutting_down
flag. This change simplifies the shutdown logic and prevents potential crashes. The introduction of this flag and its atomic usage in shutdown_raylet_gracefully
and HandleShutdownRaylet
is a solid improvement. Additionally, the pull request includes numerous valuable cleanups, such as using modern C++ features, optimizing for performance by avoiding copies and reserving vector capacity, and improving code style, which all contribute to better code quality and maintainability. I have one suggestion to improve the linkage of a helper function.
Signed-off-by: dayshah <[email protected]>
@codope can you help review? You touched this recently. |
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if there's a way to reduce the flag count to reduce confusion and more subtle bugs in future. We can replace the two flags by something like std::once_flag shutdown_once
(keep shutting_down
) process-wide and use std::call_once(shutdown_once, [&]{ … do internal graceful shutdown … });
to protect shutdown_raylet_gracefully_internal
. Another option is to use enum like RayletState
and a single std::atomic<RayletState>
(drop all boolean flags).
Wdyt?
src/ray/raylet/node_manager.cc
Outdated
if (is_shutting_down_) { | ||
RAY_LOG(INFO) << "Node already has received the shutdown request. The shutdown " | ||
"request RPC is ignored."; | ||
if (shutting_down_.exchange(true)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use compare_exchange_strong
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So after the enum update, i think exchange is simpler for what we want in this case because SHUTTING_DOWN is the terminal state and we want to always want to keep going if it wasn't shutting down before
src/ray/raylet/main.cc
Outdated
auto shutdown_raylet_gracefully = | ||
[&main_service, &shutting_down, shutdown_raylet_gracefully_internal]( | ||
const ray::rpc::NodeDeathInfo &node_death_info) { | ||
if (shutting_down.exchange(true)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use compare_exchange_strong
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved this to compare exchange strong after the enum update since alive -> shutdown_queued is the only state transition we want
.WillOnce([&](const gcs::SubscribeCallback<NodeID, rpc::GcsNodeInfo> &subscribe, | ||
const gcs::StatusCallback &done) { | ||
publish_node_change_callback = subscribe; | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this lambda, maybe call done(gcs::Status::OK());
after capturing subscribe
? This avoids subtle hangs or unmet expectations in future refactors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I tried this, but the current mock gcs client doesn't allow us access to the underlying gcs rpc client's functions which this ends up directly calling. Or we need to stub out the ray syncer with a fake syncer for this to work in the test. This isn't totally necessary for this test, and the gcs client + its mock/fake is already getting reworked so holding off on it.
}); | ||
node_manager_->RegisterGcs(); | ||
|
||
shutting_down_ = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also pair/add a death test where we don't set shutting_down_
and expect fatal when not shutting down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, good idea, parameterized the test with all 3 enums, and in the alive case we assert death
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Ya we need at least 2 since there's 2 phases to shutdown and we want to shortcut to phase 2 in the sigterm case. But ya I like the enum idea, makes it more understandable. Moved to that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Test Fragility Due to Missing Wait Loop
Removing the wait loop for publish_node_change_callback
makes this test fragile. It now relies on the mock's synchronous behavior to set the callback, risking use of an uninitialized callback if the mock ever becomes asynchronous, which could lead to crashes or flakiness.
src/ray/raylet/tests/node_manager_test.cc#L590-L600
ray/src/ray/raylet/tests/node_manager_test.cc
Lines 590 to 600 in b566e52
}); | |
node_manager_->RegisterGcs(); | |
// Preparing a detached actor creation task spec for the later RequestWorkerLease rpc. | |
const auto owner_node_id = NodeID::FromRandom(); | |
rpc::Address owner_address; | |
owner_address.set_node_id(owner_node_id.Binary()); | |
const auto actor_id = | |
ActorID::Of(JobID::FromInt(1), TaskID::FromRandom(JobID::FromInt(1)), 0); | |
const auto lease_spec = DetachedActorCreationLeaseSpec(owner_address, actor_id); | |
Problem
Currently there's two different shutdown flags on the raylet. There's
shutted_down
in main.cc which tracks whethershutdown_raylet_gracefully_internal
has been executed yet, and then there'sshutted_down
isn't atomically checked + changed insideshutdown_raylet_gracefully_internal
. So it's possible to have the internal shutdown path happen twice.shutdown_raylet_gracefully_internal
which only setsshutted_down
in main.cc and notis_shutting_down_
in node manager.cc. So we could end up in a case where we send an UnregisterSelf to the GCS and get the publish back that we're dead before. This will result in a RAY_LOG(FATAL) where the raylet will crash itself. See [core] fix test state api and dashboard flakiness #56966 for more context.Solution
The solution is to introduce a
shutdown_state
enum that's created in main.cc asALIVE
and passed down to NodeManager. It's set toSHUTDOWN_QUEUED
inshutdown_raylet_gracefully
so that shutdown is only queued once and isn't queued if we went straight toSHUTTING_DOWN
. The enum is set to SHUTTING_DOWN inshutdown_raylet_gracefully_internal
for when shutdown actually starts. In the sigterm case we'll directly go to this state.shutdown_state
is also checked in the pubsub NodeRemoved callback so the raylet won't crash and ray_log fatal itself when it gets its own node death publish and shutdown has already started.Also a bit of miscellaneous cpp cleanup while I was there...