Skip to content

Conversation

dayshah
Copy link
Contributor

@dayshah dayshah commented Oct 5, 2025

Problem

Currently there's two different shutdown flags on the raylet. There's shutted_down in main.cc which tracks whether shutdown_raylet_gracefully_internal has been executed yet, and then there's

  1. shutted_down isn't atomically checked + changed inside shutdown_raylet_gracefully_internal. So it's possible to have the internal shutdown path happen twice.
  2. When the raylet gets sigtermed it calls shutdown_raylet_gracefully_internal which only sets shutted_down in main.cc and not is_shutting_down_ in node manager.cc. So we could end up in a case where we send an UnregisterSelf to the GCS and get the publish back that we're dead before. This will result in a RAY_LOG(FATAL) where the raylet will crash itself. See [core] fix test state api and dashboard flakiness #56966 for more context.

Solution

The solution is to introduce a shutdown_state enum that's created in main.cc as ALIVE and passed down to NodeManager. It's set to SHUTDOWN_QUEUED in shutdown_raylet_gracefully so that shutdown is only queued once and isn't queued if we went straight to SHUTTING_DOWN. The enum is set to SHUTTING_DOWN in shutdown_raylet_gracefully_internal for when shutdown actually starts. In the sigterm case we'll directly go to this state. shutdown_state is also checked in the pubsub NodeRemoved callback so the raylet won't crash and ray_log fatal itself when it gets its own node death publish and shutdown has already started.

Also a bit of miscellaneous cpp cleanup while I was there...

@dayshah dayshah requested a review from a team as a code owner October 5, 2025 06:16
@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Oct 5, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses race conditions during raylet shutdown by introducing a unified atomic shutting_down flag. This change simplifies the shutdown logic and prevents potential crashes. The introduction of this flag and its atomic usage in shutdown_raylet_gracefully and HandleShutdownRaylet is a solid improvement. Additionally, the pull request includes numerous valuable cleanups, such as using modern C++ features, optimizing for performance by avoiding copies and reserving vector capacity, and improving code style, which all contribute to better code quality and maintainability. I have one suggestion to improve the linkage of a helper function.

@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 5, 2025
@edoakes
Copy link
Collaborator

edoakes commented Oct 5, 2025

@codope can you help review? You touched this recently.

Signed-off-by: dayshah <[email protected]>
@codope codope self-assigned this Oct 6, 2025
Copy link
Contributor

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if there's a way to reduce the flag count to reduce confusion and more subtle bugs in future. We can replace the two flags by something like std::once_flag shutdown_once (keep shutting_down) process-wide and use std::call_once(shutdown_once, [&]{ … do internal graceful shutdown … }); to protect shutdown_raylet_gracefully_internal. Another option is to use enum like RayletState and a single std::atomic<RayletState> (drop all boolean flags).

Wdyt?

if (is_shutting_down_) {
RAY_LOG(INFO) << "Node already has received the shutdown request. The shutdown "
"request RPC is ignored.";
if (shutting_down_.exchange(true)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use compare_exchange_strong instead?

Copy link
Contributor Author

@dayshah dayshah Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So after the enum update, i think exchange is simpler for what we want in this case because SHUTTING_DOWN is the terminal state and we want to always want to keep going if it wasn't shutting down before

auto shutdown_raylet_gracefully =
[&main_service, &shutting_down, shutdown_raylet_gracefully_internal](
const ray::rpc::NodeDeathInfo &node_death_info) {
if (shutting_down.exchange(true)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use compare_exchange_strong instead?

Copy link
Contributor Author

@dayshah dayshah Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved this to compare exchange strong after the enum update since alive -> shutdown_queued is the only state transition we want

Comment on lines +1227 to +1230
.WillOnce([&](const gcs::SubscribeCallback<NodeID, rpc::GcsNodeInfo> &subscribe,
const gcs::StatusCallback &done) {
publish_node_change_callback = subscribe;
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this lambda, maybe call done(gcs::Status::OK()); after capturing subscribe? This avoids subtle hangs or unmet expectations in future refactors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I tried this, but the current mock gcs client doesn't allow us access to the underlying gcs rpc client's functions which this ends up directly calling. Or we need to stub out the ray syncer with a fake syncer for this to work in the test. This isn't totally necessary for this test, and the gcs client + its mock/fake is already getting reworked so holding off on it.

});
node_manager_->RegisterGcs();

shutting_down_ = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also pair/add a death test where we don't set shutting_down_ and expect fatal when not shutting down?

Copy link
Contributor Author

@dayshah dayshah Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, good idea, parameterized the test with all 3 enums, and in the alive case we assert death

@dayshah dayshah requested a review from codope October 13, 2025 00:24
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
cursor[bot]

This comment was marked as outdated.

@dayshah
Copy link
Contributor Author

dayshah commented Oct 13, 2025

Just wondering if there's a way to reduce the flag count to reduce confusion and more subtle bugs in future. We can replace the two flags by something like std::once_flag shutdown_once (keep shutting_down) process-wide and use std::call_once(shutdown_once, [&]{ … do internal graceful shutdown … }); to protect shutdown_raylet_gracefully_internal. Another option is to use enum like RayletState and a single std::atomic<RayletState> (drop all boolean flags).

Wdyt?

Ya we need at least 2 since there's 2 phases to shutdown and we want to shortcut to phase 2 in the sigterm case. But ya I like the enum idea, makes it more understandable. Moved to that.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Test Fragility Due to Missing Wait Loop

Removing the wait loop for publish_node_change_callback makes this test fragile. It now relies on the mock's synchronous behavior to set the callback, risking use of an uninitialized callback if the mock ever becomes asynchronous, which could lead to crashes or flakiness.

src/ray/raylet/tests/node_manager_test.cc#L590-L600

});
node_manager_->RegisterGcs();
// Preparing a detached actor creation task spec for the later RequestWorkerLease rpc.
const auto owner_node_id = NodeID::FromRandom();
rpc::Address owner_address;
owner_address.set_node_id(owner_node_id.Binary());
const auto actor_id =
ActorID::Of(JobID::FromInt(1), TaskID::FromRandom(JobID::FromInt(1)), 0);
const auto lease_spec = DetachedActorCreationLeaseSpec(owner_address, actor_id);

Fix in Cursor Fix in Web


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants