[core] Fix raylet shutdown race(s) #57198

dayshah · 2025-10-05T06:15:59Z

Problem

Currently there's two different shutdown flags on the raylet. There's shutted_down in main.cc which tracks whether shutdown_raylet_gracefully_internal has been executed yet, and then there's

shutted_down isn't atomically checked + changed inside shutdown_raylet_gracefully_internal. So it's possible to have the internal shutdown path happen twice.
When the raylet gets sigtermed it calls shutdown_raylet_gracefully_internal which only sets shutted_down in main.cc and not is_shutting_down_ in node manager.cc. So we could end up in a case where we send an UnregisterSelf to the GCS and get the publish back that we're dead before. This will result in a RAY_LOG(FATAL) where the raylet will crash itself. See [core] fix test state api and dashboard flakiness #56966 for more context.

Solution

The solution is to introduce a shutdown_state enum that's created in main.cc as ALIVE and passed down to NodeManager. It's set to SHUTDOWN_QUEUED in shutdown_raylet_gracefully so that shutdown is only queued once and isn't queued if we went straight to SHUTTING_DOWN. The enum is set to SHUTTING_DOWN in shutdown_raylet_gracefully_internal for when shutdown actually starts. In the sigterm case we'll directly go to this state. shutdown_state is also checked in the pubsub NodeRemoved callback so the raylet won't crash and ray_log fatal itself when it gets its own node death publish and shutdown has already started.

Also a bit of miscellaneous cpp cleanup while I was there...

Signed-off-by: dayshah <[email protected]>

gemini-code-assist

Code Review

This pull request effectively addresses race conditions during raylet shutdown by introducing a unified atomic shutting_down flag. This change simplifies the shutdown logic and prevents potential crashes. The introduction of this flag and its atomic usage in shutdown_raylet_gracefully and HandleShutdownRaylet is a solid improvement. Additionally, the pull request includes numerous valuable cleanups, such as using modern C++ features, optimizing for performance by avoiding copies and reserving vector capacity, and improving code style, which all contribute to better code quality and maintainability. I have one suggestion to improve the linkage of a helper function.

Signed-off-by: dayshah <[email protected]>

edoakes · 2025-10-05T22:18:22Z

@codope can you help review? You touched this recently.

Signed-off-by: dayshah <[email protected]>

codope

Just wondering if there's a way to reduce the flag count to reduce confusion and more subtle bugs in future. We can replace the two flags by something like std::once_flag shutdown_once (keep shutting_down) process-wide and use std::call_once(shutdown_once, [&]{ … do internal graceful shutdown … }); to protect shutdown_raylet_gracefully_internal. Another option is to use enum like RayletState and a single std::atomic<RayletState> (drop all boolean flags).

Wdyt?

codope · 2025-10-06T10:54:08Z

src/ray/raylet/node_manager.cc

-  if (is_shutting_down_) {
-    RAY_LOG(INFO) << "Node already has received the shutdown request. The shutdown "
-                     "request RPC is ignored.";
+  if (shutting_down_.exchange(true)) {


should we use compare_exchange_strong instead?

So after the enum update, i think exchange is simpler for what we want in this case because SHUTTING_DOWN is the terminal state and we want to always want to keep going if it wasn't shutting down before

codope · 2025-10-06T10:56:11Z

src/ray/raylet/main.cc

+  auto shutdown_raylet_gracefully =
+      [&main_service, &shutting_down, shutdown_raylet_gracefully_internal](
+          const ray::rpc::NodeDeathInfo &node_death_info) {
+        if (shutting_down.exchange(true)) {


should we use compare_exchange_strong instead?

moved this to compare exchange strong after the enum update since alive -> shutdown_queued is the only state transition we want

codope · 2025-10-06T11:07:04Z

src/ray/raylet/tests/node_manager_test.cc

+      .WillOnce([&](const gcs::SubscribeCallback<NodeID, rpc::GcsNodeInfo> &subscribe,
+                    const gcs::StatusCallback &done) {
+        publish_node_change_callback = subscribe;
+      });


In this lambda, maybe call done(gcs::Status::OK()); after capturing subscribe? This avoids subtle hangs or unmet expectations in future refactors.

So I tried this, but the current mock gcs client doesn't allow us access to the underlying gcs rpc client's functions which this ends up directly calling. Or we need to stub out the ray syncer with a fake syncer for this to work in the test. This isn't totally necessary for this test, and the gcs client + its mock/fake is already getting reworked so holding off on it.

codope · 2025-10-06T11:09:31Z

src/ray/raylet/tests/node_manager_test.cc

+      });
+  node_manager_->RegisterGcs();
+
+  shutting_down_ = true;


can we also pair/add a death test where we don't set shutting_down_ and expect fatal when not shutting down?

yup, good idea, parameterized the test with all 3 enums, and in the alive case we assert death

Signed-off-by: dayshah <[email protected]>

dayshah · 2025-10-13T02:08:21Z

Just wondering if there's a way to reduce the flag count to reduce confusion and more subtle bugs in future. We can replace the two flags by something like std::once_flag shutdown_once (keep shutting_down) process-wide and use std::call_once(shutdown_once, [&]{ … do internal graceful shutdown … }); to protect shutdown_raylet_gracefully_internal. Another option is to use enum like RayletState and a single std::atomic<RayletState> (drop all boolean flags).

Wdyt?

Ya we need at least 2 since there's 2 phases to shutdown and we want to shortcut to phase 2 in the sigterm case. But ya I like the enum idea, makes it more understandable. Moved to that.

cursor

Bug: Test Fragility Due to Missing Wait Loop

Removing the wait loop for publish_node_change_callback makes this test fragile. It now relies on the mock's synchronous behavior to set the callback, risking use of an uninitialized callback if the mock ever becomes asynchronous, which could lead to crashes or flakiness.

src/ray/raylet/tests/node_manager_test.cc#L590-L600

ray/src/ray/raylet/tests/node_manager_test.cc

Lines 590 to 600 in b566e52

    
               }); 
        
           node_manager_->RegisterGcs(); 
        
           // Preparing a detached actor creation task spec for the later RequestWorkerLease rpc. 
        
           const auto owner_node_id = NodeID::FromRandom(); 
        
           rpc::Address owner_address; 
        
           owner_address.set_node_id(owner_node_id.Binary()); 
        
           const auto actor_id = 
        
               ActorID::Of(JobID::FromInt(1), TaskID::FromRandom(JobID::FromInt(1)), 0); 
        
           const auto lease_spec = DetachedActorCreationLeaseSpec(owner_address, actor_id);

dayshah added 2 commits October 4, 2025 17:50

[core] Fix raylet unregister race on shutdown

1b5970e

Signed-off-by: dayshah <[email protected]>

cleanup

fa55e62

Signed-off-by: dayshah <[email protected]>

dayshah requested a review from a team as a code owner October 5, 2025 06:16

dayshah added the go add ONLY when ready to merge, run all tests label Oct 5, 2025

gemini-code-assist bot reviewed Oct 5, 2025

View reviewed changes

ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 5, 2025

Merge branch 'master' into fix-raylet-shutdown

59b3444

Signed-off-by: dayshah <[email protected]>

dayshah added 2 commits October 5, 2025 15:19

add test + fix test + misc

c89808f

Signed-off-by: dayshah <[email protected]>

up

422cab5

Signed-off-by: dayshah <[email protected]>

codope self-assigned this Oct 6, 2025

codope reviewed Oct 6, 2025

View reviewed changes

dayshah added 2 commits October 12, 2025 16:48

Merge branch 'master' into fix-raylet-shutdown

19dfccd

Signed-off-by: dayshah <[email protected]>

parameterize test

9ec1ecd

Signed-off-by: dayshah <[email protected]>

dayshah requested a review from codope October 13, 2025 00:24

dayshah added 3 commits October 12, 2025 18:26

fix

85b4ecf

Signed-off-by: dayshah <[email protected]>

up

471cf00

Signed-off-by: dayshah <[email protected]>

up

eb74fe3

Signed-off-by: dayshah <[email protected]>

This comment was marked as outdated.

Sign in to view

dayshah added 2 commits October 12, 2025 18:40

don't send reply the second time around / keep current logic

38031dc

Signed-off-by: dayshah <[email protected]>

enum

b566e52

Signed-off-by: dayshah <[email protected]>

cursor bot reviewed Oct 13, 2025

View reviewed changes

	});
	node_manager_->RegisterGcs();

	// Preparing a detached actor creation task spec for the later RequestWorkerLease rpc.
	const auto owner_node_id = NodeID::FromRandom();
	rpc::Address owner_address;
	owner_address.set_node_id(owner_node_id.Binary());
	const auto actor_id =
	ActorID::Of(JobID::FromInt(1), TaskID::FromRandom(JobID::FromInt(1)), 0);
	const auto lease_spec = DetachedActorCreationLeaseSpec(owner_address, actor_id);

[core] Fix raylet shutdown race(s) #57198

Are you sure you want to change the base?

[core] Fix raylet shutdown race(s) #57198

Uh oh!

Conversation

dayshah commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

edoakes commented Oct 5, 2025

Uh oh!

codope left a comment

Choose a reason for hiding this comment

Uh oh!

codope Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

dayshah Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codope Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

dayshah Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codope Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

dayshah Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

codope Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

dayshah Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

dayshah commented Oct 13, 2025

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Bug: Test Fragility Due to Missing Wait Loop

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dayshah commented Oct 5, 2025 •

edited

Loading

dayshah Oct 13, 2025 •

edited

Loading

dayshah Oct 13, 2025 •

edited

Loading

dayshah Oct 13, 2025 •

edited

Loading