Skip to content

Conversation

machichima
Copy link
Contributor

@machichima machichima commented Sep 30, 2025

Why are these changes needed?

During the execution of tail_job_logs() after the job submission, if the ray head connection breaks, the tail_job_logs() will not raise any error. The error should be raised.

Query the rayjob status when receiving the message, and raise error if connection closed with rayjob not in terminate stage.

Related issue number

Closes: #57002

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Note

Enhances tail_job_logs to track job status and raise errors if the WebSocket closes or errors before the job reaches a terminal state.

  • Job SDK (python/ray/dashboard/modules/job/sdk.py):
    • Track job status on each log message via get_job_info and store job_status.
    • On WebSocket CLOSED or ERROR, raise RuntimeError if job_status is not terminal; otherwise exit cleanly.
    • Update Raises docstring to include unexpected connection closure before terminal state.

Written by Cursor Bugbot for commit c011722. This will update automatically on new commits. Configure here.

@machichima machichima requested a review from a team as a code owner September 30, 2025 12:59
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes an issue where tail_job_logs() would not raise an error if the connection to the Ray head node was lost while a job was still running. The change introduces a check for the job's status when the log-tailing websocket is closed.

My review focuses on improving the performance and correctness of this new logic. The current implementation introduces a blocking, synchronous call inside an async function, and it inefficiently queries the job status on every log message. I've provided a suggestion to make the call non-blocking and to only query the status when necessary, which improves both performance and robustness.

Signed-off-by: machichima <[email protected]>
@machichima machichima force-pushed the 57002-tail-log-error-handle branch from 31cde67 to aea0717 Compare September 30, 2025 13:01
@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Sep 30, 2025
Comment on lines 499 to 505
# Query job status after receiving each message to track state
try:
job_info = self.get_job_info(job_id)
job_status = job_info.status
except Exception as e:
raise RuntimeError(f"Failed to get job status for {job_id}.") from e

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we query job info and job status outside of the loop?
in this case we only have to query 1 time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to get the up-to-date status before the connection closed, that's why we need to do it in the while loop.
The job info query will be executed each time we got new message, which is not that frequent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to check the msg to detect loss of connection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried checking msg and ws, but cannot really identify the difference between normal close and abnormal one

This is the output for normal close:

❯ python reproduce_tail_logs_issue.py
Job submitted with ID: raysubmit_SVbekdrqykf3mz7c
Starting to tail job logs...
2025-10-02 20:49:50,188 INFO sdk.py:502 -- [DEBUG] msg attributes: type=1, data=2025-10-02 20:49:47,066 INFO job_manager.py:568 -- Runtime env is setting up.
Job started
1
2
, extra=
2025-10-02 20:49:50,188 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=False, close_code=None, protocol=None
LOG: 2025-10-02 20:49:47,066    INFO job_manager.py:568 -- Runtime env is setting up.
Job started
1
2
job status: RUNNING
2025-10-02 20:49:51,190 INFO sdk.py:502 -- [DEBUG] msg attributes: type=1, data=3
, extra=
2025-10-02 20:49:51,191 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=False, close_code=None, protocol=None
LOG: 3
job status: RUNNING
2025-10-02 20:49:52,190 INFO sdk.py:502 -- [DEBUG] msg attributes: type=1, data=4
, extra=
2025-10-02 20:49:52,190 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=False, close_code=None, protocol=None
LOG: 4
job status: RUNNING
2025-10-02 20:49:53,192 INFO sdk.py:502 -- [DEBUG] msg attributes: type=1, data=5
, extra=
2025-10-02 20:49:53,192 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=False, close_code=None, protocol=None
LOG: 5
job status: SUCCEEDED
2025-10-02 20:49:54,193 INFO sdk.py:502 -- [DEBUG] msg attributes: type=1, data=Job completed
, extra=
2025-10-02 20:49:54,193 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=False, close_code=None, protocol=None
LOG: Job completed
job status: SUCCEEDED
2025-10-02 20:49:57,197 INFO sdk.py:502 -- [DEBUG] msg attributes: type=8, data=1000, extra=
2025-10-02 20:49:57,197 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=True, close_code=1000, protocol=None
2025-10-02 20:49:57,197 INFO sdk.py:502 -- [DEBUG] msg attributes: type=257, data=None, extra=None
2025-10-02 20:49:57,197 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=True, close_code=1000, protocol=None
tail_job_logs() returned normally (no exception)

This is terminate ray head before finished

❯ python reproduce_tail_logs_issue.py
Job submitted with ID: raysubmit_DFgSjCpApJcswQFT
Starting to tail job logs...
2025-10-02 20:49:28,838 INFO sdk.py:502 -- [DEBUG] msg attributes: type=1, data=2025-10-02 20:49:25,729 INFO job_manager.py:568 -- Runtime env is setting up.
Job started
1
2
, extra=
2025-10-02 20:49:28,838 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=False, close_code=None, protocol=None
LOG: 2025-10-02 20:49:25,729    INFO job_manager.py:568 -- Runtime env is setting up.
Job started
1
2
job status: RUNNING
2025-10-02 20:49:29,847 INFO sdk.py:502 -- [DEBUG] msg attributes: type=1, data=3
, extra=
2025-10-02 20:49:29,847 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=False, close_code=None, protocol=None
LOG: 3
job status: RUNNING
2025-10-02 20:49:30,076 INFO sdk.py:502 -- [DEBUG] msg attributes: type=8, data=1000, extra=
2025-10-02 20:49:30,076 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=True, close_code=1000, protocol=None
2025-10-02 20:49:30,076 INFO sdk.py:502 -- [DEBUG] msg attributes: type=257, data=None, extra=None
2025-10-02 20:49:30,076 INFO sdk.py:505 -- [DEBUG] ws attributes: closed=True, close_code=1000, protocol=None
tail_job_logs() returned normally (no exception)

)

while True:
msg = await ws.receive()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won’t this ws.receive raise exception when the connection is broken? Isn’t that enough? I actually think there is no need to query job status.

Copy link
Contributor Author

@machichima machichima Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not raise error when connection lost actually. I think in older version it will (what we used in python/ray/dashboard/modules/job/tests/backwards_compatibility_scripts/test_backwards_compatibility.sh), but in newer version it will not and just close

Even the closed_code does not have difference compare to normal close

cursor[bot]

This comment was marked as outdated.

Comment on lines 502 to 513
print(f"Close code: {ws.close_code}")
if ws.close_code == aiohttp.WSCloseCode.ABNORMAL_CLOSURE:
raise RuntimeError(
f"WebSocket connection closed unexpectedly while job with close code {ws.close_code}"
)
break
elif msg.type == aiohttp.WSMsgType.ERROR:
pass
# Old Ray versions may send ERROR on connection close
raise RuntimeError(
f"WebSocket error while tailing logs for job {job_id}. Err: {ws.exception()}"
)
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I LOVE THIS!

if msg.type == aiohttp.WSMsgType.TEXT:
yield msg.data
elif msg.type == aiohttp.WSMsgType.CLOSED:
print(f"Close code: {ws.close_code}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

del or use logger

Comment on lines 509 to 512
# Old Ray versions may send ERROR on connection close
raise RuntimeError(
f"WebSocket error while tailing logs for job {job_id}. Err: {ws.exception()}"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to handle old Ray versions here. I think we only support job client and job server with the same version?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client only uses HTTP/websocket protocol so the compatibility requirements are looser than that. We don't give an exact guarantee though.

cursor[bot]

This comment was marked as outdated.

Raises:
RuntimeError: If the job does not exist or if the request to the
job server fails.
RuntimeError: If the job does not exist, if the request to the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it easy to write a test for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me have a try!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 899c05d

cursor[bot]

This comment was marked as outdated.

Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Dashboard] tail_job_logs() exits normally when WebSocket connection is lost unexpectedly

5 participants