Make message receive and handling async #1140

halleysfifthinc · 2025-01-14T23:32:07Z

Motivation

All messages from the front-end/server are received and handled synchronously, including custom comm messages (comm_open, comm_msg, and comm_close). So, any currently executing cell blocks the IJulia kernel from receiving and handling any IOPub/comm messages. For example, in the following WebIO MWE, a JS function updates an "output" Observable, and the JS function is triggered by setting an ("input") observable:

using WebIO, Observables
s = Scope()
s["in"] = Observable{String}("")
s["out"] = Observable{String}("")
onjs(s["in"], js"""
function (val)
    _webIOScope.setObservableValue("out",val);
end""")

you can't observe a new s["out"] value (aka the result of the JS function) during execution of the same cell that set s["in"] (which triggers the JS function).

Example Julia function that fails (hangs) without async comms

function julia_js_julia(_in, out, str)
    ch = Channel{String}()
    obsf = on(out) do val
        put!(ch, val)
    end
    t = @async take!(ch)
    _in[] = str
    out = fetch(t)
    off(obsf)
        
    return out
end

*This example function isn't thread-safe. (The scp["in"] observable isn't locked, so concurrently setting it could lead to interleaved/mismatched updates to the scp["out"] observable.)

One example of an actual use-case/benefit is PlotlyJS.to_image, which uses the same
Julia => JS => Julia observable setup to retrieve the results of a plotly.js function call.
Currently, the PlotlyJS.to_image function soft-fails because the observable that holds the generated
image is only updated after the current cell finishes execution (when IJulia can process the
comm_msg from WebIO in the Jupyter frontend/client).

Testing

I've manually tested that the above WebIO MWE works with this PR, and that interrupting still works. I realize this is a fairly fundamental rearchitecturing of the message receiving/handling, but I'm not sure what else to test and/or if there is a good way to test any of this in CI. I'm open to any hints/pointers if you want more thorough testing/test cases.

Fixes #858.

P.S. Breadcrumb for the future: This new architecture has a lot of parallels (easily adapted) to the new subshells feature that was recently implemented in ipython/ipykernel#1249.

JamesWrigley · 2025-01-28T12:02:14Z

This sounds like a good idea, but it absolutely needs tests before merging. At some point I'll start writing tests for more of the internals which you should be able to modify for this PR, but feel free to have a go already if you have time :)

halleysfifthinc · 2025-01-28T22:07:55Z

👍 I will wait until you've added more internals tests before I do anything further. I am/have been running IJulia with this PR to give any bugs the opportunity to surface.

JamesWrigley · 2025-08-22T13:48:30Z

If you rebase this on master I think we can continue with it 🙂 Couple things:

We should use Threads.@spawn instead of @async.
We should run CI with multiple threads by default to try to catch any race conditions.

halleysfifthinc · 2025-08-22T16:28:36Z

Will do! The use of @async was actually intentional. My goal was to keep IJulia specific activity on the interactive thread. We could potentially go even further and @spawn cell execution on non-interactive threads to more intentionally separate user and IJulia activity. That could theoretically be helpful in some situations, but this PR will already be a(nother) significant rearchitecture of a core part of IJulia. (And I can't think of a specific motivating example.)

JamesWrigley · 2025-08-22T16:42:56Z

Keeping it on the interactive threads make sense, but for that we should use Threads.@spawn :interactive. @async has the unfortunate side-effect of pinning the parent task to the same thread so it's kinda discouraged now.

halleysfifthinc · 2025-08-22T17:24:09Z

pinning the parent task to the same thread

Right.. Is that not equivalent to Threads.@spawn :interactive? The rest of the IJulia kernel is synchronous/not using tasks, so it will always be on the first (aka interactive) thread, and we want the rest of the IJulia activity to stay on that thread too, just allowed to be asynchronous/concurrent?

Happy to learn more if I'm wrong, this was my first serious foray into async/concurrent programming!

JamesWrigley · 2025-08-22T19:07:19Z

That is technically true, but @async is still deprecated so I'd prefer we stick with Threads.@spawn and explicitly specifying the threadpool. One other advantage is that if there's multiple threads in the interactive threadpool then we can use all of them instead of one.

codecov · 2025-10-02T18:41:05Z

Codecov Report

❌ Patch coverage is 80.59701% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.96%. Comparing base (ab44427) to head (382b660).

Files with missing lines	Patch %	Lines
src/eventloop.jl	80.00%	12 Missing ⚠️
src/handlers.jl	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1140      +/-   ##
==========================================
+ Coverage   68.65%   68.96%   +0.30%     
==========================================
  Files          16       16              
  Lines        1056     1089      +33     
==========================================
+ Hits          725      751      +26     
- Misses        331      338       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

halleysfifthinc

I've left some comments to explain some design decisions and/or about open questions I have.

I'm still unsure how to add tests for this, and I'd welcome any brainstorming.

src/eventloop.jl

halleysfifthinc · 2025-10-02T18:45:19Z

src/eventloop.jl

@@ -76,12 +125,14 @@
            # send interrupts (user SIGINT) to the code-execution task
            if isa(e, InterruptException)
                @async Base.throwto(kernel.requests_task[], e)
+                @async Base.throwto(kernel.iopub_task[], e)
            else
                rethrow()
            end
        finally
            wait(control_task)
            wait(kernel.requests_task[])
+            wait(kernel.iopub_task[])
        end
    end


I'm not sure that this needs to be in a while loop vs something like

Suggested change

try

waitall([control_task, kernel.requests_task[], kernel.iopub_task[]])

catch

# send interrupts (user SIGINT) to the code-execution task

if isa(e, InterruptException)

@async Base.throwto(kernel.requests_task[], e)

@async Base.throwto(kernel.iopub_task[], e)

else

rethrow()

end

finally

wait(kernel.close_event)

end

And maybe not even the finally clause? Basically, with the wait, this task shouldn't be scheduled again unless one of the message handling tasks fails, which we aren't trying to recover from. So if we do get back here, its because we want to/have to stop.

Yeah I agree, I was looking at this recently and thought the control flow was a bit strange 😅

src/eventloop.jl

halleysfifthinc · 2025-10-02T18:49:20Z

src/handlers.jl

+const iopub_handlers = Dict{String,Function}(
+    "comm_open" => comm_open,
+    "comm_msg" => comm_msg,
+    "comm_close" => comm_close,


I am now wondering if the async handling should be expanded to most messages besides "execute_request"? In particular, "complete_request" and "inspect_request" are (should be?) side-effect free, and would be really convenient to be able to e.g. see the docs for a functions when writing a new cell while another cell is mid-execution.

src/init.jl

src/eventloop.jl

JamesWrigley · 2025-10-06T08:05:32Z

Sorry I missed this 🙈 I'll try to review it this week but feel free to ping me if I forget.

JamesWrigley

I'm not quite convinced that what we're doing here is safe. If I understand correctly the reasoning is:

ZMQ sockets are not thread-safe.
Thus we use @async to ensure that all tasks are running on the same thread.
Thus we can safely recv/send in different tasks as long as we lock appropriately to prevent one recv being interleaved with another recv (likewise for send)

But that's making the assumption that ZMQ.jl's recv and send don't do anything to the socket internally that may conflict with each other, and I don't think that's true. Imagine this sequence:

Task 1 is sending and yields immediately after calling zmq_msg_send(): https://github.com/JuliaInterop/ZMQ.jl/blob/1e1b458180311b19127937e8dd0befa79a93d54f/src/comm.jl#L8
Let's say that zmq_msg_send() fails because we have to try again (EAGAIN).
Task 2 is receiving and yields immediately after calling zmq_msg_recv(): https://github.com/JuliaInterop/ZMQ.jl/blob/1e1b458180311b19127937e8dd0befa79a93d54f/src/comm.jl#L80
Let's say it fails for some non-EAGAIN reason (maybe a corrupted message or something). This overwrites the internal error code from zmq_msg_send().
Control switches back to Task 1 which calls zmq_errno(), which returns the error code from the call to zmq_msg_recv() and thus incorrectly fails instead of trying again.

Now I'm pretty sure that neither send() or recv() will yield in those places so in practice this particular situation couldn't happen right now, but that's an implementation detail of ZMQ and certainly not something we can rely on. But I also can't think of a good alternative yet 🤔

Also, I fixed some lingering-task issues in #1190 which seems to have caused some merge conflicts, sorry about that 🙈

src/eventloop.jl

JamesWrigley · 2025-10-11T14:51:27Z

src/eventloop.jl

@@ -76,12 +125,14 @@
            # send interrupts (user SIGINT) to the code-execution task
            if isa(e, InterruptException)
                @async Base.throwto(kernel.requests_task[], e)
+                @async Base.throwto(kernel.iopub_task[], e)
            else
                rethrow()
            end
        finally
            wait(control_task)
            wait(kernel.requests_task[])
+            wait(kernel.iopub_task[])
        end
    end


Yeah I agree, I was looking at this recently and thought the control flow was a bit strange 😅

JamesWrigley · 2025-10-11T15:39:27Z

Hmm a nice design would be to use a poller that could poll the iopub socket and an internal inproc socket that we send messages to. But ZMQ doesn't have a poller yet... JuliaInterop/ZMQ.jl#52

JamesWrigley · 2025-10-11T15:47:14Z

Using timeouts would also work, but myeh 🤷

halleysfifthinc · 2025-10-13T18:21:10Z

ZMQ sockets are not thread-safe.

Thus we use @async to ensure that all tasks are running on the same thread.

Thus we can safely recv/send in different tasks as long as we lock appropriately to prevent one recv being interleaved with another recv (likewise for send)

So the @async keeping things on the same thread is unrelated to the ZMQ sockets.

The actual motivating factor behind splitting the socket locks into read/write is because the read channel/task yields (waiting to read from the socket) while holding the lock. This caused a deadlock when another task tries to send, even though the socket is otherwise quiet (not actively receiving).

To avoid the split locks, we need a way to (in the receive channel/task) release the lock on a yielding wait (i.e. the socket doesn't have anything to read so the task yields). I couldn't figure out how to do that back when I first made this PR. I'll take another look to see if I can figure it out now.

JamesWrigley · 2025-11-23T11:04:39Z

Ok, now ZMQ.jl has a poller 😅 Not in a release yet but you can dev ZMQ for now and I'll tag a release before this is merged. I would propose this implementation for all sockets that need to be used by multiple tasks:

One socket that connects to the Jupyter ZMQ address, and another inproc socket that IJulia tasks can send messages to to be forwarded to the Jupyter socket.
The Jupyter socket is only touched by a single task that polls both the Jupyter socket and its corresponding inproc socket. Received messages from the Jupyter socket are put on a Channel and received messages from the inproc socket are forwarded to the Jupyter socket. Only the inproc socket needs to be locked (for both sends and recvs).

I believe that's fully threadsafe 🤞 What do you think?

JamesWrigley · 2025-11-26T11:05:19Z

I realized today that the Poller uses tasks internally, so they also need to be robust against InterruptException's. Will check that later. I think the cleanest way would be to bubble up the exception to wait(::Poller), so here we would just need to wrap the polling loop in a try-catch.

JamesWrigley · 2025-12-12T12:11:46Z

Gentle bump, did you have any luck with this?

halleysfifthinc · 2025-12-15T22:40:41Z

Taking another look now.. IIUC, the main/only(?) reason to use an inproc for the second, internal socket (instead of e.g. a Channel) is so the "conductor" task can have a single/unified thing to wait on (e.g. ZMQ.Poller), that is woken by either an incoming message on the Jupyter socket from the client, or an outgoing message to forward on the internal socket?

JamesWrigley · 2025-12-15T23:19:27Z

Yep exactly. That way we can guarantee that each socket is only ever touched by a single IJulia task at a time and also have a proper event-driven loop.

Also, I take back what I said about the inproc sockets needing to have a lock for sends and recvs. The recv socket will only ever be used from the conductor task so only the send socket needs a lock in case multiple tasks try to send stuff.

halleysfifthinc force-pushed the async-comms branch from 5e55f55 to 6e2a01b Compare January 15, 2025 00:54

halleysfifthinc changed the title ~~WIP: Make message receive and handling async~~ Make message receive and handling async Jan 15, 2025

JamesWrigley mentioned this pull request Feb 5, 2025

Giant refactor to move all state into a Kernel struct #1145

Merged

halleysfifthinc added 2 commits September 16, 2025 09:41

Separate socket locks into send & recv locks

c954180

Refactor eventloop to enable async comms

382b660

halleysfifthinc marked this pull request as draft October 2, 2025 17:56

halleysfifthinc force-pushed the async-comms branch from 6e2a01b to 382b660 Compare October 2, 2025 18:36

halleysfifthinc commented Oct 2, 2025

View reviewed changes

JamesWrigley reviewed Oct 11, 2025

View reviewed changes

JamesWrigley mentioned this pull request Oct 31, 2025

Execution should stop after first error #376

Open

Uh oh!

Make message receive and handling async #1140

Are you sure you want to change the base?

Make message receive and handling async #1140

Conversation

halleysfifthinc commented Jan 14, 2025

Motivation

Testing

Uh oh!

JamesWrigley commented Jan 28, 2025

Uh oh!

halleysfifthinc commented Jan 28, 2025

Uh oh!

JamesWrigley commented Aug 22, 2025

Uh oh!

halleysfifthinc commented Aug 22, 2025

Uh oh!

JamesWrigley commented Aug 22, 2025

Uh oh!

halleysfifthinc commented Aug 22, 2025

Uh oh!

JamesWrigley commented Aug 22, 2025

Uh oh!

codecov bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

halleysfifthinc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

halleysfifthinc Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

JamesWrigley Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

halleysfifthinc Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JamesWrigley commented Oct 6, 2025

Uh oh!

JamesWrigley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JamesWrigley Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

JamesWrigley commented Oct 11, 2025

Uh oh!

JamesWrigley commented Oct 11, 2025

Uh oh!

halleysfifthinc commented Oct 13, 2025

Uh oh!

JamesWrigley commented Nov 23, 2025

Uh oh!

JamesWrigley commented Nov 26, 2025

Uh oh!

JamesWrigley commented Dec 12, 2025

Uh oh!

halleysfifthinc commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JamesWrigley commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Oct 2, 2025 •

edited

Loading

halleysfifthinc commented Dec 15, 2025 •

edited

Loading

JamesWrigley commented Dec 15, 2025 •

edited

Loading