-
Notifications
You must be signed in to change notification settings - Fork 372
New Kernels API leveraging the kernel client abstraction for kernel comms #1570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Introduces a next-generation kernel API (v3) that can be enabled via the --kernels-v3 flag or by setting kernels_api_version=3 in config. Key improvements: - Shared kernel client per kernel: Single client instance shared across all websocket connections, reducing resource usage and improving consistency - Pre-created clients with automatic lifecycle management: Clients connect on kernel start and disconnect on shutdown/restart - Enhanced message routing: Channel and cell ID encoding in message IDs enables precise message delivery to originating cells - Improved kernel monitoring: Better execution state tracking and heartbeat monitoring for both local and gateway kernels - Backward compatible: Defaults to v2 API; v3 is opt-in The v3 classes are swapped in automatically when enabled, requiring no manual configuration of individual components.
for more information, see https://pre-commit.ci
|
Pinging @rgbkrk @vidartf since we discussed this at JupyterCon 2025. @krassowski, since I mentioned on the JupyterLab call. I realize this is a ton of code! 😅 I don't know of a better way to do this, so opening here in it's fully form and we can always separate in smaller PRs if necessary. |
rgbkrk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you made it a flagged change it seems easy enough to just get this out there soon since we don't have to compare and contrast with the old api. I love that this fixes multiple surface areas that affect consistency of server side models.
Looking forward to reviewing more in depth shortly.
This PR is coming soon after ;) |
3coins
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Zsailer
This looks great! Left some suggestions, none of these should be blocking to merge as this is an opt-in API change.
| channel_name = "shell" | ||
|
|
||
|
|
||
| class ControlChannel(AsyncZMQSocketChannel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this extend from NamedAsyncZMQSocketChannel if encoding channel name is desired?
| channel_name = "control" | ||
|
|
||
|
|
||
| class StdinChannel(AsyncZMQSocketChannel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this extend from NamedAsyncZMQSocketChannel if encoding channel name is desired?
| gateway_enabled = getattr(self, "gateway_config", None) and getattr( | ||
| self.gateway_config, "gateway_enabled", False | ||
| ) | ||
| if gateway_enabled: | ||
| return "jupyter_server.gateway.v3.managers.GatewayMultiKernelManager" | ||
| return "jupyter_server.services.kernels.v3.kernelmanager.AsyncMappingKernelManager" | ||
| gateway_enabled = getattr(self, "gateway_config", None) and getattr( | ||
| self.gateway_config, "gateway_enabled", False | ||
| ) | ||
| if gateway_enabled: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be cleaner to extract this into a property.
@property
def is_gateway_enabled(self) -> bool:
"""Check if gateway is configured and enabled."""
return getattr(self, "gateway_config", None) and getattr(
self.gateway_config, "gateway_enabled", False
)| # Process queued messages | ||
| asyncio.create_task(self._process_queued_messages()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we track these as background tasks for proper cleanup if the client shuts down?
task = asyncio.create_task(self._process_queued_messages())
if not hasattr(self, '_background_tasks'):
self._background_tasks = []
self._background_tasks.append(task)
It seems that this is inspired by Jupyverse, I had opened an issue to make that change in Jupyter server 4 years ago.
How will that affect other Jupyter server implementations like Jupyverse? Does it require a JEP?
I would like to know how this differs from server-side execution, that Jupyverse already supports? Cell output recovery has been supported with server-side execution for 2 years. |
Yes, though I know folks have discussed this approach long before Jupyverse existed. Unfortunately, no one ever took the plunge to change this in Jupyter Server because it's quite challenging here—many people depend on the existing manager paradigm for complex kernel lifecycle systems (e.g., remote kernels that communicate over a gateway), so implementing this change in jupyter_server requires more careful consideration of backwards compatibility. I'm glad you pioneered this approach in Jupyverse; it's a great validation that the architecture works well. Jupyverse's ability to move quickly without legacy constraints is definitely an advantage.
This shouldn't impact other implementations at all. This message ID overloading happens entirely internal to this implementation. It doesn't affect the protocol or the REST interface—no contracts are changed or affected. The routing and mapping is handled within this implementation and removed before it "goes out the door" back to the frontend. Because of this, a JEP should not be needed.
I think it's worth asking this question over on the jupyter-server-documents repo, since we can speak more directly to the implementation differences there. One minor difference I'm aware of—server-side execution requires a small expansion of the REST API (a new endpoint), while this implementation works entirely within the existing APIs. Server-side execution is absolutely a valid approach—maybe even better in some ways. I don't have a strong opinion on the relative merits, but I found I could achieve the same goals without needing a new endpoint. That said, ideally we can keep this review focused on solving the core issue of multiple ZMQ connections per kernel. I mentioned jupyter-server-documents mainly to provide additional motivation for why we want one ZMQ client per kernel, but the architectural benefits stand on their own. Another major issue this PR addresses (and I'm sure you've solved in Jupyverse) is that kernel execution state conflates the kernel status on the control AND shell channels. The reason is that we don't know which channel triggered an IOPub (status) message. By prepending the channel in the message ID, we can resolve the channel in the server and ensure only shell-triggered IOPub messages update the kernel status. |
This PR introduces a new kernel API that addresses some fundamental architectural issues with how Jupyter Server currently manages kernel communication.
Background
The current kernel architecture (I'm calling it v2) creates separate ZMQ connections for each websocket client connected to a kernel. This means if you have a notebook open in multiple browser tabs, or multiple frontends connected to the same kernel, each one establishes its own set of ZMQ sockets. This works, but it's inefficient and can lead to subtle inconsistencies since each connection is independently managing its view of kernel state.
Additionally, tracking which messages belong to which cell execution has been challenging. When a kernel sends back results, we need to route them to the correct cell, but the message routing logic has been scattered and inconsistent.
What's in this new version (v3)
The v3 API takes a different approach: each kernel gets a single, pre-created kernel client that's shared across all websocket connections. When a client connects via websocket, it registers as a listener on this shared client rather than creating its own ZMQ connections. The shared client handles all the actual kernel communication, and broadcasts messages to all registered listeners.
This architecture enables some nice improvements. Message routing becomes more precise because we can encode channel and cell ID information directly in message IDs. Kernel state tracking is more consistent since there's a single source of truth. And resource usage goes down since we're not multiplying ZMQ connections.
The implementation works for both local kernels and gateway kernels, so the benefits apply regardless of your deployment setup.
Overload the message ID for more accurate routing
One of the more significant (and potentially controversial) aspects of this implementation is how we handle message routing. The v3 API encodes both the parent channel name and the source cell ID directly into Jupyter protocol message IDs using a structured format:
{channel}:{base_msg_id}#{src_id}. For example, a message might have an ID like"shell:a1b2c3d4_12345_0#cell-abc123". This solves a longstanding ambiguity in the Jupyter protocol described in jupyter/jupyter_client#839: when an IOPub status message arrives, there's no standard way to determine whether it originated from a shell channel request or a control channel request. By encoding the channel in the message ID server-side, we can track execution state more accurately and route messages to the correct destination without maintaining separate message caches or parsing metadata.The cell ID encoding similarly enables precise output routing - when kernel results come back, we know exactly which cell to deliver them to. We leverage this in jupyter-server-documents to route messages to a server-side document model for each notebook. The encoding is stripped before messages reach the frontend, so clients see standard Jupyter protocol message IDs.
This might be overloading the message ID field a bit; however, I would argue that channel and source ID (or cell_id more specifically) are two integral parts of a message's ID and helpful to include.
The jupyter-server-documents use case
A major motivator for this work is enabling server-side document state management, which is what jupyter-server-documents provides. That extension gives you real-time collaboration by maintaining the notebook document state on the server rather than just in the browser. This fixes a long standing bug where kernel execution state and cell outputs are lost when a notebook is closed during execution.
With the v2 kernel API, kernel messages flow to each websocket client, which then updates its local document model and syncs changes back to the server. This means the server's view of the document is always slightly behind, and there's no way for the server to intercept and process kernel outputs before they reach clients.
The v3 API changes this completely. The shared kernel client architecture means we can register custom listeners that intercept kernel messages server-side. The
DocumentAwareKernelClientin jupyter-server-documents does exactly this - it extendsJupyterServerKernelClientand adds listeners that route execution results, outputs, and state changes directly into the server-side collaborative document (a YRoom). The server processes and stores outputs, updates cell execution states, and manages kernel status all before broadcasting to clients. This is what enables features like smart output separation, better memory management, and more reliable collaboration.You can see this integration in action in jupyter-server-documents#170, where these APIs were originally developed as a standalone library and are now being upstreamed here.
Current layout of submodules
All the v3 classes live in
v3/subdirectories (jupyter_server/services/kernels/v3/andjupyter_server/gateway/v3/) to maintain a clean separation from the existing v2 implementation. This makes it easy to see what's new, compare implementations, and eventually transition. The goal is to reach feature parity with v2 while keeping the distinction clear, then potentially make v3 the default in Jupyter Server 3.0.This PR is low-risk to merge because the v3 code is completely isolated and opt-in. The existing v2 code paths are unchanged - users get the current behavior by default. The v3 implementation only activates when explicitly enabled via the
--kernels-v3flag or config setting. This allows us to ship the code, gather real-world feedback, and iterate on the v3 implementation over time without affecting existing deployments. We can move components from v3 to mainline incrementally as they mature.Trying it out
To enable the v3 API, you can start JupyterLab with the
--kernels-v3flag:Or if you prefer to set it in your config file:
You can also set it directly via the trait:
The default remains v2, so this is completely opt-in. The v3 classes get swapped in automatically when you enable the flag - you don't need to manually configure individual components.
What's next
I've tested this with basic kernel operations (start, execute, shutdown, restart) and it's working well with jupyter-server-documents. The v3 classes inherit from the same base classes as v2, so they should be compatible with existing extensions and customizations, but broader testing would be valuable. The immediate goal is reaching feature parity with v2 while gathering feedback, with an eye toward making this the default in a future major release.