Skip to content

Conversation

@Zsailer
Copy link
Member

@Zsailer Zsailer commented Nov 13, 2025

This PR introduces a new kernel API that addresses some fundamental architectural issues with how Jupyter Server currently manages kernel communication.

Background

The current kernel architecture (I'm calling it v2) creates separate ZMQ connections for each websocket client connected to a kernel. This means if you have a notebook open in multiple browser tabs, or multiple frontends connected to the same kernel, each one establishes its own set of ZMQ sockets. This works, but it's inefficient and can lead to subtle inconsistencies since each connection is independently managing its view of kernel state.

Additionally, tracking which messages belong to which cell execution has been challenging. When a kernel sends back results, we need to route them to the correct cell, but the message routing logic has been scattered and inconsistent.

What's in this new version (v3)

The v3 API takes a different approach: each kernel gets a single, pre-created kernel client that's shared across all websocket connections. When a client connects via websocket, it registers as a listener on this shared client rather than creating its own ZMQ connections. The shared client handles all the actual kernel communication, and broadcasts messages to all registered listeners.

This architecture enables some nice improvements. Message routing becomes more precise because we can encode channel and cell ID information directly in message IDs. Kernel state tracking is more consistent since there's a single source of truth. And resource usage goes down since we're not multiplying ZMQ connections.

The implementation works for both local kernels and gateway kernels, so the benefits apply regardless of your deployment setup.

Overload the message ID for more accurate routing

One of the more significant (and potentially controversial) aspects of this implementation is how we handle message routing. The v3 API encodes both the parent channel name and the source cell ID directly into Jupyter protocol message IDs using a structured format: {channel}:{base_msg_id}#{src_id}. For example, a message might have an ID like "shell:a1b2c3d4_12345_0#cell-abc123". This solves a longstanding ambiguity in the Jupyter protocol described in jupyter/jupyter_client#839: when an IOPub status message arrives, there's no standard way to determine whether it originated from a shell channel request or a control channel request. By encoding the channel in the message ID server-side, we can track execution state more accurately and route messages to the correct destination without maintaining separate message caches or parsing metadata.

The cell ID encoding similarly enables precise output routing - when kernel results come back, we know exactly which cell to deliver them to. We leverage this in jupyter-server-documents to route messages to a server-side document model for each notebook. The encoding is stripped before messages reach the frontend, so clients see standard Jupyter protocol message IDs.

This might be overloading the message ID field a bit; however, I would argue that channel and source ID (or cell_id more specifically) are two integral parts of a message's ID and helpful to include.

The jupyter-server-documents use case

A major motivator for this work is enabling server-side document state management, which is what jupyter-server-documents provides. That extension gives you real-time collaboration by maintaining the notebook document state on the server rather than just in the browser. This fixes a long standing bug where kernel execution state and cell outputs are lost when a notebook is closed during execution.

With the v2 kernel API, kernel messages flow to each websocket client, which then updates its local document model and syncs changes back to the server. This means the server's view of the document is always slightly behind, and there's no way for the server to intercept and process kernel outputs before they reach clients.

The v3 API changes this completely. The shared kernel client architecture means we can register custom listeners that intercept kernel messages server-side. The DocumentAwareKernelClient in jupyter-server-documents does exactly this - it extends JupyterServerKernelClient and adds listeners that route execution results, outputs, and state changes directly into the server-side collaborative document (a YRoom). The server processes and stores outputs, updates cell execution states, and manages kernel status all before broadcasting to clients. This is what enables features like smart output separation, better memory management, and more reliable collaboration.

You can see this integration in action in jupyter-server-documents#170, where these APIs were originally developed as a standalone library and are now being upstreamed here.

Current layout of submodules

All the v3 classes live in v3/ subdirectories (jupyter_server/services/kernels/v3/ and jupyter_server/gateway/v3/) to maintain a clean separation from the existing v2 implementation. This makes it easy to see what's new, compare implementations, and eventually transition. The goal is to reach feature parity with v2 while keeping the distinction clear, then potentially make v3 the default in Jupyter Server 3.0.

This PR is low-risk to merge because the v3 code is completely isolated and opt-in. The existing v2 code paths are unchanged - users get the current behavior by default. The v3 implementation only activates when explicitly enabled via the --kernels-v3 flag or config setting. This allows us to ship the code, gather real-world feedback, and iterate on the v3 implementation over time without affecting existing deployments. We can move components from v3 to mainline incrementally as they mature.

Trying it out

To enable the v3 API, you can start JupyterLab with the --kernels-v3 flag:

jupyter lab --kernels-v3

Or if you prefer to set it in your config file:

# jupyter_server_config.py
c.ServerApp.kernels_api_version = 3

You can also set it directly via the trait:

jupyter lab --ServerApp.kernels_api_version=3

The default remains v2, so this is completely opt-in. The v3 classes get swapped in automatically when you enable the flag - you don't need to manually configure individual components.

What's next

I've tested this with basic kernel operations (start, execute, shutdown, restart) and it's working well with jupyter-server-documents. The v3 classes inherit from the same base classes as v2, so they should be compatible with existing extensions and customizations, but broader testing would be valuable. The immediate goal is reaching feature parity with v2 while gathering feedback, with an eye toward making this the default in a future major release.

Zsailer and others added 2 commits November 13, 2025 12:23
Introduces a next-generation kernel API (v3) that can be enabled via the
--kernels-v3 flag or by setting kernels_api_version=3 in config.

Key improvements:
- Shared kernel client per kernel: Single client instance shared across all
  websocket connections, reducing resource usage and improving consistency
- Pre-created clients with automatic lifecycle management: Clients connect
  on kernel start and disconnect on shutdown/restart
- Enhanced message routing: Channel and cell ID encoding in message IDs
  enables precise message delivery to originating cells
- Improved kernel monitoring: Better execution state tracking and heartbeat
  monitoring for both local and gateway kernels
- Backward compatible: Defaults to v2 API; v3 is opt-in

The v3 classes are swapped in automatically when enabled, requiring no
manual configuration of individual components.
@Zsailer Zsailer changed the title New Kernels API New Kernels API leveraging the kernel client abstraction for kernel comms Nov 13, 2025
@Zsailer Zsailer requested a review from vidartf November 13, 2025 20:52
@Zsailer
Copy link
Member Author

Zsailer commented Nov 13, 2025

Pinging @rgbkrk @vidartf since we discussed this at JupyterCon 2025. @krassowski, since I mentioned on the JupyterLab call.

I realize this is a ton of code! 😅 I don't know of a better way to do this, so opening here in it's fully form and we can always separate in smaller PRs if necessary.

Copy link
Contributor

@rgbkrk rgbkrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you made it a flagged change it seems easy enough to just get this out there soon since we don't have to compare and contrast with the old api. I love that this fixes multiple surface areas that affect consistency of server side models.

Looking forward to reviewing more in depth shortly.

@Zsailer
Copy link
Member Author

Zsailer commented Nov 13, 2025

server side models.

This PR is coming soon after ;)

Copy link
Contributor

@3coins 3coins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Zsailer
This looks great! Left some suggestions, none of these should be blocking to merge as this is an opt-in API change.

channel_name = "shell"


class ControlChannel(AsyncZMQSocketChannel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this extend from NamedAsyncZMQSocketChannel if encoding channel name is desired?

channel_name = "control"


class StdinChannel(AsyncZMQSocketChannel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this extend from NamedAsyncZMQSocketChannel if encoding channel name is desired?

Comment on lines +1655 to +1664
gateway_enabled = getattr(self, "gateway_config", None) and getattr(
self.gateway_config, "gateway_enabled", False
)
if gateway_enabled:
return "jupyter_server.gateway.v3.managers.GatewayMultiKernelManager"
return "jupyter_server.services.kernels.v3.kernelmanager.AsyncMappingKernelManager"
gateway_enabled = getattr(self, "gateway_config", None) and getattr(
self.gateway_config, "gateway_enabled", False
)
if gateway_enabled:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be cleaner to extract this into a property.

@property
def is_gateway_enabled(self) -> bool:
  """Check if gateway is configured and enabled."""
  return getattr(self, "gateway_config", None) and getattr(
      self.gateway_config, "gateway_enabled", False
  )

Comment on lines +153 to +154
# Process queued messages
asyncio.create_task(self._process_queued_messages())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we track these as background tasks for proper cleanup if the client shuts down?

task = asyncio.create_task(self._process_queued_messages())
if not hasattr(self, '_background_tasks'):
    self._background_tasks = []
self._background_tasks.append(task)

@davidbrochart
Copy link
Contributor

davidbrochart commented Nov 14, 2025

The v3 API takes a different approach: each kernel gets a single, pre-created kernel client that's shared across all websocket connections. When a client connects via websocket, it registers as a listener on this shared client rather than creating its own ZMQ connections. The shared client handles all the actual kernel communication, and broadcasts messages to all registered listeners.

It seems that this is inspired by Jupyverse, I had opened an issue to make that change in Jupyter server 4 years ago.

One of the more significant (and potentially controversial) aspects of this implementation is how we handle message routing. The v3 API encodes both the parent channel name and the source cell ID directly into Jupyter protocol message IDs using a structured format: {channel}:{base_msg_id}#{src_id}

How will that affect other Jupyter server implementations like Jupyverse? Does it require a JEP?

A major motivator for this work is enabling server-side document state management, which is what jupyter-server-documents provides. That extension gives you real-time collaboration by maintaining the notebook document state on the server rather than just in the browser. This fixes a long standing bug where kernel execution state and cell outputs are lost when a notebook is closed during execution.

I would like to know how this differs from server-side execution, that Jupyverse already supports? Cell output recovery has been supported with server-side execution for 2 years.

@Zsailer
Copy link
Member Author

Zsailer commented Nov 14, 2025

It seems that this is inspired by Jupyverse, I had opened an issue to make that change in Jupyter server #658.

Yes, though I know folks have discussed this approach long before Jupyverse existed. Unfortunately, no one ever took the plunge to change this in Jupyter Server because it's quite challenging here—many people depend on the existing manager paradigm for complex kernel lifecycle systems (e.g., remote kernels that communicate over a gateway), so implementing this change in jupyter_server requires more careful consideration of backwards compatibility.

I'm glad you pioneered this approach in Jupyverse; it's a great validation that the architecture works well. Jupyverse's ability to move quickly without legacy constraints is definitely an advantage.

The v3 API encodes both the parent channel name and the source cell ID directly into Jupyter protocol message IDs using a structured format: {channel}:{base_msg_id}#{src_id}

How will that affect other Jupyter server implementations like Jupyverse? Does it require a JEP?

This shouldn't impact other implementations at all. This message ID overloading happens entirely internal to this implementation. It doesn't affect the protocol or the REST interface—no contracts are changed or affected. The routing and mapping is handled within this implementation and removed before it "goes out the door" back to the frontend. Because of this, a JEP should not be needed.

I would like to know how this differs from server-side execution, that Jupyverse already supports? Cell output recovery has been supported with server-side execution for 2 years.

I think it's worth asking this question over on the jupyter-server-documents repo, since we can speak more directly to the implementation differences there. One minor difference I'm aware of—server-side execution requires a small expansion of the REST API (a new endpoint), while this implementation works entirely within the existing APIs. Server-side execution is absolutely a valid approach—maybe even better in some ways. I don't have a strong opinion on the relative merits, but I found I could achieve the same goals without needing a new endpoint.

That said, ideally we can keep this review focused on solving the core issue of multiple ZMQ connections per kernel. I mentioned jupyter-server-documents mainly to provide additional motivation for why we want one ZMQ client per kernel, but the architectural benefits stand on their own.

Another major issue this PR addresses (and I'm sure you've solved in Jupyverse) is that kernel execution state conflates the kernel status on the control AND shell channels. The reason is that we don't know which channel triggered an IOPub (status) message. By prepending the channel in the message ID, we can resolve the channel in the server and ensure only shell-triggered IOPub messages update the kernel status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants