Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,9 @@
[Unreleased](https://github.com/bird-house/birdhouse-deploy/tree/master) (latest)
------------------------------------------------------------------------------------------------------------------

[//]: # (list changes here, using '-' for each new entry, remove this when items are added)
## Documentation

- Document how to allow Jupyterlab containers to access GPUs on the host machine

[2.18.4](https://github.com/bird-house/birdhouse-deploy/tree/2.18.4) (2025-10-01)
------------------------------------------------------------------------------------------------------------------
Expand Down
39 changes: 39 additions & 0 deletions birdhouse/components/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -821,6 +821,45 @@ Usage
The service is available at ``${BIRDHOUSE_PROXY_SCHEME}://${BIRDHOUSE_FQDN_PUBLIC}/jupyter``. Users are able to log in to Jupyterhub using the
same user name and password as Magpie. They will then be able to launch a personal jupyterlab server.

GPU support for jupyterlab containers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If the host machine has GPUs and you want to make them available to the docker containers running Jupyterlab:

1. ensure that the GPU drivers on the host machine are up to date
2. install the `NVIDIA container toolkit`_ package on the host machine
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume enabling it like this makes it available to all containers on the server.

It doesn't have to be mentioned here, but just pointing it out FYI, it would be relevant for weaver-worker as well to run GPU jobs. I'm just not sure if the device syntax is the same in docker-compose since it's been 2-3 years since I've checked this.

If it does indeed work like this, maybe a note about all server resources sharing the GPUs could be relevant. They would not (necessarily) be dedicated to jupyter kernels.

Copy link
Collaborator Author

@mishaschwartz mishaschwartz Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume enabling it like this makes it available to all containers on the server.

Yes

It doesn't have to be mentioned here, but just pointing it out FYI, it would be relevant for weaver-worker as well to run GPU jobs.

Yes I definitely want to figure out how to make this work with weaver as well. Since the weaver worker container is not dynamically created, I think we can just add directly to the weaver-worker definition in the relevant docker-compose-extra.yml file. But that's something I'll have to figure out/work on next.

3. `configure docker`_ to use the NVIDIA as a container run-time
4. restart docker for the changes to take effect
5. add the following to the ``JUPYTERHUB_CONFIG_OVERRIDE`` variable in your local environment file:

.. code-block:: python

# enable GPU support
import docker

c.DockerSpawner.extra_host_config["device_requests"] = [
docker.types.DeviceRequest(count=-1, capabilities=[["gpu"]])
]
Comment on lines +840 to +842
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean that every single user notebook automatically gets assigned a GPU?
If so, won't that create a bottleneck very quickly?

Is there a way to have both GPU/CPU-only simultaneously (adding more DeviceRequest variants to the list?), and have it selected somehow by the user when starting the kernel?
Maybe even more specific GPU definitions, like providing the ones with just 8GB VRAM vs others with 48GB separately?

I would be interested in that specific multi-config example, and how users interact with it to request (or us limiting them) appropriate resources.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean that every single user notebook automatically gets assigned a GPU?

No, it means that every container has access to all GPUs on the host. This PR doesn't introduce any solutions for allocating different GPU resources to different users. That is a much more complex thing that I'll have to try to figure out at a later date (because I don't really understand it yet).

Is there a way to have both GPU/CPU-only simultaneously (adding more DeviceRequest variants to the list?), and have it selected somehow by the user when starting the kernel?
Maybe even more specific GPU definitions, like providing the ones with just 8GB VRAM vs others with 48GB separately?

I think so, it would require a pretty good understanding of the nvidia toolkit and docker settings. I'm still reading about it and I can continue to update these docs as we figure different possible configurations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if something like https://jupyterhub-dockerspawner.readthedocs.io/en/latest/api/index.html#dockerspawner.SwarmSpawner.group_overrides could be used to dynamically apply the GPU request on specific users/conditions, therefore allowing having GPU or CPU-only setup.

Documentation is very sparse, so definitely very hard to figure out 😅
All promising features nonetheless. Thanks for looking into them.



This will allow the docker containers to access all GPUs on the host machine. To limit the number of GPUs you want to make available
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So all Jupyter users will have access to the GPU? And if they happen to all use the GPU at the same time, will they step on each other foot?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup they will definitely step on each others feet. Just the same as if a user hogs any other resource (CPU, memory, etc.)

We definitely need a better way to manage resource over-use but the problem isn't specific to GPUs.

you can change the ``count`` value to a positive integer or you can specify the ``device_ids`` key instead. ``device_ids`` takes a list
Comment on lines +845 to +846
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count=-1: That was my impression even before reading this part.

Requesting all available GPUs would essentially lock-out any second user trying to use a kernel.

The example should probably use count=1 instead and make a stronger warning about this situation, to let server maintainers know to not raise the value too much unless they got some really big GPU cluster.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me clarify... this gives each container access to all GPUs on the host and they share them as a resource. The same way they're sharing access to the CPUs, memory, etc.
This will not stop a user from starting up their container.

Note that if user A proceeds to max out some of the GPUs then user B can't use them until they are done but managing that goes beyond the scope of this documentation so far.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you have big GPUs like A100 to actually provide virtual-GPU VRAM, it won't take that much to cause all users to crash their respective processes from OutOfMemoryError.

I'm nowhere near an expert on the matter, but I know that our clusters leverage some vGPUs to allow some kind of splitting this way. I don't know if that would play nice with multiple dockers trying to access the same GPU. Doesn't it do some kind of lock/reservation when assigned to a particular kernel?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that you need a vGPU setup for docker. Since there is no true VM (no hypervisor) with docker the containers can just access the GPU directly.

One thing you could do is use MIG (Multi-Instance GPU) or MPS (Multi-Process Service) to split up resources but not all GPUs support these (none of ours do unfortunately).

Doesn't it do some kind of lock/reservation when assigned to a particular kernel?

As far as I know it manages context-switching the same way a CPU would that was running multiple processes/threads. So everything gets slowed down and/or on-GPU memory gets filled up but it won't necessarily just immediately fail from a user perspective.

of integers representing the devices (GPUs) that you want to enable. Device IDs for each GPU can be inspected by running the ``nvidia-smi``
command on the host machine.

The `driver capabilities`_ setting indicates that this device request is for GPUs (as opposed to other devices that may be
available such as TPUs).

For example, if I only want to make available GPUs with ids 1 and 4 you would set:

.. code-block::

docker.types.DeviceRequest(device_ids=[1, 4], capabilities=[["gpu"]])

.. _NVIDIA container toolkit: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
.. _configure docker: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker
.. _driver capabilities: https://docs.docker.com/reference/compose-file/deploy/#capabilities

How to Enable the Component
---------------------------

Expand Down
Loading