Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 15 additions & 1 deletion CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,21 @@
[Unreleased](https://github.com/bird-house/birdhouse-deploy/tree/master) (latest)
------------------------------------------------------------------------------------------------------------------

[//]: # (list changes here, using '-' for each new entry, remove this when items are added)
## Changes

- Set GPU access on Jupyterlab containers based on Magpie user or group name

Adds to the feature that lets resource allocations to Jupyterlab containers be assigned based on username or
group membership.

New settings for the `JUPYTERHUB_RESOURCE_LIMITS` variable are `gpu_ids` and `gpu_count`.
`gpu_ids` are a comma separated list of the GPU ids available on the host that you want to make available to
the user or group. GPU ids can typically be discovered by running the `nvidia-smi` command.
If `gpu_count` is also specified, this is an integer indicating how many GPUs to make available to that user
or group.

For example, if `gpu_ids=gpu1,gpu2,gpu6` and `gpu_count=2` then two GPUs will be randomly selected from the
`gpu_ids` list.

[2.19.0](https://github.com/bird-house/birdhouse-deploy/tree/2.19.0) (2025-12-05)
------------------------------------------------------------------------------------------------------------------
Expand Down
14 changes: 11 additions & 3 deletions birdhouse/components/jupyterhub/default.env
Original file line number Diff line number Diff line change
Expand Up @@ -80,16 +80,24 @@ export JUPYTERHUB_ADMIN_USERS='{\"${MAGPIE_ADMIN_USERNAME}\"}' # python set syn
# export JUPYTERHUB_RESOURCE_LIMITS="
# user:user1:mem_limit=30G
# group:group1:mem_limit=10G:cpu_limit=1
# group:group2:cpu_limit=3
# group:group2:cpu_limit=3:gpu_ids=gpu1,gpu2,gpu3
# user:user2:gpu_ids=gpu1,gpu2,gpu3:gpu_count=2
# "
#
# Supported limits are: `mem_limit` and `cpu_limit`. See the Jupyterhub Dockerspawner documentation
# for details and supported values.
# Supported limits are: `mem_limit`, `cpu_limit`, `gpu_count`, `gpu_ids`. See the Jupyterhub Dockerspawner documentation
# for details and supported values for mem_limit and cpu_limit.
# - https://jupyterhub-dockerspawner.readthedocs.io/en/latest/api/index.html#dockerspawner.DockerSpawner.cpu_limit
# - https://jupyterhub-dockerspawner.readthedocs.io/en/latest/api/index.html#dockerspawner.DockerSpawner.mem_limit
# gpu_ids are a comma separated list of the GPU ids available on the host that you want to make available to
# the user or group. GPU ids can typically be discovered by running the `nvidia-smi` command.
# If gpu_count is also specified, this is an integer indicating how many GPUs to make available to that user or group.
# For example, if gpu_ids=gpu1,gpu2,gpu6 and gpu_count=2 then two GPUs will be randomly selected from the gpu_ids list.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If omitted, is it default 1 or "all"?

I personally think 1 would be safer for fair/shared-use and avoid over allocating, but the default should be indicated either way.

Copy link
Collaborator

@tlvu tlvu Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is on a group and the number of users in the group exceed the number of gpu in gpu_ids, what happen if all the users of the group login to Jupyter?!

If by mistake when writing the JUPYTERHUB_RESOURCE_LIMITS block, we give exactly the same gpu_ids to 2 users, what happen if both users login at the same time? This case will happen with the current code if we forgot gpu_count when defining a group and the group has more than 1 users.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If omitted, is it default 1 or "all"?

Currently it's "all".

what happen if all the users of the group login to Jupyter?!
what happen if both users login at the same time?

See #594 (comment).

Users have to share in the same way they have to share memory. If we want to get smart about this and create a system where users will never be able to affect others be over-allocating resources we can do that.

But that's a much more complex configuration that I'm still working out the details of.

Copy link
Member

@fmigneault fmigneault Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is on a group

I am so used to think in terms of single user requesting the GPU for their job that I didn't consider the groups allocation, which realistically should be more than 1 if possible to avoid a big user queue over a single resource.

So it seems a reasonable default should be user/group-based? 1 if users, "all" if group.
Would that seem more confusing?

If so, "all" could remain valid for both edit following below comment
probably better to have "1" everywhere...

I think the use-case of a "user reserving everything a leaving none for others" should be strongly recommended in the doc, so that the maintainer considers explicit gpu_count values for user-based allocations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following #616 (comment), I see also that group allocations could be misinterpreted easily if gpu_count is omitted. The same problem would happen of a single user holding all resources offered to the group.

For groups-based allocations, I think the typical use-case would more often be to provide many GPUs to meet demands, but still distribute/limit them across their users.

# Note that this will not create the groups in Magpie, that must be done manually.
# Note that if a user belongs to multiple groups, later values in `JUPYTERHUB_RESOURCE_LIMITS` will take
# precedence. For example, if a user named user1 belongs to group1 and group2 then the following limits will apply:
# - mem_limit=10G (because group1 is later in the list)
# - cpu_limit=3 (because group2 is later in the list)
# - gpu_ids=gpu1,gpu2,gpu3
export JUPYTERHUB_RESOURCE_LIMITS=
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mishaschwartz

Thinking about this last night, can you just restore the default empty value here?

All default values should be in a corresponding default.env, like all other vars, to not confuse users and to maintain consistency. We should not expect the user to search the code for that default value.

The commented out value in env.local.example is just an example, it should not be considered a default value.

If you want to avoid duplicating all the documentations for that var in default.env, you can simply reference the user to the env.local.example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we really need to write down some of these policies somewhere

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we really need to write down some of these policies somewhere

Absolutely agree, added #620 before I forget.


export DELAYED_EVAL="
Expand Down
19 changes: 17 additions & 2 deletions birdhouse/components/jupyterhub/jupyterhub_config.py.template
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
import os
from os.path import join
import logging
import random
import subprocess

import docker
from dockerspawner import DockerSpawner

c = get_config() # noqa # can be called directy without import because injected by IPython
Expand Down Expand Up @@ -137,7 +139,7 @@ if os.environ['WORKSPACE_DIR'] != jupyterhub_data_dir:
container_gdrive_settings_path = join(container_home_dir, ".jupyter/lab/user-settings/@jupyterlab/google-drive/drive.jupyterlab-settings")
host_gdrive_settings_path = os.environ['JUPYTER_GOOGLE_DRIVE_SETTINGS']

# resource_limits: dict[tuple[Literal["user", "group"], str], dict[Literal["cpu_limit", "mem_limit"], str]]
# resource_limits: dict[tuple[Literal["user", "group"], str], dict[Literal["cpu_limit", "mem_limit", "gpu_count", "gpu_ids"], str]]
resource_limits = {tuple(lim[:2]): dict(li.split("=") for li in lim[2:] if "=" in li)
for limit in """${JUPYTERHUB_RESOURCE_LIMITS}""".strip().split()
if (lim := limit.split(":"))}
Expand Down Expand Up @@ -183,14 +185,27 @@ def limit_resource_hook(spawner):
spawner.mem_limit = os.environ['JUPYTER_DEMO_USER_MEM_LIMIT']

user_groups = {g.name for g in spawner.user.groups}
gpu_ids = []
gpu_count = None
for (name_type, name), limits in resource_limits.items():
if (name_type == "user" and name == spawner.user.name) or (name_type == "group" and name in user_groups):
for limit, value in limits.items():
if limit == "cpu_limit":
spawner.cpu_limit = float(value)
elif limit == "mem_limit":
spawner.mem_limit = value

elif limit == "gpu_ids":
gpu_ids = value.split(",")
elif limit == "gpu_count":
gpu_count = int(value)
if gpu_ids:
if gpu_count is not None:
# randomly assign GPUs in an attempt to evenly distribute GPU resources
random.shuffle(gpu_ids)
gpu_ids = gpu_ids[:gpu_count]
spawner.extra_host_config["device_requests"] = [
docker.types.DeviceRequest(device_ids=gpu_ids, capabilities=[["gpu"]])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ! So if we forgot to specify gpu_count, all gpu_ids are given to the user ! I guess we better not forget gpu_count for group definition then !

Please document this default behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the reverse.

With the new default gpu_count = 1, even for users with gpus_ids=0,2,3, if we do not set gpu_count=3, the user will only have 1 gpu?

So we have to remember to set gpu_count for a user definition if we give that user more than 1 gpu_ids?

Should add this default behavior to the documentation or keep the default to all for the user case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new default gpu_count = 1, even for users with gpus_ids=0,2,3, if we do not set gpu_count=3, the user will only have 1 gpu?
So we have to remember to set gpu_count for a user definition if we give that user more than 1 gpu_ids?

correct

Should add this default behavior to the documentation or keep the default to all for the user case.

It's extra confusing if the default behaviour is different for users and groups, we should be consistent.
It is documented. See:

# If gpu_count is also specified, this is an integer indicating how many GPUs to make available to that user or group.
# If gpu_count is not specified, then exactly one GPU will be randomly selected.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's extra confusing if the default behaviour is different for users and groups, we should be consistent.

Agreed to keep same default behavior for consistency. I just find it more natural if we are giving multiples gpu_ids to a user definition, we intend for the user to have all of them. Now we also have to remember to give gpu_count to a user definition.

But it's fine. Keep it that way. There are no perfect solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just find it more natural if we are giving multiples gpu_ids to a user definition, we intend for the user to have all of them.

This is because you are thinking in terms of "allocation", but GPUs are usually configured in terms of "availability", because it is very expensive to assign these resources and have them siting there locked and unused by a single user.

Typically, GPU requests are conservative (if any provided at all by default), and users have to explicitly ask for one/many and/or specific capabilities/VRAM according to their use case.

If we were adding a $ tag to these GPU invocations, you can be sure users would be unhappy that they got over-allocated unrequested resources.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and users have to explicitly ask for one/many and/or specific capabilities/VRAM according to their use case.

Yes I'd eventually like to make use of an options form where users can request up to a certain amount of resources instead of just automatically giving them the maximum they're allowed according to these rules.

We could also free up resources early by setting limits on how long a user can keep a resource (i.e. user X is allowed to request 3 GPUs but we'll kill their container after 2 hours). Think of this as similar to "salloc -p archiveshort" on scinet to get synchronous access to one of the compute nodes for a short period of time.

I've got lots of ideas for how to extend this and try to make it "fair" to users who are all sharing resources. My main goal is to give the node administrator the freedom to set the resources however they want. BUT we should provide documentation that gives good advice and a reasonable starting configuration.

For this PR, the goal is simply to incorporate GPUs into the jupyterhub resource allocation mechanism

]

def pre_spawn_hook(spawner):
create_dir_hook(spawner)
Expand Down
12 changes: 9 additions & 3 deletions birdhouse/env.local.example
Original file line number Diff line number Diff line change
Expand Up @@ -388,18 +388,24 @@ export GEOSERVER_ADMIN_PASSWORD="${__DEFAULT__GEOSERVER_ADMIN_PASSWORD}"
# export JUPYTERHUB_RESOURCE_LIMITS="
# user:user1:mem_limit=30G
# group:group1:mem_limit=10G:cpu_limit=1
# group:group2:cpu_limit=3
# group:group2:cpu_limit=3:gpu_ids=gpu1,gpu2,gpu3
# user:user2:gpu_ids=gpu1,gpu2,gpu3:gpu_count=2
# "
#
# Supported limits are: `mem_limit` and `cpu_limit`. See the Jupyterhub Dockerspawner documentation
# for details and supported values.
# Supported limits are: `mem_limit`, `cpu_limit`, `gpu_count`, `gpu_ids`. See the Jupyterhub Dockerspawner documentation
# for details and supported values for mem_limit and cpu_limit.
# - https://jupyterhub-dockerspawner.readthedocs.io/en/latest/api/index.html#dockerspawner.DockerSpawner.cpu_limit
# - https://jupyterhub-dockerspawner.readthedocs.io/en/latest/api/index.html#dockerspawner.DockerSpawner.mem_limit
# gpu_ids are a comma separated list of the GPU ids available on the host that you want to make available to
# the user or group. GPU ids can typically be discovered by running the `nvidia-smi` command.
# If gpu_count is also specified, this is an integer indicating how many GPUs to make available to that user or group.
# For example, if gpu_ids=gpu1,gpu2,gpu6 and gpu_count=2 then two GPUs will be randomly selected from the gpu_ids list.
# Note that this will not create the groups in Magpie, that must be done manually.
# Note that if a user belongs to multiple groups, later values in `JUPYTERHUB_RESOURCE_LIMITS` will take
# precedence. For example, if a user named user1 belongs to group1 and group2 then the following limits will apply:
# - mem_limit=10G (because group1 is later in the list)
# - cpu_limit=3 (because group2 is later in the list)
# - gpu_ids=gpu1,gpu2,gpu3
#export JUPYTERHUB_RESOURCE_LIMITS=

# Allow for adding new config or override existing config in
Expand Down
Loading