Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 18 additions & 1 deletion CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,24 @@
[Unreleased](https://github.com/bird-house/birdhouse-deploy/tree/master) (latest)
------------------------------------------------------------------------------------------------------------------

[//]: # (list changes here, using '-' for each new entry, remove this when items are added)
## Changes

- Set GPU access on Jupyterlab containers based on Magpie user or group name

Adds to the feature that lets resource allocations to Jupyterlab containers be assigned based on username or
group membership.

New settings for the `JUPYTERHUB_RESOURCE_LIMITS` variable are `gpu_ids` and `gpu_count`.

`gpu_ids` are an array of the GPU uuids or zero based indexes of the GPUs that you want to make available
to the user or group. GPU uuids and indexes can be discovered by running the `nvidia-smi --list-gpus` command or similar
(such as `amd-smi list` for AMD GPUs). Uuids are preferred as they remain stable across the life of the GPU. Mixing indexes and uuids
is possible but discouraged since it makes it possible to select the same GPU multiple times.
If `gpu_count` is also specified, this is an integer indicating how many GPUs to make available to that user or group.
If `gpu_count` is not specified, then exactly one GPU will be randomly selected.
For example, if `{"gpu_ids": [1,2,6], "gpu_count": 2}` then two GPUs will be randomly selected from the `gpu_ids` list.

Also changes the format for `JUPYTERHUB_RESOURCE_LIMITS` to a yaml or JSON string.

[2.20.1](https://github.com/bird-house/birdhouse-deploy/tree/2.20.1) (2025-12-16)
------------------------------------------------------------------------------------------------------------------
Expand Down
19 changes: 1 addition & 18 deletions birdhouse/components/jupyterhub/default.env
Original file line number Diff line number Diff line change
Expand Up @@ -72,24 +72,7 @@ export JUPYTERHUB_AUTHENTICATOR_REFRESH_AGE=60
# Usernames that should be given admin access in jupyterhub
export JUPYTERHUB_ADMIN_USERS='{\"${MAGPIE_ADMIN_USERNAME}\"}' # python set syntax

# Resource limits for JupyterLab containers. Resource limits can be set per Magpie user or group.
# The value for this variable is a whitespace delimited string. Each section is delimited by colons (:)
# where the first element is either `group` or `user` and the second element is the name of the user or group
# to apply the limits to. The rest are resource limits of the form `limit=amount`. For example:
#
# export JUPYTERHUB_RESOURCE_LIMITS="
# user:user1:mem_limit=30G
# group:group1:mem_limit=10G:cpu_limit=1
# group:group2:cpu_limit=3
# "
#
# Supported limits are: `mem_limit` and `cpu_limit`. See the Jupyterhub Dockerspawner documentation
# for details and supported values.
# Note that this will not create the groups in Magpie, that must be done manually.
# Note that if a user belongs to multiple groups, later values in `JUPYTERHUB_RESOURCE_LIMITS` will take
# precedence. For example, if a user named user1 belongs to group1 and group2 then the following limits will apply:
# - mem_limit=10G (because group1 is later in the list)
# - cpu_limit=3 (because group2 is later in the list)
# See description in env.local.example for details
export JUPYTERHUB_RESOURCE_LIMITS=
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mishaschwartz

Thinking about this last night, can you just restore the default empty value here?

All default values should be in a corresponding default.env, like all other vars, to not confuse users and to maintain consistency. We should not expect the user to search the code for that default value.

The commented out value in env.local.example is just an example, it should not be considered a default value.

If you want to avoid duplicating all the documentations for that var in default.env, you can simply reference the user to the env.local.example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we really need to write down some of these policies somewhere

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we really need to write down some of these policies somewhere

Absolutely agree, added #620 before I forget.


export DELAYED_EVAL="
Expand Down
43 changes: 34 additions & 9 deletions birdhouse/components/jupyterhub/jupyterhub_config.py.template
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
import os
from os.path import join
import logging
import random
import subprocess

import docker
import yaml
from dockerspawner import DockerSpawner

c = get_config() # noqa # can be called directy without import because injected by IPython
Expand Down Expand Up @@ -137,10 +140,18 @@ if os.environ['WORKSPACE_DIR'] != jupyterhub_data_dir:
container_gdrive_settings_path = join(container_home_dir, ".jupyter/lab/user-settings/@jupyterlab/google-drive/drive.jupyterlab-settings")
host_gdrive_settings_path = os.environ['JUPYTER_GOOGLE_DRIVE_SETTINGS']

# resource_limits: dict[tuple[Literal["user", "group"], str], dict[Literal["cpu_limit", "mem_limit"], str]]
resource_limits = {tuple(lim[:2]): dict(li.split("=") for li in lim[2:] if "=" in li)
for limit in """${JUPYTERHUB_RESOURCE_LIMITS}""".strip().split()
if (lim := limit.split(":"))}
# class LimitDict(TypedDict):
# mem_limit: NotRequired[str | int]
# cpu_limit: NotRequired[str | float | int]
# gpu_ids: NotRequired[list[int | str]]
# gpu_count: NotRequired[int]
#
# class LimitRule(TypedDict):
# type: Literal["user", "group"]
# name: str
# limits: LimitDict
# resource_limits: LimitRule
resource_limits = yaml.safe_load("""${JUPYTERHUB_RESOURCE_LIMITS}""") or []

if len(host_gdrive_settings_path) > 0:
c.DockerSpawner.volumes[host_gdrive_settings_path] = {
Expand Down Expand Up @@ -183,14 +194,28 @@ def limit_resource_hook(spawner):
spawner.mem_limit = os.environ['JUPYTER_DEMO_USER_MEM_LIMIT']

user_groups = {g.name for g in spawner.user.groups}
for (name_type, name), limits in resource_limits.items():
if (name_type == "user" and name == spawner.user.name) or (name_type == "group" and name in user_groups):
for limit, value in limits.items():
gpu_ids = []
gpu_count = 1
for rule in resource_limits:
rule_type = rule["type"]
name = rule["name"]
if rule_type == "user" and name == spawner.user.name or rule_type == "group" and name in user_groups:
for limit, value in rule["limits"].items():
if limit == "cpu_limit":
spawner.cpu_limit = float(value)
spawner.cpu_limit = value
elif limit == "mem_limit":
spawner.mem_limit = value

elif limit == "gpu_ids":
gpu_ids = value
elif limit == "gpu_count":
gpu_count = value
if gpu_ids:
# randomly assign GPUs in an attempt to evenly distribute GPU resources
random.shuffle(gpu_ids)
gpu_ids = gpu_ids[:gpu_count]
spawner.extra_host_config["device_requests"] = [
docker.types.DeviceRequest(device_ids=gpu_ids, capabilities=[["gpu"]])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ! So if we forgot to specify gpu_count, all gpu_ids are given to the user ! I guess we better not forget gpu_count for group definition then !

Please document this default behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the reverse.

With the new default gpu_count = 1, even for users with gpus_ids=0,2,3, if we do not set gpu_count=3, the user will only have 1 gpu?

So we have to remember to set gpu_count for a user definition if we give that user more than 1 gpu_ids?

Should add this default behavior to the documentation or keep the default to all for the user case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new default gpu_count = 1, even for users with gpus_ids=0,2,3, if we do not set gpu_count=3, the user will only have 1 gpu?
So we have to remember to set gpu_count for a user definition if we give that user more than 1 gpu_ids?

correct

Should add this default behavior to the documentation or keep the default to all for the user case.

It's extra confusing if the default behaviour is different for users and groups, we should be consistent.
It is documented. See:

# If gpu_count is also specified, this is an integer indicating how many GPUs to make available to that user or group.
# If gpu_count is not specified, then exactly one GPU will be randomly selected.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's extra confusing if the default behaviour is different for users and groups, we should be consistent.

Agreed to keep same default behavior for consistency. I just find it more natural if we are giving multiples gpu_ids to a user definition, we intend for the user to have all of them. Now we also have to remember to give gpu_count to a user definition.

But it's fine. Keep it that way. There are no perfect solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just find it more natural if we are giving multiples gpu_ids to a user definition, we intend for the user to have all of them.

This is because you are thinking in terms of "allocation", but GPUs are usually configured in terms of "availability", because it is very expensive to assign these resources and have them siting there locked and unused by a single user.

Typically, GPU requests are conservative (if any provided at all by default), and users have to explicitly ask for one/many and/or specific capabilities/VRAM according to their use case.

If we were adding a $ tag to these GPU invocations, you can be sure users would be unhappy that they got over-allocated unrequested resources.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and users have to explicitly ask for one/many and/or specific capabilities/VRAM according to their use case.

Yes I'd eventually like to make use of an options form where users can request up to a certain amount of resources instead of just automatically giving them the maximum they're allowed according to these rules.

We could also free up resources early by setting limits on how long a user can keep a resource (i.e. user X is allowed to request 3 GPUs but we'll kill their container after 2 hours). Think of this as similar to "salloc -p archiveshort" on scinet to get synchronous access to one of the compute nodes for a short period of time.

I've got lots of ideas for how to extend this and try to make it "fair" to users who are all sharing resources. My main goal is to give the node administrator the freedom to set the resources however they want. BUT we should provide documentation that gives good advice and a reasonable starting configuration.

For this PR, the goal is simply to incorporate GPUs into the jupyterhub resource allocation mechanism

]

def pre_spawn_hook(spawner):
create_dir_hook(spawner)
Expand Down
47 changes: 47 additions & 0 deletions birdhouse/components/jupyterhub/resource-limit.schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
{
"type": "array",
"items": {
"type": "object",
"properties": {
"type": {
"type": "string",
"enum": [
"user",
"group"
]
},
"name": {
"type": "string",
"pattern": "^.+$"
},
"limits": {
"type": "object",
"properties": {
"mem_limit": {
"type": "string"
},
"cpu_limit": {
"type": "number"
},
"gpu_ids": {
"type": "array"
},
"gpu_count": {
"type": "number"
}
},
"dependentRequired": {
"gpu_count": [
"gpu_ids"
]
},
"additionalProperties": false
}
},
"required": [
"type",
"name",
"limits"
]
}
}
35 changes: 23 additions & 12 deletions birdhouse/env.local.example
Original file line number Diff line number Diff line change
Expand Up @@ -381,25 +381,36 @@ export GEOSERVER_ADMIN_PASSWORD="${__DEFAULT__GEOSERVER_ADMIN_PASSWORD}"
#export JUPYTERHUB_AUTHENTICATOR_REFRESH_AGE=60

# Resource limits for JupyterLab containers. Resource limits can be set per Magpie user or group.
# The value for this variable is a whitespace delimited string. Each section is delimited by colons (:)
# where the first element is either `group` or `user` and the second element is the name of the user or group
# to apply the limits to. The rest are resource limits of the form `limit=amount`. For example:
#
# export JUPYTERHUB_RESOURCE_LIMITS="
# user:user1:mem_limit=30G
# group:group1:mem_limit=10G:cpu_limit=1
# group:group2:cpu_limit=3
# "
#
# Supported limits are: `mem_limit` and `cpu_limit`. See the Jupyterhub Dockerspawner documentation
# for details and supported values.
# The value for this variable is a yaml or JSON array of mappings with the following keys: "type" (either "user"
# or "group"), "name" (the name of the group or user to apply the limits to) and "limits" (see below). For example:
# export JUPYTERHUB_RESOURCE_LIMITS='
# [
# {"type": "user", "name": "user1", "limits": {"mem_limit": "30G"}},
# {"type": "group", "name": "group1", "limits": {"mem_limit": "10G", "cpu_limit": 1}},
# {"type": "group", "name": "group2", "limits": {"cpu_limit": 3, "gpu_ids": [0, 3, 4]}},
# {"type": "user", "name": "user2", "limits": {"gpu_ids": [1, 2, 3], "gpu_count": 2}}
# ]
#'
# Supported limits are: "mem_limit", "cpu_limit", "gpu_count", "gpu_ids".
# For a JSON schema describing the structure of this JSON array see
# birdhouse/components/jupyterhub/resource-limit.schema.json
# See the Jupyterhub Dockerspawner documentation
# for details and supported values for mem_limit and cpu_limit.
# - https://jupyterhub-dockerspawner.readthedocs.io/en/latest/api/index.html#dockerspawner.DockerSpawner.cpu_limit
# - https://jupyterhub-dockerspawner.readthedocs.io/en/latest/api/index.html#dockerspawner.DockerSpawner.mem_limit
# gpu_ids are an array of the GPU uuids or zero based indexes of the GPUs that you want to make available
# to the user or group. GPU uuids and indexes can be discovered by running the `nvidia-smi --list-gpus` command or similar
# (such as `amd-smi list` for AMD GPUs). Uuids are preferred as they remain stable across the life of the GPU. Mixing indexes and uuids
# is possible but discouraged since it makes it possible to select the same GPU multiple times.
# If gpu_count is also specified, this is an integer indicating how many GPUs to make available to that user or group.
# If gpu_count is not specified, then exactly one GPU will be randomly selected.
# For example, if {"gpu_ids": [1,2,6], "gpu_count": 2} then two GPUs will be randomly selected from the gpu_ids list.
# Note that this will not create the groups in Magpie, that must be done manually.
# Note that if a user belongs to multiple groups, later values in `JUPYTERHUB_RESOURCE_LIMITS` will take
# precedence. For example, if a user named user1 belongs to group1 and group2 then the following limits will apply:
# - mem_limit=10G (because group1 is later in the list)
# - cpu_limit=3 (because group2 is later in the list)
# - gpu_ids=0,3,4
#export JUPYTERHUB_RESOURCE_LIMITS=

# Allow for adding new config or override existing config in
Expand Down
Loading