-
Notifications
You must be signed in to change notification settings - Fork 7
Set GPU access on Jupyterlab containers based on Magpie user or group name #616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 3 commits
62bd64a
66e7d94
269677a
c032517
d835ee4
905d6e3
8553740
6032b92
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -73,24 +73,35 @@ export JUPYTERHUB_AUTHENTICATOR_REFRESH_AGE=60 | |
| export JUPYTERHUB_ADMIN_USERS='{\"${MAGPIE_ADMIN_USERNAME}\"}' # python set syntax | ||
|
|
||
| # Resource limits for JupyterLab containers. Resource limits can be set per Magpie user or group. | ||
| # The value for this variable is a whitespace delimited string. Each section is delimited by colons (:) | ||
| # where the first element is either `group` or `user` and the second element is the name of the user or group | ||
| # to apply the limits to. The rest are resource limits of the form `limit=amount`. For example: | ||
| # | ||
| # export JUPYTERHUB_RESOURCE_LIMITS=" | ||
| # user:user1:mem_limit=30G | ||
| # group:group1:mem_limit=10G:cpu_limit=1 | ||
| # group:group2:cpu_limit=3 | ||
| # " | ||
| # | ||
| # Supported limits are: `mem_limit` and `cpu_limit`. See the Jupyterhub Dockerspawner documentation | ||
| # for details and supported values. | ||
| # Note that this will not create the groups in Magpie, that must be done manually. | ||
| # Note that if a user belongs to multiple groups, later values in `JUPYTERHUB_RESOURCE_LIMITS` will take | ||
| # precedence. For example, if a user named user1 belongs to group1 and group2 then the following limits will apply: | ||
| # The value for this variable is a yaml or JSON array of mappings with the following keys: "type" (either "user" | ||
| # or "group"), "name" (the name of the group or user to apply the limits to) and "limits" (see below). For example: | ||
| # export JUPYTERHUB_RESOURCE_LIMITS=' | ||
| # [ | ||
| # {"type": "user", "name": "user1", "limits": {"mem_limit": "30G"}}, | ||
| # {"type": "group", "name": "group1", "limits": {"mem_limit": "10G", "cpu_limit": 1}}, | ||
| # {"type": "group", "name": "group2", "limits": {"cpu_limit": 3, "gpu_ids": [0, 3, 4]}}, | ||
| # {"type": "user", "name": "user2", "limits": {"gpu_ids": [1, 2, 3], "gpu_count": 2}} | ||
| # ] | ||
| #' | ||
| # Supported limits are: "mem_limit", "cpu_limit", "gpu_count", "gpu_ids". | ||
| # For a JSON schema describing the structure of this JSON array see | ||
| # birdhouse/components/jupyterhub/resource-limit.schema.json | ||
| # See the Jupyterhub Dockerspawner documentation | ||
| # for details and supported values for mem_limit and cpu_limit. | ||
| # - https://jupyterhub-dockerspawner.readthedocs.io/en/latest/api/index.html#dockerspawner.DockerSpawner.cpu_limit | ||
| # - https://jupyterhub-dockerspawner.readthedocs.io/en/latest/api/index.html#dockerspawner.DockerSpawner.mem_limit | ||
| # gpu_ids are an array of the GPU uuids or zero based indexes of the GPUs that you want to make available | ||
| # to the user or group. GPU uuids and indexes can be discovered by running the `nvidia-smi --list-gpus` command. | ||
| # If gpu_count is also specified, this is an integer indicating how many GPUs to make available to that user or group. | ||
| # If gpu_count is not specified, then exactly one GPU will be randomly selected. | ||
| # For example, if {"gpu_ids": [1,2,6], "gpu_count": 2} then two GPUs will be randomly selected from the gpu_ids list. | ||
| # Note that this will not create the groups in Magpie, that must be done manually. | ||
| # Note that if a user belongs to multiple groups, later values in `JUPYTERHUB_RESOURCE_LIMITS` will take | ||
| # precedence. For example, if a user named user1 belongs to group1 and group2 then the following limits will apply: | ||
| # - mem_limit=10G (because group1 is later in the list) | ||
| # - cpu_limit=3 (because group2 is later in the list) | ||
| export JUPYTERHUB_RESOURCE_LIMITS= | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thinking about this last night, can you just restore the default empty value here? All default values should be in a corresponding The commented out value in If you want to avoid duplicating all the documentations for that var in
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, we really need to write down some of these policies somewhere
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Absolutely agree, added #620 before I forget. |
||
| # - gpu_ids=0,3,4 | ||
| export JUPYTERHUB_RESOURCE_LIMITS='[]' | ||
|
||
|
|
||
| export DELAYED_EVAL=" | ||
| $DELAYED_EVAL | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,8 +1,11 @@ | ||
| import os | ||
| from os.path import join | ||
| import logging | ||
| import random | ||
| import subprocess | ||
|
|
||
| import docker | ||
| import yaml | ||
| from dockerspawner import DockerSpawner | ||
|
|
||
| c = get_config() # noqa # can be called directy without import because injected by IPython | ||
|
|
@@ -137,10 +140,18 @@ if os.environ['WORKSPACE_DIR'] != jupyterhub_data_dir: | |
| container_gdrive_settings_path = join(container_home_dir, ".jupyter/lab/user-settings/@jupyterlab/google-drive/drive.jupyterlab-settings") | ||
| host_gdrive_settings_path = os.environ['JUPYTER_GOOGLE_DRIVE_SETTINGS'] | ||
|
|
||
| # resource_limits: dict[tuple[Literal["user", "group"], str], dict[Literal["cpu_limit", "mem_limit"], str]] | ||
| resource_limits = {tuple(lim[:2]): dict(li.split("=") for li in lim[2:] if "=" in li) | ||
| for limit in """${JUPYTERHUB_RESOURCE_LIMITS}""".strip().split() | ||
| if (lim := limit.split(":"))} | ||
| # class LimitDict(TypedDict): | ||
| # mem_limit: NotRequired[str | int] | ||
| # cpu_limit: NotRequired[str | float | int] | ||
| # gpu_ids: NotRequired[list[int | str]] | ||
| # gpu_count: NotRequired[int] | ||
| # | ||
| # class LimitRule(TypedDict): | ||
| # type: Literal["user", "group"] | ||
| # name: str | ||
| # limits: LimitDict | ||
| # resource_limits: LimitRule | ||
| resource_limits = yaml.safe_load("""${JUPYTERHUB_RESOURCE_LIMITS}""") | ||
mishaschwartz marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| if len(host_gdrive_settings_path) > 0: | ||
| c.DockerSpawner.volumes[host_gdrive_settings_path] = { | ||
|
|
@@ -183,14 +194,28 @@ def limit_resource_hook(spawner): | |
| spawner.mem_limit = os.environ['JUPYTER_DEMO_USER_MEM_LIMIT'] | ||
|
|
||
| user_groups = {g.name for g in spawner.user.groups} | ||
| for (name_type, name), limits in resource_limits.items(): | ||
| if (name_type == "user" and name == spawner.user.name) or (name_type == "group" and name in user_groups): | ||
| for limit, value in limits.items(): | ||
| gpu_ids = [] | ||
| gpu_count = 1 | ||
| for rule in resource_limits: | ||
| rule_type = rule["type"] | ||
| name = rule["name"] | ||
| if rule_type == "user" and name == spawner.user.name or rule_type == "group" and name in user_groups: | ||
| for limit, value in rule["limits"].items(): | ||
| if limit == "cpu_limit": | ||
| spawner.cpu_limit = float(value) | ||
| spawner.cpu_limit = value | ||
| elif limit == "mem_limit": | ||
| spawner.mem_limit = value | ||
|
|
||
| elif limit == "gpu_ids": | ||
| gpu_ids = value | ||
| elif limit == "gpu_count": | ||
| gpu_count = value | ||
| if gpu_ids: | ||
| # randomly assign GPUs in an attempt to evenly distribute GPU resources | ||
| random.shuffle(gpu_ids) | ||
| gpu_ids = gpu_ids[:gpu_count] | ||
| spawner.extra_host_config["device_requests"] = [ | ||
| docker.types.DeviceRequest(device_ids=gpu_ids, capabilities=[["gpu"]]) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh ! So if we forgot to specify Please document this default behavior.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now the reverse. With the new default So we have to remember to set Should add this default behavior to the documentation or keep the default to
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
correct
It's extra confusing if the default behaviour is different for users and groups, we should be consistent.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Agreed to keep same default behavior for consistency. I just find it more natural if we are giving multiples But it's fine. Keep it that way. There are no perfect solution.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is because you are thinking in terms of "allocation", but GPUs are usually configured in terms of "availability", because it is very expensive to assign these resources and have them siting there locked and unused by a single user. Typically, GPU requests are conservative (if any provided at all by default), and users have to explicitly ask for one/many and/or specific capabilities/VRAM according to their use case. If we were adding a
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes I'd eventually like to make use of an options form where users can request up to a certain amount of resources instead of just automatically giving them the maximum they're allowed according to these rules. We could also free up resources early by setting limits on how long a user can keep a resource (i.e. user X is allowed to request 3 GPUs but we'll kill their container after 2 hours). Think of this as similar to "salloc -p archiveshort" on scinet to get synchronous access to one of the compute nodes for a short period of time. I've got lots of ideas for how to extend this and try to make it "fair" to users who are all sharing resources. My main goal is to give the node administrator the freedom to set the resources however they want. BUT we should provide documentation that gives good advice and a reasonable starting configuration. For this PR, the goal is simply to incorporate GPUs into the jupyterhub resource allocation mechanism |
||
| ] | ||
|
|
||
| def pre_spawn_hook(spawner): | ||
| create_dir_hook(spawner) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| { | ||
| "type": "array", | ||
| "items": { | ||
| "type": "object", | ||
| "properties": { | ||
| "type": { | ||
| "type": "string", | ||
| "enum": [ | ||
| "user", | ||
| "group" | ||
| ] | ||
| }, | ||
| "name": { | ||
| "type": "string", | ||
| "pattern": "^.+$" | ||
| }, | ||
| "limits": { | ||
| "type": "object", | ||
| "properties": { | ||
| "mem_limit": { | ||
| "type": "string" | ||
| }, | ||
| "cpu_limit": { | ||
| "type": "number" | ||
| }, | ||
| "gpu_ids": { | ||
| "type": "array" | ||
| }, | ||
| "gpu_count": { | ||
| "type": "number" | ||
| } | ||
| }, | ||
| "dependentRequired": { | ||
| "gpu_count": [ | ||
| "gpu_ids" | ||
| ] | ||
| }, | ||
| "additionalProperties": false | ||
| } | ||
| }, | ||
| "required": [ | ||
| "type", | ||
| "name", | ||
| "limits" | ||
| ] | ||
| } | ||
| } |
Uh oh!
There was an error while loading. Please reload this page.