-
Notifications
You must be signed in to change notification settings - Fork 7
Set GPU access on Jupyterlab containers based on Magpie user or group name #616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 2 commits
62bd64a
66e7d94
269677a
c032517
d835ee4
905d6e3
8553740
6032b92
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -80,16 +80,25 @@ export JUPYTERHUB_ADMIN_USERS='{\"${MAGPIE_ADMIN_USERNAME}\"}' # python set syn | |
| # export JUPYTERHUB_RESOURCE_LIMITS=" | ||
| # user:user1:mem_limit=30G | ||
| # group:group1:mem_limit=10G:cpu_limit=1 | ||
| # group:group2:cpu_limit=3 | ||
| # group:group2:cpu_limit=3:gpu_ids=0,3,4 | ||
| # user:user2:gpu_ids=1,2,3:gpu_count=2 | ||
| # " | ||
| # | ||
| # Supported limits are: `mem_limit` and `cpu_limit`. See the Jupyterhub Dockerspawner documentation | ||
| # for details and supported values. | ||
| # Supported limits are: `mem_limit`, `cpu_limit`, `gpu_count`, `gpu_ids`. See the Jupyterhub Dockerspawner documentation | ||
| # for details and supported values for mem_limit and cpu_limit. | ||
| # - https://jupyterhub-dockerspawner.readthedocs.io/en/latest/api/index.html#dockerspawner.DockerSpawner.cpu_limit | ||
| # - https://jupyterhub-dockerspawner.readthedocs.io/en/latest/api/index.html#dockerspawner.DockerSpawner.mem_limit | ||
| # gpu_ids are a comma separated list of the GPU uuids or zero based indexes available on the host that you want to make available | ||
| # to the user or group. GPU uuids and indexes can be discovered by running the `nvidia-smi --list-gpus` command. | ||
mishaschwartz marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| # If gpu_count is also specified, this is an integer indicating how many GPUs to make available to that user or group. | ||
| # If gpu_count is not specified, then exactly one GPU will be randomly selected. | ||
| # For example, if gpu_ids=1,2,6 and gpu_count=2 then two GPUs will be randomly selected from the gpu_ids list. | ||
| # Note that this will not create the groups in Magpie, that must be done manually. | ||
| # Note that if a user belongs to multiple groups, later values in `JUPYTERHUB_RESOURCE_LIMITS` will take | ||
| # precedence. For example, if a user named user1 belongs to group1 and group2 then the following limits will apply: | ||
| # - mem_limit=10G (because group1 is later in the list) | ||
| # - cpu_limit=3 (because group2 is later in the list) | ||
| # - gpu_ids=0,3,4 | ||
| export JUPYTERHUB_RESOURCE_LIMITS= | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thinking about this last night, can you just restore the default empty value here? All default values should be in a corresponding The commented out value in If you want to avoid duplicating all the documentations for that var in
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, we really need to write down some of these policies somewhere
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Absolutely agree, added #620 before I forget. |
||
|
|
||
| export DELAYED_EVAL=" | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,8 +1,10 @@ | ||
| import os | ||
| from os.path import join | ||
| import logging | ||
| import random | ||
| import subprocess | ||
|
|
||
| import docker | ||
| from dockerspawner import DockerSpawner | ||
|
|
||
| c = get_config() # noqa # can be called directy without import because injected by IPython | ||
|
|
@@ -137,7 +139,7 @@ if os.environ['WORKSPACE_DIR'] != jupyterhub_data_dir: | |
| container_gdrive_settings_path = join(container_home_dir, ".jupyter/lab/user-settings/@jupyterlab/google-drive/drive.jupyterlab-settings") | ||
| host_gdrive_settings_path = os.environ['JUPYTER_GOOGLE_DRIVE_SETTINGS'] | ||
|
|
||
| # resource_limits: dict[tuple[Literal["user", "group"], str], dict[Literal["cpu_limit", "mem_limit"], str]] | ||
| # resource_limits: dict[tuple[Literal["user", "group"], str], dict[Literal["cpu_limit", "mem_limit", "gpu_count", "gpu_ids"], str]] | ||
| resource_limits = {tuple(lim[:2]): dict(li.split("=") for li in lim[2:] if "=" in li) | ||
| for limit in """${JUPYTERHUB_RESOURCE_LIMITS}""".strip().split() | ||
| if (lim := limit.split(":"))} | ||
|
|
@@ -183,14 +185,26 @@ def limit_resource_hook(spawner): | |
| spawner.mem_limit = os.environ['JUPYTER_DEMO_USER_MEM_LIMIT'] | ||
|
|
||
| user_groups = {g.name for g in spawner.user.groups} | ||
| gpu_ids = [] | ||
| gpu_count = 1 | ||
| for (name_type, name), limits in resource_limits.items(): | ||
| if (name_type == "user" and name == spawner.user.name) or (name_type == "group" and name in user_groups): | ||
| for limit, value in limits.items(): | ||
| if limit == "cpu_limit": | ||
| spawner.cpu_limit = float(value) | ||
| elif limit == "mem_limit": | ||
| spawner.mem_limit = value | ||
|
|
||
| elif limit == "gpu_ids": | ||
| gpu_ids = value.split(",") | ||
| elif limit == "gpu_count": | ||
| gpu_count = int(value) | ||
| if gpu_ids: | ||
| # randomly assign GPUs in an attempt to evenly distribute GPU resources | ||
| random.shuffle(gpu_ids) | ||
| gpu_ids = gpu_ids[:gpu_count] | ||
| spawner.extra_host_config["device_requests"] = [ | ||
| docker.types.DeviceRequest(device_ids=gpu_ids, capabilities=[["gpu"]]) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh ! So if we forgot to specify Please document this default behavior.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now the reverse. With the new default So we have to remember to set Should add this default behavior to the documentation or keep the default to
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
correct
It's extra confusing if the default behaviour is different for users and groups, we should be consistent.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Agreed to keep same default behavior for consistency. I just find it more natural if we are giving multiples But it's fine. Keep it that way. There are no perfect solution.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is because you are thinking in terms of "allocation", but GPUs are usually configured in terms of "availability", because it is very expensive to assign these resources and have them siting there locked and unused by a single user. Typically, GPU requests are conservative (if any provided at all by default), and users have to explicitly ask for one/many and/or specific capabilities/VRAM according to their use case. If we were adding a
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes I'd eventually like to make use of an options form where users can request up to a certain amount of resources instead of just automatically giving them the maximum they're allowed according to these rules. We could also free up resources early by setting limits on how long a user can keep a resource (i.e. user X is allowed to request 3 GPUs but we'll kill their container after 2 hours). Think of this as similar to "salloc -p archiveshort" on scinet to get synchronous access to one of the compute nodes for a short period of time. I've got lots of ideas for how to extend this and try to make it "fair" to users who are all sharing resources. My main goal is to give the node administrator the freedom to set the resources however they want. BUT we should provide documentation that gives good advice and a reasonable starting configuration. For this PR, the goal is simply to incorporate GPUs into the jupyterhub resource allocation mechanism |
||
| ] | ||
|
|
||
| def pre_spawn_hook(spawner): | ||
| create_dir_hook(spawner) | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.