-
Notifications
You must be signed in to change notification settings - Fork 7
Set GPU access on Jupyterlab containers based on Magpie user or group name #616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| # If gpu_count is also specified, this is an integer indicating how many GPUs to make available to that user or group. | ||
| # For example, if gpu_ids=gpu1,gpu2,gpu6 and gpu_count=2 then two GPUs will be randomly selected from the gpu_ids list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If omitted, is it default 1 or "all"?
I personally think 1 would be safer for fair/shared-use and avoid over allocating, but the default should be indicated either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is on a group and the number of users in the group exceed the number of gpu in gpu_ids, what happen if all the users of the group login to Jupyter?!
If by mistake when writing the JUPYTERHUB_RESOURCE_LIMITS block, we give exactly the same gpu_ids to 2 users, what happen if both users login at the same time? This case will happen with the current code if we forgot gpu_count when defining a group and the group has more than 1 users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If omitted, is it default 1 or "all"?
Currently it's "all".
what happen if all the users of the group login to Jupyter?!
what happen if both users login at the same time?
See #594 (comment).
Users have to share in the same way they have to share memory. If we want to get smart about this and create a system where users will never be able to affect others be over-allocating resources we can do that.
But that's a much more complex configuration that I'm still working out the details of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is on a group
I am so used to think in terms of single user requesting the GPU for their job that I didn't consider the groups allocation, which realistically should be more than 1 if possible to avoid a big user queue over a single resource.
So it seems a reasonable default should be user/group-based? 1 if users, "all" if group.
Would that seem more confusing?
If so, "all" could remain valid for both edit following below comment
probably better to have "1" everywhere...
I think the use-case of a "user reserving everything a leaving none for others" should be strongly recommended in the doc, so that the maintainer considers explicit gpu_count values for user-based allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following #616 (comment), I see also that group allocations could be misinterpreted easily if gpu_count is omitted. The same problem would happen of a single user holding all resources offered to the group.
For groups-based allocations, I think the typical use-case would more often be to provide many GPUs to meet demands, but still distribute/limit them across their users.
tlvu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add documentations for corner/error cases so the admin can avoid mistake when allocating gpus.
Otherwise looks good to me but I will not be able to test this since I do not have access to any machines with gpus. Hope @fmigneault you have some gpus !
| # If gpu_count is also specified, this is an integer indicating how many GPUs to make available to that user or group. | ||
| # For example, if gpu_ids=gpu1,gpu2,gpu6 and gpu_count=2 then two GPUs will be randomly selected from the gpu_ids list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is on a group and the number of users in the group exceed the number of gpu in gpu_ids, what happen if all the users of the group login to Jupyter?!
If by mistake when writing the JUPYTERHUB_RESOURCE_LIMITS block, we give exactly the same gpu_ids to 2 users, what happen if both users login at the same time? This case will happen with the current code if we forgot gpu_count when defining a group and the group has more than 1 users.
| random.shuffle(gpu_ids) | ||
| gpu_ids = gpu_ids[:gpu_count] | ||
| spawner.extra_host_config["device_requests"] = [ | ||
| docker.types.DeviceRequest(device_ids=gpu_ids, capabilities=[["gpu"]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ! So if we forgot to specify gpu_count, all gpu_ids are given to the user ! I guess we better not forget gpu_count for group definition then !
Please document this default behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the reverse.
With the new default gpu_count = 1, even for users with gpus_ids=0,2,3, if we do not set gpu_count=3, the user will only have 1 gpu?
So we have to remember to set gpu_count for a user definition if we give that user more than 1 gpu_ids?
Should add this default behavior to the documentation or keep the default to all for the user case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new default gpu_count = 1, even for users with gpus_ids=0,2,3, if we do not set gpu_count=3, the user will only have 1 gpu?
So we have to remember to set gpu_count for a user definition if we give that user more than 1 gpu_ids?
correct
Should add this default behavior to the documentation or keep the default to all for the user case.
It's extra confusing if the default behaviour is different for users and groups, we should be consistent.
It is documented. See:
# If gpu_count is also specified, this is an integer indicating how many GPUs to make available to that user or group.
# If gpu_count is not specified, then exactly one GPU will be randomly selected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's extra confusing if the default behaviour is different for users and groups, we should be consistent.
Agreed to keep same default behavior for consistency. I just find it more natural if we are giving multiples gpu_ids to a user definition, we intend for the user to have all of them. Now we also have to remember to give gpu_count to a user definition.
But it's fine. Keep it that way. There are no perfect solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just find it more natural if we are giving multiples gpu_ids to a user definition, we intend for the user to have all of them.
This is because you are thinking in terms of "allocation", but GPUs are usually configured in terms of "availability", because it is very expensive to assign these resources and have them siting there locked and unused by a single user.
Typically, GPU requests are conservative (if any provided at all by default), and users have to explicitly ask for one/many and/or specific capabilities/VRAM according to their use case.
If we were adding a $ tag to these GPU invocations, you can be sure users would be unhappy that they got over-allocated unrequested resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and users have to explicitly ask for one/many and/or specific capabilities/VRAM according to their use case.
Yes I'd eventually like to make use of an options form where users can request up to a certain amount of resources instead of just automatically giving them the maximum they're allowed according to these rules.
We could also free up resources early by setting limits on how long a user can keep a resource (i.e. user X is allowed to request 3 GPUs but we'll kill their container after 2 hours). Think of this as similar to "salloc -p archiveshort" on scinet to get synchronous access to one of the compute nodes for a short period of time.
I've got lots of ideas for how to extend this and try to make it "fair" to users who are all sharing resources. My main goal is to give the node administrator the freedom to set the resources however they want. BUT we should provide documentation that gives good advice and a reasonable starting configuration.
For this PR, the goal is simply to incorporate GPUs into the jupyterhub resource allocation mechanism
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/3878/Result 🆘 ABORTEDBIRDHOUSE_DEPLOY_BRANCH : gpu-resource-allocation DACCS_IAC_BRANCH : master DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-133.rdext.crim.ca
|
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/3888/Result ✅ SUCCESSBIRDHOUSE_DEPLOY_BRANCH : gpu-resource-allocation DACCS_IAC_BRANCH : master DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-216.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/566/NOTEBOOK TEST RESULTS |
tlvu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still missing some very minor details but much better overall, thanks !
| # - cpu_limit=3 (because group2 is later in the list) | ||
| # - gpu_ids=0,3,4 | ||
| export JUPYTERHUB_RESOURCE_LIMITS= | ||
| export JUPYTERHUB_RESOURCE_LIMITS='[]' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default can not be an empty string? Empty string will cause yaml.safe_load() error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be an empty string too. Either way we can make it work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be either undefined (null or None on Python side), '', {} or []. If the code does if resource_limits: ... with either variant, it could handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the default be empty string then? All existing empty defaults are empty string so this one just jumped at me !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also be export JUPYTERHUB_RESOURCE_LIMITS=.
No preference on my end. As long as the YAML doesn't rely explicitly on a given structure to do its work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the default for JUPYTERHUB_RESOURCE_LIMITS be '' so it looks like all other empty default?
| random.shuffle(gpu_ids) | ||
| gpu_ids = gpu_ids[:gpu_count] | ||
| spawner.extra_host_config["device_requests"] = [ | ||
| docker.types.DeviceRequest(device_ids=gpu_ids, capabilities=[["gpu"]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the reverse.
With the new default gpu_count = 1, even for users with gpus_ids=0,2,3, if we do not set gpu_count=3, the user will only have 1 gpu?
So we have to remember to set gpu_count for a user definition if we give that user more than 1 gpu_ids?
Should add this default behavior to the documentation or keep the default to all for the user case.
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/3893/Result ❌ FAILUREBIRDHOUSE_DEPLOY_BRANCH : gpu-resource-allocation DACCS_IAC_BRANCH : master DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-91.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/570/NOTEBOOK TEST RESULTS |
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/3895/Result ❌ FAILUREBIRDHOUSE_DEPLOY_BRANCH : gpu-resource-allocation DACCS_IAC_BRANCH : master DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-216.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/571/NOTEBOOK TEST RESULTS |
fmigneault
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting on remaining discussions to conclude...
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/3901/Result ❌ FAILUREBIRDHOUSE_DEPLOY_BRANCH : gpu-resource-allocation DACCS_IAC_BRANCH : master DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-216.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/577/NOTEBOOK TEST RESULTS |
tlvu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good for me at this stage.
| # precedence. For example, if a user named user1 belongs to group1 and group2 then the following limits will apply: | ||
| # - mem_limit=10G (because group1 is later in the list) | ||
| # - cpu_limit=3 (because group2 is later in the list) | ||
| export JUPYTERHUB_RESOURCE_LIMITS= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this last night, can you just restore the default empty value here?
All default values should be in a corresponding default.env, like all other vars, to not confuse users and to maintain consistency. We should not expect the user to search the code for that default value.
The commented out value in env.local.example is just an example, it should not be considered a default value.
If you want to avoid duplicating all the documentations for that var in default.env, you can simply reference the user to the env.local.example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we really need to write down some of these policies somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we really need to write down some of these policies somewhere
Absolutely agree, added #620 before I forget.
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/3902/Result ❌ FAILUREBIRDHOUSE_DEPLOY_BRANCH : gpu-resource-allocation DACCS_IAC_BRANCH : master DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-216.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/578/NOTEBOOK TEST RESULTS |
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/3912/Result ❌ FAILUREBIRDHOUSE_DEPLOY_BRANCH : gpu-resource-allocation DACCS_IAC_BRANCH : master DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-216.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/585/NOTEBOOK TEST RESULTS |
Overview
Adds to the feature that lets resource allocations to Jupyterlab containers be assigned based on username or group membership.
New settings for the
JUPYTERHUB_RESOURCE_LIMITSvariable aregpu_idsandgpu_count.gpu_idsare a comma separated list of the GPU ids available on the host that you want to make available to the user or group. GPU ids can typically be discovered by running thenvidia-smicommand.If
gpu_countis also specified, this is an integer indicating how many GPUs to make available to that user or group.For example, if
gpu_ids=gpu1,gpu2,gpu6andgpu_count=2then two GPUs will be randomly selected from thegpu_idslist.Changes
Non-breaking changes
Breaking changes
Related Issue / Discussion
Additional Information
CI Operations
birdhouse_daccs_configs_branch: master
birdhouse_skip_ci: false