-
Notifications
You must be signed in to change notification settings - Fork 7
Document how to allow Jupyterlab containers to access GPUs #594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
tlvu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, except the part that override JUPYTERHUB_CONFIG_OVERRIDE, should be appending.
| ] | ||
| ' | ||
| This will allow the docker containers to access all GPUs on the host machine. To limit the number of GPUs you want to make available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So all Jupyter users will have access to the GPU? And if they happen to all use the GPU at the same time, will they step on each other foot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup they will definitely step on each others feet. Just the same as if a user hogs any other resource (CPU, memory, etc.)
We definitely need a better way to manage resource over-use but the problem isn't specific to GPUs.
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/3708/Result ❌ FAILUREBIRDHOUSE_DEPLOY_BRANCH : gpu-support-documentation DACCS_IAC_BRANCH : master DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-216.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/465/NOTEBOOK TEST RESULTS |
fmigneault
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice addition. Looking forward to advanced applications of it.
| c.DockerSpawner.extra_host_config["device_requests"] = [ | ||
| docker.types.DeviceRequest(count=-1, capabilities=[["gpu"]]) | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean that every single user notebook automatically gets assigned a GPU?
If so, won't that create a bottleneck very quickly?
Is there a way to have both GPU/CPU-only simultaneously (adding more DeviceRequest variants to the list?), and have it selected somehow by the user when starting the kernel?
Maybe even more specific GPU definitions, like providing the ones with just 8GB VRAM vs others with 48GB separately?
I would be interested in that specific multi-config example, and how users interact with it to request (or us limiting them) appropriate resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean that every single user notebook automatically gets assigned a GPU?
No, it means that every container has access to all GPUs on the host. This PR doesn't introduce any solutions for allocating different GPU resources to different users. That is a much more complex thing that I'll have to try to figure out at a later date (because I don't really understand it yet).
Is there a way to have both GPU/CPU-only simultaneously (adding more DeviceRequest variants to the list?), and have it selected somehow by the user when starting the kernel?
Maybe even more specific GPU definitions, like providing the ones with just 8GB VRAM vs others with 48GB separately?
I think so, it would require a pretty good understanding of the nvidia toolkit and docker settings. I'm still reading about it and I can continue to update these docs as we figure different possible configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if something like https://jupyterhub-dockerspawner.readthedocs.io/en/latest/api/index.html#dockerspawner.SwarmSpawner.group_overrides could be used to dynamically apply the GPU request on specific users/conditions, therefore allowing having GPU or CPU-only setup.
Documentation is very sparse, so definitely very hard to figure out 😅
All promising features nonetheless. Thanks for looking into them.
| This will allow the docker containers to access all GPUs on the host machine. To limit the number of GPUs you want to make available | ||
| you can change the ``count`` value to a positive integer or you can specify the ``device_ids`` key instead. ``device_ids`` takes a list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count=-1: That was my impression even before reading this part.
Requesting all available GPUs would essentially lock-out any second user trying to use a kernel.
The example should probably use count=1 instead and make a stronger warning about this situation, to let server maintainers know to not raise the value too much unless they got some really big GPU cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me clarify... this gives each container access to all GPUs on the host and they share them as a resource. The same way they're sharing access to the CPUs, memory, etc.
This will not stop a user from starting up their container.
Note that if user A proceeds to max out some of the GPUs then user B can't use them until they are done but managing that goes beyond the scope of this documentation so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you have big GPUs like A100 to actually provide virtual-GPU VRAM, it won't take that much to cause all users to crash their respective processes from OutOfMemoryError.
I'm nowhere near an expert on the matter, but I know that our clusters leverage some vGPUs to allow some kind of splitting this way. I don't know if that would play nice with multiple dockers trying to access the same GPU. Doesn't it do some kind of lock/reservation when assigned to a particular kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that you need a vGPU setup for docker. Since there is no true VM (no hypervisor) with docker the containers can just access the GPU directly.
One thing you could do is use MIG (Multi-Instance GPU) or MPS (Multi-Process Service) to split up resources but not all GPUs support these (none of ours do unfortunately).
Doesn't it do some kind of lock/reservation when assigned to a particular kernel?
As far as I know it manages context-switching the same way a CPU would that was running multiple processes/threads. So everything gets slowed down and/or on-GPU memory gets filled up but it won't necessarily just immediately fail from a user perspective.
| If the host machine has GPUs and you want to make them available to the docker containers running Jupyterlab: | ||
| 1. ensure that the GPU drivers on the host machine are up to date | ||
| 2. install the `NVIDIA container toolkit`_ package on the host machine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume enabling it like this makes it available to all containers on the server.
It doesn't have to be mentioned here, but just pointing it out FYI, it would be relevant for weaver-worker as well to run GPU jobs. I'm just not sure if the device syntax is the same in docker-compose since it's been 2-3 years since I've checked this.
If it does indeed work like this, maybe a note about all server resources sharing the GPUs could be relevant. They would not (necessarily) be dedicated to jupyter kernels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume enabling it like this makes it available to all containers on the server.
Yes
It doesn't have to be mentioned here, but just pointing it out FYI, it would be relevant for weaver-worker as well to run GPU jobs.
Yes I definitely want to figure out how to make this work with weaver as well. Since the weaver worker container is not dynamically created, I think we can just add directly to the weaver-worker definition in the relevant docker-compose-extra.yml file. But that's something I'll have to figure out/work on next.
E2E Test ResultsDACCS-iac Pipeline ResultsBuild URL : http://daccs-jenkins.crim.ca:80/job/DACCS-iac-birdhouse/3710/Result ❌ FAILUREBIRDHOUSE_DEPLOY_BRANCH : gpu-support-documentation DACCS_IAC_BRANCH : master DACCS_CONFIGS_BRANCH : master PAVICS_E2E_WORKFLOW_TESTS_BRANCH : master PAVICS_SDI_BRANCH : master DESTROY_INFRA_ON_EXIT : true PAVICS_HOST : https://host-140-216.rdext.crim.ca PAVICS-e2e-workflow-tests Pipeline ResultsTests URL : http://daccs-jenkins.crim.ca:80/job/PAVICS-e2e-workflow-tests/job/master/466/NOTEBOOK TEST RESULTS |
Overview
Documentation update to describe steps to enable GPU support for jupyterhub.
Changes
Non-breaking changes
Breaking changes
Related Issue / Discussion
Additional Information
Links to other issues or sources.
CI Operations
birdhouse_daccs_configs_branch: master
birdhouse_skip_ci: false