-
-
Notifications
You must be signed in to change notification settings - Fork 115
Fix issue 390 support different machine types on gcp #451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue 390 support different machine types on gcp #451
Conversation
dask_cloudprovider/gcp/instances.py
Outdated
if ngpus is not None: | ||
self.scheduler_options["ngpus"] = 0 | ||
self.scheduler_options["gpu_type"] = None | ||
self.scheduler_options["gpu_instance"] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmiasnychenko should we set scheduler gpus settings always, whatever the number of GPUs?
Also, please leave a comment that we don't run tasks on scheduler, so we don't need a GPU there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for setting GPUs settings, I believe the answer is yes. All the settings are going into the self.options
, which is the base for later self.scheduler_options
and self.worker_options
. If we don't override the scheduler GPU settings, they will stay from up above, and we will have the same configuration for both scheduler and worker.
I can move the overriding outside the if statement, if that's what you mean. It provides more clarity, but functionally should be the same
I agree with providing more documentation. I will add it for the ngpus
and gpu_type
argument descriptions.
@@ -603,7 +613,14 @@ def __init__( | |||
bootstrap if bootstrap is not None else self.config.get("bootstrap") | |||
) | |||
self.machine_type = machine_type or self.config.get("machine_type") | |||
self.gpu_instance = "gpu" in self.machine_type or bool(ngpus) | |||
if machine_type is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmiasnychenko it would be great if we could check that machine_type
is set XOR scheduler/worker_machine_type
; otherwise, we should throw an error. It should be a BC safe check.
- added info on GPU logic in docs - adjusted scheduler GPU logic - fixed the machine type checker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, seems like a great improvement!
dask_cloudprovider/gcp/instances.py
Outdated
@@ -445,10 +453,11 @@ class GCPCluster(VMCluster): | |||
extra_bootstrap: list[str] (optional) | |||
Extra commands to be run during the bootstrap phase. | |||
ngpus: int (optional) | |||
The number of GPUs to atatch to the instance. | |||
The number of GPUs to atatch to the worker instance. No work is expected to be done on scheduler, so no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't true. Due to the way that Dask uses pickle to move things around there are cases where the scheduler might deserialize a meta object which may try and allocate a small amount of GPU memory. It's always recommended to have a small GPU available on the scheduler.
https://docs.rapids.ai/deployment/stable/guides/scheduler-gpu-requirements/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback!
While it makes sense to have a GPU on the scheduler to avoid these issues, I think it would be beneficial to allow some flexibility in the configuration. Some users might want different GPU configurations (e.g., a smaller/cheaper GPU on the scheduler vs. more powerful ones on workers), or in some cases might want to explicitly disable scheduler GPUs for cost reasons despite the potential pickle issues.
I've updated the PR to support both approaches:
- Unified configuration (existing behavior):
ngpus
andgpu_type
apply to both scheduler and workers - Separate configuration (new):
scheduler_ngpus
/scheduler_gpu_type
andworker_ngpus
/worker_gpu_type
for fine-grained control
The default behavior remains the same (same GPU config for both), but now users have the flexibility to choose different configurations when needed. I've also updated the documentation to mention the scheduler GPU requirements you referenced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup totally agree with all of that!
- Maintain existing GPU logic - Add ability to specify different GPU configurations for workers and scheduler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small nitpick.
- Updating the documentation to get rid of old GPUs
As per #390, there is a feature request for allowing to choose different machine types on GCP. Here I tried to implement that, and make only the workers to use GPU.
I used #369 as a reference