Skip to content

Fix issue 390 support different machine types on gcp #451

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

gmiasnychenko
Copy link
Contributor

As per #390, there is a feature request for allowing to choose different machine types on GCP. Here I tried to implement that, and make only the workers to use GPU.

I used #369 as a reference

if ngpus is not None:
self.scheduler_options["ngpus"] = 0
self.scheduler_options["gpu_type"] = None
self.scheduler_options["gpu_instance"] = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmiasnychenko should we set scheduler gpus settings always, whatever the number of GPUs?

Also, please leave a comment that we don't run tasks on scheduler, so we don't need a GPU there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for setting GPUs settings, I believe the answer is yes. All the settings are going into the self.options, which is the base for later self.scheduler_options and self.worker_options. If we don't override the scheduler GPU settings, they will stay from up above, and we will have the same configuration for both scheduler and worker.

I can move the overriding outside the if statement, if that's what you mean. It provides more clarity, but functionally should be the same

I agree with providing more documentation. I will add it for the ngpus and gpu_type argument descriptions.

@@ -603,7 +613,14 @@ def __init__(
bootstrap if bootstrap is not None else self.config.get("bootstrap")
)
self.machine_type = machine_type or self.config.get("machine_type")
self.gpu_instance = "gpu" in self.machine_type or bool(ngpus)
if machine_type is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmiasnychenko it would be great if we could check that machine_type is set XOR scheduler/worker_machine_type; otherwise, we should throw an error. It should be a BC safe check.

- added info on GPU logic in docs
- adjusted scheduler GPU logic
- fixed the machine type checker
Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, seems like a great improvement!

@@ -445,10 +453,11 @@ class GCPCluster(VMCluster):
extra_bootstrap: list[str] (optional)
Extra commands to be run during the bootstrap phase.
ngpus: int (optional)
The number of GPUs to atatch to the instance.
The number of GPUs to atatch to the worker instance. No work is expected to be done on scheduler, so no
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't true. Due to the way that Dask uses pickle to move things around there are cases where the scheduler might deserialize a meta object which may try and allocate a small amount of GPU memory. It's always recommended to have a small GPU available on the scheduler.

https://docs.rapids.ai/deployment/stable/guides/scheduler-gpu-requirements/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback!

While it makes sense to have a GPU on the scheduler to avoid these issues, I think it would be beneficial to allow some flexibility in the configuration. Some users might want different GPU configurations (e.g., a smaller/cheaper GPU on the scheduler vs. more powerful ones on workers), or in some cases might want to explicitly disable scheduler GPUs for cost reasons despite the potential pickle issues.

I've updated the PR to support both approaches:

  • Unified configuration (existing behavior): ngpus and gpu_type apply to both scheduler and workers
  • Separate configuration (new): scheduler_ngpus/scheduler_gpu_type and worker_ngpus/worker_gpu_type for fine-grained control

The default behavior remains the same (same GPU config for both), but now users have the flexibility to choose different configurations when needed. I've also updated the documentation to mention the scheduler GPU requirements you referenced

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup totally agree with all of that!

- Maintain existing GPU logic
- Add ability to specify different GPU configurations for workers and scheduler
Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small nitpick.

- Updating the documentation to get rid of old GPUs
@jacobtomlinson jacobtomlinson merged commit 0e2bdbf into dask:main Jun 3, 2025
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants