Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler GPU requirement #236

Open
jacobtomlinson opened this issue Mar 31, 2023 · 2 comments
Open

Scheduler GPU requirement #236

jacobtomlinson opened this issue Mar 31, 2023 · 2 comments

Comments

@jacobtomlinson
Copy link

jacobtomlinson commented Mar 31, 2023

Just a heads up that with recent changes in distributed (dask/distributed#7564) a GPU is now mandatory on the scheduler if the client/workers have GPUs.

However, it can be a lesser GPU provided that it has compatible CUDA compute capabilities (RAPIDS needs CCC 6.0+, other libraries may vary). So I can see folks configuring workers with A100s and schedulers with T4s to optimize cost.

Currently, the scheduler_gpu kwarg in coiled.Cluster is a boolean and in theory you could set worker_gpu=1,scheduler_gpu=False which will break things when trying to use that cluster going forwards.

I would suggest that if worker_gpu is set then scheduler_gpu must be set to True, so maybe that kwarg should be removed altogether.

It would be nice to add a new argument called scheduler_gpu_type instead so that users could set something like worker_gpu=1, worker_gpu_type="nvidia-tesla-a100", scheduler_gpu_type="nvidia-tesla-t4".

@ntabris
Copy link
Member

ntabris commented Mar 31, 2023

Thanks! @fjetter also gave me a heads up about this.

Assuming you're using the cluster in normal ways (not, e.g., using scheduler as a notebook host), is there any reason that T4 wouldn't be good enough? Our scheduler_gpu kwarg always adds T4.

I very much look forward to the public docs where this is explained! People will definitely want to understand this requirement.

@jacobtomlinson
Copy link
Author

is there any reason that T4 wouldn't be good enough?

I can only speak for RAPIDS, but a T4 on the scheduler is probably going to be a good bet for the majority of users so making it the default would be totally reasonable. The H100 and L4 are on the horizon though so I expect once those are generally available it would be common to pair a T4 with V100/A100 and an L4 with H100 due to the CUDA compute capability compatibility (what a mouthful).

I'm not sure whether there would be implications with pairing a T4 with an H100. But we can worry about that later.

As you say if you set jupyter=Trueyou might want the scheduler GPU to also be high-end, but not necessarily. I could imagine folks setting n_workers=0,jupyter=True to get a T4 for some initial lower-performance interactive exploration of subsets of data, then calling cluster.scale(n) when it's time to use the full dataset and really push things.

I very much look forward to the public docs where this is explained! People will definitely want to understand this requirement.

Right now we have this page which is useful but will need updating with the new harder requirements. @fjetter, @rjzamora and I are also drafting up a blog post which will go on https://blog.dask.org announcing this change and some docs for distributed to go with it. I expect these should be available with the Dask 2023.4.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants