-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler GPU requirement #236
Comments
Thanks! @fjetter also gave me a heads up about this. Assuming you're using the cluster in normal ways (not, e.g., using scheduler as a notebook host), is there any reason that T4 wouldn't be good enough? Our I very much look forward to the public docs where this is explained! People will definitely want to understand this requirement. |
I can only speak for RAPIDS, but a T4 on the scheduler is probably going to be a good bet for the majority of users so making it the default would be totally reasonable. The H100 and L4 are on the horizon though so I expect once those are generally available it would be common to pair a T4 with V100/A100 and an L4 with H100 due to the CUDA compute capability compatibility (what a mouthful). I'm not sure whether there would be implications with pairing a T4 with an H100. But we can worry about that later. As you say if you set
Right now we have this page which is useful but will need updating with the new harder requirements. @fjetter, @rjzamora and I are also drafting up a blog post which will go on https://blog.dask.org announcing this change and some docs for distributed to go with it. I expect these should be available with the Dask |
Just a heads up that with recent changes in distributed (dask/distributed#7564) a GPU is now mandatory on the scheduler if the client/workers have GPUs.
However, it can be a lesser GPU provided that it has compatible CUDA compute capabilities (RAPIDS needs CCC 6.0+, other libraries may vary). So I can see folks configuring workers with A100s and schedulers with T4s to optimize cost.
Currently, the
scheduler_gpu
kwarg incoiled.Cluster
is a boolean and in theory you could setworker_gpu=1,scheduler_gpu=False
which will break things when trying to use that cluster going forwards.I would suggest that if
worker_gpu
is set thenscheduler_gpu
must be set toTrue
, so maybe that kwarg should be removed altogether.It would be nice to add a new argument called
scheduler_gpu_type
instead so that users could set something likeworker_gpu=1, worker_gpu_type="nvidia-tesla-a100", scheduler_gpu_type="nvidia-tesla-t4"
.The text was updated successfully, but these errors were encountered: