Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Launching GPU with nvidia runtime #284

Open
aimran-adroll opened this issue Jun 11, 2024 · 12 comments
Open

Launching GPU with nvidia runtime #284

aimran-adroll opened this issue Jun 11, 2024 · 12 comments

Comments

@aimran-adroll
Copy link

aimran-adroll commented Jun 11, 2024

I would like to be able to launch notebooks using containers with nvidia runtime.

It'd be good to know if its supported before I spend time preparing an image with additional dask requirements

@mrocklin
Copy link
Member

Hey @aimran-adroll , I suspect that the answer is "yes" although you might also be interested in recent GPU developments in Coiled in the last couple months (package sync works, better GPU metrics, etc..). If you're game, it might be good to have you talk to @jrbourbeau who did a bunch of this work. I'll bet that he could point you in some fruitful directions. If that's interesting send me a note offline and we'll set something up.

cc'ing @ntabris to give the definitive "yes that's fine" to your stated question though

@ntabris
Copy link
Member

ntabris commented Jun 11, 2024

Yes, that's fine. The VMs have NVIDIA Container Toolkit and you can use containers that see and use GPU with NVIDIA driver + CUDA.

@aimran-adroll
Copy link
Author

aimran-adroll commented Jun 11, 2024

Thanks both @ntabris and @mrocklin

I will give it a go. I suspect my first attempt failed since it did not have the obvious dask/jupyter related packages 🤦🏽‍♂️

Super exciting to be able to launch gpu notebooks

@ntabris
Copy link
Member

ntabris commented Jun 11, 2024

FYI this doc says what our docker run command needs, so you can validate container locally if you want.

@aimran-adroll
Copy link
Author

This little dockerfile did not work

FROM nvcr.io/nvidia/merlin/merlin-tensorflow:nightly

WORKDIR /src

RUN pip install -U pip
RUN pip install dask coiled ipykernel ipython dask-labextension jupyterlab jupyterlab matplotlib

Locally it passed the check that @ntabris mentioned

❯ docker run --rm nvidia-merlin python -m distributed.cli.dask_spec \
        --spec '{"cls":"dask.distributed.Scheduler", "opts":{}}'

Command to launch notebook

coiled notebook start --vm-type g5.xlarge --container redacted.dkr.ecr.us-west-2.amazonaws.com/aitest/nv-merlin:latest --region us-west-2 --name ai-tf

Gist of the error

coiled.errors.ClusterCreationError: Cluster status is error (reason: Scheduler Stopped -> Software environment exited with error code 1.) (cluster_id: 494802)

@ntabris
Copy link
Member

ntabris commented Jun 12, 2024

Ah, sorry, this isn't easy to spot but I think the problem is with mismatch between image and VM arch. When I dig in to the (not super easy to find) logs, I see this:

dask The requested image's platform (linux/arm64) does not match the detected host platform (linux/amd64/v3) and no specific platform was requested

@aimran-adroll
Copy link
Author

Thanks for the quick debugging. 🚀

aside: We need a cloud startup that lets you modify/build/push docker image in the cloud on just the right machine 😄

Once you are done pushing out 7GB image over residential network, I have forgotten what I wanted to do in the first place

@mrocklin
Copy link
Member

I'd be curious to learn more about why you want to use Docker in the first place. My guess is that either there a piece of software that you're trying to distribute that isn't in a convenient conda repository, or that it's just very culturally entrenched. If that wasn't the reason, I'd probably want to question the choice of Docker and see if there is some other approach we could facilitate.

@aimran-adroll
Copy link
Author

Great question.

Its a fairly typical workflow for us/me. I want to try a new ML (or whatever) package. I have no idea what the dependancies are (esp because it involves cuda, magical mix of different packages). The exact source recipe is not always easy to track down. I also have to weigh the upfront time investment.

In these scenarios, a docker container is a perfect answer to my conundrum -- quick and easy to evaluate something new

@mrocklin
Copy link
Member

So, for common ML packages (PyTorch, TensorFlow, XGBoost, ...) we've been teaching package sync how to do the translation between CPU and GPU versions. So if your package is mostly depending on those (say you want to use some huggingface transformers package) then the answer is that you just conda install it on your local machine and then have Coiled spin up a cluster with GPUs attached. Coiled notices the change in architecture, swaps out the relevant packages, and has the conda solver fill in any gaps.

It's pretty magical.

If there was some other baseline GPU package that you needed (say, Jax) that didn't already have this treatment then we could add it. The main reason to not use package sync in this case is if there is some GPU package for which there is no CPU equivalent, and that you couldn't install on a non-GPU machine.

@aimran-adroll
Copy link
Author

wow. that does sound magical

🏃🏽‍♂️ trying it now

@mrocklin
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants