Launching GPU with nvidia runtime #284

aimran-adroll · 2024-06-11T19:31:18Z

I would like to be able to launch notebooks using containers with nvidia runtime.

It'd be good to know if its supported before I spend time preparing an image with additional dask requirements

The text was updated successfully, but these errors were encountered:

mrocklin · 2024-06-11T19:46:20Z

Hey @aimran-adroll , I suspect that the answer is "yes" although you might also be interested in recent GPU developments in Coiled in the last couple months (package sync works, better GPU metrics, etc..). If you're game, it might be good to have you talk to @jrbourbeau who did a bunch of this work. I'll bet that he could point you in some fruitful directions. If that's interesting send me a note offline and we'll set something up.

cc'ing @ntabris to give the definitive "yes that's fine" to your stated question though

ntabris · 2024-06-11T20:02:58Z

Yes, that's fine. The VMs have NVIDIA Container Toolkit and you can use containers that see and use GPU with NVIDIA driver + CUDA.

aimran-adroll · 2024-06-11T20:31:04Z

Thanks both @ntabris and @mrocklin

I will give it a go. I suspect my first attempt failed since it did not have the obvious dask/jupyter related packages 🤦🏽‍♂️

Super exciting to be able to launch gpu notebooks

ntabris · 2024-06-11T21:01:09Z

FYI this doc says what our docker run command needs, so you can validate container locally if you want.

aimran-adroll · 2024-06-12T16:03:15Z

This little dockerfile did not work

FROM nvcr.io/nvidia/merlin/merlin-tensorflow:nightly

WORKDIR /src

RUN pip install -U pip
RUN pip install dask coiled ipykernel ipython dask-labextension jupyterlab jupyterlab matplotlib

Locally it passed the check that @ntabris mentioned

❯ docker run --rm nvidia-merlin python -m distributed.cli.dask_spec \
        --spec '{"cls":"dask.distributed.Scheduler", "opts":{}}'

Command to launch notebook

coiled notebook start --vm-type g5.xlarge --container redacted.dkr.ecr.us-west-2.amazonaws.com/aitest/nv-merlin:latest --region us-west-2 --name ai-tf

Gist of the error

coiled.errors.ClusterCreationError: Cluster status is error (reason: Scheduler Stopped -> Software environment exited with error code 1.) (cluster_id: 494802)

ntabris · 2024-06-12T16:09:28Z

Ah, sorry, this isn't easy to spot but I think the problem is with mismatch between image and VM arch. When I dig in to the (not super easy to find) logs, I see this:

dask The requested image's platform (linux/arm64) does not match the detected host platform (linux/amd64/v3) and no specific platform was requested

aimran-adroll · 2024-06-12T16:12:48Z

Thanks for the quick debugging. 🚀

aside: We need a cloud startup that lets you modify/build/push docker image in the cloud on just the right machine 😄

Once you are done pushing out 7GB image over residential network, I have forgotten what I wanted to do in the first place

mrocklin · 2024-06-12T17:25:35Z

I'd be curious to learn more about why you want to use Docker in the first place. My guess is that either there a piece of software that you're trying to distribute that isn't in a convenient conda repository, or that it's just very culturally entrenched. If that wasn't the reason, I'd probably want to question the choice of Docker and see if there is some other approach we could facilitate.

aimran-adroll · 2024-06-12T17:44:01Z

Great question.

Its a fairly typical workflow for us/me. I want to try a new ML (or whatever) package. I have no idea what the dependancies are (esp because it involves cuda, magical mix of different packages). The exact source recipe is not always easy to track down. I also have to weigh the upfront time investment.

In these scenarios, a docker container is a perfect answer to my conundrum -- quick and easy to evaluate something new

mrocklin · 2024-06-12T17:50:17Z

So, for common ML packages (PyTorch, TensorFlow, XGBoost, ...) we've been teaching package sync how to do the translation between CPU and GPU versions. So if your package is mostly depending on those (say you want to use some huggingface transformers package) then the answer is that you just conda install it on your local machine and then have Coiled spin up a cluster with GPUs attached. Coiled notices the change in architecture, swaps out the relevant packages, and has the conda solver fill in any gaps.

It's pretty magical.

If there was some other baseline GPU package that you needed (say, Jax) that didn't already have this treatment then we could add it. The main reason to not use package sync in this case is if there is some GPU package for which there is no CPU equivalent, and that you couldn't install on a non-GPU machine.

aimran-adroll · 2024-06-12T17:52:16Z

wow. that does sound magical

🏃🏽‍♂️ trying it now

mrocklin · 2024-06-12T17:53:47Z

See https://docs.coiled.io/user_guide/gpu-job.html#example-train-a-gpu-accelerated-pytorch-model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Launching GPU with nvidia runtime #284

Launching GPU with nvidia runtime #284

aimran-adroll commented Jun 11, 2024 •

edited

Loading

mrocklin commented Jun 11, 2024

ntabris commented Jun 11, 2024

aimran-adroll commented Jun 11, 2024 •

edited

Loading

ntabris commented Jun 11, 2024

aimran-adroll commented Jun 12, 2024

ntabris commented Jun 12, 2024

aimran-adroll commented Jun 12, 2024

mrocklin commented Jun 12, 2024

aimran-adroll commented Jun 12, 2024

mrocklin commented Jun 12, 2024

aimran-adroll commented Jun 12, 2024

mrocklin commented Jun 12, 2024

Launching GPU with nvidia runtime #284

Launching GPU with nvidia runtime #284

Comments

aimran-adroll commented Jun 11, 2024 • edited Loading

mrocklin commented Jun 11, 2024

ntabris commented Jun 11, 2024

aimran-adroll commented Jun 11, 2024 • edited Loading

ntabris commented Jun 11, 2024

aimran-adroll commented Jun 12, 2024

ntabris commented Jun 12, 2024

aimran-adroll commented Jun 12, 2024

mrocklin commented Jun 12, 2024

aimran-adroll commented Jun 12, 2024

mrocklin commented Jun 12, 2024

aimran-adroll commented Jun 12, 2024

mrocklin commented Jun 12, 2024

aimran-adroll commented Jun 11, 2024 •

edited

Loading

aimran-adroll commented Jun 11, 2024 •

edited

Loading