Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementations of PyTorchLightningEstimator.create_predictor do not allow explicit device selection #3208

Open
tmwly opened this issue Jul 31, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@tmwly
Copy link

tmwly commented Jul 31, 2024

Description

Implementations of PyTorchLightningEstimator.create_predictor, such as DeepAREstimator.create_predictor and SimpleFeedForwardAREstimator.create_predictor, pass device="auto" in the constructor of PyTorchPredictor.

This means that, if a model is trained using a cpu accelerator, but a gpu is present and setup on the machine, the returned predictor will be loaded onto the gpu.

This can be observed when constructing the Estimator and then calling Estimator.train()

What happens is a call to Estimator.train_model, which finishes with a call to create_predictor here.

To Reproduce

This can be shown using the following code sample and nvidia-smi, adapted from the gluonts tutorials.

from gluonts.torch import SimpleFeedForwardEstimator
from gluonts.dataset.repository import get_dataset, dataset_names
from gluonts.dataset.util import to_pandas

dataset = get_dataset("m4_hourly")

estimator = SimpleFeedForwardEstimator(
    prediction_length=dataset.metadata.prediction_length,
    context_length=100,
    trainer_kwargs={
                "max_epochs": 1,
                "accelerator": "cpu",
                "devices": "auto"
            }
)

predictor = estimator.train(dataset.train)

Error message or code output

Calling nvidia-smi after running this code will indicate that the process is running on a gpu, something like the following:

['Wed Jul 31 14:54:10 2024       ',
 '+-----------------------------------------------------------------------------+',
 '| NVIDIA-SMI 470.199.02   Driver Version: 470.199.02   CUDA Version: 11.4     |',
 '|-------------------------------+----------------------+----------------------+',
 '| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |',
 '| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |',
 '|                               |                      |               MIG M. |',
 '|===============================+======================+======================|',
 '|   0  Tesla M60           Off  | 00000000:00:1D.0 Off |                    0 |',
 '| N/A   28C    P0    37W / 150W |    721MiB /  7618MiB |      1%      Default |',
 '|                               |                      |                  N/A |',
 '+-------------------------------+----------------------+----------------------+',
 '|   1  Tesla M60           Off  | 00000000:00:1E.0 Off |                    0 |',
 '| N/A   29C    P8    15W / 150W |      6MiB /  7618MiB |      0%      Default |',
 '|                               |                      |                  N/A |',
 '+-------------------------------+----------------------+----------------------+',
 '                                                                               ',
 '+-----------------------------------------------------------------------------+',
 '| Processes:                                                                  |',
 '|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |',
 '|        ID   ID                                                   Usage      |',
 '|=============================================================================|',
 '|    0   N/A  N/A      1186      G   /usr/lib/xorg/Xorg                  3MiB |',
 '|    0   N/A  N/A      6479      C   ...ython/ts_torch/bin/python      714MiB |',
 '|    1   N/A  N/A      1186      G   /usr/lib/xorg/Xorg                  3MiB |',
 '+-----------------------------------------------------------------------------+']

Workaround

Setting the CUDA_VISIBLE_DEVICES env variable to -1 before code execution should prevent the auto detection of any cuda gpu.

This has side effects, as no further code in the process can detect gpus.

This might not work for other device types, I have not tested it myself.

This must be done before gluonts is imported

Possible fix

I have had a quick look at the codebase (as a new arrival here, so I might be missing some stuff), and from a naive perspective it seems that this could be a fairly light touch fix:

  • Adding the parameter device with a default value of auto to the get_predictor method
  • Either:
    1: passing through the kwargs present in the Estimator.train() function args down through to train_model(), and adding, a specific kwarg for the predictor device (predictor_device)
    2: Doing something smart with the (if present) trainer_kwargs.accelerator inside the train_model() function

I'm not certain that number 2 is good, as I don't think there is a 1-1 mapping between lightning accelerators and torch device types

I'm happy to do the work to implement the fix, but I'd be curious to first know if anyone has ideas for a nicer approach?

Environment

  • Operating system: Ubuntu 20.04.6 LTS (Focal Fossa)
  • Python version: 3.9.17
  • GluonTS version: tested with 0.13.9 (which detected the device inline, but the logic is the same for recent released versions)
  • MXNet version: N/A

Let me know if you have any questions about my setup, I'll be happy to help

Thanks !

@tmwly tmwly added the bug Something isn't working label Jul 31, 2024
@tmwly
Copy link
Author

tmwly commented Jul 31, 2024

(Apologies, depending on how you look at it, this might actually be more of a feature request than a bug)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant