Implementations of PyTorchLightningEstimator.create_predictor do not allow explicit device selection #3208

tmwly · 2024-07-31T15:36:11Z

Description

Implementations of PyTorchLightningEstimator.create_predictor, such as DeepAREstimator.create_predictor and SimpleFeedForwardAREstimator.create_predictor, pass device="auto" in the constructor of PyTorchPredictor.

This means that, if a model is trained using a cpu accelerator, but a gpu is present and setup on the machine, the returned predictor will be loaded onto the gpu.

This can be observed when constructing the Estimator and then calling Estimator.train()

What happens is a call to Estimator.train_model, which finishes with a call to create_predictor here.

To Reproduce

This can be shown using the following code sample and nvidia-smi, adapted from the gluonts tutorials.

from gluonts.torch import SimpleFeedForwardEstimator
from gluonts.dataset.repository import get_dataset, dataset_names
from gluonts.dataset.util import to_pandas

dataset = get_dataset("m4_hourly")

estimator = SimpleFeedForwardEstimator(
    prediction_length=dataset.metadata.prediction_length,
    context_length=100,
    trainer_kwargs={
                "max_epochs": 1,
                "accelerator": "cpu",
                "devices": "auto"
            }
)

predictor = estimator.train(dataset.train)

Error message or code output

Calling nvidia-smi after running this code will indicate that the process is running on a gpu, something like the following:

['Wed Jul 31 14:54:10 2024       ',
 '+-----------------------------------------------------------------------------+',
 '| NVIDIA-SMI 470.199.02   Driver Version: 470.199.02   CUDA Version: 11.4     |',
 '|-------------------------------+----------------------+----------------------+',
 '| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |',
 '| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |',
 '|                               |                      |               MIG M. |',
 '|===============================+======================+======================|',
 '|   0  Tesla M60           Off  | 00000000:00:1D.0 Off |                    0 |',
 '| N/A   28C    P0    37W / 150W |    721MiB /  7618MiB |      1%      Default |',
 '|                               |                      |                  N/A |',
 '+-------------------------------+----------------------+----------------------+',
 '|   1  Tesla M60           Off  | 00000000:00:1E.0 Off |                    0 |',
 '| N/A   29C    P8    15W / 150W |      6MiB /  7618MiB |      0%      Default |',
 '|                               |                      |                  N/A |',
 '+-------------------------------+----------------------+----------------------+',
 '                                                                               ',
 '+-----------------------------------------------------------------------------+',
 '| Processes:                                                                  |',
 '|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |',
 '|        ID   ID                                                   Usage      |',
 '|=============================================================================|',
 '|    0   N/A  N/A      1186      G   /usr/lib/xorg/Xorg                  3MiB |',
 '|    0   N/A  N/A      6479      C   ...ython/ts_torch/bin/python      714MiB |',
 '|    1   N/A  N/A      1186      G   /usr/lib/xorg/Xorg                  3MiB |',
 '+-----------------------------------------------------------------------------+']

Workaround

Setting the CUDA_VISIBLE_DEVICES env variable to -1 before code execution should prevent the auto detection of any cuda gpu.

This has side effects, as no further code in the process can detect gpus.

This might not work for other device types, I have not tested it myself.

This must be done before gluonts is imported

Possible fix

I have had a quick look at the codebase (as a new arrival here, so I might be missing some stuff), and from a naive perspective it seems that this could be a fairly light touch fix:

Adding the parameter device with a default value of auto to the get_predictor method
Either:
1: passing through the kwargs present in the Estimator.train() function args down through to train_model(), and adding, a specific kwarg for the predictor device (predictor_device)
2: Doing something smart with the (if present) trainer_kwargs.accelerator inside the train_model() function

I'm not certain that number 2 is good, as I don't think there is a 1-1 mapping between lightning accelerators and torch device types

I'm happy to do the work to implement the fix, but I'd be curious to first know if anyone has ideas for a nicer approach?

Environment

Operating system: Ubuntu 20.04.6 LTS (Focal Fossa)
Python version: 3.9.17
GluonTS version: tested with 0.13.9 (which detected the device inline, but the logic is the same for recent released versions)
MXNet version: N/A

Let me know if you have any questions about my setup, I'll be happy to help

Thanks !

The text was updated successfully, but these errors were encountered:

tmwly · 2024-07-31T15:37:11Z

(Apologies, depending on how you look at it, this might actually be more of a feature request than a bug)

tmwly added the bug Something isn't working label Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementations of PyTorchLightningEstimator.create_predictor do not allow explicit device selection #3208

Implementations of PyTorchLightningEstimator.create_predictor do not allow explicit device selection #3208

tmwly commented Jul 31, 2024 •

edited

Loading

tmwly commented Jul 31, 2024

Implementations of PyTorchLightningEstimator.create_predictor do not allow explicit device selection #3208

Implementations of PyTorchLightningEstimator.create_predictor do not allow explicit device selection #3208

Comments

tmwly commented Jul 31, 2024 • edited Loading

Description

To Reproduce

Error message or code output

Workaround

Possible fix

Environment

tmwly commented Jul 31, 2024

tmwly commented Jul 31, 2024 •

edited

Loading