-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4/5 trial fails due to lack of memory #4010
Comments
Hey @diegotxegp, Are you able to try setting Regarding GPU usage - is your |
Thank you for your quick response. The point is that I am trying it automatically with AutoML. Since the error raised, I added "os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"" as you said ,and regarding "max_current_trials", I set it like as follows, but with not much difference: Code: from ludwig.automl import auto_train AutoML config: { 'eval_split': 'validation', |
Describe the bug
4/5 trials fail due to lack of memory. I have 4 x GPUs RTX 2080 Super (8 GB) and 64 GB RAM but it seems that AutoML doesn't recognize my GPUs to make the most.
To Reproduce
Using the "Rotten Tomatoes" example from the Ludwig AI web. If you have more than one GPUs, you will be able to reproduce this error.
from ludwig.automl import auto_train
auto_train_results = auto_train(
dataset=self.df,
target="recommended",
time_limit_s=7200,
)
Expected behavior
Run the 5 trials with different results. No only one execution with 4 error due to lack of memory.
Screenshots
Trial trial_78e53127 completed after 11 iterations at 2024-05-28 13:24:32. Total running time: 21min 24s
Trial status: 4 ERROR | 1 TERMINATED
Current time: 2024-05-28 13:24:32. Total running time: 21min 24s
Logical resource usage: 0/20 CPUs, 1.0/4 GPUs (0.0/1.0 accelerator_type:G)
Current best trial: 78e53127 with metric_score=0.9420865774154663 and params={'trainer.learning_rate': 2.2103375806114728e-05, 'trainer.batch_size': 64, 'combiner.num_fc_layers': 1, 'combiner.output_size': 128, 'combiner.dropout': 0.012855425737772442}
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status ...ner.learning_rate trainer.batch_size ...ner.num_fc_layers combiner.output_size combiner.dropout iter total time (s) metric_score │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ trial_78e53127 TERMINATED 2.21034e-05 64 1 128 0.0128554 11 1259.7 0.942087 │
│ trial_6a7803f9 ERROR 3.10601e-05 1024 3 256 0.0093055 │
│ trial_6950b4ec ERROR 0.000337902 1024 2 256 0.086701 │
│ trial_6225efbb ERROR 0.000705436 1024 1 128 0.0393212 │
│ trial_5d372a47 ERROR 0.000517778 1024 3 128 0.0782563 │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Number of errored trials: 4
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name # failures error file │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ trial_6a7803f9 1 /home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_6a7803f9/error.txt │
│ trial_6950b4ec 1 /home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_6950b4ec/error.txt │
│ trial_6225efbb 1 /home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_6225efbb/error.txt │
│ trial_5d372a47 1 /home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_5d372a47/error.txt │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────╯
2024-05-28 13:24:32,620 ERROR tune.py:1144 -- Trials did not complete: [trial_6a7803f9, trial_6950b4ec, trial_6225efbb, trial_5d372a47]
2024-05-28 13:24:32,631 WARNING experiment_analysis.py:916 -- Failed to read the results for 4 trials:
/home/diego/.local/lib/python3.10/site-packages/ludwig/automl/automl.py:286: UserWarning: There was an error running the experiment. A trial failed to start. Consider increasing the time budget for experiment.
warnings.warn(
Environment (please complete the following information):
Additional context
The idea is using AutoML for its ease to autoconfig.
The text was updated successfully, but these errors were encountered: