Multi_GPU Settings #480

seqwut · 2023-10-28T18:52:40Z

seqwut
Oct 28, 2023

I repeatedly get these logs after successful set- and startup of model, dataset etc. running fine-tuning experiment on h2o-llmstudio:

Running this on 4x/6x A100s for 30b model results in given log with stale session as ETA keeps increasing.

2023-10-28 18:25:34,405 - INFO: Added key: store_based_barrier_key:1 to store for rank: 1
2023-10-28 18:25:34,434 - INFO: Added key: store_based_barrier_key:1 to store for rank: 3
2023-10-28 18:25:34,443 - INFO: Added key: store_based_barrier_key:1 to store for rank: 0
2023-10-28 18:25:34,445 - INFO: Added key: store_based_barrier_key:1 to store for rank: 2
2023-10-28 18:25:34,445 - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2023-10-28 18:25:34,446 - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2023-10-28 18:25:34,454 - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2023-10-28 18:25:34,454 - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2023-10-28 18:25:34,486 - INFO: Added key: store_based_barrier_key:2 to store for rank: 0
2023-10-28 18:25:34,487 - INFO: Added key: store_based_barrier_key:2 to store for rank: 1
2023-10-28 18:25:34,487 - INFO: Added key: store_based_barrier_key:2 to store for rank: 2
2023-10-28 18:25:34,487 - INFO: Added key: store_based_barrier_key:2 to store for rank: 3
2023-10-28 18:25:34,487 - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
2023-10-28 18:25:34,487 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 3, total: 4 local rank: 3.
2023-10-28 18:25:34,497 - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
2023-10-28 18:25:34,497 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total: 4 local rank: 0.
2023-10-28 18:25:34,497 - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
2023-10-28 18:25:34,497 - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
2023-10-28 18:25:34,497 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 1, total: 4 local rank: 1.
2023-10-28 18:25:34,497 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 2, total: 4 local rank: 2.
2023-10-28 18:25:35,475 - WARNING: No OpenAI API Key set. Setting metric to BLEU.
2023-10-28 18:25:35,475 - WARNING: No OpenAI API Key set. Setting metric to BLEU.
2023-10-28 18:25:35,475 - WARNING: No OpenAI API Key set. Setting metric to BLEU.
2023-10-28 18:25:35,476 - INFO: Global random seed: 409062
2023-10-28 18:25:35,476 - WARNING: No OpenAI API Key set. Setting metric to BLEU.
2023-10-28 18:25:35,476 - INFO: Preparing the data...
2023-10-28 18:25:35,476 - INFO: Setting up automatic validation split...
2023-10-28 18:25:35,752 - INFO: Preparing train and validation data
2023-10-28 18:25:35,753 - INFO: Loading train dataset...
2023-10-28 18:25:37,635 - INFO: Stop token ids: [tensor([ 529, 29989, 5205, 29989, 29958])]
2023-10-28 18:25:37,790 - INFO: Loading validation dataset...
2023-10-28 18:25:38,235 - INFO: Stop token ids: [tensor([ 529, 29989, 5205, 29989, 29958])]
2023-10-28 18:25:38,237 - INFO: Number of observations in train dataset: 14289
2023-10-28 18:25:38,238 - INFO: Number of observations in validation dataset: 145
2023-10-28 18:25:38,371 - INFO: Setting pretraining_tp of model config to 1.
2023-10-28 18:25:38,375 - INFO: Using int4 for backbone
2023-10-28 18:25:38,394 - INFO: Setting pretraining_tp of model config to 1.
2023-10-28 18:25:38,399 - INFO: Using int4 for backbone
2023-10-28 18:25:38,409 - INFO: Setting pretraining_tp of model config to 1.
2023-10-28 18:25:38,413 - INFO: Using int4 for backbone
2023-10-28 18:25:38,848 - INFO: Stop token ids: [tensor([ 529, 29989, 5205, 29989, 29958], device='cuda:0')]
2023-10-28 18:25:38,848 - INFO: Setting pretraining_tp of model config to 1.
2023-10-28 18:25:38,852 - INFO: Using int4 for backbone
2023-10-28 18:34:24,786 - INFO: Lora module names: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']
2023-10-28 18:34:25,117 - INFO: Enough space available for saving model weights.
2023-10-28 18:34:25,856 - INFO: Training Epoch: 1 / 7
2023-10-28 18:34:25,857 - INFO: train loss: 0%| | 0/1786 [00:00
2023-10-28 18:34:26,849 - INFO: Evaluation step: 1786
2023-10-28 18:34:26,849 - INFO: Evaluation step: 1786
2023-10-28 18:34:26,916 - INFO: Evaluation step: 1786
2023-10-28 18:34:26,954 - INFO: Evaluation step: 1786
2023-10-28 18:34:27,059 - INFO: Stop token ids: [tensor([ 529, 29989, 5205, 29989, 29958])]
2023-10-28 18:34:32,200 - INFO: Reducer buckets have been rebuilt in this iteration.
2023-10-28 18:34:32,199 - INFO: Reducer buckets have been rebuilt in this iteration.
2023-10-28 18:34:32,200 - INFO: Reducer buckets have been rebuilt in this iteration.
2023-10-28 18:34:32,200 - INFO: Reducer buckets have been rebuilt in this iteration.

After this, no more entries for hours. Any ideas? Is there something to set up other than default when using Multiple GPUs?
Different model on single GPU, workflow works fine.

psinger · 2023-10-28T18:55:11Z

psinger
Oct 28, 2023
Maintainer

Are the GPUs at 100% or 0%. Please check the GPU logs via nvidia-smi or similar tools. We also show it on the main page of the GUI.

And could you try running it on a single GPU? Just select a single one in the environment settings.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi_GPU Settings #480

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Multi_GPU Settings #480

seqwut Oct 28, 2023

Replies: 1 comment

psinger Oct 28, 2023 Maintainer

seqwut
Oct 28, 2023

psinger
Oct 28, 2023
Maintainer