Replies: 1 comment
-
Are the GPUs at 100% or 0%. Please check the GPU logs via And could you try running it on a single GPU? Just select a single one in the environment settings. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I repeatedly get these logs after successful set- and startup of model, dataset etc. running fine-tuning experiment on h2o-llmstudio:
Running this on 4x/6x A100s for 30b model results in given log with stale session as ETA keeps increasing.
2023-10-28 18:25:34,405 - INFO: Added key: store_based_barrier_key:1 to store for rank: 1
2023-10-28 18:25:34,434 - INFO: Added key: store_based_barrier_key:1 to store for rank: 3
2023-10-28 18:25:34,443 - INFO: Added key: store_based_barrier_key:1 to store for rank: 0
2023-10-28 18:25:34,445 - INFO: Added key: store_based_barrier_key:1 to store for rank: 2
2023-10-28 18:25:34,445 - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2023-10-28 18:25:34,446 - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2023-10-28 18:25:34,454 - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2023-10-28 18:25:34,454 - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2023-10-28 18:25:34,486 - INFO: Added key: store_based_barrier_key:2 to store for rank: 0
2023-10-28 18:25:34,487 - INFO: Added key: store_based_barrier_key:2 to store for rank: 1
2023-10-28 18:25:34,487 - INFO: Added key: store_based_barrier_key:2 to store for rank: 2
2023-10-28 18:25:34,487 - INFO: Added key: store_based_barrier_key:2 to store for rank: 3
2023-10-28 18:25:34,487 - INFO: Rank 3: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
2023-10-28 18:25:34,487 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 3, total: 4 local rank: 3.
2023-10-28 18:25:34,497 - INFO: Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
2023-10-28 18:25:34,497 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total: 4 local rank: 0.
2023-10-28 18:25:34,497 - INFO: Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
2023-10-28 18:25:34,497 - INFO: Rank 2: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes.
2023-10-28 18:25:34,497 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 1, total: 4 local rank: 1.
2023-10-28 18:25:34,497 - INFO: Training in distributed mode with multiple processes, 1 GPU per process. Process 2, total: 4 local rank: 2.
2023-10-28 18:25:35,475 - WARNING: No OpenAI API Key set. Setting metric to BLEU.
2023-10-28 18:25:35,475 - WARNING: No OpenAI API Key set. Setting metric to BLEU.
2023-10-28 18:25:35,475 - WARNING: No OpenAI API Key set. Setting metric to BLEU.
2023-10-28 18:25:35,476 - INFO: Global random seed: 409062
2023-10-28 18:25:35,476 - WARNING: No OpenAI API Key set. Setting metric to BLEU.
2023-10-28 18:25:35,476 - INFO: Preparing the data...
2023-10-28 18:25:35,476 - INFO: Setting up automatic validation split...
2023-10-28 18:25:35,752 - INFO: Preparing train and validation data
2023-10-28 18:25:35,753 - INFO: Loading train dataset...
2023-10-28 18:25:37,635 - INFO: Stop token ids: [tensor([ 529, 29989, 5205, 29989, 29958])]
2023-10-28 18:25:37,790 - INFO: Loading validation dataset...
2023-10-28 18:25:38,235 - INFO: Stop token ids: [tensor([ 529, 29989, 5205, 29989, 29958])]
2023-10-28 18:25:38,237 - INFO: Number of observations in train dataset: 14289
2023-10-28 18:25:38,238 - INFO: Number of observations in validation dataset: 145
2023-10-28 18:25:38,371 - INFO: Setting pretraining_tp of model config to 1.
2023-10-28 18:25:38,375 - INFO: Using int4 for backbone
2023-10-28 18:25:38,394 - INFO: Setting pretraining_tp of model config to 1.
2023-10-28 18:25:38,399 - INFO: Using int4 for backbone
2023-10-28 18:25:38,409 - INFO: Setting pretraining_tp of model config to 1.
2023-10-28 18:25:38,413 - INFO: Using int4 for backbone
2023-10-28 18:25:38,848 - INFO: Stop token ids: [tensor([ 529, 29989, 5205, 29989, 29958], device='cuda:0')]
2023-10-28 18:25:38,848 - INFO: Setting pretraining_tp of model config to 1.
2023-10-28 18:25:38,852 - INFO: Using int4 for backbone
2023-10-28 18:34:24,786 - INFO: Lora module names: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']
2023-10-28 18:34:25,117 - INFO: Enough space available for saving model weights.
2023-10-28 18:34:25,856 - INFO: Training Epoch: 1 / 7
2023-10-28 18:34:25,857 - INFO: train loss: 0%| | 0/1786 [00:00
2023-10-28 18:34:26,849 - INFO: Evaluation step: 1786
2023-10-28 18:34:26,849 - INFO: Evaluation step: 1786
2023-10-28 18:34:26,916 - INFO: Evaluation step: 1786
2023-10-28 18:34:26,954 - INFO: Evaluation step: 1786
2023-10-28 18:34:27,059 - INFO: Stop token ids: [tensor([ 529, 29989, 5205, 29989, 29958])]
2023-10-28 18:34:32,200 - INFO: Reducer buckets have been rebuilt in this iteration.
2023-10-28 18:34:32,199 - INFO: Reducer buckets have been rebuilt in this iteration.
2023-10-28 18:34:32,200 - INFO: Reducer buckets have been rebuilt in this iteration.
2023-10-28 18:34:32,200 - INFO: Reducer buckets have been rebuilt in this iteration.
After this, no more entries for hours. Any ideas? Is there something to set up other than default when using Multiple GPUs?
Different model on single GPU, workflow works fine.
Beta Was this translation helpful? Give feedback.
All reactions