multiple GPUs training cannot match up with single GPU training #17

hxu105 · 2024-09-10T20:24:36Z

Howdy,

I have tried to reproduce the experiments and encountered some issues with the training setting. I choose the "lmsys/vicuna-7b-v1.5-16k" as my base model and integrate it with ND or HO template (2-layer MLP as adapter) and train the model on CORA dataset for 1 epoch. However, the single GPU training is substantially better than the multiple GPU training. I put the result in the table following.

template	single	multiple
ND	0.7265	0.6368
HO	0.7618	0.6985

The other training configurations are set to be the same for single and multiple GPU settings. lr = 2e-3, batch size = 12 for single, and batch size = 4 for 3 GPUs.

Could you help to take a look at it?

Many thanks,

HX

ChenRunjin · 2024-09-11T02:38:01Z

In my previous experiments, I usually obtained very similar results when training on single or multiple GPUs. However, when training on the Cora dataset, I typically trained for 3-5 epochs (you can set the dataset name to cora.3). I suspect the issue may arise because the model hasn't fully converged when you only train for 1 epoch on smaller datasets, though I'm not entirely certain about this.

hxu105 · 2024-09-16T17:03:04Z

I see, however, the pubmed dataset also suffers the same problem, some can reach more than 90% accuracy when training model on a single GPU but the model only performs around 85% when trained on multiple GPUs. What would be a reasonable number of epochs you suggest to run on multi-GPU setting? Many thanks!

ChenRunjin · 2024-09-17T02:07:09Z

Hi, it's a bit strange—I haven't encountered this issue before. On my end, the PubMed dataset can achieve 95% accuracy on node classification with just 1 epoch, but it takes about 5 epochs for link prediction to reach top performance. In my case, the performance between multi-GPU and single-GPU setups is nearly the same. Have you experienced similar issues in other DeepSpeed experiments?

hxu105 · 2024-09-17T13:46:51Z

I got it! I was referring to the link prediction task. So for link prediction, the model usually needs more epochs to converge to optima? Does the model also need more epochs for larger datasets like products and arxiv? Many thanks!

ChenRunjin · 2024-09-17T16:20:43Z

For the link prediction task, if you are training on a single small dataset, I recommend using 8 epochs for Cora and 5 epochs for PubMed. For larger datasets like Arxiv and Products, 1 epoch is sufficient. If you are training on multiple datasets, since the model can leverage information from different datasets, I would suggest using the combination of arxiv-products-pubmed-cora.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiple GPUs training cannot match up with single GPU training #17

multiple GPUs training cannot match up with single GPU training #17

hxu105 commented Sep 10, 2024

ChenRunjin commented Sep 11, 2024

hxu105 commented Sep 16, 2024

ChenRunjin commented Sep 17, 2024

hxu105 commented Sep 17, 2024

ChenRunjin commented Sep 17, 2024

multiple GPUs training cannot match up with single GPU training #17

multiple GPUs training cannot match up with single GPU training #17

Comments

hxu105 commented Sep 10, 2024

ChenRunjin commented Sep 11, 2024

hxu105 commented Sep 16, 2024

ChenRunjin commented Sep 17, 2024

hxu105 commented Sep 17, 2024

ChenRunjin commented Sep 17, 2024