-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
20B pretrained model inference OOM on 8xA100 40GB #901
Comments
Hey @Mutinifni Here's a snippet for deepspeed
and for accelerate something like
|
Thanks @satpalsr! DeepSpeed MII worked for me (with just 2 GPUs). I would like to ask a follow-up question to understand this a little bit better. Based on the DeepSpeed MII latency/cost analysis, it looks like DeepSpeed MII performs much better than the baseline (presumably huggingface transformers), so is there any reason to prefer the huggingface model for deployment? Do DeepSpeed MII or accelerate underperform with larger GPU deployments, or are they strictly better? |
Our understanding is that it's strictly better. We're currently working on replacing our current inference backend with it. |
Describe the bug
Inference using the 20B pretrained model from README with slim weights and 20B.yml config runs out of memory on 8xA100 40GB GPUs. I tried varying
pipe-parallel-size
andmodel-parallel-size
from [1, 2, 4] but none worked.I also ran into an OOM with the huggingface 20B model, based on the code in this repo. (from #782)
To Reproduce
python ./deepy.py generate.py -d configs 20B.yml local_setup.yml text_generation.yml
Expected behavior
Inference should not OOM (training OOM is expected).
Proposed solution
Not sure.
Environment (please complete the following information):
Should I try any different configurations options for these to work? Thanks!
The text was updated successfully, but these errors were encountered: