Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in predicted runtimes for some configs #3

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

hakesh729
Copy link

@hakesh729 hakesh729 commented Jun 24, 2024

Context: We are trying to get Proteus runtime predictions for V100 + 32 gpu load (4 nodes with 8 gpus per node) load for GPT3-2.7B model with different configurations involving PP-degree, TP-degree (aka MP deg), zero. We also made some changes to support higher PP degree in PP strategy and used macro-batches as microbatches etc. More details about changes we did will be mentioned in other comments.

Also, we added relevant H100 topo files to test proteus with H100 as well. So, please ignore those files and other external/nccl file changes.

With these changes to megatron_gpt code, we are observing a variation of 10x from actual runtime to predicted time. The megatron_gpt python script specifiying the config details are given in examples/fail_config1.sh and examples/fail_config2.sh files. Attached screenshot of observed result for one of this config:

image

@AgrawalAmey
Copy link

@JF-D Hakesh has been trying to obtain Proteus predictions for our paper. We have incorporated the fixes etc discussed over the email. The only additional changes here are related to support pipeline parallel (with virtual stages) and micro batching. Please share any inputs you have. We want to ensure that we are not introducing any errors with our changes. Thank you!

@JF-D
Copy link
Owner

JF-D commented Jun 25, 2024

Have you tried to check the trace dumped? Proteus will export a trace that can be visulized in chrome tracing by setting profile=True (here). From the trace you can check whether the partition and scheduling is as you expected.

Could you also provide a copy of the cost model profiling result? I don't have access to V100 or H100 GPUs currently.

@hakesh729
Copy link
Author

hakesh729 commented Sep 4, 2024

Hi @JF-D,

I apologize for the delayed response. Due to some summer commitments, we had to take a break from our project. We’ve now resumed working on the issue and would like to share the Proteus trace for a simple configuration (different from the one we are talking about in the description) where we observe a 5x discrepancy between the predicted and actual runtime.

Drive link for traces and other config related details:
https://drive.google.com/drive/folders/1g0zJVgt5LUGnQ1QvbieOMjsIulExUh2U?usp=sharing

I’ve provided a public Google Drive link that includes the Proteus trace for the V100 configuration. Specifically, the Proteus runtime is approximately 166 seconds per iteration, while the actual runtime is around 30 seconds per iteration—about a 5x difference for the config mentioned. The trace covers five steps (similar to Proteus iterations), so the total time you’ll see in the trace is around 150 seconds. I’ve also included our own trace for comparison, which I believe is easier to interpret, along with a PDF detailing the configuration.

Unfortunately, the proteus chrome trace is challenging to analyze, so your assistance in pinpointing where things might be going wrong would be invaluable. If you need any further information to understand our trace, please let me know. We are eager to resolve this issue and greatly appreciate your help.

Thank you!

Let me know if there’s anything else you’d like to adjust!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants