Discrepancy in predicted runtimes for some configs #3

hakesh729 · 2024-06-24T02:15:28Z

Context: We are trying to get Proteus runtime predictions for V100 + 32 gpu load (4 nodes with 8 gpus per node) load for GPT3-2.7B model with different configurations involving PP-degree, TP-degree (aka MP deg), zero. We also made some changes to support higher PP degree in PP strategy and used macro-batches as microbatches etc. More details about changes we did will be mentioned in other comments.

Also, we added relevant H100 topo files to test proteus with H100 as well. So, please ignore those files and other external/nccl file changes.

With these changes to megatron_gpt code, we are observing a variation of 10x from actual runtime to predicted time. The megatron_gpt python script specifiying the config details are given in examples/fail_config1.sh and examples/fail_config2.sh files. Attached screenshot of observed result for one of this config:

AgrawalAmey · 2024-06-24T03:00:34Z

@JF-D Hakesh has been trying to obtain Proteus predictions for our paper. We have incorporated the fixes etc discussed over the email. The only additional changes here are related to support pipeline parallel (with virtual stages) and micro batching. Please share any inputs you have. We want to ensure that we are not introducing any errors with our changes. Thank you!

JF-D · 2024-06-25T03:59:50Z

Have you tried to check the trace dumped? Proteus will export a trace that can be visulized in chrome tracing by setting profile=True (here). From the trace you can check whether the partition and scheduling is as you expected.

Could you also provide a copy of the cost model profiling result? I don't have access to V100 or H100 GPUs currently.

hakesh729 · 2024-09-04T04:22:04Z

Hi @JF-D,

I apologize for the delayed response. Due to some summer commitments, we had to take a break from our project. We’ve now resumed working on the issue and would like to share the Proteus trace for a simple configuration (different from the one we are talking about in the description) where we observe a 5x discrepancy between the predicted and actual runtime.

Drive link for traces and other config related details:
https://drive.google.com/drive/folders/1g0zJVgt5LUGnQ1QvbieOMjsIulExUh2U?usp=sharing

I’ve provided a public Google Drive link that includes the Proteus trace for the V100 configuration. Specifically, the Proteus runtime is approximately 166 seconds per iteration, while the actual runtime is around 30 seconds per iteration—about a 5x difference for the config mentioned. The trace covers five steps (similar to Proteus iterations), so the total time you’ll see in the trace is around 150 seconds. I’ve also included our own trace for comparison, which I believe is easier to interpret, along with a PDF detailing the configuration.

Unfortunately, the proteus chrome trace is challenging to analyze, so your assistance in pinpointing where things might be going wrong would be invaluable. If you need any further information to understand our trace, please let me know. We are eager to resolve this issue and greatly appreciate your help.

Thank you!

Let me know if there’s anything else you’d like to adjust!

Hakesh Darapaneni added 2 commits June 23, 2024 21:12

Our changes to proteus codebase

1948ede

Sample config script giving 10x more runtime than actual

7364a52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy in predicted runtimes for some configs #3

Discrepancy in predicted runtimes for some configs #3

hakesh729 commented Jun 24, 2024 •

edited

Loading

AgrawalAmey commented Jun 24, 2024

JF-D commented Jun 25, 2024

hakesh729 commented Sep 4, 2024 •

edited

Loading

Discrepancy in predicted runtimes for some configs #3

Are you sure you want to change the base?

Discrepancy in predicted runtimes for some configs #3

Conversation

hakesh729 commented Jun 24, 2024 • edited Loading

AgrawalAmey commented Jun 24, 2024

JF-D commented Jun 25, 2024

hakesh729 commented Sep 4, 2024 • edited Loading

hakesh729 commented Jun 24, 2024 •

edited

Loading

hakesh729 commented Sep 4, 2024 •

edited

Loading