Discrepancy in predicted runtimes for some configs #3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context: We are trying to get Proteus runtime predictions for V100 + 32 gpu load (4 nodes with 8 gpus per node) load for GPT3-2.7B model with different configurations involving PP-degree, TP-degree (aka MP deg), zero. We also made some changes to support higher PP degree in PP strategy and used macro-batches as microbatches etc. More details about changes we did will be mentioned in other comments.
Also, we added relevant H100 topo files to test proteus with H100 as well. So, please ignore those files and other external/nccl file changes.
With these changes to megatron_gpt code, we are observing a variation of 10x from actual runtime to predicted time. The megatron_gpt python script specifiying the config details are given in examples/fail_config1.sh and examples/fail_config2.sh files. Attached screenshot of observed result for one of this config: