Improving communication overlap for the case of multi kernel queue usage #1308

youngeunkwon0405 · 2024-11-02T15:55:30Z

Description

The current TP-overlap relay is on a single kernel queue to configure launch ordering to control compute-communication overlap, which fails to overlap when multi kernel queue is used.

This PR enforces launch ordering using the LaunchCompletionEvent feature between the communication kernel and the compute kernel to ensure the overlap.

This feature is specific to Hopper and applies only to bulk overlap cases.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Youngeun Kwon <[email protected]>

for more information, see https://pre-commit.ci

youngeunkwon0405 · 2024-11-02T15:58:30Z

@erhoo82 Hi Sangkug, this is a PR for launch ordering work. Could you please assign a reviewer?

transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp

Signed-off-by: Youngeun Kwon <[email protected]>

for more information, see https://pre-commit.ci

denera · 2024-11-04T00:10:03Z

transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers.cu

+  if (comm_launch_event) {
+    SETUP_LAUNCH_CONFIG_WITH_COMPLETION_EVENT(sms, warps * 32, stream, comm_launch_event);
+    callranks_rs_oop_fp8(2) callranks_rs_oop_fp8(4) callranks_rs_oop_fp8(8)
+  } else {
+    SETUP_LAUNCH_CONFIG(sms, warps * 32, stream);
+    callranks_rs_oop_fp8(2) callranks_rs_oop_fp8(4) callranks_rs_oop_fp8(8)
+  }


Same here for duplicated kernel launch code.

Suggested change

if (comm_launch_event) {

SETUP_LAUNCH_CONFIG_WITH_COMPLETION_EVENT(sms, warps * 32, stream, comm_launch_event);

callranks_rs_oop_fp8(2) callranks_rs_oop_fp8(4) callranks_rs_oop_fp8(8)

} else {

SETUP_LAUNCH_CONFIG(sms, warps * 32, stream);

callranks_rs_oop_fp8(2) callranks_rs_oop_fp8(4) callranks_rs_oop_fp8(8)

}

if (comm_launch_event) {

SETUP_LAUNCH_CONFIG_WITH_COMPLETION_EVENT(sms, warps * 32, stream, comm_launch_event);

} else {

SETUP_LAUNCH_CONFIG(sms, warps * 32, stream);

}

callranks_rs_oop_fp8(2) callranks_rs_oop_fp8(4) callranks_rs_oop_fp8(8)

Hi @denera, the suggested coding style causes a compile error, which is why I had to do a duplicated kernel launch...
Since both SETUP_LAUNCH_CONFIG and callranks_** are define functions, there is a variable scope issue. The compute kernel call should be in the same or lower scope than the SETUP kernel. This issue applies the same to the other comments. If you have a better solution for this, please let me know.

transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers.cu

denera · 2024-11-04T00:27:19Z

@youngeunkwon0405 The TP overlap unit tests explicitly set CUDA_DEVICE_MAX_CONNECTIONS=1 in tests/pytorch/distributed/test_comm_gemm_overlap.py:43. Could you update this to not set the environment variable for Hopper so the changes in this PR are tested in our CI?

Also please launch the L1 tests with /te-ci pytorch L1 when you update the unit tests. Thanks!

Signed-off-by: Youngeun Kwon <[email protected]>

for more information, see https://pre-commit.ci

youngeunkwon0405 · 2024-11-07T21:19:57Z

@youngeunkwon0405 The TP overlap unit tests explicitly set CUDA_DEVICE_MAX_CONNECTIONS=1 in tests/pytorch/distributed/test_comm_gemm_overlap.py:43. Could you update this to not set the environment variable for Hopper so the changes in this PR are tested in our CI?

Also please launch the L1 tests with /te-ci pytorch L1 when you update the unit tests. Thanks!

Hi @denera, I have updated the test_comm_gemm_overlap.py file in the latest commit. Will it meet your expectations?

Also, could you please elaborate on more details about the following? I am new to writing a test and also new to the ci process.

please launch the L1 tests with /te-ci pytorch L1 when you update the unit tests.

I have tested the modified test case only and the following was a new result.
============================= test session starts ==============================
platform linux -- Python 3.10.12, pytest-8.1.1, pluggy-1.5.0 -- /usr/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/workspace/.hypothesis/examples')
rootdir: /lustre/fsw/coreai_dlalgo_llm/youngeunk/nemo/nemo.dev/mount/TransformerEngine-youngeunk
plugins: xdoctest-1.0.2, typeguard-4.3.0, xdist-3.6.1, shard-0.1.2, rerunfailures-14.0, mock-3.14.0, flakefinder-1.1.0, hypothesis-5.35.1, hydra-core-1.3.2, anyio-4.4.0
collecting ... collected 6 items
Running 6 items in this shard: tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[ALL-GATHER - BF16 - 1 connections], tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[REDUCE-SCATTER - BF16 - 1 connections], tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[REDUCE-SCATTER - FP8 - 1 connections], tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[ALL-GATHER - BF16 - 8 connections], tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[REDUCE-SCATTER - BF16 - 8 connections], tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[REDUCE-SCATTER - FP8 - 8 connections]

../lustre/fsw/coreai_dlalgo_llm/youngeunk/nemo/nemo.dev/mount/TransformerEngine-youngeunk/tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[ALL-GATHER - BF16 - 1 connections] PASSED
../lustre/fsw/coreai_dlalgo_llm/youngeunk/nemo/nemo.dev/mount/TransformerEngine-youngeunk/tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[REDUCE-SCATTER - BF16 - 1 connections] PASSED
../lustre/fsw/coreai_dlalgo_llm/youngeunk/nemo/nemo.dev/mount/TransformerEngine-youngeunk/tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[REDUCE-SCATTER - FP8 - 1 connections] PASSED
../lustre/fsw/coreai_dlalgo_llm/youngeunk/nemo/nemo.dev/mount/TransformerEngine-youngeunk/tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[ALL-GATHER - BF16 - 8 connections] PASSED
../lustre/fsw/coreai_dlalgo_llm/youngeunk/nemo/nemo.dev/mount/TransformerEngine-youngeunk/tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[REDUCE-SCATTER - BF16 - 8 connections] PASSED
../lustre/fsw/coreai_dlalgo_llm/youngeunk/nemo/nemo.dev/mount/TransformerEngine-youngeunk/tests/pytorch/distributed/test_comm_gemm_overlap.py::test_bulk_overlaps[REDUCE-SCATTER - FP8 - 8 connections] PASSED

========================= 6 passed in 93.35s (0:01:33) =========================

denera · 2024-11-14T07:01:20Z

/te-ci pytorch L1

denera

LGTM, pending rebase on latest TE/main and clean CI results.

youngeunkwon0405 · 2024-11-14T07:31:19Z

@denera Rebased with the main. Could you please let me know what the next step would be?

youngeunkwon0405 and others added 6 commits November 1, 2024 14:39

draft implementation

a37ad13

Signed-off-by: Youngeun Kwon <[email protected]>

compile error fix

3627dcc

Signed-off-by: Youngeun Kwon <[email protected]>

fix compile error

0cde98e

Signed-off-by: Youngeun Kwon <[email protected]>

Merge branch 'NVIDIA:main' into fdl_for_merge

97937ad

remove print

8344e49

Signed-off-by: Youngeun Kwon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

caadcac

for more information, see https://pre-commit.ci

erhoo82 reviewed Nov 2, 2024

View reviewed changes

erhoo82 requested review from denera and erhoo82 November 2, 2024 20:30

youngeunkwon0405 and others added 2 commits November 2, 2024 18:44

Edit comments

a264df8

Signed-off-by: Youngeun Kwon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

9553547

for more information, see https://pre-commit.ci

denera requested changes Nov 4, 2024

View reviewed changes

youngeunkwon0405 and others added 3 commits November 7, 2024 13:07

edit the bulk-overlap test case

8c30572

Signed-off-by: Youngeun Kwon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

71e8cb3

for more information, see https://pre-commit.ci

Merge branch 'main' into fdl_for_merge

33058db

denera approved these changes Nov 14, 2024

View reviewed changes

Merge branch 'NVIDIA:main' into fdl_for_merge

5102a8d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving communication overlap for the case of multi kernel queue usage #1308

Improving communication overlap for the case of multi kernel queue usage #1308

youngeunkwon0405 commented Nov 2, 2024

youngeunkwon0405 commented Nov 2, 2024

denera Nov 4, 2024

youngeunkwon0405 Nov 7, 2024

denera commented Nov 4, 2024 •

edited

Loading

youngeunkwon0405 commented Nov 7, 2024 •

edited

Loading

denera commented Nov 14, 2024

denera left a comment

youngeunkwon0405 commented Nov 14, 2024

Improving communication overlap for the case of multi kernel queue usage #1308

Are you sure you want to change the base?

Improving communication overlap for the case of multi kernel queue usage #1308

Conversation

youngeunkwon0405 commented Nov 2, 2024

Description

Type of change

Changes

Checklist:

youngeunkwon0405 commented Nov 2, 2024

denera Nov 4, 2024

Choose a reason for hiding this comment

youngeunkwon0405 Nov 7, 2024

Choose a reason for hiding this comment

denera commented Nov 4, 2024 • edited Loading

youngeunkwon0405 commented Nov 7, 2024 • edited Loading

denera commented Nov 14, 2024

denera left a comment

Choose a reason for hiding this comment

youngeunkwon0405 commented Nov 14, 2024

denera commented Nov 4, 2024 •

edited

Loading

youngeunkwon0405 commented Nov 7, 2024 •

edited

Loading