Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add memory snapshot callback #2788

Merged
merged 32 commits into from
Feb 1, 2024
Merged

add memory snapshot callback #2788

merged 32 commits into from
Feb 1, 2024

Conversation

cli99
Copy link
Contributor

@cli99 cli99 commented Dec 19, 2023

This PR adds a callback to capture memory snapshot. A html file with interactive visualization would be generated if enabled. https://github.com/pytorch/pytorch.github.io/blob/site/assets/images/understanding-gpu-memory-1/snapshot.html shows an example visualization.

To enable the callback, in the yaml config file, also need this change in the foundry mosaicml/llm-foundry#810.

callbacks:
    memory_snapshot:
        {
            "skip_batches": 2,
            "interval": 3,
            "folder": "traces",
            "overwrite": true,
            "filename": "rank{rank}.{batch}.pt.trace.memory_snapshot.html",
            "remote_file_name": "oci://bucket_name/{run_name}/torch_memory_snapshots/rank{rank}.{batch}.pt.trace.memory_snapshot.html",
        }

Below is an example memory snapshot over three batches of MPT-7B with micro batch size 4, FSDP full shard, on 8XH100
image

For more details on the memory snapshot, refer to

@cli99 cli99 marked this pull request as ready for review December 19, 2023 00:32
Copy link
Contributor

@j316chuck j316chuck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this Cheng! Overall, this PR looks good to me barring some style suggestions.

Additionally, here are two higher level use cases to consider for this PR:

  • First, should we add this callback in the profiler instead so we support having N memory traces instead of just 1 memory trace? I can see pros and cons for both so just wanted to make sure we are aligned on this decision cc: @mvpatel2000
  • Second, is it possible to add this memory trace when a run oom's?

Here's the original ticket that this PR addresses that has some ideas on the second point:
https://databricks.atlassian.net/browse/GRT-2231

composer/callbacks/memory_snapshot.py Show resolved Hide resolved
composer/callbacks/memory_snapshot.py Outdated Show resolved Hide resolved
composer/callbacks/memory_snapshot.py Outdated Show resolved Hide resolved
composer/callbacks/memory_snapshot.py Show resolved Hide resolved
composer/callbacks/memory_snapshot.py Show resolved Hide resolved
composer/callbacks/memory_snapshot.py Outdated Show resolved Hide resolved
tests/callbacks/test_memory_snapshot.py Show resolved Hide resolved
tests/callbacks/test_memory_snapshot.py Outdated Show resolved Hide resolved
@cli99
Copy link
Contributor Author

cli99 commented Dec 19, 2023

Thanks for adding this Cheng! Overall, this PR looks good to me barring some style suggestions.

Additionally, here are two higher level use cases to consider for this PR:

  • First, should we add this callback in the profiler instead so we support having N memory traces instead of just 1 memory trace? I can see pros and cons for both so just wanted to make sure we are aligned on this decision cc: @mvpatel2000
  • Second, is it possible to add this memory trace when a run oom's?

Here's the original ticket that this PR addresses that has some ideas on the second point: https://databricks.atlassian.net/browse/GRT-2231

My thoughts on the two questions

  1. Composer profiler aligns with what torch profiler supports and takes the schedule parameters; memory snapshot call back does not reply on any torch profiler apis (nor its schedule) and shall live by itself.
  2. Generating memory snapshot right before OOM is a separate feature. It's going to generate a snapshot at a single timestamp (not over a time interval) and the best visualization i think would be a flame graph. Two options here: a) add to the current memory monitor callback(in its init function) b) implement as a separate callback. What do you think? @j316chuck @mvpatel2000

@cli99 cli99 requested a review from j316chuck December 19, 2023 20:39
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can you please provide a manual test and a screenshot just to show it works on a real run?

  1. I agree this should be separate from profiler. It's convenient to have modularity, especially because profiler slows things down a lot
  2. I think merging OOM capture into this callback is reasonable given it's the same torch system. I suggest follow-on PR since this is basically done

@j316chuck
Copy link
Contributor

j316chuck commented Dec 19, 2023

Agreed with 1 and 2 from Mihir. Up to you whether or not you want to create follow up PR or add the OOM callback to this PR 👍

@cli99 cli99 merged commit 55d594a into mosaicml:dev Feb 1, 2024
14 checks passed
@cli99 cli99 deleted the add-memory-snapshot branch February 1, 2024 00:30
@cli99 cli99 restored the add-memory-snapshot branch February 1, 2024 04:59
j316chuck added a commit that referenced this pull request Feb 2, 2024
* add memory snapshot callback

* fix check

* fix check

* Update composer/callbacks/memory_snapshot.py

Co-authored-by: Charles Tang <[email protected]>

* address comments

* fix upload filename print

* fix cpu check

* fix cpu check

* add pt version check

* add pt version check

* fix remote upload

* fix test

* fix cpu test

* fix gpu test

* fix gpu test

* fix gpu test

* fix gpu test

* fix gpu test

* do plotting before saving

* fix test

* fix test

* fix test

---------

Co-authored-by: Charles Tang <[email protected]>
Co-authored-by: Mihir Patel <[email protected]>
j316chuck added a commit that referenced this pull request Feb 2, 2024
* add memory snapshot callback

* fix check

* fix check

* Update composer/callbacks/memory_snapshot.py

Co-authored-by: Charles Tang <[email protected]>

* address comments

* fix upload filename print

* fix cpu check

* fix cpu check

* add pt version check

* add pt version check

* fix remote upload

* fix test

* fix cpu test

* fix gpu test

* fix gpu test

* fix gpu test

* fix gpu test

* fix gpu test

* do plotting before saving

* fix test

* fix test

* fix test

---------

Co-authored-by: Charles Tang <[email protected]>
Co-authored-by: Mihir Patel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants