add memory snapshot callback #2788

cli99 · 2023-12-19T00:22:16Z

This PR adds a callback to capture memory snapshot. A html file with interactive visualization would be generated if enabled. https://github.com/pytorch/pytorch.github.io/blob/site/assets/images/understanding-gpu-memory-1/snapshot.html shows an example visualization.

To enable the callback, in the yaml config file, also need this change in the foundry mosaicml/llm-foundry#810.

callbacks:
    memory_snapshot:
        {
            "skip_batches": 2,
            "interval": 3,
            "folder": "traces",
            "overwrite": true,
            "filename": "rank{rank}.{batch}.pt.trace.memory_snapshot.html",
            "remote_file_name": "oci://bucket_name/{run_name}/torch_memory_snapshots/rank{rank}.{batch}.pt.trace.memory_snapshot.html",
        }

Below is an example memory snapshot over three batches of MPT-7B with micro batch size 4, FSDP full shard, on 8XH100

For more details on the memory snapshot, refer to

composer/callbacks/memory_snapshot.py

j316chuck

Thanks for adding this Cheng! Overall, this PR looks good to me barring some style suggestions.

Additionally, here are two higher level use cases to consider for this PR:

First, should we add this callback in the profiler instead so we support having N memory traces instead of just 1 memory trace? I can see pros and cons for both so just wanted to make sure we are aligned on this decision cc: @mvpatel2000
Second, is it possible to add this memory trace when a run oom's?

Here's the original ticket that this PR addresses that has some ideas on the second point:
https://databricks.atlassian.net/browse/GRT-2231

composer/callbacks/memory_snapshot.py

tests/callbacks/test_memory_snapshot.py

Co-authored-by: Charles Tang <[email protected]>

cli99 · 2023-12-19T20:39:10Z

Thanks for adding this Cheng! Overall, this PR looks good to me barring some style suggestions.

Additionally, here are two higher level use cases to consider for this PR:

First, should we add this callback in the profiler instead so we support having N memory traces instead of just 1 memory trace? I can see pros and cons for both so just wanted to make sure we are aligned on this decision cc: @mvpatel2000

Second, is it possible to add this memory trace when a run oom's?

Here's the original ticket that this PR addresses that has some ideas on the second point: https://databricks.atlassian.net/browse/GRT-2231

My thoughts on the two questions

Composer profiler aligns with what torch profiler supports and takes the schedule parameters; memory snapshot call back does not reply on any torch profiler apis (nor its schedule) and shall live by itself.
Generating memory snapshot right before OOM is a separate feature. It's going to generate a snapshot at a single timestamp (not over a time interval) and the best visualization i think would be a flame graph. Two options here: a) add to the current memory monitor callback(in its init function) b) implement as a separate callback. What do you think? @j316chuck @mvpatel2000

mvpatel2000

LGTM. Can you please provide a manual test and a screenshot just to show it works on a real run?

I agree this should be separate from profiler. It's convenient to have modularity, especially because profiler slows things down a lot
I think merging OOM capture into this callback is reasonable given it's the same torch system. I suggest follow-on PR since this is basically done

j316chuck · 2023-12-19T21:46:30Z

Agreed with 1 and 2 from Mihir. Up to you whether or not you want to create follow up PR or add the OOM callback to this PR 👍

* add memory snapshot callback * fix check * fix check * Update composer/callbacks/memory_snapshot.py Co-authored-by: Charles Tang <[email protected]> * address comments * fix upload filename print * fix cpu check * fix cpu check * add pt version check * add pt version check * fix remote upload * fix test * fix cpu test * fix gpu test * fix gpu test * fix gpu test * fix gpu test * fix gpu test * do plotting before saving * fix test * fix test * fix test --------- Co-authored-by: Charles Tang <[email protected]> Co-authored-by: Mihir Patel <[email protected]>

add memory snapshot callback

b66940c

cli99 mentioned this pull request Dec 19, 2023

add memory snapshot to callbacks mosaicml/llm-foundry#810

Merged

fix check

aca981c

cli99 requested review from j316chuck and mvpatel2000 December 19, 2023 00:32

cli99 marked this pull request as ready for review December 19, 2023 00:32

fix check

4eb4f1c

j316chuck reviewed Dec 19, 2023

View reviewed changes

composer/callbacks/memory_snapshot.py Outdated Show resolved Hide resolved

j316chuck reviewed Dec 19, 2023

View reviewed changes

cli99 and others added 3 commits December 19, 2023 10:24

Update composer/callbacks/memory_snapshot.py

d9e7e27

Co-authored-by: Charles Tang <[email protected]>

address comments

85ccbb0

Merge branch 'dev' into add-memory-snapshot

fb72b32

cli99 requested a review from j316chuck December 19, 2023 20:39

mvpatel2000 approved these changes Dec 19, 2023

View reviewed changes

cli99 and others added 6 commits December 19, 2023 15:20

fix upload filename print

484fccb

fix cpu check

a342dd2

fix cpu check

a15090d

add pt version check

df42ee8

add pt version check

3611ad6

Merge branch 'dev' into add-memory-snapshot

54aa37f

j316chuck approved these changes Dec 21, 2023

View reviewed changes

cli99 added 7 commits January 2, 2024 13:29

Merge branch 'dev' into add-memory-snapshot

2215b3e

Merge branch 'dev' into add-memory-snapshot

9d4e16c

Merge branch 'dev' into add-memory-snapshot

88bf550

fix remote upload

c635ec1

Merge branch 'dev' into add-memory-snapshot

fb1dd38

Merge branch 'dev' into add-memory-snapshot

00d9203

fix test

63873d6

cli99 added 13 commits January 31, 2024 10:42

Merge branch 'dev' into add-memory-snapshot

252ea08

fix cpu test

ff92748

fix gpu test

abbb37e

fix gpu test

7b48ebb

fix gpu test

22f742e

fix gpu test

19b0ca0

fix gpu test

dc20aa7

do plotting before saving

17195ad

Merge branch 'dev' into add-memory-snapshot

d00834f

fix test

7c23c16

fix test

6880cd1

Merge branch 'dev' into add-memory-snapshot

d3d4d90

fix test

a330623

cli99 merged commit 55d594a into mosaicml:dev Feb 1, 2024
14 checks passed

cli99 deleted the add-memory-snapshot branch February 1, 2024 00:30

cli99 restored the add-memory-snapshot branch February 1, 2024 04:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add memory snapshot callback #2788

add memory snapshot callback #2788

cli99 commented Dec 19, 2023 •

edited

Loading

j316chuck left a comment •

edited

Loading

cli99 commented Dec 19, 2023

mvpatel2000 left a comment

j316chuck commented Dec 19, 2023 •

edited

Loading

add memory snapshot callback #2788

add memory snapshot callback #2788

Conversation

cli99 commented Dec 19, 2023 • edited Loading

j316chuck left a comment • edited Loading

Choose a reason for hiding this comment

cli99 commented Dec 19, 2023

mvpatel2000 left a comment

Choose a reason for hiding this comment

j316chuck commented Dec 19, 2023 • edited Loading

cli99 commented Dec 19, 2023 •

edited

Loading

j316chuck left a comment •

edited

Loading

j316chuck commented Dec 19, 2023 •

edited

Loading