min/max flagging added to system_metrics_monitor with only non-redundant, necessary gpu metrics logged #3373

JackZ-db · 2024-06-06T06:42:46Z

What does this PR do?

Added logging for gpu power and removed all gpu metrics covered in other callbacks (used, free, total memory).

The following GPU metrics pertaining to Straggler Detection are logged:

    gpu_percentage: Occupancy rate, percent of time over the past sampling period during
                    which one or more kernels was executing on the GPU.
    memory_percentage: Percent of time over the past sampling period during which
                    global (device) memory was being read or written.
    gpu_temperature_C: Temperature of device, in Celcius.
    gpu_power_usage_W: Power usage of device, in Watts.

Added clearer documentation for the SystemsMetricMonitor class and removed the user inputted boolean parameter for whether the model uses GPU's or not (gpu_available).

Added a boolean flag for users to select whether to log the min/max values for the GPU metrics listed above and their corresponding their ranks, or log all values for all ranks (log_all_data, default set to false).

Before submitting

Have you read the contributor guidelines?
Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
Did you update any related docs and document your change?
Did you update any related tests and add any new tests related to your change? (see testing)
Did you run the tests locally to make sure they pass?
Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

mvpatel2000

LGTM, mostly minor style comments :)

composer/callbacks/system_metrics_monitor.py

mvpatel2000

Last wave of nits!

composer/callbacks/system_metrics_monitor.py

mvpatel2000

LGTM!

composer/callbacks/system_metrics_monitor.py

…ant, necessary gpu metrics logged (mosaicml#3373) * implemented min_max flag * fixed string parsing * refactoring compute_system_metrics for all_reduce * keep track of rank within dict * added compute_min_max * added flag for both min_max and all_logging * corrected min_max call with model_device * removing total bytes (always going ot be constant) * handled no gpu case in min_max flag * removed unnecessary imports, patched unit tests * fixed assert statement for with gpu case, world size 1 * case min_rank and max_rank as int to guarantee them working as indices * fixed indent issue from fixing font * made docs more concise and readable * fixing unexpected unindent * fixing unit test device * modifying device to equal model_device.type * reverting to device=model_device * setting device in unit test = 'gpu' * setting device = 'cuda' in unit testing * reverting to next(state.model.parameters()).device * removed torch as a dependecy for unit_testing * cleaned up UI to be consistent + removed calling next to obtain device --------- Co-authored-by: Mihir Patel <[email protected]> Co-authored-by: Charles Tang <[email protected]>

…ant, necessary gpu metrics logged (#3373) * implemented min_max flag * fixed string parsing * refactoring compute_system_metrics for all_reduce * keep track of rank within dict * added compute_min_max * added flag for both min_max and all_logging * corrected min_max call with model_device * removing total bytes (always going ot be constant) * handled no gpu case in min_max flag * removed unnecessary imports, patched unit tests * fixed assert statement for with gpu case, world size 1 * case min_rank and max_rank as int to guarantee them working as indices * fixed indent issue from fixing font * made docs more concise and readable * fixing unexpected unindent * fixing unit test device * modifying device to equal model_device.type * reverting to device=model_device * setting device in unit test = 'gpu' * setting device = 'cuda' in unit testing * reverting to next(state.model.parameters()).device * removed torch as a dependecy for unit_testing * cleaned up UI to be consistent + removed calling next to obtain device --------- Co-authored-by: Mihir Patel <[email protected]> Co-authored-by: Charles Tang <[email protected]>

JackZ-db added 11 commits June 5, 2024 17:15

implemented min_max flag

1eb8e4f

fixed string parsing

659c905

refactoring compute_system_metrics for all_reduce

1f3b289

keep track of rank within dict

d9bfca3

added compute_min_max

a2bd879

added flag for both min_max and all_logging

5553e0a

corrected min_max call with model_device

36623bc

removing total bytes (always going ot be constant)

0de15b4

handled no gpu case in min_max flag

e8bf93c

removed unnecessary imports, patched unit tests

b3b859c

fixed assert statement for with gpu case, world size 1

d652633

JackZ-db requested review from j316chuck and mvpatel2000 June 6, 2024 06:42

JackZ-db self-assigned this Jun 6, 2024

JackZ-db added 2 commits June 6, 2024 09:15

case min_rank and max_rank as int to guarantee them working as indices

d795550

fixed indent issue from fixing font

8958c40

mvpatel2000 reviewed Jun 6, 2024

View reviewed changes

mvpatel2000 and others added 4 commits June 6, 2024 14:46

Merge branch 'dev' into jz/metrics_monitor

5cd20c0

Merge branch 'mosaicml:dev' into jz/metrics_monitor

8f7d273

made docs more concise and readable

15cdb09

fixing unexpected unindent

3746bab

JackZ-db requested a review from mvpatel2000 June 6, 2024 21:40

JackZ-db added 7 commits June 6, 2024 18:01

fixing unit test device

3988f41

modifying device to equal model_device.type

988195d

reverting to device=model_device

80df26c

setting device in unit test = 'gpu'

9e41dc8

setting device = 'cuda' in unit testing

1d0ad04

reverting to next(state.model.parameters()).device

91f1ac4

removed torch as a dependecy for unit_testing

56db297

mvpatel2000 reviewed Jun 7, 2024

View reviewed changes

cleaned up UI to be consistent + removed calling next to obtain device

0477b4a

JackZ-db requested a review from mvpatel2000 June 7, 2024 18:46

mvpatel2000 approved these changes Jun 7, 2024

View reviewed changes

mvpatel2000 reviewed Jun 7, 2024

View reviewed changes

composer/callbacks/system_metrics_monitor.py Show resolved Hide resolved

j316chuck approved these changes Jun 10, 2024

View reviewed changes

Merge branch 'dev' into jz/metrics_monitor

bfd0a12

JackZ-db merged commit cca51e2 into mosaicml:dev Jun 17, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

min/max flagging added to system_metrics_monitor with only non-redundant, necessary gpu metrics logged #3373

min/max flagging added to system_metrics_monitor with only non-redundant, necessary gpu metrics logged #3373

JackZ-db commented Jun 6, 2024 •

edited

Loading

mvpatel2000 left a comment

mvpatel2000 left a comment

mvpatel2000 left a comment

min/max flagging added to system_metrics_monitor with only non-redundant, necessary gpu metrics logged #3373

min/max flagging added to system_metrics_monitor with only non-redundant, necessary gpu metrics logged #3373

Conversation

JackZ-db commented Jun 6, 2024 • edited Loading

What does this PR do?

Before submitting

mvpatel2000 left a comment

Choose a reason for hiding this comment

mvpatel2000 left a comment

Choose a reason for hiding this comment

mvpatel2000 left a comment

Choose a reason for hiding this comment

JackZ-db commented Jun 6, 2024 •

edited

Loading