Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

min/max flagging added to system_metrics_monitor with only non-redundant, necessary gpu metrics logged #3373

Merged
merged 26 commits into from
Jun 17, 2024

Conversation

JackZ-db
Copy link
Contributor

@JackZ-db JackZ-db commented Jun 6, 2024

What does this PR do?

Added logging for gpu power and removed all gpu metrics covered in other callbacks (used, free, total memory).

The following GPU metrics pertaining to Straggler Detection are logged:

    gpu_percentage: Occupancy rate, percent of time over the past sampling period during
                    which one or more kernels was executing on the GPU.
    memory_percentage: Percent of time over the past sampling period during which
                    global (device) memory was being read or written.
    gpu_temperature_C: Temperature of device, in Celcius.
    gpu_power_usage_W: Power usage of device, in Watts.

Added clearer documentation for the SystemsMetricMonitor class and removed the user inputted boolean parameter for whether the model uses GPU's or not (gpu_available).

Added a boolean flag for users to select whether to log the min/max values for the GPU metrics listed above and their corresponding their ranks, or log all values for all ranks (log_all_data, default set to false).

Before submitting

  • Have you read the contributor guidelines?
  • Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
  • Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
  • Did you update any related docs and document your change?
  • Did you update any related tests and add any new tests related to your change? (see testing)
  • Did you run the tests locally to make sure they pass?
  • Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

@JackZ-db JackZ-db self-assigned this Jun 6, 2024
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, mostly minor style comments :)

composer/callbacks/system_metrics_monitor.py Outdated Show resolved Hide resolved
composer/callbacks/system_metrics_monitor.py Outdated Show resolved Hide resolved
composer/callbacks/system_metrics_monitor.py Outdated Show resolved Hide resolved
composer/callbacks/system_metrics_monitor.py Outdated Show resolved Hide resolved
composer/callbacks/system_metrics_monitor.py Outdated Show resolved Hide resolved
composer/callbacks/system_metrics_monitor.py Outdated Show resolved Hide resolved
composer/callbacks/system_metrics_monitor.py Outdated Show resolved Hide resolved
@JackZ-db JackZ-db requested a review from mvpatel2000 June 6, 2024 21:40
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last wave of nits!

composer/callbacks/system_metrics_monitor.py Outdated Show resolved Hide resolved
composer/callbacks/system_metrics_monitor.py Outdated Show resolved Hide resolved
composer/callbacks/system_metrics_monitor.py Outdated Show resolved Hide resolved
composer/callbacks/system_metrics_monitor.py Outdated Show resolved Hide resolved
composer/callbacks/system_metrics_monitor.py Outdated Show resolved Hide resolved
@JackZ-db JackZ-db requested a review from mvpatel2000 June 7, 2024 18:46
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@JackZ-db JackZ-db merged commit cca51e2 into mosaicml:dev Jun 17, 2024
17 checks passed
mvpatel2000 added a commit to mvpatel2000/composer that referenced this pull request Jul 21, 2024
…ant, necessary gpu metrics logged (mosaicml#3373)

* implemented min_max flag

* fixed string parsing

* refactoring compute_system_metrics for all_reduce

* keep track of rank within dict

* added compute_min_max

* added flag for both min_max and all_logging

* corrected min_max call with model_device

* removing total bytes (always going ot be constant)

* handled no gpu case in min_max flag

* removed unnecessary imports, patched unit tests

* fixed assert statement for with gpu case, world size 1

* case min_rank and max_rank as int to guarantee them working as indices

* fixed indent issue from fixing font

* made docs more concise and readable

* fixing unexpected unindent

* fixing unit test device

* modifying device to equal model_device.type

* reverting to device=model_device

* setting device in unit test = 'gpu'

* setting device = 'cuda' in unit testing

* reverting to next(state.model.parameters()).device

* removed torch as a dependecy for unit_testing

* cleaned up UI to be consistent + removed calling next to obtain device

---------

Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Charles Tang <[email protected]>
mvpatel2000 added a commit that referenced this pull request Jul 21, 2024
…ant, necessary gpu metrics logged (#3373)

* implemented min_max flag

* fixed string parsing

* refactoring compute_system_metrics for all_reduce

* keep track of rank within dict

* added compute_min_max

* added flag for both min_max and all_logging

* corrected min_max call with model_device

* removing total bytes (always going ot be constant)

* handled no gpu case in min_max flag

* removed unnecessary imports, patched unit tests

* fixed assert statement for with gpu case, world size 1

* case min_rank and max_rank as int to guarantee them working as indices

* fixed indent issue from fixing font

* made docs more concise and readable

* fixing unexpected unindent

* fixing unit test device

* modifying device to equal model_device.type

* reverting to device=model_device

* setting device in unit test = 'gpu'

* setting device = 'cuda' in unit testing

* reverting to next(state.model.parameters()).device

* removed torch as a dependecy for unit_testing

* cleaned up UI to be consistent + removed calling next to obtain device

---------

Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Charles Tang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants