Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Name the forward pass thread in the trainer loop #895

Closed
wants to merge 1 commit into from

Conversation

JKSenthil
Copy link
Contributor

Summary:
Internal

Context

With the sched_ext effort we are trying to build custom Linux schedulers that provide a small performance boost to AI training and improve the resource isolation on the trainer hosts. The latter is necessary to avoid cases when noisy neighbor processes, like data loaders, slow down the GPU training.

More details in this note: https://fb.workplace.com/notes/1118655556176038

By naming the forward pass thread we can use its name and assign it a higher priority at the linux scheduler level. The backward pass is named inside the Pytorch implementation but the forward pass needs to be named at the application level.

We did the same thing in PyPer, APS, MVAI which are the largest trainer frameworks for reco models, consuming 70%+ of fleet level GPU hours for recommender systems.

This Diff

Adds core lines

if torch.multiprocessing._get_thread_name() != "trainer_main":
    torch.multiprocessing._set_thread_name("trainer_main")

to train/eval/predict scripts. We can check the preexisting name to avoid renaming the same thread.

Reviewed By: diego-urgell

Differential Revision: D61924982

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61924982

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61924982

JKSenthil added a commit to JKSenthil/tnt that referenced this pull request Sep 9, 2024
Summary:
Pull Request resolved: pytorch#895

Internal
# Context
With the sched_ext effort we are trying to build custom Linux schedulers that provide a small performance boost to AI training and improve the resource isolation on the trainer hosts. The latter is necessary to avoid cases when noisy neighbor processes, like data loaders, slow down the GPU training.

More details in this note: https://fb.workplace.com/notes/1118655556176038

By naming the forward pass thread we can use its name and assign it a higher priority at the linux scheduler level. The backward pass is named inside the Pytorch implementation but the forward pass needs to be named at the application level.

We did the same thing in PyPer, APS, MVAI which are the largest trainer frameworks for reco models, consuming 70%+ of fleet level GPU hours for recommender systems.

# This Diff
Adds core lines
```
if torch.multiprocessing._get_thread_name() != "trainer_main":
    torch.multiprocessing._set_thread_name("trainer_main")
```
to train/eval/predict scripts. We can check the preexisting name to avoid renaming the same thread.

Reviewed By: diego-urgell

Differential Revision: D61924982
Summary:
Pull Request resolved: pytorch#895

Internal
# Context
With the sched_ext effort we are trying to build custom Linux schedulers that provide a small performance boost to AI training and improve the resource isolation on the trainer hosts. The latter is necessary to avoid cases when noisy neighbor processes, like data loaders, slow down the GPU training.

More details in this note: https://fb.workplace.com/notes/1118655556176038

By naming the forward pass thread we can use its name and assign it a higher priority at the linux scheduler level. The backward pass is named inside the Pytorch implementation but the forward pass needs to be named at the application level.

We did the same thing in PyPer, APS, MVAI which are the largest trainer frameworks for reco models, consuming 70%+ of fleet level GPU hours for recommender systems.

# This Diff
Adds core lines
```
if torch.multiprocessing._get_thread_name() != "trainer_main":
    torch.multiprocessing._set_thread_name("trainer_main")
```
to train/eval/predict scripts. We can check the preexisting name to avoid renaming the same thread.

Reviewed By: diego-urgell

Differential Revision: D61924982
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D61924982

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants