Name the forward pass thread in the trainer loop #895

JKSenthil · 2024-09-09T19:30:15Z

Summary:
Internal

Context

With the sched_ext effort we are trying to build custom Linux schedulers that provide a small performance boost to AI training and improve the resource isolation on the trainer hosts. The latter is necessary to avoid cases when noisy neighbor processes, like data loaders, slow down the GPU training.

More details in this note: https://fb.workplace.com/notes/1118655556176038

By naming the forward pass thread we can use its name and assign it a higher priority at the linux scheduler level. The backward pass is named inside the Pytorch implementation but the forward pass needs to be named at the application level.

We did the same thing in PyPer, APS, MVAI which are the largest trainer frameworks for reco models, consuming 70%+ of fleet level GPU hours for recommender systems.

This Diff

Adds core lines

if torch.multiprocessing._get_thread_name() != "trainer_main":
    torch.multiprocessing._set_thread_name("trainer_main")

to train/eval/predict scripts. We can check the preexisting name to avoid renaming the same thread.

Reviewed By: diego-urgell

Differential Revision: D61924982

facebook-github-bot · 2024-09-09T19:30:39Z

This pull request was exported from Phabricator. Differential Revision: D61924982

facebook-github-bot · 2024-09-09T19:39:29Z

This pull request was exported from Phabricator. Differential Revision: D61924982

Summary: Pull Request resolved: pytorch#895 Internal # Context With the sched_ext effort we are trying to build custom Linux schedulers that provide a small performance boost to AI training and improve the resource isolation on the trainer hosts. The latter is necessary to avoid cases when noisy neighbor processes, like data loaders, slow down the GPU training. More details in this note: https://fb.workplace.com/notes/1118655556176038 By naming the forward pass thread we can use its name and assign it a higher priority at the linux scheduler level. The backward pass is named inside the Pytorch implementation but the forward pass needs to be named at the application level. We did the same thing in PyPer, APS, MVAI which are the largest trainer frameworks for reco models, consuming 70%+ of fleet level GPU hours for recommender systems. # This Diff Adds core lines ``` if torch.multiprocessing._get_thread_name() != "trainer_main": torch.multiprocessing._set_thread_name("trainer_main") ``` to train/eval/predict scripts. We can check the preexisting name to avoid renaming the same thread. Reviewed By: diego-urgell Differential Revision: D61924982

facebook-github-bot · 2024-09-09T21:51:09Z

This pull request was exported from Phabricator. Differential Revision: D61924982

facebook-github-bot added the cla signed label Sep 9, 2024

facebook-github-bot added the fb-exported label Sep 9, 2024

JKSenthil force-pushed the export-D61924982 branch from 3b32360 to 7388ecd Compare September 9, 2024 19:39

JKSenthil force-pushed the export-D61924982 branch from 7388ecd to 05bd14c Compare September 9, 2024 21:51

facebook-github-bot closed this in 665dd50 Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Name the forward pass thread in the trainer loop #895

Name the forward pass thread in the trainer loop #895

JKSenthil commented Sep 9, 2024

facebook-github-bot commented Sep 9, 2024

facebook-github-bot commented Sep 9, 2024

facebook-github-bot commented Sep 9, 2024

Name the forward pass thread in the trainer loop #895

Name the forward pass thread in the trainer loop #895

Conversation

JKSenthil commented Sep 9, 2024

Context

This Diff

facebook-github-bot commented Sep 9, 2024

facebook-github-bot commented Sep 9, 2024

facebook-github-bot commented Sep 9, 2024