Incorrect Imagenet evals with pytorch_eval_num_workers > 0 #732

priyakasimbeg · 2024-03-27T00:01:27Z

AlgoPerf submitter team reports that they are no longer able to reproduce the NAdam baseline results in PyTorch using the current repo in PyTorch on the ImageNet workloads (both ResNet and ViT).
See the plot below in terms of differences in the training/validation loss and accuracy between the given NAdam Jax results and the current run's results on ImageNet ViT.

They did not see a change in OGBG and FastMRI.

The list of commits that we merged were from 389fe3f823a5016289b55b48aa8061a37b18b401 to 79ccc5e860d7928cf896ffe12ec686c72fd840d4.

Steps to Reproduce

Running submission runner with eval_num_workers=4 (recently changed default to help speed up evals).

Source or Possible Fix

Setting the eval_num_workers to 0 resolves the discrepancy in evals. We are still investigating why.

The text was updated successfully, but these errors were encountered:

priyakasimbeg · 2024-03-27T01:12:11Z

Changed default number of workers for PyTorch data loaders to 0.
Important update: for speech workloads the pytorch_eval_num_workers flag to submission_runner.py has to be set to >0, to prevent data loader crash in jax code.

runame · 2024-04-03T17:03:02Z

I tried reproducing the issue by running the target setting run on the current dev branch with pytorch_eval_num_workers=4, but I don't see the drop in eval metrics compared to an older reference run (this one).

If someone can share the exact command and commit they used to produce the run in the plot I will try to run this instead.

priyakasimbeg changed the title ~~Incorrect Imagenet evals for PyTorch data loader num workers > 0~~ Incorrect Imagenet evals with pytorch_eval_num_workers > 0 Mar 27, 2024

priyakasimbeg mentioned this issue Mar 27, 2024

Dev -> main #737

Merged

runame added 🔥 PyTorch Issue that mainly deals with the PyTorch version of the code 🐛 Bug Something isn't working labels Mar 28, 2024

priyakasimbeg mentioned this issue Mar 29, 2024

Add warning for PyTorch data loader num_workers flag. #726

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect Imagenet evals with pytorch_eval_num_workers > 0 #732

Incorrect Imagenet evals with pytorch_eval_num_workers > 0 #732

priyakasimbeg commented Mar 27, 2024

priyakasimbeg commented Mar 27, 2024 •

edited

Loading

runame commented Apr 3, 2024

Incorrect Imagenet evals with pytorch_eval_num_workers > 0 #732

Incorrect Imagenet evals with pytorch_eval_num_workers > 0 #732

Comments

priyakasimbeg commented Mar 27, 2024

Steps to Reproduce

Source or Possible Fix

priyakasimbeg commented Mar 27, 2024 • edited Loading

runame commented Apr 3, 2024

priyakasimbeg commented Mar 27, 2024 •

edited

Loading