Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metrics] ValueError: Expected to find locked file from process x but it doesn't exist. #607

Open
raghavm1 opened this issue Jul 11, 2024 · 1 comment

Comments

@raghavm1
Copy link

raghavm1 commented Jul 11, 2024

When calling .compute in distributed multi-node setting, I get this error -

[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 2750, in _evaluate
[rank1]:     metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
[rank1]:     return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 3641, in evaluate
[rank1]:     output = eval_loop(
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 3924, in evaluation_loop
[rank1]:     metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
[rank1]:   File "/home/rm6418/projects/GEMT5/gemt5-mul.py", line 486, in <lambda>
[rank1]:     compute_metrics=lambda eval_pred: compute_metrics(eval_pred, tokenizer, accelerator),
[rank1]:   File "/home/rm6418/projects/GEMT5/gemt5-mul.py", line 119, in compute_metrics
[rank1]:     return metric.compute(predictions=decoded_preds, references=references)
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/evaluate/module.py", line 455, in compute
[rank1]:     self.add_batch(**inputs)
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/evaluate/module.py", line 515, in add_batch
[rank1]:     self._init_writer()
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/evaluate/module.py", line 664, in _init_writer
[rank1]:     self._check_rendez_vous()  # wait for master to be ready and to let everyone go
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/evaluate/module.py", line 377, in _check_rendez_vous
[rank1]:     raise ValueError(f"Couldn't acquire lock on {lock_file_name} from process {self.process_id}.") from None
[rank1]: ValueError: Couldn't acquire lock on /scratch/rm6418/gemt5_cache/sacrebleu/default/gemt5_exp1-12-rdv.lock from process 1.

I've looked at #481, #542
but this issue still seems to be happening on the current latest released versions

All metrics are loaded with the same experiment_id, and with the correct num_process arguments.
All the files (the lock files) are present in the cache directory.

Environment

evaluate - 0.4.2
accelerate - 0.31.0
datasets - 2.20.0
transformers - 4.42.3

Any suggestions appreciated!

@ffrancesco94
Copy link

I ended up fixing it by computing metrics only on the main process. I used accelerator.gather_for_metrics() and then the following:

if accelerator.is_main_process:
    metrics.compute()

Only workaround I could find until it gets fixed upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants