[Metrics] ValueError: Expected to find locked file from process x but it doesn't exist. #607

raghavm1 · 2024-07-11T15:26:20Z

When calling .compute in distributed multi-node setting, I get this error -

[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 2750, in _evaluate
[rank1]:     metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
[rank1]:     return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 3641, in evaluate
[rank1]:     output = eval_loop(
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 3924, in evaluation_loop
[rank1]:     metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
[rank1]:   File "/home/rm6418/projects/GEMT5/gemt5-mul.py", line 486, in <lambda>
[rank1]:     compute_metrics=lambda eval_pred: compute_metrics(eval_pred, tokenizer, accelerator),
[rank1]:   File "/home/rm6418/projects/GEMT5/gemt5-mul.py", line 119, in compute_metrics
[rank1]:     return metric.compute(predictions=decoded_preds, references=references)
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/evaluate/module.py", line 455, in compute
[rank1]:     self.add_batch(**inputs)
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/evaluate/module.py", line 515, in add_batch
[rank1]:     self._init_writer()
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/evaluate/module.py", line 664, in _init_writer
[rank1]:     self._check_rendez_vous()  # wait for master to be ready and to let everyone go
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/evaluate/module.py", line 377, in _check_rendez_vous
[rank1]:     raise ValueError(f"Couldn't acquire lock on {lock_file_name} from process {self.process_id}.") from None
[rank1]: ValueError: Couldn't acquire lock on /scratch/rm6418/gemt5_cache/sacrebleu/default/gemt5_exp1-12-rdv.lock from process 1.

I've looked at #481, #542
but this issue still seems to be happening on the current latest released versions

All metrics are loaded with the same experiment_id, and with the correct num_process arguments.
All the files (the lock files) are present in the cache directory.

Environment

evaluate - 0.4.2
accelerate - 0.31.0
datasets - 2.20.0
transformers - 4.42.3

Any suggestions appreciated!

The text was updated successfully, but these errors were encountered:

ffrancesco94 · 2024-11-04T09:38:22Z

I ended up fixing it by computing metrics only on the main process. I used accelerator.gather_for_metrics() and then the following:

if accelerator.is_main_process:
    metrics.compute()

Only workaround I could find until it gets fixed upstream.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metrics] ValueError: Expected to find locked file from process x but it doesn't exist. #607

[Metrics] ValueError: Expected to find locked file from process x but it doesn't exist. #607

raghavm1 commented Jul 11, 2024 •

edited

Loading

ffrancesco94 commented Nov 4, 2024

[Metrics] ValueError: Expected to find locked file from process x but it doesn't exist. #607

[Metrics] ValueError: Expected to find locked file from process x but it doesn't exist. #607

Comments

raghavm1 commented Jul 11, 2024 • edited Loading

Environment

ffrancesco94 commented Nov 4, 2024

raghavm1 commented Jul 11, 2024 •

edited

Loading