You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When calling .compute in distributed multi-node setting, I get this error -
[rank1]: File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 2750, in _evaluate
[rank1]: metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank1]: File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
[rank1]: return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank1]: File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 3641, in evaluate
[rank1]: output = eval_loop(
[rank1]: File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 3924, in evaluation_loop
[rank1]: metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
[rank1]: File "/home/rm6418/projects/GEMT5/gemt5-mul.py", line 486, in <lambda>
[rank1]: compute_metrics=lambda eval_pred: compute_metrics(eval_pred, tokenizer, accelerator),
[rank1]: File "/home/rm6418/projects/GEMT5/gemt5-mul.py", line 119, in compute_metrics
[rank1]: return metric.compute(predictions=decoded_preds, references=references)
[rank1]: File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/evaluate/module.py", line 455, in compute
[rank1]: self.add_batch(**inputs)
[rank1]: File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/evaluate/module.py", line 515, in add_batch
[rank1]: self._init_writer()
[rank1]: File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/evaluate/module.py", line 664, in _init_writer
[rank1]: self._check_rendez_vous() # wait for master to be ready and to let everyone go
[rank1]: File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/evaluate/module.py", line 377, in _check_rendez_vous
[rank1]: raise ValueError(f"Couldn't acquire lock on {lock_file_name} from process {self.process_id}.") from None
[rank1]: ValueError: Couldn't acquire lock on /scratch/rm6418/gemt5_cache/sacrebleu/default/gemt5_exp1-12-rdv.lock from process 1.
I've looked at #481, #542
but this issue still seems to be happening on the current latest released versions
All metrics are loaded with the same experiment_id, and with the correct num_process arguments.
All the files (the lock files) are present in the cache directory.
When calling .compute in distributed multi-node setting, I get this error -
I've looked at #481, #542
but this issue still seems to be happening on the current latest released versions
All metrics are loaded with the same
experiment_id
, and with the correctnum_process
arguments.All the files (the lock files) are present in the cache directory.
Environment
evaluate - 0.4.2
accelerate - 0.31.0
datasets - 2.20.0
transformers - 4.42.3
Any suggestions appreciated!
The text was updated successfully, but these errors were encountered: