[ARC E2E] Huggingface accuracy failed #911

mengfei25 · 2024-09-13T06:57:52Z

🐛 Describe the bug

Accuracy failed for key name albert.embeddings.token_type_embeddings.weight.grad

Category	Model	Accuracy
huggingface_amp_bf16_training	AlbertForMaskedLM	fail_accuracy
huggingface_amp_fp16_training	AlbertForMaskedLM	fail_accuracy
huggingface_bfloat16_training	AlbertForMaskedLM	fail_accuracy
huggingface_float16_training	AlbertForMaskedLM	fail_accuracy
huggingface_float32_training	AlbertForMaskedLM	fail_accuracy
huggingface_amp_bf16_training	AlbertForQuestionAnswering	fail_accuracy
huggingface_amp_fp16_training	AlbertForQuestionAnswering	fail_accuracy
huggingface_bfloat16_training	AlbertForQuestionAnswering	fail_accuracy
huggingface_float16_training	AlbertForQuestionAnswering	fail_accuracy
huggingface_float32_training	AlbertForQuestionAnswering	fail_accuracy

Versions

torch-xpu-ops: 7e3d00a
pytorch: 0d1d69fd25fdc096763bfe85f4d379e27ea1c9f8
device: ARC 24.04
driver: 24.31.30508.7

The text was updated successfully, but these errors were encountered:

mengfei25 · 2024-09-18T07:20:35Z

Compared with 22.04

Category	Model	Ubuntu 24.04	Ubuntu 22.04
huggingface_amp_bf16_training	AlbertForMaskedLM	fail_accuracy	pass
huggingface_bfloat16_training	AlbertForMaskedLM	fail_accuracy	pass
huggingface_float32_training	AlbertForMaskedLM	fail_accuracy	pass
huggingface_float32_training	AlbertForQuestionAnswering	fail_accuracy	pass

jianyizh · 2024-09-25T07:07:09Z

@mengfei25
I can reproduce on ubuntu 22, for example, AlbertForMaskedLM fp16 training, I get
W0925 06:28:19.597000 315406 site-packages/torch/_dynamo/utils.py:1723] Similarity score=0.9181798100471497
E0925 06:28:19.598000 315406 site-packages/torch/_dynamo/utils.py:1674] Accuracy failed for key name albert.embeddings.token_type_embeddings.weight.grad. The similarity score is exactly the same as your test log.
However, If I modify the patch and compare the result directly, it can pass. You do not need put fp64_outputs on xpu, and you can put new_result and correct_result on cpu to compare. https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/common.py#L2898
For this cosine similarity fail, it's related to layer norm backward

jianyizh · 2024-09-27T06:33:53Z

cuda also fails on cosine similarity
cuda train AlbertForMaskedLM
W0927 14:50:50.191000 3628133 /mnt/tmp/home/sparse/miniforge3/envs/jianyi/lib/python3.10/site-packages/torch/_inductor/debug.py:434] [3/0_1] AlbertForMaskedLM__0_forward_1 debug trace: /home/sparse/jianyi/pytorch/inductor_log/huggingface/AlbertForMaskedLM/training/amp_fp16/22/torch_compile_debug/torch_compile_debug/run_2024_09_27_14_50_33_656959-pid_3628133/torchinductor/AlbertForMaskedLM__0_forward_1.0
W0927 14:51:04.399000 3628133 /mnt/tmp/home/sparse/miniforge3/envs/jianyi/lib/python3.10/site-packages/torch/_inductor/debug.py:434] [3/0_1] AlbertForMaskedLM__0_backward_3 debug trace: /home/sparse/jianyi/pytorch/inductor_log/huggingface/AlbertForMaskedLM/training/amp_fp16/22/torch_compile_debug/torch_compile_debug/run_2024_09_27_14_50_33_656959-pid_3628133/torchinductor/AlbertForMaskedLM__0_backward_3.1
W0927 14:51:05.220000 3628133 /mnt/tmp/home/sparse/miniforge3/envs/jianyi/lib/python3.10/site-packages/torch/_dynamo/utils.py:1723] Similarity score=0.6160069704055786
E0927 14:51:05.221000 3628133 /mnt/tmp/home/sparse/miniforge3/envs/jianyi/lib/python3.10/site-packages/torch/_dynamo/utils.py:1674] Accuracy failed for key name albert.embeddings.token_type_embeddings.weight.grad
fail_accuracy

chuanqi129 · 2024-10-14T05:46:38Z

@jianyizh does it mean both xpu and cuda will fail on cosine similarity for those models?

jianyizh · 2024-10-14T05:50:09Z

@jianyizh does it mean both xpu and cuda will fail on cosine similarity for those models?

yes, and they can pass on the normal accuracy test with fp64 patch corrected

jianyizh self-assigned this Sep 24, 2024

chuanqi129 added E2E Accuracy client labels Oct 14, 2024

chuanqi129 added this to the PT2.6 milestone Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARC E2E] Huggingface accuracy failed #911

[ARC E2E] Huggingface accuracy failed #911

mengfei25 commented Sep 13, 2024 •

edited

Loading

mengfei25 commented Sep 18, 2024 •

edited

Loading

jianyizh commented Sep 25, 2024 •

edited

Loading

jianyizh commented Sep 27, 2024

chuanqi129 commented Oct 14, 2024

jianyizh commented Oct 14, 2024

[ARC E2E] Huggingface accuracy failed #911

[ARC E2E] Huggingface accuracy failed #911

Comments

mengfei25 commented Sep 13, 2024 • edited Loading

🐛 Describe the bug

Versions

mengfei25 commented Sep 18, 2024 • edited Loading

jianyizh commented Sep 25, 2024 • edited Loading

jianyizh commented Sep 27, 2024

chuanqi129 commented Oct 14, 2024

jianyizh commented Oct 14, 2024

mengfei25 commented Sep 13, 2024 •

edited

Loading

mengfei25 commented Sep 18, 2024 •

edited

Loading

jianyizh commented Sep 25, 2024 •

edited

Loading