Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARC E2E] Huggingface accuracy failed #911

Open
mengfei25 opened this issue Sep 13, 2024 · 5 comments
Open

[ARC E2E] Huggingface accuracy failed #911

mengfei25 opened this issue Sep 13, 2024 · 5 comments
Assignees
Milestone

Comments

@mengfei25
Copy link
Contributor

mengfei25 commented Sep 13, 2024

🐛 Describe the bug

Accuracy failed for key name albert.embeddings.token_type_embeddings.weight.grad

Category Model Accuracy
huggingface_amp_bf16_training AlbertForMaskedLM fail_accuracy
huggingface_amp_fp16_training AlbertForMaskedLM fail_accuracy
huggingface_bfloat16_training AlbertForMaskedLM fail_accuracy
huggingface_float16_training AlbertForMaskedLM fail_accuracy
huggingface_float32_training AlbertForMaskedLM fail_accuracy
huggingface_amp_bf16_training AlbertForQuestionAnswering fail_accuracy
huggingface_amp_fp16_training AlbertForQuestionAnswering fail_accuracy
huggingface_bfloat16_training AlbertForQuestionAnswering fail_accuracy
huggingface_float16_training AlbertForQuestionAnswering fail_accuracy
huggingface_float32_training AlbertForQuestionAnswering fail_accuracy

Versions

torch-xpu-ops: 7e3d00a
pytorch: 0d1d69fd25fdc096763bfe85f4d379e27ea1c9f8
device: ARC 24.04
driver: 24.31.30508.7

@mengfei25
Copy link
Contributor Author

mengfei25 commented Sep 18, 2024

Compared with 22.04

Category Model Ubuntu 24.04 Ubuntu 22.04
huggingface_amp_bf16_training AlbertForMaskedLM fail_accuracy pass
huggingface_bfloat16_training AlbertForMaskedLM fail_accuracy pass
huggingface_float32_training AlbertForMaskedLM fail_accuracy pass
huggingface_float32_training AlbertForQuestionAnswering fail_accuracy pass

@jianyizh jianyizh self-assigned this Sep 24, 2024
@jianyizh
Copy link

jianyizh commented Sep 25, 2024

@mengfei25
I can reproduce on ubuntu 22, for example, AlbertForMaskedLM fp16 training, I get
W0925 06:28:19.597000 315406 site-packages/torch/_dynamo/utils.py:1723] Similarity score=0.9181798100471497
E0925 06:28:19.598000 315406 site-packages/torch/_dynamo/utils.py:1674] Accuracy failed for key name albert.embeddings.token_type_embeddings.weight.grad. The similarity score is exactly the same as your test log.
However, If I modify the patch and compare the result directly, it can pass. You do not need put fp64_outputs on xpu, and you can put new_result and correct_result on cpu to compare. https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/common.py#L2898
For this cosine similarity fail, it's related to layer norm backward

@jianyizh
Copy link

cuda also fails on cosine similarity
cuda train AlbertForMaskedLM
W0927 14:50:50.191000 3628133 /mnt/tmp/home/sparse/miniforge3/envs/jianyi/lib/python3.10/site-packages/torch/_inductor/debug.py:434] [3/0_1] AlbertForMaskedLM__0_forward_1 debug trace: /home/sparse/jianyi/pytorch/inductor_log/huggingface/AlbertForMaskedLM/training/amp_fp16/22/torch_compile_debug/torch_compile_debug/run_2024_09_27_14_50_33_656959-pid_3628133/torchinductor/AlbertForMaskedLM__0_forward_1.0
W0927 14:51:04.399000 3628133 /mnt/tmp/home/sparse/miniforge3/envs/jianyi/lib/python3.10/site-packages/torch/_inductor/debug.py:434] [3/0_1] AlbertForMaskedLM__0_backward_3 debug trace: /home/sparse/jianyi/pytorch/inductor_log/huggingface/AlbertForMaskedLM/training/amp_fp16/22/torch_compile_debug/torch_compile_debug/run_2024_09_27_14_50_33_656959-pid_3628133/torchinductor/AlbertForMaskedLM__0_backward_3.1
W0927 14:51:05.220000 3628133 /mnt/tmp/home/sparse/miniforge3/envs/jianyi/lib/python3.10/site-packages/torch/_dynamo/utils.py:1723] Similarity score=0.6160069704055786
E0927 14:51:05.221000 3628133 /mnt/tmp/home/sparse/miniforge3/envs/jianyi/lib/python3.10/site-packages/torch/_dynamo/utils.py:1674] Accuracy failed for key name albert.embeddings.token_type_embeddings.weight.grad
fail_accuracy

@chuanqi129
Copy link
Contributor

@jianyizh does it mean both xpu and cuda will fail on cosine similarity for those models?

@jianyizh
Copy link

@jianyizh does it mean both xpu and cuda will fail on cosine similarity for those models?

yes, and they can pass on the normal accuracy test with fp64 patch corrected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants