Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to reproduce the perplexity of Llama-2 7B on wikitext #2301

Open
Yonghao-Tan opened this issue Sep 15, 2024 · 10 comments
Open

Fail to reproduce the perplexity of Llama-2 7B on wikitext #2301

Yonghao-Tan opened this issue Sep 15, 2024 · 10 comments

Comments

@Yonghao-Tan
Copy link

Hi, when I use the command for evaluating Llama-2 7B on wikitext2:
lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf --tasks wikitext --device cuda:0 --batch_size 1
The result is
image
However, the fp16 result I saw in many papers is 5.47. Another confusing point is that the other tasks like piqa, winogrande, arc-e, arc-c ... I can get the exact same results as the papers reported. Thanks!

@baberabb
Copy link
Contributor

Hi! Can you provide a source? I'll check

@Yonghao-Tan
Copy link
Author

Thanks for your reply! The source for Llama-2 7B on wikitext2 is from many SOTA quantization works:
https://arxiv.org/pdf/2306.00978 (page 7)
https://github.com/qwopqwop200/GPTQ-for-LLaMa (Llama-1 only)
https://arxiv.org/pdf/2308.13137 (page 7)
They all report PPL 5.68 for Llama-1-7b and 5.47 for Llama-2-7b as a FP16 baseline, which are far from 8.7071 as I tried in lm-eval
Thank you in advance

@lonleyodd
Copy link

lonleyodd commented Sep 26, 2024

Hi, when I use the command for evaluating Llama-2 7B on wikitext2: lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf --tasks wikitext --device cuda:0 --batch_size 1 The result is image However, the fp16 result I saw in many papers is 5.47. Another confusing point is that the other tasks like piqa, winogrande, arc-e, arc-c ... I can get the exact same results as the papers reported. Thanks!

Hello, how did you test zero-shot tasks like piqa, arc...,here is my result, compared to result in paper spinquant, it seems something wrong. btw, what's difference between acc and acc norm? I don't know which one to compare with
image
image

@huweim
Copy link

huweim commented Sep 27, 2024

For me, I use the codebase of GPTQ or AWQ to run the wikitext evaluation.

@Yonghao-Tan
Copy link
Author

I think the baseline in GPTQ or AWQ for wikitext2 is correct. However, lm-eval contains most of the datasets so I want to use it for wikitext2. But the result is confusing since the other tasks like common sense are correct, only wiki2 fails to align with other repos.

@Yonghao-Tan
Copy link
Author

Is it possible that the metric here for wikitext2 is different from what is used in other codebase? Since all paper reports mostly the same FP16 baseline for Llama2-7b on wikitext2 (5.47)

@huweim
Copy link

huweim commented Sep 27, 2024

I think the baseline in GPTQ or AWQ for wikitext2 is correct. However, lm-eval contains most of the datasets so I want to use it for wikitext2. But the result is confusing since the other tasks like common sense are correct, only wiki2 fails to align with other repos.

Indeed. So AWQ splits the evaluation of wikitext from lm-eval, and so does QuaRot. See AWQ and QuaRot. In this way, you can reproduce the PPL resutls in most paper (5.47 for Llama2-7b and 5.69 for Llama-7b).
Therefore, I choose to manually load the dataset and process the PPL calculations on wikitext and C4.

For other tasks, lm-eval is good.

@Yonghao-Tan
Copy link
Author

I think the baseline in GPTQ or AWQ for wikitext2 is correct. However, lm-eval contains most of the datasets so I want to use it for wikitext2. But the result is confusing since the other tasks like common sense are correct, only wiki2 fails to align with other repos.

Indeed. So AWQ splits the evaluation of wikitext from lm-eval, and so does QuaRot. See AWQ and QuaRot. In this way, you can reproduce the PPL resutls in most paper (5.47 for Llama2-7b and 5.69 for Llama-7b). Therefore, I choose to manually load the dataset and process the PPL calculations on wikitext and C4.

For other tasks, lm-eval is good.

Thanks. Do you mean manually load the dataset and process the PPL calculations on wikitext and C4 in lm-eval (change the code in lm-eval)?

@huweim
Copy link

huweim commented Sep 27, 2024

I think the baseline in GPTQ or AWQ for wikitext2 is correct. However, lm-eval contains most of the datasets so I want to use it for wikitext2. But the result is confusing since the other tasks like common sense are correct, only wiki2 fails to align with other repos.

Indeed. So AWQ splits the evaluation of wikitext from lm-eval, and so does QuaRot. See AWQ and QuaRot. In this way, you can reproduce the PPL resutls in most paper (5.47 for Llama2-7b and 5.69 for Llama-7b). Therefore, I choose to manually load the dataset and process the PPL calculations on wikitext and C4.
For other tasks, lm-eval is good.

Thanks. Do you mean manually load the dataset and process the PPL calculations on wikitext and C4 in lm-eval (change the code in lm-eval)?

Yes. Just refer to the implementation of AWQ and QuaRot.

Maybe there is a better way:)

@Yonghao-Tan
Copy link
Author

Thanks a lot! I'll try that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants