Fail to reproduce the perplexity of Llama-2 7B on wikitext #2301

Yonghao-Tan · 2024-09-15T16:27:41Z

Hi, when I use the command for evaluating Llama-2 7B on wikitext2:
lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf --tasks wikitext --device cuda:0 --batch_size 1
The result is

However, the fp16 result I saw in many papers is 5.47. Another confusing point is that the other tasks like piqa, winogrande, arc-e, arc-c ... I can get the exact same results as the papers reported. Thanks!

baberabb · 2024-09-15T18:44:19Z

Hi! Can you provide a source? I'll check

Yonghao-Tan · 2024-09-15T19:15:26Z

Thanks for your reply! The source for Llama-2 7B on wikitext2 is from many SOTA quantization works:
https://arxiv.org/pdf/2306.00978 (page 7)
https://github.com/qwopqwop200/GPTQ-for-LLaMa (Llama-1 only)
https://arxiv.org/pdf/2308.13137 (page 7)
They all report PPL 5.68 for Llama-1-7b and 5.47 for Llama-2-7b as a FP16 baseline, which are far from 8.7071 as I tried in lm-eval
Thank you in advance

lonleyodd · 2024-09-26T12:25:45Z

Hi, when I use the command for evaluating Llama-2 7B on wikitext2: lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf --tasks wikitext --device cuda:0 --batch_size 1 The result is However, the fp16 result I saw in many papers is 5.47. Another confusing point is that the other tasks like piqa, winogrande, arc-e, arc-c ... I can get the exact same results as the papers reported. Thanks!

Hello, how did you test zero-shot tasks like piqa, arc...，here is my result, compared to result in paper spinquant, it seems something wrong. btw, what's difference between acc and acc norm? I don't know which one to compare with

huweim · 2024-09-27T02:31:14Z

For me, I use the codebase of GPTQ or AWQ to run the wikitext evaluation.

Yonghao-Tan · 2024-09-27T07:35:14Z

I think the baseline in GPTQ or AWQ for wikitext2 is correct. However, lm-eval contains most of the datasets so I want to use it for wikitext2. But the result is confusing since the other tasks like common sense are correct, only wiki2 fails to align with other repos.

Yonghao-Tan · 2024-09-27T07:37:19Z

Is it possible that the metric here for wikitext2 is different from what is used in other codebase? Since all paper reports mostly the same FP16 baseline for Llama2-7b on wikitext2 (5.47)

huweim · 2024-09-27T07:43:47Z

I think the baseline in GPTQ or AWQ for wikitext2 is correct. However, lm-eval contains most of the datasets so I want to use it for wikitext2. But the result is confusing since the other tasks like common sense are correct, only wiki2 fails to align with other repos.

Indeed. So AWQ splits the evaluation of wikitext from lm-eval, and so does QuaRot. See AWQ and QuaRot. In this way, you can reproduce the PPL resutls in most paper (5.47 for Llama2-7b and 5.69 for Llama-7b).
Therefore, I choose to manually load the dataset and process the PPL calculations on wikitext and C4.

For other tasks, lm-eval is good.

Yonghao-Tan · 2024-09-27T07:45:28Z

I think the baseline in GPTQ or AWQ for wikitext2 is correct. However, lm-eval contains most of the datasets so I want to use it for wikitext2. But the result is confusing since the other tasks like common sense are correct, only wiki2 fails to align with other repos.

Indeed. So AWQ splits the evaluation of wikitext from lm-eval, and so does QuaRot. See AWQ and QuaRot. In this way, you can reproduce the PPL resutls in most paper (5.47 for Llama2-7b and 5.69 for Llama-7b). Therefore, I choose to manually load the dataset and process the PPL calculations on wikitext and C4.

For other tasks, lm-eval is good.

Thanks. Do you mean manually load the dataset and process the PPL calculations on wikitext and C4 in lm-eval (change the code in lm-eval)?

huweim · 2024-09-27T07:54:18Z

I think the baseline in GPTQ or AWQ for wikitext2 is correct. However, lm-eval contains most of the datasets so I want to use it for wikitext2. But the result is confusing since the other tasks like common sense are correct, only wiki2 fails to align with other repos.

Indeed. So AWQ splits the evaluation of wikitext from lm-eval, and so does QuaRot. See AWQ and QuaRot. In this way, you can reproduce the PPL resutls in most paper (5.47 for Llama2-7b and 5.69 for Llama-7b). Therefore, I choose to manually load the dataset and process the PPL calculations on wikitext and C4.
For other tasks, lm-eval is good.

Thanks. Do you mean manually load the dataset and process the PPL calculations on wikitext and C4 in lm-eval (change the code in lm-eval)?

Yes. Just refer to the implementation of AWQ and QuaRot.

Maybe there is a better way:)

Yonghao-Tan · 2024-09-27T08:12:24Z

Thanks a lot! I'll try that

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to reproduce the perplexity of Llama-2 7B on wikitext #2301

Fail to reproduce the perplexity of Llama-2 7B on wikitext #2301

Yonghao-Tan commented Sep 15, 2024

baberabb commented Sep 15, 2024

Yonghao-Tan commented Sep 15, 2024

lonleyodd commented Sep 26, 2024 •

edited

Loading

huweim commented Sep 27, 2024

Yonghao-Tan commented Sep 27, 2024

Yonghao-Tan commented Sep 27, 2024

huweim commented Sep 27, 2024

Yonghao-Tan commented Sep 27, 2024

huweim commented Sep 27, 2024

Yonghao-Tan commented Sep 27, 2024

Fail to reproduce the perplexity of Llama-2 7B on wikitext #2301

Fail to reproduce the perplexity of Llama-2 7B on wikitext #2301

Comments

Yonghao-Tan commented Sep 15, 2024

baberabb commented Sep 15, 2024

Yonghao-Tan commented Sep 15, 2024

lonleyodd commented Sep 26, 2024 • edited Loading

huweim commented Sep 27, 2024

Yonghao-Tan commented Sep 27, 2024

Yonghao-Tan commented Sep 27, 2024

huweim commented Sep 27, 2024

Yonghao-Tan commented Sep 27, 2024

huweim commented Sep 27, 2024

Yonghao-Tan commented Sep 27, 2024

lonleyodd commented Sep 26, 2024 •

edited

Loading