Can you provide some examples of llama and gemma on common benchmarks? #1978

pass-lin · 2024-11-09T11:26:19Z

I am unable to reproduce the performance of llama3 and gemma2 implemented by Keras Hub on the GSM8k benchmark.
paper ref https://arxiv.org/pdf/[2407.21783](https://arxiv.org/pdf/2407.21783) and https://arxiv.org/pdf/[2408.00118](https://arxiv.org/pdf/2408.00118)

Could the keras team please provide an example for replicating the results, and also compare the performance across different backends?

Gopi-Uppari · 2024-11-11T06:34:20Z

Hi @pass-lin,

The links provided aren't working.

For reproducing these models from keras, we need to follow below steps:

Loading the model.
Preparing the dataset.
Evaluating the model's performance.

Please refer to this gist file.

Could you please refer to this issue link that contains detailed information about reproducing the models on the GSM8K dataset. Also refer to LLaMA3-Quantization.

Please note that the actual performance may vary based on the specific implementation details, such as prompt formatting and answer extraction methods.

Thank you.

pass-lin · 2024-11-11T07:28:13Z

Hi @pass-lin,

The links provided aren't working.

For reproducing these models from keras, we need to follow below steps:

Loading the model.

Preparing the dataset.

Evaluating the model's performance.

Please refer to this gist file.

Could you please refer to this issue link that contains detailed information about reproducing the models on the GSM8K dataset. Also refer to LLaMA3-Quantization.

Please note that the actual performance may vary based on the specific implementation details, such as prompt formatting and answer extraction methods.

Thank you.

your code find some error

os.environ["KERAS_BACKEND"] = "jax"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

import keras_nlp
model = keras_nlp.models.Llama3CausalLM.from_preset("hf://NousResearch/Meta-Llama-3.1-8B")

from datasets import load_dataset
dataset = load_dataset("gsm8k", "main", split="test")

from tqdm import tqdm


uncorrect = 0
total = len(dataset)

for data in tqdm(dataset):
    question = data['question']
    answer = data['answer']

    # Generate response
    generated_answer = model.generate(question, max_length=100)

    if generated_answer == question:
        uncorrect += 1
    
unaccuracy = uncorrect / total

print(f"LLAMA-3 fail on GSM8K: {unaccuracy * 100:.2f}%")

you will find model can not generate normally,the generate answer is equal to input question

Gopi-Uppari · 2024-11-12T10:05:59Z

Hi @pass-lin,

Try adjusting the temperature and top-k sampling parameters; this should help the model generate the expected output or answer. Currently, the generated answer is same to the input answer, not the question.

Thank you.

github-actions bot added the Gemma Gemma model specific issues label Nov 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can you provide some examples of llama and gemma on common benchmarks? #1978

Can you provide some examples of llama and gemma on common benchmarks? #1978

pass-lin commented Nov 9, 2024 •

edited

Loading

Gopi-Uppari commented Nov 11, 2024

pass-lin commented Nov 11, 2024 •

edited

Loading

Gopi-Uppari commented Nov 12, 2024

Can you provide some examples of llama and gemma on common benchmarks? #1978

Can you provide some examples of llama and gemma on common benchmarks? #1978

Comments

pass-lin commented Nov 9, 2024 • edited Loading

Gopi-Uppari commented Nov 11, 2024

pass-lin commented Nov 11, 2024 • edited Loading

Gopi-Uppari commented Nov 12, 2024

pass-lin commented Nov 9, 2024 •

edited

Loading

pass-lin commented Nov 11, 2024 •

edited

Loading