Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you provide some examples of llama and gemma on common benchmarks? #1978

Open
pass-lin opened this issue Nov 9, 2024 · 3 comments
Open
Labels
Gemma Gemma model specific issues

Comments

@pass-lin
Copy link

pass-lin commented Nov 9, 2024

I am unable to reproduce the performance of llama3 and gemma2 implemented by Keras Hub on the GSM8k benchmark.
paper ref https://arxiv.org/pdf/[2407.21783](https://arxiv.org/pdf/2407.21783) and https://arxiv.org/pdf/[2408.00118](https://arxiv.org/pdf/2408.00118)

Could the keras team please provide an example for replicating the results, and also compare the performance across different backends?

@github-actions github-actions bot added the Gemma Gemma model specific issues label Nov 9, 2024
@Gopi-Uppari
Copy link

Hi @pass-lin,

The links provided aren't working.

For reproducing these models from keras, we need to follow below steps:

  1. Loading the model.
  2. Preparing the dataset.
  3. Evaluating the model's performance.

Please refer to this gist file.

Could you please refer to this issue link that contains detailed information about reproducing the models on the GSM8K dataset. Also refer to LLaMA3-Quantization.

Please note that the actual performance may vary based on the specific implementation details, such as prompt formatting and answer extraction methods.

Thank you.

@pass-lin
Copy link
Author

pass-lin commented Nov 11, 2024

Hi @pass-lin,

The links provided aren't working.

For reproducing these models from keras, we need to follow below steps:

  1. Loading the model.
  2. Preparing the dataset.
  3. Evaluating the model's performance.

Please refer to this gist file.

Could you please refer to this issue link that contains detailed information about reproducing the models on the GSM8K dataset. Also refer to LLaMA3-Quantization.

Please note that the actual performance may vary based on the specific implementation details, such as prompt formatting and answer extraction methods.

Thank you.

your code find some error

os.environ["KERAS_BACKEND"] = "jax"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

import keras_nlp
model = keras_nlp.models.Llama3CausalLM.from_preset("hf://NousResearch/Meta-Llama-3.1-8B")

from datasets import load_dataset
dataset = load_dataset("gsm8k", "main", split="test")

from tqdm import tqdm


uncorrect = 0
total = len(dataset)

for data in tqdm(dataset):
    question = data['question']
    answer = data['answer']

    # Generate response
    generated_answer = model.generate(question, max_length=100)

    if generated_answer == question:
        uncorrect += 1
    
unaccuracy = uncorrect / total

print(f"LLAMA-3 fail on GSM8K: {unaccuracy * 100:.2f}%")

you will find model can not generate normally,the generate answer is equal to input question

@Gopi-Uppari
Copy link

Hi @pass-lin,

Try adjusting the temperature and top-k sampling parameters; this should help the model generate the expected output or answer. Currently, the generated answer is same to the input answer, not the question.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Gemma Gemma model specific issues
Projects
None yet
Development

No branches or pull requests

2 participants