Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Open Arabic LLM Leaderboard Benchmarks (Full and Light Version) #2232

Merged
merged 68 commits into from
Sep 10, 2024

Conversation

Malikeh97
Copy link
Contributor

Contributors: @shahrzads @Malikeh97

In this PR, we fully implemented and tested two following task sets and added them to lm-evaluation-harness repo.

  • All the benchmarks under the Open Arabic LLM Leaderboard according to their lighteval implementation: Link

    •   The tasks are under `lm_eval/tasks/arabic_leaderboard_complete` folder
      
  • The light version of the Open Arabic LLM Leaderboard benchmark with 10% of the train sets and full validation sets for quick evaluation of the models.

    •   The tasks are under `lm_eval/tasks/arabic_leaderboard_light` folder
      

    Note! The further details of groups, tasks, and subtasks are added as a README file to both folders.

@Malikeh97
Copy link
Contributor Author

Hi, thanks very much for the substantial contribution!

I've left a review--the main comment I have is that 1) the task names should be more clearly named, and 2) also, we should be very clear what tasks are machine-translated and what are not.

(perhaps using a prefix like arabic_leaderboard for these tasks' names is also desirable?)

Hi @haileyschoelkopf
Thanks for you constructive comments. My colleague @shahrzads addressed all the comments and fully tested both light and full versions on sample models. It would be great if you do a final review of the PR and let us know if anything is still missing.

Thanks for creating such a fantastic repo and it was our pleasure to contribute!

@shahrzads
Copy link
Contributor

Hi, thanks very much for the substantial contribution!

I've left a review--the main comment I have is that 1) the task names should be more clearly named, and 2) also, we should be very clear what tasks are machine-translated and what are not.

(perhaps using a prefix like arabic_leaderboard for these tasks' names is also desirable?)

Hi @haileyschoelkopf,
Thank you for your thorough review and valuable feedback. We have carefully addressed all the comments and made the necessary adjustments to the code. We also tested both light and full versions on sample models. Please let us know if there are any further changes required. Otherwise, we believe the updates are ready for merge and look forward to your approval. Your prompt action would be greatly appreciated.

Thanks for the great repo @haileyschoelkopf, @lintangsutawika, and @baberabb. It was a pleasure to contribute to this amazing project!

@haileyschoelkopf
Copy link
Collaborator

Thank you very much @shahrzads @Malikeh97 , I will review again today!

Copy link
Collaborator

@haileyschoelkopf haileyschoelkopf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @shahrzads @Malikeh97 , I reviewed again! Just a few more requests re: standardizing task naming (sorry...) then we'd be good to go!

Also, would it be possible to run the precommit on all changed files? (pip install pre-commit, pre-commit install, then re-commit changes)?

lm_eval/tasks/README.md Outdated Show resolved Hide resolved
lm_eval/tasks/README.md Outdated Show resolved Hide resolved
lm_eval/tasks/arabic_leaderboard_complete/README.md Outdated Show resolved Hide resolved
lm_eval/tasks/arabic_leaderboard_complete/README.md Outdated Show resolved Hide resolved
lm_eval/tasks/arabic_leaderboard_complete/README.md Outdated Show resolved Hide resolved
lm_eval/tasks/arabic_leaderboard_complete/README.md Outdated Show resolved Hide resolved
lm_eval/tasks/arabic_leaderboard_complete/README.md Outdated Show resolved Hide resolved
@shahrzads
Copy link
Contributor

Hi @shahrzads @Malikeh97 , I reviewed again! Just a few more requests re: standardizing task naming (sorry...) then we'd be good to go!

Also, would it be possible to run the precommit on all changed files? (pip install pre-commit, pre-commit install, then re-commit changes)?

Hi @haileyschoelkopf,

Thanks a lot for the review and your great comments! I have addressed all the comments, including standardizing the task naming.

I have also run the pre-commit on all the changed files, as requested.

Additionally, to make the benchmark more aligned with OALL (https://huggingface.co/spaces/OALL/Open-Arabic-LLM-Leaderboard) and lighteval (https://github.com/huggingface/lighteval), we added the Arabic mt_mmlu to the benchmark. The details are also added to the RADME file. Please take a look and let me know if there are any further changes needed.

Thanks again for your guidance!

@shahrzads
Copy link
Contributor

Hi @haileyschoelkopf,

Hope you are doing well!

I would like to follow up on addressing your feedback and ask for your comment on the new changes. We would appreciate it if you can review the new version of the code and let us know your opinion.

@haileyschoelkopf
Copy link
Collaborator

Hi @shahrzads @Malikeh97 , thank you for your hard work on this and sorry for the delayed approval! Ready to merge, conditional on tests passing.

@Malikeh97
Copy link
Contributor Author

Hi @haileyschoelkopf
Thanks for all the constructive comments! Our PR is ready to merge. But, I see 2 workflows awaiting approval. Is there anything that we should done from our end?

Best,
Malikeh

@lintangsutawika lintangsutawika merged commit decc533 into EleutherAI:main Sep 10, 2024
7 of 9 checks passed
@lintangsutawika
Copy link
Contributor

@Malikeh97 thanks for the PR!

jmercat pushed a commit to TRI-ML/lm-evaluation-harness that referenced this pull request Sep 25, 2024
…leutherAI#2232)

* arabic leaferboard yaml file is added

* arabic toxigen is implemented

* Dataset library is imported

* arabic sciq is added

* util file of arabic toxigen is updated

* arabic race is added

* arabic piqa is implemented

* arabic open qa is added

* arabic copa is implemented

* arabic boolq ia added

* arabic arc easy is added

* arabic arc challenge is added

* arabic exams benchmark is implemented

* arabic hellaswag is added

* arabic leaderboard yaml file metrics are updated

* arabic mmlu benchmarks are added

* arabic mmlu group yaml file is updated

* alghafa benchmarks are added

* acva benchmarks are added

* acva utils.py is updated

* light version of arabic leaderboard benchmarks are added

* bugs fixed

* bug fixed

* bug fixed

* bug fixed

* bug fixed

* bug fixed

* library import bug is fixed

* doc to target updated

* bash file is deleted

* results folder is deleted

* leaderboard groups are added

* full arabic leaderboard groups are added, plus some bug fixes to the light version

* Create README.md

README.md for arabic_leaderboard_complete

* Create README.md

README.md for arabic_leaderboard_light

* Delete lm_eval/tasks/arabic_leaderboard directory

* Update README.md

* Update README.md

adding the Arabic leaderboards to the library

* Update README.md

10% of the training set

* Update README.md

10% of the training set

* revert .gitignore to prev version

* Update lm_eval/tasks/README.md

Co-authored-by: Hailey Schoelkopf <[email protected]>

* updated main README.md

* Update lm_eval/tasks/README.md

* specify machine translated benchmarks (complete)

* specify machine translated benchmarks (light version)

* add alghafa to the related task names (complete and light)

* add 'acva' to the related task names (complete and light)

* add 'arabic_leaderboard' to all the groups (complete and light)

* all dataset - not a random sample

* added more accurate details to the readme file

* added mt_mmlu from okapi

* Update lm_eval/tasks/README.md

Co-authored-by: Hailey Schoelkopf <[email protected]>

* Update lm_eval/tasks/README.md

Co-authored-by: Hailey Schoelkopf <[email protected]>

* updated mt_mmlu readme

* renaming 'alghafa' full and light

* renaming 'arabic_mmlu' light and full

* renaming 'acva' full and light

* update readme and standardize dir/file names

* running pre-commit

---------

Co-authored-by: shahrzads <[email protected]>
Co-authored-by: shahrzads <[email protected]>
Co-authored-by: Hailey Schoelkopf <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants