Benchmark evaluation for language models. #615

mina58 · 2024-08-19T06:33:35Z

Not sure if this feature belongs to this library or would it require a complete separate library. I am proposing the creation of a library where llm benchmarks can be ran. For example, evaluating a model on HumanEval. Such library would make evaluating LLMs much easier. I liked the dockerized approach they used at https://github.com/NVlabs/verilog-eval to safely evaluate code. Evaluating math and reasoning skill of llm could also be benificial.

Vipitis · 2024-08-20T16:36:14Z

consider, which includes a sandbox to run HumanEval (on Linux only)
https://github.com/bigcode-project/bigcode-evaluation-harness

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark evaluation for language models. #615

Benchmark evaluation for language models. #615

mina58 commented Aug 19, 2024

Vipitis commented Aug 20, 2024

Benchmark evaluation for language models. #615

Benchmark evaluation for language models. #615

Comments

mina58 commented Aug 19, 2024

Vipitis commented Aug 20, 2024