Skip to content

Collection of popular solutions for evaluating large language model responses

Notifications You must be signed in to change notification settings

wylupek/LLM_evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Objective

The goal of this project is to develop and implement a mechanism for verifying the correctness of responses generated by chatbots and large language models (LLMs). This involves:

  1. Evaluating Mechanisms:

    • Implementing and deploying methods to assess the quality and accuracy of chatbot responses.
    • Utilizing established benchmarks and evaluation tools for an assessment.
  2. Custom Evaluation Criteria:

    • Designing and testing criteria tailored to specific requirements such as grammatical correctness, relevance, truthfulness, and toxicity.
    • Ensuring the criteria are adapted to meet the particular needs of different chatbot and LLM applications.
  3. Analysis:

    • Gathering and examining various evaluation methods.
    • Assessing their capabilities and effectiveness in measuring diverse criteria.

GLUE

The General Language Understanding Evaluation (GLUE) benchmark is a comprehensive suite of tasks designed to evaluate and compare the performance of natural language understanding models. It encompasses a variety of language tasks to test different aspects of language understanding and reasoning. Here are the main metrics:

  • Corpus of Linguistic Acceptability (CoLA): Check if the response is grammatically correct.
  • Question Natural Language Inference (QNLI): Check if the response pertains to the asked question.
  • Multi-Genre Natural Language Inference (MNLI): Detect if one sentence in a pair contradicts the other.
  • Microsoft Research Paraphrase Corpus (MRPC): Check if one sentence is a paraphrase of another. Useful for verifying extracted sentences from the knowledge base.

Prometheus

Prometheus is an open-source project that offers an alternative to GPT-4 evaluation, enabling detailed assessments of large language models. Based on Llama-2-Chat and fine-tuned with 100,000 feedback samples from the Feedback-Collection dataset, Prometheus excels in evaluating language models according to various custom criteria, such as readability for children, cultural sensitivity, and creativity. Additionally, it serves as a reward model in reinforcement learning with human feedback (RLHF).

Prometheus allows for precise evaluation by using well-formulated prompts to check how responses meet criteria of interest, including truthfulness, conciseness, and completeness. The evaluation process requires four input components: the instruction, the response to evaluate, the score rubric, and the reference answer. The score rubric ranges from 1 to 5, with detailed descriptions of the requirements for each score level. A significant challenge remains in the automatic selection of an appropriate reference answer.

Prometheus

DeepEval

DeepEval is an open-source framework designed for evaluating large language models using the LLM judge method. It supports the creation of unit tests in a Pytest style, offering ready-made implementations of numerous evaluation metrics for comprehensive analysis and assessment of chatbot responses. One of its key advantages is the ease with which custom measures can be added. The framework also facilitates model testing using popular benchmarks like MMLU and HellaSwag. The standout feature of DeepEval is its flexibility, allowing any language model to be used for evaluation.

The example metrics include:

  • G-Eval: Similar to Prometheus evaluation. By providing any evaluation criterion or steps, a score ranging from 0-1 is obtained.
  • Summarization: Assesses the quality of responses in terms of summarizing documents from the knowledge base, determining whether the response is sufficient or comprehensive. Optionally, it can incorporate questions that the knowledge base document answers, which we want the chatbot to address.
  • Hallucination: Checks if the response contains accurate information based on the context (document from the knowledge base).
  • Toxicity: Detects toxic and harmful content in the chatbot's response.

Useful Links

About

Collection of popular solutions for evaluating large language model responses

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published