Skip to content

Commit

Permalink
fixed error
Browse files Browse the repository at this point in the history
  • Loading branch information
bat-kryptonyte committed Jul 24, 2023
1 parent 2226a12 commit aa077a1
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/screens/ARB/Arb.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,7 @@ const Arb: React.FC = () => {
Evaluation Results
</Heading>
<Text textAlign="justify" mt={4}>
Our evaluation of current large language models (LLMs) focuses on text-only problems, with no multimodal tasks, using models including ChatGPT, GPT 3.5, GPT-4, and Claude. Each question type is assessed with task-specific instructions and chain of thought; for multiple-choice questions, the model's choice is compared with the correct answer, while numerical, symbolic, and proof-like problems require extraction and parsing of the model's answer, often requiring mathematical libraries and manual grading due to their complexity. We also tested two model-based approaches for grading, including GPT-4's ability to grade equivalence of two symbolic expressions and a rubric-based evaluation method, which showed promising results, facilitating the evaluation of increasingly unstructured answers.
Our evaluation of current large language models (LLMs) focuses on text-only problems, with no multimodal tasks, using models including ChatGPT, GPT 3.5, GPT-4, and Claude. Each question type is assessed with task-specific instructions and chain of thought; for multiple-choice questions, the model&apos;s choice is compared with the correct answer, while numerical, symbolic, and proof-like problems require extraction and parsing of the model&apos;s answer, often requiring mathematical libraries and manual grading due to their complexity. We also tested two model-based approaches for grading, including GPT-4&apos;s ability to grade equivalence of two symbolic expressions and a rubric-based evaluation method, which showed promising results, facilitating the evaluation of increasingly unstructured answers.
</Text>
<Box mt={4}>
<Image
Expand Down

0 comments on commit aa077a1

Please sign in to comment.