A Domain-Agnostic Method for Procedurally Generating LLM Evaluations
This is a supporting repository for our paper titled "Understanding Social Reasoning in LLMs with LLMs". We develop a method that uses large language models (LLMs) to procedurally generate evaluations for other LLMs. We apply this method to assess the performance of LLMs in a subdomain of social reasoning (Theory-of-Mind). Please checkout our paper for further details.
├── code
│ └── analysis
│ └── prolific-exp-1
│ └── prolific-exp-2
│ └── prompt_instructions
│ └── scripts
│ └── src
├── data
│ ├── bigtom
│ └── expert_data
│ └── social_iqa
│ └── prolific
├── .gitignore
├── LICENSE
└── requirements.txt
curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
bash Miniconda3-latest-MacOSX-x86_64.sh
- close and reopen terminal
source ~/.bashrc
orsource ~/.zshrc
conda create --name name-of-my-env python==3.10
conda activate name-of-my-env
pip install -r requirements.txt
Prompt for generating BigToM is in code/prompt_instructions/bigtom.txt
and the python script is at code/src/bigtom.py
. To generate BigToM, run the following commands:
cd code/src
python bigtom.py
python generate_conditions.py
We provide code to run Human experiments of 3 kinds:
- Expert Ratings:
code/src/expert_evaluate.py
- Prolific Experiment for Rating Generated Stories:
code/prolific-exp-1
- Prolific Experiment for Testing Human Participants:
code/prolific-exp-2
We provide code to evaluate models on BigToM in code/src/evaluate_conditions.py
. More specific experiment scripts are available in code/scripts
.