GitHub - cicl-stanford/procedural-evals-tom

A Domain-Agnostic Method for Procedurally Generating LLM Evaluations

🧐 What is this?

This is a supporting repository for our paper titled "Understanding Social Reasoning in LLMs with LLMs". We develop a method that uses large language models (LLMs) to procedurally generate evaluations for other LLMs. We apply this method to assess the performance of LLMs in a subdomain of social reasoning (Theory-of-Mind). Please checkout our paper for further details.

📂 Repo structure

├── code                 
│   └── analysis
│   └── prolific-exp-1
│   └── prolific-exp-2
│   └── prompt_instructions
│   └── scripts
│   └── src 
├── data   
│   ├── bigtom    
│   └── expert_data
│   └── social_iqa
│   └── prolific
├── .gitignore
├── LICENSE            
└── requirements.txt

🚀 Getting started

Using miniconda

curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
bash Miniconda3-latest-MacOSX-x86_64.sh
close and reopen terminal
source ~/.bashrc or source ~/.zshrc
conda create --name name-of-my-env python==3.10
conda activate name-of-my-env
pip install -r requirements.txt

Generating BigToM

Prompt for generating BigToM is in code/prompt_instructions/bigtom.txt and the python script is at code/src/bigtom.py. To generate BigToM, run the following commands:

cd code/src
python bigtom.py
python generate_conditions.py

Human Experiments

We provide code to run Human experiments of 3 kinds:

Expert Ratings: code/src/expert_evaluate.py
Prolific Experiment for Rating Generated Stories: code/prolific-exp-1
Prolific Experiment for Testing Human Participants: code/prolific-exp-2

Evaluating Models

We provide code to evaluate models on BigToM in code/src/evaluate_conditions.py. More specific experiment scripts are available in code/scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧐 What is this?

📂 Repo structure

🚀 Getting started

Using miniconda

Generating BigToM

Human Experiments

Evaluating Models

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
code		code
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

cicl-stanford/procedural-evals-tom

Folders and files

Latest commit

History

Repository files navigation

🧐 What is this?

📂 Repo structure

🚀 Getting started

Using miniconda

Generating BigToM

Human Experiments

Evaluating Models

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages