Code release for our paper Debating with More Persuasive LLMs Leads to More Truthful Answers
- Python 3.11
- Virtual environment tool (e.g., virtualenv)
- Create and Activate a Virtual Environment:
virtualenv --python python3.11 .venv source .venv/bin/activate
- Install Required Packages:
pip install -r requirements.txt
- Install Pre-Commit Hooks:
make hooks
- Create a SECRETS file
API_KEY=<KEY> ANTHROPIC_API_KEY=<KEY> DEFAULT_ORG=
- If you want to use the frontend also add the following
DB_USER=psqluser DB_PASSWORD=password DB_HOST=localhost DB_PORT=5432 DB_NAME=debate
We recommend our minimal script to run everything end-to-end: ./scripts/reproduce_minimal.sh
. This will run all protocols (blind, consultancy, debate, interactive debate, expert judge) with GPT-4-Turbo expert and non-experts on 10 questions from QuALITY. Then run notebook scripts/plot_minimal.ipynb
to generate a minimal plot similar to Figure 1 in the paper.
Running ./scripts/run_figure1.sh
will reproduce the LLM results in Figure 1 but uses best-of-16 so watch your OpenAI usage limit.
# Run a debate and judgement on one question from the train-set of QuALITY
python -m core.main exp_dir='./exp/test1' +experiment='debate' +index=0 +swap=False
# Use a Claude 2.1 debater using best-of-8 and only 2 rounds of debate
python3 -m core.main exp_dir='./exp/test2' +experiment='debate' +index=0 +swap=False \
++correct_debater.BoN=8 \
++incorrect_debater.BoN=8 \
++correct_debater.language_model.model='claude-2.1' \
++incorrect_debater.language_model.model='claude-2.1' \
++rollout.num_steps=2 \
++anthropic_num_threads=10
# Debate with GPT-4-Turbo models and best-of-4 followed by judging and scoring on the train split of QuALITY
python -m core.debate \
exp_dir='./exp/test3' \
+experiment='debate'\
++correct_debater.language_model.model='gpt-4-1106-preview' \
++incorrect_debater.language_model.model='gpt-4-1106-preview' \
++correct_debater.BoN=4 \
++incorrect_debater.BoN=4 \
++max_num_from_same_story=5 \
++split=train
python -m core.judge \
exp_dir='./exp/test3' \
+experiment='debate'\
++judge.language_model.model='gpt-4-1106-preview' \
++judge_name='gpt-4-turbo'
python -m core.scoring.accuracy \
exp_dir='./exp/test3' \
+experiment='debate'\
++judge_name='gpt-4-turbo'
# Correct Consultantacy
python -m core.debate \
exp_dir='./exp/test3' \
+experiment='consultancy' \
method_type='correct'
# Incorrect Consultantacy
python -m core.debate \
exp_dir='./exp/test3' \
+experiment='consultancy' \
method_type='incorrect'
We can also call methods within other python scripts:
from core.debate import main as debate
from hydra import compose, initialize
with initialize(version_base=None,
config_path="../core/config/quality",
job_name="test4"):
cfg = compose(
config_name="config",
overrides=["method=debate","exp_dir=./exp/test4"])
debate(cfg)
T_h
: Testset used in the human trial with 47 questions. Use options:++max_num_from_same_story=5 ++split=both ++human_experiments=8
T_l
: Testset used by LLM experiments with 400 questions. Use options++max_num_from_same_story=5 ++split=train
D_l
: Development set used by LLM experiments with 291 questions. Use options++max_num_from_same_story=5 ++split=dev
Each csv requires has following fields. This csv file is generated by load/quality.py and it is then populated with debate transcripts by debate.py
and judge answers by judge.py
.
Field | Type | Description |
---|---|---|
id | str |
uuid from dataset |
question | str |
question (does not include answers) |
correct answer | str |
the answer to the question |
negative answer | str |
misleading or incorrect answer |
complete | bool |
used by pipelines to check stage is complete |
transcript | str |
entire debate transcript, generated by debate |
answer | str |
predicted answer generated by judge |
cd web/frontend && npm install && npm run dev
uvicorn web.backend.main:app --reload
core/main.py
: Script for testing debate and consultancy protocols on one question.core/debate.py
: Script for running debate protocol on a set of questions.core/judge.py
: Script for running a judge on the output transcripts of debate.py.core/scoring/accuracy.py
: Script for computing accuracy of judge answers.core/tournament.py
: Script for running a cross-play tournament with many debaters.core/scoring/rating.py
: Script for computing aggregate Elo rating debaters.core/llm_api
: Directory containing modules for async LLM inferencescripts
: Contains the scripts to reproduce the main figures in the paper and otehr minimal examples.scripts/human_trial_example
: Contains the script and config to reproduce a human experiment.data
: zip files containing LLM debates from the human trialweb
: Contains the code for the web frontend and backend to host a human trial or view debate transcripts.
Contributions to this repository are welcome. Please follow the standard procedures for submitting issues and pull requests.
If you would like to cite our work or data, you may use this citation:
@misc{khan2024debating,
title={Debating with More Persuasive LLMs Leads to More Truthful Answers},
author={Akbir Khan and John Hughes and Dan Valentine and Laura Ruis and Kshitij Sachan and Ansh Radhakrishnan and Edward Grefenstette and Samuel R. Bowman and Tim Rocktäschel and Ethan Perez},
year={2024},
eprint={2402.06782},
archivePrefix={arXiv},
primaryClass={cs.AI}
}