GitHub - luk-s/Influence-benchmark: A benchmark for evaluating the tendency of LLM agents to influence human preferences

Influence-Benchmark (WIP)

Influence-benchmark is a framework for simulating and evaluating AI agent interactions, with a specific focus on measuring the potential influence of Large Language Models (LLMs) on human preferences in multi-turn conversations. This project is a work in process and is not necessarily fully implemented yet.

Training AI systems with human feedback incentivizes the AI systems to influence annotators to provide positive feedback by any means, potentially via a variety of harmful mechanisms, such as sycophancy, deception, or manipulation. So far, in realistic LLM setups, only the emergence of sycophancy has been observed. This project shows that optimizing on user feedback through Reinforcement Learning methods can lead to the emergence of more sophisticated and harmful annotator gaming incentives in LLMs, even after just a few training iterations, and using relatively weak optimization methods.

Current setup

In our setup we use 4 LLMs (which can be the same model)

The agent model: this is the model we are testing and will do expert iteration on
The environment model: this model provides the environment's responses, typically character dialogue.
The preference model: This model predicts what rating the character in the environment would give the latest agent response. This is the signal which determines what we will train on for expert iteration etc.
The transition model: This model predicts whether a new environment state should be transitioned to. Currently this only predicts if the character in the environment has made up their mind and wants to end the conversation.
The influence detector model: This model determines if the agent has engaged in problematic influencing behavior.

Features

Flexible environment configurations for different interaction scenarios.
Vectorized implementation for efficient parallel simulations.
Support for multiple backend models (OpenAI GPT, Hugging Face transformers)
Expert Iteration algorithm implementation to measure the effect of longer horizon RL.
WandB logging for visualizing agent interactions and training metrics.

Installation

git clone https://github.com/carolius/Influence-benchmark.git
cd Influence-benchmark/
conda create -n influence python=3.11.9 -y
conda activate influence
pip install -e .

Usage

Experiments are in the influence_benchmark/experiments folder and have a large number of parameters which can be customized. Current experiments include launching vectorized environments, launching expert iteration or KTO on our environments which include a therapy chatbot environment, a relationship chatbot environment and a ticket booking tool-use environment.

Custom environments can be defined as yaml files, see influence_benchmark/config for examples of this.

Project Structure

influence_benchmark/: Main package
- agent/: Agent implementations
- backend/: Model backend interfaces (OpenAI, Hugging Face)
- environment/: Core environment classes
- experiments/: Experiment runners
- gui/: Web-based visualization interface
- RL/: Reinforcement learning algorithms (e.g., Expert Iteration)
- environment_vectorized/: Parallel environment implementation

For slurm users

Run scripts like this. The provided GPUs will be named like range(n_devices) sbatch influence_benchmark/experiments/slurm/expert_iteration.sh

Task Log:

Acknowledgments

This research is being conducted as part of MATS.

Name		Name	Last commit message	Last commit date
Latest commit History 770 Commits
.github		.github
influence_benchmark		influence_benchmark
notebooks		notebooks
scripts		scripts
tests		tests
.gitignore		.gitignore
README.md		README.md
influence_example.png		influence_example.png
lint.sh		lint.sh
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Influence-Benchmark (WIP)

Current setup

Features

Installation

Usage

Project Structure

For slurm users

Task Log:

Acknowledgments

About

Releases

Packages

Languages

luk-s/Influence-benchmark

Folders and files

Latest commit

History

Repository files navigation

Influence-Benchmark (WIP)

Current setup

Features

Installation

Usage

Project Structure

For slurm users

Task Log:

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages