Sycophancy Activation Steering

Setup

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Then create a .env file with the following variables (see .env.example):

HF_TOKEN=huggingface_token_with_access_to_llama2
CLAUDE_API_KEY=api_key_for_claude (optional, only needed for LLM-enabled eval)

Available commands

# Format datasets for generating steering vector and testing effect
python make_datasets.py --generate_test_split 0.8 --anthropic_custom_split 0.6 --n_datapoints 1200
# Generate steering vectors and optionally save full activations
python generate_vectors.py --layers 15 20 25 --save_activations
# Optionally, plot projected activations
python plot_activations.py --activations_pos_file activations/activations_pos_15.pt --activations_neg_file activations/activations_neg_15.pt --fname activations_proj_15.png --title "Activations layer 15"
# Apply steering vectors to model and test effect (--type can by one of "in_distribution", "out_of_distribution", "truthful_qa"), (--few_shot can be one of "positive", "negative", "unbiased", "none")
python prompting_with_steering.py --type in_distribution --layers 15 20 25 --multipliers -1.5 -1 0 1 1.5 --few_shot positive

Full replicable experiments

Scripts that can be run to replicate the experiments are in the scripts/ folder.

Analysis / charts

analysis/ contains scripts for Claude-enabled eval of out-of-distribution steering + plotting of result charts.

Running tests

I have added a few unit tests for some of the utility functions. To run them, simply run:

pytest

TODO

Test layer transference on more layers
Adapt for llama-13b (!!)
Add MMLU dataset and eval
Add reward hacking dataset and eval
Add Jupyter notebook for examining and visualizing vector similarity between layers and tokens + transferring vectors between layers
Adapt for llama-70b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sycophancy Activation Steering

Setup

Available commands

Full replicable experiments

Analysis / charts

Running tests

TODO

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
analysis		analysis
llm_generated_data		llm_generated_data
results		results
scripts		scripts
utils		utils
vectors		vectors
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
generate_vectors.py		generate_vectors.py
llama_wrapper.py		llama_wrapper.py
make_datasets.py		make_datasets.py
plot_activations.py		plot_activations.py
prompting_with_steering.py		prompting_with_steering.py
requirements.txt		requirements.txt

andyrdt/SycophancySteering

Folders and files

Latest commit

History

Repository files navigation

Sycophancy Activation Steering

Setup

Available commands

Full replicable experiments

Analysis / charts

Running tests

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages