Refusal in Language Models Is Mediated by a Single Direction

Content warning: This repository contains text that is offensive, harmful, or otherwise inappropriate in nature.

This repository contains code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction". In the spirit of scientific reproducibility, we provide code to reproduce the main results from the paper.

Setup

git clone https://github.com/andyrdt/refusal_direction.git
cd refusal_direction
source setup.sh

The setup script will prompt you for a HuggingFace token (required to access gated models) and a Together AI token (required to access the Together AI API, which is used for evaluating jailbreak safety scores). It will then set up a virtual environment and install the required packages.

Reproducing main results

To reproduce the main results from the paper, run the following command:

python3 -m pipeline.run_pipeline --model_path {model_path}

where {model_path} is the path to a HuggingFace model. For example, for Llama-3 8B Instruct, the model path would be meta-llama/Meta-Llama-3-8B-Instruct.

The pipeline performs the following steps:

Extract candiate refusal directions
- Artifacts will be saved in pipeline/runs/{model_alias}/generate_directions
Select the most effective refusal direction
- Artifacts will be saved in pipeline/runs/{model_alias}/select_direction
- The selected refusal direction will be saved as pipeline/runs/{model_alias}/direction.pt
Generate completions over harmful prompts, and evaluate refusal metrics.
- Artifacts will be saved in pipeline/runs/{model_alias}/completions
Generate completions over harmless prompts, and evaluate refusal metrics.
- Artifacts will be saved in pipeline/runs/{model_alias}/completions
Evaluate CE loss metrics.
- Artifacts will be saved in pipeline/runs/{model_alias}/loss_evals

For convenience, we have included pipeline artifacts for the smallest model in each model family:

Minimal demo Colab

As part of our blog post, we included a minimal demo of bypassing refusal. This demo is available as a Colab notebook.

As featured in

Since publishing our initial blog post in April 2024, our methodology has been independently reproduced and used many times. In particular, we acknowledge Fail Spy for their work in reproducing and extending our methodology.

Our work has been featured in:

Citing this work

If you find this work useful in your research, please consider citing our paper:

@article{arditi2024refusal,
  title={Refusal in Language Models Is Mediated by a Single Direction},
  author={Andy Arditi and Oscar Obeso and Aaquib Syed and Daniel Paleka and Nina Panickssery and Wes Gurnee and Neel Nanda},
  journal={arXiv preprint arXiv:2406.11717},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
dataset		dataset
pipeline		pipeline
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Refusal in Language Models Is Mediated by a Single Direction

Setup

Reproducing main results

Minimal demo Colab

As featured in

Citing this work

About

Contributors 2

Languages

License

andyrdt/refusal_direction

Folders and files

Latest commit

History

Repository files navigation

Refusal in Language Models Is Mediated by a Single Direction

Setup

Reproducing main results

Minimal demo Colab

As featured in

Citing this work

About

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages