Skip to content

Latest commit

 

History

History
64 lines (46 loc) · 1.91 KB

README.md

File metadata and controls

64 lines (46 loc) · 1.91 KB

Unveiling the Implicit Toxicity in LLMs

This repository contains data and code for our EMNLP 2023 paper

Unveiling the Implicit Toxicity in Large Language Models

In this work, we show that large language models can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting. We further propose a RL-based method to induce implicit toxicity in LLMs via optimizing the reward that prefers implicit toxic outputs to explicit toxic and non-toxic ones.

1. Install

conda create -n implicit python=3.10
pip install -r requirements.txt

2. Prepare Data

Training data and test data can be found here: huggingface.co/datasets/jiaxin-wen/Implicit-Toxicity

  • training data
    • sft-train.json: training data for supervised learning
    • reward-train.json: training data for reward model training and RL
    • aug-train.json: the human-labeled 4K training data
  • test data
    • test.json: the implicit toxic test data (generated by zero-shot prompting on ChatGPT and RL LLaMA-13B)

3. Inducing Implicit Toxicity in LLMs via Reinforcement Learning

3.1 Supervised Learning

cd sft
bash train.sh

3.2 Reward Model Training

cd reward_model
bash train.sh

3.3 Reinforcement Learning

CUDA_VISIBLE_DEVICES=7 python reward_api.py
CUDA_VISIBLE_DEVICES=7 python attack_reward_api.py
bash train.sh

4. Citation

@article{wen2023implicit,
  title={Unveiling the Implicit Toxicity in Large Language Models},
  author={Wen, Jiaxin and Ke, Pei, and Sun, Hao and Zhang, Zhexin and Li, Chengfei and Bai, Jinfeng and Huang, Minlie},
  journal={arXiv preprint arXiv:2311.17391},
  year={2023}
}