HarmBench

HarmBench Dataset

Functional categories: HarmBench contains the following 4 functional categories of behavior:

200 standard behaviors
100 copyright behaviors
100 contextual behaviors
110 multimodal behaviors

Descriptive text for image — Figure 1: Attack success rate (ASR) averaged across all attacks and open-source models for standard, contextual, and copyrights behaviors ASR is considerably higher for contextual behaviors than standard behaviors. This is concerning, as contextual behaviors represent more specific harmful tasks that would be challenging to look up the answer to on a search engine. Thus, behaviors would more differentially harmful for LLMs to exhibit are easier to elicit with red teaming methods

Semantic categories: HarmBench contains the following 7 semantic categories of behavior: Cybercrime & Unauthorized Intrusion, Chemical & Biological Weapons/Drugs, Copyright Violations, Misinformation & Disinformation, Harassment & Bullying, Illegal Activities, and General Harm.

Four Models — Figure 2: For example, on Llama 2 and GPT models the Misinformation & Disinformation category has the highest ASR, but for Baichuan 2 and Starling the Chemical & Biological Weapons / Drugs category has the highest ASR. This suggests that training distributions can greatly influence the kinds of behaviors that are harder to elicit.

Behavior Classifier

We also release a high-quality classifier that is built on top of the Llama 13B chat model and achieves comparable performance to GPT-4 on a human labelled validation set. The classifier is available on Hugging Face Logo Hugging Face at cais/HarmBench-Llama-2-13b-cls . More details about the training process of the classifier is reported in Appendix B of our paper.

Behavior Type	AdvBench	GPTFuzz	Llama-Guard	GPT-4 Judge (PAIR)	HarmBench Ours
Standard	71.14	77.36	68.41	89.8	94.53
Contextual	67.5	71.5	64	85.5	90.5
Average (↑)	69.93	75.42	66.94	88.37	93.19

Table 1: Agreement rates between previous metrics and classifiers compared to human judgments on our manually labeled validation set.

Adversarial Training

We propose a new adversarial training method for robust refusal, called Robust Refusal Dynamic Defense (R2D2). As opposed to fine-tuning on a static dataset of harmful prompts, our method fine-tunes LLMs on a dynamic pool of test cases continually updated by a strong optimization-based red teaming method. Detail implementations are in Appendix A of our paper.

Model	Mistral Tiny	Koala 13B	Zephyr + R2D2 Ours
MT-Bench	6.5	5.4	6.0
GCG ASR	68.5	61.1	5.9
Average ASR	53.9	47.4	15.8

Table 2: MT-Bench Comparison for Adversarially Trained Model

Robustness is independent of model size*

While findings in prior work suggested that larger models would be harder to red team, we find no correlation between robustness and model size within model families when red teaming at scale. We do observe a substantial difference in robustness between model families, which suggests that procedures and data used during training are far more important than model size in determining robustness to jailbreaks.