Functional categories: HarmBench contains the following 4 functional categories of behavior:
200 standard behaviors
100 copyright behaviors
100 contextual behaviors
110 multimodal behaviors
Figure 1: Attack success rate (ASR) averaged across all attacks and open-source models for standard, contextual, and copyrights behaviors ASR is considerably higher for contextual behaviors than standard behaviors. This is concerning, as contextual behaviors represent more specific harmful tasks that would be challenging to look up the answer to on a search engine. Thus, behaviors would more differentially harmful for LLMs to exhibit are easier to elicit with red teaming methods
Semantic categories: HarmBench contains the following 7 semantic categories of behavior: Cybercrime & Unauthorized Intrusion, Chemical & Biological Weapons/Drugs, Copyright Violations, Misinformation & Disinformation, Harassment & Bullying, Illegal Activities, and General Harm.
Figure 2: For example, on Llama 2 and GPT models the Misinformation & Disinformation category has the highest ASR, but for Baichuan 2 and Starling the Chemical & Biological Weapons / Drugs category has the highest ASR. This suggests that training distributions can greatly influence the kinds of behaviors that are harder to elicit.
Behavior Classifier
We also release a high-quality classifier that is built on top of the Llama 13B chat model and achieves comparable performance to GPT-4 on a human labelled validation set. The classifier is available on Hugging Face at cais/HarmBench-Llama-2-13b-cls . More details about the training process of the classifier is reported in Appendix B of our paper.
Behavior Type
AdvBench
GPTFuzz
Llama-Guard
GPT-4 Judge (PAIR)
HarmBench Ours
Standard
71.14
77.36
68.41
89.8
94.53
Contextual
67.5
71.5
64
85.5
90.5
Average (↑)
69.93
75.42
66.94
88.37
93.19
Table 1: Agreement rates between previous metrics and classifiers compared to human judgments on our manually labeled validation set.
Adversarial Training
We propose a new adversarial training method for robust refusal, called Robust Refusal Dynamic Defense (R2D2). As opposed to fine-tuning on a static dataset of harmful prompts, our method fine-tunes LLMs on a dynamic pool of test cases continually updated by a strong optimization-based red teaming method. Detail implementations are in Appendix A of our paper.
Figure 3: A comparison of the average ASR across the GCG, GCG (Multi), and GCG (Transfer) attacks on different target LLMs. Our adversarial training method, named R2D2, is the most robust by a wide margin. Compared to Llama 2 13B, the second most robust LLM on GCG attacks, ASR on our Zephyr + R2D2 model is 4× lower
Model
Mistral Tiny
Koala 13B
Zephyr + R2D2 Ours
MT-Bench
6.5
5.4
6.0
GCG ASR
68.5
61.1
5.9
Average ASR
53.9
47.4
15.8
Table 2: MT-Bench Comparison for Adversarially Trained Model
Robustness is independent of model size*
While findings in prior work suggested that larger models would be harder to red team, we find no correlation between robustness and model size within model families when red teaming at scale. We do observe a substantial difference in robustness between model families, which suggests that procedures and data used during training are far more important than model size in determining robustness to jailbreaks.
Figure 4: We find that attack success rate is highly stable within model families, but highly variable across model families. This suggests that training data and algorithms are far more important than model size in determining LLM robustness, emphasizing the importance of model-level defenses.