HarmBench Dataset

outline
Functional categories: HarmBench contains the following 4 functional categories of behavior:
  1. 200 standard behaviors
  2. 100 copyright behaviors
  3. 100 contextual behaviors
  4. 110 multimodal behaviors
Descriptive text for image
Figure 1: Attack success rate (ASR) averaged across all attacks and open-source models for standard, contextual, and copyrights behaviors ASR is considerably higher for contextual behaviors than standard behaviors. This is concerning, as contextual behaviors represent more specific harmful tasks that would be challenging to look up the answer to on a search engine. Thus, behaviors would more differentially harmful for LLMs to exhibit are easier to elicit with red teaming methods
Semantic categories: HarmBench contains the following 7 semantic categories of behavior: Cybercrime & Unauthorized Intrusion, Chemical & Biological Weapons/Drugs, Copyright Violations, Misinformation & Disinformation, Harassment & Bullying, Illegal Activities, and General Harm.
Four Models
Figure 2: For example, on Llama 2 and GPT models the Misinformation & Disinformation category has the highest ASR, but for Baichuan 2 and Starling the Chemical & Biological Weapons / Drugs category has the highest ASR. This suggests that training distributions can greatly influence the kinds of behaviors that are harder to elicit.

Behavior Classifier

We also release a high-quality classifier that is built on top of the Llama 13B chat model and achieves comparable performance to GPT-4 on a human labelled validation set. The classifier is available on Hugging Face Logo Hugging Face at cais/HarmBench-Llama-2-13b-cls . More details about the training process of the classifier is reported in Appendix B of our paper.
Behavior TypeAdvBenchGPTFuzzLlama-GuardGPT-4 Judge (PAIR)
HarmBench Ours
Standard71.1477.3668.4189.894.53
Contextual67.571.56485.590.5
Average (↑)69.9375.4266.9488.3793.19
Table 1: Agreement rates between previous metrics and classifiers compared to human judgments on our manually labeled validation set.

Adversarial Training

We propose a new adversarial training method for robust refusal, called Robust Refusal Dynamic Defense (R2D2). As opposed to fine-tuning on a static dataset of harmful prompts, our method fine-tunes LLMs on a dynamic pool of test cases continually updated by a strong optimization-based red teaming method. Detail implementations are in Appendix A of our paper.
Four Models
Figure 3: A comparison of the average ASR across the GCG, GCG (Multi), and GCG (Transfer) attacks on different target LLMs. Our adversarial training method, named R2D2, is the most robust by a wide margin. Compared to Llama 2 13B, the second most robust LLM on GCG attacks, ASR on our Zephyr + R2D2 model is 4× lower
ModelMistral TinyKoala 13B
Zephyr + R2D2 Ours
MT-Bench6.55.46.0
GCG ASR68.561.15.9
Average ASR53.947.415.8
Table 2: MT-Bench Comparison for Adversarially Trained Model

Robustness is independent of model size*

While findings in prior work suggested that larger models would be harder to red team, we find no correlation between robustness and model size within model families when red teaming at scale. We do observe a substantial difference in robustness between model families, which suggests that procedures and data used during training are far more important than model size in determining robustness to jailbreaks.
Four Models
Figure 4: We find that attack success rate is highly stable within model families, but highly variable across model families. This suggests that training data and algorithms are far more important than model size in determining LLM robustness, emphasizing the importance of model-level defenses.