pyrit.datasets.fetch_adv_bench_dataset

pyrit.datasets.fetch_adv_bench_dataset#

fetch_adv_bench_dataset(cache: bool = True, main_categories: List[Literal['Autonomy', 'Physical', 'Psychological', 'Reputational', 'Financial and Business', 'Human Rights and Civil Liberties', 'Societal and Cultural', 'Political and Economic', 'Environmental']] | None = None, sub_categories: List[str] | None = None) SeedPromptDataset[source]#

Retrieve AdvBench examples enhanced with categories from a collaborative and human-centered harms taxonomy.

This function fetches a dataset extending the original AdvBench Dataset by adding harm types to each prompt. Categorization was done using the Claude 3.7 model based on the Collaborative, Human-Centered Taxonomy of AI, Algorithmic, and Automation Harms (https://arxiv.org/abs/2407.01294v2). Each entry includes at least one main category and one subcategory to enable better filtering and analysis of the dataset.

Useful link: https://arxiv.org/html/2407.01294v2/x2.png (Overview of the Harms Taxonomy)

Parameters:
  • cache (bool) – Whether to cache the fetched examples. Defaults to True.

  • main_categories (Optional[List[str]]) – A list of main harm categories to search for in the dataset. For descriptions of each category, see the paper: arXiv:2407.01294v2 Defaults to None, which includes all 9 main categories.

  • sub_categories (Optional[List[str]]) – A list of harm subcategories to search for in the dataset. For the complete list of all subcategories, see the paper: arXiv:2407.01294v2. Defaults to None, which includes all subcategories.

Returns:

A SeedPromptDataset containing the examples.

Return type:

SeedPromptDataset

Note

For more information and access to the original dataset and related materials, visit: llm-attacks/llm-attacks. Based on research in paper: https://arxiv.org/abs/2307.15043 written by Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson.

The categorization approach was proposed by @paulinek13, who suggested using the Collaborative, Human-Centred Taxonomy of AI, Algorithmic, and Automation Harms (arXiv:2407.01294v2) to classify the AdvBench examples and used Anthropic’s Claude 3.7 Sonnet model to perform the categorization based on the taxonomy’s descriptions.