pyrit.datasets.fetch_sorry_bench_dataset#
- fetch_sorry_bench_dataset(*, cache_dir: str | None = None, categories: List[str] | None = None, prompt_style: str | None = None, token: str | None = None) SeedDataset[source]#
Fetch Sorry-Bench dataset from Hugging Face (updated 2025/03 version).
The Sorry-Bench dataset contains adversarial prompts designed to test LLM safety across 44 categories with 21 different prompt styles (base + 20 linguistic mutations).
Reference: https://arxiv.org/abs/2406.14598
- Parameters:
cache_dir (Optional[str]) – Optional cache directory for Hugging Face datasets.
categories (Optional[List[str]]) – Optional list of categories to filter. Full list in: https://huggingface.co/datasets/sorry-bench/sorry-bench-202503/blob/main/meta_info.py
prompt_style (Optional[str]) – Optional prompt style to filter. Available styles: “base”, “ascii”, “caesar”, “slang”, “authority_endorsement”, etc. Default: “base” (only base prompts, no mutations) Full list: https://huggingface.co/datasets/sorry-bench/sorry-bench-202503
token (Optional[str]) – Hugging Face authentication token. If not provided, will attempt to read from HUGGINGFACE_TOKEN environment variable. This is needed for accessing gated datasets on Hugging Face.
- Returns:
SeedDataset containing Sorry-Bench prompts with harm categories.
- Return type:
- Raises:
ValueError – If invalid categories or prompt_style are provided.