Dataset fetching and loading utilities for various red teaming and safety evaluation datasets.
SeedDatasetFilter¶
Filter for discovering datasets by metadata criteria.
Supports two construction patterns:
Simple (flat kwargs — wraps into a single SeedDatasetMetadata criterion)::
f = SeedDatasetFilter(tags={"safety"}, size={"small", "large"})Composed (explicit criteria list — OR across criteria, AND within each)::
f = SeedDatasetFilter(criteria=[
SeedDatasetMetadata(size={"small"}, modalities={"text"}),
SeedDatasetMetadata(size={"large"}, modalities={"image"}),
])Passing both flat kwargs and criteria raises ValueError.
Special tags:
“all”: Returns every dataset, ignores all other fields. This tag will override anything else you pass to the filter object.
“default”: Matches datasets with “default” in their tags. With strict_match=True, loses its shortcut and is treated as a normal tag.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
criteria | Optional[list[SeedDatasetMetadata]] | Explicit list of SeedDatasetMetadata to OR-match against. Defaults to None. |
strict_match | bool | If True, within-axis matching uses AND instead of OR. Defaults to False. |
**kwargs | Any | Flat metadata fields passed to SeedDatasetMetadata. Defaults to {}. |
SeedDatasetLoadTime¶
Bases: Enum
Approximate time to load a dataset. Used to skip slow datasets in fast runs.
SeedDatasetMetadata¶
Unified schema for dataset metadata and filter criteria.
All fields are optional sets. When used for real dataset metadata, parsers wrap singular values into single-element sets. When used as filter criteria, multiple values per field express “match any of these” (OR within axis).
SeedDatasetProvider¶
Bases: ABC
Abstract base class for providing seed datasets with automatic registration.
All concrete subclasses are automatically registered and can be discovered via get_all_providers() class method. This enables automatic discovery of both local and remote dataset providers.
Subclasses must implement:
fetch_dataset(): Fetch and return the dataset as a SeedDataset
dataset_name property: Human-readable name for the dataset
All subclasses also have a _metadata property that is optional to make dataset addition easier, but failing to complete it makes downstream analysis more difficult.
Methods:
fetch_dataset¶
fetch_dataset(cache: bool = True) → SeedDatasetFetch the dataset and return as a SeedDataset.
| Parameter | Type | Description |
|---|---|---|
cache | bool | Whether to cache the fetched dataset. Defaults to True. Remote datasets will use DB_DATA_PATH for caching. Defaults to True. |
Returns:
SeedDataset— The fetched dataset with prompts.
Raises:
Exception— If the dataset cannot be fetched or processed.
fetch_datasets_async¶
fetch_datasets_async(dataset_names: Optional[list[str]] = None, cache: bool = True, max_concurrency: int = 5) → list[SeedDataset]Fetch all registered datasets with optional filtering and caching.
Datasets are fetched concurrently for improved performance.
| Parameter | Type | Description |
|---|---|---|
dataset_names | Optional[list[str]] | Optional list of dataset names to fetch. If None, fetches all. Names should match the dataset_name property of providers. Defaults to None. |
cache | bool | Whether to cache the fetched datasets. Defaults to True. This uses DB_DATA_PATH for caching remote datasets. Defaults to True. |
max_concurrency | int | Maximum number of datasets to fetch concurrently. Defaults to 5. Set to 1 for fully sequential execution. Defaults to 5. |
Returns:
list[SeedDataset]— List[SeedDataset]: List of all fetched datasets.
Raises:
ValueError— If any requested dataset_name does not exist.Exception— If any dataset fails to load.
get_all_dataset_names_async¶
get_all_dataset_names_async(filters: Optional[SeedDatasetFilter] = None) → list[str]Get the names of all registered datasets.
| Parameter | Type | Description |
|---|---|---|
filters | Optional[SeedDatasetFilter] | List of filters to apply. Defaults to None. |
Returns:
list[str]— List[str]: List of dataset names from all registered providers.
Raises:
ValueError— If no providers are registered or if providers cannot be instantiated.
get_all_providers¶
get_all_providers() → dict[str, type[SeedDatasetProvider]]Get all registered dataset provider classes.
Returns:
dict[str, type[SeedDatasetProvider]]— Dict[str, Type[SeedDatasetProvider]]: Dictionary mapping class names to provider classes.
TextJailBreak¶
A class that manages jailbreak datasets (like DAN, etc.).
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
template_path | str | Full path to a YAML template file. Defaults to None. |
template_file_name | str | Name of a template file in datasets/jailbreak directory. Defaults to None. |
string_template | str | A string template to use directly. Defaults to None. |
random_template | bool | Whether to use a random template from datasets/jailbreak. Defaults to False. |
**kwargs | Any | Additional parameters to apply to the template. The ‘prompt’ parameter will be preserved for later use in get_jailbreak(). Defaults to {}. |
Methods:
get_jailbreak¶
get_jailbreak(prompt: str) → strRender the jailbreak template with the provided user prompt.
| Parameter | Type | Description |
|---|---|---|
prompt | str | The user prompt to insert into the jailbreak template. |
Returns:
str— The rendered jailbreak template with the prompt parameter filled in.
Raises:
ValueError— If the template fails to render.
get_jailbreak_system_prompt¶
get_jailbreak_system_prompt() → strGet the jailbreak template as a system prompt without a specific user prompt.
Returns:
str— The rendered jailbreak template with an empty prompt parameter.
get_jailbreak_templates¶
get_jailbreak_templates(num_templates: Optional[int] = None) → list[str]Retrieve all jailbreaks from the JAILBREAK_TEMPLATES_PATH.
| Parameter | Type | Description |
|---|---|---|
num_templates | int | Number of jailbreak templates to return. None to get all. Defaults to None. |
Returns:
list[str]— List[str]: List of jailbreak template file names.
Raises:
ValueError— If no jailbreak templates are found in the jailbreak directory.ValueError— If n is larger than the number of templates that exist.