Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

pyrit.datasets

Dataset fetching and loading utilities for various red teaming and safety evaluation datasets.

SeedDatasetFilter

Filter for discovering datasets by metadata criteria.

Supports two construction patterns:

Simple (flat kwargs — wraps into a single SeedDatasetMetadata criterion)::

f = SeedDatasetFilter(tags={"safety"}, size={"small", "large"})

Composed (explicit criteria list — OR across criteria, AND within each)::

f = SeedDatasetFilter(criteria=[
    SeedDatasetMetadata(size={"small"}, modalities={"text"}),
    SeedDatasetMetadata(size={"large"}, modalities={"image"}),
])

Passing both flat kwargs and criteria raises ValueError.

Special tags:

Constructor Parameters:

ParameterTypeDescription
criteriaOptional[list[SeedDatasetMetadata]]Explicit list of SeedDatasetMetadata to OR-match against. Defaults to None.
strict_matchboolIf True, within-axis matching uses AND instead of OR. Defaults to False.
**kwargsAnyFlat metadata fields passed to SeedDatasetMetadata. Defaults to {}.

SeedDatasetLoadTime

Bases: Enum

Approximate time to load a dataset. Used to skip slow datasets in fast runs.

SeedDatasetMetadata

Unified schema for dataset metadata and filter criteria.

All fields are optional sets. When used for real dataset metadata, parsers wrap singular values into single-element sets. When used as filter criteria, multiple values per field express “match any of these” (OR within axis).

SeedDatasetProvider

Bases: ABC

Abstract base class for providing seed datasets with automatic registration.

All concrete subclasses are automatically registered and can be discovered via get_all_providers() class method. This enables automatic discovery of both local and remote dataset providers.

Subclasses must implement:

All subclasses also have a _metadata property that is optional to make dataset addition easier, but failing to complete it makes downstream analysis more difficult.

Methods:

fetch_dataset

fetch_dataset(cache: bool = True) → SeedDataset

Fetch the dataset and return as a SeedDataset.

ParameterTypeDescription
cacheboolWhether to cache the fetched dataset. Defaults to True. Remote datasets will use DB_DATA_PATH for caching. Defaults to True.

Returns:

Raises:

fetch_datasets_async

fetch_datasets_async(dataset_names: Optional[list[str]] = None, cache: bool = True, max_concurrency: int = 5) → list[SeedDataset]

Fetch all registered datasets with optional filtering and caching.

Datasets are fetched concurrently for improved performance.

ParameterTypeDescription
dataset_namesOptional[list[str]]Optional list of dataset names to fetch. If None, fetches all. Names should match the dataset_name property of providers. Defaults to None.
cacheboolWhether to cache the fetched datasets. Defaults to True. This uses DB_DATA_PATH for caching remote datasets. Defaults to True.
max_concurrencyintMaximum number of datasets to fetch concurrently. Defaults to 5. Set to 1 for fully sequential execution. Defaults to 5.

Returns:

Raises:

get_all_dataset_names_async

get_all_dataset_names_async(filters: Optional[SeedDatasetFilter] = None) → list[str]

Get the names of all registered datasets.

ParameterTypeDescription
filtersOptional[SeedDatasetFilter]List of filters to apply. Defaults to None.

Returns:

Raises:

get_all_providers

get_all_providers() → dict[str, type[SeedDatasetProvider]]

Get all registered dataset provider classes.

Returns:

TextJailBreak

A class that manages jailbreak datasets (like DAN, etc.).

Constructor Parameters:

ParameterTypeDescription
template_pathstrFull path to a YAML template file. Defaults to None.
template_file_namestrName of a template file in datasets/jailbreak directory. Defaults to None.
string_templatestrA string template to use directly. Defaults to None.
random_templateboolWhether to use a random template from datasets/jailbreak. Defaults to False.
**kwargsAnyAdditional parameters to apply to the template. The ‘prompt’ parameter will be preserved for later use in get_jailbreak(). Defaults to {}.

Methods:

get_jailbreak

get_jailbreak(prompt: str) → str

Render the jailbreak template with the provided user prompt.

ParameterTypeDescription
promptstrThe user prompt to insert into the jailbreak template.

Returns:

Raises:

get_jailbreak_system_prompt

get_jailbreak_system_prompt() → str

Get the jailbreak template as a system prompt without a specific user prompt.

Returns:

get_jailbreak_templates

get_jailbreak_templates(num_templates: Optional[int] = None) → list[str]

Retrieve all jailbreaks from the JAILBREAK_TEMPLATES_PATH.

ParameterTypeDescription
num_templatesintNumber of jailbreak templates to return. None to get all. Defaults to None.

Returns:

Raises: