pyrit.datasets.SeedDatasetProvider

pyrit.datasets.SeedDatasetProvider#

class SeedDatasetProvider[source]#

Bases: ABC

Abstract base class for providing seed datasets with automatic registration.

All concrete subclasses are automatically registered and can be discovered via get_all_providers() class method. This enables automatic discovery of both local and remote dataset providers.

Subclasses must implement: - fetch_dataset(): Fetch and return the dataset as a SeedDataset - dataset_name property: Human-readable name for the dataset

__init__()#

Methods

`__init__`()
`fetch_dataset`(*[, cache])	Fetch the dataset and return as a SeedDataset.
`fetch_datasets_async`(*[, dataset_names, ...])	Fetch all registered datasets with optional filtering and caching.
`get_all_dataset_names`()	Get the names of all registered datasets.
`get_all_providers`()	Get all registered dataset provider classes.

Attributes

dataset_name

Return the human-readable name of the dataset.

abstract property dataset_name: str#

Return the human-readable name of the dataset.

Returns:: The dataset name (e.g., “HarmBench”, “JailbreakBench JBB-Behaviors”)
Return type:: str

abstract async fetch_dataset(*, cache: bool = True) → SeedDataset[source]#

Fetch the dataset and return as a SeedDataset.

Parameters:: cache – Whether to cache the fetched dataset. Defaults to True. Remote datasets will use DB_DATA_PATH for caching.
Returns:: The fetched dataset with prompts.
Return type:: SeedDataset
Raises:: Exception – If the dataset cannot be fetched or processed.

async classmethod fetch_datasets_async(*, dataset_names: List[str] | None = None, cache: bool = True, max_concurrency: int = 5) → list[SeedDataset][source]#

Fetch all registered datasets with optional filtering and caching.

Datasets are fetched concurrently for improved performance.

Parameters:

dataset_names – Optional list of dataset names to fetch. If None, fetches all. Names should match the dataset_name property of providers.
cache – Whether to cache the fetched datasets. Defaults to True. This uses DB_DATA_PATH for caching remote datasets.
max_concurrency – Maximum number of datasets to fetch concurrently. Defaults to 5. Set to 1 for fully sequential execution.

Returns:

List of all fetched datasets.

Return type:

List[SeedDataset]

Raises:

ValueError – If any requested dataset_name does not exist.
Exception – If any dataset fails to load.

Example

>>> # Fetch all datasets
>>> all_datasets = await SeedDatasetProvider.fetch_datasets_async()
>>>
>>> # Fetch specific datasets
>>> specific = await SeedDatasetProvider.fetch_datasets_async(
...     dataset_names=["harmbench", "DarkBench"]
... )

classmethod get_all_dataset_names() → List[str][source]#

Get the names of all registered datasets.

Returns:: List of dataset names from all registered providers.
Return type:: List[str]
Raises:: ValueError – If no providers are registered or if providers cannot be instantiated.

Example

>>> names = SeedDatasetProvider.get_all_dataset_names()
>>> print(f"Available datasets: {', '.join(names)}")

classmethod get_all_providers() → Dict[str, Type[SeedDatasetProvider]][source]#

Get all registered dataset provider classes.

Returns:: Dictionary mapping class names to provider classes.
Return type:: Dict[str, Type[SeedDatasetProvider]]

pyrit.datasets.SeedDatasetProvider

Contents

pyrit.datasets.SeedDatasetProvider#