pyrit.datasets.SeedDatasetProvider#

class SeedDatasetProvider[source]#

Bases: ABC

Abstract base class for providing seed datasets with automatic registration.

All concrete subclasses are automatically registered and can be discovered via get_all_providers() class method. This enables automatic discovery of both local and remote dataset providers.

Subclasses must implement: - fetch_dataset(): Fetch and return the dataset as a SeedDataset - dataset_name property: Human-readable name for the dataset

__init__()#

Methods

__init__()

fetch_dataset(*[, cache])

Fetch the dataset and return as a SeedDataset.

fetch_datasets_async(*[, dataset_names, ...])

Fetch all registered datasets with optional filtering and caching.

get_all_dataset_names()

Get the names of all registered datasets.

get_all_providers()

Get all registered dataset provider classes.

Attributes

dataset_name

Return the human-readable name of the dataset.

abstract property dataset_name: str#

Return the human-readable name of the dataset.

Returns:

The dataset name (e.g., “HarmBench”, “JailbreakBench JBB-Behaviors”)

Return type:

str

abstract async fetch_dataset(*, cache: bool = True) SeedDataset[source]#

Fetch the dataset and return as a SeedDataset.

Parameters:

cache – Whether to cache the fetched dataset. Defaults to True. Remote datasets will use DB_DATA_PATH for caching.

Returns:

The fetched dataset with prompts.

Return type:

SeedDataset

Raises:

Exception – If the dataset cannot be fetched or processed.

async classmethod fetch_datasets_async(*, dataset_names: List[str] | None = None, cache: bool = True, max_concurrency: int = 5) list[SeedDataset][source]#

Fetch all registered datasets with optional filtering and caching.

Datasets are fetched concurrently for improved performance.

Parameters:
  • dataset_names – Optional list of dataset names to fetch. If None, fetches all. Names should match the dataset_name property of providers.

  • cache – Whether to cache the fetched datasets. Defaults to True. This uses DB_DATA_PATH for caching remote datasets.

  • max_concurrency – Maximum number of datasets to fetch concurrently. Defaults to 5. Set to 1 for fully sequential execution.

Returns:

List of all fetched datasets.

Return type:

List[SeedDataset]

Raises:
  • ValueError – If any requested dataset_name does not exist.

  • Exception – If any dataset fails to load.

Example

>>> # Fetch all datasets
>>> all_datasets = await SeedDatasetProvider.fetch_datasets_async()
>>>
>>> # Fetch specific datasets
>>> specific = await SeedDatasetProvider.fetch_datasets_async(
...     dataset_names=["harmbench", "DarkBench"]
... )
classmethod get_all_dataset_names() List[str][source]#

Get the names of all registered datasets.

Returns:

List of dataset names from all registered providers.

Return type:

List[str]

Raises:

ValueError – If no providers are registered or if providers cannot be instantiated.

Example

>>> names = SeedDatasetProvider.get_all_dataset_names()
>>> print(f"Available datasets: {', '.join(names)}")
classmethod get_all_providers() Dict[str, Type[SeedDatasetProvider]][source]#

Get all registered dataset provider classes.

Returns:

Dictionary mapping class names to provider classes.

Return type:

Dict[str, Type[SeedDatasetProvider]]