pyrit.datasets.SeedDatasetProvider#
- class SeedDatasetProvider[source]#
Bases:
ABCAbstract base class for providing seed datasets with automatic registration.
All concrete subclasses are automatically registered and can be discovered via get_all_providers() class method. This enables automatic discovery of both local and remote dataset providers.
Subclasses must implement: - fetch_dataset(): Fetch and return the dataset as a SeedDataset - dataset_name property: Human-readable name for the dataset
- __init__()#
Methods
__init__()fetch_dataset(*[, cache])Fetch the dataset and return as a SeedDataset.
fetch_datasets_async(*[, dataset_names, ...])Fetch all registered datasets with optional filtering and caching.
Get the names of all registered datasets.
Get all registered dataset provider classes.
Attributes
Return the human-readable name of the dataset.
- abstract property dataset_name: str#
Return the human-readable name of the dataset.
- Returns:
The dataset name (e.g., “HarmBench”, “JailbreakBench JBB-Behaviors”)
- Return type:
- abstract async fetch_dataset(*, cache: bool = True) SeedDataset[source]#
Fetch the dataset and return as a SeedDataset.
- Parameters:
cache – Whether to cache the fetched dataset. Defaults to True. Remote datasets will use DB_DATA_PATH for caching.
- Returns:
The fetched dataset with prompts.
- Return type:
- Raises:
Exception – If the dataset cannot be fetched or processed.
- async classmethod fetch_datasets_async(*, dataset_names: List[str] | None = None, cache: bool = True, max_concurrency: int = 5) list[SeedDataset][source]#
Fetch all registered datasets with optional filtering and caching.
Datasets are fetched concurrently for improved performance.
- Parameters:
dataset_names – Optional list of dataset names to fetch. If None, fetches all. Names should match the dataset_name property of providers.
cache – Whether to cache the fetched datasets. Defaults to True. This uses DB_DATA_PATH for caching remote datasets.
max_concurrency – Maximum number of datasets to fetch concurrently. Defaults to 5. Set to 1 for fully sequential execution.
- Returns:
List of all fetched datasets.
- Return type:
List[SeedDataset]
- Raises:
ValueError – If any requested dataset_name does not exist.
Exception – If any dataset fails to load.
Example
>>> # Fetch all datasets >>> all_datasets = await SeedDatasetProvider.fetch_datasets_async() >>> >>> # Fetch specific datasets >>> specific = await SeedDatasetProvider.fetch_datasets_async( ... dataset_names=["harmbench", "DarkBench"] ... )
- classmethod get_all_dataset_names() List[str][source]#
Get the names of all registered datasets.
- Returns:
List of dataset names from all registered providers.
- Return type:
List[str]
- Raises:
ValueError – If no providers are registered or if providers cannot be instantiated.
Example
>>> names = SeedDatasetProvider.get_all_dataset_names() >>> print(f"Available datasets: {', '.join(names)}")
- classmethod get_all_providers() Dict[str, Type[SeedDatasetProvider]][source]#
Get all registered dataset provider classes.
- Returns:
Dictionary mapping class names to provider classes.
- Return type:
Dict[str, Type[SeedDatasetProvider]]