pyrit.datasets.fetch_pku_safe_rlhf_dataset

pyrit.datasets.fetch_pku_safe_rlhf_dataset#

fetch_pku_safe_rlhf_dataset(include_safe_prompts: bool = True, filter_harm_categories: List[Literal['Animal Abuse', 'Copyright Issues', 'Cybercrime', 'Discriminatory Behavior', 'Disrupting Public Order', 'Drugs', 'Economic Crime', 'Endangering National Security', 'Endangering Public Health', 'Environmental Damage', 'Human Trafficking', 'Insulting Behavior', 'Mental Manipulation', 'Physical Harm', 'Privacy Violation', 'Psychological Harm', 'Sexual Content', 'Violence', 'White-Collar Crime']] | None = None) SeedPromptDataset[source]#

Fetch PKU-SafeRLHF examples and create a SeedPromptDataset.

Parameters:
  • include_safe_prompts (bool) – All prompts in the dataset are returned if True; the dataset has RLHF markers for unsafe responses, so if False we only return the unsafe subset.

  • filter_harm_categories – List of harm categories to filter the examples. Defaults to None, which means all categories are included. Otherwise, only prompts with at least one matching category are included.

Returns:

A SeedPromptDataset containing the examples.

Return type:

SeedPromptDataset

Note

For more information and access to the original dataset and related materials, visit: https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF. Based on research in paper: https://arxiv.org/pdf/2406.15513 written by Jiaming Ji and Donghai Hong and Borong Zhang and Boyuan Chen and Josef Dai and Boren Zheng and Tianyi Qiu and Boxun Li and Yaodong Yang