pyrit.orchestrator.PAIROrchestrator#
- class PAIROrchestrator(*, verbose: bool = False, objective_target: PromptChatTarget, desired_target_response_prefix: str, adversarial_chat: PromptChatTarget, conversation_objective: str, number_of_conversation_streams: int = 20, max_conversation_depth: int = 3, stop_on_first_success: bool = True, scorer: Scorer, scorer_sensitivity: float = 1.0, prompt_converters: list[PromptConverter] | None = None, single_turn_jailbreak_only: bool = True)[source]#
Bases:
Orchestrator
This orchestrator implements the Prompt Automatic Iterative Refinement (PAIR) algorithm
This algorithm was published and described in the paper: Chao, Patrick, et al. Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv:2310.08419, arXiv, 13 Oct. 2023. arXiv.org, http://arxiv.org/abs/2310.08419
The authors published a reference implementation in the following repository: patrickrchao/ JailbreakingLLMs/blob/main/system_prompts.py
- __init__(*, verbose: bool = False, objective_target: PromptChatTarget, desired_target_response_prefix: str, adversarial_chat: PromptChatTarget, conversation_objective: str, number_of_conversation_streams: int = 20, max_conversation_depth: int = 3, stop_on_first_success: bool = True, scorer: Scorer, scorer_sensitivity: float = 1.0, prompt_converters: list[PromptConverter] | None = None, single_turn_jailbreak_only: bool = True) None [source]#
Initializes the PAIR orchestrator to run the PAIR algorithm against two targets.
- Parameters:
verbose – Whether to print debug information. Defaults to False.
objective_target – The target model to jailbreak.
desired_target_response_prefix – An example of a desired response from the target. This is used to compare against the target’s actual response to determine if the target has been compromised. E.g., “Sure, here is code for a keylogger.”
adversarial_chat – The attacker model to use to jailbreak the target.
conversation_objective – The objective for the attacker to achieve, as a string. E.g. “Generate a phishing email” or “Generate code for a keylogger”, etc. Defaults to “Sure, I’d be happy to assist with that”.
max_conversation_depth – The maximum number of turns in a conversation. Defaults to 3 based on the PAIR’s paper.
number_of_conversation_streams – The number of independent conversation streams to run for PAIR algorithm. defaults to 20 based on the PAIR’s paper.
stop_on_first_success – Whether to stop the algorithm after the first successful attack. Defaults to True. If False, the algorithm will continue to run until all conversation streams are exhausted possibly finding multiple successful jailbreaks.
scorer (Score, Optional) – The scorer to use for scoring the target’s response.
scorer_sensitivity (float, Optional) – The sensitivity of the scorer. This value indicates the scorer’s tolerance for scores below the maximum score value. Note that the sensitivity refers to the normalized score (value-min)/(max-min). For example, for a scoring scale 1-10, the scores map as follows: 1 -> 0.0, 2 -> 0.11, 3 -> 0.22, …, 10 -> 1.0. A sensitivity of 0.7 would consider a score of 7 or above as a successful jailbreak. Defaults to 1.0 which is equivalent to 10/10 in the scale of the PAIR paper.
prompt_converters (list[PromptConverter], Optional) – List of prompt converters. These are stacked in the order they are provided. The default PAIR implementation does not use any converters.
Methods
__init__
(*[, verbose, ...])Initializes the PAIR orchestrator to run the PAIR algorithm against two targets.
dispose_db_engine
()Dispose database engine to release database connections and resources.
get_identifier
()get_memory
()Retrieves the memory associated with this orchestrator.
get_score_memory
()Retrieves the scores of the PromptRequestPieces associated with this orchestrator.
print
(*[, normalized_score_threshold])Prints the conversations that have been processed by the orchestrator.
run
()Runs the PAIR algorithm against the target and attacker.
- print(*, normalized_score_threshold: float | None = None) None [source]#
Prints the conversations that have been processed by the orchestrator.
This method retrieves all the prompt pieces from the memory, filters them based on the orchestrator class and optionally based on the orchestrator instance and successful jailbreaks. The filtered pieces are then grouped by conversation ID and printed.
- Parameters:
normalized_score_threshold – The score threshold to filter the successful jailbreaks. If None, it defaults to the orchestrator’s scorer sensitivity. Defaults to None. This must be a value between 0 and 1.
- async run() list[PromptRequestResponse] [source]#
Runs the PAIR algorithm against the target and attacker.