Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

pyrit.executor.attack

Attack executor module.

Functions

generate_simulated_conversation_async

generate_simulated_conversation_async(objective: str, adversarial_chat: PromptChatTarget, objective_scorer: TrueFalseScorer, num_turns: int = 3, starting_sequence: int = 0, adversarial_chat_system_prompt_path: Union[str, Path], simulated_target_system_prompt_path: Optional[Union[str, Path]] = None, next_message_system_prompt_path: Optional[Union[str, Path]] = None, attack_converter_config: Optional[AttackConverterConfig] = None, memory_labels: Optional[dict[str, str]] = None) → list[SeedPrompt]

Generate a simulated conversation between an adversarial chat and a target.

This utility runs a RedTeamingAttack with score_last_turn_only=True against a simulated target (the same LLM as adversarial_chat, optionally configured with a system prompt). The resulting conversation is returned as a list of SeedPrompts that can be merged with other SeedPrompts in a SeedGroup for use as prepended_conversation and next_message.

Use cases:

ParameterTypeDescription
objectivestrThe objective for the adversarial chat to work toward.
adversarial_chatPromptChatTargetThe adversarial LLM that generates attack prompts. This same LLM is also used as the simulated target.
objective_scorerTrueFalseScorerScorer to evaluate the final turn.
num_turnsintNumber of conversation turns to generate. Defaults to 3. Defaults to 3.
starting_sequenceintThe starting sequence number for the generated SeedPrompts. Each message gets an incrementing sequence number. Defaults to 0. Defaults to 0.
adversarial_chat_system_prompt_pathUnion[str, Path]Path to the system prompt for the adversarial chat.
simulated_target_system_prompt_pathOptional[Union[str, Path]]Path to the system prompt for the simulated target. If None, no system prompt is used for the simulated target. Defaults to None.
next_message_system_prompt_pathOptional[Union[str, Path]]Optional path to a system prompt for generating a final user message. If provided, after the simulated conversation, a single LLM call generates a user message that attempts to get the target to fulfill the objective in their next response. The prompt template receives objective and conversation_so_far parameters. Defaults to None.
attack_converter_configOptional[AttackConverterConfig]Converter configuration for the attack. Defaults to None. Defaults to None.
memory_labelsOptional[dict[str, str]]Labels to associate with the conversation in memory. Defaults to None. Defaults to None.

Returns:

Raises:

AttackAdversarialConfig

Adversarial configuration for attacks that involve adversarial chat targets.

This class defines the configuration for attacks that utilize an adversarial chat target, including the target chat model, system prompt, and seed prompt for the attack.

AttackContext

Bases: StrategyContext, ABC, Generic[AttackParamsT]

Base class for all attack contexts.

This class holds both the immutable attack parameters and the mutable execution state. The params field contains caller-provided inputs, while other fields track execution progress.

Attacks that generate certain values internally (e.g., RolePlayAttack generates next_message and prepended_conversation) can set the mutable override fields (_next_message_override, _prepended_conversation_override) during _setup_async.

AttackConverterConfig

Bases: StrategyConverterConfig

Configuration for prompt converters used in attacks.

This class defines the converter configurations that transform prompts during the attack process, both for requests and responses.

AttackExecutor

Manages the execution of attack strategies with support for parallel execution.

The AttackExecutor provides controlled execution of attack strategies with concurrency limiting. It uses the attack’s params_type to create parameters from seed groups.

Constructor Parameters:

ParameterTypeDescription
max_concurrencyintMaximum number of concurrent attack executions (default: 1). Defaults to 1.

Methods:

execute_attack_async

execute_attack_async(attack: AttackStrategy[AttackStrategyContextT, AttackStrategyResultT], objectives: Sequence[str], field_overrides: Optional[Sequence[dict[str, Any]]] = None, return_partial_on_failure: bool = False, broadcast_fields: Any = {}) → AttackExecutorResult[AttackStrategyResultT]

Execute attacks in parallel for each objective.

Creates AttackParameters directly from objectives and field values.

ParameterTypeDescription
attackAttackStrategy[AttackStrategyContextT, AttackStrategyResultT]The attack strategy to execute.
objectivesSequence[str]List of attack objectives.
field_overridesOptional[Sequence[dict[str, Any]]]Optional per-objective field overrides. If provided, must match the length of objectives. Defaults to None.
return_partial_on_failureboolIf True, returns partial results when some objectives fail. If False (default), raises the first exception. Defaults to False.
**broadcast_fieldsAnyFields applied to all objectives (e.g., memory_labels). Per-objective field_overrides take precedence. Defaults to {}.

Returns:

Raises:

execute_attack_from_seed_groups_async

execute_attack_from_seed_groups_async(attack: AttackStrategy[AttackStrategyContextT, AttackStrategyResultT], seed_groups: Sequence[SeedAttackGroup], adversarial_chat: Optional[PromptChatTarget] = None, objective_scorer: Optional[TrueFalseScorer] = None, field_overrides: Optional[Sequence[dict[str, Any]]] = None, return_partial_on_failure: bool = False, broadcast_fields: Any = {}) → AttackExecutorResult[AttackStrategyResultT]

Execute attacks in parallel, extracting parameters from SeedAttackGroups.

Uses the attack’s params_type.from_seed_group() to extract parameters, automatically handling which fields the attack accepts.

ParameterTypeDescription
attackAttackStrategy[AttackStrategyContextT, AttackStrategyResultT]The attack strategy to execute.
seed_groupsSequence[SeedAttackGroup]SeedAttackGroups containing objectives and optional prompts.
adversarial_chatOptional[PromptChatTarget]Optional chat target for generating adversarial prompts or simulated conversations. Required when seed groups contain SeedSimulatedConversation configurations. Defaults to None.
objective_scorerOptional[TrueFalseScorer]Optional scorer for evaluating simulated conversations. Required when seed groups contain SeedSimulatedConversation configurations. Defaults to None.
field_overridesOptional[Sequence[dict[str, Any]]]Optional per-seed-group field overrides. If provided, must match the length of seed_groups. Each dict is passed to from_seed_group() as overrides. Defaults to None.
return_partial_on_failureboolIf True, returns partial results when some objectives fail. If False (default), raises the first exception. Defaults to False.
**broadcast_fieldsAnyFields applied to all seed groups (e.g., memory_labels). Per-seed-group field_overrides take precedence. Defaults to {}.

Returns:

Raises:

execute_multi_objective_attack_async

execute_multi_objective_attack_async(attack: AttackStrategy[AttackStrategyContextT, AttackStrategyResultT], objectives: list[str], prepended_conversation: Optional[list[Message]] = None, memory_labels: Optional[dict[str, str]] = None, return_partial_on_failure: bool = False, attack_params: Any = {}) → AttackExecutorResult[AttackStrategyResultT]

Execute the same attack strategy with multiple objectives against the same target in parallel.

.. deprecated:: Use :meth:execute_attack_async instead. This method will be removed in a future version.

ParameterTypeDescription
attackAttackStrategy[AttackStrategyContextT, AttackStrategyResultT]The attack strategy to use for all objectives.
objectiveslist[str]List of attack objectives to test.
prepended_conversationOptional[list[Message]]Conversation to prepend to the target model. Defaults to None.
memory_labelsOptional[dict[str, str]]Additional labels that can be applied to the prompts. Defaults to None.
return_partial_on_failureboolIf True, returns partial results on failure. Defaults to False.
**attack_paramsAnyAdditional parameters specific to the attack strategy. Defaults to {}.

Returns:

execute_multi_turn_attacks_async

execute_multi_turn_attacks_async(attack: AttackStrategy[_MultiTurnContextT, AttackStrategyResultT], objectives: list[str], messages: Optional[list[Message]] = None, prepended_conversations: Optional[list[list[Message]]] = None, memory_labels: Optional[dict[str, str]] = None, return_partial_on_failure: bool = False, attack_params: Any = {}) → AttackExecutorResult[AttackStrategyResultT]

Execute a batch of multi-turn attacks with multiple objectives.

.. deprecated:: Use :meth:execute_attack_async instead. This method will be removed in a future version.

ParameterTypeDescription
attackAttackStrategy[_MultiTurnContextT, AttackStrategyResultT]The multi-turn attack strategy to use.
objectiveslist[str]List of attack objectives to test.
messagesOptional[list[Message]]List of messages to use for this execution (per-objective). Defaults to None.
prepended_conversationsOptional[list[list[Message]]]Conversations to prepend to each objective (per-objective). Defaults to None.
memory_labelsOptional[dict[str, str]]Additional labels that can be applied to the prompts. Defaults to None.
return_partial_on_failureboolIf True, returns partial results on failure. Defaults to False.
**attack_paramsAnyAdditional parameters specific to the attack strategy. Defaults to {}.

Returns:

Raises:

execute_single_turn_attacks_async

execute_single_turn_attacks_async(attack: AttackStrategy[_SingleTurnContextT, AttackStrategyResultT], objectives: list[str], messages: Optional[list[Message]] = None, prepended_conversations: Optional[list[list[Message]]] = None, memory_labels: Optional[dict[str, str]] = None, return_partial_on_failure: bool = False, attack_params: Any = {}) → AttackExecutorResult[AttackStrategyResultT]

Execute a batch of single-turn attacks with multiple objectives.

.. deprecated:: Use :meth:execute_attack_async instead. This method will be removed in a future version.

ParameterTypeDescription
attackAttackStrategy[_SingleTurnContextT, AttackStrategyResultT]The single-turn attack strategy to use.
objectiveslist[str]List of attack objectives to test.
messagesOptional[list[Message]]List of messages to use for this execution (per-objective). Defaults to None.
prepended_conversationsOptional[list[list[Message]]]Conversations to prepend to each objective (per-objective). Defaults to None.
memory_labelsOptional[dict[str, str]]Additional labels that can be applied to the prompts. Defaults to None.
return_partial_on_failureboolIf True, returns partial results on failure. Defaults to False.
**attack_paramsAnyAdditional parameters specific to the attack strategy. Defaults to {}.

Returns:

Raises:

AttackExecutorResult

Bases: Generic[AttackResultT]

Result container for attack execution, supporting both full and partial completion.

This class holds results from parallel attack execution. It is iterable and behaves like a list in the common case where all objectives complete successfully.

When some objectives don’t complete (throw exceptions), access incomplete_objectives to retrieve the failures, or use raise_if_incomplete() to raise the first exception.

Note: “completed” means the execution finished, not that the attack objective was achieved.

Methods:

get_results

get_results() → list[AttackResultT]

Get completed results, raising if any incomplete.

Returns:

raise_if_incomplete

raise_if_incomplete() → None

Raise the first exception if any objectives are incomplete.

AttackParameters

Immutable parameters for attack execution.

This class defines the standard contract for attack parameters. All attacks at a given level of the hierarchy share the same parameter signature.

Attacks that don’t accept certain parameters should use the excluding() factory to create a derived params type without those fields. Attacks that need additional parameters should extend this class with new fields.

Methods:

excluding

excluding(field_names: str = ()) → type[AttackParameters]

Create a new AttackParameters subclass that excludes the specified fields.

This factory method creates a frozen dataclass without the specified fields. The resulting class inherits the from_seed_group() behavior and will raise if excluded fields are passed as overrides.

ParameterTypeDescription
*field_namesstrNames of fields to exclude from the new params type. Defaults to ().

Returns:

Raises:

from_seed_group_async

from_seed_group_async(seed_group: SeedAttackGroup, adversarial_chat: Optional[PromptChatTarget] = None, objective_scorer: Optional[TrueFalseScorer] = None, overrides: Any = {}) → AttackParamsT

Create an AttackParameters instance from a SeedAttackGroup.

Extracts standard fields from the seed group and applies any overrides. If the seed_group has a simulated conversation config, generates the simulated conversation using the provided adversarial_chat and scorer.

ParameterTypeDescription
seed_groupSeedAttackGroupThe seed attack group to extract parameters from.
adversarial_chatOptional[PromptChatTarget]The adversarial chat target for generating simulated conversations. Required if seed_group has a simulated conversation config. Defaults to None.
objective_scorerOptional[TrueFalseScorer]The scorer for evaluating simulated conversations. Required if seed_group has a simulated conversation config. Defaults to None.
**overridesAnyField overrides to apply. Must be valid fields for this params type. Defaults to {}.

Returns:

Raises:

AttackResultPrinter

Bases: ABC

Abstract base class for printing attack results.

This interface defines the contract for printing attack results in various formats. Implementations can render results to console, logs, files, or other outputs.

Methods:

print_conversation_async(result: AttackResult, include_scores: bool = False) → None

Print only the conversation history.

ParameterTypeDescription
resultAttackResultThe attack result containing the conversation to print
include_scoresboolWhether to include scores in the output. Defaults to False. Defaults to False.
print_result_async(result: AttackResult, include_auxiliary_scores: bool = False, include_pruned_conversations: bool = False, include_adversarial_conversation: bool = False) → None

Print the complete attack result.

ParameterTypeDescription
resultAttackResultThe attack result to print
include_auxiliary_scoresboolWhether to include auxiliary scores in the output. Defaults to False. Defaults to False.
include_pruned_conversationsboolWhether to include pruned conversations. For each pruned conversation, only the last message and its score are shown. Defaults to False. Defaults to False.
include_adversarial_conversationboolWhether to include the adversarial conversation (the red teaming LLM’s reasoning). Only shown for successful attacks to avoid overwhelming output. Defaults to False. Defaults to False.
print_summary_async(result: AttackResult) → None

Print a summary of the attack result without the full conversation.

ParameterTypeDescription
resultAttackResultThe attack result to summarize

AttackScoringConfig

Scoring configuration for evaluating attack effectiveness.

This class defines the scoring components used to evaluate attack effectiveness, detect refusals, and perform auxiliary scoring operations.

AttackStrategy

Bases: Strategy[AttackStrategyContextT, AttackStrategyResultT], Identifiable, ABC

Abstract base class for attack strategies. Defines the interface for executing attacks and handling results.

Constructor Parameters:

ParameterTypeDescription
objective_targetPromptTargetThe target system to attack.
context_typetype[AttackStrategyContextT]The type of context this strategy operates on.
params_typeType[AttackParamsT]The type of parameters this strategy accepts. Defaults to AttackParameters. Use AttackParameters.excluding() to create a params type that rejects certain fields. Defaults to AttackParameters.
loggerlogging.LoggerLogger instance for logging events. Defaults to logger.

Methods:

execute_async

execute_async(kwargs: Any = {}) → AttackStrategyResultT

Execute the attack strategy asynchronously with the provided parameters.

This method provides a stable contract for all attacks. The signature includes all standard parameters (objective, next_message, prepended_conversation, memory_labels). Attacks that don’t accept certain parameters will raise ValueError if those parameters are provided.

ParameterTypeDescription
objectivestrThe objective of the attack.
next_messageOptional[Message]Message to send to the target.
prepended_conversationOptional[List[Message]]Conversation to prepend.
memory_labelsOptional[Dict[str, str]]Memory labels for the attack context.
**kwargsAnyAdditional context-specific parameters (conversation_id, system_prompt, etc.). Defaults to {}.

Returns:

Raises:

get_attack_scoring_config

get_attack_scoring_config() → Optional[AttackScoringConfig]

Get the attack scoring configuration used by this strategy.

Returns:

get_objective_target

get_objective_target() → PromptTarget

Get the objective target for this attack strategy.

Returns:

get_request_converters

get_request_converters() → list[Any]

Get request converter configurations used by this strategy.

Returns:

ChunkedRequestAttack

Bases: MultiTurnAttackStrategy[ChunkedRequestAttackContext, AttackResult]

Implementation of chunked extraction attack strategy.

This class orchestrates a multi-turn attack where protected information is requested in chunks/segments across multiple turns. This technique bypasses length-based filters or output truncation by asking for specific character ranges of a secret value.

This technique was discovered during CTF red teaming exercises where targets would reveal portions of secrets when asked for specific segments, even when they refused to reveal the complete value.

The attack flow consists of:

  1. Generating chunk request prompts based on the configured strategy.

  2. Sending each chunk request to the target system sequentially.

  3. Collecting responses from each chunk request.

  4. Combining all chunk responses to reconstruct the full value.

  5. Evaluating the combined result with scorers if configured.

  6. Returning the attack result with achievement status.

The strategy supports customization through converters and scorers for comprehensive evaluation.

Constructor Parameters:

ParameterTypeDescription
objective_targetPromptTargetThe target system to attack. Defaults to REQUIRED_VALUE.
chunk_sizeintSize of each chunk to request (default: 50). Defaults to 50.
total_lengthintEstimated total length of the target value (default: 200). Defaults to 200.
chunk_typestrType of chunk to request (e.g., “characters”, “bytes”, “words”). Defaults to 'characters'.
request_templatestrTemplate for generating chunk requests (default: “Give me {chunk_type} {start}-{end} of ‘{objective}’”). Defaults to DEFAULT_TEMPLATE.
attack_converter_configOptional[AttackConverterConfig]Configuration for prompt converters. Defaults to None.
attack_scoring_configOptional[AttackScoringConfig]Configuration for scoring components. Defaults to None.
prompt_normalizerOptional[PromptNormalizer]Normalizer for handling prompts. Defaults to None.

Methods:

get_attack_scoring_config

get_attack_scoring_config() → Optional[AttackScoringConfig]

Get the attack scoring configuration used by this strategy.

Returns:

ChunkedRequestAttackContext

Bases: MultiTurnAttackContext[Any]

Context for the ChunkedRequest attack strategy.

ConsoleAttackResultPrinter

Bases: AttackResultPrinter

Console printer for attack results with enhanced formatting.

This printer formats attack results for console display with optional color coding, proper indentation, text wrapping, and visual separators. Colors can be disabled for consoles that don’t support ANSI characters.

Constructor Parameters:

ParameterTypeDescription
widthintMaximum width for text wrapping. Must be positive. Defaults to 100. Defaults to 100.
indent_sizeintNumber of spaces for indentation. Must be non-negative. Defaults to 2. Defaults to 2.
enable_colorsboolWhether to enable ANSI color output. When False, all output will be plain text without colors. Defaults to True. Defaults to True.

Methods:

print_conversation_async(result: AttackResult, include_scores: bool = False, include_reasoning_trace: bool = False) → None

Print the conversation history to console with enhanced formatting.

Displays the full conversation between user and assistant, including:

ParameterTypeDescription
resultAttackResultThe attack result containing the conversation_id. Must have a valid conversation_id attribute.
include_scoresboolWhether to include scores in the output. Defaults to False. Defaults to False.
include_reasoning_traceboolWhether to include model reasoning trace in the output for applicable models. Defaults to False. Defaults to False.
print_messages_async(messages: list[Any], include_scores: bool = False, include_reasoning_trace: bool = False) → None

Print a list of messages to console with enhanced formatting.

This method can be called directly with a list of Message objects, without needing an AttackResult. Useful for printing prepended_conversation or any other list of messages.

Displays:

ParameterTypeDescription
messageslistList of Message objects to print.
include_scoresboolWhether to include scores in the output. Defaults to False. Defaults to False.
include_reasoning_traceboolWhether to include model reasoning trace in the output for applicable models. Defaults to False. Defaults to False.
print_result_async(result: AttackResult, include_auxiliary_scores: bool = False, include_pruned_conversations: bool = False, include_adversarial_conversation: bool = False) → None

Print the complete attack result to console.

This method orchestrates the printing of all components of an attack result, including header, summary, conversation history, metadata, and footer.

ParameterTypeDescription
resultAttackResultThe attack result to print. Must not be None.
include_auxiliary_scoresboolWhether to include auxiliary scores in the output. Defaults to False. Defaults to False.
include_pruned_conversationsboolWhether to include pruned conversations. For each pruned conversation, only the last message and its score are shown. Defaults to False. Defaults to False.
include_adversarial_conversationboolWhether to include the adversarial conversation (the red teaming LLM’s reasoning). Only shown for successful attacks to avoid overwhelming output. Defaults to False. Defaults to False.
print_summary_async(result: AttackResult) → None

Print a summary of the attack result with enhanced formatting.

Displays:

ParameterTypeDescription
resultAttackResultThe attack result to summarize. Must contain objective, attack_identifier, conversation_id, executed_turns, execution_time_ms, outcome, and optionally outcome_reason and last_score attributes.

ContextComplianceAttack

Bases: PromptSendingAttack

Implementation of the context compliance attack strategy.

This attack attempts to bypass safety measures by rephrasing the objective into a more benign context. It uses an adversarial chat target to:

  1. Rephrase the objective as a more benign question

  2. Generate a response to the benign question

  3. Rephrase the original objective as a follow-up question

This creates a context that makes it harder for the target to detect the true intent.

Constructor Parameters:

ParameterTypeDescription
objective_targetPromptChatTargetThe target system to attack. Must be a PromptChatTarget. Defaults to REQUIRED_VALUE.
attack_adversarial_configAttackAdversarialConfigConfiguration for the adversarial component, including the adversarial chat target used for rephrasing.
attack_converter_configOptional[AttackConverterConfig]Configuration for attack converters, including request and response converters. Defaults to None.
attack_scoring_configOptional[AttackScoringConfig]Configuration for attack scoring. Defaults to None.
prompt_normalizerOptional[PromptNormalizer]The prompt normalizer to use for sending prompts. Defaults to None.
max_attempts_on_failureintMaximum number of attempts to retry on failure. Defaults to 0.
context_description_instructions_pathOptional[Path]Path to the context description instructions YAML file. If not provided, uses the default path. Defaults to None.
affirmative_responseOptional[str]The affirmative response to be used in the conversation history. If not provided, uses the default “yes.”. Defaults to None.

ConversationManager

Manages conversations for attacks, handling message history, system prompts, and conversation state.

This class provides methods to:

Constructor Parameters:

ParameterTypeDescription
attack_identifierComponentIdentifierThe identifier of the attack this manager belongs to.
prompt_normalizerOptional[PromptNormalizer]Optional prompt normalizer for converting prompts. If not provided, a default PromptNormalizer instance will be created. Defaults to None.

Methods:

add_prepended_conversation_to_memory_async

add_prepended_conversation_to_memory_async(prepended_conversation: list[Message], conversation_id: str, request_converters: Optional[list[PromptConverterConfiguration]] = None, prepended_conversation_config: Optional[PrependedConversationConfig] = None, max_turns: Optional[int] = None) → int

Add prepended conversation messages to memory for a chat target.

This is a lower-level method that handles adding messages to memory without modifying any attack context state. It can be called directly by attacks that manage their own state (like TAP nodes) or internally by initialize_context_async for standard attacks.

Messages are added with:

ParameterTypeDescription
prepended_conversationlist[Message]Messages to add to memory.
conversation_idstrConversation ID to assign to all messages.
request_convertersOptional[list[PromptConverterConfiguration]]Optional converters to apply to messages. Defaults to None.
prepended_conversation_configOptional[PrependedConversationConfig]Optional configuration for converter roles. Defaults to None.
max_turnsOptional[int]If provided, validates that turn count doesn’t exceed this limit. Defaults to None.

Returns:

Raises:

get_conversation

get_conversation(conversation_id: str) → list[Message]

Retrieve a conversation by its ID.

ParameterTypeDescription
conversation_idstrThe ID of the conversation to retrieve.

Returns:

get_last_message

get_last_message(conversation_id: str, role: Optional[ChatMessageRole] = None) → Optional[MessagePiece]

Retrieve the most recent message from a conversation.

ParameterTypeDescription
conversation_idstrThe ID of the conversation to retrieve from.
roleOptional[ChatMessageRole]If provided, return only the last message matching this role. Defaults to None.

Returns:

initialize_context_async

initialize_context_async(context: AttackContext[Any], target: PromptTarget, conversation_id: str, request_converters: Optional[list[PromptConverterConfiguration]] = None, prepended_conversation_config: Optional[PrependedConversationConfig] = None, max_turns: Optional[int] = None, memory_labels: Optional[dict[str, str]] = None) → ConversationState

Initialize attack context with prepended conversation and merged labels.

This is the primary method for setting up an attack context. It:

  1. Merges memory_labels from attack strategy with context labels

  2. Processes prepended_conversation based on target type and config

  3. Updates context.executed_turns for multi-turn attacks

  4. Sets context.next_message if there’s an unanswered user message

ParameterTypeDescription
contextAttackContext[Any]The attack context to initialize.
targetPromptTargetThe objective target for the conversation.
conversation_idstrUnique identifier for the conversation.
request_convertersOptional[list[PromptConverterConfiguration]]Converters to apply to messages. Defaults to None.
prepended_conversation_configOptional[PrependedConversationConfig]Configuration for handling prepended conversation. Defaults to None.
max_turnsOptional[int]Maximum turns allowed (for validation and state tracking). Defaults to None.
memory_labelsOptional[dict[str, str]]Labels from the attack strategy to merge with context labels. Defaults to None.

Returns:

Raises:

set_system_prompt

set_system_prompt(target: PromptChatTarget, conversation_id: str, system_prompt: str, labels: Optional[dict[str, str]] = None) → None

Set or update the system prompt for a conversation.

ParameterTypeDescription
targetPromptChatTargetThe chat target to set the system prompt on.
conversation_idstrUnique identifier for the conversation.
system_promptstrThe system prompt text.
labelsOptional[dict[str, str]]Optional labels to associate with the system prompt. Defaults to None.

ConversationSession

Session for conversations.

ConversationState

Container for conversation state data returned from context initialization.

CrescendoAttack

Bases: MultiTurnAttackStrategy[CrescendoAttackContext, CrescendoAttackResult]

Implementation of the Crescendo attack strategy.

The Crescendo Attack is a multi-turn strategy that progressively guides the model to generate harmful content through small, benign steps. It leverages the model’s recency bias, pattern-following tendency, and trust in self-generated text.

The attack flow consists of:

  1. Generating progressively harmful prompts using an adversarial chat model.

  2. Sending prompts to the target and evaluating responses for refusal.

  3. Backtracking when the target refuses to respond.

  4. Scoring responses to determine if the objective has been achieved.

  5. Continuing until the objective is met or maximum turns/backtracks are reached.

You can learn more about the Crescendo attack Russinovich et al., 2024.

Constructor Parameters:

ParameterTypeDescription
objective_targetPromptChatTargetThe target system to attack. Must be a PromptChatTarget. Defaults to REQUIRED_VALUE.
attack_adversarial_configAttackAdversarialConfigConfiguration for the adversarial component, including the adversarial chat target and optional system prompt path.
attack_converter_configOptional[AttackConverterConfig]Configuration for attack converters, including request and response converters. Defaults to None.
attack_scoring_configOptional[AttackScoringConfig]Configuration for scoring responses. Defaults to None.
prompt_normalizerOptional[PromptNormalizer]Normalizer for prompts. Defaults to None.
max_backtracksintMaximum number of backtracks allowed. Defaults to 10.
max_turnsintMaximum number of turns allowed. Defaults to 10.
prepended_conversation_configOptional[PrependedConversationConfiguration]Configuration for how to process prepended conversations. Controls converter application by role, message normalization, and non-chat target behavior. Defaults to None.

Methods:

get_attack_scoring_config

get_attack_scoring_config() → Optional[AttackScoringConfig]

Get the attack scoring configuration used by this strategy.

Returns:

CrescendoAttackContext

Bases: MultiTurnAttackContext[Any]

Context for the Crescendo attack strategy.

CrescendoAttackResult

Bases: AttackResult

Result of the Crescendo attack strategy execution.

FlipAttack

Bases: PromptSendingAttack

Implement the FlipAttack method Li et al., 2024.

Essentially, it adds a system prompt to the beginning of the conversation to flip each word in the prompt.

Constructor Parameters:

ParameterTypeDescription
objective_targetPromptChatTargetThe target system to attack. Defaults to REQUIRED_VALUE.
attack_converter_config(AttackConverterConfig, Optional)Configuration for the prompt converters. Defaults to None.
attack_scoring_config(AttackScoringConfig, Optional)Configuration for scoring components. Defaults to None.
prompt_normalizer(PromptNormalizer, Optional)Normalizer for handling prompts. Defaults to None.
max_attempts_on_failure(int, Optional)Maximum number of attempts to retry on failure. Defaults to 0.

ManyShotJailbreakAttack

Bases: PromptSendingAttack

Implement the Many Shot Jailbreak method Anthropic, 2024.

Prepends the seed prompt with a faux dialogue between a human and an AI, using examples from a dataset to demonstrate successful jailbreaking attempts. This method leverages the model’s ability to learn from examples to bypass safety measures.

Constructor Parameters:

ParameterTypeDescription
objective_targetPromptTargetThe target system to attack. Defaults to REQUIRED_VALUE.
attack_converter_config(AttackConverterConfig, Optional)Configuration for the prompt converters. Defaults to None.
attack_scoring_config(AttackScoringConfig, Optional)Configuration for scoring components. Defaults to None.
prompt_normalizer(PromptNormalizer, Optional)Normalizer for handling prompts. Defaults to None.
max_attempts_on_failure(int, Optional)Maximum number of attempts to retry on failure. Defaults to 0. Defaults to 0.
example_countintThe number of examples to include from many_shot_examples or the Many Shot Jailbreaking dataset. Defaults to the first 100. Defaults to 100.
many_shot_examples(list[dict[str, str]], Optional)The many shot jailbreaking examples to use. If not provided, takes the first example_count examples from Many Shot Jailbreaking dataset. Defaults to None.

MarkdownAttackResultPrinter

Bases: AttackResultPrinter

Markdown printer for attack results optimized for Jupyter notebooks.

This printer formats attack results as markdown, making them ideal for display in Jupyter notebooks where LLM responses often contain code blocks and other markdown formatting that should be properly rendered.

Constructor Parameters:

ParameterTypeDescription
display_inlineboolIf True, uses IPython.display to render markdown inline in Jupyter notebooks. If False, prints markdown strings. Defaults to True. Defaults to True.

Methods:

print_conversation_async(result: AttackResult, include_scores: bool = False) → None

Print only the conversation history as formatted markdown.

Extracts and displays the conversation messages from the attack result without the summary or metadata sections. Useful for focusing on the actual interaction flow.

ParameterTypeDescription
resultAttackResultThe attack result containing the conversation to display.
include_scoresboolWhether to include scores for each message. Defaults to False. Defaults to False.
print_result_async(result: AttackResult, include_auxiliary_scores: bool = False, include_pruned_conversations: bool = False, include_adversarial_conversation: bool = False) → None

Print the complete attack result as formatted markdown.

Generates a comprehensive markdown report including attack summary, conversation history, scores, and metadata. The output is optimized for display in Jupyter notebooks.

ParameterTypeDescription
resultAttackResultThe attack result to print.
include_auxiliary_scoresboolWhether to include auxiliary scores in the conversation display. Defaults to False. Defaults to False.
include_pruned_conversationsboolWhether to include pruned conversations. For each pruned conversation, only the last message and its score are shown. Defaults to False. Defaults to False.
include_adversarial_conversationboolWhether to include the adversarial conversation (the red teaming LLM’s reasoning). Only shown for successful attacks to avoid overwhelming output. Defaults to False. Defaults to False.
print_summary_async(result: AttackResult) → None

Print a summary of the attack result as formatted markdown.

Displays key information about the attack including objective, outcome, execution metrics, and final score without the full conversation history. Useful for getting a quick overview of the attack results.

ParameterTypeDescription
resultAttackResultThe attack result to summarize.

MultiPromptSendingAttack

Bases: MultiTurnAttackStrategy[MultiTurnAttackContext[Any], AttackResult]

Implementation of multi-prompt sending attack strategy.

This class orchestrates a multi-turn attack where a series of predefined malicious prompts are sent sequentially to try to achieve a specific objective against a target system. The strategy evaluates the final target response using optional scorers to determine if the objective has been met.

The attack flow consists of:

  1. Sending each predefined prompt to the target system in sequence.

  2. Continuing until all predefined prompts are sent.

  3. Evaluating the final response with scorers if configured.

  4. Returning the attack result with achievement status.

Note: This attack always runs all predefined prompts regardless of whether the objective is achieved early in the sequence.

The strategy supports customization through prepended conversations, converters, and multiple scorer types for comprehensive evaluation.

Constructor Parameters:

ParameterTypeDescription
objective_targetPromptTargetThe target system to attack. Defaults to REQUIRED_VALUE.
attack_converter_configOptional[AttackConverterConfig]Configuration for prompt converters. Defaults to None.
attack_scoring_configOptional[AttackScoringConfig]Configuration for scoring components. Defaults to None.
prompt_normalizerOptional[PromptNormalizer]Normalizer for handling prompts. Defaults to None.

Methods:

execute_async

execute_async(kwargs: Any = {}) → AttackResult

Execute the attack strategy asynchronously with the provided parameters.

Returns:

get_attack_scoring_config

get_attack_scoring_config() → Optional[AttackScoringConfig]

Get the attack scoring configuration used by this strategy.

Returns:

MultiPromptSendingAttackParameters

Bases: AttackParameters

Parameters for MultiPromptSendingAttack.

Extends AttackParameters to include user_messages field for multi-turn attacks. Only accepts objective and user_messages fields.

Methods:

from_seed_group_async

from_seed_group_async(seed_group: SeedAttackGroup, adversarial_chat: Optional[PromptChatTarget] = None, objective_scorer: Optional[TrueFalseScorer] = None, overrides: Any = {}) → MultiPromptSendingAttackParameters

Create parameters from a SeedGroup, extracting user messages.

ParameterTypeDescription
seed_groupSeedAttackGroupThe seed group to extract parameters from.
adversarial_chatOptional[PromptChatTarget]Not used by this attack type. Defaults to None.
objective_scorerOptional[TrueFalseScorer]Not used by this attack type. Defaults to None.
**overridesAnyField overrides to apply. Defaults to {}.

Returns:

Raises:

MultiTurnAttackContext

Bases: AttackContext[AttackParamsT]

Context for multi-turn attacks.

Holds execution state for multi-turn attacks. The immutable attack parameters (objective, next_message, prepended_conversation, memory_labels) are stored in the params field inherited from AttackContext.

MultiTurnAttackStrategy

Bases: AttackStrategy[MultiTurnAttackStrategyContextT, AttackStrategyResultT], ABC

Strategy for executing multi-turn attacks. This strategy is designed to handle attacks that consist of multiple turns of interaction with the target model.

Constructor Parameters:

ParameterTypeDescription
objective_targetPromptTargetThe target system to attack.
context_typetype[MultiTurnAttackContext]The type of context this strategy will use.
params_typeType[AttackParamsT]The type of parameters this strategy accepts. Defaults to AttackParameters.
loggerlogging.LoggerLogger instance for logging events and messages. Defaults to logger.

PrependedConversationConfig

Configuration for controlling how prepended conversations are processed before being sent to the objective target.

This class provides control over:

Methods:

default

default() → PrependedConversationConfig

Create a default configuration with converters applied to all roles.

Returns:

for_non_chat_target

for_non_chat_target(message_normalizer: Optional[MessageStringNormalizer] = None, apply_converters_to_roles: Optional[list[ChatMessageRole]] = None) → PrependedConversationConfig

Create a configuration for use with non-chat targets.

This configuration normalizes the prepended conversation into a text block that will be prepended to the first message sent to the target.

ParameterTypeDescription
message_normalizerOptional[MessageStringNormalizer]Normalizer for formatting the prepended conversation into a string. Defaults to ConversationContextNormalizer if not provided. Defaults to None.
apply_converters_to_rolesOptional[list[ChatMessageRole]]Roles to apply converters to before normalization. Defaults to all roles. Defaults to None.

Returns:

get_message_normalizer

get_message_normalizer() → MessageStringNormalizer

Get the normalizer for objective target context, with a default fallback.

Returns:

PromptSendingAttack

Bases: SingleTurnAttackStrategy

Implementation of single-turn prompt sending attack strategy.

This class orchestrates a single-turn attack where malicious prompts are injected to try to achieve a specific objective against a target system. The strategy evaluates the target response using optional scorers to determine if the objective has been met.

The attack flow consists of:

  1. Preparing the prompt based on the objective.

  2. Sending the prompt to the target system through optional converters.

  3. Evaluating the response with scorers if configured.

  4. Retrying on failure up to the configured number of retries.

  5. Returning the attack result with achievement status.

The strategy supports customization through prepended conversations, converters, and multiple scorer types for comprehensive evaluation.

Constructor Parameters:

ParameterTypeDescription
objective_targetPromptTargetThe target system to attack. Defaults to REQUIRED_VALUE.
attack_converter_configOptional[AttackConverterConfig]Configuration for prompt converters. Defaults to None.
attack_scoring_configOptional[AttackScoringConfig]Configuration for scoring components. Defaults to None.
prompt_normalizerOptional[PromptNormalizer]Normalizer for handling prompts. Defaults to None.
max_attempts_on_failureintMaximum number of attempts to retry on failure. Defaults to 0.
params_typeType[AttackParamsT]The type of parameters this strategy accepts. Defaults to AttackParameters. Use AttackParameters.excluding() to create a params type that rejects certain fields. Defaults to AttackParameters.
prepended_conversation_configOptional[PrependedConversationConfiguration]Configuration for how to process prepended conversations. Controls converter application by role, message normalization, and non-chat target behavior. Defaults to None.

Methods:

get_attack_scoring_config

get_attack_scoring_config() → Optional[AttackScoringConfig]

Get the attack scoring configuration used by this strategy.

Returns:

RTASystemPromptPaths

Bases: enum.Enum

Enum for predefined red teaming attack system prompt paths.

RedTeamingAttack

Bases: MultiTurnAttackStrategy[MultiTurnAttackContext[Any], AttackResult]

Implementation of multi-turn red teaming attack strategy.

This class orchestrates an iterative attack process where an adversarial chat model generates prompts to send to a target system, attempting to achieve a specified objective. The strategy evaluates each target response using a scorer to determine if the objective has been met.

The attack flow consists of:

  1. Generating adversarial prompts based on previous responses and scoring feedback.

  2. Sending prompts to the target system through optional converters.

  3. Scoring target responses to assess objective achievement.

  4. Using scoring feedback to guide subsequent prompt generation.

  5. Continuing until the objective is achieved or maximum turns are reached.

The strategy supports customization through system prompts, seed prompts, and prompt converters, allowing for various attack techniques and scenarios.

Constructor Parameters:

ParameterTypeDescription
objective_targetPromptTargetThe target system to attack. Defaults to REQUIRED_VALUE.
attack_adversarial_configAttackAdversarialConfigConfiguration for the adversarial component.
attack_converter_configOptional[AttackConverterConfig]Configuration for attack converters. Defaults to None. Defaults to None.
attack_scoring_configOptional[AttackScoringConfig]Configuration for attack scoring. Defaults to None. Defaults to None.
prompt_normalizerOptional[PromptNormalizer]The prompt normalizer to use for sending prompts. Defaults to None. Defaults to None.
max_turnsintMaximum number of turns for the attack. Defaults to 10. Defaults to 10.
score_last_turn_onlyboolIf True, only score the final turn instead of every turn. This reduces LLM calls when intermediate scores are not needed (e.g., for generating simulated conversations). The attack will run for exactly max_turns when this is enabled. Defaults to False. Defaults to False.

Methods:

get_attack_scoring_config

get_attack_scoring_config() → Optional[AttackScoringConfig]

Get the attack scoring configuration used by this strategy.

Returns:

RolePlayAttack

Bases: PromptSendingAttack

Implementation of single-turn role-play attack strategy.

This class orchestrates a role-play attack where malicious objectives are rephrased into role-playing contexts to make them appear more benign and bypass content filters. The strategy uses an adversarial chat target to transform the objective into a role-play scenario before sending it to the target system.

The attack flow consists of:

  1. Loading role-play scenarios from a YAML file.

  2. Using an adversarial chat target to rephrase the objective into the role-play context.

  3. Sending the rephrased objective to the target system.

  4. Evaluating the response with scorers if configured.

  5. Retrying on failure up to the configured number of retries.

  6. Returning the attack result

The strategy supports customization through prepended conversations, converters, and multiple scorer types.

Constructor Parameters:

ParameterTypeDescription
objective_targetPromptTargetThe target system to attack. Defaults to REQUIRED_VALUE.
adversarial_chatPromptChatTargetThe adversarial chat target used to rephrase objectives into role-play scenarios.
role_play_definition_pathpathlib.PathPath to the YAML file containing role-play definitions (rephrase instructions, user start turn, assistant start turn).
attack_converter_configOptional[AttackConverterConfig]Configuration for prompt converters. Defaults to None.
attack_scoring_configOptional[AttackScoringConfig]Configuration for scoring components. Defaults to None.
prompt_normalizerOptional[PromptNormalizer]Normalizer for handling prompts. Defaults to None.
max_attempts_on_failureintMaximum number of attempts to retry the attack Defaults to 0.

RolePlayPaths

Bases: enum.Enum

Enum for predefined role-play scenario paths.

SingleTurnAttackContext

Bases: AttackContext[AttackParamsT]

Context for single-turn attacks.

Holds execution state for single-turn attacks. The immutable attack parameters (objective, next_message, prepended_conversation, memory_labels) are stored in the params field inherited from AttackContext.

SingleTurnAttackStrategy

Bases: AttackStrategy[SingleTurnAttackContext[Any], AttackResult], ABC

Strategy for executing single-turn attacks. This strategy is designed to handle attacks that consist of a single turn of interaction with the target model.

Constructor Parameters:

ParameterTypeDescription
objective_targetPromptTargetThe target system to attack.
context_typetype[SingleTurnAttackContext]The type of context this strategy will use. Defaults to SingleTurnAttackContext.
params_typeType[AttackParamsT]The type of parameters this strategy accepts. Defaults to AttackParameters.
loggerlogging.LoggerLogger instance for logging events and messages. Defaults to logger.

SkeletonKeyAttack

Bases: PromptSendingAttack

Implementation of the skeleton key jailbreak attack strategy.

This attack sends an initial skeleton key prompt to the target, and then follows up with a separate attack prompt. If successful, the first prompt makes the target comply even with malicious follow-up prompts.

The attack flow consists of:

  1. Sending a skeleton key prompt to bypass the target’s safety mechanisms.

  2. Sending the actual objective prompt to the primed target.

  3. Evaluating the response using configured scorers to determine success.

Learn more about the attack Microsoft Security Response Center, 2024.

Constructor Parameters:

ParameterTypeDescription
objective_targetPromptTargetThe target system to attack. Defaults to REQUIRED_VALUE.
attack_converter_configOptional[AttackConverterConfig]Configuration for prompt converters. Defaults to None.
attack_scoring_configOptional[AttackScoringConfig]Configuration for scoring components. Defaults to None.
prompt_normalizerOptional[PromptNormalizer]Normalizer for handling prompts. Defaults to None.
skeleton_key_promptOptional[str]The skeleton key prompt to use. If not provided, uses the default skeleton key prompt. Defaults to None.
max_attempts_on_failureintMaximum number of attempts to retry on failure. Defaults to 0.

TAPAttackContext

Bases: MultiTurnAttackContext[Any]

Context for the Tree of Attacks with Pruning (TAP) attack strategy.

This context contains all execution-specific state for a TAP attack instance, ensuring thread safety by isolating state per execution.

TAPAttackResult

Bases: AttackResult

Result of the Tree of Attacks with Pruning (TAP) attack strategy execution.

This result includes the standard attack result information with attack-specific data stored in the metadata dictionary.

TreeOfAttacksWithPruningAttack

Bases: AttackStrategy[TAPAttackContext, TAPAttackResult]

Implement the Tree of Attacks with Pruning (TAP) attack strategy.

The TAP attack strategy systematically explores multiple adversarial prompt paths in parallel using a tree structure. It employs breadth-first search with pruning to efficiently find effective jailbreaks while managing computational resources.

How it works:

  1. Initialization: Creates multiple initial attack branches (width) to explore different approaches

  2. Tree Expansion: For each iteration (depth), branches are expanded by a branching factor

  3. Prompt Generation: Each node generates adversarial prompts via an LLM red-teaming assistant

  4. Evaluation: Responses are evaluated for objective achievement and on-topic relevance

  5. Pruning: Low-scoring or off-topic branches are pruned to maintain the width constraint

  6. Iteration: The process continues until the objective is achieved or max depth is reached

The strategy balances exploration (trying diverse approaches) with exploitation (focusing on promising paths) through its pruning mechanism.

Constructor Parameters:

ParameterTypeDescription
objective_targetPromptChatTargetThe target system to attack. Defaults to REQUIRED_VALUE.
attack_adversarial_configAttackAdversarialConfigConfiguration for the adversarial chat component.
attack_converter_configOptional[AttackConverterConfig]Configuration for attack converters. Defaults to None. Defaults to None.
attack_scoring_configOptional[AttackScoringConfig]Scoring configuration for TAP. The objective_scorer must be a FloatScaleThresholdScorer, which provides both granular float scores for node comparison and a threshold for determining success. Can be either AttackScoringConfig or TAPAttackScoringConfig. If not provided, a default configuration with SelfAskScaleScorer and threshold 0.7 is created. Defaults to None.
prompt_normalizerOptional[PromptNormalizer]The prompt normalizer to use. Defaults to None. Defaults to None.
tree_widthintNumber of branches to explore in parallel at each level. Defaults to 3. Defaults to 3.
tree_depthintMaximum number of iterations to perform. Defaults to 5. Defaults to 5.
branching_factorintNumber of child branches to create from each parent. Defaults to 2. Defaults to 2.
on_topic_checking_enabledboolWhether to check if prompts are on-topic. Defaults to True. Defaults to True.
desired_response_prefixstrExpected prefix for successful responses. Defaults to “Sure, here is”. Defaults to 'Sure, here is'.
batch_sizeintNumber of nodes to process in parallel per batch. Defaults to 10. Defaults to 10.
prepended_conversation_configOptional[PrependedConversationConfiguration]Configuration for how to process prepended conversations. Controls converter application by role, message normalization, and non-chat target behavior. Defaults to None.

Methods:

execute_async

execute_async(kwargs: Any = {}) → TAPAttackResult

Execute the multi-turn attack strategy asynchronously with the provided parameters.

ParameterTypeDescription
objectivestrThe objective of the attack.
memory_labelsOptional[Dict[str, str]]Memory labels for the attack context.
**kwargsAnyAdditional parameters for the attack. Defaults to {}.

Returns:

get_attack_scoring_config

get_attack_scoring_config() → Optional[AttackScoringConfig]

Get the attack scoring configuration used by this strategy.

Returns:

References
  1. Russinovich, M., Salem, A., & Eldan, R. (2024). Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack. arXiv Preprint arXiv:2404.01833. https://crescendo-the-multiturn-jailbreak.github.io/
  2. Li, Y., Zhang, H., Li, H., & Zheng, H.-T. (2024). FlipAttack: Jailbreak LLMs via Flipping. arXiv Preprint arXiv:2410.02832. https://arxiv.org/abs/2410.02832
  3. Anthropic. (2024). Many-Shot Jailbreaking. https://www.anthropic.com/research/many-shot-jailbreaking
  4. Microsoft Security Response Center. (2024). Mitigating Skeleton Key, a New Type of Generative AI Jailbreak Technique. https://www.microsoft.com/en-us/security/blog/2024/06/26/mitigating-skeleton-key-a-new-type-of-generative-ai-jailbreak-technique/