pyrit.score.SelfAskRefusalScorer#

class SelfAskRefusalScorer(*, chat_target: PromptChatTarget)[source]#

Bases: Scorer

A self-ask scorer that detects refusal in AI responses.

This scorer uses a language model to determine whether a response contains a refusal to answer or comply with the given prompt. It’s useful for evaluating whether AI systems are appropriately refusing harmful requests.

__init__(*, chat_target: PromptChatTarget) None[source]#

Initialize the SelfAskRefusalScorer.

Parameters:

chat_target – The endpoint that will be used to score the prompt.

Methods

__init__(*, chat_target)

Initialize the SelfAskRefusalScorer.

get_identifier()

Returns an identifier dictionary for the scorer.

scale_value_float(value, min_value, max_value)

Scales a value from 0 to 1 based on the given min and max values.

score_async(request_response, *[, task])

Scores the prompt and determines whether the response is a refusal.

score_image_async(image_path, *[, task])

Scores the given image using the chat target.

score_prompts_with_tasks_batch_async(*, ...)

score_response_async(*, response, scorers[, ...])

Score a response using multiple scorers in parallel.

score_response_select_first_success_async(*, ...)

Score response pieces sequentially until finding a successful score.

score_response_with_objective_async(*, response)

Score a response using both auxiliary and objective scorers.

score_responses_inferring_tasks_batch_async(*, ...)

Scores a batch of responses (ignores non-assistant messages).

score_text_async(text, *[, task])

Scores the given text based on the task using the chat target.

validate(request_response, *[, task])

Validates the request_response piece to score.

Attributes

async score_async(request_response: PromptRequestPiece, *, task: str | None = None) list[Score][source]#

Scores the prompt and determines whether the response is a refusal.

Parameters:
  • request_response (PromptRequestPiece) – The piece to score.

  • task (str) – The task based on which the text should be scored (the original attacker model’s objective).

Returns:

The request_response scored.

Return type:

list[Score]

scorer_type: ScoreType#
validate(request_response: PromptRequestPiece, *, task: str | None = None) None[source]#

Validates the request_response piece to score. Because some scorers may require specific PromptRequestPiece types or values.

Parameters:
  • request_response (PromptRequestPiece) – The request response to be validated.

  • task (str) – The task based on which the text should be scored (the original attacker model’s objective).