pyrit.score.SelfAskRefusalScorer

pyrit.score.SelfAskRefusalScorer#

class SelfAskRefusalScorer(*, chat_target: ~pyrit.prompt_target.common.prompt_chat_target.PromptChatTarget, refusal_system_prompt_path: ~pyrit.score.true_false.self_ask_refusal_scorer.RefusalScorerPaths | ~pathlib.Path | str = RefusalScorerPaths.DEFAULT, prompt_format_string: str | None = None, validator: ~pyrit.score.scorer_prompt_validator.ScorerPromptValidator | None = None, score_aggregator: ~collections.abc.Callable[[~collections.abc.Iterable[~pyrit.models.score.Score]], ~pyrit.score.score_aggregator_result.ScoreAggregatorResult] = <function _create_aggregator.<locals>.aggregator>)[source]#

Bases: TrueFalseScorer

A self-ask scorer that detects refusal in AI responses.

This scorer uses a language model to determine whether a response contains a refusal to answer or comply with the given prompt. It’s useful for evaluating whether AI systems are appropriately refusing harmful requests.

The scorer supports two modes via RefusalScorerPaths:

  • DEFAULT: Works with or without an explicit objective.

    Safe completions are NOT considered refusals.

  • STRICT: Treats safe completions as refusals. Works best with an explicit objective.

__init__(*, chat_target: ~pyrit.prompt_target.common.prompt_chat_target.PromptChatTarget, refusal_system_prompt_path: ~pyrit.score.true_false.self_ask_refusal_scorer.RefusalScorerPaths | ~pathlib.Path | str = RefusalScorerPaths.DEFAULT, prompt_format_string: str | None = None, validator: ~pyrit.score.scorer_prompt_validator.ScorerPromptValidator | None = None, score_aggregator: ~collections.abc.Callable[[~collections.abc.Iterable[~pyrit.models.score.Score]], ~pyrit.score.score_aggregator_result.ScoreAggregatorResult] = <function _create_aggregator.<locals>.aggregator>) None[source]#

Initialize the SelfAskRefusalScorer.

Parameters:
  • chat_target (PromptChatTarget) – The endpoint that will be used to score the prompt.

  • refusal_system_prompt_path (Union[RefusalScorerPaths, Path, str]) – The path to the system prompt to use for refusal detection. Can be a RefusalScorerPaths enum value, a Path, or a string path. Defaults to RefusalScorerPaths.DEFAULT.

  • prompt_format_string (Optional[str]) – The format string for the prompt with placeholders. Use {objective} for the conversation objective and {response} for the response to evaluate. Defaults to “conversation_objective: {objective}nresponse_to_evaluate_input: {response}”.

  • validator (Optional[ScorerPromptValidator]) – Custom validator. Defaults to None.

  • score_aggregator (TrueFalseAggregatorFunc) – The aggregator function to use. Defaults to TrueFalseScoreAggregator.OR.

Methods

__init__(*, chat_target[, ...])

Initialize the SelfAskRefusalScorer.

evaluate_async([file_mapping, ...])

Evaluate this scorer against human-labeled datasets.

get_identifier()

Get the component's identifier, building it lazily on first access.

get_scorer_metrics()

Get evaluation metrics for this scorer from the configured evaluation result file.

scale_value_float(value, min_value, max_value)

Scales a value from 0 to 1 based on the given min and max values.

score_async(message, *[, objective, ...])

Score the message, add the results to the database, and return a list of Score objects.

score_image_async(image_path, *[, objective])

Score the given image using the chat target.

score_image_batch_async(*, image_paths[, ...])

Score a batch of images asynchronously.

score_prompts_batch_async(*, messages[, ...])

Score multiple prompts in batches using the provided objectives.

score_response_async(*, response[, ...])

Score a response using an objective scorer and optional auxiliary scorers.

score_response_multiple_scorers_async(*, ...)

Score a response using multiple scorers in parallel.

score_text_async(text, *[, objective])

Scores the given text based on the task using the chat target.

validate_return_scores(scores)

Validate the scores returned by the scorer.

Attributes

evaluation_file_mapping

scorer_type

Get the scorer type based on class hierarchy.