pyrit.score.SelfAskRefusalScorer

pyrit.score.SelfAskRefusalScorer#

class SelfAskRefusalScorer(*, chat_target: PromptChatTarget)[source]#

Bases: Scorer

A self-ask scorer that detects refusal in AI responses.

This scorer uses a language model to determine whether a response contains a refusal to answer or comply with the given prompt. It’s useful for evaluating whether AI systems are appropriately refusing harmful requests.

__init__(*, chat_target: PromptChatTarget) → None[source]#

Initialize the SelfAskRefusalScorer.

Parameters:: chat_target – The endpoint that will be used to score the prompt.

Methods

`__init__`(*, chat_target)	Initialize the SelfAskRefusalScorer.
`get_identifier`()	Returns an identifier dictionary for the scorer.
`get_scorer_metrics`(dataset_name[, metrics_type])	Returns evaluation statistics for the scorer using the dataset_name of the human labeled dataset that this scorer was run against.
`scale_value_float`(value, min_value, max_value)	Scales a value from 0 to 1 based on the given min and max values.
`score_async`(request_response, *[, task])	Score the request_response, add the results to the database and return a list of Score objects.
`score_image_async`(image_path, *[, task])	Scores the given image using the chat target.
`score_prompts_with_tasks_batch_async`(*, ...)
`score_response_async`(*, response, scorers[, ...])	Score a response using multiple scorers in parallel.
`score_response_select_first_success_async`(*, ...)	Score response pieces sequentially until finding a successful score.
`score_response_with_objective_async`(*, response)	Score a response using both auxiliary and objective scorers.
`score_responses_inferring_tasks_batch_async`(*, ...)	Scores a batch of responses (ignores non-assistant messages).
`score_text_async`(text, *[, task])	Scores the given text based on the task using the chat target.
`score_text_batch_async`(*, texts[, tasks, ...])
`validate`(request_response, *[, task])	Validates the request_response piece to score.

Attributes

scorer_type

scorer_type: ScoreType#

validate(request_response: PromptRequestPiece, *, task: str | None = None) → None[source]#

Validates the request_response piece to score. Because some scorers may require specific PromptRequestPiece types or values.

Parameters:

request_response (PromptRequestPiece) – The request response to be validated.
task (str) – The task based on which the text should be scored (the original attacker model’s objective).

pyrit.score.SelfAskRefusalScorer

Contents

pyrit.score.SelfAskRefusalScorer#