pyrit.score.SelfAskRefusalScorer#
- class SelfAskRefusalScorer(*, chat_target: ~pyrit.prompt_target.common.prompt_chat_target.PromptChatTarget, refusal_system_prompt_path: ~pyrit.score.true_false.self_ask_refusal_scorer.RefusalScorerPaths | ~pathlib.Path | str = RefusalScorerPaths.DEFAULT, prompt_format_string: str | None = None, validator: ~pyrit.score.scorer_prompt_validator.ScorerPromptValidator | None = None, score_aggregator: ~collections.abc.Callable[[~collections.abc.Iterable[~pyrit.models.score.Score]], ~pyrit.score.score_aggregator_result.ScoreAggregatorResult] = <function _create_aggregator.<locals>.aggregator>)[source]#
Bases:
TrueFalseScorerA self-ask scorer that detects refusal in AI responses.
This scorer uses a language model to determine whether a response contains a refusal to answer or comply with the given prompt. It’s useful for evaluating whether AI systems are appropriately refusing harmful requests.
The scorer supports two modes via RefusalScorerPaths:
- DEFAULT: Works with or without an explicit objective.
Safe completions are NOT considered refusals.
STRICT: Treats safe completions as refusals. Works best with an explicit objective.
- __init__(*, chat_target: ~pyrit.prompt_target.common.prompt_chat_target.PromptChatTarget, refusal_system_prompt_path: ~pyrit.score.true_false.self_ask_refusal_scorer.RefusalScorerPaths | ~pathlib.Path | str = RefusalScorerPaths.DEFAULT, prompt_format_string: str | None = None, validator: ~pyrit.score.scorer_prompt_validator.ScorerPromptValidator | None = None, score_aggregator: ~collections.abc.Callable[[~collections.abc.Iterable[~pyrit.models.score.Score]], ~pyrit.score.score_aggregator_result.ScoreAggregatorResult] = <function _create_aggregator.<locals>.aggregator>) None[source]#
Initialize the SelfAskRefusalScorer.
- Parameters:
chat_target (PromptChatTarget) – The endpoint that will be used to score the prompt.
refusal_system_prompt_path (Union[RefusalScorerPaths, Path, str]) – The path to the system prompt to use for refusal detection. Can be a RefusalScorerPaths enum value, a Path, or a string path. Defaults to RefusalScorerPaths.DEFAULT.
prompt_format_string (Optional[str]) – The format string for the prompt with placeholders. Use
{objective}for the conversation objective and{response}for the response to evaluate. Defaults to “conversation_objective: {objective}nresponse_to_evaluate_input: {response}”.validator (Optional[ScorerPromptValidator]) – Custom validator. Defaults to None.
score_aggregator (TrueFalseAggregatorFunc) – The aggregator function to use. Defaults to TrueFalseScoreAggregator.OR.
Methods
__init__(*, chat_target[, ...])Initialize the SelfAskRefusalScorer.
evaluate_async([file_mapping, ...])Evaluate this scorer against human-labeled datasets.
get_identifier()Get the component's identifier, building it lazily on first access.
get_scorer_metrics()Get evaluation metrics for this scorer from the configured evaluation result file.
scale_value_float(value, min_value, max_value)Scales a value from 0 to 1 based on the given min and max values.
score_async(message, *[, objective, ...])Score the message, add the results to the database, and return a list of Score objects.
score_image_async(image_path, *[, objective])Score the given image using the chat target.
score_image_batch_async(*, image_paths[, ...])Score a batch of images asynchronously.
score_prompts_batch_async(*, messages[, ...])Score multiple prompts in batches using the provided objectives.
score_response_async(*, response[, ...])Score a response using an objective scorer and optional auxiliary scorers.
score_response_multiple_scorers_async(*, ...)Score a response using multiple scorers in parallel.
score_text_async(text, *[, objective])Scores the given text based on the task using the chat target.
validate_return_scores(scores)Validate the scores returned by the scorer.
Attributes
evaluation_file_mappingscorer_typeGet the scorer type based on class hierarchy.