pyrit.score.Scorer#
- class Scorer(*, validator: ScorerPromptValidator)[source]#
Bases:
ABCAbstract base class for scorers.
- __init__(*, validator: ScorerPromptValidator)[source]#
Initialize the Scorer.
- Parameters:
validator (ScorerPromptValidator) – Validator for message pieces and scorer configuration.
Methods
__init__(*, validator)Initialize the Scorer.
evaluate_async([file_mapping, ...])Evaluate this scorer against human-labeled datasets.
Get an identifier dictionary for the scorer for database storage.
Get evaluation metrics for this scorer from the configured evaluation result file.
scale_value_float(value, min_value, max_value)Scales a value from 0 to 1 based on the given min and max values.
score_async(message, *[, objective, ...])Score the message, add the results to the database, and return a list of Score objects.
score_image_async(image_path, *[, objective])Score the given image using the chat target.
score_image_batch_async(*, image_paths[, ...])Score a batch of images asynchronously.
score_prompts_batch_async(*, messages[, ...])Score multiple prompts in batches using the provided objectives.
score_response_async(*, response[, ...])Score a response using an objective scorer and optional auxiliary scorers.
Score a response using multiple scorers in parallel.
score_text_async(text, *[, objective])Scores the given text based on the task using the chat target.
validate_return_scores(scores)Validate the scores returned by the scorer.
Attributes
Get the scorer identifier.
Get the scorer type based on class hierarchy.
- async evaluate_async(file_mapping: 'ScorerEvalDatasetFiles' | None = None, *, num_scorer_trials: int = 3, update_registry_behavior: RegistryUpdateBehavior = None, max_concurrency: int = 10) 'ScorerMetrics' | None[source]#
Evaluate this scorer against human-labeled datasets.
Uses file mapping to determine which datasets to evaluate and how to aggregate results.
- Parameters:
file_mapping – Optional ScorerEvalDatasetFiles configuration. If not provided, uses the scorer’s configured evaluation_file_mapping. Maps input file patterns to an output result file.
num_scorer_trials – Number of times to score each response (for measuring variance). Defaults to 3.
update_registry_behavior – Controls how existing registry entries are handled. - SKIP_IF_EXISTS (default): Check registry for existing results. If found, return cached metrics. - ALWAYS_UPDATE: Always run evaluation and overwrite any existing registry entry. - NEVER_UPDATE: Always run evaluation but never write to registry (for debugging). Defaults to RegistryUpdateBehavior.SKIP_IF_EXISTS.
max_concurrency – Maximum number of concurrent scoring requests. Defaults to 10.
- Returns:
The evaluation metrics, or None if no datasets found.
- Return type:
- Raises:
ValueError – If no file_mapping is provided and no evaluation_file_mapping is configured.
- get_identifier() Dict[str, Any][source]#
Get an identifier dictionary for the scorer for database storage.
Large fields (system_prompt_template, user_prompt_template) are shortened for compact storage. Includes the computed hash of the configuration.
- Returns:
The identifier dictionary containing configuration details and hash.
- Return type:
- abstract get_scorer_metrics() 'ScorerMetrics' | None[source]#
Get evaluation metrics for this scorer from the configured evaluation result file.
Looks up metrics by this scorer’s identity hash in the JSONL result file. The result file may contain entries for multiple scorer configurations.
Subclasses must implement this to return the appropriate metrics type: - TrueFalseScorer subclasses should return ObjectiveScorerMetrics - FloatScaleScorer subclasses should return HarmScorerMetrics
- Returns:
The metrics for this scorer, or None if not found or not configured.
- Return type:
- scale_value_float(value: float, min_value: float, max_value: float) float[source]#
Scales a value from 0 to 1 based on the given min and max values. E.g. 3 stars out of 5 stars would be .5.
- async score_async(message: Message, *, objective: str | None = None, role_filter: Literal['system', 'user', 'assistant', 'simulated_assistant', 'tool', 'developer'] | None = None, skip_on_error_result: bool = False, infer_objective_from_request: bool = False) list[Score][source]#
Score the message, add the results to the database, and return a list of Score objects.
- Parameters:
message (Message) – The message to be scored.
objective (Optional[str]) – The task or objective based on which the message should be scored. Defaults to None.
role_filter (Optional[ChatMessageRole]) – Only score messages with this exact stored role. Use “assistant” to score only real assistant responses, or “simulated_assistant” to score only simulated responses. Defaults to None (no filtering).
skip_on_error_result (bool) – If True, skip scoring if the message contains an error. Defaults to False.
infer_objective_from_request (bool) – If True, infer the objective from the message’s previous request when objective is not provided. Defaults to False.
- Returns:
A list of Score objects representing the results.
- Return type:
- Raises:
PyritException – If scoring raises a PyRIT exception (re-raised with enhanced context).
RuntimeError – If scoring raises a non-PyRIT exception (wrapped with scorer context).
- async score_image_async(image_path: str, *, objective: str | None = None) list[Score][source]#
Score the given image using the chat target.
- async score_image_batch_async(*, image_paths: Sequence[str], objectives: Sequence[str] | None = None, batch_size: int = 10) list[Score][source]#
Score a batch of images asynchronously.
- Parameters:
image_paths (Sequence[str]) – Sequence of paths to image files to be scored.
objectives (Optional[Sequence[str]]) – Optional sequence of objectives corresponding to each image. If provided, must match the length of image_paths. Defaults to None.
batch_size (int) – Maximum number of images to score concurrently. Defaults to 10.
- Returns:
A list of Score objects representing the scoring results for all images.
- Return type:
- Raises:
ValueError – If the number of objectives does not match the number of image_paths.
- async score_prompts_batch_async(*, messages: Sequence[Message], objectives: Sequence[str] | None = None, batch_size: int = 10, role_filter: Literal['system', 'user', 'assistant', 'simulated_assistant', 'tool', 'developer'] | None = None, skip_on_error_result: bool = False, infer_objective_from_request: bool = False) list[Score][source]#
Score multiple prompts in batches using the provided objectives.
- Parameters:
messages (Sequence[Message]) – The messages to be scored.
objectives (Sequence[str]) – The objectives/tasks based on which the prompts should be scored. Must have the same length as messages.
batch_size (int) – The maximum batch size for processing prompts. Defaults to 10.
role_filter (Optional[ChatMessageRole]) – If provided, only score pieces with this role. Defaults to None (no filtering).
skip_on_error_result (bool) – If True, skip scoring pieces that have errors. Defaults to False.
infer_objective_from_request (bool) – If True and objective is empty, attempt to infer the objective from the request. Defaults to False.
- Returns:
A flattened list of Score objects from all scored prompts.
- Return type:
- Raises:
ValueError – If objectives is empty or if the number of objectives doesn’t match the number of messages.
- async static score_response_async(*, response: Message, objective_scorer: Scorer | None = None, auxiliary_scorers: List[Scorer] | None = None, role_filter: Literal['system', 'user', 'assistant', 'simulated_assistant', 'tool', 'developer'] = 'assistant', objective: str | None = None, skip_on_error_result: bool = True) Dict[str, List[Score]][source]#
Score a response using an objective scorer and optional auxiliary scorers.
- Parameters:
response (Message) – Response containing pieces to score.
objective_scorer (Optional[Scorer]) – The main scorer to determine success. Defaults to None.
auxiliary_scorers (Optional[List[Scorer]]) – List of auxiliary scorers to apply. Defaults to None.
role_filter (ChatMessageRole) – Only score pieces with this exact stored role. Defaults to “assistant” (real responses only, not simulated).
objective (Optional[str]) – Task/objective for scoring context. Defaults to None.
skip_on_error_result (bool) – If True, skip scoring pieces that have errors. Defaults to True.
- Returns:
- Dictionary with keys auxiliary_scores and objective_scores
containing lists of scores from each type of scorer.
- Return type:
- Raises:
ValueError – If response is not provided.
- async static score_response_multiple_scorers_async(*, response: Message, scorers: List[Scorer], role_filter: Literal['system', 'user', 'assistant', 'simulated_assistant', 'tool', 'developer'] = 'assistant', objective: str | None = None, skip_on_error_result: bool = True) List[Score][source]#
Score a response using multiple scorers in parallel.
This method applies each scorer to the first scorable response piece (filtered by role and error), and returns all scores. This is typically used for auxiliary scoring where all results are needed.
- Parameters:
response (Message) – The response containing pieces to score.
scorers (List[Scorer]) – List of scorers to apply.
role_filter (ChatMessageRole) – Only score pieces with this exact stored role. Defaults to “assistant” (real responses only, not simulated).
objective (Optional[str]) – Optional objective description for scoring context.
skip_on_error_result (bool) – If True, skip scoring pieces that have errors (default: True).
- Returns:
All scores from all scorers
- Return type:
List[Score]
- async score_text_async(text: str, *, objective: str | None = None) list[Score][source]#
Scores the given text based on the task using the chat target.
- property scorer_identifier: ScorerIdentifier#
Get the scorer identifier. Built lazily on first access.
- Returns:
The identifier containing all configuration parameters.
- Return type: