Scoring functionality for evaluating AI model responses across various dimensions including harm detection, objective completion, and content classification.
Functions¶
create_conversation_scorer¶
create_conversation_scorer(scorer: Scorer, validator: Optional[ScorerPromptValidator] = None) → ScorerCreate a ConversationScorer that inherits from the same type as the wrapped scorer.
This factory dynamically creates a ConversationScorer class that inherits from the wrapped scorer’s base class (FloatScaleScorer or TrueFalseScorer), ensuring the returned scorer is an instance of both ConversationScorer and the wrapped scorer’s type.
| Parameter | Type | Description |
|---|---|---|
scorer | Scorer | The scorer to wrap for conversation-level evaluation. Must be an instance of FloatScaleScorer or TrueFalseScorer. |
validator | Optional[ScorerPromptValidator] | Optional validator override. If not provided, uses the wrapped scorer’s validator. Defaults to None. |
Returns:
Scorer— A ConversationScorer instance that is also an instance of the wrapped scorer’s type.
Raises:
ValueError— If the scorer is not an instance of FloatScaleScorer or TrueFalseScorer.
get_all_harm_metrics¶
get_all_harm_metrics(harm_category: str) → list[ScorerMetricsWithIdentity[HarmScorerMetrics]]Load all harm scorer metrics for a specific harm category.
Returns a list of ScorerMetricsWithIdentity[HarmScorerMetrics] objects that wrap
the scorer’s identity information and its performance metrics, enabling clean attribute
access like entry.metrics.mean_absolute_error or entry.metrics.harm_category.
| Parameter | Type | Description |
|---|---|---|
harm_category | str | The harm category to load metrics for (e.g., “hate_speech”, “violence”). |
Returns:
list[ScorerMetricsWithIdentity[HarmScorerMetrics]]— List[ScorerMetricsWithIdentity[HarmScorerMetrics]]: List of metrics with scorer identity. Access metrics viaentry.metrics.mean_absolute_error,entry.metrics.harm_category, etc. Access scorer info viaentry.scorer_identifier.class_name, etc.
get_all_objective_metrics¶
get_all_objective_metrics(file_path: Optional[Path] = None) → list[ScorerMetricsWithIdentity[ObjectiveScorerMetrics]]Load all objective scorer metrics with full scorer identity for comparison.
Returns a list of ScorerMetricsWithIdentity[ObjectiveScorerMetrics] objects that wrap
the scorer’s identity information and its performance metrics, enabling clean attribute
access like entry.metrics.accuracy or entry.metrics.f1_score.
| Parameter | Type | Description |
|---|---|---|
file_path | Optional[Path] | Path to a specific JSONL file to load. If not provided, uses the default path: SCORER_EVALS_PATH / “objective” / “objective_achieved_metrics.jsonl” Defaults to None. |
Returns:
list[ScorerMetricsWithIdentity[ObjectiveScorerMetrics]]— List[ScorerMetricsWithIdentity[ObjectiveScorerMetrics]]: List of metrics with scorer identity. Access metrics viaentry.metrics.accuracy,entry.metrics.f1_score, etc. Access scorer info viaentry.scorer_identifier.class_name, etc.
AudioFloatScaleScorer¶
Bases: FloatScaleScorer
A scorer that processes audio files by transcribing them and scoring the transcript.
The AudioFloatScaleScorer transcribes audio to text using Azure Speech-to-Text, then scores the transcript using a FloatScaleScorer.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
text_capable_scorer | FloatScaleScorer | A FloatScaleScorer capable of processing text. This scorer will be used to evaluate the transcribed audio content. |
validator | Optional[ScorerPromptValidator] | Validator for the scorer. Defaults to audio_path data type validator. Defaults to None. |
use_entra_auth | Optional[bool] | Whether to use Entra ID authentication for Azure Speech. Defaults to True if None. Defaults to None. |
AudioTrueFalseScorer¶
Bases: TrueFalseScorer
A scorer that processes audio files by transcribing them and scoring the transcript.
The AudioTrueFalseScorer transcribes audio to text using Azure Speech-to-Text, then scores the transcript using a TrueFalseScorer.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
text_capable_scorer | TrueFalseScorer | A TrueFalseScorer capable of processing text. This scorer will be used to evaluate the transcribed audio content. |
validator | Optional[ScorerPromptValidator] | Validator for the scorer. Defaults to audio_path data type validator. Defaults to None. |
use_entra_auth | Optional[bool] | Whether to use Entra ID authentication for Azure Speech. Defaults to True if None. Defaults to None. |
AzureContentFilterScorer¶
Bases: FloatScaleScorer
A scorer that uses Azure Content Safety API to evaluate text and images for harmful content.
This scorer analyzes content across multiple harm categories (hate, self-harm, sexual, violence) and returns a score for each category in the range [0, 1], where higher scores indicate more severe content. Supports both text and image inputs.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
endpoint | `Optional[str | None]` |
api_key | `Optional[str | Callable[[], str |
harm_categories | Optional[list[TextCategory]] | The harm categories you want to query for as defined in azure.ai.contentsafety.models.TextCategory. If not provided, defaults to all categories. Defaults to None. |
validator | Optional[ScorerPromptValidator] | Custom validator for the scorer. Defaults to None. Defaults to None. |
Methods:
evaluate_async¶
evaluate_async(file_mapping: Optional[ScorerEvalDatasetFiles] = None, num_scorer_trials: int = 3, update_registry_behavior: RegistryUpdateBehavior = None, max_concurrency: int = 10) → Optional[ScorerMetrics]Evaluate this scorer against human-labeled datasets.
AzureContentFilterScorer requires exactly one harm category to be configured for evaluation. This ensures each score corresponds to exactly one category in the ground truth dataset.
| Parameter | Type | Description |
|---|---|---|
file_mapping | Optional[ScorerEvalDatasetFiles] | Optional ScorerEvalDatasetFiles configuration. If not provided, uses the mapping based on the configured harm category. Defaults to None. |
num_scorer_trials | int | Number of times to score each response. Defaults to 3. Defaults to 3. |
update_registry_behavior | RegistryUpdateBehavior | Controls how existing registry entries are handled. - SKIP_IF_EXISTS (default): Check registry for existing results. If found, return cached metrics. - ALWAYS_UPDATE: Always run evaluation and overwrite any existing registry entry. - NEVER_UPDATE: Always run evaluation but never write to registry (for debugging). Defaults to RegistryUpdateBehavior.SKIP_IF_EXISTS. Defaults to None. |
max_concurrency | int | Maximum concurrent scoring requests. Defaults to 10. Defaults to 10. |
Returns:
Optional[ScorerMetrics]— The evaluation metrics, or None if no datasets found.
Raises:
ValueError— If more than one harm category is configured.
BatchScorer¶
A utility class for scoring prompts in batches in a parallelizable and convenient way.
This class provides functionality to score existing prompts stored in memory without any target interaction, making it a pure scoring utility.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
batch_size | int | The (max) batch size for sending prompts. Defaults to 10. Note: If using a scorer that takes a prompt target, and providing max requests per minute on the target, this should be set to 1 to ensure proper rate limit management. Defaults to 10. |
Methods:
score_responses_by_filters_async¶
score_responses_by_filters_async(scorer: Scorer, attack_id: Optional[str | uuid.UUID] = None, conversation_id: Optional[str | uuid.UUID] = None, prompt_ids: Optional[list[str] | list[uuid.UUID]] = None, labels: Optional[dict[str, str]] = None, sent_after: Optional[datetime] = None, sent_before: Optional[datetime] = None, original_values: Optional[list[str]] = None, converted_values: Optional[list[str]] = None, data_type: Optional[str] = None, not_data_type: Optional[str] = None, converted_value_sha256: Optional[list[str]] = None, objective: str = '') → list[Score]Score the responses that match the specified filters.
| Parameter | Type | Description |
|---|---|---|
scorer | Scorer | The Scorer object to use for scoring. |
attack_id | `Optional[str | uuid.UUID]` |
conversation_id | `Optional[str | uuid.UUID]` |
prompt_ids | `Optional[list[str] | list[uuid.UUID]]` |
labels | Optional[dict[str, str]] | A dictionary of labels. Defaults to None. Defaults to None. |
sent_after | Optional[datetime] | Filter for prompts sent after this datetime. Defaults to None. Defaults to None. |
sent_before | Optional[datetime] | Filter for prompts sent before this datetime. Defaults to None. Defaults to None. |
original_values | Optional[list[str]] | A list of original values. Defaults to None. Defaults to None. |
converted_values | Optional[list[str]] | A list of converted values. Defaults to None. Defaults to None. |
data_type | Optional[str] | The data type to filter by. Defaults to None. Defaults to None. |
not_data_type | Optional[str] | The data type to exclude. Defaults to None. Defaults to None. |
converted_value_sha256 | Optional[list[str]] | A list of SHA256 hashes of converted values. Defaults to None. Defaults to None. |
objective | str | A task is used to give the scorer more context on what exactly to score. A task might be the request prompt text or the original attack model’s objective. Note: the same task is applied to all matched prompts. Defaults to an empty string. Defaults to ''. |
Returns:
list[Score]— list[Score]: A list of Score objects for responses that match the specified filters.
Raises:
ValueError— If no entries match the provided filters.
ConsoleScorerPrinter¶
Bases: ScorerPrinter
Console printer for scorer information with enhanced formatting.
This printer formats scorer details for console display with optional color coding, proper indentation, and visual hierarchy. Colors can be disabled for consoles that don’t support ANSI characters.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
indent_size | int | Number of spaces for indentation. Must be non-negative. Defaults to 2. Defaults to 2. |
enable_colors | bool | Whether to enable ANSI color output. When False, all output will be plain text without colors. Defaults to True. Defaults to True. |
Methods:
print_harm_scorer¶
print_harm_scorer(scorer_identifier: ComponentIdentifier, harm_category: str) → NonePrint harm scorer information including type, nested scorers, and evaluation metrics.
This method displays:
Scorer type and identity information
Nested sub-scorers (for composite scorers)
Harm evaluation metrics (MAE, Krippendorff alpha) from the registry
| Parameter | Type | Description |
|---|---|---|
scorer_identifier | ComponentIdentifier | The scorer identifier to print information for. |
harm_category | str | The harm category for looking up metrics (e.g., “hate_speech”, “violence”). |
print_objective_scorer¶
print_objective_scorer(scorer_identifier: ComponentIdentifier) → NonePrint objective scorer information including type, nested scorers, and evaluation metrics.
This method displays:
Scorer type and identity information
Nested sub-scorers (for composite scorers)
Objective evaluation metrics (accuracy, precision, recall, F1) from the registry
| Parameter | Type | Description |
|---|---|---|
scorer_identifier | ComponentIdentifier | The scorer identifier to print information for. |
ContentClassifierPaths¶
Bases: enum.Enum
Paths to content classifier YAML files.
ConversationScorer¶
Bases: Scorer, ABC
Scorer that evaluates entire conversation history rather than individual messages.
This scorer wraps another scorer (FloatScaleScorer or TrueFalseScorer) and evaluates the full conversation context. Useful for multi-turn conversations where context matters (e.g., psychosocial harms that emerge over time or persuasion/deception over many messages).
The ConversationScorer dynamically inherits from the same base class as the wrapped scorer, ensuring proper type compatibility.
Note: This class cannot be instantiated directly. Use create_conversation_scorer() factory instead.
Methods:
validate_return_scores¶
validate_return_scores(scores: list[Score]) → NoneValidate scores by delegating to the wrapped scorer’s validation.
| Parameter | Type | Description |
|---|---|---|
scores | list[Score] | The scores to validate. |
DecodingScorer¶
Bases: TrueFalseScorer
Scorer that checks if the request values are in the output using a text matching strategy.
This scorer checks if any of the user request values (original_value, converted_value, or metadata decoded_text) match the response converted_value using the configured text matching strategy.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
text_matcher | Optional[TextMatching] | The text matching strategy to use. Defaults to ExactTextMatching with case_sensitive=False. Defaults to None. |
categories | Optional[list[str]] | Optional list of categories for the score. Defaults to None. Defaults to None. |
aggregator | TrueFalseAggregatorFunc | The aggregator function to use. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR. |
validator | Optional[ScorerPromptValidator] | Custom validator. Defaults to None. Defaults to None. |
FloatScaleScoreAggregator¶
Namespace for float scale score aggregators that return a single aggregated score.
All aggregators return a list containing one ScoreAggregatorResult that combines all input scores together, preserving all categories.
FloatScaleScorer¶
Bases: Scorer
Base class for scorers that return floating-point scores in the range [0, 1].
This scorer evaluates prompt responses and returns numeric scores indicating the degree to which a response exhibits certain characteristics. Each piece in a request response is scored independently, returning one score per piece.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
validator | ScorerPromptValidator | A validator object used to validate scores. |
Methods:
get_scorer_metrics¶
get_scorer_metrics() → Optional[HarmScorerMetrics]Get evaluation metrics for this scorer from the configured evaluation result file.
Returns:
Optional[HarmScorerMetrics]— The metrics for this scorer, or None if not found or not configured.
validate_return_scores¶
validate_return_scores(scores: list[Score]) → NoneValidate that the returned scores are within the valid range [0, 1].
Raises:
ValueError— If any score is not between 0 and 1.
FloatScaleScorerAllCategories¶
Namespace for float scale score aggregators that combine all categories.
These aggregators ignore category boundaries and aggregate all scores together, returning a single ScoreAggregatorResult with all categories combined.
FloatScaleScorerByCategory¶
Namespace for float scale score aggregators that group by category.
These aggregators return multiple ScoreAggregatorResult objects (one per category). Useful for scorers like AzureContentFilterScorer that return multiple scores per item.
FloatScaleThresholdScorer¶
Bases: TrueFalseScorer
A scorer that applies a threshold to a float scale score to make it a true/false score.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
scorer | FloatScaleScorer | The underlying float scale scorer to use. |
threshold | float | The threshold value between 0 and 1. Scores >= threshold are True, otherwise False. |
float_scale_aggregator | FloatScaleAggregatorFunc | The aggregator function to use for combining multiple float scale scores. Defaults to FloatScaleScoreAggregator.MAX. Defaults to FloatScaleScoreAggregator.MAX. |
GandalfScorer¶
Bases: TrueFalseScorer
A scorer for evaluating responses in Gandalf challenges.
This scorer checks whether a password or secret has been successfully extracted from a Gandalf-style challenge by submitting the response to the Gandalf API and checking if the password is correct. Returns True if the password was successfully extracted.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
level | GandalfLevel | The Gandalf challenge level to score against. |
chat_target | PromptChatTarget | The chat target used for password extraction. |
validator | Optional[ScorerPromptValidator] | Custom validator. Defaults to text data type validator. Defaults to None. |
score_aggregator | TrueFalseAggregatorFunc | Aggregator for combining scores. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR. |
HarmHumanLabeledEntry¶
Bases: HumanLabeledEntry
A class that represents a human-labeled dataset entry for a specific harm category. This class includes the Messages and a list of human scores, which are floats between 0.0 and 1.0 inclusive, representing the degree of harm severity where 0.0 is minimal and 1.0 is maximal. The harm category is a string that represents the type of harm (e.g., “hate_speech”, “misinformation”, etc.).
HarmScorerEvaluator¶
Bases: ScorerEvaluator
A class that evaluates a harm scorer against HumanLabeledDatasets of type HARM.
HarmScorerMetrics¶
Bases: ScorerMetrics
Metrics for evaluating a harm scorer against a HumanLabeledDataset.
Methods:
get_harm_definition¶
get_harm_definition() → Optional[HarmDefinition]Load and return the HarmDefinition object for this metrics instance.
Loads the harm definition YAML file specified in harm_definition and returns it as a HarmDefinition object. The result is cached after the first load.
Returns:
Optional[HarmDefinition]— The loaded harm definition object, or None if harm_definition is not set.
Raises:
FileNotFoundError— If the harm definition file does not exist.ValueError— If the harm definition file is invalid.
HumanInTheLoopScorerGradio¶
Bases: TrueFalseScorer
Create scores from manual human input using Gradio and adds them to the database.
In the future this will not be a TrueFalseScorer. However, it is all that is supported currently.
.. deprecated:: This Gradio-based scorer is deprecated and will be removed in v0.13.0. Use the React-based GUI instead.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
open_browser | bool | If True, the scorer will open the Gradio interface in a browser instead of opening it in PyWebview. Defaults to False. Defaults to False. |
validator | Optional[ScorerPromptValidator] | Custom validator. Defaults to None. Defaults to None. |
score_aggregator | TrueFalseAggregatorFunc | Aggregator for combining scores. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR. |
Methods:
retrieve_score¶
retrieve_score(request_prompt: MessagePiece, objective: Optional[str] = None) → list[Score]Retrieve a score from the human evaluator through the RPC server.
| Parameter | Type | Description |
|---|---|---|
request_prompt | MessagePiece | The message piece to be scored. |
objective | Optional[str] | The objective to evaluate against. Defaults to None. Defaults to None. |
Returns:
list[Score]— list[Score]: A list containing a single Score object from the human evaluator.
HumanLabeledDataset¶
A class that represents a human-labeled dataset, including the entries and each of their corresponding human scores. This dataset is used to evaluate PyRIT scorer performance via the ScorerEvaluator class. HumanLabeledDatasets can be constructed from a CSV file.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
name | str | The name of the human-labeled dataset. For datasets of uniform type, this is often the harm category (e.g. hate_speech) or objective. It will be used in the naming of metrics (JSON) and model scores (CSV) files when evaluation is run on this dataset. |
entries | List[HumanLabeledEntry] | A list of entries in the dataset. |
metrics_type | MetricsType | The type of the human-labeled dataset, either HARM or OBJECTIVE. |
version | str | The version of the human-labeled dataset. |
harm_definition | str | Path to the harm definition YAML file for HARM datasets. Defaults to None. |
harm_definition_version | str | Version of the harm definition YAML file. Used to ensure the human labels match the scoring criteria version. Defaults to None. |
Methods:
from_csv¶
from_csv(csv_path: Union[str, Path], metrics_type: MetricsType, dataset_name: Optional[str] = None, version: Optional[str] = None, harm_definition: Optional[str] = None, harm_definition_version: Optional[str] = None) → HumanLabeledDatasetLoad a human-labeled dataset from a CSV file with standard column names.
Expected CSV format:
‘assistant_response’: The assistant’s response text
‘human_score’: Human-assigned label (can have multiple columns for multiple raters)
‘objective’: For OBJECTIVE datasets, the objective being evaluated
‘data_type’: Optional data type (defaults to ‘text’ if not present)
You can optionally include a # comment line at the top of the CSV file to specify the dataset version and harm definition path. The format is:
For harm datasets: # dataset_version=x.y, harm_definition=path/to/definition.yaml, harm_definition_version=x.y
For objective datasets: # dataset_version=x.y
| Parameter | Type | Description |
|---|---|---|
csv_path | Union[str, Path] | The path to the CSV file. |
metrics_type | MetricsType | The type of the human-labeled dataset, either HARM or OBJECTIVE. |
dataset_name | (str, Optional) | The name of the dataset. If not provided, it will be inferred from the CSV file name. Defaults to None. |
version | (str, Optional) | The version of the dataset. If not provided here, it will be inferred from the CSV file if a dataset_version comment line is present. Defaults to None. |
harm_definition | (str, Optional) | Path to the harm definition YAML file. If not provided here, it will be inferred from the CSV file if a harm_definition comment is present. Defaults to None. |
harm_definition_version | (str, Optional) | Version of the harm definition YAML file. If not provided here, it will be inferred from the CSV file if a harm_definition_version comment is present. Defaults to None. |
Returns:
HumanLabeledDataset— The human-labeled dataset object.
Raises:
FileNotFoundError— If the CSV file does not exist.ValueError— If version is not provided and not found in the CSV file.
get_harm_definition¶
get_harm_definition() → Optional[HarmDefinition]Load and return the HarmDefinition object for this dataset.
For HARM datasets, this loads the harm definition YAML file specified in harm_definition and returns it as a HarmDefinition object. The result is cached after the first load.
Returns:
Optional[HarmDefinition]— The loaded harm definition object, or None if this is not a HARM dataset or harm_definition is not set.
Raises:
FileNotFoundError— If the harm definition file does not exist.ValueError— If the harm definition file is invalid.
validate¶
validate() → NoneValidate that the dataset is internally consistent.
Checks that all entries match the dataset’s metrics_type and, for HARM datasets, that all entries have the same harm_category, that harm_definition is specified, and that the harm definition file exists and is loadable.
Raises:
ValueError— If entries don’t match metrics_type, harm categories are inconsistent, or harm_definition is missing for HARM datasets.FileNotFoundError— If the harm definition file does not exist.
HumanLabeledEntry¶
A class that represents an entry in a dataset of assistant responses that have been scored by humans. It is used to evaluate PyRIT scorer performance as measured by degree of alignment with human labels. This class includes the Messages and a list of human-assigned scores, which are floats between 0.0 and 1.0 inclusive (representing degree of severity) for harm datasets, and booleans for objective datasets.
InsecureCodeScorer¶
Bases: FloatScaleScorer
A scorer that uses an LLM to evaluate code snippets for potential security vulnerabilities. Configuration is loaded from a YAML file for dynamic prompts and instructions.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
chat_target | PromptChatTarget | The target to use for scoring code security. |
system_prompt_path | Optional[Union[str, Path]] | Path to the YAML file containing the system prompt. Defaults to the default insecure code scoring prompt if not provided. Defaults to None. |
validator | Optional[ScorerPromptValidator] | Custom validator for the scorer. Defaults to None. Defaults to None. |
LikertScaleEvalFiles¶
Configuration for evaluating a Likert scale scorer on a set of dataset files.
LikertScalePaths¶
Bases: enum.Enum
Enum containing Likert scale configurations including YAML paths and evaluation file mappings.
Each enum value is a tuple of (yaml_path, evaluation_files) where:
yaml_path: Path to the YAML file containing the Likert scale definition
evaluation_files: Optional LikertScaleEvalFiles for scorer evaluation, or None if no dataset exists
MarkdownInjectionScorer¶
Bases: TrueFalseScorer
A scorer that detects markdown injection attempts in text responses.
This scorer checks for the presence of markdown syntax patterns that could be used for injection attacks, such as links, images, or other markdown constructs that might be exploited. Returns True if markdown injection is detected.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
validator | Optional[ScorerPromptValidator] | Custom validator. Defaults to None. Defaults to None. |
score_aggregator | TrueFalseAggregatorFunc | The aggregator function to use. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR. |
MetricsType¶
Bases: Enum
Enum representing the type of metrics when evaluating scorers on human-labeled datasets.
ObjectiveHumanLabeledEntry¶
Bases: HumanLabeledEntry
A class that represents a human-labeled dataset entry for a specific objective. This class includes the Messages and a list of human scores, which are booleans indicating whether the response/conversation meets the objective (e.g., 0 for not meeting the objective, 1 for meeting the objective). The objective is a string that represents the objective (e.g., "how to make a Molotov cocktail?).
ObjectiveScorerEvaluator¶
Bases: ScorerEvaluator
A class that evaluates an objective scorer against HumanLabeledDatasets of type OBJECTIVE.
ObjectiveScorerMetrics¶
Bases: ScorerMetrics
Metrics for evaluating an objective scorer against a HumanLabeledDataset.
PlagiarismMetric¶
Bases: Enum
Enum representing different plagiarism detection metrics.
PlagiarismScorer¶
Bases: FloatScaleScorer
A scorer that measures plagiarism by computing word-level similarity between the AI response and a reference text.
This scorer implements three similarity metrics:
Word-level longest common subsequence (LCS)
Word-level Levenshtein similarity
Word-level n-gram Jaccard similarity
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
reference_text | str | The reference text to compare against. |
metric | PlagiarismMetric | The plagiarism detection metric to use. Defaults to PlagiarismMetric.LCS. Defaults to PlagiarismMetric.LCS. |
n | int | The n-gram size for n-gram similarity. Defaults to 5. Defaults to 5. |
validator | Optional[ScorerPromptValidator] | Custom validator for the scorer. Defaults to None. Defaults to None. |
PromptShieldScorer¶
Bases: TrueFalseScorer
Returns true if an attack or jailbreak has been detected by Prompt Shield.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
prompt_shield_target | PromptShieldTarget | The Prompt Shield target to use for scoring. |
validator | Optional[ScorerPromptValidator] | Custom validator. Defaults to None. Defaults to None. |
score_aggregator | TrueFalseAggregatorFunc | The aggregator function to use. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR. |
QuestionAnswerScorer¶
Bases: TrueFalseScorer
A class that represents a question answering scorer.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
correct_answer_matching_patterns | list[str] | A list of patterns to check for in the response. If any pattern is found in the response, the score will be True. These patterns should be format strings that will be formatted with the correct answer metadata. Defaults to CORRECT_ANSWER_MATCHING_PATTERNS. Defaults to CORRECT_ANSWER_MATCHING_PATTERNS. |
category | Optional[list[str]] | Optional list of categories for the score. Defaults to None. Defaults to None. |
validator | Optional[ScorerPromptValidator] | Custom validator. Defaults to None. Defaults to None. |
score_aggregator | TrueFalseAggregatorFunc | The aggregator function to use. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR. |
RefusalScorerPaths¶
Bases: enum.Enum
Paths to refusal scorer system prompt YAML files.
Each enum value represents a different refusal detection strategy:
DEFAULT: Standard refusal detection that works with or without an explicit objective. If an objective is provided, evaluates refusal against it; if not, evaluates against the implied objective. Safe completions (including partial information, redirections, asking questions, or excessive caveats) are NOT considered refusals.
STRICT: Strict refusal detection that treats “safe completions” as refusals. Works best when an explicit objective is provided.
RegistryUpdateBehavior¶
Bases: Enum
Enum representing how the evaluation registry should be updated.
Scorer¶
Bases: Identifiable, abc.ABC
Abstract base class for scorers.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
validator | ScorerPromptValidator | Validator for message pieces and scorer configuration. |
Methods:
evaluate_async¶
evaluate_async(file_mapping: Optional[ScorerEvalDatasetFiles] = None, num_scorer_trials: int = 3, update_registry_behavior: RegistryUpdateBehavior = None, max_concurrency: int = 10) → Optional[ScorerMetrics]Evaluate this scorer against human-labeled datasets.
Uses file mapping to determine which datasets to evaluate and how to aggregate results.
| Parameter | Type | Description |
|---|---|---|
file_mapping | Optional[ScorerEvalDatasetFiles] | Optional ScorerEvalDatasetFiles configuration. If not provided, uses the scorer’s configured evaluation_file_mapping. Maps input file patterns to an output result file. Defaults to None. |
num_scorer_trials | int | Number of times to score each response (for measuring variance). Defaults to 3. Defaults to 3. |
update_registry_behavior | RegistryUpdateBehavior | Controls how existing registry entries are handled. - SKIP_IF_EXISTS (default): Check registry for existing results. If found, return cached metrics. - ALWAYS_UPDATE: Always run evaluation and overwrite any existing registry entry. - NEVER_UPDATE: Always run evaluation but never write to registry (for debugging). Defaults to RegistryUpdateBehavior.SKIP_IF_EXISTS. Defaults to None. |
max_concurrency | int | Maximum number of concurrent scoring requests. Defaults to 10. Defaults to 10. |
Returns:
Optional[ScorerMetrics]— The evaluation metrics, or None if no datasets found.
Raises:
ValueError— If no file_mapping is provided and no evaluation_file_mapping is configured.
get_identifier¶
get_identifier() → ComponentIdentifierGet the scorer’s identifier with eval_hash always attached.
Overrides the base Identifiable.get_identifier() so that
to_dict() always emits the eval_hash key.
Returns:
ComponentIdentifier— The identity witheval_hashset.
get_scorer_metrics¶
get_scorer_metrics() → Optional[ScorerMetrics]Get evaluation metrics for this scorer from the configured evaluation result file.
Looks up metrics by this scorer’s identity hash in the JSONL result file. The result file may contain entries for multiple scorer configurations.
Subclasses must implement this to return the appropriate metrics type:
TrueFalseScorer subclasses should return ObjectiveScorerMetrics
FloatScaleScorer subclasses should return HarmScorerMetrics
Returns:
Optional[ScorerMetrics]— The metrics for this scorer, or None if not found or not configured.
scale_value_float¶
scale_value_float(value: float, min_value: float, max_value: float) → floatScales a value from 0 to 1 based on the given min and max values. E.g. 3 stars out of 5 stars would be .5.
| Parameter | Type | Description |
|---|---|---|
value | float | The value to be scaled. |
min_value | float | The minimum value of the range. |
max_value | float | The maximum value of the range. |
Returns:
float— The scaled value.
score_async¶
score_async(message: Message, objective: Optional[str] = None, role_filter: Optional[ChatMessageRole] = None, skip_on_error_result: bool = False, infer_objective_from_request: bool = False) → list[Score]Score the message, add the results to the database, and return a list of Score objects.
| Parameter | Type | Description |
|---|---|---|
message | Message | The message to be scored. |
objective | Optional[str] | The task or objective based on which the message should be scored. Defaults to None. Defaults to None. |
role_filter | Optional[ChatMessageRole] | Only score messages with this exact stored role. Use “assistant” to score only real assistant responses, or “simulated_assistant” to score only simulated responses. Defaults to None (no filtering). Defaults to None. |
skip_on_error_result | bool | If True, skip scoring if the message contains an error. Defaults to False. Defaults to False. |
infer_objective_from_request | bool | If True, infer the objective from the message’s previous request when objective is not provided. Defaults to False. Defaults to False. |
Returns:
list[Score]— list[Score]: A list of Score objects representing the results.
Raises:
PyritException— If scoring raises a PyRIT exception (re-raised with enhanced context).RuntimeError— If scoring raises a non-PyRIT exception (wrapped with scorer context).
score_image_async¶
score_image_async(image_path: str, objective: Optional[str] = None) → list[Score]Score the given image using the chat target.
| Parameter | Type | Description |
|---|---|---|
image_path | str | The path to the image file to be scored. |
objective | Optional[str] | The objective based on which the image should be scored. Defaults to None. Defaults to None. |
Returns:
list[Score]— list[Score]: A list of Score objects representing the results.
score_image_batch_async¶
score_image_batch_async(image_paths: Sequence[str], objectives: Optional[Sequence[str]] = None, batch_size: int = 10) → list[Score]Score a batch of images asynchronously.
| Parameter | Type | Description |
|---|---|---|
image_paths | Sequence[str] | Sequence of paths to image files to be scored. |
objectives | Optional[Sequence[str]] | Optional sequence of objectives corresponding to each image. If provided, must match the length of image_paths. Defaults to None. Defaults to None. |
batch_size | int | Maximum number of images to score concurrently. Defaults to 10. Defaults to 10. |
Returns:
list[Score]— list[Score]: A list of Score objects representing the scoring results for all images.
Raises:
ValueError— If the number of objectives does not match the number of image_paths.
score_prompts_batch_async¶
score_prompts_batch_async(messages: Sequence[Message], objectives: Optional[Sequence[str]] = None, batch_size: int = 10, role_filter: Optional[ChatMessageRole] = None, skip_on_error_result: bool = False, infer_objective_from_request: bool = False) → list[Score]Score multiple prompts in batches using the provided objectives.
| Parameter | Type | Description |
|---|---|---|
messages | Sequence[Message] | The messages to be scored. |
objectives | Sequence[str] | The objectives/tasks based on which the prompts should be scored. Must have the same length as messages. Defaults to None. |
batch_size | int | The maximum batch size for processing prompts. Defaults to 10. Defaults to 10. |
role_filter | Optional[ChatMessageRole] | If provided, only score pieces with this role. Defaults to None (no filtering). Defaults to None. |
skip_on_error_result | bool | If True, skip scoring pieces that have errors. Defaults to False. Defaults to False. |
infer_objective_from_request | bool | If True and objective is empty, attempt to infer the objective from the request. Defaults to False. Defaults to False. |
Returns:
list[Score]— list[Score]: A flattened list of Score objects from all scored prompts.
Raises:
ValueError— If objectives is empty or if the number of objectives doesn’t match the number of messages.
score_response_async¶
score_response_async(response: Message, objective_scorer: Optional[Scorer] = None, auxiliary_scorers: Optional[list[Scorer]] = None, role_filter: ChatMessageRole = 'assistant', objective: Optional[str] = None, skip_on_error_result: bool = True) → dict[str, list[Score]]Score a response using an objective scorer and optional auxiliary scorers.
| Parameter | Type | Description |
|---|---|---|
response | Message | Response containing pieces to score. |
objective_scorer | Optional[Scorer] | The main scorer to determine success. Defaults to None. Defaults to None. |
auxiliary_scorers | Optional[List[Scorer]] | List of auxiliary scorers to apply. Defaults to None. Defaults to None. |
role_filter | ChatMessageRole | Only score pieces with this exact stored role. Defaults to “assistant” (real responses only, not simulated). Defaults to 'assistant'. |
objective | Optional[str] | Task/objective for scoring context. Defaults to None. Defaults to None. |
skip_on_error_result | bool | If True, skip scoring pieces that have errors. Defaults to True. Defaults to True. |
Returns:
dict[str, list[Score]]— Dict[str, List[Score]]: Dictionary with keysauxiliary_scoresandobjective_scorescontaining lists of scores from each type of scorer.
Raises:
ValueError— If response is not provided.
score_response_multiple_scorers_async¶
score_response_multiple_scorers_async(response: Message, scorers: list[Scorer], role_filter: ChatMessageRole = 'assistant', objective: Optional[str] = None, skip_on_error_result: bool = True) → list[Score]Score a response using multiple scorers in parallel.
This method applies each scorer to the first scorable response piece (filtered by role and error), and returns all scores. This is typically used for auxiliary scoring where all results are needed.
| Parameter | Type | Description |
|---|---|---|
response | Message | The response containing pieces to score. |
scorers | List[Scorer] | List of scorers to apply. |
role_filter | ChatMessageRole | Only score pieces with this exact stored role. Defaults to “assistant” (real responses only, not simulated). Defaults to 'assistant'. |
objective | Optional[str] | Optional objective description for scoring context. Defaults to None. |
skip_on_error_result | bool | If True, skip scoring pieces that have errors (default: True). Defaults to True. |
Returns:
list[Score]— List[Score]: All scores from all scorers
score_text_async¶
score_text_async(text: str, objective: Optional[str] = None) → list[Score]Scores the given text based on the task using the chat target.
| Parameter | Type | Description |
|---|---|---|
text | str | The text to be scored. |
objective | Optional[str] | The task based on which the text should be scored Defaults to None. |
Returns:
list[Score]— list[Score]: A list of Score objects representing the results.
validate_return_scores¶
validate_return_scores(scores: list[Score]) → NoneValidate the scores returned by the scorer. Because some scorers may require specific Score types or values.
| Parameter | Type | Description |
|---|---|---|
scores | list[Score] | The scores to be validated. |
ScorerEvalDatasetFiles¶
Configuration for evaluating a scorer on a set of dataset files.
Maps input dataset files (via glob patterns) to an output result file. Multiple files matching the patterns will be concatenated before evaluation.
ScorerEvaluator¶
Bases: abc.ABC
A class that evaluates an LLM scorer against HumanLabeledDatasets, calculating appropriate metrics and saving them to a file.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
scorer | Scorer | The scorer to evaluate. |
Methods:
evaluate_dataset_async¶
evaluate_dataset_async(labeled_dataset: HumanLabeledDataset, num_scorer_trials: int = 1, max_concurrency: int = 10) → ScorerMetricsRun the evaluation for the scorer/policy combination on the passed in HumanLabeledDataset.
This method performs pure computation without side effects (no file writing). It can be called directly with an in-memory HumanLabeledDataset for experiments that don’t use file-based datasets (e.g., iterative rubric tuning with custom splits).
| Parameter | Type | Description |
|---|---|---|
labeled_dataset | HumanLabeledDataset | The HumanLabeledDataset to evaluate the scorer against. |
num_scorer_trials | int | The number of trials to run the scorer on all responses. Defaults to 1. |
max_concurrency | int | Maximum number of concurrent scoring requests. Defaults to 10. Defaults to 10. |
Returns:
ScorerMetrics— The metrics for the scorer. This will be either HarmScorerMetrics or ObjectiveScorerMetrics depending on the type of the HumanLabeledDataset (HARM or OBJECTIVE).
Raises:
ValueError— If the labeled_dataset is invalid.
from_scorer¶
from_scorer(scorer: Scorer, metrics_type: Optional[MetricsType] = None) → ScorerEvaluatorCreate a ScorerEvaluator based on the type of scoring.
| Parameter | Type | Description |
|---|---|---|
scorer | Scorer | The scorer to evaluate. |
metrics_type | MetricsType | The type of scoring, either HARM or OBJECTIVE. If not provided, it will default to OBJECTIVE for true/false scorers and HARM for all other scorers. Defaults to None. |
Returns:
ScorerEvaluator— An instance of HarmScorerEvaluator or ObjectiveScorerEvaluator.
run_evaluation_async¶
run_evaluation_async(dataset_files: ScorerEvalDatasetFiles, num_scorer_trials: int = 3, update_registry_behavior: RegistryUpdateBehavior = RegistryUpdateBehavior.SKIP_IF_EXISTS, max_concurrency: int = 10) → Optional[ScorerMetrics]Evaluate scorer using dataset files configuration.
The update_registry_behavior parameter controls how existing registry entries are handled:
SKIP_IF_EXISTS (default): Check registry for existing results matching scorer config, dataset version, and num_scorer_trials. If found, return cached metrics. If not found, run evaluation and write to registry.
ALWAYS_UPDATE: Always run evaluation and overwrite any existing registry entry.
NEVER_UPDATE: Always run evaluation but never write to registry (for debugging).
| Parameter | Type | Description |
|---|---|---|
dataset_files | ScorerEvalDatasetFiles | ScorerEvalDatasetFiles configuration specifying glob patterns for input files and a result file name. |
num_scorer_trials | int | Number of scoring trials per response. Defaults to 3. Defaults to 3. |
update_registry_behavior | RegistryUpdateBehavior | Controls how existing registry entries are handled. Defaults to RegistryUpdateBehavior.SKIP_IF_EXISTS. Defaults to RegistryUpdateBehavior.SKIP_IF_EXISTS. |
max_concurrency | int | Maximum number of concurrent scoring requests. Defaults to 10. Defaults to 10. |
Returns:
Optional[ScorerMetrics]— ScorerMetrics if evaluation completed, None if no files found.
Raises:
ValueError— If harm_category is not specified for harm scorer evaluations.
ScorerMetrics¶
Base dataclass for storing scorer evaluation metrics.
This class provides methods for serializing metrics to JSON and loading them from JSON files.
Methods:
from_json¶
from_json(file_path: Union[str, Path]) → TLoad the metrics from a JSON file.
| Parameter | Type | Description |
|---|---|---|
file_path | Union[str, Path] | The path to the JSON file. |
Returns:
T— An instance of ScorerMetrics with the loaded data.
Raises:
FileNotFoundError— If the specified file does not exist.
to_json¶
to_json() → strConvert the metrics to a JSON string.
Returns:
str— The JSON string representation of the metrics.
ScorerMetricsWithIdentity¶
Bases: Generic[M]
Wrapper that combines scorer metrics with the scorer’s identity information.
This class provides a clean interface for working with evaluation results, allowing access to both the scorer configuration and its performance metrics.
Generic over the metrics type M, so:
ScorerMetricsWithIdentity[ObjectiveScorerMetrics] has metrics: ObjectiveScorerMetrics
ScorerMetricsWithIdentity[HarmScorerMetrics] has metrics: HarmScorerMetrics
ScorerPrinter¶
Bases: ABC
Abstract base class for printing scorer information.
This interface defines the contract for printing scorer details including type information, nested sub-scorers, and evaluation metrics from the registry. Implementations can render output to console, logs, files, or other outputs.
Methods:
print_harm_scorer¶
print_harm_scorer(scorer_identifier: ComponentIdentifier, harm_category: str) → NonePrint harm scorer information including type, nested scorers, and evaluation metrics.
This method displays:
Scorer type and identity information
Nested sub-scorers (for composite scorers)
Harm evaluation metrics (MAE, Krippendorff alpha) from the registry
| Parameter | Type | Description |
|---|---|---|
scorer_identifier | ComponentIdentifier | The scorer identifier to print information for. |
harm_category | str | The harm category for looking up metrics (e.g., “hate_speech”, “violence”). |
print_objective_scorer¶
print_objective_scorer(scorer_identifier: ComponentIdentifier) → NonePrint objective scorer information including type, nested scorers, and evaluation metrics.
This method displays:
Scorer type and identity information
Nested sub-scorers (for composite scorers)
Objective evaluation metrics (accuracy, precision, recall, F1) from the registry
| Parameter | Type | Description |
|---|---|---|
scorer_identifier | ComponentIdentifier | The scorer identifier to print information for. |
ScorerPromptValidator¶
Validates message pieces and scorer configurations.
This class provides validation for scorer inputs, ensuring that message pieces meet required criteria such as data types, roles, and metadata requirements.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
supported_data_types | Optional[Sequence[PromptDataType]] | Data types that the scorer supports. Defaults to all data types if not provided. Defaults to None. |
required_metadata | Optional[Sequence[str]] | Metadata keys that must be present in message pieces. Defaults to empty list. Defaults to None. |
supported_roles | Optional[Sequence[ChatMessageRole]] | Message roles that the scorer supports. Defaults to all roles if not provided. Defaults to None. |
max_pieces_in_response | Optional[int] | Maximum number of pieces allowed in a response. Defaults to None (no limit). Defaults to None. |
max_text_length | Optional[int] | Maximum character length for text data type pieces. Defaults to None (no limit). Defaults to None. |
enforce_all_pieces_valid | Optional[bool] | Whether all pieces must be valid or just at least one. Defaults to False. Defaults to False. |
raise_on_no_valid_pieces | Optional[bool] | Whether to raise ValueError when no pieces are valid. Defaults to False, allowing scorers to handle empty results gracefully (e.g., returning False for blocked responses). Set to True to raise an exception instead. Defaults to False. |
is_objective_required | bool | Whether an objective must be provided for scoring. Defaults to False. Defaults to False. |
Methods:
is_message_piece_supported¶
is_message_piece_supported(message_piece: MessagePiece) → boolCheck if a message piece is supported by this validator.
| Parameter | Type | Description |
|---|---|---|
message_piece | MessagePiece | The message piece to check. |
Returns:
bool— True if the message piece meets all validation criteria, False otherwise.
validate¶
validate(message: Message, objective: str | None) → NoneValidate a message and objective against configured requirements.
| Parameter | Type | Description |
|---|---|---|
message | Message | The message to validate. |
objective | `str | None` |
Raises:
ValueError— If validation fails due to unsupported pieces, exceeding max pieces, or missing objective.
SelfAskCategoryScorer¶
Bases: TrueFalseScorer
A class that represents a self-ask score for text classification and scoring. Given a classifier file, it scores according to these categories and returns the category the MessagePiece fits best.
There is also a false category that is used if the MessagePiece does not fit any of the categories.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
chat_target | PromptChatTarget | The chat target to interact with. |
content_classifier_path | Union[str, Path] | The path to the classifier YAML file. |
score_aggregator | TrueFalseAggregatorFunc | The aggregator function to use. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR. |
validator | Optional[ScorerPromptValidator] | Custom validator. Defaults to None. Defaults to None. |
SelfAskGeneralFloatScaleScorer¶
Bases: FloatScaleScorer
A general-purpose self-ask float-scale scorer that uses a chat target and a configurable system prompt and prompt format. The final score is normalized to [0, 1].
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
chat_target | PromptChatTarget | The chat target used to score. |
system_prompt_format_string | str | System prompt template with placeholders for objective, prompt, and message_piece. |
prompt_format_string | Optional[str] | User prompt template with the same placeholders. Defaults to None. |
category | Optional[str] | Category for the score. Defaults to None. |
min_value | int | Minimum of the model’s native scale. Defaults to 0. Defaults to 0. |
max_value | int | Maximum of the model’s native scale. Defaults to 100. Defaults to 100. |
validator | Optional[ScorerPromptValidator] | Custom validator. If omitted, a default validator will be used requiring text input and an objective. Defaults to None. |
score_value_output_key | str | JSON key for the score value. Defaults to “score_value”. Defaults to 'score_value'. |
rationale_output_key | str | JSON key for the rationale. Defaults to “rationale”. Defaults to 'rationale'. |
description_output_key | str | JSON key for the description. Defaults to “description”. Defaults to 'description'. |
metadata_output_key | str | JSON key for the metadata. Defaults to “metadata”. Defaults to 'metadata'. |
category_output_key | str | JSON key for the category. Defaults to “category”. Defaults to 'category'. |
SelfAskGeneralTrueFalseScorer¶
Bases: TrueFalseScorer
A general-purpose self-ask True/False scorer that uses a chat target and a configurable system prompt and prompt format.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
chat_target | PromptChatTarget | The chat target used to score. |
system_prompt_format_string | str | System prompt template with placeholders for objective, task (alias of objective), prompt, and message_piece. |
prompt_format_string | Optional[str] | User prompt template with the same placeholders. Defaults to None. |
category | Optional[str] | Category for the score. Defaults to None. |
validator | Optional[ScorerPromptValidator] | Custom validator. If omitted, a default validator will be used requiring text input and an objective. Defaults to None. |
score_aggregator | TrueFalseAggregatorFunc | Aggregator for combining scores. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR. |
score_value_output_key | str | JSON key for the score value. Defaults to “score_value”. Defaults to 'score_value'. |
rationale_output_key | str | JSON key for the rationale. Defaults to “rationale”. Defaults to 'rationale'. |
description_output_key | str | JSON key for the description. Defaults to “description”. Defaults to 'description'. |
metadata_output_key | str | JSON key for the metadata. Defaults to “metadata”. Defaults to 'metadata'. |
category_output_key | str | JSON key for the category. Defaults to “category”. Defaults to 'category'. |
SelfAskLikertScorer¶
Bases: FloatScaleScorer
A class that represents a “self-ask” score for text scoring based on a Likert scale. A Likert scale consists of ranked, ordered categories and is often on a 5 or 7 point basis, but you can configure a scale with any set of non-negative integer score values and descriptions by providing a custom YAML file.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
chat_target | PromptChatTarget | The chat target to use for scoring. |
likert_scale | LikertScalePaths | The Likert scale configuration to use for scoring. |
validator | Optional[ScorerPromptValidator] | Custom validator for the scorer. Defaults to None. Defaults to None. |
SelfAskQuestionAnswerScorer¶
Bases: SelfAskTrueFalseScorer
A class that represents a self-ask question answering scorer.
Usually, QuestionAnswerScorer should be used, but this can be useful for when choices are not sent to objective target or you need more flexibility in determining if the questions were answered correctly.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
chat_target | PromptChatTarget | The chat target to use for the scorer. |
true_false_question_path | Optional[pathlib.Path] | The path to the true/false question file. Defaults to None, which uses the default question_answering.yaml file. Defaults to None. |
validator | Optional[ScorerPromptValidator] | Custom validator. Defaults to None. Defaults to None. |
score_aggregator | TrueFalseAggregatorFunc | The aggregator function to use. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR. |
SelfAskRefusalScorer¶
Bases: TrueFalseScorer
A self-ask scorer that detects refusal in AI responses.
This scorer uses a language model to determine whether a response contains a refusal to answer or comply with the given prompt. It’s useful for evaluating whether AI systems are appropriately refusing harmful requests.
The scorer supports two modes via RefusalScorerPaths:
DEFAULT: Works with or without an explicit objective. Safe completions are NOT considered refusals.
STRICT: Treats safe completions as refusals. Works best with an explicit objective.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
chat_target | PromptChatTarget | The endpoint that will be used to score the prompt. |
refusal_system_prompt_path | Union[RefusalScorerPaths, Path, str] | The path to the system prompt to use for refusal detection. Can be a RefusalScorerPaths enum value, a Path, or a string path. Defaults to RefusalScorerPaths.DEFAULT. Defaults to RefusalScorerPaths.DEFAULT. |
prompt_format_string | Optional[str] | The format string for the prompt with placeholders. Use {objective} for the conversation objective and {response} for the response to evaluate. Defaults to “conversation_objective: {objective}\nresponse_to_evaluate_input: {response}”. Defaults to None. |
validator | Optional[ScorerPromptValidator] | Custom validator. Defaults to None. Defaults to None. |
score_aggregator | TrueFalseAggregatorFunc | The aggregator function to use. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR. |
SelfAskScaleScorer¶
Bases: FloatScaleScorer
A class that represents a “self-ask” score for text scoring for a customizable numeric scale.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
chat_target | PromptChatTarget | The chat target to use for scoring. |
scale_arguments_path | Optional[Union[Path, str]] | Path to the YAML file containing scale definitions. Defaults to TREE_OF_ATTACKS_SCALE if not provided. Defaults to None. |
system_prompt_path | Optional[Union[Path, str]] | Path to the YAML file containing the system prompt. Defaults to GENERAL_SYSTEM_PROMPT if not provided. Defaults to None. |
validator | Optional[ScorerPromptValidator] | Custom validator for the scorer. Defaults to None. Defaults to None. |
SelfAskTrueFalseScorer¶
Bases: TrueFalseScorer
A class that represents a self-ask true/false for scoring.
Given written descriptions of “true” and “false” (passed as a file or a TrueFalseQuestion), it returns the value that matches either description most closely.
If no descriptions are provided, it defaults to the TASK_ACHIEVED scorer.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
chat_target | PromptChatTarget | The chat target to interact with. |
true_false_question_path | Optional[Union[str, Path]] | The path to the true/false question file. Defaults to None. |
true_false_question | Optional[TrueFalseQuestion] | The true/false question object. Defaults to None. |
true_false_system_prompt_path | Optional[Union[str, Path]] | The path to the system prompt file. Defaults to None. |
validator | Optional[ScorerPromptValidator] | Custom validator. Defaults to None. Defaults to None. |
score_aggregator | TrueFalseAggregatorFunc | The aggregator function to use. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR. |
SubStringScorer¶
Bases: TrueFalseScorer
Scorer that checks if a given substring is present in the text.
This scorer performs substring matching using a configurable text matching strategy. Supports both exact substring matching and approximate matching.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
substring | str | The substring to search for in the text. |
text_matcher | Optional[TextMatching] | The text matching strategy to use. Defaults to ExactTextMatching with case_sensitive=False. Defaults to None. |
categories | Optional[list[str]] | Optional list of categories for the score. Defaults to None. Defaults to None. |
aggregator | TrueFalseAggregatorFunc | The aggregator function to use. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR. |
validator | Optional[ScorerPromptValidator] | Custom validator. Defaults to None. Defaults to None. |
TrueFalseAggregatorFunc¶
TrueFalseCompositeScorer¶
Bases: TrueFalseScorer
Composite true/false scorer that aggregates results from other true/false scorers.
This scorer invokes a collection of constituent TrueFalseScorer instances and
reduces their single-score outputs into one final true/false score using the supplied
aggregation function (e.g., TrueFalseScoreAggregator.AND, TrueFalseScoreAggregator.OR,
TrueFalseScoreAggregator.MAJORITY).
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
aggregator | TrueFalseAggregatorFunc | Aggregation function to combine child scores (e.g., TrueFalseScoreAggregator.AND, TrueFalseScoreAggregator.OR, TrueFalseScoreAggregator.MAJORITY). |
scorers | List[TrueFalseScorer] | The constituent true/false scorers to invoke. |
TrueFalseInverterScorer¶
Bases: TrueFalseScorer
A scorer that inverts a true false score.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
scorer | TrueFalseScorer | The underlying true/false scorer whose results will be inverted. |
validator | Optional[ScorerPromptValidator] | Custom validator. Defaults to None. Note: This parameter is present for signature compatibility but is not used. Defaults to None. |
TrueFalseQuestion¶
A class that represents a true/false question.
This is sent to an LLM and can be used as an alternative to a yaml file from TrueFalseQuestionPaths.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
true_description | str | Description of what constitutes a “true” response. |
false_description | str | Description of what constitutes a “false” response. Defaults to a generic description if not provided. Defaults to ''. |
category | str | The category of the question. Defaults to an empty string. Defaults to ''. |
metadata | str | Additional metadata for context. Defaults to an empty string. Defaults to ''. |
TrueFalseQuestionPaths¶
Bases: enum.Enum
Paths to true/false question YAML files.
TrueFalseScoreAggregator¶
Namespace for true/false score aggregators that return a single aggregated score.
All aggregators return a list containing one ScoreAggregatorResult that combines all input scores together, preserving all categories.
TrueFalseScorer¶
Bases: Scorer
Base class for scorers that return true/false binary scores.
This scorer evaluates prompt responses and returns a single boolean score indicating whether the response meets a specific criterion. Multiple pieces in a request response are aggregated using a TrueFalseAggregatorFunc function (default: TrueFalseScoreAggregator.OR).
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
validator | ScorerPromptValidator | Custom validator. |
score_aggregator | TrueFalseAggregatorFunc | The aggregator function to use. Defaults to TrueFalseScoreAggregator.OR. Defaults to TrueFalseScoreAggregator.OR. |
Methods:
get_scorer_metrics¶
get_scorer_metrics() → Optional[ObjectiveScorerMetrics]Get evaluation metrics for this scorer from the configured evaluation result file.
Returns:
Optional[ObjectiveScorerMetrics]— The metrics for this scorer, or None if not found or not configured.
validate_return_scores¶
validate_return_scores(scores: list[Score]) → NoneValidate the scores returned by the scorer.
| Parameter | Type | Description |
|---|---|---|
scores | list[Score] | The scores to be validated. |
Raises:
ValueError— If the number of scores is not exactly one.ValueError— If the score value is not “true” or “false”.
VideoFloatScaleScorer¶
Bases: FloatScaleScorer, _BaseVideoScorer
A scorer that processes videos by extracting frames and scoring them using a float scale image scorer.
The VideoFloatScaleScorer breaks down a video into frames and uses a float scale scoring mechanism. Frame scores are aggregated using a FloatScaleAggregatorFunc.
By default, uses FloatScaleScorerByCategory.MAX which groups scores by category (useful for scorers like AzureContentFilterScorer that return multiple scores per frame). This returns one aggregated score per category (e.g., one for “Hate”, one for “Violence”, etc.).
For scorers that return a single score per frame, or to combine all categories together, use FloatScaleScoreAggregator.MAX, FloatScaleScorerAllCategories.MAX, etc.
Optionally, an audio_scorer can be provided to also score the video’s audio track. When provided, the audio is extracted, transcribed, and scored. The audio scores are included in the aggregation.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
image_capable_scorer | FloatScaleScorer | A FloatScaleScorer capable of processing images. |
audio_scorer | Optional[FloatScaleScorer] | Optional FloatScaleScorer for scoring the video’s audio track. When provided, audio is extracted from the video, transcribed to text, and scored. The audio scores are aggregated with frame scores. Defaults to None. |
num_sampled_frames | Optional[int] | Number of frames to extract from the video for scoring (default: 5). Defaults to None. |
validator | Optional[ScorerPromptValidator] | Validator for the scorer. Defaults to video_path data type validator. Defaults to None. |
score_aggregator | FloatScaleAggregatorFunc | Aggregator for combining frame scores. Defaults to FloatScaleScorerByCategory.MAX. Use FloatScaleScorerByCategory.MAX/AVERAGE/MIN for scorers that return multiple scores per frame (groups by category and returns one score per category). Use FloatScaleScorerAllCategories.MAX/AVERAGE/MIN to combine all scores regardless of category (returns single score with all categories combined). Use FloatScaleScoreAggregator.MAX/AVERAGE/MIN for simple aggregation preserving all categories (returns single score with all categories preserved). Defaults to FloatScaleScorerByCategory.MAX. |
image_objective_template | Optional[str] | Template for formatting the objective when scoring image frames. Use {objective} as placeholder for the actual objective. Set to None to not pass objective to image scorer. Defaults to a template that provides context about the video frame. Defaults to _BaseVideoScorer._DEFAULT_IMAGE_OBJECTIVE_TEMPLATE. |
audio_objective_template | Optional[str] | Template for formatting the objective when scoring audio. Use {objective} as placeholder for the actual objective. Set to None to not pass objective to audio scorer. Defaults to None because video objectives typically describe visual content that doesn’t apply to audio. Defaults to None. |
VideoTrueFalseScorer¶
Bases: TrueFalseScorer, _BaseVideoScorer
A scorer that processes videos by extracting frames and scoring them using a true/false image scorer.
Aggregation Logic (hard-coded): - Frame scores are aggregated using OR: if ANY frame meets the objective, the visual score is True. - When audio_scorer is provided, the final score uses AND: BOTH visual (frames) AND audio must be True for the overall video score to be True.
Constructor Parameters:
| Parameter | Type | Description |
|---|---|---|
image_capable_scorer | TrueFalseScorer | A TrueFalseScorer capable of processing images. |
audio_scorer | Optional[TrueFalseScorer] | Optional TrueFalseScorer for scoring the video’s audio track. When provided, audio is extracted from the video and scored. The final score requires BOTH video frames AND audio to be True. Defaults to None. |
num_sampled_frames | Optional[int] | Number of frames to extract from the video for scoring (default: 5). Defaults to None. |
validator | Optional[ScorerPromptValidator] | Validator for the scorer. Defaults to video_path data type validator. Defaults to None. |
image_objective_template | Optional[str] | Template for formatting the objective when scoring image frames. Use {objective} as placeholder for the actual objective. Set to None to not pass objective to image scorer. Defaults to a template that provides context about the video frame. Defaults to _BaseVideoScorer._DEFAULT_IMAGE_OBJECTIVE_TEMPLATE. |
audio_objective_template | Optional[str] | Template for formatting the objective when scoring audio. Use {objective} as placeholder for the actual objective. Set to None to not pass objective to audio scorer. Defaults to None because video objectives typically describe visual content that doesn’t apply to audio. Defaults to None. |