pyrit.score.ScorerEvaluator#
- class ScorerEvaluator(scorer: Scorer)[source]#
Bases:
ABCA class that evaluates an LLM scorer against HumanLabeledDatasets, calculating appropriate metrics and saving them to a file.
- __init__(scorer: Scorer)[source]#
Initialize the ScorerEvaluator with a scorer.
- Parameters:
scorer (Scorer) – The scorer to evaluate.
Methods
__init__(scorer)Initialize the ScorerEvaluator with a scorer.
from_scorer(scorer[, metrics_type])Create a ScorerEvaluator based on the type of scoring.
run_evaluation_async(*, dataset_files[, ...])Evaluate scorer using dataset files configuration.
Attributes
- expected_metrics_type: MetricsType#
- classmethod from_scorer(scorer: Scorer, metrics_type: MetricsType | None = None) ScorerEvaluator[source]#
Create a ScorerEvaluator based on the type of scoring.
- Parameters:
scorer (Scorer) – The scorer to evaluate.
metrics_type (MetricsType) – The type of scoring, either HARM or OBJECTIVE. If not provided, it will default to OBJECTIVE for true/false scorers and HARM for all other scorers.
- Returns:
An instance of HarmScorerEvaluator or ObjectiveScorerEvaluator.
- Return type:
- async run_evaluation_async(*, dataset_files: ScorerEvalDatasetFiles, num_scorer_trials: int = 3, update_registry_behavior: RegistryUpdateBehavior = RegistryUpdateBehavior.SKIP_IF_EXISTS, max_concurrency: int = 10) ScorerMetrics | None[source]#
Evaluate scorer using dataset files configuration.
The update_registry_behavior parameter controls how existing registry entries are handled:
- SKIP_IF_EXISTS (default): Check registry for existing results matching scorer config,
dataset version, and num_scorer_trials. If found, return cached metrics. If not found, run evaluation and write to registry.
ALWAYS_UPDATE: Always run evaluation and overwrite any existing registry entry.
NEVER_UPDATE: Always run evaluation but never write to registry (for debugging).
- Parameters:
dataset_files – ScorerEvalDatasetFiles configuration specifying glob patterns for input files and a result file name.
num_scorer_trials – Number of scoring trials per response. Defaults to 3.
update_registry_behavior – Controls how existing registry entries are handled. Defaults to RegistryUpdateBehavior.SKIP_IF_EXISTS.
max_concurrency – Maximum number of concurrent scoring requests. Defaults to 10.
- Returns:
ScorerMetrics if evaluation completed, None if no files found.
- Raises:
ValueError – If harm_category is not specified for harm scorer evaluations.