pyrit.score.ScorerEvaluator#

class ScorerEvaluator(scorer: Scorer)[source]#

Bases: ABC

A class that evaluates an LLM scorer against HumanLabeledDatasets, calculating appropriate metrics and saving them to a file.

__init__(scorer: Scorer)[source]#

Initialize the ScorerEvaluator with a scorer.

Parameters:

scorer (Scorer) – The scorer to evaluate.

Methods

__init__(scorer)

Initialize the ScorerEvaluator with a scorer.

from_scorer(scorer[, metrics_type])

Create a ScorerEvaluator based on the type of scoring.

run_evaluation_async(*, dataset_files[, ...])

Evaluate scorer using dataset files configuration.

Attributes

expected_metrics_type: MetricsType#
classmethod from_scorer(scorer: Scorer, metrics_type: MetricsType | None = None) ScorerEvaluator[source]#

Create a ScorerEvaluator based on the type of scoring.

Parameters:
  • scorer (Scorer) – The scorer to evaluate.

  • metrics_type (MetricsType) – The type of scoring, either HARM or OBJECTIVE. If not provided, it will default to OBJECTIVE for true/false scorers and HARM for all other scorers.

Returns:

An instance of HarmScorerEvaluator or ObjectiveScorerEvaluator.

Return type:

ScorerEvaluator

async run_evaluation_async(*, dataset_files: ScorerEvalDatasetFiles, num_scorer_trials: int = 3, update_registry_behavior: RegistryUpdateBehavior = RegistryUpdateBehavior.SKIP_IF_EXISTS, max_concurrency: int = 10) ScorerMetrics | None[source]#

Evaluate scorer using dataset files configuration.

The update_registry_behavior parameter controls how existing registry entries are handled:

  • SKIP_IF_EXISTS (default): Check registry for existing results matching scorer config,

    dataset version, and num_scorer_trials. If found, return cached metrics. If not found, run evaluation and write to registry.

  • ALWAYS_UPDATE: Always run evaluation and overwrite any existing registry entry.

  • NEVER_UPDATE: Always run evaluation but never write to registry (for debugging).

Parameters:
  • dataset_files – ScorerEvalDatasetFiles configuration specifying glob patterns for input files and a result file name.

  • num_scorer_trials – Number of scoring trials per response. Defaults to 3.

  • update_registry_behavior – Controls how existing registry entries are handled. Defaults to RegistryUpdateBehavior.SKIP_IF_EXISTS.

  • max_concurrency – Maximum number of concurrent scoring requests. Defaults to 10.

Returns:

ScorerMetrics if evaluation completed, None if no files found.

Raises:

ValueError – If harm_category is not specified for harm scorer evaluations.