Azure Evaluation SDK

To thoroughly assess the performance of your generative AI application when applied to a substantial dataset, you can evaluate a Generative AI application in your development environment with the Azure AI evaluation SDK.

Category Evaluator class
Performance and quality (AI-assisted) GroundednessEvaluator, GroundednessProEvaluator, RetrievalEvaluator, RelevanceEvaluator, CoherenceEvaluator, FluencyEvaluator, SimilarityEvaluator
Performance and quality (NLP) F1ScoreEvaluator, RougeScoreEvaluator, GleuScoreEvaluator, BleuScoreEvaluator, MeteorScoreEvaluator
Risk and safety (AI-assisted) ViolenceEvaluator, SexualEvaluator, SelfHarmEvaluator, HateUnfairnessEvaluator, IndirectAttackEvaluator, ProtectedMaterialEvaluator
Composite QAEvaluator, ContentSafetyEvaluator

Performance and Quality Evaluators

AI-assisted Evaluators:

GroundednessEvaluator: Assesses the accuracy of responses based on provided context. GroundednessProEvaluator: Similar to GroundednessEvaluator but uses Azure AI Content Safety. RetrievalEvaluator: Evaluates the effectiveness of retrieval-augmented generation. RelevanceEvaluator: Measures how relevant the response is to the query. CoherenceEvaluator: Checks the logical flow and consistency of the response. FluencyEvaluator: Evaluates the grammatical correctness and naturalness of the response. SimilarityEvaluator: Measures the similarity between generated responses and ground truth.

NLP Evaluators:

F1ScoreEvaluator: Calculates the F1 score for precision and recall. RougeScoreEvaluator: Measures overlap with reference texts using ROUGE metrics. GleuScoreEvaluator: Evaluates using GLEU score. BleuScoreEvaluator: Uses BLEU score for evaluating machine translations. MeteorScoreEvaluator: Measures using METEOR score.

Risk and Safety Evaluators

AI-assisted Evaluators:

ViolenceEvaluator: Detects violent content. SexualEvaluator: Identifies sexually explicit content. SelfHarmEvaluator: Detects content related to self-harm. HateUnfairnessEvaluator: Identifies hate speech and unfair content. IndirectAttackEvaluator: Evaluates vulnerability to indirect attack jailbreaks. ProtectedMaterialEvaluator: Detects protected material.

Composite Evaluators:

QAEvaluator: Combines multiple quality evaluators for comprehensive assessment. ContentSafetyEvaluator: Combines multiple safety evaluators for overall safety assessment.


Distributed by an MIT license. This hands-on lab was developed by Microsoft AI GBB (Global Black Belt).