Skip to content

Evaluation

This is the canonical page for how evaluation works in AgentOps. An evaluation runs a dataset against a target agent, scores the responses, and gates the result against thresholds. Foundry operates the agent at runtime; AgentOps turns that run into repo-side release proof.

If you want a hands-on walkthrough instead of a reference, start with the Prompt agent tutorial or the HTTP agent tutorial.

What an evaluation is

An evaluation is defined by one flat file, agentops.yaml. It connects three things: the agent (the target to evaluate), the dataset (the rows to send), and the thresholds (the quality gates that decide pass or fail).

The minimum config is three lines:

version: 1
agent: "travel-agent:1"
dataset: .agentops/data/smoke.jsonl

The AgentOps runner reads that config, sends each dataset row to the target, collects responses, scores them with evaluators, and checks the scores against your thresholds. It writes two outputs every run: results.json for automation and report.md for human review.

Where evaluations run

By default, agentops eval run is a local runner. It runs wherever you execute the command: your laptop, a dev container, GitHub Actions, or another CI host. The output is written to that workspace under .agentops/results/latest/.

Foundry visibility is opt-in:

Config What happens Foundry surface
execution: local or omitted AgentOps invokes the target and scores rows locally. Local results.json and report.md only.
execution: local plus publish: true AgentOps keeps the local run as source of truth, then uploads metrics and row results. Classic Foundry Evaluations.
execution: cloud Foundry runs the agent and evaluators server-side. New Foundry Evaluations.

execution: cloud is currently for Foundry prompt agents declared as name:version. HTTP endpoints use the local runner; if you want those local results visible in Foundry, use publish: true, which targets the Classic Foundry Evaluations upload path.

If you configure Application Insights, AgentOps also emits telemetry spans so the run can be inspected through Foundry tracing or Azure Monitor Logs. That is separate from the Evaluations page.

Exit codes are the CI contract

The runner returns 0 when every threshold passes, 2 when the run succeeded but one or more thresholds failed, and 1 for a runtime or configuration error. These three codes are the public gate contract. CI treats 2 as a hard fail so a deploy never runs on a regression.

graph TD
    A[agentops.yaml target dataset thresholds]
    B[JSONL dataset rows]
    C[AgentOps runner]
    D[Foundry target]
    E[HTTP target]
    F[Model target]
    G[Evaluators and thresholds]
    H[results.json]
    I[report.md]

    A --> C
    B --> C
    C --> D
    C --> E
    C --> F
    D --> G
    E --> G
    F --> G
    G --> H
    G --> I

Target kinds

AgentOps resolves the agent: value into one of four target kinds by its shape. You do not choose a backend by hand; the shape of agent: selects both the kind and the fields that make sense for it.

agent: value Target kind Use case
"travel-agent:1" (name:version) Foundry prompt agent Foundry Agent Service agents
"https://...services.ai.azure.com/.../agents/<id>" Foundry hosted agent A deployed agent endpoint on a Foundry domain
"https://api.example.com/chat" HTTP/JSON endpoint LangGraph, Agent Framework, ACA, AKS, custom REST
"model:gpt-4o-mini" Model-direct Raw model deployment checks

HTTP targets need request and response mapping

A custom HTTP endpoint rarely matches AgentOps defaults exactly, so you map its request and response shape with top-level fields. Use request_field and response_field (dot-paths) to point at the right JSON keys, tool_calls_field for tool output, auth_header_env to name an env var holding a Bearer token, and extra_fields for any static body fields.

version: 1
agent: https://my-aca-app.eastus2.azurecontainerapps.io/chat
dataset: .agentops/data/qa.jsonl
request_field: message            # default is "message"
response_field: text              # dot-path; default is "text"
auth_header_env: APP_API_TOKEN    # value is sent as a Bearer token

Fill agentops.yaml for HTTP endpoints

For HTTP agents, fill agentops.yaml from the shape of the request and response. Start with the defaults, then add only the fields your endpoint needs.

version: 1
agent: https://api.example.com/chat
dataset: .agentops/data/qa.jsonl
protocol: http-json
request_field: message
response_field: text
If the endpoint response is... Use this config
JSON, for example {"text": "answer"} response_mode: json or omit it. Set response_field: text if needed.
Plain text, returned all at once response_mode: text. Do not add stream:.
Plain text, streamed in chunks response_mode: text. Do not add stream: unless the first chunk is not part of the answer.
Plain text stream with a leading id or token response_mode: text plus stream.strip_leading_token: true.
Server-Sent Events with data: lines response_mode: sse.
Server-Sent Events where each data: line is JSON response_mode: sse plus stream.text_field, for example stream.text_field: choices.0.delta.content.
Server-Sent Events with a final marker response_mode: sse plus stream.done_marker, for example stream.done_marker: "[DONE]".

Examples:

# JSON response: {"answer": "..."}
response_mode: json
response_field: answer
# Plain text response, streamed or not.
response_mode: text
# GPT-RAG orchestrator: text stream where the first token is a conversation id.
response_mode: text
stream:
  strip_leading_token: true
# SSE response with JSON data frames.
response_mode: sse
stream:
  text_field: choices.0.delta.content
  done_marker: "[DONE]"

Datasets and scenarios

A dataset is a plain JSONL file, one evaluation row per line. Each row has an input prompt and usually an expected reference answer. Optional fields drive which evaluators run.

{"id": "1", "input": "What is the refund policy?", "expected": "Refunds within 30 days.", "context": "Our policy: refunds are available within 30 days."}

The presence of optional fields tells AgentOps which evaluation scenario you are running. You do not declare the scenario; the row shape implies it.

Scenario Signal in the row Purpose
Model quality model:<deployment> target plus expected Direct model checks
RAG context Grounding and retrieval checks
Conversational input plus expected Chatbot and Q&A quality
Agent workflow tool_calls plus tool_definitions Tool-use quality
Content safety Safety evaluators Responsible AI checks

Evaluators

An evaluator is a scoring function that measures one aspect of a response. They come in two flavors. AI-assisted evaluators use a judge model to score qualities like coherence, similarity, or groundedness. Local metrics are computed without a judge, such as avg_latency_seconds or F1ScoreEvaluator for exact-reference checks.

AgentOps auto-selects evaluators from the target kind and the dataset shape, so a three-line config still scores the right things. Prompt and hosted agents get answer-quality judges, context rows add the RAG set, and tool rows add the tool-use set.

Run agentops eval init after you create the dataset to see the recommendation. For HTTP, model, and other local targets, this is recommendation-only: AgentOps does not call azd or create eval.yaml. For Foundry prompt agents, the same command can also delegate to azd ai agent eval init to create Foundry-native eval assets.

Override only when you must

Set the evaluators: list in agentops.yaml only when you need to replace the auto-selection. It is an escape hatch, not the normal path. For the full catalog of evaluator names and their required inputs, see Built-in Evaluators.

Evaluation path: where the run executes

The execution: field decides where the evaluation actually runs. Local is the default and works for every target. Cloud runs a Foundry prompt agent server-side. The azd recipe path delegates to an existing azd ai agent eval flow.

Target Cloud (execution: cloud) Local runner Recommended default
Foundry prompt agent (name:version) Yes Yes Cloud for official Foundry runs; local for fast feedback
Foundry hosted agent URL No Yes Local runner; optionally publish: true
Generic HTTP/JSON endpoint No Yes Local runner; optionally publish: true
Raw model deployment (model:<name>) No Yes Local runner

For prompt-agent CI pipelines that need a merge or deploy gate, prefer cloud eval. Foundry executes the managed evaluation and AgentOps enforces thresholds, baselines, Doctor readiness, and release evidence.

Reusing an azd eval recipe

If a Foundry project already uses the public-preview azd ai agent eval recipe, set execution: azd and eval_recipe: eval.yaml. AgentOps delegates execution to azd, normalizes the metrics, binds thresholds, writes results.json, and fails closed for any threshold that has no emitted metric. Rubric evaluator dimensions are treated as first-class metric names.

Mini-glossary

The tutorials defer to these definitions, so they live here once.

Dimension

A dimension is a single named axis a rubric or evaluator scores. A Travel Agent rubric might score the dimensions helpfulness, safety, and format_adherence separately, so one response produces one score per dimension rather than a single blended number.

Rubric

A rubric is an evaluator that scores responses against a written scoring guide, usually one score per dimension. For example, a rubric can define helpfulness: 1 to 5 with a short description of what a 1 and a 5 look like, and the judge model applies that guide to each row. Rubric dimensions become metric names you can put thresholds on.

smoke-core

A smoke-core is a small, fast smoke dataset plus the minimal evaluator set that gates it. It is the quick check you run on every change to catch obvious breakage in seconds, before the larger scenario datasets run. Think of it as the few rows and one or two evaluators that must always pass.

Configuration model

agentops.yaml is the single source of truth. Keep it small and add only the fields your target needs. For the complete schema, every top-level field, and more examples, see Built-in Evaluators for evaluator config and the tutorials for end-to-end setups.

version: 1
agent: "https://api.example.com/chat"
dataset: .agentops/data/support.jsonl

request_field: message
response_field: text

thresholds:
  coherence: ">=3"
  avg_latency_seconds: "<=2"
© 2026 AgentOps Accelerator — built for shipping Foundry agents with confidence