Skip to main content

Eval Suites

Eval suites score skills and agents across multiple models so quality changes are caught at PR time, not after release. Git-Ape uses microsoft/waza as the eval runner; suites live under .github/evals/.

.github/evals/
├── manifest.yaml # CI matrix (tiers + models)
├── <skill-name>/eval.yaml # Skill evals
└── agents/<agent-name>/eval.yaml # Agent evals

Skill eval scaffold​

Every skill eval needs an eval.yaml and at least two tasks (one positive, one negative):

.github/evals/my-skill/
├── eval.yaml
└── tasks/
├── positive-001-typical-use.yaml
└── negative-001-off-topic.yaml

eval.yaml​

# yaml-language-server: $schema=https://raw.githubusercontent.com/microsoft/waza/main/schemas/eval.schema.json

name: my-skill-eval
description: "What this suite scores."
skill: my-skill
version: "0.1"

config:
trials_per_task: 2 # 2 = expanded-tier default (flake detection); bump to 3 for pilot/promotion
timeout_seconds: 60
parallel: false
executor: copilot-sdk
model: claude-sonnet-4.6 # Default — CI overrides via --model per matrix leg

metrics:
- name: trigger_precision
weight: 1.0
threshold: 0.6
description: Skill should activate on relevant prompts and stay quiet otherwise.

graders:
# Cap runaway loops / unexpected plan expansion.
- type: behavior
name: budget
config:
max_tool_calls: 30
max_duration_ms: 240000

tasks:
- "tasks/*.yaml"

Task file​

# tasks/positive-001-typical-use.yaml
name: positive-001-typical-use
description: "User asks for the canonical thing this skill does"
prompt: |
<Verbatim user prompt that should trigger the skill>

expect:
trigger: skill:my-skill
behavior:
- "describes what the skill should do procedurally"
- "second behavioral assertion"
prompt:
must_include:
- "literal substring that MUST appear in the reply"

Negative tasks omit the skill-specific expect.trigger and assert the agent should not invoke the skill:

# tasks/negative-001-off-topic.yaml
name: negative-001-off-topic
description: "Unrelated request — skill should NOT fire"
prompt: |
Help me write a sonnet about ducks.

expect:
trigger: !skill:my-skill # ! = negation

Agent eval scaffold​

Agent evals live under .github/evals/agents/<agent-name>/. They use the same waza schema with two extra pieces:

  1. A mirrored copy of the agent file as a sibling, so waza's discovery picks it up.
  2. config.skill_directories listing both the eval directory (for the agent mirror) and the real skill directory.
.github/evals/agents/my-agent/
├── eval.yaml
├── my-agent.agent.md # Mirror of .github/agents/my-agent.agent.md
└── tasks/
├── positive-001-happy-path.yaml
└── negative-001-off-topic.yaml
name: my-agent-agent-eval
description: "End-to-end eval for the my-agent agent."
skill: my-agent
version: "0.1"

config:
trials_per_task: 2
timeout_seconds: 480
parallel: false
executor: copilot-sdk
model: claude-sonnet-4.6
skill_directories:
- "." # picks up the .agent.md mirror
- "../../../skills/my-skill-the-agent-uses" # picks up the real skill

The eval-directory copy of *.agent.md must be kept in sync with the production agent. The /agent-improve and /agent-bench prompts do this resync automatically (cp .github/agents/<name>.agent.md .github/evals/agents/<name>/). Add a CI lint or pre-commit hook if you want hard enforcement.

Graders​

GraderScoresWhen to include
trigger (implicit via metrics.trigger_precision)Did the agent route to the right skill on positives, and stay quiet on negatives?Always
behaviorTool-call budget, duration, or step-order assertionsAlways (at minimum a budget config)
prompt (LLM-as-judge)Did the reply text meet a documented quality bar? Scored by claude-sonnet-4.6.Always for positive tasks
skill_invocationDid the agent invoke the specific tool calls the skill expects?Skills that wrap MCP tools or specific CLIs
tool_constraintDid the agent stay within the declared tool allowlist?Agent evals (waza auto-injects this from .agent.md tools: unless suppressed)

Per-task vs eval-root graders​

Some graders should not fire on every task. For example, prompt quality scoring on a refusal task produces low scores because the agent (correctly) gave a short refusal — there's nothing to evaluate against a quality rubric.

Pattern: declare per-task graders inside each tasks/*.yaml, and only include eval-root graders that genuinely apply to every task (typically budget).

Dual tool taxonomy​

The production tools: field on an .agent.md uses VS Code Copilot Chat IDs (read, search, execute/*, microsoftdocs/mcp/*, azure-mcp/*). The waza copilot-sdk executor emits SDK CLI short names (bash, view, edit, create, sql, task).

If you write per-task tool_constraint graders, target the SDK taxonomy. To stop waza from auto-injecting an eval-root tool_constraint that would fail against VS Code IDs the SDK never emits, declare a no-op suppressor:

graders:
- type: tool_constraint
name: _suppress_auto_inject
config:
reject_tools:
- tool: "^___never_matches___$"

Real assertions then live per-task:

# tasks/positive-001.yaml
expect_tools:
- tool: "^(bash|view|edit|create|sql|task)$"

How CI picks up your eval​

The matrix is driven by .github/evals/manifest.yaml:

skills:
- name: prereq-check
tier: pilot
# Add your skill here:
- name: my-skill
tier: expanded # Start in expanded; promote to pilot via /skill-promote

tiers:
pilot: # Full 4-model fan-out
models:
- name: claude-sonnet-4.6
- name: gpt-5.4
baseline: true
- name: gpt-5-codex
- name: claude-opus-4.6
expanded: # 2-model fan-out (lower cost)
models:
- name: claude-sonnet-4.6
- name: gpt-5-codex
TierModelsUse when
pilot4 (claude-sonnet-4.6, gpt-5.4 baseline, gpt-5-codex, claude-opus-4.6)Skill is stable; you want full cross-model signal
expanded2 (claude-sonnet-4.6, gpt-5-codex)Skill is new; cap quota cost while it stabilises

The PR workflows that consume the manifest:

Agent evals are discovered from the filesystem (every directory under .github/evals/agents/ with an eval.yaml runs); they do not need a manifest entry.

Preflight gating: both workflows include a preflight job that probes the COPILOT_GITHUB_TOKEN secret. If the token is missing or lacks access to the private microsoft/waza repo (where waza releases live), downstream jobs are skipped cleanly instead of failing red. Maintainers can validate the secret end-to-end by checking that the matrix actually runs on the next PR.

Local validation​

# Single run, single model
waza run .github/evals/my-skill/eval.yaml --no-cache

# Verbose with debug
waza run .github/evals/my-skill/eval.yaml -v --debug

# Cross-model bench (uses the bench prompt)
# In VS Code: /skill-bench skillName=my-skill

The /skill-bench and /agent-bench prompts wrap the cross-model run + waza compare for you and print a one-line winner summary.