Evaluation in Azure AI Foundry gives you systematic, repeatable ways to measure whether your copilot's outputs are accurate, coherent, safe, and grounded before and after deployment. Evaluations run against a test dataset (CSV or JSONL format). The dataset is fed to your model or agent, and the outputs are scored by evaluators — specialized tools that apply either mathematical formulas or AI-assisted (LLM-as-judge) methods.
Azure AI Foundry organizes built-in evaluators into three categories:
| Category | Examples | Judge Method |
|---|---|---|
| Quality (General Purpose) | Coherence, Fluency, Relevance, GroundednessA quality evaluator that measures whether every claim in a model's response is traceable to and supported by the source documents provided as context. The key metric for RAG applications — low groundedness signals hallucination., Similarity | LLM-as-JudgeAn evaluation approach where a separate GPT model evaluates each response against a rubric and returns a numerical score plus rationale. Required for quality metrics (Coherence, Fluency, Relevance, Groundedness) but not for safety metrics. (requires a GPT deployment) |
| Safety | Violence, Sexual, Self-Harm, Hate/Unfairness, Protected Material, Jailbreak | Safety EvaluatorA built-in evaluator that uses dedicated automated classifiers (not a GPT judge) to score model outputs on the same 0–7 severity scale used by content filters. No judge deployment is required. |
| Agent-specific | Intent Resolution, Task Adherence, Tool Call Accuracy | AI-assisted |
Choose the appropriate Evaluation TargetThe source of data being evaluated: Agent (live deployed agent), Model (live deployed chat model), Dataset (pre-captured responses in a file), or Traces (agent interactions from Application Insights). — Agent, Model, Dataset, or Traces — depending on whether you are evaluating live outputs or pre-captured responses.
Core Quality Metrics
Coherence — measures logical structure, clarity, and internal consistency. A coherent answer is well-organized and free of internal contradictions.
Fluency — measures grammatical correctness and linguistic quality. A fluent response reads naturally without grammatical errors.
Relevance — measures how well the response addresses the user's actual query. A relevant response stays on topic and directly answers what was asked.
Groundedness — measures whether every claim is traceable to the source documents (context) provided to the model. The most critical metric for RAG applications.
| Metric | Ideal Result | Suboptimal Result |
|---|---|---|
| Coherence | Well-organized, logically consistent | Disjointed, self-contradictory |
| Fluency | Natural, grammatically correct | Grammatical errors, awkward phrasing |
| Relevance | Directly answers the user's query | Off-topic, misses the point |
| Groundedness | All claims supported by source context | Hallucinated or unsupported facts |
Safety Evaluators — No Judge Required
Safety evaluators use dedicated classifiers, not a GPT judge model. They score on the same 0–7 scale used by content filters:
| Safety Evaluator | What It Measures |
|---|---|
| Hate and Unfairness | Discriminatory or dehumanizing content toward identity groups |
| Sexual | Explicit sexual content in responses |
| Violence | Physical harm, threats, extremist glorification |
| Self-Harm | Suicide instructions, self-injury encouragement |
| Protected Material | Known copyrighted text or public code in outputs |
| Direct Jailbreak | Susceptibility to prompt injection attempts (direct) |
| Indirect Jailbreak | Susceptibility to prompt injection via documents |
| Code Vulnerability | Insecure or exploitable code in generated outputs |
| Ungrounded Attributes | False attribution of personal characteristics to real people |
Test Dataset Format (JSONL)
The evaluation API and portal wizard accept data in JSONL format (one JSON object per line):
{"query": "What is the capital of France?", "context": "France is a country in Western Europe.", "response": "Paris is the capital of France.", "ground_truth": "Paris"}
| Field | Required By |
|---|---|
query | Relevance, Groundedness, Coherence, Fluency |
response | All evaluators |
context | Groundedness (the source documents to verify against) |
ground_truth | Similarity, QA metrics |
Coherence and Fluency do not require ground_truth — they assess the response alone.
Score Scales by Category
| Evaluator Type | Score Scale |
|---|---|
| Quality evaluators (Coherence, Fluency, Relevance, Groundedness) | 1–5 (LLM-judge rubric) |
| Safety evaluators (Violence, Sexual, etc.) | 0–7 (severity scale) |
Evaluation Targets
| Target | Use When |
|---|---|
| Agent | Evaluate outputs generated live by a deployed agent |
| Model | Evaluate outputs from a deployed chat/completion model |
| Dataset | Evaluate pre-existing responses already captured in a file |
| Traces | Evaluate agent interactions captured in Application Insights |
Manual Evaluation
In addition to automated scoring, Azure AI Foundry supports manual evaluation — a human reviews individual query/response pairs and assigns ratings. Especially useful for domain-specific quality criteria hard to capture in rubrics, auditing automated evaluator outputs for calibration, and red-team testing novel adversarial inputs.
Step 1 — Navigate to Evaluations. In ai.azure.com, open your project. In the left pane, select Evaluation, then select + Create.
Step 2 — Choose the evaluation target. Select Dataset if you already have model outputs to score, or Model to have the evaluation call your deployment live. For a copilot assessment, select Agent if you are evaluating an AI Foundry agent.
Step 3 — Upload or generate a test dataset. Select Add new dataset and upload a JSONL file containing query, context, and response columns. If you don't have data, use Synthetic dataset generation — specify the number of rows and a description of the scenario, and the portal generates sample rows.
Step 4 — Select evaluators. Under Quality evaluators, enable Coherence, Fluency, Relevance, and Groundedness. Under Safety evaluators, enable Violence, Sexual, Self-Harm, and Hate/Unfairness. Select the GPT deployment to use as judge (required for quality metrics only).
Step 5 — Verify data mapping. The portal auto-maps dataset columns to evaluator fields. Confirm that query, context, response, and ground_truth (if present) are mapped correctly. Unassigned required fields appear with an asterisk — manually assign them before proceeding.
Step 6 — Name and submit. Provide a run name (e.g., copilot-v2-baseline), review your configuration, and select Submit. The evaluation typically completes within a few minutes.
Step 7 — Review results. Open the run from the Evaluation page. Review aggregate scores per metric, then drill into individual rows to see which responses scored low on Groundedness or flagged for safety metrics.
AI-3016 Assessment Focus
Expect questions about which evaluators require a GPT judge (quality only, not safety), what context vs. ground_truth is used for, and what a low Groundedness score specifically means (not factual incorrectness — unsupported by context).
Exam Trap
"Safety evaluators require a GPT judge deployment." Safety evaluators use dedicated classifiers, not a judge model. Only quality/performance metrics (Coherence, Fluency, Relevance, Groundedness) require a GPT judge deployment.
Exam Trap
"Groundedness measures whether the answer is factually correct in general." Groundedness measures whether the answer is supported by the provided context documents, not general world knowledge. A response can be factually true yet score low on groundedness if the context doesn't contain that information.
Exam Trap
"You need ground truth data to run Coherence and Fluency evaluations." Coherence and Fluency do not require ground truth. They assess the response alone. ground_truth is needed for Similarity and some QA metrics.
Exam Trap
"All evaluators produce a 0–7 score like content filters." Quality evaluators (Coherence, Fluency, etc.) typically return scores on a 1–5 scale per the LLM-judge rubric; safety evaluators use the 0–7 severity scale. The scales differ by category.
Exam Trap
"Evaluation results are only visible via the SDK." The Foundry portal has a dedicated Evaluations tab with dashboards, metric comparisons, and row-level drill-downs. SDK results are also logged there when you pass the project connection.
Exam Tip
For the Dataset evaluation target, the model is NOT called again — pre-captured responses are scored as-is. For Model and Agent targets, the evaluation sends live requests to the deployment.
Must Memorize
Quality evaluators need: query + response (minimum) + context (for Groundedness) + ground_truth (for Similarity). Safety evaluators need: response only.
Question — click to flip
Q: Which quality evaluator specifically measures whether a model's response is supported by the source documents provided in the prompt context?
Question — click to flip
Q: A developer wants to run safety evaluations (Violence, Sexual, Self-Harm). What additional Azure resource is required?
Question — click to flip
Q: What does a low Groundedness score indicate in an evaluation?
Question — click to flip
Q: Which test dataset field is required when running the Groundedness evaluator?
Question — click to flip
Q: Which evaluation target should you select to score responses already stored in a JSONL file without calling the model again?
Question — click to flip
Q: On which score scale do quality evaluators (Coherence, Fluency) return results, vs. safety evaluators (Violence, Self-Harm)?