AI-3016 Learning Portal

Concept — What & Why

Evaluation in Azure AI Foundry gives you systematic, repeatable ways to measure whether your copilot's outputs are accurate, coherent, safe, and grounded before and after deployment. Evaluations run against a test dataset (CSV or JSONL format). The dataset is fed to your model or agent, and the outputs are scored by evaluators — specialized tools that apply either mathematical formulas or AI-assisted (LLM-as-judge) methods.

Azure AI Foundry organizes built-in evaluators into three categories:

Category	Examples	Judge Method
Quality (General Purpose)	Coherence, Fluency, Relevance, GroundednessA quality evaluator that measures whether every claim in a model's response is traceable to and supported by the source documents provided as context. The key metric for RAG applications — low groundedness signals hallucination., Similarity	LLM-as-JudgeAn evaluation approach where a separate GPT model evaluates each response against a rubric and returns a numerical score plus rationale. Required for quality metrics (Coherence, Fluency, Relevance, Groundedness) but not for safety metrics. (requires a GPT deployment)
Safety	Violence, Sexual, Self-Harm, Hate/Unfairness, Protected Material, Jailbreak	Safety EvaluatorA built-in evaluator that uses dedicated automated classifiers (not a GPT judge) to score model outputs on the same 0–7 severity scale used by content filters. No judge deployment is required.
Agent-specific	Intent Resolution, Task Adherence, Tool Call Accuracy	AI-assisted

Choose the appropriate Evaluation TargetThe source of data being evaluated: Agent (live deployed agent), Model (live deployed chat model), Dataset (pre-captured responses in a file), or Traces (agent interactions from Application Insights). — Agent, Model, Dataset, or Traces — depending on whether you are evaluating live outputs or pre-captured responses.

Core Quality Metrics

Coherence — measures logical structure, clarity, and internal consistency. A coherent answer is well-organized and free of internal contradictions.

Fluency — measures grammatical correctness and linguistic quality. A fluent response reads naturally without grammatical errors.

Relevance — measures how well the response addresses the user's actual query. A relevant response stays on topic and directly answers what was asked.

Groundedness — measures whether every claim is traceable to the source documents (context) provided to the model. The most critical metric for RAG applications.

Metric	Ideal Result	Suboptimal Result
Coherence	Well-organized, logically consistent	Disjointed, self-contradictory
Fluency	Natural, grammatically correct	Grammatical errors, awkward phrasing
Relevance	Directly answers the user's query	Off-topic, misses the point
Groundedness	All claims supported by source context	Hallucinated or unsupported facts

Deep Dive — How It Works

Safety Evaluators — No Judge Required

Safety evaluators use dedicated classifiers, not a GPT judge model. They score on the same 0–7 scale used by content filters:

Safety Evaluator	What It Measures
Hate and Unfairness	Discriminatory or dehumanizing content toward identity groups
Sexual	Explicit sexual content in responses
Violence	Physical harm, threats, extremist glorification
Self-Harm	Suicide instructions, self-injury encouragement
Protected Material	Known copyrighted text or public code in outputs
Direct Jailbreak	Susceptibility to prompt injection attempts (direct)
Indirect Jailbreak	Susceptibility to prompt injection via documents
Code Vulnerability	Insecure or exploitable code in generated outputs
Ungrounded Attributes	False attribution of personal characteristics to real people

Test Dataset Format (JSONL)

The evaluation API and portal wizard accept data in JSONL format (one JSON object per line):

{"query": "What is the capital of France?", "context": "France is a country in Western Europe.", "response": "Paris is the capital of France.", "ground_truth": "Paris"}

Field	Required By
`query`	Relevance, Groundedness, Coherence, Fluency
`response`	All evaluators
`context`	Groundedness (the source documents to verify against)
`ground_truth`	Similarity, QA metrics

Coherence and Fluency do not require ground_truth — they assess the response alone.

Score Scales by Category

Evaluator Type	Score Scale
Quality evaluators (Coherence, Fluency, Relevance, Groundedness)	1–5 (LLM-judge rubric)
Safety evaluators (Violence, Sexual, etc.)	0–7 (severity scale)

Evaluation Targets

Target	Use When
Agent	Evaluate outputs generated live by a deployed agent
Model	Evaluate outputs from a deployed chat/completion model
Dataset	Evaluate pre-existing responses already captured in a file
Traces	Evaluate agent interactions captured in Application Insights

Manual Evaluation

In addition to automated scoring, Azure AI Foundry supports manual evaluation — a human reviews individual query/response pairs and assigns ratings. Especially useful for domain-specific quality criteria hard to capture in rubrics, auditing automated evaluator outputs for calibration, and red-team testing novel adversarial inputs.

Hands-On Lab

Step 1 — Navigate to Evaluations. In ai.azure.com, open your project. In the left pane, select Evaluation, then select + Create.

Step 2 — Choose the evaluation target. Select Dataset if you already have model outputs to score, or Model to have the evaluation call your deployment live. For a copilot assessment, select Agent if you are evaluating an AI Foundry agent.

Step 3 — Upload or generate a test dataset. Select Add new dataset and upload a JSONL file containing query, context, and response columns. If you don't have data, use Synthetic dataset generation — specify the number of rows and a description of the scenario, and the portal generates sample rows.

Step 4 — Select evaluators. Under Quality evaluators, enable Coherence, Fluency, Relevance, and Groundedness. Under Safety evaluators, enable Violence, Sexual, Self-Harm, and Hate/Unfairness. Select the GPT deployment to use as judge (required for quality metrics only).

Step 5 — Verify data mapping. The portal auto-maps dataset columns to evaluator fields. Confirm that query, context, response, and ground_truth (if present) are mapped correctly. Unassigned required fields appear with an asterisk — manually assign them before proceeding.

Step 6 — Name and submit. Provide a run name (e.g., copilot-v2-baseline), review your configuration, and select Submit. The evaluation typically completes within a few minutes.

Step 7 — Review results. Open the run from the Evaluation page. Review aggregate scores per metric, then drill into individual rows to see which responses scored low on Groundedness or flagged for safety metrics.

Exam Angle — What AI-3016 Tests

AI-3016 Assessment Focus

Expect questions about which evaluators require a GPT judge (quality only, not safety), what context vs. ground_truth is used for, and what a low Groundedness score specifically means (not factual incorrectness — unsupported by context).

Exam Trap

"Safety evaluators require a GPT judge deployment." Safety evaluators use dedicated classifiers, not a judge model. Only quality/performance metrics (Coherence, Fluency, Relevance, Groundedness) require a GPT judge deployment.

Exam Trap

"Groundedness measures whether the answer is factually correct in general." Groundedness measures whether the answer is supported by the provided context documents, not general world knowledge. A response can be factually true yet score low on groundedness if the context doesn't contain that information.

Exam Trap

"You need ground truth data to run Coherence and Fluency evaluations." Coherence and Fluency do not require ground truth. They assess the response alone. ground_truth is needed for Similarity and some QA metrics.

Exam Trap

"All evaluators produce a 0–7 score like content filters." Quality evaluators (Coherence, Fluency, etc.) typically return scores on a 1–5 scale per the LLM-judge rubric; safety evaluators use the 0–7 severity scale. The scales differ by category.

Exam Trap

"Evaluation results are only visible via the SDK." The Foundry portal has a dedicated Evaluations tab with dashboards, metric comparisons, and row-level drill-downs. SDK results are also logged there when you pass the project connection.

Exam Tip

For the Dataset evaluation target, the model is NOT called again — pre-captured responses are scored as-is. For Model and Agent targets, the evaluation sends live requests to the deployment.

Must Memorize

Quality evaluators need: query + response (minimum) + context (for Groundedness) + ground_truth (for Similarity). Safety evaluators need: response only.

Question — click to flip

Q: Which quality evaluator specifically measures whether a model's response is supported by the source documents provided in the prompt context?

Question — click to flip

Q: A developer wants to run safety evaluations (Violence, Sexual, Self-Harm). What additional Azure resource is required?

Question — click to flip

Q: What does a low Groundedness score indicate in an evaluation?

Question — click to flip

Q: Which test dataset field is required when running the Groundedness evaluator?

Question — click to flip

Q: Which evaluation target should you select to score responses already stored in a JSONL file without calling the model again?

Question — click to flip

Q: On which score scale do quality evaluators (Coherence, Fluency) return results, vs. safety evaluators (Violence, Self-Harm)?

5.2 — Evaluate Copilot Performance

Core Quality Metrics

Safety Evaluators — No Judge Required

Test Dataset Format (JSONL)

Score Scales by Category

Evaluation Targets

Manual Evaluation

AI-3016 Assessment Focus