AI-3018 Learning Portal

Concept — What & Why

Testing Multi-Agent Systems

Once a connected-agent topology is wired up, you must validate that the orchestrator correctly delegates to specialists and that the end-to-end response is accurate and coherent. Azure AI Foundry provides two primary testing surfaces: the Agents playground in the portal and the SDK-based run step inspection in code.

Testing Surfaces at a Glance

Surface	Best for	Visibility
Foundry portal playground	Quick smoke tests, iterating on system prompts	Conversation thread, basic run status
Run step inspection (API)	Deep debugging, seeing which tool was called and what it returned	Full tool call / tool result trace
SDK integration tests	CI/CD gate, regression coverage	Programmatic assertions on responses

Each delegation to a specialist produces a Run StepA record of one unit of work performed during an agent run. Delegation to a connected agent produces a run step of type tool_calls with connected_agent entries inside it. in the orchestrator's run. A tool_calls Run StepA run step type that appears when the orchestrator invokes a tool — including connected agents. Each tool call entry contains the agent name called and its output. confirms that delegation occurred; inspect the connected_agent.nameThe field inside a tool_calls run step that identifies which specialist was invoked. Its value matches the tool name you assigned when registering the connected agent. field to see which specialist was called. If only a message_creation Run StepA run step type that appears when the model generates its final reply. If only message_creation steps exist with no tool_calls steps, the orchestrator answered without delegating. is present, the orchestrator answered from its own knowledge without delegating.

Deep Dive — How It Works

How Run Steps Reveal Delegation

When the orchestrator delegates to a specialist, the run on the orchestrator's thread produces a run step of type tool_calls. Each tool call entry contains:

type: "connected_agent" (distinguishing it from function calls)
connected_agent.name: the tool name of the specialist (e.g., billing_agent)
connected_agent.output: the specialist's response text

You can retrieve these steps programmatically:

run_steps = project_client.agents.list_run_steps(
    thread_id=thread.id,
    run_id=run.id,
)
for step in run_steps:
    if step.type == "tool_calls":
        for tc in step.step_details.tool_calls:
            print(f"Agent called: {tc.connected_agent.name}")
            print(f"Specialist output: {tc.connected_agent.output}")

Common Multi-Agent Issues and Fixes

Symptom	Likely cause	Fix
Orchestrator answers itself instead of delegating	Tool description too broad or matches the orchestrator's own scope	Narrow the description; add "Do NOT handle this yourself" to the orchestrator system prompt
Orchestrator always delegates to the same specialist	Overlapping descriptions; one description is a superset of others	Make descriptions mutually exclusive; add explicit scope boundaries
Specialist returns a generic "I don't know"	Specialist system prompt too restrictive or its tools are missing	Review and expand the specialist's system prompt and tool configuration
End-to-end response is cut off or incoherent	Orchestrator not passing enough context to the specialist	Check what message/context the orchestrator sends; increase context in the invocation
Run never completes (timeout)	Specialist invoking a long-running tool or stuck in a loop	Check the specialist's run steps for stalled tool calls

Crafting Test Queries That Exercise Routing

Good test queries for multi-agent systems should:

Target a single specialist — verify each specialist is reachable individually.
Span multiple specialists — verify the orchestrator can make two delegation calls in one turn.
Be ambiguous — verify the orchestrator routes to the correct specialist even when the query does not name the domain explicitly.
Include edge cases — queries that sit on the boundary between two specialists' scopes stress-test description quality.

Example test matrix for a three-specialist orchestrator (billing, HR, tech support):

Query	Expected delegation	Tests
"Why was I charged twice this month?"	billing_agent	Direct routing
"How do I request parental leave?"	hr_agent	Direct routing
"My laptop won't connect to VPN"	tech_support_agent	Direct routing
"I was charged for software I can't install"	billing_agent then tech_support_agent	Multi-hop routing
"Help me"	Orchestrator should ask a clarifying question	No premature delegation

Validating End-to-End Response Quality

Multi-agent responses can fail at the seam between the specialist's output and the orchestrator's synthesis. Check for:

Factual consistency — does the orchestrator's final reply accurately reflect the specialist's output, or did it hallucinate a summary?
Completeness — did the orchestrator include all parts of the specialist's answer, or silently drop details?
Attribution — if the specialist cited sources or returned structured data, does the final response preserve them?
Latency — each additional agent hop adds ~1–3 seconds; validate that total response time is within UX requirements.

Debugging Multi-Agent Flows Step by Step

Check the orchestrator run status — a status of failed or expired signals a problem before even reaching the specialist.
List run steps — look for tool_calls steps. Their presence confirms delegation occurred.
Inspect the tool call input — the orchestrator passes a message to the specialist; verify it contains the right context.
Inspect the tool call output — verify the specialist's reply is accurate and complete.
Check the specialist's own thread — if the specialist has a nested thread, retrieve it to see its own run steps and any sub-tool calls.
Review agent call logs — in the Foundry portal, navigate to Monitoring → Traces to see latency and error detail for each agent hop.

Hands-On Lab

Hands-On: Validate Multi-Agent Delegation

Goal: Submit a test query to a multi-agent setup in the Foundry portal playground and verify delegation via run step inspection.

Open the orchestrator agent in Azure AI Foundry → Agents → select your orchestrator.
Launch the playground — click Test in playground (top-right of the agent detail page).
Submit a targeted query — type a message that should route to one of your specialists (e.g., "I need help with my invoice from last month"). Press Enter.
Observe the response — note whether the answer contains specialist-level detail. High specificity suggests delegation occurred; a generic reply suggests the orchestrator answered itself.
Retrieve run steps in code — using the Python SDK, call list_run_steps(thread_id, run_id) and print each step's type. Confirm a tool_calls step is present with connected_agent.name matching your specialist's tool name.
Test the boundary case — submit a query that sits on the scope boundary of two specialists. Verify the orchestrator delegates to the correct one; if not, refine the tool descriptions.
Test a multi-hop query — submit a query spanning two domains (e.g., "I was billed for software that is broken"). Confirm the response integrates output from both specialists and that two tool_calls steps appear in the run steps.

Exam Angle — What AI-3018 Tests

AI-3018 Assessment Focus

Run step inspection is the authoritative method for confirming delegation — the portal playground alone is not sufficient. Know the run step types and what their presence/absence means.

Exam Trap

"The Foundry portal playground shows which specialist was called" — The portal shows the final conversation thread but does not expose run step details. To confirm which specialist was invoked, you must use the SDK to list run steps and inspect tool call entries.

Exam Trap

"If the orchestrator responds correctly, delegation must have occurred" — The orchestrator may answer from its own knowledge without delegating, especially if its system prompt overlaps with a specialist's scope. Always verify via run step inspection.

Exam Trap

"A failed run on the orchestrator means the specialist failed" — The orchestrator run can fail for many reasons unrelated to specialists — model errors, quota limits, or invalid tool definitions. Check the run's last_error field before assuming the specialist is at fault.

Exam Trap

"Higher latency always means a routing problem" — Multi-agent responses are inherently slower than single-agent responses because each delegation is an additional LLM call. A modest latency increase of 1–4 seconds per specialist hop is expected and normal.

Exam Tip

Start testing with simple, targeted single-specialist queries before testing multi-hop or ambiguous queries. Isolate each specialist's reachability first, then add complexity.

Must Memorize

No tool_calls step in run steps = no delegation occurred. The orchestrator answered from its own knowledge. This is the definitive check — not the response text.

Question — click to flip

Q: How can you definitively confirm that the orchestrator delegated to a specific specialist?

Question — click to flip

Q: What does it mean if run step inspection shows only message_creation steps and no tool_calls steps?

Question — click to flip

Q: Which run step type indicates the orchestrator invoked a connected agent?

Question — click to flip

Q: Is increased latency after adding connected agents always a sign of a problem?

Question — click to flip

Q: What is the first query type to use when validating a new multi-agent topology?

Question — click to flip

Q: An orchestrator run fails with status 'failed'. Does this mean the specialist agent failed?

3.2 — Test Agent-to-Agent Conversation

Testing Multi-Agent Systems

Testing Surfaces at a Glance

How Run Steps Reveal Delegation

Common Multi-Agent Issues and Fixes

Crafting Test Queries That Exercise Routing

Validating End-to-End Response Quality

Debugging Multi-Agent Flows Step by Step

Hands-On: Validate Multi-Agent Delegation

AI-3018 Assessment Focus