Testing Multi-Agent Systems
Once a connected-agent topology is wired up, you must validate that the orchestrator correctly delegates to specialists and that the end-to-end response is accurate and coherent. Azure AI Foundry provides two primary testing surfaces: the Agents playground in the portal and the SDK-based run step inspection in code.
Testing Surfaces at a Glance
| Surface | Best for | Visibility |
|---|---|---|
| Foundry portal playground | Quick smoke tests, iterating on system prompts | Conversation thread, basic run status |
| Run step inspection (API) | Deep debugging, seeing which tool was called and what it returned | Full tool call / tool result trace |
| SDK integration tests | CI/CD gate, regression coverage | Programmatic assertions on responses |
Each delegation to a specialist produces a Run StepA record of one unit of work performed during an agent run. Delegation to a connected agent produces a run step of type tool_calls with connected_agent entries inside it. in the orchestrator's run. A tool_calls Run StepA run step type that appears when the orchestrator invokes a tool — including connected agents. Each tool call entry contains the agent name called and its output. confirms that delegation occurred; inspect the connected_agent.nameThe field inside a tool_calls run step that identifies which specialist was invoked. Its value matches the tool name you assigned when registering the connected agent. field to see which specialist was called. If only a message_creation Run StepA run step type that appears when the model generates its final reply. If only message_creation steps exist with no tool_calls steps, the orchestrator answered without delegating. is present, the orchestrator answered from its own knowledge without delegating.
How Run Steps Reveal Delegation
When the orchestrator delegates to a specialist, the run on the orchestrator's thread produces a run step of type tool_calls. Each tool call entry contains:
type:"connected_agent"(distinguishing it from function calls)connected_agent.name: the tool name of the specialist (e.g.,billing_agent)connected_agent.output: the specialist's response text
You can retrieve these steps programmatically:
run_steps = project_client.agents.list_run_steps(
thread_id=thread.id,
run_id=run.id,
)
for step in run_steps:
if step.type == "tool_calls":
for tc in step.step_details.tool_calls:
print(f"Agent called: {tc.connected_agent.name}")
print(f"Specialist output: {tc.connected_agent.output}")
Common Multi-Agent Issues and Fixes
| Symptom | Likely cause | Fix |
|---|---|---|
| Orchestrator answers itself instead of delegating | Tool description too broad or matches the orchestrator's own scope | Narrow the description; add "Do NOT handle this yourself" to the orchestrator system prompt |
| Orchestrator always delegates to the same specialist | Overlapping descriptions; one description is a superset of others | Make descriptions mutually exclusive; add explicit scope boundaries |
| Specialist returns a generic "I don't know" | Specialist system prompt too restrictive or its tools are missing | Review and expand the specialist's system prompt and tool configuration |
| End-to-end response is cut off or incoherent | Orchestrator not passing enough context to the specialist | Check what message/context the orchestrator sends; increase context in the invocation |
| Run never completes (timeout) | Specialist invoking a long-running tool or stuck in a loop | Check the specialist's run steps for stalled tool calls |
Crafting Test Queries That Exercise Routing
Good test queries for multi-agent systems should:
- Target a single specialist — verify each specialist is reachable individually.
- Span multiple specialists — verify the orchestrator can make two delegation calls in one turn.
- Be ambiguous — verify the orchestrator routes to the correct specialist even when the query does not name the domain explicitly.
- Include edge cases — queries that sit on the boundary between two specialists' scopes stress-test description quality.
Example test matrix for a three-specialist orchestrator (billing, HR, tech support):
| Query | Expected delegation | Tests |
|---|---|---|
| "Why was I charged twice this month?" | billing_agent | Direct routing |
| "How do I request parental leave?" | hr_agent | Direct routing |
| "My laptop won't connect to VPN" | tech_support_agent | Direct routing |
| "I was charged for software I can't install" | billing_agent then tech_support_agent | Multi-hop routing |
| "Help me" | Orchestrator should ask a clarifying question | No premature delegation |
Validating End-to-End Response Quality
Multi-agent responses can fail at the seam between the specialist's output and the orchestrator's synthesis. Check for:
- Factual consistency — does the orchestrator's final reply accurately reflect the specialist's output, or did it hallucinate a summary?
- Completeness — did the orchestrator include all parts of the specialist's answer, or silently drop details?
- Attribution — if the specialist cited sources or returned structured data, does the final response preserve them?
- Latency — each additional agent hop adds ~1–3 seconds; validate that total response time is within UX requirements.
Debugging Multi-Agent Flows Step by Step
- Check the orchestrator run status — a status of
failedorexpiredsignals a problem before even reaching the specialist. - List run steps — look for
tool_callssteps. Their presence confirms delegation occurred. - Inspect the tool call input — the orchestrator passes a message to the specialist; verify it contains the right context.
- Inspect the tool call output — verify the specialist's reply is accurate and complete.
- Check the specialist's own thread — if the specialist has a nested thread, retrieve it to see its own run steps and any sub-tool calls.
- Review agent call logs — in the Foundry portal, navigate to Monitoring → Traces to see latency and error detail for each agent hop.
Hands-On: Validate Multi-Agent Delegation
Goal: Submit a test query to a multi-agent setup in the Foundry portal playground and verify delegation via run step inspection.
- Open the orchestrator agent in Azure AI Foundry → Agents → select your orchestrator.
- Launch the playground — click Test in playground (top-right of the agent detail page).
- Submit a targeted query — type a message that should route to one of your specialists (e.g., "I need help with my invoice from last month"). Press Enter.
- Observe the response — note whether the answer contains specialist-level detail. High specificity suggests delegation occurred; a generic reply suggests the orchestrator answered itself.
- Retrieve run steps in code — using the Python SDK, call
list_run_steps(thread_id, run_id)and print each step's type. Confirm atool_callsstep is present withconnected_agent.namematching your specialist's tool name. - Test the boundary case — submit a query that sits on the scope boundary of two specialists. Verify the orchestrator delegates to the correct one; if not, refine the tool descriptions.
- Test a multi-hop query — submit a query spanning two domains (e.g., "I was billed for software that is broken"). Confirm the response integrates output from both specialists and that two
tool_callssteps appear in the run steps.
AI-3018 Assessment Focus
Run step inspection is the authoritative method for confirming delegation — the portal playground alone is not sufficient. Know the run step types and what their presence/absence means.
Exam Trap
"The Foundry portal playground shows which specialist was called" — The portal shows the final conversation thread but does not expose run step details. To confirm which specialist was invoked, you must use the SDK to list run steps and inspect tool call entries.
Exam Trap
"If the orchestrator responds correctly, delegation must have occurred" — The orchestrator may answer from its own knowledge without delegating, especially if its system prompt overlaps with a specialist's scope. Always verify via run step inspection.
Exam Trap
"A failed run on the orchestrator means the specialist failed" — The orchestrator run can fail for many reasons unrelated to specialists — model errors, quota limits, or invalid tool definitions. Check the run's last_error field before assuming the specialist is at fault.
Exam Trap
"Higher latency always means a routing problem" — Multi-agent responses are inherently slower than single-agent responses because each delegation is an additional LLM call. A modest latency increase of 1–4 seconds per specialist hop is expected and normal.
Exam Tip
Start testing with simple, targeted single-specialist queries before testing multi-hop or ambiguous queries. Isolate each specialist's reachability first, then add complexity.
Must Memorize
No tool_calls step in run steps = no delegation occurred. The orchestrator answered from its own knowledge. This is the definitive check — not the response text.
Question — click to flip
Q: How can you definitively confirm that the orchestrator delegated to a specific specialist?
Question — click to flip
Q: What does it mean if run step inspection shows only message_creation steps and no tool_calls steps?
Question — click to flip
Q: Which run step type indicates the orchestrator invoked a connected agent?
Question — click to flip
Q: Is increased latency after adding connected agents always a sign of a problem?
Question — click to flip
Q: What is the first query type to use when validating a new multi-agent topology?
Question — click to flip
Q: An orchestrator run fails with status 'failed'. Does this mean the specialist agent failed?