AI-3018 Learning Portal
Objective 3.2 35 minhigh prioritymulti-agentrun-stepsdebuggingconnected-agentrouting-validationtool-calls

3.2 — Test Agent-to-Agent Conversation

Submit structured test queries to validate multi-agent handoff and delegation, and inspect run steps to confirm which specialist agent handled each query.

Prerequisites: 3.1
Concept — What & Why

Testing Multi-Agent Systems

Once a connected-agent topology is wired up, you must validate that the orchestrator correctly delegates to specialists and that the end-to-end response is accurate and coherent. Azure AI Foundry provides two primary testing surfaces: the Agents playground in the portal and the SDK-based run step inspection in code.

Testing Surfaces at a Glance

SurfaceBest forVisibility
Foundry portal playgroundQuick smoke tests, iterating on system promptsConversation thread, basic run status
Run step inspection (API)Deep debugging, seeing which tool was called and what it returnedFull tool call / tool result trace
SDK integration testsCI/CD gate, regression coverageProgrammatic assertions on responses

Each delegation to a specialist produces a Run StepA record of one unit of work performed during an agent run. Delegation to a connected agent produces a run step of type tool_calls with connected_agent entries inside it. in the orchestrator's run. A tool_calls Run StepA run step type that appears when the orchestrator invokes a tool — including connected agents. Each tool call entry contains the agent name called and its output. confirms that delegation occurred; inspect the connected_agent.nameThe field inside a tool_calls run step that identifies which specialist was invoked. Its value matches the tool name you assigned when registering the connected agent. field to see which specialist was called. If only a message_creation Run StepA run step type that appears when the model generates its final reply. If only message_creation steps exist with no tool_calls steps, the orchestrator answered without delegating. is present, the orchestrator answered from its own knowledge without delegating.

Deep Dive — How It Works

How Run Steps Reveal Delegation

When the orchestrator delegates to a specialist, the run on the orchestrator's thread produces a run step of type tool_calls. Each tool call entry contains:

  • type: "connected_agent" (distinguishing it from function calls)
  • connected_agent.name: the tool name of the specialist (e.g., billing_agent)
  • connected_agent.output: the specialist's response text

You can retrieve these steps programmatically:

run_steps = project_client.agents.list_run_steps(
    thread_id=thread.id,
    run_id=run.id,
)
for step in run_steps:
    if step.type == "tool_calls":
        for tc in step.step_details.tool_calls:
            print(f"Agent called: {tc.connected_agent.name}")
            print(f"Specialist output: {tc.connected_agent.output}")

Common Multi-Agent Issues and Fixes

SymptomLikely causeFix
Orchestrator answers itself instead of delegatingTool description too broad or matches the orchestrator's own scopeNarrow the description; add "Do NOT handle this yourself" to the orchestrator system prompt
Orchestrator always delegates to the same specialistOverlapping descriptions; one description is a superset of othersMake descriptions mutually exclusive; add explicit scope boundaries
Specialist returns a generic "I don't know"Specialist system prompt too restrictive or its tools are missingReview and expand the specialist's system prompt and tool configuration
End-to-end response is cut off or incoherentOrchestrator not passing enough context to the specialistCheck what message/context the orchestrator sends; increase context in the invocation
Run never completes (timeout)Specialist invoking a long-running tool or stuck in a loopCheck the specialist's run steps for stalled tool calls

Crafting Test Queries That Exercise Routing

Good test queries for multi-agent systems should:

  • Target a single specialist — verify each specialist is reachable individually.
  • Span multiple specialists — verify the orchestrator can make two delegation calls in one turn.
  • Be ambiguous — verify the orchestrator routes to the correct specialist even when the query does not name the domain explicitly.
  • Include edge cases — queries that sit on the boundary between two specialists' scopes stress-test description quality.

Example test matrix for a three-specialist orchestrator (billing, HR, tech support):

QueryExpected delegationTests
"Why was I charged twice this month?"billing_agentDirect routing
"How do I request parental leave?"hr_agentDirect routing
"My laptop won't connect to VPN"tech_support_agentDirect routing
"I was charged for software I can't install"billing_agent then tech_support_agentMulti-hop routing
"Help me"Orchestrator should ask a clarifying questionNo premature delegation

Validating End-to-End Response Quality

Multi-agent responses can fail at the seam between the specialist's output and the orchestrator's synthesis. Check for:

  • Factual consistency — does the orchestrator's final reply accurately reflect the specialist's output, or did it hallucinate a summary?
  • Completeness — did the orchestrator include all parts of the specialist's answer, or silently drop details?
  • Attribution — if the specialist cited sources or returned structured data, does the final response preserve them?
  • Latency — each additional agent hop adds ~1–3 seconds; validate that total response time is within UX requirements.

Debugging Multi-Agent Flows Step by Step

  1. Check the orchestrator run status — a status of failed or expired signals a problem before even reaching the specialist.
  2. List run steps — look for tool_calls steps. Their presence confirms delegation occurred.
  3. Inspect the tool call input — the orchestrator passes a message to the specialist; verify it contains the right context.
  4. Inspect the tool call output — verify the specialist's reply is accurate and complete.
  5. Check the specialist's own thread — if the specialist has a nested thread, retrieve it to see its own run steps and any sub-tool calls.
  6. Review agent call logs — in the Foundry portal, navigate to MonitoringTraces to see latency and error detail for each agent hop.
Hands-On Lab

Hands-On: Validate Multi-Agent Delegation

Goal: Submit a test query to a multi-agent setup in the Foundry portal playground and verify delegation via run step inspection.

  1. Open the orchestrator agent in Azure AI Foundry → Agents → select your orchestrator.
  2. Launch the playground — click Test in playground (top-right of the agent detail page).
  3. Submit a targeted query — type a message that should route to one of your specialists (e.g., "I need help with my invoice from last month"). Press Enter.
  4. Observe the response — note whether the answer contains specialist-level detail. High specificity suggests delegation occurred; a generic reply suggests the orchestrator answered itself.
  5. Retrieve run steps in code — using the Python SDK, call list_run_steps(thread_id, run_id) and print each step's type. Confirm a tool_calls step is present with connected_agent.name matching your specialist's tool name.
  6. Test the boundary case — submit a query that sits on the scope boundary of two specialists. Verify the orchestrator delegates to the correct one; if not, refine the tool descriptions.
  7. Test a multi-hop query — submit a query spanning two domains (e.g., "I was billed for software that is broken"). Confirm the response integrates output from both specialists and that two tool_calls steps appear in the run steps.
Exam Angle — What AI-3018 Tests

AI-3018 Assessment Focus

Run step inspection is the authoritative method for confirming delegation — the portal playground alone is not sufficient. Know the run step types and what their presence/absence means.

Exam Trap

"The Foundry portal playground shows which specialist was called" — The portal shows the final conversation thread but does not expose run step details. To confirm which specialist was invoked, you must use the SDK to list run steps and inspect tool call entries.

Exam Trap

"If the orchestrator responds correctly, delegation must have occurred" — The orchestrator may answer from its own knowledge without delegating, especially if its system prompt overlaps with a specialist's scope. Always verify via run step inspection.

Exam Trap

"A failed run on the orchestrator means the specialist failed" — The orchestrator run can fail for many reasons unrelated to specialists — model errors, quota limits, or invalid tool definitions. Check the run's last_error field before assuming the specialist is at fault.

Exam Trap

"Higher latency always means a routing problem" — Multi-agent responses are inherently slower than single-agent responses because each delegation is an additional LLM call. A modest latency increase of 1–4 seconds per specialist hop is expected and normal.

Exam Tip

Start testing with simple, targeted single-specialist queries before testing multi-hop or ambiguous queries. Isolate each specialist's reachability first, then add complexity.

Must Memorize

No tool_calls step in run steps = no delegation occurred. The orchestrator answered from its own knowledge. This is the definitive check — not the response text.

Question — click to flip

Q: How can you definitively confirm that the orchestrator delegated to a specific specialist?

Question — click to flip

Q: What does it mean if run step inspection shows only message_creation steps and no tool_calls steps?

Question — click to flip

Q: Which run step type indicates the orchestrator invoked a connected agent?

Question — click to flip

Q: Is increased latency after adding connected agents always a sign of a problem?

Question — click to flip

Q: What is the first query type to use when validating a new multi-agent topology?

Question — click to flip

Q: An orchestrator run fails with status 'failed'. Does this mean the specialist agent failed?

Sources & Further Reading