What Is a Model Deployment?
A Deployment NameThe free-form string you assign when creating a deployment (e.g., gpt4o-agent-prod). Agents reference the deployment name, not the underlying model name or version. in Azure AI Foundry is a named, managed instance of a language model that exposes an API endpoint your agents (and other applications) call. Think of it as renting a specific model version and provisioning the compute capacity to serve it. You cannot point an agent at "GPT-4o in general" — you must point it at a specific named deployment in your Foundry project.
Model deployments in Foundry serve two roles:
- The AI brain behind an agent — when an agent receives a user message it sends that message (plus system instructions, tool schemas, and thread history) to the configured model deployment.
- A standalone inference endpoint — used by applications calling the Chat Completions API directly, without agent tooling.
Quota is governed by Tokens Per Minute (TPM)The key quota metric governing how many tokens your deployment can process per minute across all requests. Exceeding TPM results in HTTP 429 Too Many Requests errors.. For reserved capacity workloads, you can provision Provisioned Throughput Units (PTU)Reserved compute capacity for Provisioned deployments. PTUs provide predictable latency and throughput but are billed hourly regardless of actual usage.. Agents depend on Function CallingA model capability required by agents to invoke tools. The model receives a tool schema and returns a structured JSON call that the Agents runtime executes. support in the underlying model to invoke tools reliably.
Supported Model Families for Agent Use
Not every model in the catalog supports agent use. Agents require function calling support (to invoke tools) and chat completion format. The primary supported families are:
| Model Family | Agent Support | Notes |
|---|---|---|
| GPT-4o | Full | Recommended for production agents; supports function calling, vision, and structured outputs |
| GPT-4o mini | Full | Lighter, cheaper; good for high-volume or simple agents |
| GPT-4 Turbo | Full | Predecessor to GPT-4o; still widely used |
| GPT-4 | Full | Original GPT-4; slower and more expensive than Turbo variants |
| GPT-3.5 Turbo | Partial | Supports function calling but less reliable for complex tool use |
| o1 / o1-mini | Limited | Reasoning-focused; limited tool use support at GA |
| Phi-3 / Phi-4 | No | Small language models; no function calling as of GA |
| Embedding models | No | Used for vector search, not for agent reasoning |
GPT-4o is the recommended choice for AI-3018 scenarios — it reliably handles multi-step tool use, structured JSON output, and long system instructions.
Deployment Types
| Deployment Type | Description | Best For |
|---|---|---|
| Standard | Pay-as-you-go; shared compute; quota tied to your subscription region | Development, testing, low-volume production |
| Global Standard | Routes traffic globally to Microsoft datacenters for best availability; still pay-as-you-go | Higher throughput needs with no data residency requirement |
| Provisioned | Reserved compute (PTU — Provisioned Throughput Units); predictable latency and throughput; billed hourly regardless of use | High-volume production with SLA requirements |
| Data Zone Standard | Routes within a specific geographic data zone (EU, US) for data residency compliance | Regulated industries needing data boundary control |
For the AI-3018 exam lab environments, Standard deployment is used. In production scenarios, Global Standard is the common recommendation for agent workloads where occasional latency spikes are acceptable.
Token Quotas and Rate Limits
Tokens Per Minute (TPM) is the key quota metric. It controls how many tokens your deployment can process per minute across all requests. For agents, this quota matters because:
- Each agent turn consumes tokens for the prompt (system instructions + thread history + user message) and the completion (agent response + any tool call JSON).
- Long thread histories and verbose system instructions consume prompt tokens on every turn.
- Tool calls add extra tokens for the function schema sent to the model and the structured JSON response.
| Quota Type | What it limits | Impact on agents |
|---|---|---|
| TPM (Tokens Per Minute) | Total tokens processed per minute | Hitting this causes 429 Too Many Requests errors |
| RPM (Requests Per Minute) | Number of API calls per minute | Hitting this throttles concurrent agent threads |
| Model quota | Subscription-wide TPM cap per model | Shared across all deployments of the same model |
Practical guidance: For agent workloads, request at least 30K–100K TPM for development.
How to Deploy a Model: Step-by-Step
From the Deployments blade (recommended for agents):
- Open your Foundry Project at ai.azure.com.
- In the left nav, click Models + endpoints → Deployments.
- Click + Deploy model → Deploy base model.
- In the model catalog, search for gpt-4o and click on it.
- Click Confirm to proceed to the deployment configuration screen.
- Set the Deployment name (e.g.,
gpt-4o-agent). This name is what your agent references — choose something meaningful. - Select the Deployment type (Standard for labs).
- Adjust Tokens per minute (TPM) quota using the slider.
- Click Deploy. The deployment is ready in seconds to minutes.
Hands-On: Deploy a GPT-4o Model for Agent Use
Goal: Deploy a GPT-4o model in your Foundry Project and verify it is available for agent use.
- Open your Foundry Project at ai.azure.com.
- In the left navigation, click Models + endpoints → Deployments.
- Click + Deploy model → Deploy base model.
- In the search box of the model catalog, type gpt-4o and press Enter.
- Click the gpt-4o card (not gpt-4o-mini), then click Confirm.
- On the deployment configuration page, set Deployment name to
gpt-4o-agent. - Leave Deployment type as Standard.
- Use the TPM slider to set 40,000 tokens per minute (sufficient for lab work).
- Click Deploy and wait for the status to show Succeeded (usually under 60 seconds).
- Navigate to Agents in the left nav and click + New agent. Confirm that
gpt-4o-agentappears in the Deployment dropdown — this confirms the deployment is usable by agents.
AI-3018 Assessment Focus
Deployment type selection and model capability are frequent exam targets. Know which models support function calling and what each deployment type offers.
Exam Trap
"You can name a deployment the same as the model" — You can, but it is not required. The deployment name is a free-form string you choose; the underlying model is a separate field. Agents reference the deployment name, not the model name.
Exam Trap
"Standard and Global Standard deployments have different token costs" — Token pricing per input/output token is the same between Standard and Global Standard. The difference is routing (global vs. regional) and availability — not per-token cost.
Exam Trap
"Provisioned deployments scale automatically" — Provisioned (PTU) deployments have fixed reserved capacity. Requests beyond that capacity are throttled unless you also have a Standard fallback deployment. They do not auto-scale like cloud VMs.
Exam Trap
"Embedding model deployments can power agents" — Embedding models produce vector representations, not natural language completions. They cannot be selected as an agent's model deployment and do not support function calling.
Exam Trap
"GPT-3.5 Turbo is equivalent to GPT-4o for agents" — GPT-3.5 Turbo technically supports the tools parameter, but it is significantly less reliable for complex multi-step tool use, structured JSON adherence, and instruction following compared to GPT-4o.
Exam Tip
For any scenario involving tool use failures with Phi-3 or other small models — the root cause is always missing function calling support. Fix = switch to GPT-4o or GPT-4 Turbo.
Question — click to flip
Q: What must an agent reference to use a model — the model name or the deployment name?
Question — click to flip
Q: Which deployment type provides reserved throughput billed hourly regardless of actual usage?
Question — click to flip
Q: Why can't a Phi-3 deployment power an agent with tools?
Question — click to flip
Q: What HTTP error indicates a deployment has exceeded its TPM quota?
Question — click to flip
Q: What is the difference between Standard and Global Standard deployment types?
Question — click to flip
Q: Can embedding model deployments be used as an agent's underlying model?