AI-3018 Learning Portal
Objective 1.2 30 minhigh prioritymodel-deploymentgpt-4oprovisionedtpmfunction-calling

1.2 — Create a New Completion Model Deployment

Deploy a completion model in Azure AI Foundry that powers your agent, understanding deployment types and quota.

Prerequisites: 1.1
Concept — What & Why

What Is a Model Deployment?

A Deployment NameThe free-form string you assign when creating a deployment (e.g., gpt4o-agent-prod). Agents reference the deployment name, not the underlying model name or version. in Azure AI Foundry is a named, managed instance of a language model that exposes an API endpoint your agents (and other applications) call. Think of it as renting a specific model version and provisioning the compute capacity to serve it. You cannot point an agent at "GPT-4o in general" — you must point it at a specific named deployment in your Foundry project.

Model deployments in Foundry serve two roles:

  1. The AI brain behind an agent — when an agent receives a user message it sends that message (plus system instructions, tool schemas, and thread history) to the configured model deployment.
  2. A standalone inference endpoint — used by applications calling the Chat Completions API directly, without agent tooling.

Quota is governed by Tokens Per Minute (TPM)The key quota metric governing how many tokens your deployment can process per minute across all requests. Exceeding TPM results in HTTP 429 Too Many Requests errors.. For reserved capacity workloads, you can provision Provisioned Throughput Units (PTU)Reserved compute capacity for Provisioned deployments. PTUs provide predictable latency and throughput but are billed hourly regardless of actual usage.. Agents depend on Function CallingA model capability required by agents to invoke tools. The model receives a tool schema and returns a structured JSON call that the Agents runtime executes. support in the underlying model to invoke tools reliably.

Deep Dive — How It Works

Supported Model Families for Agent Use

Not every model in the catalog supports agent use. Agents require function calling support (to invoke tools) and chat completion format. The primary supported families are:

Model FamilyAgent SupportNotes
GPT-4oFullRecommended for production agents; supports function calling, vision, and structured outputs
GPT-4o miniFullLighter, cheaper; good for high-volume or simple agents
GPT-4 TurboFullPredecessor to GPT-4o; still widely used
GPT-4FullOriginal GPT-4; slower and more expensive than Turbo variants
GPT-3.5 TurboPartialSupports function calling but less reliable for complex tool use
o1 / o1-miniLimitedReasoning-focused; limited tool use support at GA
Phi-3 / Phi-4NoSmall language models; no function calling as of GA
Embedding modelsNoUsed for vector search, not for agent reasoning

GPT-4o is the recommended choice for AI-3018 scenarios — it reliably handles multi-step tool use, structured JSON output, and long system instructions.

Deployment Types

Deployment TypeDescriptionBest For
StandardPay-as-you-go; shared compute; quota tied to your subscription regionDevelopment, testing, low-volume production
Global StandardRoutes traffic globally to Microsoft datacenters for best availability; still pay-as-you-goHigher throughput needs with no data residency requirement
ProvisionedReserved compute (PTU — Provisioned Throughput Units); predictable latency and throughput; billed hourly regardless of useHigh-volume production with SLA requirements
Data Zone StandardRoutes within a specific geographic data zone (EU, US) for data residency complianceRegulated industries needing data boundary control

For the AI-3018 exam lab environments, Standard deployment is used. In production scenarios, Global Standard is the common recommendation for agent workloads where occasional latency spikes are acceptable.

Token Quotas and Rate Limits

Tokens Per Minute (TPM) is the key quota metric. It controls how many tokens your deployment can process per minute across all requests. For agents, this quota matters because:

  • Each agent turn consumes tokens for the prompt (system instructions + thread history + user message) and the completion (agent response + any tool call JSON).
  • Long thread histories and verbose system instructions consume prompt tokens on every turn.
  • Tool calls add extra tokens for the function schema sent to the model and the structured JSON response.
Quota TypeWhat it limitsImpact on agents
TPM (Tokens Per Minute)Total tokens processed per minuteHitting this causes 429 Too Many Requests errors
RPM (Requests Per Minute)Number of API calls per minuteHitting this throttles concurrent agent threads
Model quotaSubscription-wide TPM cap per modelShared across all deployments of the same model

Practical guidance: For agent workloads, request at least 30K–100K TPM for development.

How to Deploy a Model: Step-by-Step

From the Deployments blade (recommended for agents):

  1. Open your Foundry Project at ai.azure.com.
  2. In the left nav, click Models + endpointsDeployments.
  3. Click + Deploy modelDeploy base model.
  4. In the model catalog, search for gpt-4o and click on it.
  5. Click Confirm to proceed to the deployment configuration screen.
  6. Set the Deployment name (e.g., gpt-4o-agent). This name is what your agent references — choose something meaningful.
  7. Select the Deployment type (Standard for labs).
  8. Adjust Tokens per minute (TPM) quota using the slider.
  9. Click Deploy. The deployment is ready in seconds to minutes.
Hands-On Lab

Hands-On: Deploy a GPT-4o Model for Agent Use

Goal: Deploy a GPT-4o model in your Foundry Project and verify it is available for agent use.

  1. Open your Foundry Project at ai.azure.com.
  2. In the left navigation, click Models + endpointsDeployments.
  3. Click + Deploy modelDeploy base model.
  4. In the search box of the model catalog, type gpt-4o and press Enter.
  5. Click the gpt-4o card (not gpt-4o-mini), then click Confirm.
  6. On the deployment configuration page, set Deployment name to gpt-4o-agent.
  7. Leave Deployment type as Standard.
  8. Use the TPM slider to set 40,000 tokens per minute (sufficient for lab work).
  9. Click Deploy and wait for the status to show Succeeded (usually under 60 seconds).
  10. Navigate to Agents in the left nav and click + New agent. Confirm that gpt-4o-agent appears in the Deployment dropdown — this confirms the deployment is usable by agents.
Exam Angle — What AI-3018 Tests

AI-3018 Assessment Focus

Deployment type selection and model capability are frequent exam targets. Know which models support function calling and what each deployment type offers.

Exam Trap

"You can name a deployment the same as the model" — You can, but it is not required. The deployment name is a free-form string you choose; the underlying model is a separate field. Agents reference the deployment name, not the model name.

Exam Trap

"Standard and Global Standard deployments have different token costs" — Token pricing per input/output token is the same between Standard and Global Standard. The difference is routing (global vs. regional) and availability — not per-token cost.

Exam Trap

"Provisioned deployments scale automatically" — Provisioned (PTU) deployments have fixed reserved capacity. Requests beyond that capacity are throttled unless you also have a Standard fallback deployment. They do not auto-scale like cloud VMs.

Exam Trap

"Embedding model deployments can power agents" — Embedding models produce vector representations, not natural language completions. They cannot be selected as an agent's model deployment and do not support function calling.

Exam Trap

"GPT-3.5 Turbo is equivalent to GPT-4o for agents" — GPT-3.5 Turbo technically supports the tools parameter, but it is significantly less reliable for complex multi-step tool use, structured JSON adherence, and instruction following compared to GPT-4o.

Exam Tip

For any scenario involving tool use failures with Phi-3 or other small models — the root cause is always missing function calling support. Fix = switch to GPT-4o or GPT-4 Turbo.

Question — click to flip

Q: What must an agent reference to use a model — the model name or the deployment name?

Question — click to flip

Q: Which deployment type provides reserved throughput billed hourly regardless of actual usage?

Question — click to flip

Q: Why can't a Phi-3 deployment power an agent with tools?

Question — click to flip

Q: What HTTP error indicates a deployment has exceeded its TPM quota?

Question — click to flip

Q: What is the difference between Standard and Global Standard deployment types?

Question — click to flip

Q: Can embedding model deployments be used as an agent's underlying model?

Sources & Further Reading