Once you identify a model in the catalog, you deploy it to make it available for inference. Azure AI Foundry offers two high-level deployment paths:
- Standard deployment in Foundry resources — preferred; supports ADMs and select partner models
- Managed compute deployment — for open-weight and custom models
The Foundry portal automatically routes you to the correct option based on the model you choose.
Key deployment settings you configure at creation time:
| Setting | Description |
|---|---|
| Deployment name | Used as the model parameter in API calls — immutable after creation |
| Deployment type | Standard, Global-Standard, Global-Batch, Provisioned-Managed, etc. |
| Tokens per Minute (TPM)A rate limit on throughput allocated from your subscription's per-region, per-model quota pool. TPM is not the model's context window — it controls how many tokens your deployment can process per minute across all callers. | Rate limit from your subscription quota; adjustable post-deployment |
| Content filter policy | The filter configuration to attach |
| Version Upgrade PolicyA deployment setting that controls what happens when a new model version is released. Options: OnceNewDefaultVersionAvailable (auto-upgrade to new default), OnceCurrentVersionExpired (upgrade only at retirement), NoAutoUpgrade (manual only — stops at retirement). | Controls auto-upgrade behavior |
You can test deployed models interactively in the Foundry PlaygroundThe interactive no-code web interface in Azure AI Foundry for testing deployed models by sending prompts and reviewing responses, with controls for system message, temperature, max response tokens, and top P. without writing any code.
Deployment Types Compared
| Deployment Type | Billing | Best For |
|---|---|---|
| Standard | Per token (TPM quota) | Development, variable workloads |
| Global-Standard | Per token, global routing | Broadest regional availability |
| Provisioned-Managed (PTU) | Reserved capacity (per hour) | High-volume, latency-sensitive production |
| Global-Batch | Per token, async | Cost-optimized batch processing; no playground |
| Managed compute | VM core-hours | Open-weight / custom models |
Version Upgrade Policies
| Policy | Behavior | Best for |
|---|---|---|
OnceNewDefaultVersionAvailable | Auto-upgrades when a new default version is set | Development environments; keep current |
OnceCurrentVersionExpired | Upgrades only when the current version is retired | Production; safest middle ground |
NoAutoUpgrade | Never auto-upgrades; deployment stops working when the pinned version is retired | Strict version locking (requires active monitoring) |
Editable vs. Fixed Post-Deployment Settings
| Setting | Can edit after deployment? |
|---|---|
| Deployment name | No — immutable; delete and redeploy |
| Tokens per Minute (TPM) | Yes — from the deployment details page |
| Content filter policy | Yes — replace policy from deployment page |
| Model version | Yes — triggers Updating provisioning state |
| Deployment type | No — fixed at creation |
| Azure region | No — fixed at creation |
Playground Parameters
| Parameter | Range | Effect |
|---|---|---|
| Temperature | 0–2 | Controls randomness; 0 = deterministic, 2 = very random |
| Max response (tokens) | 1–model max | Caps generated response length |
| Top P | 0–1 | Nucleus sampling; adjust Temperature OR Top P, not both |
| Stop sequences | String list | Tokens that halt generation |
Deploy a model from the catalog:
Navigate to Azure AI Foundry portal → Discover → Models → select a model (e.g., gpt-4o-mini) → Deploy → Custom settings.
In the deployment wizard: set Deployment name → choose Deployment type (e.g., Global-Standard) → adjust Tokens per Minute slider → assign a Content filter policy → set Version upgrade policy → select Deploy.
Wait for Provisioning state to show Succeeded on the Models + endpoints page.
Test in the playground:
From the deployment list, click the deployment name → Open in playground (or navigate to Playgrounds → Chat).
In the System message box, enter instructions (e.g., "You are a concise technical assistant.") → select Apply changes.
Type a user prompt in the chat box → press Enter to send → review the response.
Adjust Temperature (0–2 scale) or Max response tokens in the Parameters panel → resend the same prompt to observe differences.
Select View code / </> Code tab → copy the pre-populated Python snippet to validate API connectivity.
Edit an existing deployment:
Navigate to Models + endpoints → select the deployment name → Edit (pencil icon).
Increase or decrease the Tokens per Minute allocation → select Save.
To update the model version: in the Properties pane select Edit → change Model version in the dropdown → confirm. The deployment enters Updating state for a few minutes.
AI-3016 Assessment Focus
Deployment-type selection and TPM vs. context-window confusion are high-frequency exam topics. Know which settings are editable post-deployment and what happens when a pinned version is retired with NoAutoUpgrade.
Exam Trap
"Batch deployment supports the Foundry playground for testing." Global-Batch does not support playground testing. Use Standard or Global-Standard deployments for interactive validation.
Exam Trap
"You can change the deployment name after a model is deployed." The deployment name is immutable after creation. If you need a different name, delete and redeploy.
Exam Trap
"TPM quota is the same as the model's max input token limit." TPM is a throughput rate limit allocated from your subscription quota. The model's max input token limit (context window) is a fixed model property unaffected by TPM.
Exam Trap
"The NoAutoUpgrade policy keeps a deployment running indefinitely." When the pinned model version reaches its retirement date, deployments with NoAutoUpgrade will stop serving requests. Manual version update before retirement is required.
Exam Trap
"Temperature and Top P should be adjusted together for best results." Microsoft explicitly recommends adjusting either Temperature or Top P — not both simultaneously — as combining them produces unpredictable behavior.
Exam Tip
For overnight batch cost optimization: Global-Batch. For latency-sensitive high-volume production: Provisioned-Managed (PTU). For development with variable load: Standard or Global-Standard.
Must Memorize
After deployment, only TPM, content filter, and model version are editable. Deployment name, type, provider, and region are fixed.
Question — click to flip
Q: A company processes large document volumes overnight at minimum cost with no playground requirement. Which deployment type is most appropriate?
Question — click to flip
Q: With NoAutoUpgrade policy, what happens when the pinned model version reaches its retirement date?
Question — click to flip
Q: Which deployment settings can be modified after a model is deployed in Azure AI Foundry?
Question — click to flip
Q: What is the primary difference between Standard and Provisioned-Managed (PTU) deployment types?
Question — click to flip
Q: A developer wants to test a deployed model without writing code. Which Foundry feature enables this?
Question — click to flip
Q: What does the Tokens per Minute (TPM) setting control in an Azure AI Foundry deployment?