AI-3016 Learning Portal
Objective 2.2 25 minmedium prioritydeploymenttpmplaygroundversion-policyprovisioned

2.2 — Configure, Deploy, Edit, and Test Models

Configure deployment settings (TPM, content filter, version policy), deploy a model, edit post-deployment settings, and test interactively in the Foundry Playground.

Concept — What & Why

Once you identify a model in the catalog, you deploy it to make it available for inference. Azure AI Foundry offers two high-level deployment paths:

  • Standard deployment in Foundry resources — preferred; supports ADMs and select partner models
  • Managed compute deployment — for open-weight and custom models

The Foundry portal automatically routes you to the correct option based on the model you choose.

Key deployment settings you configure at creation time:

SettingDescription
Deployment nameUsed as the model parameter in API calls — immutable after creation
Deployment typeStandard, Global-Standard, Global-Batch, Provisioned-Managed, etc.
Tokens per Minute (TPM)A rate limit on throughput allocated from your subscription's per-region, per-model quota pool. TPM is not the model's context window — it controls how many tokens your deployment can process per minute across all callers.Rate limit from your subscription quota; adjustable post-deployment
Content filter policyThe filter configuration to attach
Version Upgrade PolicyA deployment setting that controls what happens when a new model version is released. Options: OnceNewDefaultVersionAvailable (auto-upgrade to new default), OnceCurrentVersionExpired (upgrade only at retirement), NoAutoUpgrade (manual only — stops at retirement).Controls auto-upgrade behavior

You can test deployed models interactively in the Foundry PlaygroundThe interactive no-code web interface in Azure AI Foundry for testing deployed models by sending prompts and reviewing responses, with controls for system message, temperature, max response tokens, and top P. without writing any code.

Deep Dive — How It Works

Deployment Types Compared

Deployment TypeBillingBest For
StandardPer token (TPM quota)Development, variable workloads
Global-StandardPer token, global routingBroadest regional availability
Provisioned-Managed (PTU)Reserved capacity (per hour)High-volume, latency-sensitive production
Global-BatchPer token, asyncCost-optimized batch processing; no playground
Managed computeVM core-hoursOpen-weight / custom models

Version Upgrade Policies

PolicyBehaviorBest for
OnceNewDefaultVersionAvailableAuto-upgrades when a new default version is setDevelopment environments; keep current
OnceCurrentVersionExpiredUpgrades only when the current version is retiredProduction; safest middle ground
NoAutoUpgradeNever auto-upgrades; deployment stops working when the pinned version is retiredStrict version locking (requires active monitoring)

Editable vs. Fixed Post-Deployment Settings

SettingCan edit after deployment?
Deployment nameNo — immutable; delete and redeploy
Tokens per Minute (TPM)Yes — from the deployment details page
Content filter policyYes — replace policy from deployment page
Model versionYes — triggers Updating provisioning state
Deployment typeNo — fixed at creation
Azure regionNo — fixed at creation

Playground Parameters

ParameterRangeEffect
Temperature0–2Controls randomness; 0 = deterministic, 2 = very random
Max response (tokens)1–model maxCaps generated response length
Top P0–1Nucleus sampling; adjust Temperature OR Top P, not both
Stop sequencesString listTokens that halt generation
Hands-On Lab

Deploy a model from the catalog:

Navigate to Azure AI Foundry portal → DiscoverModels → select a model (e.g., gpt-4o-mini) → DeployCustom settings.

In the deployment wizard: set Deployment name → choose Deployment type (e.g., Global-Standard) → adjust Tokens per Minute slider → assign a Content filter policy → set Version upgrade policy → select Deploy.

Wait for Provisioning state to show Succeeded on the Models + endpoints page.

Test in the playground:

From the deployment list, click the deployment name → Open in playground (or navigate to PlaygroundsChat).

In the System message box, enter instructions (e.g., "You are a concise technical assistant.") → select Apply changes.

Type a user prompt in the chat box → press Enter to send → review the response.

Adjust Temperature (0–2 scale) or Max response tokens in the Parameters panel → resend the same prompt to observe differences.

Select View code / </> Code tab → copy the pre-populated Python snippet to validate API connectivity.

Edit an existing deployment:

Navigate to Models + endpoints → select the deployment name → Edit (pencil icon).

Increase or decrease the Tokens per Minute allocation → select Save.

To update the model version: in the Properties pane select Edit → change Model version in the dropdown → confirm. The deployment enters Updating state for a few minutes.

Exam Angle — What AI-3016 Tests

AI-3016 Assessment Focus

Deployment-type selection and TPM vs. context-window confusion are high-frequency exam topics. Know which settings are editable post-deployment and what happens when a pinned version is retired with NoAutoUpgrade.

Exam Trap

"Batch deployment supports the Foundry playground for testing." Global-Batch does not support playground testing. Use Standard or Global-Standard deployments for interactive validation.

Exam Trap

"You can change the deployment name after a model is deployed." The deployment name is immutable after creation. If you need a different name, delete and redeploy.

Exam Trap

"TPM quota is the same as the model's max input token limit." TPM is a throughput rate limit allocated from your subscription quota. The model's max input token limit (context window) is a fixed model property unaffected by TPM.

Exam Trap

"The NoAutoUpgrade policy keeps a deployment running indefinitely." When the pinned model version reaches its retirement date, deployments with NoAutoUpgrade will stop serving requests. Manual version update before retirement is required.

Exam Trap

"Temperature and Top P should be adjusted together for best results." Microsoft explicitly recommends adjusting either Temperature or Top P — not both simultaneously — as combining them produces unpredictable behavior.

Exam Tip

For overnight batch cost optimization: Global-Batch. For latency-sensitive high-volume production: Provisioned-Managed (PTU). For development with variable load: Standard or Global-Standard.

Must Memorize

After deployment, only TPM, content filter, and model version are editable. Deployment name, type, provider, and region are fixed.

Question — click to flip

Q: A company processes large document volumes overnight at minimum cost with no playground requirement. Which deployment type is most appropriate?

Question — click to flip

Q: With NoAutoUpgrade policy, what happens when the pinned model version reaches its retirement date?

Question — click to flip

Q: Which deployment settings can be modified after a model is deployed in Azure AI Foundry?

Question — click to flip

Q: What is the primary difference between Standard and Provisioned-Managed (PTU) deployment types?

Question — click to flip

Q: A developer wants to test a deployed model without writing code. Which Foundry feature enables this?

Question — click to flip

Q: What does the Tokens per Minute (TPM) setting control in an Azure AI Foundry deployment?

Sources & Further Reading