AI-3016 Learning Portal
Objective 5.1 25 minhigh prioritycontent-filterblocklistprotected-materialprompt-shieldsharm-categories

5.1 — Configure Content Filters, Blocklists, and Protected Material

Configure content filter policies with per-category input/output thresholds, add custom blocklists, enable protected material detection, and apply filter policies to model deployments.

Concept — What & Why

Azure AI Foundry includes a content filtering system that runs alongside every deployed model. It uses an ensemble of multi-class classification models to detect harmful content in both user prompts (input) and model completions (output). Filtering is configured independently for each direction through a Content Filter PolicyA named configuration object created at the hub/resource level that defines per-category severity thresholds, blocklists, and additional safety features for both input and output directions. Applied to one or more model deployments..

The system operates on four harm categories, each scored on a severity scale of 0–7 (0 = safe, 7 = most severe). The portal exposes three actionable severity thresholds: Low, Medium, and High. You can add a BlocklistA custom list of terms or phrases that unconditionally block (or flag) matching content regardless of the severity score. Blocklist matches bypass the severity slider logic entirely. to unconditionally block specific terms regardless of severity. Prompt ShieldsA content filter feature that detects attempts to override system instructions. Direct attack detection (jailbreak) is on by default; indirect attack detection (injected instructions hidden in documents) is off by default. detects attempts to override system instructions, and Protected Material DetectionBinary classifiers that identify content matching known copyrighted text (songs, recipes, web content) or public code repositories (GitHub). Can be set to Block or Annotate Only. identifies content matching copyrighted text or public code.

The Four Harm Categories

CategoryAPI TermWhat It Covers
Hate and FairnessHateDiscriminatory language targeting identity groups (race, gender, religion, sexual orientation, disability, etc.)
SexualSexualExplicit sexual content, adult material, CSAM, non-consensual acts
ViolenceViolencePhysical harm, weapons, extremist/terrorist content, gore
Self-HarmSelfHarmSuicide instructions, self-injury encouragement, eating disorder promotion

Default behavior: content at Medium or High severity is blocked for all four categories on both prompts and completions. Low-severity content passes through by default.

Deep Dive — How It Works

Severity Scale Mapping

The underlying model assigns scores from 0 to 7:

Portal LabelUnderlying Score RangeMeaning
Safe (not configurable)0–1Benign, no action taken
Low2–3Mild risk; some edge cases
Medium4–5Moderate risk; default block threshold
High6–7Severe harm; always blocked by default

You can move the slider to block at Low (strictest), Medium (default), or High (most lenient). Turning filters completely off requires Microsoft approval via a limited-access review form.

Additional Safety Features

FeatureDefaultApplied ToWhat It Does
Prompt Shields (direct attacks / jailbreak)OnInputDetects attempts to override system instructions
Prompt Shields (indirect attacks)OffInputDetects injected instructions hidden in documents
Protected material — textOnOutputBlocks known copyrighted text (songs, recipes, web content)
Protected material — codeOnOutputFlags/blocks code matching public GitHub repositories
Groundedness detectionOff (Preview)OutputDetects ungrounded/hallucinated claims vs. source documents
PII detectionOff (Preview)OutputFilters personally identifiable information

Input vs. Output Filter Independence

Input filters apply to user prompts before the model processes them. Output filters apply to model completions before they reach the end user. You configure each direction independently with different thresholds per category per direction.

Blocklist Facts

  • Applied as either an input filter, output filter, or both
  • Multiple blocklists can be combined in a single filter policy
  • A built-in profanity blocklist is available without creating a custom list
  • Blocklist matches trigger block action unconditionally — they bypass severity slider logic

Filter Scope

A content filter configuration is created at the hub/resource level and then associated with one or more model deployments. A single deployment can only have one active filter policy at a time.

Hands-On Lab

Follow these steps in the Azure AI Foundry portal (classic view):

Step 1 — Navigate to Guardrails + Controls. Sign in at ai.azure.com. Open your project. In the left pane, select Guardrails + controls, then select the Content filters tab.

Step 2 — Create a new filter policy. Select + Create content filter. On the Basic information page, enter a descriptive name (e.g., strict-customer-policy) and choose the Azure OpenAI connection to associate with this filter. Select Next.

Step 3 — Configure Input filters. On the Input filters page, use the sliders to set per-category thresholds. For a strict deployment, set all four categories (Hate, Sexual, Violence, Self-harm) to Low. Toggle Prompt Shields (jailbreak) to Block (it is on by default). Select Next.

Step 4 — Configure Output filters. On the Output filters page, set your output thresholds (e.g., Violence and Sexual at Medium, Self-harm at Low). Enable Protected material — text and Protected material — code and set each to Block. Enable Streaming mode if your app streams tokens. Select Next.

Step 5 — Add a Blocklist (optional). On the Input filter or Output filter page, enable the Blocklist toggle. Select one or more custom blocklists from the dropdown, or select the built-in profanity blocklist. Select Next.

Step 6 — Associate with a deployment. On the Deployment page, select the model deployment to which this filter should be applied. If the deployment already has a filter, confirm the replacement. Select Create.

Step 7 — Verify in the playground. Go to Models + endpoints, open the deployment, and select Open in playground. Submit a test prompt that would normally trigger one of the blocked categories. Confirm the response is blocked or annotated as expected.

Exam Angle — What AI-3016 Tests

AI-3016 Assessment Focus

Expect questions about filter scope (hub vs. deployment level), the 0–7 severity scale vs. portal Low/Medium/High labels, blocklist additive behavior, and what requires Microsoft approval.

Exam Trap

"The severity scale goes from 0 to 6." The underlying model uses a 0–7 scale. The portal consolidates adjacent pairs into Low (2–3), Medium (4–5), and High (6–7). The full scale is 0–7.

Exam Trap

"You can disable content filters for any deployment at any time." Turning filters completely off requires Microsoft approval through a limited-access review. All customers can change thresholds (Low/Medium/High) but cannot disable entirely without approval.

Exam Trap

"Blocklists replace the harm-category filters." Blocklists are additive. They run in parallel with severity-based filters. A term on a blocklist is always blocked regardless of what the category slider is set to.

Exam Trap

"Protected material detection blocks content automatically by default." Protected material detection is on by default, but the action mode (Block vs. Annotate Only) must be explicitly chosen. Annotate only returns a flag without blocking.

Exam Trap

"Input and output filters share the same configuration." They are configured independently. You can set Violence to block at Low on inputs and only at High on outputs within the same filter policy.

Exam Trap

"A content filter is scoped to a single deployment." Content filters are created at the hub/resource level and can be applied to multiple deployments. One deployment holds one filter at a time, but the same policy can be reused.

Must Memorize

Default block threshold: Medium and above (not Low, not High-only). To block everything including Low-severity: move slider to Low. To require Microsoft approval: turn filters completely Off.

Question — click to flip

Q: What is the default severity threshold at which Azure AI Foundry blocks content for the four harm categories?

Question — click to flip

Q: A developer wants every occurrence of 'CompetitorBrand' blocked in model outputs regardless of severity. What is the correct approach?

Question — click to flip

Q: Which action in Azure AI Foundry requires Microsoft approval through a limited-access review?

Question — click to flip

Q: What happens when Protected material — code is set to 'Annotate Only' and output matches a public repository?

Question — click to flip

Q: At which scope is a content filter policy created in Azure AI Foundry?

Question — click to flip

Q: What is the underlying severity score range used by the content filtering system, and how does it map to the portal labels?

Sources & Further Reading