AZ-305 Learning Portal

Concept — What & Why

Azure Data Integration Services

Azure Data FactoryCloud-native ETL/ELT orchestration service with 90+ built-in connectors, code-free visual pipeline design, and serverless data transformation via Data Flows (powered by Spark). Supports event-driven and scheduled triggers, CI/CD with Git integration, and Self-Hosted Integration Runtime for on-premises connectivity.Azure Synapse AnalyticsA unified analytics platform that combines data integration (same engine as Data Factory), dedicated SQL pools for data warehousing, serverless SQL pools for query-on-demand, and Apache Spark pools for big data processing — all in one workspace with integrated monitoring.Integration Runtime (IR)The compute infrastructure that powers Data Factory pipeline execution. Azure IR: cloud-only, managed by Microsoft. Self-Hosted IR: installed on-premises for private network access to restricted data sources. Azure SSIS IR: runs SSIS packages in Azure.Medallion ArchitectureA data lakehouse pattern with three quality layers: Bronze (raw, unmodified data), Silver (cleaned, deduplicated, conformed data), Gold (business-ready aggregates for reporting). Implements progressive data quality and reduces redundant transformation work.Change Data Capture (CDC)A pattern that captures only the changed records from a source system rather than full extracts. Implementations: Cosmos DB Change Feed, SQL Server CDC, PostgreSQL replication slots. Enables efficient incremental loads to reduce warehouse load and latency.

Integration Pattern Selection

Pattern	Latency	Complexity	Best For
Batch ETL	Hours	Low	Nightly loads, weekly aggregations
ELT (load then transform)	Hours	Low-Medium	Cloud-native data warehouses
Streaming	Seconds	High	IoT, clickstreams, real-time dashboards
CDC Incremental	Minutes	Medium	Near-real-time sync, reducing load
Lambda Architecture	Mixed	High	Requires both batch accuracy and streaming speed

Pipeline Activity Types

Activity	Purpose
Copy	Bulk data movement between sources and sinks
Data Flow	Serverless Spark transformation (code-free)
Execute Pipeline	Modular pipelines calling child pipelines
ForEach	Iterate over collections (multiple files, tables)
Lookup	Fetch dynamic values for parameterized pipelines
If Condition / Until	Branching and retry logic

Deep Dive — How It Works

Architecture Patterns

ETL vs. ELT

ETL (Extract-Transform-Load) — transform before loading

Transform in a separate engine before writing to destination
Traditional approach (on-premises data warehouses)
Data Factory Data Flows or Databricks for transformation

ELT (Extract-Load-Transform) — load raw, transform in warehouse

Load raw data to data lake or warehouse first
Transform using warehouse compute (Synapse SQL pools, Databricks)
Preferred for cloud-native designs — leverages warehouse-scale compute

Medallion Lakehouse Architecture

Bronze: Raw data in native format — no transformation, append-only, full history
Silver: Cleaned, deduplicated, standardized schema — conformed dimensions
Gold: Business aggregates, KPIs, star-schema fact tables for BI consumption

On-Premises to Azure Connectivity

Scenario	Integration Runtime
Cloud-to-cloud (Azure to Azure)	Azure IR (default)
Restricted on-premises data source	Self-Hosted IR
Lift-and-shift SSIS packages	Azure-SSIS IR
On-premises behind firewall, no direct access	Self-Hosted IR with outbound-only connectivity

Self-Hosted IR is installed on-premises and makes outbound connections to Data Factory — no inbound firewall rules required. This is the key advantage over VPN-only solutions for restricted networks.

Streaming Architecture

Real-time streaming pipeline:

Ingestion: Event Hubs or IoT Hub
Processing: Stream Analytics (SQL-like queries) or Apache Kafka on AKS
Storage: Cosmos DB (hot queries), Data Lake (analytics), Synapse (reporting)
Alerting: Azure Monitor + Action Groups on anomalies

Stream Analytics uses SQL-like query language with windowing functions (tumbling, hopping, sliding windows) for time-based aggregations — no code required.

Data Factory vs. Synapse Integration: When to Use Which

Use Case	Recommendation
Standalone ETL pipelines only	Azure Data Factory
ETL + analytics in one workspace	Azure Synapse Analytics
Connecting to Dedicated SQL Pool	Synapse Pipelines (same engine, native integration)
Multiple independent pipelines managed centrally	Data Factory

Hands-On Lab

Hands-On: Create Data Factory Pipeline with ForEach Loop

Step 1: Create Data Factory and Pipeline

Navigate to Azure Data Factory > Create
Open Author hub in ADF Studio
Click New pipeline and name it (e.g., ingest-customer-files)

Step 2: Add GetMetadata + ForEach for Multiple Files

Add Get Metadata activity:
- Dataset: Azure Blob Storage container
- Field list: Child items (lists all files)
Add ForEach activity connected to GetMetadata output
- Items: @activity('GetMetadata').output.childItems
- Sequential: False (parallel processing)
Inside ForEach, add Copy activity:
- Source: Blob Storage file (parameterized with @item().name)
- Sink: Azure SQL Database or Data Lake

Step 3: Configure Self-Hosted Integration Runtime

Open Manage hub > Integration runtimes > New
Select Self-Hosted
Name it and click Create
Download and install the IR agent on the on-premises machine
Register the node with the displayed authentication key
Create Linked Services pointing to on-premises SQL Server using the Self-Hosted IR

Step 4: Monitor Pipeline Execution

Open Monitor hub
View Pipeline runs — check status (In Progress, Succeeded, Failed)
Click a run to see Activity runs with timing details
For failures: Click Output column for error details
Set Alert rules for pipeline failures via Azure Monitor

Step 5: Design Medallion Architecture in Data Lake

Create ADLS Gen2 containers: bronze, silver, gold
Configure pipeline to:
- Copy raw files from source to bronze/ (no transformation)
- Data Flow to clean and standardize data into silver/
- Spark notebook (via Databricks or Synapse) to aggregate to gold/
Schedule bronze load nightly; silver/gold transformations after bronze completes

Exam Angle — What AZ-305 Tests

AZ-305 Exam Focus

AZ-305 tests your ability to select the right integration pattern, integration runtime type, and pipeline design for a given scenario. The most common question patterns involve: when to use Synapse vs. Data Factory, when to use Self-Hosted vs. Azure IR, and batch vs. streaming decisions.

Exam Trap

Data Factory vs. Synapse Integration: Synapse includes the same Data Factory pipeline engine. When a scenario requires both data integration AND analytics in the same workspace, recommend Synapse. If only ETL orchestration is needed without analytics, Data Factory is simpler and more appropriate.

Exam Trap

ETL vs. ELT Confusion: Modern cloud-native architectures prefer ELT — load raw data to the lake, then transform using warehouse compute. ETL (transform before loading) is the traditional on-premises pattern. The exam expects you to recommend ELT for new cloud designs.

Exam Trap

Streaming Always Better for Fresh Data: Streaming adds complexity and cost. If the business requirement is hourly or daily data freshness, batch ETL is the right choice. Only use streaming when latency requirements are measured in seconds or sub-minutes (real-time dashboards, live alerting, IoT anomaly detection).

Exam Trap

Azure IR for On-Premises Access: Azure IR cannot reach on-premises data sources behind firewalls without explicit network configuration (managed VNet integration). Self-Hosted IR must be installed on-premises and makes outbound connections — the correct and simpler choice for restricted on-premises access.

Exam Tip

ForEach for Dynamic File Processing: When a scenario involves processing a variable number of files or tables, the ForEach activity is the answer. It dynamically iterates over a collection, avoiding the need to create N individual pipelines. Combine with GetMetadata to list files and Copy/Data Flow inside the loop.

Must Memorize

Medallion Layer Purposes: Bronze = raw, unmodified, append-only. Silver = cleaned, deduplicated, standardized. Gold = business-ready aggregates for reporting. Data moves progressively through layers, never going backwards. This is the standard lakehouse architecture pattern for AZ-305 scenarios.

Question — click to flip

Q: What is the difference between ETL and ELT, and which is preferred for cloud-native designs?

Question — click to flip

Q: When do you use Self-Hosted Integration Runtime vs. Azure Integration Runtime?

Question — click to flip

Q: What are the three layers of medallion architecture and what does each contain?

Question — click to flip

Q: When should you use Azure Synapse Analytics instead of Azure Data Factory for data integration?

Question — click to flip

Q: What is Change Data Capture (CDC) and what Azure service natively supports it for Cosmos DB?

Question — click to flip

Q: How does the ForEach activity in Data Factory help when processing a variable number of files daily?

2.3 — Design Data Integration

Azure Data Integration Services

Integration Pattern Selection

Pipeline Activity Types

Architecture Patterns

ETL vs. ELT

Medallion Lakehouse Architecture

On-Premises to Azure Connectivity

Streaming Architecture

Data Factory vs. Synapse Integration: When to Use Which

Hands-On: Create Data Factory Pipeline with ForEach Loop

Step 1: Create Data Factory and Pipeline

Step 2: Add GetMetadata + ForEach for Multiple Files

Step 3: Configure Self-Hosted Integration Runtime

Step 4: Monitor Pipeline Execution

Step 5: Design Medallion Architecture in Data Lake

AZ-305 Exam Focus