AZ-305 Learning Portal
Objective 2.3 35 minmedium prioritydata-factorysynapse-analyticsdata-lakestream-analyticsdatabricksmedallion-architectureintegration-runtimechange-data-capture

2.3 — Design Data Integration

Design data integration pipelines using Azure Data Factory, Azure Synapse Analytics, and Azure Stream Analytics to implement ETL/ELT batch and real-time streaming patterns, medallion architecture, and secure pipeline design.

Concept — What & Why

Azure Data Integration Services

Azure Data FactoryCloud-native ETL/ELT orchestration service with 90+ built-in connectors, code-free visual pipeline design, and serverless data transformation via Data Flows (powered by Spark). Supports event-driven and scheduled triggers, CI/CD with Git integration, and Self-Hosted Integration Runtime for on-premises connectivity.Azure Synapse AnalyticsA unified analytics platform that combines data integration (same engine as Data Factory), dedicated SQL pools for data warehousing, serverless SQL pools for query-on-demand, and Apache Spark pools for big data processing — all in one workspace with integrated monitoring.Integration Runtime (IR)The compute infrastructure that powers Data Factory pipeline execution. Azure IR: cloud-only, managed by Microsoft. Self-Hosted IR: installed on-premises for private network access to restricted data sources. Azure SSIS IR: runs SSIS packages in Azure.Medallion ArchitectureA data lakehouse pattern with three quality layers: Bronze (raw, unmodified data), Silver (cleaned, deduplicated, conformed data), Gold (business-ready aggregates for reporting). Implements progressive data quality and reduces redundant transformation work.Change Data Capture (CDC)A pattern that captures only the changed records from a source system rather than full extracts. Implementations: Cosmos DB Change Feed, SQL Server CDC, PostgreSQL replication slots. Enables efficient incremental loads to reduce warehouse load and latency.

Integration Pattern Selection

PatternLatencyComplexityBest For
Batch ETLHoursLowNightly loads, weekly aggregations
ELT (load then transform)HoursLow-MediumCloud-native data warehouses
StreamingSecondsHighIoT, clickstreams, real-time dashboards
CDC IncrementalMinutesMediumNear-real-time sync, reducing load
Lambda ArchitectureMixedHighRequires both batch accuracy and streaming speed

Pipeline Activity Types

ActivityPurpose
CopyBulk data movement between sources and sinks
Data FlowServerless Spark transformation (code-free)
Execute PipelineModular pipelines calling child pipelines
ForEachIterate over collections (multiple files, tables)
LookupFetch dynamic values for parameterized pipelines
If Condition / UntilBranching and retry logic
Deep Dive — How It Works

Architecture Patterns

ETL vs. ELT

ETL (Extract-Transform-Load) — transform before loading

  • Transform in a separate engine before writing to destination
  • Traditional approach (on-premises data warehouses)
  • Data Factory Data Flows or Databricks for transformation

ELT (Extract-Load-Transform) — load raw, transform in warehouse

  • Load raw data to data lake or warehouse first
  • Transform using warehouse compute (Synapse SQL pools, Databricks)
  • Preferred for cloud-native designs — leverages warehouse-scale compute

Medallion Lakehouse Architecture

  • Bronze: Raw data in native format — no transformation, append-only, full history
  • Silver: Cleaned, deduplicated, standardized schema — conformed dimensions
  • Gold: Business aggregates, KPIs, star-schema fact tables for BI consumption

On-Premises to Azure Connectivity

ScenarioIntegration Runtime
Cloud-to-cloud (Azure to Azure)Azure IR (default)
Restricted on-premises data sourceSelf-Hosted IR
Lift-and-shift SSIS packagesAzure-SSIS IR
On-premises behind firewall, no direct accessSelf-Hosted IR with outbound-only connectivity

Self-Hosted IR is installed on-premises and makes outbound connections to Data Factory — no inbound firewall rules required. This is the key advantage over VPN-only solutions for restricted networks.

Streaming Architecture

Real-time streaming pipeline:

  1. Ingestion: Event Hubs or IoT Hub
  2. Processing: Stream Analytics (SQL-like queries) or Apache Kafka on AKS
  3. Storage: Cosmos DB (hot queries), Data Lake (analytics), Synapse (reporting)
  4. Alerting: Azure Monitor + Action Groups on anomalies

Stream Analytics uses SQL-like query language with windowing functions (tumbling, hopping, sliding windows) for time-based aggregations — no code required.

Data Factory vs. Synapse Integration: When to Use Which

Use CaseRecommendation
Standalone ETL pipelines onlyAzure Data Factory
ETL + analytics in one workspaceAzure Synapse Analytics
Connecting to Dedicated SQL PoolSynapse Pipelines (same engine, native integration)
Multiple independent pipelines managed centrallyData Factory
Hands-On Lab

Hands-On: Create Data Factory Pipeline with ForEach Loop

Step 1: Create Data Factory and Pipeline

  1. Navigate to Azure Data Factory > Create
  2. Open Author hub in ADF Studio
  3. Click New pipeline and name it (e.g., ingest-customer-files)

Step 2: Add GetMetadata + ForEach for Multiple Files

  1. Add Get Metadata activity:
    • Dataset: Azure Blob Storage container
    • Field list: Child items (lists all files)
  2. Add ForEach activity connected to GetMetadata output
    • Items: @activity('GetMetadata').output.childItems
    • Sequential: False (parallel processing)
  3. Inside ForEach, add Copy activity:
    • Source: Blob Storage file (parameterized with @item().name)
    • Sink: Azure SQL Database or Data Lake

Step 3: Configure Self-Hosted Integration Runtime

  1. Open Manage hub > Integration runtimes > New
  2. Select Self-Hosted
  3. Name it and click Create
  4. Download and install the IR agent on the on-premises machine
  5. Register the node with the displayed authentication key
  6. Create Linked Services pointing to on-premises SQL Server using the Self-Hosted IR

Step 4: Monitor Pipeline Execution

  1. Open Monitor hub
  2. View Pipeline runs — check status (In Progress, Succeeded, Failed)
  3. Click a run to see Activity runs with timing details
  4. For failures: Click Output column for error details
  5. Set Alert rules for pipeline failures via Azure Monitor

Step 5: Design Medallion Architecture in Data Lake

  1. Create ADLS Gen2 containers: bronze, silver, gold
  2. Configure pipeline to:
    • Copy raw files from source to bronze/ (no transformation)
    • Data Flow to clean and standardize data into silver/
    • Spark notebook (via Databricks or Synapse) to aggregate to gold/
  3. Schedule bronze load nightly; silver/gold transformations after bronze completes
Exam Angle — What AZ-305 Tests

AZ-305 Exam Focus

AZ-305 tests your ability to select the right integration pattern, integration runtime type, and pipeline design for a given scenario. The most common question patterns involve: when to use Synapse vs. Data Factory, when to use Self-Hosted vs. Azure IR, and batch vs. streaming decisions.

Exam Trap

Data Factory vs. Synapse Integration: Synapse includes the same Data Factory pipeline engine. When a scenario requires both data integration AND analytics in the same workspace, recommend Synapse. If only ETL orchestration is needed without analytics, Data Factory is simpler and more appropriate.

Exam Trap

ETL vs. ELT Confusion: Modern cloud-native architectures prefer ELT — load raw data to the lake, then transform using warehouse compute. ETL (transform before loading) is the traditional on-premises pattern. The exam expects you to recommend ELT for new cloud designs.

Exam Trap

Streaming Always Better for Fresh Data: Streaming adds complexity and cost. If the business requirement is hourly or daily data freshness, batch ETL is the right choice. Only use streaming when latency requirements are measured in seconds or sub-minutes (real-time dashboards, live alerting, IoT anomaly detection).

Exam Trap

Azure IR for On-Premises Access: Azure IR cannot reach on-premises data sources behind firewalls without explicit network configuration (managed VNet integration). Self-Hosted IR must be installed on-premises and makes outbound connections — the correct and simpler choice for restricted on-premises access.

Exam Tip

ForEach for Dynamic File Processing: When a scenario involves processing a variable number of files or tables, the ForEach activity is the answer. It dynamically iterates over a collection, avoiding the need to create N individual pipelines. Combine with GetMetadata to list files and Copy/Data Flow inside the loop.

Must Memorize

Medallion Layer Purposes: Bronze = raw, unmodified, append-only. Silver = cleaned, deduplicated, standardized. Gold = business-ready aggregates for reporting. Data moves progressively through layers, never going backwards. This is the standard lakehouse architecture pattern for AZ-305 scenarios.

Question — click to flip

Q: What is the difference between ETL and ELT, and which is preferred for cloud-native designs?

Question — click to flip

Q: When do you use Self-Hosted Integration Runtime vs. Azure Integration Runtime?

Question — click to flip

Q: What are the three layers of medallion architecture and what does each contain?

Question — click to flip

Q: When should you use Azure Synapse Analytics instead of Azure Data Factory for data integration?

Question — click to flip

Q: What is Change Data Capture (CDC) and what Azure service natively supports it for Cosmos DB?

Question — click to flip

Q: How does the ForEach activity in Data Factory help when processing a variable number of files daily?

Sources & Further Reading