Azure Data Integration Services
Azure Data FactoryCloud-native ETL/ELT orchestration service with 90+ built-in connectors, code-free visual pipeline design, and serverless data transformation via Data Flows (powered by Spark). Supports event-driven and scheduled triggers, CI/CD with Git integration, and Self-Hosted Integration Runtime for on-premises connectivity.Azure Synapse AnalyticsA unified analytics platform that combines data integration (same engine as Data Factory), dedicated SQL pools for data warehousing, serverless SQL pools for query-on-demand, and Apache Spark pools for big data processing — all in one workspace with integrated monitoring.Integration Runtime (IR)The compute infrastructure that powers Data Factory pipeline execution. Azure IR: cloud-only, managed by Microsoft. Self-Hosted IR: installed on-premises for private network access to restricted data sources. Azure SSIS IR: runs SSIS packages in Azure.Medallion ArchitectureA data lakehouse pattern with three quality layers: Bronze (raw, unmodified data), Silver (cleaned, deduplicated, conformed data), Gold (business-ready aggregates for reporting). Implements progressive data quality and reduces redundant transformation work.Change Data Capture (CDC)A pattern that captures only the changed records from a source system rather than full extracts. Implementations: Cosmos DB Change Feed, SQL Server CDC, PostgreSQL replication slots. Enables efficient incremental loads to reduce warehouse load and latency.Integration Pattern Selection
| Pattern | Latency | Complexity | Best For |
|---|---|---|---|
| Batch ETL | Hours | Low | Nightly loads, weekly aggregations |
| ELT (load then transform) | Hours | Low-Medium | Cloud-native data warehouses |
| Streaming | Seconds | High | IoT, clickstreams, real-time dashboards |
| CDC Incremental | Minutes | Medium | Near-real-time sync, reducing load |
| Lambda Architecture | Mixed | High | Requires both batch accuracy and streaming speed |
Pipeline Activity Types
| Activity | Purpose |
|---|---|
| Copy | Bulk data movement between sources and sinks |
| Data Flow | Serverless Spark transformation (code-free) |
| Execute Pipeline | Modular pipelines calling child pipelines |
| ForEach | Iterate over collections (multiple files, tables) |
| Lookup | Fetch dynamic values for parameterized pipelines |
| If Condition / Until | Branching and retry logic |
Architecture Patterns
ETL vs. ELT
ETL (Extract-Transform-Load) — transform before loading
- Transform in a separate engine before writing to destination
- Traditional approach (on-premises data warehouses)
- Data Factory Data Flows or Databricks for transformation
ELT (Extract-Load-Transform) — load raw, transform in warehouse
- Load raw data to data lake or warehouse first
- Transform using warehouse compute (Synapse SQL pools, Databricks)
- Preferred for cloud-native designs — leverages warehouse-scale compute
Medallion Lakehouse Architecture
- Bronze: Raw data in native format — no transformation, append-only, full history
- Silver: Cleaned, deduplicated, standardized schema — conformed dimensions
- Gold: Business aggregates, KPIs, star-schema fact tables for BI consumption
On-Premises to Azure Connectivity
| Scenario | Integration Runtime |
|---|---|
| Cloud-to-cloud (Azure to Azure) | Azure IR (default) |
| Restricted on-premises data source | Self-Hosted IR |
| Lift-and-shift SSIS packages | Azure-SSIS IR |
| On-premises behind firewall, no direct access | Self-Hosted IR with outbound-only connectivity |
Self-Hosted IR is installed on-premises and makes outbound connections to Data Factory — no inbound firewall rules required. This is the key advantage over VPN-only solutions for restricted networks.
Streaming Architecture
Real-time streaming pipeline:
- Ingestion: Event Hubs or IoT Hub
- Processing: Stream Analytics (SQL-like queries) or Apache Kafka on AKS
- Storage: Cosmos DB (hot queries), Data Lake (analytics), Synapse (reporting)
- Alerting: Azure Monitor + Action Groups on anomalies
Stream Analytics uses SQL-like query language with windowing functions (tumbling, hopping, sliding windows) for time-based aggregations — no code required.
Data Factory vs. Synapse Integration: When to Use Which
| Use Case | Recommendation |
|---|---|
| Standalone ETL pipelines only | Azure Data Factory |
| ETL + analytics in one workspace | Azure Synapse Analytics |
| Connecting to Dedicated SQL Pool | Synapse Pipelines (same engine, native integration) |
| Multiple independent pipelines managed centrally | Data Factory |
Hands-On: Create Data Factory Pipeline with ForEach Loop
Step 1: Create Data Factory and Pipeline
- Navigate to Azure Data Factory > Create
- Open Author hub in ADF Studio
- Click New pipeline and name it (e.g.,
ingest-customer-files)
Step 2: Add GetMetadata + ForEach for Multiple Files
- Add Get Metadata activity:
- Dataset: Azure Blob Storage container
- Field list: Child items (lists all files)
- Add ForEach activity connected to GetMetadata output
- Items:
@activity('GetMetadata').output.childItems - Sequential: False (parallel processing)
- Items:
- Inside ForEach, add Copy activity:
- Source: Blob Storage file (parameterized with
@item().name) - Sink: Azure SQL Database or Data Lake
- Source: Blob Storage file (parameterized with
Step 3: Configure Self-Hosted Integration Runtime
- Open Manage hub > Integration runtimes > New
- Select Self-Hosted
- Name it and click Create
- Download and install the IR agent on the on-premises machine
- Register the node with the displayed authentication key
- Create Linked Services pointing to on-premises SQL Server using the Self-Hosted IR
Step 4: Monitor Pipeline Execution
- Open Monitor hub
- View Pipeline runs — check status (In Progress, Succeeded, Failed)
- Click a run to see Activity runs with timing details
- For failures: Click Output column for error details
- Set Alert rules for pipeline failures via Azure Monitor
Step 5: Design Medallion Architecture in Data Lake
- Create ADLS Gen2 containers:
bronze,silver,gold - Configure pipeline to:
- Copy raw files from source to
bronze/(no transformation) - Data Flow to clean and standardize data into
silver/ - Spark notebook (via Databricks or Synapse) to aggregate to
gold/
- Copy raw files from source to
- Schedule bronze load nightly; silver/gold transformations after bronze completes
AZ-305 Exam Focus
AZ-305 tests your ability to select the right integration pattern, integration runtime type, and pipeline design for a given scenario. The most common question patterns involve: when to use Synapse vs. Data Factory, when to use Self-Hosted vs. Azure IR, and batch vs. streaming decisions.
Exam Trap
Data Factory vs. Synapse Integration: Synapse includes the same Data Factory pipeline engine. When a scenario requires both data integration AND analytics in the same workspace, recommend Synapse. If only ETL orchestration is needed without analytics, Data Factory is simpler and more appropriate.
Exam Trap
ETL vs. ELT Confusion: Modern cloud-native architectures prefer ELT — load raw data to the lake, then transform using warehouse compute. ETL (transform before loading) is the traditional on-premises pattern. The exam expects you to recommend ELT for new cloud designs.
Exam Trap
Streaming Always Better for Fresh Data: Streaming adds complexity and cost. If the business requirement is hourly or daily data freshness, batch ETL is the right choice. Only use streaming when latency requirements are measured in seconds or sub-minutes (real-time dashboards, live alerting, IoT anomaly detection).
Exam Trap
Azure IR for On-Premises Access: Azure IR cannot reach on-premises data sources behind firewalls without explicit network configuration (managed VNet integration). Self-Hosted IR must be installed on-premises and makes outbound connections — the correct and simpler choice for restricted on-premises access.
Exam Tip
ForEach for Dynamic File Processing: When a scenario involves processing a variable number of files or tables, the ForEach activity is the answer. It dynamically iterates over a collection, avoiding the need to create N individual pipelines. Combine with GetMetadata to list files and Copy/Data Flow inside the loop.
Must Memorize
Medallion Layer Purposes: Bronze = raw, unmodified, append-only. Silver = cleaned, deduplicated, standardized. Gold = business-ready aggregates for reporting. Data moves progressively through layers, never going backwards. This is the standard lakehouse architecture pattern for AZ-305 scenarios.
Question — click to flip
Q: What is the difference between ETL and ELT, and which is preferred for cloud-native designs?
Question — click to flip
Q: When do you use Self-Hosted Integration Runtime vs. Azure Integration Runtime?
Question — click to flip
Q: What are the three layers of medallion architecture and what does each contain?
Question — click to flip
Q: When should you use Azure Synapse Analytics instead of Azure Data Factory for data integration?
Question — click to flip
Q: What is Change Data Capture (CDC) and what Azure service natively supports it for Cosmos DB?
Question — click to flip
Q: How does the ForEach activity in Data Factory help when processing a variable number of files daily?