Dataflow Gen2 – Low-Code ETL in Microsoft Fabric (Architecture, Demo & Best Practices)
In this episode of Fabric Friday, we explore Dataflow Gen2 in Microsoft Fabric — a scalable, low-code ETL solution built on Power Query, Spark and OneLake.
Dataflow Gen2 keeps the familiar Power BI transformation experience, but adds disk-based staging and Spark execution, allowing you to process large datasets efficiently and reliably.
🔍 What you’ll learn
Dataflow Gen2 architecture
• How Gen2 differs from Gen1
• Why staging on disk enables large-scale ETL
• How Spark and Parquet are used behind the scenes
Practical demo
• Importing data from an OData source
• Transformations with Power Query
• Preview, data quality and diagram view
• Writing outputs to Lakehouse and Warehouse
Multi-destination & incremental refresh
• Writing to multiple targets in one run
• Bronze / silver / gold layering
• Incremental loading using modified timestamps
Integration with pipelines
• Calling dataflows from pipelines
• Separating ETL logic from orchestration
Performance & best practices
• When to enable staging
• Filtering and column reduction
• Naming and modular query design
⸻
⭐ Key Takeaways
• Dataflow Gen2 enables enterprise-scale, low-code ETL
• Staging + Spark are the foundation for scalability
• Multi-destination outputs simplify medallion architectures
• Incremental refresh is essential for performance
• Pipelines and dataflows work best together
00:00 Matthias is here, your trusted advisor
00:02 Let's talk about data flows and enterprise-grade ETL pipelines without code
00:11 End-to-end architecture overview
00:26 Over 150+ data sources supported (SQL Server, CloudDBs, files, APIs, etc.)
01:02 ETL process: Extract with connectors, Transform with Power Query, Stage to Lakehouse, then copy fast to destination
02:12 Staging explained - Spark-based, fast, scalable with Power Query interface
03:16 Difference between Data Flow Gen 1 and Gen 2 explained (in-memory vs. Spark + staging)
04:26 Gen 2 architecture visualized
05:11 Gen 1 architecture and limitations
05:42 Recommendation to always use Gen 2 with staging enabled by default
06:07 Demo start: create Data Flow Gen 2, import Northwind data with OData connector
07:22 Data preview and transformation basics in Power Query interface
08:01 Data profiling features (histograms, statistics, diagram view)
09:30 Data transformation: grouping, counting orders per customer
10:10 Merging Customers and Orders queries into combined table
11:00 Data column selection and expansion in merged table
14:00 Adding a rank column to data
15:00 Setting up data destinations (Lakehouse, Warehouse, SQL DB)
16:00 Disabling staging option for demo (not recommended for large datasets)
18:50 Comparison of performance staging enabled vs disabled
19:40 Multi-destination pattern in a single data flow (bronze, silver, gold layers)
20:45 Explanation of incremental refresh pattern (need timestamp column, loading changed data only)
23:00 Integration of Data Flow Gen 2 into Pipelines for orchestration and scheduling
24:10 Demo: pipeline creation and running a data flow activity
29:07 Common community questions:
29:22 - When to use Data Flow Gen 2 vs pipelines
29:58 - How to speed up slow data flows (enable staging, filter early, remove columns)
30:29 - Multiple outputs in one data flow possible
30:49 - Learning path (start with Data Flow Gen2 before notebooks)
31:21 Tips & tricks for transformations, naming steps, start small with data sets 32:22 Thanks and closing remarks
Информация по комментариям в разработке