SlideShare una empresa de Scribd logo
1 de 25
Azure Data Factory: Mapping Data Flows
Performance Tuning Data Flows
v001
Sample Timings 1
Scenario 1
 Source: Delimited Text Blob Store
 Sink: Azure SQL DB
 File size: 421Mb, 74 columns, 887k rows
 Transforms: Single derived column to mask 3 fields
 Time: 4 mins end-to-end using memory optimized 80-core debug Azure IR
 Recommended settings: Current partitioning used throughout
Sample timings 2
 Scenario 2
 Source: Azure SQL DB Table
 Sink: Azure SQL DB Table
 Table size: 74 columns, 887k rows
 Transforms: Single derived column to mask 3 fields
 Time: 3 mins end-to-end using memory optimized 80-core debug Azure IR
 Recommended settings: Source partitioning on SQL DB Source, current partitioning
on Derived Column and Sink
Sample timings 3
 Scenario 3
 Source: Delimited Text Blob Store
 Sink: Delimited Text Blob store
 Table size: 74 columns, 887k rows
 Transforms: Single derived column to mask 3 fields
 Time: 2 mins end-to-end using memory optimized 80-core debug Azure IR
 Recommended settings: Leaving default/current partitioning throughout allows
ADF to scale-up/down partitions based on size of Azure IR (i.e. number of worker
cores)
File conversion Source->Sink property findings
 Large data sizes should use more vcores(16+) with memory optimized or
general purpose
 Compute optimized does not improve performance in this scenario
 CSV to parquet format convert has 45% time overhead in comparison with CSV
to CSV
 CSV to JSON format convert has 24% time overhead in comparison with CSV to
CSV
 CSV to JSON has better performance even though it has a lot of data to write
 CSV to parquet has a slight lag because of time spent in decompression
 Scaling V-cores improves performance for both IO and computation
File Conversion Timing
Compute type: General Purpose
• Dataset has 36 Columns of string, integer, short, double
• CSV dataset has 25 files with different file sizes
• Performance improvement scales proportionately with increase
in Vcores
• 8 Vcore to 64 Vcore performance increase is around 8 times more
SQL Database Timing
Synapse DW Timing
Compute type: General Purpose
Adding cores proportionally decreases time it takes to process data into staging files for Polybase. However, there is a
fairly static amount time that it takes to write that data from Parquet into SQL tables using Polybase.
CosmosDB Timing
Compute type: General Purpose
Window / Aggregate Timing
Compute type: General Purpose
• Performance improvement scales proportionately with
increase in Vcores
• 8 Vcore to 64 Vcore performance increase is around 5
times more
Transformation Timings
Compute type: General Purpose
Transformation recommendations
• When ranking data across entire dataset, use Rank
transformation instead of Window with rank()
• When using rowNumber() in Window to uniquely
add a row counter to each row across entire
dataset, instead use the Surrogate Key
transformation
TPCH Timings
Compute type: General Purpose
TPCH CSV in ADLS Gen 2
Azure Data Factory Data Flow Performance
* Includes cold cluster start-up time
Azure Synapse Data Flow Performance
* Includes cold cluster start-up time
Identifying bottlenecks
1. Cluster startup time
2. Sink processing time
3. Source read time
4. Transformation stage time
1. Sequential executions can
lower the cluster startup time
by setting a TTL in Azure IR
2. Total time to process the
stream from source to sink.
There is also a post-processing
time when you click on the Sink
that will show you how much
time Spark had to spend with
partition and job clean-up.
Write to single file and slow
database connections will
increase this time
3. Shows you how long it took to
read data from source.
Optimize with different source
partition strategies
4. This will show you bottlenecks
in your transformation logic.
With larger general purpose
and mem optimized IRs, most
of these operations occur in
memory in data frames and are
usually the fastest operations
in your data flow
File Partitioning
 Maintain current partitioning
 Avoid output to single file
 For manual partitioning, use number of cores from your Azure IR
and multiply by 5
 Example: transform a series of files in your ADLS folders w/32-core Azure IR, number of
partitions would be 32 x 5 = 160 partitions
 If you know data well enough to have high-cardinality columns, use those columns as Hash
partition
 If you do not know data patterns very well, use Round Robin
Best practices - Sources
 When reading from file-based sources, data flow automatically
partitions the data based on size
 ~128 MB per partition, evenly distributed
 Use current partitioning will be fastest for file-based and Synapse using PolyBase
 Enable staging for Synapse
 For Azure SQL DB, use Source partitioning on column with high
cardinality
 Improves performance, but can saturate your source database
 Reading can be limited by the I/O of your source
Optimizing transformations
 Each transformation has its own optimize tab
 Generally better to not alter -> reshuffling is a relatively slow process
 Reshuffling can occur if data is very skewed
 One node has a disproportionate amount of data
 For Joins, Exists and Lookups:
 If you have a many of these transforms, memory optimized greatly increases performance
 Can ‘Broadcast’ if the data on one side is small
 Rule of thumb: Less than 50k rows
 Use Window transformation partitioned over segments of data
 For Rank() across entire dataset, use the Rank transformation instead
 For RowNumber() across entire dataset, use the Surrogate Key transformation instead
 Transformations that require reshuffling like Sort negatively impact
performance
Best practices – Debug (Data Preview)
 Data Preview
 Data preview is inside the data flow designer transformation properties
 Uses row limits and sampling techniques to preview data from a small size of data
 Allows you to build and validate units of logic with samples of data in real time
 You have control over the size of the data limits under Debug Settings
 If you wish to test with larger datasets, set a larger compute size in the Azure IR when
switching on “Debug Mode”
 Data Preview is only a snapshot of data in memory from Spark data frames. This feature does
not write any data, so the sink drivers are not utilized and not tested in this mode.
Best practices – Debug (Pipeline Debug)
 Pipeline Debug
 Click debug button to test your data flow inside of a pipeline
 Default debug limits the execution runtime so you will want to limit data sizes
 Sampling can be applied here as well by using the “Enable Sampling” option in each Source
 Use the debug button option of “use activity IR” when you wish to use a job execution
compute environment
 This option is good for debugging with larger datasets. It will not have the same execution timeout limit as the
default debug setting
Best practices - Sinks
 SQL:
 Disable indexes on target with pre/post SQL scripts
 Increase SQL capacity during pipeline execution
 Enable staging when using Synapse
 Use Source Partitioning on Source under Optimize
 Set number of partitions based on size of IR
 File-based sinks:
 Use current partitioning allows Spark to create output
 Output to single file is a slow operation
 Often unnecessary by whoever is consuming data
 Can set naming patterns or use data in column
 Any reshuffling of data is slow
 Cosmos DB
 Set throughput and batch size to meet performance requirements
Azure Integration Runtime
 Data Flows use JIT compute to minimize running expensive clusters
when they are mostly idle
 Generally more economical, but each cluster takes ~4 minutes to spin up
 IR specifies what cluster type and core-count to use
 Memory optimized is best, compute optimized doesn’t generally work for production workloads
 When running Sequential jobs utilize Time to Live to reuse cluster
between executions
 Keeps compute resources alive for TTL minutes after execution for new job to use
 Maximum one job per cluster
 Reduces job startup latency to ~1.5 minutes
 Rule of thumb: start small and scale up
Azure IR – General Purpose
• This was General Purpose 4+4, the default auto resolve Azure IR
• For prod workloads, GP is usually sufficient at >= 16 cores
• You get 1 driver and 1 worker node, both with 4 vcores
• Good for debugging, testing, and many production workloads
• Tested with 887k row CSV file with 74 columns
• Default partitioning
• Spark chose 4 partitions
• Cluster startup time: 4.5 mins
• Sink IO writing: 46s
• Transformation time: 42s
• Sink post-processing time: 45s
Azure IR – Compute Optimized
• Computed Optimized intended for smaller workloads
• 8+8, this is smallest CO option and you get 1 driver and 2
workers
• Not suitable for large production workloads
• Tested with 887k row CSV file with 74 columns
• Default partitioning
• Spark chose 8 partitions
• Cluster startup time: 4.5 mins
• Sink IO writing: 20s
• Transformation time: 35s
• Sink post-processing time: 40s
• More worker nodes gave us more partitions and better perf than
General Purpose
Azure IR – Memory Optimized
• Memory Optimized well suited for large production workload
reliability with many aggregates, lookups, and joins
• 64+16 gives you 16 vcores for driver and 64 across worker nodes
• Tested with 887k row CSV file with 74 columns
• Default partitioning
• Spark chose 64 partitions
• Cluster startup time: 4.8 mins
• Sink IO writing: 19s
• Transformation time: 17s
• Sink post-processing time: 40s

Más contenido relacionado

La actualidad más candente

Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Azure data bricks by Eugene Polonichko
Azure data bricks by Eugene PolonichkoAzure data bricks by Eugene Polonichko
Azure data bricks by Eugene PolonichkoAlex Tumanoff
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Lift SSIS package to Azure Data Factory V2
Lift SSIS package to Azure Data Factory V2Lift SSIS package to Azure Data Factory V2
Lift SSIS package to Azure Data Factory V2Manjeet Singh
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudMark Kromer
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen
 
Azure data factory security
Azure data factory securityAzure data factory security
Azure data factory securityMikeBrassil1
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Azure Data Factory presentation with links
Azure Data Factory presentation with linksAzure Data Factory presentation with links
Azure Data Factory presentation with linksChris Testa-O'Neill
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWSGary Stafford
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Riccardo Zamana
 

La actualidad más candente (20)

Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Azure data bricks by Eugene Polonichko
Azure data bricks by Eugene PolonichkoAzure data bricks by Eugene Polonichko
Azure data bricks by Eugene Polonichko
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Lift SSIS package to Azure Data Factory V2
Lift SSIS package to Azure Data Factory V2Lift SSIS package to Azure Data Factory V2
Lift SSIS package to Azure Data Factory V2
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 
Amazon Aurora: Under the Hood
Amazon Aurora: Under the HoodAmazon Aurora: Under the Hood
Amazon Aurora: Under the Hood
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
Azure data factory security
Azure data factory securityAzure data factory security
Azure data factory security
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Azure Data Factory presentation with links
Azure Data Factory presentation with linksAzure Data Factory presentation with links
Azure Data Factory presentation with links
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
 

Similar a Azure Data Factory Data Flow Performance Tuning 101

Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mark Kromer
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB
 
SRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon AuroraSRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon AuroraAmazon Web Services
 
45 ways to speed up firebird database
45 ways to speed up firebird database45 ways to speed up firebird database
45 ways to speed up firebird databaseFabio Codebue
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
 
Espc17 make your share point fly by tuning and optimising sql server
Espc17 make your share point  fly by tuning and optimising sql serverEspc17 make your share point  fly by tuning and optimising sql server
Espc17 make your share point fly by tuning and optimising sql serverIsabelle Van Campenhoudt
 
Make your SharePoint fly by tuning and optimizing SQL Server
Make your SharePoint  fly by tuning and optimizing SQL ServerMake your SharePoint  fly by tuning and optimizing SQL Server
Make your SharePoint fly by tuning and optimizing SQL Serverserge luca
 
071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephenSteve Feldman
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Marco Tusa
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
MariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and OptimizationMariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and OptimizationMariaDB plc
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityHiromitsu Komatsu
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Speedment, Inc.
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Malin Weiss
 
Optimize SQL server performance for SharePoint
Optimize SQL server performance for SharePointOptimize SQL server performance for SharePoint
Optimize SQL server performance for SharePointserge luca
 
Fabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptxFabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptxMark Kromer
 

Similar a Azure Data Factory Data Flow Performance Tuning 101 (20)

Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
 
SRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon AuroraSRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon Aurora
 
Performance tuning in sql server
Performance tuning in sql serverPerformance tuning in sql server
Performance tuning in sql server
 
45 ways to speed up firebird database
45 ways to speed up firebird database45 ways to speed up firebird database
45 ways to speed up firebird database
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Espc17 make your share point fly by tuning and optimising sql server
Espc17 make your share point  fly by tuning and optimising sql serverEspc17 make your share point  fly by tuning and optimising sql server
Espc17 make your share point fly by tuning and optimising sql server
 
Make your SharePoint fly by tuning and optimizing SQL Server
Make your SharePoint  fly by tuning and optimizing SQL ServerMake your SharePoint  fly by tuning and optimizing SQL Server
Make your SharePoint fly by tuning and optimizing SQL Server
 
071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen071410 sun a_1515_feldman_stephen
071410 sun a_1515_feldman_stephen
 
Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2Scaling with sync_replication using Galera and EC2
Scaling with sync_replication using Galera and EC2
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
MariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and OptimizationMariaDB Performance Tuning and Optimization
MariaDB Performance Tuning and Optimization
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
Optimize SQL server performance for SharePoint
Optimize SQL server performance for SharePointOptimize SQL server performance for SharePoint
Optimize SQL server performance for SharePoint
 
Fabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptxFabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptx
 

Más de Mark Kromer

Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesMark Kromer
 
Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mark Kromer
 
Data cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsData cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsMark Kromer
 
Data cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsData cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsMark Kromer
 
Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mark Kromer
 
Data Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADFData Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADFMark Kromer
 
Azure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryAzure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryMark Kromer
 
Data Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADFData Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADFMark Kromer
 
Data quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADFData quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADFMark Kromer
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryMark Kromer
 
ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2Mark Kromer
 
ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1Mark Kromer
 
ADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview MigrationADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview MigrationMark Kromer
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudMark Kromer
 
Azure Data Factory Data Flow
Azure Data Factory Data FlowAzure Data Factory Data Flow
Azure Data Factory Data FlowMark Kromer
 
Azure Data Factory Data Flow Limited Preview for January 2019
Azure Data Factory Data Flow Limited Preview for January 2019Azure Data Factory Data Flow Limited Preview for January 2019
Azure Data Factory Data Flow Limited Preview for January 2019Mark Kromer
 
Microsoft Azure Data Factory Data Flow Scenarios
Microsoft Azure Data Factory Data Flow ScenariosMicrosoft Azure Data Factory Data Flow Scenarios
Microsoft Azure Data Factory Data Flow ScenariosMark Kromer
 
Azure Data Factory Data Flow Preview December 2019
Azure Data Factory Data Flow Preview December 2019Azure Data Factory Data Flow Preview December 2019
Azure Data Factory Data Flow Preview December 2019Mark Kromer
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekMark Kromer
 
Azure Data Factory for Redmond SQL PASS UG Sept 2018
Azure Data Factory for Redmond SQL PASS UG Sept 2018Azure Data Factory for Redmond SQL PASS UG Sept 2018
Azure Data Factory for Redmond SQL PASS UG Sept 2018Mark Kromer
 

Más de Mark Kromer (20)

Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelines
 
Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22
 
Data cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsData cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flows
 
Data cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsData cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flows
 
Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021
 
Data Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADFData Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADF
 
Azure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryAzure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power Query
 
Data Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADFData Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADF
 
Data quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADFData quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADF
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
 
ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2
 
ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flows Training Slides V1
 
ADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview MigrationADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview Migration
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
 
Azure Data Factory Data Flow
Azure Data Factory Data FlowAzure Data Factory Data Flow
Azure Data Factory Data Flow
 
Azure Data Factory Data Flow Limited Preview for January 2019
Azure Data Factory Data Flow Limited Preview for January 2019Azure Data Factory Data Flow Limited Preview for January 2019
Azure Data Factory Data Flow Limited Preview for January 2019
 
Microsoft Azure Data Factory Data Flow Scenarios
Microsoft Azure Data Factory Data Flow ScenariosMicrosoft Azure Data Factory Data Flow Scenarios
Microsoft Azure Data Factory Data Flow Scenarios
 
Azure Data Factory Data Flow Preview December 2019
Azure Data Factory Data Flow Preview December 2019Azure Data Factory Data Flow Preview December 2019
Azure Data Factory Data Flow Preview December 2019
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data Week
 
Azure Data Factory for Redmond SQL PASS UG Sept 2018
Azure Data Factory for Redmond SQL PASS UG Sept 2018Azure Data Factory for Redmond SQL PASS UG Sept 2018
Azure Data Factory for Redmond SQL PASS UG Sept 2018
 

Último

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Azure Data Factory Data Flow Performance Tuning 101

  • 1. Azure Data Factory: Mapping Data Flows Performance Tuning Data Flows v001
  • 2. Sample Timings 1 Scenario 1  Source: Delimited Text Blob Store  Sink: Azure SQL DB  File size: 421Mb, 74 columns, 887k rows  Transforms: Single derived column to mask 3 fields  Time: 4 mins end-to-end using memory optimized 80-core debug Azure IR  Recommended settings: Current partitioning used throughout
  • 3. Sample timings 2  Scenario 2  Source: Azure SQL DB Table  Sink: Azure SQL DB Table  Table size: 74 columns, 887k rows  Transforms: Single derived column to mask 3 fields  Time: 3 mins end-to-end using memory optimized 80-core debug Azure IR  Recommended settings: Source partitioning on SQL DB Source, current partitioning on Derived Column and Sink
  • 4. Sample timings 3  Scenario 3  Source: Delimited Text Blob Store  Sink: Delimited Text Blob store  Table size: 74 columns, 887k rows  Transforms: Single derived column to mask 3 fields  Time: 2 mins end-to-end using memory optimized 80-core debug Azure IR  Recommended settings: Leaving default/current partitioning throughout allows ADF to scale-up/down partitions based on size of Azure IR (i.e. number of worker cores)
  • 5. File conversion Source->Sink property findings  Large data sizes should use more vcores(16+) with memory optimized or general purpose  Compute optimized does not improve performance in this scenario  CSV to parquet format convert has 45% time overhead in comparison with CSV to CSV  CSV to JSON format convert has 24% time overhead in comparison with CSV to CSV  CSV to JSON has better performance even though it has a lot of data to write  CSV to parquet has a slight lag because of time spent in decompression  Scaling V-cores improves performance for both IO and computation
  • 6. File Conversion Timing Compute type: General Purpose • Dataset has 36 Columns of string, integer, short, double • CSV dataset has 25 files with different file sizes • Performance improvement scales proportionately with increase in Vcores • 8 Vcore to 64 Vcore performance increase is around 8 times more
  • 8. Synapse DW Timing Compute type: General Purpose Adding cores proportionally decreases time it takes to process data into staging files for Polybase. However, there is a fairly static amount time that it takes to write that data from Parquet into SQL tables using Polybase.
  • 10. Window / Aggregate Timing Compute type: General Purpose • Performance improvement scales proportionately with increase in Vcores • 8 Vcore to 64 Vcore performance increase is around 5 times more
  • 11. Transformation Timings Compute type: General Purpose Transformation recommendations • When ranking data across entire dataset, use Rank transformation instead of Window with rank() • When using rowNumber() in Window to uniquely add a row counter to each row across entire dataset, instead use the Surrogate Key transformation
  • 12. TPCH Timings Compute type: General Purpose TPCH CSV in ADLS Gen 2
  • 13. Azure Data Factory Data Flow Performance * Includes cold cluster start-up time
  • 14. Azure Synapse Data Flow Performance * Includes cold cluster start-up time
  • 15. Identifying bottlenecks 1. Cluster startup time 2. Sink processing time 3. Source read time 4. Transformation stage time 1. Sequential executions can lower the cluster startup time by setting a TTL in Azure IR 2. Total time to process the stream from source to sink. There is also a post-processing time when you click on the Sink that will show you how much time Spark had to spend with partition and job clean-up. Write to single file and slow database connections will increase this time 3. Shows you how long it took to read data from source. Optimize with different source partition strategies 4. This will show you bottlenecks in your transformation logic. With larger general purpose and mem optimized IRs, most of these operations occur in memory in data frames and are usually the fastest operations in your data flow
  • 16. File Partitioning  Maintain current partitioning  Avoid output to single file  For manual partitioning, use number of cores from your Azure IR and multiply by 5  Example: transform a series of files in your ADLS folders w/32-core Azure IR, number of partitions would be 32 x 5 = 160 partitions  If you know data well enough to have high-cardinality columns, use those columns as Hash partition  If you do not know data patterns very well, use Round Robin
  • 17. Best practices - Sources  When reading from file-based sources, data flow automatically partitions the data based on size  ~128 MB per partition, evenly distributed  Use current partitioning will be fastest for file-based and Synapse using PolyBase  Enable staging for Synapse  For Azure SQL DB, use Source partitioning on column with high cardinality  Improves performance, but can saturate your source database  Reading can be limited by the I/O of your source
  • 18. Optimizing transformations  Each transformation has its own optimize tab  Generally better to not alter -> reshuffling is a relatively slow process  Reshuffling can occur if data is very skewed  One node has a disproportionate amount of data  For Joins, Exists and Lookups:  If you have a many of these transforms, memory optimized greatly increases performance  Can ‘Broadcast’ if the data on one side is small  Rule of thumb: Less than 50k rows  Use Window transformation partitioned over segments of data  For Rank() across entire dataset, use the Rank transformation instead  For RowNumber() across entire dataset, use the Surrogate Key transformation instead  Transformations that require reshuffling like Sort negatively impact performance
  • 19. Best practices – Debug (Data Preview)  Data Preview  Data preview is inside the data flow designer transformation properties  Uses row limits and sampling techniques to preview data from a small size of data  Allows you to build and validate units of logic with samples of data in real time  You have control over the size of the data limits under Debug Settings  If you wish to test with larger datasets, set a larger compute size in the Azure IR when switching on “Debug Mode”  Data Preview is only a snapshot of data in memory from Spark data frames. This feature does not write any data, so the sink drivers are not utilized and not tested in this mode.
  • 20. Best practices – Debug (Pipeline Debug)  Pipeline Debug  Click debug button to test your data flow inside of a pipeline  Default debug limits the execution runtime so you will want to limit data sizes  Sampling can be applied here as well by using the “Enable Sampling” option in each Source  Use the debug button option of “use activity IR” when you wish to use a job execution compute environment  This option is good for debugging with larger datasets. It will not have the same execution timeout limit as the default debug setting
  • 21. Best practices - Sinks  SQL:  Disable indexes on target with pre/post SQL scripts  Increase SQL capacity during pipeline execution  Enable staging when using Synapse  Use Source Partitioning on Source under Optimize  Set number of partitions based on size of IR  File-based sinks:  Use current partitioning allows Spark to create output  Output to single file is a slow operation  Often unnecessary by whoever is consuming data  Can set naming patterns or use data in column  Any reshuffling of data is slow  Cosmos DB  Set throughput and batch size to meet performance requirements
  • 22. Azure Integration Runtime  Data Flows use JIT compute to minimize running expensive clusters when they are mostly idle  Generally more economical, but each cluster takes ~4 minutes to spin up  IR specifies what cluster type and core-count to use  Memory optimized is best, compute optimized doesn’t generally work for production workloads  When running Sequential jobs utilize Time to Live to reuse cluster between executions  Keeps compute resources alive for TTL minutes after execution for new job to use  Maximum one job per cluster  Reduces job startup latency to ~1.5 minutes  Rule of thumb: start small and scale up
  • 23. Azure IR – General Purpose • This was General Purpose 4+4, the default auto resolve Azure IR • For prod workloads, GP is usually sufficient at >= 16 cores • You get 1 driver and 1 worker node, both with 4 vcores • Good for debugging, testing, and many production workloads • Tested with 887k row CSV file with 74 columns • Default partitioning • Spark chose 4 partitions • Cluster startup time: 4.5 mins • Sink IO writing: 46s • Transformation time: 42s • Sink post-processing time: 45s
  • 24. Azure IR – Compute Optimized • Computed Optimized intended for smaller workloads • 8+8, this is smallest CO option and you get 1 driver and 2 workers • Not suitable for large production workloads • Tested with 887k row CSV file with 74 columns • Default partitioning • Spark chose 8 partitions • Cluster startup time: 4.5 mins • Sink IO writing: 20s • Transformation time: 35s • Sink post-processing time: 40s • More worker nodes gave us more partitions and better perf than General Purpose
  • 25. Azure IR – Memory Optimized • Memory Optimized well suited for large production workload reliability with many aggregates, lookups, and joins • 64+16 gives you 16 vcores for driver and 64 across worker nodes • Tested with 887k row CSV file with 74 columns • Default partitioning • Spark chose 64 partitions • Cluster startup time: 4.8 mins • Sink IO writing: 19s • Transformation time: 17s • Sink post-processing time: 40s