2. #azuresatpn
Nice to meet you
Riccardo Perico | rperico@solidq.com | @R1k91
SolidQ
Data Platform & BI Specialist
10 years working, training and speaking in Microsoft «Data Realm»
MCP: MTA, MCSA
https://www.linkedin.com/in/riccardo-perico-8b942384/
5. #azuresatpn
What ADF really is?
Cloud based
Data
integration
service
Orchestrates &
Automates
Data
movement and
transformation
Allows
Monitoring
and Debugging
Programmable
7. #azuresatpn
Sample Workflow
On-premises
data mart
Customer
web logs
Product table
Azure DB
Product
recommendations
Visualize
Azure Blob storage
Customer web
Logs
Product table
Data set
(Collection of files,
DB table, etc.)
Pipeline: A sequence of
activities (logical group)
Activity: A processing step
(Hadoop job, custom code, ML model, etc.)
…
Data sources Ingest Transform and analyze Publish
Combined
input table
Mapping
Transform,
combine, etc. Analyze Move
8. #azuresatpn
Devices Device Connectivity Storage Analytics Presentation & Action
Event Hubs SQL Database
Machine
Learning
App Service
IoT Hubs
Table/Blob
Storage
Stream Analytics Power BI
Service Bus Cosmos DB HDInsight
Notification
Hubs
External Data
Sources
External Data
Sources
Data Factory Mobile Services
BizTalk Services
Data Lake
Analytics
11. #azuresatpn
Activities & Pipelines
An Activity is a single task in workflow:
• Copy from input to output
• Transform
• C#
• Stored Procedure
• Hadoop (Map/Reduce, Hive, Pig)
• ML, Data Lake Analytics
• Databricks
• Control
• IF, ForEach, Until, Wait, Execute Pipeline
• Web
Pipeline groups activities
SQL
Serve
r
SQL
DB
SQL
Server
VMs
12. #azuresatpn
Integration Runtime
• Bridge between Activity and Linked Service
• Compute environment where activity runs or it’s dispatched from
3 types of IR:
• IR Azure
• IR Self-hosted
• IR Azure-SSIS
14. #azuresatpn
ADF Location vs IR Location
• ADF location metadata store and triggering pipeline start
• IR location backend compute engine location (data movement,
activity dispatch and SSIS execution)
ADF Location and IR location could be different
IR can use “Auto Resolve”
15. #azuresatpn
Mapping Data Flows
• Based on Spark
• Use Databricks behind the scene
• A lot of transformations already available
• Few sources available for now
• This week GA announced!
17. #azuresatpn
Developer Tools
• Azure Portal: Create, Edit. Visual and Textual
• Visual Studio: Integrated in VS project
• Powershell: cmdlets https://docs.microsoft.com/en-
us/powershell/module/azurerm.datafactories/?view=azurermps-
6.13.0
• Azure RM Template
18. #azuresatpn
Pricing
Multiple factors affect pricing
• Number of Activities run
• Volume of data moved
• SQL Server Integration Services Compute Hours
• Whether you re-running an activity
https://azure.microsoft.com/en-us/pricing/details/data-factory/v2/
Enterprises have data of various types that are located in disparate sources on-premises, in the cloud, structured, unstructured, and semi-structured, all arriving at different intervals and speeds.
The first step in building an information production system is to connect to all the required sources.
Without Data Factory, enterprises must build custom data movement components, they often lack the enterprise-grade monitoring, alerting, and the controls that a fully managed service can offer.
After data is present in a centralized data store in the cloud, process or transform the collected data by using compute services such as HDInsight Hadoop, Spark, Data Lake Analytics, and Machine Learning.
After the raw data has been refined into a business-ready consumable form, load the data into Azure Data Warehouse, Azure SQL Database, Azure CosmosDB, or whichever analytics engine your business users can point to from their business intelligence tools.
The Data Set is a view of input/output data
Data sets identify the data from different data stores.
Azure: public accessible endpoints, serverless, fully managed, pay for use only, scaled up automagically according to copy activity properties
Self-hosted: everything works in a private network behind corporate firewall, only HTTP outbound. A Windows server is needed and IR must be installed. Supports active-active load balancing.
Azure SSIS: Set of VMs natively executes SSIS. Supports BYO SSISDB on Azure SQL DB or Managed Instance. To On-prem use Azure Virtual Network with VPN site-to-site.
Mapping Data Flows are visually designed data transformations in Azure Data Factory
Copy activity from S3 to SQL
Rest to SQL
Datasets transform with SP vs MDF
Trigger & Monitor