3. The need for evolution – Identified 2 years ago
… data warehousing has reached the most
significant tipping point since its inception. The
biggest, possibly most elaborate data
management system in IT is changing.
– Gartner, “The State of Data Warehousing in 2012”
Data sources
4. The “Traditional” Data Warehouse
4
Data sources
Increasing
data volumes
1
Real-time
data
4
Non-Relational Data
New data
sources & types
2
Cloud-born
data
3
5. Evolving Approaches to Analytics
ETL Tool
(SSIS, etc)
EDW
(SQL Svr, Teradata, etc)
Extract
Original Data
Load
Transformed
Data
Transform
BI Tools
Data Marts
Data Lake(s)
Dashboards
Apps
6. ETL Tool
(SSIS, etc)
EDW
(SQL Svr, Teradata, etc)
Extract
Original Data
Load
Transformed
Data
Transform
BI Tools
Ingest (EL)
Original Data
Data Marts
Data Lake(s)
Dashboards
Apps
Evolving Approaches to Analytics
7. ETL Tool
(SSIS, etc)
EDW
(SQL Svr, Teradata, etc)
Extract
Original Data
Load
Transformed
Data
Transform
BI Tools
Ingest (EL)
Original Data
Scale-out
Storage &
Compute
(HDFS, Blob Storage,
etc)
Transform & Load
Data Marts
Data Lake(s)
Dashboards
Apps
Streaming data
Evolving Approaches to Analytics
8. ETL Tool
(SSIS, etc)
EDW
(SQL Svr, Teradata, etc)
Extract
Original Data
Load
Transformed
Data
Transform
BI Tools
Ingest (EL)
Original Data
Scale-out
Storage &
Compute
(HDFS, Blob Storage,
etc)
Transform & Load
Data Marts
Data Lake(s)
Dashboards
Apps
Streaming data
Evolving Approaches to Analytics
Real Time data analytics
10. Azure Data Factory Overview
• New Azure service for data developers & IT
• Compose data processing, storage and movement services to create & manage
analytics pipelines
• Initially focused on Azure & hybrid movement to/from on premises SQL Server.
Overtime will expand to more storage & processing systems throughout
• Rich, simple end-to-end pipeline monitoring and management
13. Customer Profiling – Game Usage Analytics
2277,2013-06-01 02:26:54.3943450,111,164.234.187.32,24.84.225.233,true,8,1,2058
2277,2013-06-01 03:26:23.2240000,111,164.234.187.32,24.84.225.233,true,8,1,2058-2123-2009-2068-2166
2277,2013-06-01 04:22:39.4940000,111,164.234.187.32,24.84.225.233,true,8,1,
2277,2013-06-01 05:43:54.1240000,111,164.234.187.32,24.84.225.233,true,8,1,2058-225545-2309-2068-2166
2277,2013-06-01 06:11:23.9274300,111,164.234.187.32,24.84.225.233,true,8,1,223-2123-2009-4229-9936623
2277,2013-06-01 07:37:01.3962500,111,164.234.187.32,24.84.225.233,true,8,1,
2277,2013-06-01 08:12:03.1109790,111,164.234.187.32,24.84.225.233,true,8,1,234322-2123-2234234-12432-344323
…
Log Files Snippet (10s of TBs per day in cloud storage)
User Table
UserID FirstName LastName State …
2277 Pratik Patel Oregon
664432 Dave Nettleton Washington
8853 Mike Flasko California
New User Activity Per Week By Region
profileid day state duration rank weaponsused interactedwith
1148 6/2/2013Oregon 216 33 1 5
1004 6/2/2013Missouri 22 40 6 2
292 6/1/2013Georgia 201 137 1 5
1059 6/2/2013Oregon 27 104 5 2
675 6/2/2013California 65 164 3 2
1348 6/3/2013Nebraska 21 95 5 2
14. Terminologies
• Linked Services
• Data Sets
• Pipeline
• Diagram View
• Create a Data factory
• Add Data Sources
• Define Tables and Pipelines
• Deploy & Start
• Monitor and Manage
Steps
15. Example: Game Logs, Customer Profiling
On Premises SQL Server Azure Blob Storage
1000’s Log FilesNew User View
Azure Data Factory
16. Example: Game Logs, Customer Profiling
On Premises SQL Server Azure Blob Storage
1000’s Log FilesNew User View
Azure Data FactoryViewOf
Game Usage
ViewOf
New Users
New User
Activity
17. Example: Game Logs, Customer Profiling
ViewOf
On Premises SQL Server Azure Blob Storage
1000’s Log FilesNew User View
Copy “NewUsers” to
Blob Storage
Cloud New
Users
Azure Data FactoryViewOf
Game Usage
ViewOf
New Users
New User
Activity
Pipeline
18. Example: Game Logs, Customer Profiling
On Premises SQL Server Azure Blob Storage
1000’s Log FilesNew User View
Copy NewUsers to
Blob Storage
Cloud New
Users
Azure Data FactoryViewOf
Game Usage
ViewOf
Mask & Geo-
Code
New Users
Geo Dictionary
Geo Coded
Game Usage
HDInsight
New User
Activity
Pipeline
Pipeline
19. Example: Game Logs, Customer Profiling
On Premises SQL Server Azure Blob Storage
1000’s Log FilesNew User View
Copy NewUsers to
Blob Storage
Cloud New
Users
Azure Data FactoryViewOf
Game Usage
ViewOf
RunsOn
Mask & Geo-
Code
New Users
Geo Dictionary
Geo Coded
Game Usage
Join &
Aggregate
HDInsight
New User
Activity
ViewOf
Pipeline
Pipeline
Pipeline
24. Custom Actions
• Allows running any .NET code wrapped within an ADF activity
• Can be used to connect to new sources/destination
• Can be used to create custom transformation activities
• Example: Invoke Azure ML model
• SDK for custom activity creation:
25. Coordination:
• Rich scheduling
• Complex dependencies
• Incremental rerun
Authoring:
• JSON & Powershell/C#
Management:
• Lineage
• Data production policies (late data, rerun, latency, etc)
Hub: Azure Hub (HDInsight + Blob storage)
• Activities: Hive, Pig, C#
• Data Connectors: Blobs, Tables, Azure DB, On Prem SQL Server, MDS [internal]
Data Factory – Available Today
31. Using Azure Analytic Service
Data Source
Collect Process
Consume
Deliver
Event Inputs
- Event Hub
- Azure Blob
Transform
- Temporal joins
- Filter
- Aggregates
- Projections
- Windows
- Etc.
Enrich
Correlate
Outputs
- SQL Azure
- Azure Blobs
- Event Hub
- Table Storage
Azure
Storage
Azure Stream Analytics
Reference Data
- Azure Blob
32. Sample Scenario : Toll Station
TollId EntryTime
License
Plate
State Make Model Type Weight
1 2014-10-25T19:33:30.0000000Z JNB7001 NY Honda CRV 1 3010
1 2014-10-25T19:33:31.0000000Z YXZ1001 NY Toyota Camry 2 3020
3 2014-10-25T19:33:32.0000000Z ABC1004 CT Ford Taurus 2 3800
2 2014-10-25T19:33:33.0000000Z XYZ1003 CT Toyota Corolla 2 2900
1 2014-10-25T19:33:34.0000000Z BNJ1007 NY Honda CRV 1 3400
2 2014-10-25T19:33:35.0000000Z CDE1007 NJ Toyota 4x4 1 3800
… … … … … … … …
EntryStream - Data about vehicles entering toll stations
TollId ExitTime LicensePlate
1 2014-10-25T19:33:40.0000000Z JNB7001
1 2014-10-25T19:33:41.0000000Z YXZ1001
3 2014-10-25T19:33:42.0000000Z ABC1004
2 2014-10-25T19:33:43.0000000Z XYZ1003
… … …
ExitStream - Data about cars leaving toll stations
LicensePlate RegistartionId Expired
SVT6023 285429838 1
XLZ3463 362715656 0
QMZ1273 876133137 1
RIV8632 992711956 0
… … ….
ReferenceData - Commercial vehicle registration data
33. Query Language - Overview
DML Statements
• SELECT
• FROM
• WHERE
• GROUP BY
• HAVING
• CASE
• JOINS
• UNION
Scaling Functions
• WITH
• PARTITION BY
Date and Time Functions
• DATENAME
• DATEPART
• DAY
• MONTH
• YEAR
• DATETIMEFROMPARTS
• DATEDIFF
• DATADD
Windowing Extensions
• Tumbling Window
• Hopping Window
• Sliding Window
Aggregate Functions
• SUM
• COUNT
• AVG
• MIN
• MAX
String Functions
• LEN
CONCAT
• SUBSTRING
• CHARINDEX
• PATINDEX
34. Tumbling Windows
SELECT TollId, COUNT(*)
FROM EntryStream TIMESTAMP BY EntryTime
GROUP BY TollId, TumblingWindow(second, 10)
Count the total number of vehicles entering each toll booth every interval of 10 seconds.
1 5 4 26 8 6 5
0 5 2010 15
Time
(secs)
1 5 4 26
8 6
25
A 10-second Tumbling Window
30
3 6 1
5 3 6 1
35. Hopping Windows
SELECT COUNT(*), TollId
FROM EntryStream TIMESTAMP BY EntryTime
GROUP BY TollId, HoppingWindow (second, 10,5)
Count the number of vehicles
entering each toll booth every
interval of 10 seconds; update
results every 10 seconds
1 5 4 26 8 7
0 5 2010 15
Time
(secs)
25
A 10-second Hopping Window with a 5-second “Hop”
30
4 26
8 6
5 3 6 1
1 5 4 26
8 6 5 3
6 15 3
36. Sliding Windows
Give me the count of all the toll
booths which have served more than
10 vehicles in the last 10 seconds
1 5
0 5 2010 15 Time
(secs)
25
A 10-second Sliding Window
8
8
51
9
51 9
1
SELECT TollId, Count(*)
FROM EntryStream ES
GROUP BY TollId, SlidingWindow (second, 10)
HAVING Count(*) > 10
37. Intake millions of events per second
Process data from connected devices/apps
Integrated with highly-scalable publish-subscriber ingestor
Easy processing on continuous
streams of data
Transform, augment, correlate, temporal operations
Detect patterns and anomalies in streaming data
Correlate streaming with reference
data
38. Input and Output
Management
Transformations
Management
Programmatic Access with REST APIs
Jobs Management
Start Job
Stop Job
Create Job
Delete Job
List Jobs
Update Job
Create Input / Output
Delete Input / Output
List Input / Output
Update Input / Output
Create Transformation
Delete Transformation
Get Transformation
Update Transformation
The full functionality of
Azure Stream Analytics is
through REST APIs.
Enables programmatic
access
Useful for automation
through scripting
Embed in other
applications/tools
40. Scaling Concepts – Partitions
Step Result1
Step Result2
Step Result3
PartitionId=1
PartitionId=3
PartitionId=2
PartitionId = 1
PartitionId = 2
PartitionId = 3
Event HubSELECT COUNT(*) AS Count, TollBoothId
FROM EntryStream Partition By PartitionId
GROUP BY TumblingWindow (minute, 3),
TollBoothId
41. 41
• Preview services
• Offers ability to deal with new age problem in processing and
analyzing data
• Scale, Speed, Economy
ADF & ASA
42. Recommended/related sessions
Inside Azure Storage – Options, abstractions and Best Practices
Data, Sabha2, 11.00 AM – 11.55 AM tomorrow
1
Choosing Right platform for BigData
Data, Sabha2, 3.00 PM to 3.55 PM tomorrow
2
Practical Machine Learning
Data, Sabha2 , 4.15 to 5.10 Today
3
43. References
Related references for you to expand your knowledge on the subject
Azure Stream Analytics Documentation
http://azure.microsoft.com/en-
in/documentation/services/stream-analytics/
Stream Analytics Query Language Reference
https://msdn.microsoft.com/en-
us/library/azure/dn834998.aspx
Azure Portal
http://azure.microsoft.com
Azure Updates
http://azure.microsoft.com/blog/
Microsoft Virtual Academy
aka.ms/mva
Developer Network
msdn.microsoft.com/
44. Azure Support
Must know resources to get online help for Azure.
Azure Support Options
http://azure.microsoft.com/en-
us/support/options/
Azure Support Plans
http://azure.microsoft.com/en-
us/support/plans/
Ask questions, & get answers
Post questions
in the Azure
forums
Tag questions
with the keyword
Azure.
45. Azure Vidyapeeth
A platform for learning – Choose your topic, choose your time
• Register to attend Azure Vidyapeeth Live webinars @
www.aka.ms/azure-vidyapeeth
• Collect free $100 Azure gift pass by registering for our Azure Vidyapeeth series at the Expo zone!
• Point your mobile phone here to download the Azure Vidyapeeth Mobile App :
www.aka.ms/av-app
46. Tell us what you think
Help us shape future events by
sharing your valuable feedback.
Scan the QR code to evaluate
this session.
< QR Code will be given 2 days before
the Conference >
49. You write declarative queries in SQL
No code compilation, easy to author and deploy
Unified programming model
Brings together event streams, reference data and
machine learning extensions
Temporal Semantics
All operators respect, and some use, the temporal
properties of events
Built-in operators and functions
These should (mostly) look familiar if you know
relational databases
Filters, projections, joins, windowed (temporal)
aggregates, text and date manipulation
50. 50
Why Event Processing in the Cloud?
Event data is already
in the Cloud
Event data is
globally distributed
Reduced TCO Scale Managed service,
not infrastructure
Bring the processing to the data,
not the data to the processing!
51. Application Components
Components of an Azure Stream Analytics Application
AzureSQLDB
AzureEvent Hubs
AzureBlobStorageAzureBlob Storage
AzureEvent Hubs
ReferenceData
Queryrunscontinuouslyagainstincomingstreamofevents
Events
Have a defined schema and
are temporal (sequenced in
time)
Notas del editor
Let us start with a statement that was made 2 years ago.
Gartner stated that DW has reached a significant tipping point since its inception. The biggest, possibly the most elaborate data management system in IT is changing
DW is not going anywhere, just that there are more tools for developers such as Hadoop and NoSQL
Each pipeline has an activity that is depicted by the blue box
.
Economy – Yes, the services are spoken of in mills/sec etc
Go to pricing portal
Stream Analytics is priced on two variables:
Volume of data processed
Streaming units required to process the data stream
Now let us dig deeper into what a typical ASA application looks like. An ASA application has three major components:
Input – Inputs are the sources of events. Note that the ‘original’ source of streaming events are devices, machines, applications, sensors, applications etc. However, ASA is not intended to connect to them directly. Rather ASA lets Azure Event Hubs be the primary interface to the wide variety of event sources. ASA is optimized to get streaming data from Azure Event Hubs and Azure Blob Storage. Azure Blob Storage is the likely place where log data is stored. The list of input sources that ASA directly integrates with may increase in the future, but Azure Event Hubs and Azure Blob Storage will be the primary sources.
Query – Queries are the main component of an ASA application. Queries implement the “analytics logic”. Queries are a set of transformations that are applied to the input stream to produce another set of output events. Queries are the only thing that an ASA application developers actually ‘develops’. Everything else is done through guided wizards in the Azure Portal. Note that ASA has a SQL-like query language but unlike traditional databases, ASA queries run continuously against the stream of incoming events. The queries stop being applied only when the job itself stops.
Output – As queries execute they continuously produce results. The results can be stored in Blob Storage, Event Hubs or Azure SQL database. Note that if the output is stored in Event Hub or Blob Storage, it can become the input to another ASA job. So it is possible to ‘chain’ together multiple jobs to implement a series of transformations.