Until recently, data was gathered for well-defined objectives such as auditing, forensics, reporting and line-of-business operations; now, exploratory and predictive analysis is becoming ubiquitous, and the default increasingly is to capture and store any and all data, in anticipation of potential future strategic value. These differences in data heterogeneity, scale and usage are leading to a new generation of data management and analytic systems, where the emphasis is on supporting a wide range of very large datasets that are stored uniformly and analyzed seamlessly using whatever techniques are most appropriate, including traditional tools like SQL and BI and newer tools, e.g., for machine learning and stream analytics. These new systems are necessarily based on scale-out architectures for both storage and computation.
Hadoop has become a key building block in the new generation of scale-out systems. On the storage side, HDFS has provided a cost-effective and scalable substrate for storing large heterogeneous datasets. However, as key customer and systems touch points are instrumented to log data, and Internet of Things applications become common, data in the enterprise is growing at a staggering pace, and the need to leverage different storage tiers (ranging from tape to main memory) is posing new challenges, leading to caching technologies, such as Spark. On the analytics side, the emergence of resource managers such as YARN has opened the door for analytics tools to bypass the Map-Reduce layer and directly exploit shared system resources while computing close to data copies. This trend is especially significant for iterative computations such as graph analytics and machine learning, for which Map-Reduce is widely recognized to be a poor fit.
While Hadoop is widely recognized and used externally, Microsoft has long been at the forefront of Big Data analytics, with Cosmos and Scope supporting all internal customers. These internal services are a key part of our strategy going forward, and are enabling new state of the art external-facing services such as Azure Data Lake and more. I will examine these trends, and ground the talk by discussing the Microsoft Big Data stack.
DevEX - reference for building teams, processes, and platforms
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO for Data, Microsoft
1. Big Data @ Microsoft
Raghu Ramakrishnan
CTO for Data, Technical Fellow
Microsoft
2. Data and Analytics – 3 Pillars
SQL 2016
Azure SQL DB
Azure SQL DW
SQL Server R services
On-prem and cloud
(Windows, Linux)
Cortana
Intelligence
Suite
Hadoop, Data Lake, Machine
learning, PowerBI, Data
Factory, Streaming,
Perceptual Intelligence
On-prem connectivity
Microsoft
R server
Hadoop
Teradata
On-prem and cloud
(Windows, Linux)
3. SQL Server 2016: Everything Built-In
The above graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from Microsoft. Gartner does not endorse any
vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research
organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
Consistent experience from on-premises to cloud
Microsoft Tableau Oracle
$120
$480
$2,230
Self-service BI per user
In-memory across all workloads
TPC-H non-clustered 10TB
Oracle
is #4#2
SQL Server
#1
SQL Server
#3
SQL Server
built-inbuilt-in built-in built-in built-in
0
1
4
0 0
3
34
29
22
15
5
22
6
43
20
69
18
49
3
-80
-70
-60
-50
-40
-30
-20
-10
0
2010 2011 2012 2013 2014 2015
SQL Server Oracle MySQL2 SAP HANA
TPC-H non-clustered results as of 04/06/15, 5/04/15, 4/15/14 and 11/25/13, respectively. http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster
at massive scale
National Institute of Standards and Technology Comprehensive Vulnerability Database update 5/4/2015
4. In-Database Advanced Analytics
No need to move the data
Open source R with in-
memory & massive
scale – multi-threading &
massive parallel processing
Data Scientist
Interact directly with data
R built-in to SQL Server
Data Developer/DBA
Manage data and
analytics together
Example Solutions
• Sales forecasting
• Warehouse efficiency
• Predictive maintenance
Extensibility
?
R
R Integration
Relational data
Analytic Library
T-SQL interface
010010
100100
010101
New R scripts
010010
100100
010101
010010
100100
010101
010010
100100
010101
• Credit risk protection
010010
100100
010101
Microsoft Azure Marketplace
Real-time
operational analytics
without moving the data
NEW
NEW
End-to-end mobile BI Advanced AnalyticsMission critical OLTP
5. High-performance open source R plus:
Enterprise Scale & Performance
– Scales from workstations to large clusters
– Scales to large data sizes
– Growing portfolio of Parallelized algorithms
Secure, Scalable R Deployment/Operationalization
Write Once Deploy Anywhere for multiple platforms
IDE for data scientists and developers
Enterprise Class Support
DistributedR
DeployR DevelopR
ScaleR
ConnectR
6. Cloud – SQL Server/SQL Azure
Shifting how you purchase and manage machines
Increased focus on Total Cost of Ownership and continuous improvements
Built from the same code base
We increased surface area compatibility with V12 Azure SQL Database
We’re learning how to run our own code – the good and the bad
We’re using that to improve both product and service
Microsoft is the only provider both on-premises and in the cloud
7. Order history
Name SSN Date
Jane Doe cm61ba906fd 2/28/2005
Jim Gray ox7ff654ae6d 3/18/2005
John Smith i2y36cg776rg 4/10/2005
Bill Brown nx290pldo90l 4/27/2005
Order history
Name SSN Date
Jane Doe cm61ba906fd 2/28/2005
Jim Gray ox7ff654ae6d 3/18/2005
John Smith i2y36cg776rg 4/10/2005
Bill Brown nx290pldo90l 4/27/2005
Customer data
Product data
Order History
Stretch to cloud
Stretch SQL Server into Azure
Stretch warm and cold tables to Azure with remote query processing
App
Query
Microsoft Azure
Jim Gray ox7ff654ae6d 3/18/2005
SQL Server 2016
8. Azure SQL DW
Fully managed relational data warehouse-as-a-service
First elastic cloud data warehouse with proven SQL Server capabilities
Support your smallest to your largest data storage needs
Scales to petabytes of data
Massively Parallel Processing
Instant-on compute scales in seconds
Query Relational / Non-Relational
Saas
Azure
Public
Cloud
Office 365Office 365
Get started in minutes
Integrated with Azure ML, PowerBI & ADF
Simple billing compute & storage
Pay for what you need, when you need
it with dynamic pause
AzureAzure
9.
10.
11. Store any data
relations
Do any analysis
SQL queries
Hive,
At any speed
Batch
Hive
At any scale … elastic!
Anywhere
Data to
Intelligent
Action
12. Web Logs,
Omniture logs
On-Premise
SQL Server
(customer and product data)
In-Store Activity
with
Kinect sensors
Social Data
Diagnostic
streaming
Event hubs
Machine
Learning
Stream Analytics
Azure DataLake
Data Factory: Move Data, Orchestrate, Schedule, and Monitor
HDInsight HDInsight Machine
Learning
Azure SQL
Data Warehouse
Power BI
INGEST PREPARE ANALYZE PUBLISH
Stream Analytics
CONSUMEDATA SOURCES
Cortana
Web/LOB
Dashboards
13.
14.
15.
16. Azure Data Analytics Stack
REEF library
STORAGE
YARN
HDFS/WebHDFS API
Compute-tier
Cache Clusters
(Local ENs + CSM)
RAM / SSD / HDD
WAS-based Remote Storage
Cosmos Store API
CLUSTER-WIDE RM (YARN++)
YARN + Federation
YARN + Rayon (Capacity reservation)
YARN +
Mercury
Shared micro-
services for all
metadata
(extent map,
logical name
space, secure
store) based on
Hekaton/RSL
rings
YARN +
Mercury
YARN +
Mercury
Application
Engines
Per-job RM
and runtimeM/R
U-SQL
Batch
Spark
Tez
Spark
Runtime
Spark HiveU-SQL Azure ML Azure SA
COMPUTE TIER
SQL-DW HDInsight
IaaS
Services
17. Windows
SMSG
Live
Ads
CRM/Dynamics
Windows Phone
Xbox Live
Office365
STB Malware Protection
Microsoft Stores
STBCommerceRisk
Messenger
LCA
Exchange
Yammer
Skype
Bing
data managed: EBs
cluster sizes: 10s of Ks
# machines: 100s of Ks
daily I/O: >100 PBs
# internal developers: 1000s
# daily jobs: 100s of Ks
19. Implement Data Warehouse
Physical Design
ETL
Development
Reporting &
Analytics
Development
Install and Tune
Reporting &
Analytics Design
Dimension Modelling
ETL Design
Setup Infrastructure
Understand
Corporate
Strategy
Data sources
ETL
BI and analytic
Data warehouse
Gather
Requirements
Business
Requirements
Technical
Requirements
20. Ingest all data
regardless of requirements
Store all data
in native format without
schema definition
Do analysis
Using analytic engines
like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
21. What happened?
What is happening?
Why did it happen?
What are key
relationships?
What will happen?
What if?
How risky is it?
What should happen?
What is the best option?
How can I optimize?
Data sources
24. • Interactive and Real-Time Analytics requires i
• Massive data volumes require scale-out stores
using commodity servers, even archival storage
Tiered Storage
Seamlessly move data across tiers, mirroring life-cycle and usage patterns
Schedule compute near low-latency copies of data
How can we manage this trade-off without moving data across
different storage systems (and governance boundaries)?
25. • Many different analytic engines (OSS and
vendors; SQL, ML; batch, interactive, streaming)
• Many users’ jobs (across these job types) run
on the same machines (where the data lives)
Resource Management with Multitenancy and SLAs
Policy-driven management of vast compute pools co-located with data
Schedule computation “near” data
How can we manage this multi-tenanted heterogeneous job mix
across tens of thousands of machines?
26. Azure Data Lake Store
Fully managed cloud data store designed for analytics
Supports HDFS compliant analytics applications and tools
Petabyte files, unlimited account size
High throughput for analytics performance
Low latency ingestion with read as you write
AAD-based authentication, access auditing
File and folder-level ACLs, Encryption at rest
27. ADLS Security: Encryption-at-Rest
Transparently encrypts data flowing
to and from public networks as well
as at rest
Transparent server-side encryption
User can manage their own
encryption keys or let Azure Data
Lake Store manage the key using
Azure Key Vault
28
28. ADLS Security: Role-Based Access Control
Each file and directory is associated
with an owner and a group
Files or directories have separate
permissions (read(r), write(w),
execute(x)) for owners, members of
the group, and for all other users
Fine-grained access control lists
(ACLs) can be specified for specific
named users or named groups
29
29. ADL Store: Ingress
Data can be ingested into Azure Data Lake Store from a variety of sources
Server logs
Azure Event Hub
Apache
Flume
Azure Storage Blobs
Custom programs
.NET SDK
JavaScript CLI
Azure Portal
Azure PowerShell
Azure Data Factory
Apache Sqoop
Azure SQL DB
Azure SQL DW
Azure tables
Table Storage
On-premises databases
SQL
30
ADL Store
Built-in
copy service
30. ADL Store: Egress
Data can be exported from Azure Data Lake Store into numerous targets/sinks
Azure SQL DB
SQL
Azure SQL DW
Azure
Tables
Table Storage
On-premises databases
Azure Data Factory
Apache Sqoop
Azure Storage Blobs
Custom programs
.NET SDK
JavaScript CLI
Azure Portal
Azure PowerShell
31
Built-in
copy service
ADL Store
31. Extent
Metadata
Data Data Data…
Remote Storage
Naming
Service
Secret Store
1) Filename Translation
3) Find Extents
4) Data
access
Remote storage tier
builds securely on
WAS
Secure
Works with
YARN!
COMPUTE
TIER
Secure Store Service
Intelligent ingest
Massively parallel
2) Azure Access Keys
32. • Interactive and Real-Time Analytics requires i
• Massive data volumes require scale-out stores
using commodity servers, even archival storage
Tiered Storage
Scale storage independently of compute
Seamlessly move data across tiers, mirroring life-cycle and usage patterns
Schedule compute near low-latency copies of data
Data Lifecycle Management
How can we manage this trade-off without moving data across
different storage systems (and governance boundaries)?
33. Extent
Metadata
Data Data Data…
Remote Storage
Naming
Service
Secret Store
1) Filename Translation
3) Find Extents
4) Data
access
Remote storage tier
builds securely on
WAS
Secure
Works with
YARN!
COMPUTE
TIER
Data Data Data
…
Secure Store Service
Local Storage
Intelligent ingest
Massively parallel
2) Azure Access Keys
34. Azure HDInsight—Linux and Windows
Managed, Monitored, Supported
• Cluster customization – Install your favorite project
• Harness existing .Net & Java skills to write
customer extensions
• Supports broad ecosystem of ISVs
(Hadoop and Traditional)
Full Apache Hadoop
• Batch – MapReduce, PIG, Hive, Spark
• Stream Processing and Analytics – Storm,
SparkStreaming
• Interactive SQL – Hive (Tez), and SparkSQL
• Table Serving – Hbase
• Machine Learning – SparkML, Mahout
38. Azure
Data Lake
Analytics Service
A new distributed
analytics service
Built on Apache YARN
Scales dynamically with a dial
Pay by the query
Supports Azure AD for access control,
roles, and integration with on-prem
identity systems
U-SQL language unifies the benefits of
SQL with the power of C#
Hive etc. will be added over time
Processes data across Azure
41
39. Get started
Log in to Azure Create an ADLA
account
Write and
submit an ADLA
job with U-SQL
(or Hive/Pig)
The job reads
and writes data
from storage
1 2 3 4
30 seconds
ADLS
Azure Blobs
Azure DB
…
40. ADLA Complements HDInsight
HDInsight
Dedicated managed clusters for
developers familiar with the Open
Source: Java, Eclipse, Hive, etc.
Clusters offer customization, control,
and flexibility in a managed Hadoop
cluster
ADLA
Enables customers to leverage
existing experience with C#, SQL &
PowerShell
Offers convenience, efficiency, and
automatic scale in a “job service”
form factor over a system-managed
shared resource pool
41. U-SQL A hyper-scalable, highly extensible
language for preparing, transforming
and analyzing all data
Allows users to focus on the what—
not the how—of business problems
Built on familiar languages (SQL and
C#) and supported by a fully integrated
development environment
Built for data developers & scientists
44
42. U-SQL Language Philosophy
Declarative query and transformation language:
• Uses SQL’s SELECT FROM WHERE with GROUP BY/aggregation, joins,
SQL Analytics functions
• Optimizable, scalable
Operates on unstructured & structured data
• Schema on read over files
• Relational metadata objects (e.g. database, table)
Extensible from ground up:
• Type system is based on C#
• Expression language is C#
21
User-defined functions (U-SQL and C#)
User-defined types (U-SQL/C#) (future)
User-defined aggregators (C#)
User-defined operators (UDO) (C#)
U-SQL provides the parallelization and scale-out framework for
usercode
• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINERS
Expression-flow programming style:
• Easy to use functional lambda composition
• Composable, globally optimizable
Federated query across distributed data sources (soon)
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int
, order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount float
FROM "/input/orders.txt“
USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string
FROM "/input/customers.txt“
USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder
, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt
, SUM(c.amount) AS totalamount
FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid
WHERE c.city.StartsWith("New")
&& MyNamespace.MyFunction(o.odate) > 10
GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"
USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;
43. Federated Queries: Query Data Where It Lives
Easily query data in multiple Azure data stores without moving it to a single store
Benefits
Avoid moving large amounts of data across the
network between stores
Single view of data irrespective of physical location
Minimize data proliferation issues caused by
maintaining multiple copies
Single query language for all data
Each data store maintains its own sovereignty
Design choices based on the need
U-SQL
Query
Result
Query
46
Azure
Storage Blobs
Azure SQL
in VMs
Azure
SQL DB
Azure Data
Lake Analytics
44. Join Local (ADLS) and External Data
1. Create two tables.
• An external table ‘PurchaseOrders’ that refers to the
PurchaseOrders table in the external SQL Azure DB.
• A ‘local’ table ‘UserIdsTable’ created by ‘extracting’ User
Ids and region fields from the WebLogRecords.txt file
stored in Azure Data Lake.
2. Join the PurchaseOrders table with UserIds table on the
common UserId column.
Purchase orders table
Azure SQL DB
External
purchase orders
table
Local
user IDs
table
JOIN
(on User IDs)
Azure Data Lake
Analytics
Find sum of all purchases by users in the ‘en-us’ region
Query 9
47
WebLogRecords.txt
45. Concepts: Jobs, Stages and Vertexes
Each job is broken into a number
of vertexes
Each vertex is some work that
needs to be done
Input
Output
Output
6 Stages
8 Vertexes
Vertexes are organized into stages
– Vertexes in each stage do the same
work on the same data
– Vertex in one stage may depend on a
vertex in a earlier stage
Stages themselves are organized into
an acyclic graph
49
46. • Many different analytic engines (OSS and
vendors; SQL, ML; batch, interactive, streaming)
• Many users’ jobs (across these job types) run
on the same machines (where the data lives)
Resource Management with Multitenancy and SLAs
Policy-driven management of vast compute pools co-located with data
Schedule computation “near” data
How can we manage this multi-tenanted heterogeneous job mix
across tens of thousands of machines?
47. Resource Managers for Big Data
Allocate compute containers to competing jobs
Multiple job engines shared pool
Containers
YARN: Resource manager for Hadoop2.x
Corona, Mesos, Omega
48. Shared Data and Compute
Tiered Storage
Relational
Query Engine
Machine
Learning
Compute Fabric (Resource Management)
Multiple analytic
engines sharing same
resource pool
Compute and
store/cache on
same machines
50. YARN Gaps
resource allocation SLOs
scalability limitations
• High allocation latency
• Support for specialized execution frameworks
• Interactive environments, long-running services
51. • Amoeba Rayon
• Status: shipping in Apache Hadoop 2.6
• Mercury and Yaq
• Status: Now in Apache Hadoop trunk!
• Federation
• Status: prototype and JIRA
• Framework-level Pooling
• Enable frameworks that want to take over resource allocation to support millisecond-
level response and adaptation times
• Status: spec
Microsoft Contributions to OSS Apache YARN