SlideShare a Scribd company logo
1 of 53
Download to read offline
Big Data @ Microsoft
Raghu Ramakrishnan
CTO for Data, Technical Fellow
Microsoft
Data and Analytics – 3 Pillars
SQL 2016
Azure SQL DB
Azure SQL DW
SQL Server R services
On-prem and cloud
(Windows, Linux)
Cortana
Intelligence
Suite
Hadoop, Data Lake, Machine
learning, PowerBI, Data
Factory, Streaming,
Perceptual Intelligence
On-prem connectivity
Microsoft
R server
Hadoop
Teradata
On-prem and cloud
(Windows, Linux)
SQL Server 2016: Everything Built-In
The above graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from Microsoft. Gartner does not endorse any
vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research
organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
Consistent experience from on-premises to cloud
Microsoft Tableau Oracle
$120
$480
$2,230
Self-service BI per user
In-memory across all workloads
TPC-H non-clustered 10TB
Oracle
is #4#2
SQL Server
#1
SQL Server
#3
SQL Server
built-inbuilt-in built-in built-in built-in
0
1
4
0 0
3
34
29
22
15
5
22
6
43
20
69
18
49
3
-80
-70
-60
-50
-40
-30
-20
-10
0
2010 2011 2012 2013 2014 2015
SQL Server Oracle MySQL2 SAP HANA
TPC-H non-clustered results as of 04/06/15, 5/04/15, 4/15/14 and 11/25/13, respectively. http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster
at massive scale
National Institute of Standards and Technology Comprehensive Vulnerability Database update 5/4/2015
In-Database Advanced Analytics
No need to move the data
Open source R with in-
memory & massive
scale – multi-threading &
massive parallel processing
Data Scientist
Interact directly with data
R built-in to SQL Server
Data Developer/DBA
Manage data and
analytics together
Example Solutions
• Sales forecasting
• Warehouse efficiency
• Predictive maintenance
Extensibility
?
R
R Integration
Relational data
Analytic Library
T-SQL interface
010010
100100
010101
New R scripts
010010
100100
010101
010010
100100
010101
010010
100100
010101
• Credit risk protection
010010
100100
010101
Microsoft Azure Marketplace
Real-time
operational analytics
without moving the data
NEW
NEW
End-to-end mobile BI Advanced AnalyticsMission critical OLTP
High-performance open source R plus:
 Enterprise Scale & Performance
– Scales from workstations to large clusters
– Scales to large data sizes
– Growing portfolio of Parallelized algorithms
 Secure, Scalable R Deployment/Operationalization
 Write Once Deploy Anywhere for multiple platforms
 IDE for data scientists and developers
 Enterprise Class Support
DistributedR
DeployR DevelopR
ScaleR
ConnectR
Cloud – SQL Server/SQL Azure
Shifting how you purchase and manage machines
Increased focus on Total Cost of Ownership and continuous improvements
Built from the same code base
We increased surface area compatibility with V12 Azure SQL Database
We’re learning how to run our own code – the good and the bad
We’re using that to improve both product and service
Microsoft is the only provider both on-premises and in the cloud
Order history
Name SSN Date
Jane Doe cm61ba906fd 2/28/2005
Jim Gray ox7ff654ae6d 3/18/2005
John Smith i2y36cg776rg 4/10/2005
Bill Brown nx290pldo90l 4/27/2005
Order history
Name SSN Date
Jane Doe cm61ba906fd 2/28/2005
Jim Gray ox7ff654ae6d 3/18/2005
John Smith i2y36cg776rg 4/10/2005
Bill Brown nx290pldo90l 4/27/2005
Customer data
Product data
Order History
Stretch to cloud
Stretch SQL Server into Azure
Stretch warm and cold tables to Azure with remote query processing
App
Query
Microsoft Azure

Jim Gray ox7ff654ae6d 3/18/2005
SQL Server 2016
Azure SQL DW
Fully managed relational data warehouse-as-a-service
First elastic cloud data warehouse with proven SQL Server capabilities
Support your smallest to your largest data storage needs
Scales to petabytes of data
Massively Parallel Processing
Instant-on compute scales in seconds
Query Relational / Non-Relational
Saas
Azure
Public
Cloud
Office 365Office 365
Get started in minutes
Integrated with Azure ML, PowerBI & ADF
Simple billing compute & storage
Pay for what you need, when you need
it with dynamic pause
AzureAzure
Store any data
relations
Do any analysis
SQL queries
Hive,
At any speed
Batch
Hive
At any scale … elastic!
Anywhere
Data to
Intelligent
Action
Web Logs,
Omniture logs
On-Premise
SQL Server
(customer and product data)
In-Store Activity
with
Kinect sensors
Social Data
Diagnostic
streaming
Event hubs
Machine
Learning
Stream Analytics
Azure DataLake
Data Factory: Move Data, Orchestrate, Schedule, and Monitor
HDInsight HDInsight Machine
Learning
Azure SQL
Data Warehouse
Power BI
INGEST PREPARE ANALYZE PUBLISH
Stream Analytics
CONSUMEDATA SOURCES
Cortana
Web/LOB
Dashboards
Azure Data Analytics Stack
REEF library
STORAGE
YARN
HDFS/WebHDFS API
Compute-tier
Cache Clusters
(Local ENs + CSM)
RAM / SSD / HDD
WAS-based Remote Storage
Cosmos Store API
CLUSTER-WIDE RM (YARN++)
YARN + Federation
YARN + Rayon (Capacity reservation)
YARN +
Mercury
Shared micro-
services for all
metadata
(extent map,
logical name
space, secure
store) based on
Hekaton/RSL
rings
YARN +
Mercury
YARN +
Mercury
Application
Engines
Per-job RM
and runtimeM/R
U-SQL
Batch
Spark
Tez
Spark
Runtime
Spark HiveU-SQL Azure ML Azure SA
COMPUTE TIER
SQL-DW HDInsight
IaaS
Services
Windows
SMSG
Live
Ads
CRM/Dynamics
Windows Phone
Xbox Live
Office365
STB Malware Protection
Microsoft Stores
STBCommerceRisk
Messenger
LCA
Exchange
Yammer
Skype
Bing
data managed: EBs
cluster sizes: 10s of Ks
# machines: 100s of Ks
daily I/O: >100 PBs
# internal developers: 1000s
# daily jobs: 100s of Ks
Observation
Pattern
Theory
Hypothesis
What will
happen?
How can we
make it happen?
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did
it happen?
Descriptive
Analytics
Diagnostic
Analytics
Confirmation
Theory
Hypothesis
Observation
Implement Data Warehouse
Physical Design
ETL
Development
Reporting &
Analytics
Development
Install and Tune
Reporting &
Analytics Design
Dimension Modelling
ETL Design
Setup Infrastructure
Understand
Corporate
Strategy
Data sources
ETL
BI and analytic
Data warehouse
Gather
Requirements
Business
Requirements
Technical
Requirements
Ingest all data
regardless of requirements
Store all data
in native format without
schema definition
Do analysis
Using analytic engines
like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
What happened?
What is happening?
Why did it happen?
What are key
relationships?
What will happen?
What if?
How risky is it?
What should happen?
What is the best option?
How can I optimize?
Data sources
Handling failures
Sharing data, resources
Parallelism
Data-aware Optimization
Security, Compliance, Governance
Enterprise
Forrester Wave
Big Data Hadoop
Cloud Solutions
Q2 2016
• Interactive and Real-Time Analytics requires i
• Massive data volumes require scale-out stores
using commodity servers, even archival storage
Tiered Storage
Seamlessly move data across tiers, mirroring life-cycle and usage patterns
Schedule compute near low-latency copies of data
How can we manage this trade-off without moving data across
different storage systems (and governance boundaries)?
• Many different analytic engines (OSS and
vendors; SQL, ML; batch, interactive, streaming)
• Many users’ jobs (across these job types) run
on the same machines (where the data lives)
Resource Management with Multitenancy and SLAs
Policy-driven management of vast compute pools co-located with data
Schedule computation “near” data
How can we manage this multi-tenanted heterogeneous job mix
across tens of thousands of machines?
Azure Data Lake Store
Fully managed cloud data store designed for analytics
Supports HDFS compliant analytics applications and tools
Petabyte files, unlimited account size
High throughput for analytics performance
Low latency ingestion with read as you write
AAD-based authentication, access auditing
File and folder-level ACLs, Encryption at rest
ADLS Security: Encryption-at-Rest
Transparently encrypts data flowing
to and from public networks as well
as at rest
Transparent server-side encryption
User can manage their own
encryption keys or let Azure Data
Lake Store manage the key using
Azure Key Vault
28
ADLS Security: Role-Based Access Control
Each file and directory is associated
with an owner and a group
Files or directories have separate
permissions (read(r), write(w),
execute(x)) for owners, members of
the group, and for all other users
Fine-grained access control lists
(ACLs) can be specified for specific
named users or named groups
29
ADL Store: Ingress
Data can be ingested into Azure Data Lake Store from a variety of sources
Server logs
Azure Event Hub
Apache
Flume
Azure Storage Blobs
Custom programs
.NET SDK
JavaScript CLI
Azure Portal
Azure PowerShell
Azure Data Factory
Apache Sqoop
Azure SQL DB
Azure SQL DW
Azure tables
Table Storage
On-premises databases
SQL
30
ADL Store
Built-in
copy service
ADL Store: Egress
Data can be exported from Azure Data Lake Store into numerous targets/sinks
Azure SQL DB
SQL
Azure SQL DW
Azure
Tables
Table Storage
On-premises databases
Azure Data Factory
Apache Sqoop
Azure Storage Blobs
Custom programs
.NET SDK
JavaScript CLI
Azure Portal
Azure PowerShell
31
Built-in
copy service
ADL Store
Extent
Metadata
Data Data Data…
Remote Storage
Naming
Service
Secret Store
1) Filename Translation
3) Find Extents
4) Data
access
Remote storage tier
builds securely on
WAS
Secure
Works with
YARN!
COMPUTE
TIER
Secure Store Service
Intelligent ingest
Massively parallel
2) Azure Access Keys
• Interactive and Real-Time Analytics requires i
• Massive data volumes require scale-out stores
using commodity servers, even archival storage
Tiered Storage
Scale storage independently of compute
Seamlessly move data across tiers, mirroring life-cycle and usage patterns
Schedule compute near low-latency copies of data
Data Lifecycle Management
How can we manage this trade-off without moving data across
different storage systems (and governance boundaries)?
Extent
Metadata
Data Data Data…
Remote Storage
Naming
Service
Secret Store
1) Filename Translation
3) Find Extents
4) Data
access
Remote storage tier
builds securely on
WAS
Secure
Works with
YARN!
COMPUTE
TIER
Data Data Data
…
Secure Store Service
Local Storage
Intelligent ingest
Massively parallel
2) Azure Access Keys
Azure HDInsight—Linux and Windows
Managed, Monitored, Supported
• Cluster customization – Install your favorite project
• Harness existing .Net & Java skills to write
customer extensions
• Supports broad ecosystem of ISVs
(Hadoop and Traditional)
Full Apache Hadoop
• Batch – MapReduce, PIG, Hive, Spark
• Stream Processing and Analytics – Storm,
SparkStreaming
• Interactive SQL – Hive (Tez), and SparkSQL
• Table Serving – Hbase
• Machine Learning – SparkML, Mahout
Batch
MapReduce, PIG, Hive, Spark
Interactive SQL
Hive (Tez), SparkSQL
Stream Analytics
Storm, SparkStreaming
Machine Learning
SparkML, Mahout
Table Serving
Hbase
Exploratory Visualization
Jupyter, Zeppelin
Interactive SQL
SQL DW
Stream Analytics
Azure Stream Analytics
Machine Learning
Azure ML
Table Serving
Azure SQL DB
Exploratory Visualization
Power BI
• > 14 million field hours
http://www.ebird.org
Tree Swallow
Azure
Data Lake
Analytics Service
A new distributed
analytics service
Built on Apache YARN
Scales dynamically with a dial
Pay by the query
Supports Azure AD for access control,
roles, and integration with on-prem
identity systems
U-SQL language unifies the benefits of
SQL with the power of C#
Hive etc. will be added over time
Processes data across Azure
41
Get started
Log in to Azure Create an ADLA
account
Write and
submit an ADLA
job with U-SQL
(or Hive/Pig)
The job reads
and writes data
from storage
1 2 3 4
30 seconds
ADLS
Azure Blobs
Azure DB
…
ADLA Complements HDInsight
HDInsight
Dedicated managed clusters for
developers familiar with the Open
Source: Java, Eclipse, Hive, etc.
Clusters offer customization, control,
and flexibility in a managed Hadoop
cluster
ADLA
Enables customers to leverage
existing experience with C#, SQL &
PowerShell
Offers convenience, efficiency, and
automatic scale in a “job service”
form factor over a system-managed
shared resource pool
U-SQL A hyper-scalable, highly extensible
language for preparing, transforming
and analyzing all data
Allows users to focus on the what—
not the how—of business problems
Built on familiar languages (SQL and
C#) and supported by a fully integrated
development environment
Built for data developers & scientists
44
U-SQL Language Philosophy
Declarative query and transformation language:
• Uses SQL’s SELECT FROM WHERE with GROUP BY/aggregation, joins,
SQL Analytics functions
• Optimizable, scalable
Operates on unstructured & structured data
• Schema on read over files
• Relational metadata objects (e.g. database, table)
Extensible from ground up:
• Type system is based on C#
• Expression language is C#
21
User-defined functions (U-SQL and C#)
User-defined types (U-SQL/C#) (future)
User-defined aggregators (C#)
User-defined operators (UDO) (C#)
U-SQL provides the parallelization and scale-out framework for
usercode
• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINERS
Expression-flow programming style:
• Easy to use functional lambda composition
• Composable, globally optimizable
Federated query across distributed data sources (soon)
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int
, order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount float
FROM "/input/orders.txt“
USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string
FROM "/input/customers.txt“
USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder
, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt
, SUM(c.amount) AS totalamount
FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid
WHERE c.city.StartsWith("New")
&& MyNamespace.MyFunction(o.odate) > 10
GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"
USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;
Federated Queries: Query Data Where It Lives
Easily query data in multiple Azure data stores without moving it to a single store
Benefits
Avoid moving large amounts of data across the
network between stores
Single view of data irrespective of physical location
Minimize data proliferation issues caused by
maintaining multiple copies
Single query language for all data
Each data store maintains its own sovereignty
Design choices based on the need
U-SQL
Query
Result
Query
46
Azure
Storage Blobs
Azure SQL
in VMs
Azure
SQL DB
Azure Data
Lake Analytics
Join Local (ADLS) and External Data
1. Create two tables.
• An external table ‘PurchaseOrders’ that refers to the
PurchaseOrders table in the external SQL Azure DB.
• A ‘local’ table ‘UserIdsTable’ created by ‘extracting’ User
Ids and region fields from the WebLogRecords.txt file
stored in Azure Data Lake.
2. Join the PurchaseOrders table with UserIds table on the
common UserId column.
Purchase orders table
Azure SQL DB
External
purchase orders
table
Local
user IDs
table
JOIN
(on User IDs)
Azure Data Lake
Analytics
Find sum of all purchases by users in the ‘en-us’ region
Query 9
47
WebLogRecords.txt
Concepts: Jobs, Stages and Vertexes
Each job is broken into a number
of vertexes
Each vertex is some work that
needs to be done
Input
Output
Output
6 Stages
8 Vertexes
Vertexes are organized into stages
– Vertexes in each stage do the same
work on the same data
– Vertex in one stage may depend on a
vertex in a earlier stage
Stages themselves are organized into
an acyclic graph
49
• Many different analytic engines (OSS and
vendors; SQL, ML; batch, interactive, streaming)
• Many users’ jobs (across these job types) run
on the same machines (where the data lives)
Resource Management with Multitenancy and SLAs
Policy-driven management of vast compute pools co-located with data
Schedule computation “near” data
How can we manage this multi-tenanted heterogeneous job mix
across tens of thousands of machines?
Resource Managers for Big Data
Allocate compute containers to competing jobs
Multiple job engines shared pool
Containers
YARN: Resource manager for Hadoop2.x
Corona, Mesos, Omega
Shared Data and Compute
Tiered Storage
Relational
Query Engine
Machine
Learning
Compute Fabric (Resource Management)
Multiple analytic
engines sharing same
resource pool
Compute and
store/cache on
same machines
What’s Behind a U-SQL Query
. . .
. . .
…
…
…
YARN Gaps
resource allocation SLOs
scalability limitations
• High allocation latency
• Support for specialized execution frameworks
• Interactive environments, long-running services
• Amoeba Rayon
• Status: shipping in Apache Hadoop 2.6
• Mercury and Yaq
• Status: Now in Apache Hadoop trunk!
• Federation
• Status: prototype and JIRA
• Framework-level Pooling
• Enable frameworks that want to take over resource allocation to support millisecond-
level response and adaptation times
• Status: spec
Microsoft Contributions to OSS Apache YARN
REEF
http://ww.reef-project.org http://reef.incubator.apache.org
http://aka.ms/adltechblog/
http://ww.reef-project.org and
http://reef.incubator.apache.org

More Related Content

What's hot

Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsJames Serra
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategyJames Serra
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...James Serra
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksAlberto Diaz Martin
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
What is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | WhitepaperWhat is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | WhitepaperVasu S
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseJames Serra
 
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics SolutionCortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics SolutionMSAdvAnalytics
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS CloudIdan Tohami
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBaseJames Serra
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's includedJames Serra
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azureDavid Giard
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 

What's hot (20)

Power BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data SolutionsPower BI for Big Data and the New Look of Big Data Solutions
Power BI for Big Data and the New Look of Big Data Solutions
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
What is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | WhitepaperWhat is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | Whitepaper
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
Azure Big data
Azure Big data Azure Big data
Azure Big data
 
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics SolutionCortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS Cloud
 
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 

Viewers also liked

The Hive Think Tank - Design Thinking by Bernie Roth, Professor at Stanford U...
The Hive Think Tank - Design Thinking by Bernie Roth, Professor at Stanford U...The Hive Think Tank - Design Thinking by Bernie Roth, Professor at Stanford U...
The Hive Think Tank - Design Thinking by Bernie Roth, Professor at Stanford U...The Hive
 
The Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec
The Hive Think Tank: Machine Learning at Pinterest by Jure LeskovecThe Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec
The Hive Think Tank: Machine Learning at Pinterest by Jure LeskovecThe Hive
 
The Hive Think Tank: Heron at Twitter
The Hive Think Tank: Heron at TwitterThe Hive Think Tank: Heron at Twitter
The Hive Think Tank: Heron at TwitterThe Hive
 
The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...
The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...
The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...The Hive
 
The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...
The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...
The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...The Hive
 
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep Visual Understanding from Deep Learning by Prof. Jitendra MalikDeep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep Visual Understanding from Deep Learning by Prof. Jitendra MalikThe Hive
 
The Hive Think Tank: The Future Of Customer Support - AI Driven Automation
The Hive Think Tank: The Future Of Customer Support - AI Driven AutomationThe Hive Think Tank: The Future Of Customer Support - AI Driven Automation
The Hive Think Tank: The Future Of Customer Support - AI Driven AutomationThe Hive
 
The Hive Think Tank: AI in The Enterprise by Venkat Srinivasan
The Hive Think Tank: AI in The Enterprise by Venkat SrinivasanThe Hive Think Tank: AI in The Enterprise by Venkat Srinivasan
The Hive Think Tank: AI in The Enterprise by Venkat SrinivasanThe Hive
 
SQL Server 2016 SSRS and BI
SQL Server 2016 SSRS and BISQL Server 2016 SSRS and BI
SQL Server 2016 SSRS and BIMSDEVMTL
 
The Hive Think Tank: Unpacking AI for Healthcare
The Hive Think Tank: Unpacking AI for Healthcare The Hive Think Tank: Unpacking AI for Healthcare
The Hive Think Tank: Unpacking AI for Healthcare The Hive
 
Ssis 2016 RC3
Ssis 2016 RC3Ssis 2016 RC3
Ssis 2016 RC3MSDEVMTL
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
 
Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Michael Rys
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Nathan Bijnens
 
SQL Server 2016 New Features and Enhancements
SQL Server 2016 New Features and EnhancementsSQL Server 2016 New Features and Enhancements
SQL Server 2016 New Features and EnhancementsJohn Martin
 
Expt panel hive_data_rp_20130320_final-1
Expt panel hive_data_rp_20130320_final-1Expt panel hive_data_rp_20130320_final-1
Expt panel hive_data_rp_20130320_final-1The Hive
 
Leanplum_Controlled Experimentation_Panel_The Hive
Leanplum_Controlled Experimentation_Panel_The HiveLeanplum_Controlled Experimentation_Panel_The Hive
Leanplum_Controlled Experimentation_Panel_The HiveThe Hive
 

Viewers also liked (20)

The Hive Think Tank - Design Thinking by Bernie Roth, Professor at Stanford U...
The Hive Think Tank - Design Thinking by Bernie Roth, Professor at Stanford U...The Hive Think Tank - Design Thinking by Bernie Roth, Professor at Stanford U...
The Hive Think Tank - Design Thinking by Bernie Roth, Professor at Stanford U...
 
The Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec
The Hive Think Tank: Machine Learning at Pinterest by Jure LeskovecThe Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec
The Hive Think Tank: Machine Learning at Pinterest by Jure Leskovec
 
The Hive Think Tank: Heron at Twitter
The Hive Think Tank: Heron at TwitterThe Hive Think Tank: Heron at Twitter
The Hive Think Tank: Heron at Twitter
 
The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...
The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...
The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...
 
The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...
The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...
The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...
 
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep Visual Understanding from Deep Learning by Prof. Jitendra MalikDeep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
 
The Hive Think Tank: The Future Of Customer Support - AI Driven Automation
The Hive Think Tank: The Future Of Customer Support - AI Driven AutomationThe Hive Think Tank: The Future Of Customer Support - AI Driven Automation
The Hive Think Tank: The Future Of Customer Support - AI Driven Automation
 
The Hive Think Tank: AI in The Enterprise by Venkat Srinivasan
The Hive Think Tank: AI in The Enterprise by Venkat SrinivasanThe Hive Think Tank: AI in The Enterprise by Venkat Srinivasan
The Hive Think Tank: AI in The Enterprise by Venkat Srinivasan
 
SQL Server 2016 SSRS and BI
SQL Server 2016 SSRS and BISQL Server 2016 SSRS and BI
SQL Server 2016 SSRS and BI
 
The Hive Think Tank: Unpacking AI for Healthcare
The Hive Think Tank: Unpacking AI for Healthcare The Hive Think Tank: Unpacking AI for Healthcare
The Hive Think Tank: Unpacking AI for Healthcare
 
Ssis 2016 RC3
Ssis 2016 RC3Ssis 2016 RC3
Ssis 2016 RC3
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013
 
Clickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache SparkClickstream & Social Media Analysis using Apache Spark
Clickstream & Social Media Analysis using Apache Spark
 
Apache Hive - Introduction
Apache Hive - IntroductionApache Hive - Introduction
Apache Hive - Introduction
 
SQL Server 2016 New Features and Enhancements
SQL Server 2016 New Features and EnhancementsSQL Server 2016 New Features and Enhancements
SQL Server 2016 New Features and Enhancements
 
Expt panel hive_data_rp_20130320_final-1
Expt panel hive_data_rp_20130320_final-1Expt panel hive_data_rp_20130320_final-1
Expt panel hive_data_rp_20130320_final-1
 
Bizitzaren historia
Bizitzaren  historiaBizitzaren  historia
Bizitzaren historia
 
Leanplum_Controlled Experimentation_Panel_The Hive
Leanplum_Controlled Experimentation_Panel_The HiveLeanplum_Controlled Experimentation_Panel_The Hive
Leanplum_Controlled Experimentation_Panel_The Hive
 

Similar to The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO for Data, Microsoft

Red Hat Summit 2017 - Intro to SQL Server on RHEL and Open Shift
Red Hat Summit 2017 - Intro to SQL Server on RHEL and Open ShiftRed Hat Summit 2017 - Intro to SQL Server on RHEL and Open Shift
Red Hat Summit 2017 - Intro to SQL Server on RHEL and Open ShiftTravis Wright
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
 
Azure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar PresentationAzure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar PresentationMatthew W. Bowers
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
Azure databricks c sharp corner toronto feb 2019 heather grandy
Azure databricks c sharp corner toronto feb 2019 heather grandyAzure databricks c sharp corner toronto feb 2019 heather grandy
Azure databricks c sharp corner toronto feb 2019 heather grandyNilesh Shah
 
Build Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsightBuild Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsightDataWorks Summit
 
How does Microsoft solve Big Data?
How does Microsoft solve Big Data?How does Microsoft solve Big Data?
How does Microsoft solve Big Data?James Serra
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventTrivadis
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
 
Cepta The Future of Data with Power BI
Cepta The Future of Data with Power BICepta The Future of Data with Power BI
Cepta The Future of Data with Power BIKellyn Pot'Vin-Gorman
 
Cloudera, Azure and Big Data at Cloudera Meetup '17
Cloudera, Azure and Big Data at Cloudera Meetup '17Cloudera, Azure and Big Data at Cloudera Meetup '17
Cloudera, Azure and Big Data at Cloudera Meetup '17Nathan Bijnens
 
1 Introduction to Microsoft data platform analytics for release
1 Introduction to Microsoft data platform analytics for release1 Introduction to Microsoft data platform analytics for release
1 Introduction to Microsoft data platform analytics for releaseJen Stirrup
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Michael Rys
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Riccardo Zamana
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Trivadis
 
Data Estate Modernization
Data Estate ModernizationData Estate Modernization
Data Estate ModernizationKarina Matos
 
Big Data Expo 2015 - Microsoft Transform you data into intelligent action
Big Data Expo 2015 - Microsoft Transform you data into intelligent actionBig Data Expo 2015 - Microsoft Transform you data into intelligent action
Big Data Expo 2015 - Microsoft Transform you data into intelligent actionBigDataExpo
 
SQL Server Ground to Cloud.pptx
SQL Server Ground to          Cloud.pptxSQL Server Ground to          Cloud.pptx
SQL Server Ground to Cloud.pptxsaidbilgen
 

Similar to The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO for Data, Microsoft (20)

Red Hat Summit 2017 - Intro to SQL Server on RHEL and Open Shift
Red Hat Summit 2017 - Intro to SQL Server on RHEL and Open ShiftRed Hat Summit 2017 - Intro to SQL Server on RHEL and Open Shift
Red Hat Summit 2017 - Intro to SQL Server on RHEL and Open Shift
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
Azure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar PresentationAzure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar Presentation
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Azure databricks c sharp corner toronto feb 2019 heather grandy
Azure databricks c sharp corner toronto feb 2019 heather grandyAzure databricks c sharp corner toronto feb 2019 heather grandy
Azure databricks c sharp corner toronto feb 2019 heather grandy
 
Build Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsightBuild Big Data Enterprise solutions faster on Azure HDInsight
Build Big Data Enterprise solutions faster on Azure HDInsight
 
How does Microsoft solve Big Data?
How does Microsoft solve Big Data?How does Microsoft solve Big Data?
How does Microsoft solve Big Data?
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
 
Cepta The Future of Data with Power BI
Cepta The Future of Data with Power BICepta The Future of Data with Power BI
Cepta The Future of Data with Power BI
 
Cloudera, Azure and Big Data at Cloudera Meetup '17
Cloudera, Azure and Big Data at Cloudera Meetup '17Cloudera, Azure and Big Data at Cloudera Meetup '17
Cloudera, Azure and Big Data at Cloudera Meetup '17
 
1 Introduction to Microsoft data platform analytics for release
1 Introduction to Microsoft data platform analytics for release1 Introduction to Microsoft data platform analytics for release
1 Introduction to Microsoft data platform analytics for release
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
 
Data Estate Modernization
Data Estate ModernizationData Estate Modernization
Data Estate Modernization
 
Big Data Expo 2015 - Microsoft Transform you data into intelligent action
Big Data Expo 2015 - Microsoft Transform you data into intelligent actionBig Data Expo 2015 - Microsoft Transform you data into intelligent action
Big Data Expo 2015 - Microsoft Transform you data into intelligent action
 
SQL Server Ground to Cloud.pptx
SQL Server Ground to          Cloud.pptxSQL Server Ground to          Cloud.pptx
SQL Server Ground to Cloud.pptx
 

More from The Hive

"Responsible AI", by Charlie Muirhead
"Responsible AI", by Charlie Muirhead"Responsible AI", by Charlie Muirhead
"Responsible AI", by Charlie MuirheadThe Hive
 
Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...
Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...
Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...The Hive
 
Digital Transformation; Digital Twins for Delivering Business Value in IIoT
Digital Transformation; Digital Twins for Delivering Business Value in IIoTDigital Transformation; Digital Twins for Delivering Business Value in IIoT
Digital Transformation; Digital Twins for Delivering Business Value in IIoTThe Hive
 
Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18
Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18
Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18The Hive
 
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive
 
Data Science in the Enterprise
Data Science in the EnterpriseData Science in the Enterprise
Data Science in the EnterpriseThe Hive
 
AI in Software for Augmenting Intelligence Across the Enterprise
AI in Software for Augmenting Intelligence Across the EnterpriseAI in Software for Augmenting Intelligence Across the Enterprise
AI in Software for Augmenting Intelligence Across the EnterpriseThe Hive
 
“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...
“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...
“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...The Hive
 
"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell Automation
"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell Automation"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell Automation
"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell AutomationThe Hive
 
Social Impact & Ethics of AI by Steve Omohundro
Social Impact & Ethics of AI by Steve OmohundroSocial Impact & Ethics of AI by Steve Omohundro
Social Impact & Ethics of AI by Steve OmohundroThe Hive
 
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...The Hive
 
The Hive Think Tank: The Content Trap - Strategist's Guide to Digital Change
The Hive Think Tank: The Content Trap - Strategist's Guide to Digital ChangeThe Hive Think Tank: The Content Trap - Strategist's Guide to Digital Change
The Hive Think Tank: The Content Trap - Strategist's Guide to Digital ChangeThe Hive
 
The Hive Think Tank: Sidechains by Adam Back, President of Blockstream
The Hive Think Tank: Sidechains by Adam Back, President of BlockstreamThe Hive Think Tank: Sidechains by Adam Back, President of Blockstream
The Hive Think Tank: Sidechains by Adam Back, President of BlockstreamThe Hive
 
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive
 
The Hive Think Tank: Rocking the Database World with RocksDB
The Hive Think Tank: Rocking the Database World with RocksDBThe Hive Think Tank: Rocking the Database World with RocksDB
The Hive Think Tank: Rocking the Database World with RocksDBThe Hive
 
The Hive Think Tank: Rocking the Database World with RocksDB
The Hive Think Tank:  Rocking the Database World with RocksDBThe Hive Think Tank:  Rocking the Database World with RocksDB
The Hive Think Tank: Rocking the Database World with RocksDBThe Hive
 
The Hive Think Tank: Rocking the Database World with RocksDB
The Hive Think Tank: Rocking the Database World with RocksDBThe Hive Think Tank: Rocking the Database World with RocksDB
The Hive Think Tank: Rocking the Database World with RocksDBThe Hive
 
The Hive Think Tank: Stream Processing Systems by Nikita Shamgunov of MemSQL
The Hive Think Tank: Stream Processing Systems by Nikita Shamgunov of MemSQLThe Hive Think Tank: Stream Processing Systems by Nikita Shamgunov of MemSQL
The Hive Think Tank: Stream Processing Systems by Nikita Shamgunov of MemSQLThe Hive
 
The Hive Think Tank: "Stream Processing Systems" by Karthik Ramasamy of Twitter
The Hive Think Tank: "Stream Processing Systems" by Karthik Ramasamy of TwitterThe Hive Think Tank: "Stream Processing Systems" by Karthik Ramasamy of Twitter
The Hive Think Tank: "Stream Processing Systems" by Karthik Ramasamy of TwitterThe Hive
 
The Hive Think Tank: "Stream Processing Systems" by M.C. Srivas of MapR
The Hive Think Tank: "Stream Processing Systems" by M.C. Srivas of MapRThe Hive Think Tank: "Stream Processing Systems" by M.C. Srivas of MapR
The Hive Think Tank: "Stream Processing Systems" by M.C. Srivas of MapRThe Hive
 

More from The Hive (20)

"Responsible AI", by Charlie Muirhead
"Responsible AI", by Charlie Muirhead"Responsible AI", by Charlie Muirhead
"Responsible AI", by Charlie Muirhead
 
Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...
Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...
Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...
 
Digital Transformation; Digital Twins for Delivering Business Value in IIoT
Digital Transformation; Digital Twins for Delivering Business Value in IIoTDigital Transformation; Digital Twins for Delivering Business Value in IIoT
Digital Transformation; Digital Twins for Delivering Business Value in IIoT
 
Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18
Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18
Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18
 
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
 
Data Science in the Enterprise
Data Science in the EnterpriseData Science in the Enterprise
Data Science in the Enterprise
 
AI in Software for Augmenting Intelligence Across the Enterprise
AI in Software for Augmenting Intelligence Across the EnterpriseAI in Software for Augmenting Intelligence Across the Enterprise
AI in Software for Augmenting Intelligence Across the Enterprise
 
“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...
“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...
“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...
 
"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell Automation
"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell Automation"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell Automation
"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell Automation
 
Social Impact & Ethics of AI by Steve Omohundro
Social Impact & Ethics of AI by Steve OmohundroSocial Impact & Ethics of AI by Steve Omohundro
Social Impact & Ethics of AI by Steve Omohundro
 
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
 
The Hive Think Tank: The Content Trap - Strategist's Guide to Digital Change
The Hive Think Tank: The Content Trap - Strategist's Guide to Digital ChangeThe Hive Think Tank: The Content Trap - Strategist's Guide to Digital Change
The Hive Think Tank: The Content Trap - Strategist's Guide to Digital Change
 
The Hive Think Tank: Sidechains by Adam Back, President of Blockstream
The Hive Think Tank: Sidechains by Adam Back, President of BlockstreamThe Hive Think Tank: Sidechains by Adam Back, President of Blockstream
The Hive Think Tank: Sidechains by Adam Back, President of Blockstream
 
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.
 
The Hive Think Tank: Rocking the Database World with RocksDB
The Hive Think Tank: Rocking the Database World with RocksDBThe Hive Think Tank: Rocking the Database World with RocksDB
The Hive Think Tank: Rocking the Database World with RocksDB
 
The Hive Think Tank: Rocking the Database World with RocksDB
The Hive Think Tank:  Rocking the Database World with RocksDBThe Hive Think Tank:  Rocking the Database World with RocksDB
The Hive Think Tank: Rocking the Database World with RocksDB
 
The Hive Think Tank: Rocking the Database World with RocksDB
The Hive Think Tank: Rocking the Database World with RocksDBThe Hive Think Tank: Rocking the Database World with RocksDB
The Hive Think Tank: Rocking the Database World with RocksDB
 
The Hive Think Tank: Stream Processing Systems by Nikita Shamgunov of MemSQL
The Hive Think Tank: Stream Processing Systems by Nikita Shamgunov of MemSQLThe Hive Think Tank: Stream Processing Systems by Nikita Shamgunov of MemSQL
The Hive Think Tank: Stream Processing Systems by Nikita Shamgunov of MemSQL
 
The Hive Think Tank: "Stream Processing Systems" by Karthik Ramasamy of Twitter
The Hive Think Tank: "Stream Processing Systems" by Karthik Ramasamy of TwitterThe Hive Think Tank: "Stream Processing Systems" by Karthik Ramasamy of Twitter
The Hive Think Tank: "Stream Processing Systems" by Karthik Ramasamy of Twitter
 
The Hive Think Tank: "Stream Processing Systems" by M.C. Srivas of MapR
The Hive Think Tank: "Stream Processing Systems" by M.C. Srivas of MapRThe Hive Think Tank: "Stream Processing Systems" by M.C. Srivas of MapR
The Hive Think Tank: "Stream Processing Systems" by M.C. Srivas of MapR
 

Recently uploaded

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO for Data, Microsoft

  • 1. Big Data @ Microsoft Raghu Ramakrishnan CTO for Data, Technical Fellow Microsoft
  • 2. Data and Analytics – 3 Pillars SQL 2016 Azure SQL DB Azure SQL DW SQL Server R services On-prem and cloud (Windows, Linux) Cortana Intelligence Suite Hadoop, Data Lake, Machine learning, PowerBI, Data Factory, Streaming, Perceptual Intelligence On-prem connectivity Microsoft R server Hadoop Teradata On-prem and cloud (Windows, Linux)
  • 3. SQL Server 2016: Everything Built-In The above graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from Microsoft. Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose. Consistent experience from on-premises to cloud Microsoft Tableau Oracle $120 $480 $2,230 Self-service BI per user In-memory across all workloads TPC-H non-clustered 10TB Oracle is #4#2 SQL Server #1 SQL Server #3 SQL Server built-inbuilt-in built-in built-in built-in 0 1 4 0 0 3 34 29 22 15 5 22 6 43 20 69 18 49 3 -80 -70 -60 -50 -40 -30 -20 -10 0 2010 2011 2012 2013 2014 2015 SQL Server Oracle MySQL2 SAP HANA TPC-H non-clustered results as of 04/06/15, 5/04/15, 4/15/14 and 11/25/13, respectively. http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster at massive scale National Institute of Standards and Technology Comprehensive Vulnerability Database update 5/4/2015
  • 4. In-Database Advanced Analytics No need to move the data Open source R with in- memory & massive scale – multi-threading & massive parallel processing Data Scientist Interact directly with data R built-in to SQL Server Data Developer/DBA Manage data and analytics together Example Solutions • Sales forecasting • Warehouse efficiency • Predictive maintenance Extensibility ? R R Integration Relational data Analytic Library T-SQL interface 010010 100100 010101 New R scripts 010010 100100 010101 010010 100100 010101 010010 100100 010101 • Credit risk protection 010010 100100 010101 Microsoft Azure Marketplace Real-time operational analytics without moving the data NEW NEW End-to-end mobile BI Advanced AnalyticsMission critical OLTP
  • 5. High-performance open source R plus:  Enterprise Scale & Performance – Scales from workstations to large clusters – Scales to large data sizes – Growing portfolio of Parallelized algorithms  Secure, Scalable R Deployment/Operationalization  Write Once Deploy Anywhere for multiple platforms  IDE for data scientists and developers  Enterprise Class Support DistributedR DeployR DevelopR ScaleR ConnectR
  • 6. Cloud – SQL Server/SQL Azure Shifting how you purchase and manage machines Increased focus on Total Cost of Ownership and continuous improvements Built from the same code base We increased surface area compatibility with V12 Azure SQL Database We’re learning how to run our own code – the good and the bad We’re using that to improve both product and service Microsoft is the only provider both on-premises and in the cloud
  • 7. Order history Name SSN Date Jane Doe cm61ba906fd 2/28/2005 Jim Gray ox7ff654ae6d 3/18/2005 John Smith i2y36cg776rg 4/10/2005 Bill Brown nx290pldo90l 4/27/2005 Order history Name SSN Date Jane Doe cm61ba906fd 2/28/2005 Jim Gray ox7ff654ae6d 3/18/2005 John Smith i2y36cg776rg 4/10/2005 Bill Brown nx290pldo90l 4/27/2005 Customer data Product data Order History Stretch to cloud Stretch SQL Server into Azure Stretch warm and cold tables to Azure with remote query processing App Query Microsoft Azure  Jim Gray ox7ff654ae6d 3/18/2005 SQL Server 2016
  • 8. Azure SQL DW Fully managed relational data warehouse-as-a-service First elastic cloud data warehouse with proven SQL Server capabilities Support your smallest to your largest data storage needs Scales to petabytes of data Massively Parallel Processing Instant-on compute scales in seconds Query Relational / Non-Relational Saas Azure Public Cloud Office 365Office 365 Get started in minutes Integrated with Azure ML, PowerBI & ADF Simple billing compute & storage Pay for what you need, when you need it with dynamic pause AzureAzure
  • 9.
  • 10.
  • 11. Store any data relations Do any analysis SQL queries Hive, At any speed Batch Hive At any scale … elastic! Anywhere Data to Intelligent Action
  • 12. Web Logs, Omniture logs On-Premise SQL Server (customer and product data) In-Store Activity with Kinect sensors Social Data Diagnostic streaming Event hubs Machine Learning Stream Analytics Azure DataLake Data Factory: Move Data, Orchestrate, Schedule, and Monitor HDInsight HDInsight Machine Learning Azure SQL Data Warehouse Power BI INGEST PREPARE ANALYZE PUBLISH Stream Analytics CONSUMEDATA SOURCES Cortana Web/LOB Dashboards
  • 13.
  • 14.
  • 15.
  • 16. Azure Data Analytics Stack REEF library STORAGE YARN HDFS/WebHDFS API Compute-tier Cache Clusters (Local ENs + CSM) RAM / SSD / HDD WAS-based Remote Storage Cosmos Store API CLUSTER-WIDE RM (YARN++) YARN + Federation YARN + Rayon (Capacity reservation) YARN + Mercury Shared micro- services for all metadata (extent map, logical name space, secure store) based on Hekaton/RSL rings YARN + Mercury YARN + Mercury Application Engines Per-job RM and runtimeM/R U-SQL Batch Spark Tez Spark Runtime Spark HiveU-SQL Azure ML Azure SA COMPUTE TIER SQL-DW HDInsight IaaS Services
  • 17. Windows SMSG Live Ads CRM/Dynamics Windows Phone Xbox Live Office365 STB Malware Protection Microsoft Stores STBCommerceRisk Messenger LCA Exchange Yammer Skype Bing data managed: EBs cluster sizes: 10s of Ks # machines: 100s of Ks daily I/O: >100 PBs # internal developers: 1000s # daily jobs: 100s of Ks
  • 18. Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why did it happen? Descriptive Analytics Diagnostic Analytics Confirmation Theory Hypothesis Observation
  • 19. Implement Data Warehouse Physical Design ETL Development Reporting & Analytics Development Install and Tune Reporting & Analytics Design Dimension Modelling ETL Design Setup Infrastructure Understand Corporate Strategy Data sources ETL BI and analytic Data warehouse Gather Requirements Business Requirements Technical Requirements
  • 20. Ingest all data regardless of requirements Store all data in native format without schema definition Do analysis Using analytic engines like Hadoop Interactive queries Batch queries Machine Learning Data warehouse Real-time analytics Devices
  • 21. What happened? What is happening? Why did it happen? What are key relationships? What will happen? What if? How risky is it? What should happen? What is the best option? How can I optimize? Data sources
  • 22. Handling failures Sharing data, resources Parallelism Data-aware Optimization Security, Compliance, Governance Enterprise
  • 23. Forrester Wave Big Data Hadoop Cloud Solutions Q2 2016
  • 24. • Interactive and Real-Time Analytics requires i • Massive data volumes require scale-out stores using commodity servers, even archival storage Tiered Storage Seamlessly move data across tiers, mirroring life-cycle and usage patterns Schedule compute near low-latency copies of data How can we manage this trade-off without moving data across different storage systems (and governance boundaries)?
  • 25. • Many different analytic engines (OSS and vendors; SQL, ML; batch, interactive, streaming) • Many users’ jobs (across these job types) run on the same machines (where the data lives) Resource Management with Multitenancy and SLAs Policy-driven management of vast compute pools co-located with data Schedule computation “near” data How can we manage this multi-tenanted heterogeneous job mix across tens of thousands of machines?
  • 26. Azure Data Lake Store Fully managed cloud data store designed for analytics Supports HDFS compliant analytics applications and tools Petabyte files, unlimited account size High throughput for analytics performance Low latency ingestion with read as you write AAD-based authentication, access auditing File and folder-level ACLs, Encryption at rest
  • 27. ADLS Security: Encryption-at-Rest Transparently encrypts data flowing to and from public networks as well as at rest Transparent server-side encryption User can manage their own encryption keys or let Azure Data Lake Store manage the key using Azure Key Vault 28
  • 28. ADLS Security: Role-Based Access Control Each file and directory is associated with an owner and a group Files or directories have separate permissions (read(r), write(w), execute(x)) for owners, members of the group, and for all other users Fine-grained access control lists (ACLs) can be specified for specific named users or named groups 29
  • 29. ADL Store: Ingress Data can be ingested into Azure Data Lake Store from a variety of sources Server logs Azure Event Hub Apache Flume Azure Storage Blobs Custom programs .NET SDK JavaScript CLI Azure Portal Azure PowerShell Azure Data Factory Apache Sqoop Azure SQL DB Azure SQL DW Azure tables Table Storage On-premises databases SQL 30 ADL Store Built-in copy service
  • 30. ADL Store: Egress Data can be exported from Azure Data Lake Store into numerous targets/sinks Azure SQL DB SQL Azure SQL DW Azure Tables Table Storage On-premises databases Azure Data Factory Apache Sqoop Azure Storage Blobs Custom programs .NET SDK JavaScript CLI Azure Portal Azure PowerShell 31 Built-in copy service ADL Store
  • 31. Extent Metadata Data Data Data… Remote Storage Naming Service Secret Store 1) Filename Translation 3) Find Extents 4) Data access Remote storage tier builds securely on WAS Secure Works with YARN! COMPUTE TIER Secure Store Service Intelligent ingest Massively parallel 2) Azure Access Keys
  • 32. • Interactive and Real-Time Analytics requires i • Massive data volumes require scale-out stores using commodity servers, even archival storage Tiered Storage Scale storage independently of compute Seamlessly move data across tiers, mirroring life-cycle and usage patterns Schedule compute near low-latency copies of data Data Lifecycle Management How can we manage this trade-off without moving data across different storage systems (and governance boundaries)?
  • 33. Extent Metadata Data Data Data… Remote Storage Naming Service Secret Store 1) Filename Translation 3) Find Extents 4) Data access Remote storage tier builds securely on WAS Secure Works with YARN! COMPUTE TIER Data Data Data … Secure Store Service Local Storage Intelligent ingest Massively parallel 2) Azure Access Keys
  • 34. Azure HDInsight—Linux and Windows Managed, Monitored, Supported • Cluster customization – Install your favorite project • Harness existing .Net & Java skills to write customer extensions • Supports broad ecosystem of ISVs (Hadoop and Traditional) Full Apache Hadoop • Batch – MapReduce, PIG, Hive, Spark • Stream Processing and Analytics – Storm, SparkStreaming • Interactive SQL – Hive (Tez), and SparkSQL • Table Serving – Hbase • Machine Learning – SparkML, Mahout
  • 35. Batch MapReduce, PIG, Hive, Spark Interactive SQL Hive (Tez), SparkSQL Stream Analytics Storm, SparkStreaming Machine Learning SparkML, Mahout Table Serving Hbase Exploratory Visualization Jupyter, Zeppelin Interactive SQL SQL DW Stream Analytics Azure Stream Analytics Machine Learning Azure ML Table Serving Azure SQL DB Exploratory Visualization Power BI
  • 36. • > 14 million field hours http://www.ebird.org
  • 38. Azure Data Lake Analytics Service A new distributed analytics service Built on Apache YARN Scales dynamically with a dial Pay by the query Supports Azure AD for access control, roles, and integration with on-prem identity systems U-SQL language unifies the benefits of SQL with the power of C# Hive etc. will be added over time Processes data across Azure 41
  • 39. Get started Log in to Azure Create an ADLA account Write and submit an ADLA job with U-SQL (or Hive/Pig) The job reads and writes data from storage 1 2 3 4 30 seconds ADLS Azure Blobs Azure DB …
  • 40. ADLA Complements HDInsight HDInsight Dedicated managed clusters for developers familiar with the Open Source: Java, Eclipse, Hive, etc. Clusters offer customization, control, and flexibility in a managed Hadoop cluster ADLA Enables customers to leverage existing experience with C#, SQL & PowerShell Offers convenience, efficiency, and automatic scale in a “job service” form factor over a system-managed shared resource pool
  • 41. U-SQL A hyper-scalable, highly extensible language for preparing, transforming and analyzing all data Allows users to focus on the what— not the how—of business problems Built on familiar languages (SQL and C#) and supported by a fully integrated development environment Built for data developers & scientists 44
  • 42. U-SQL Language Philosophy Declarative query and transformation language: • Uses SQL’s SELECT FROM WHERE with GROUP BY/aggregation, joins, SQL Analytics functions • Optimizable, scalable Operates on unstructured & structured data • Schema on read over files • Relational metadata objects (e.g. database, table) Extensible from ground up: • Type system is based on C# • Expression language is C# 21 User-defined functions (U-SQL and C#) User-defined types (U-SQL/C#) (future) User-defined aggregators (C#) User-defined operators (UDO) (C#) U-SQL provides the parallelization and scale-out framework for usercode • EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINERS Expression-flow programming style: • Easy to use functional lambda composition • Composable, globally optimizable Federated query across distributed data sources (soon) REFERENCE MyDB.MyAssembly; CREATE TABLE T( cid int, first_order DateTime , last_order DateTime, order_count int , order_amount float ); @o = EXTRACT oid int, cid int, odate DateTime, amount float FROM "/input/orders.txt“ USING Extractors.Csv(); @c = EXTRACT cid int, name string, city string FROM "/input/customers.txt“ USING Extractors.Csv(); @j = SELECT c.cid, MIN(o.odate) AS firstorder , MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt , SUM(c.amount) AS totalamount FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid WHERE c.city.StartsWith("New") && MyNamespace.MyFunction(o.odate) > 10 GROUP BY c.cid; OUTPUT @j TO "/output/result.txt" USING new MyData.Write(); INSERT INTO T SELECT * FROM @j;
  • 43. Federated Queries: Query Data Where It Lives Easily query data in multiple Azure data stores without moving it to a single store Benefits Avoid moving large amounts of data across the network between stores Single view of data irrespective of physical location Minimize data proliferation issues caused by maintaining multiple copies Single query language for all data Each data store maintains its own sovereignty Design choices based on the need U-SQL Query Result Query 46 Azure Storage Blobs Azure SQL in VMs Azure SQL DB Azure Data Lake Analytics
  • 44. Join Local (ADLS) and External Data 1. Create two tables. • An external table ‘PurchaseOrders’ that refers to the PurchaseOrders table in the external SQL Azure DB. • A ‘local’ table ‘UserIdsTable’ created by ‘extracting’ User Ids and region fields from the WebLogRecords.txt file stored in Azure Data Lake. 2. Join the PurchaseOrders table with UserIds table on the common UserId column. Purchase orders table Azure SQL DB External purchase orders table Local user IDs table JOIN (on User IDs) Azure Data Lake Analytics Find sum of all purchases by users in the ‘en-us’ region Query 9 47 WebLogRecords.txt
  • 45. Concepts: Jobs, Stages and Vertexes Each job is broken into a number of vertexes Each vertex is some work that needs to be done Input Output Output 6 Stages 8 Vertexes Vertexes are organized into stages – Vertexes in each stage do the same work on the same data – Vertex in one stage may depend on a vertex in a earlier stage Stages themselves are organized into an acyclic graph 49
  • 46. • Many different analytic engines (OSS and vendors; SQL, ML; batch, interactive, streaming) • Many users’ jobs (across these job types) run on the same machines (where the data lives) Resource Management with Multitenancy and SLAs Policy-driven management of vast compute pools co-located with data Schedule computation “near” data How can we manage this multi-tenanted heterogeneous job mix across tens of thousands of machines?
  • 47. Resource Managers for Big Data Allocate compute containers to competing jobs Multiple job engines shared pool Containers YARN: Resource manager for Hadoop2.x Corona, Mesos, Omega
  • 48. Shared Data and Compute Tiered Storage Relational Query Engine Machine Learning Compute Fabric (Resource Management) Multiple analytic engines sharing same resource pool Compute and store/cache on same machines
  • 49. What’s Behind a U-SQL Query . . . . . . … … …
  • 50. YARN Gaps resource allocation SLOs scalability limitations • High allocation latency • Support for specialized execution frameworks • Interactive environments, long-running services
  • 51. • Amoeba Rayon • Status: shipping in Apache Hadoop 2.6 • Mercury and Yaq • Status: Now in Apache Hadoop trunk! • Federation • Status: prototype and JIRA • Framework-level Pooling • Enable frameworks that want to take over resource allocation to support millisecond- level response and adaptation times • Status: spec Microsoft Contributions to OSS Apache YARN