SlideShare a Scribd company logo
1 of 22
Need for Massive
Compute and
Storage
Advanced
Deployment
Expertise
Volume, Variety,
Velocity of Data
Speed Scale Economics
Always Up,
Always On Open and flexible
Time to value
Azure
Storage
HDInsight
Data Factory
ML
Stream
Analytics
Database
DocumentDB
Search
Event Hubs
Microsoft’s cloud Hadoop offering
100% open source Apache Hadoop
Built on the latest releases across Hadoop (2.6)
Up and running in minutes with no hardware to deploy
Harness existing .NET and Java skills
Utilize familiar BI tools for analysis including Microsoft Excel
HDInsight Supports Hive
Hadoop 2.0
HDInsight Supports HBase
Data Node Data Node Data Node Data Node
Task Tracker Task Tracker Task Tracker Task Tracker
Name Node
Job Tracker
HMaster
Coordination
Region Server Region Server Region Server Region Server
Stream
processing
Search and query
Data analytics (Excel)
Web/thick client
dashboards
Devices to take action
RabbitMQ /
ActiveMQ
DocumentDB HDInsight & Hadoop
Microsoft Azure DocumentDB can now act as an
input source or output sink for Hive, Pig, and MapReduce jobs.
Store and query schema-less, JSON data and perform analytics with it.
 Schema-less and native JSON
 Transactional JavaScript
 Rich querying
 Tunable consistency
 Scalable storage and
throughput
For more information visit http://aka.ms/documentdb-hdinsight
Industry Use Case Type of Data
Digital Marketing &
Advertisement
Campaign Tracking and Analysis Clickstream
Customer Segment Analysis Unstructured
Social Sentiment Analysis Social Media
Retail and Consumer Goods
Clickstream Analytics Clickstream
Online Recommendation System Clickstream
Entertainment & Gaming
Analysis of game play data, player profiles and stats Server Logs
User/player engagement and retention Server Logs
Telemetry and Log Analysis Server Logs
Social Sentiment Analysis Social Media
Health Care
Predictive Analysis of Patient Health & Clinical Decision Support Unstructured
Population, risk, and Care management Semi-Structured
Real-time quality measures to assist providers reach their regulatory requirements Unstructured
Manufacturing
Analysis of equipment-generated data (Wind Turbines, Gas Dispensers) Sensor
Predictive Maintenance Sensor
Power & Utilities
Customer Engagement and Retention Analysis Semi-Structured
Metering Analysis & Control Unstructured
Government
Fraud and money laundry detection Semi-Structured
ETL for Online Invoice Systems Semi-Structured
Industry Use Case Type of Data
Food Analysis of lab and clinical trial data Structured & Unstructured
Computer Software & IT Services Telemetry and Log Analysis Server Logs
Telecommunications Analysis of Call Activity Server Logs
Professional Services
QoS Analysis Server Logs
Customer Care Analysis Unstructured
Across Multiple Industries ETL Multi-structured
• Social Data
• Server Logs
Unstructured/Mixed
• Click-stream
• Sensor data
Semi-structured
• Customer records
• Transactions
Structured
• Feature adoption and interaction
• Failure analysis
• Exception detection
Product Development
• Customer segmentation
• Engagement and retention
Customer Understanding
• Social sentiment analysis
• Recommendation systems
Marketing
• Trend discovery
• Large population analysis
Scientific
Leading Computer Manufacturer / Retailer Solution
Retail and
Consumer Goods
One of the largest computer
manufacturer / retailers in the
world. They develop, sell, repair
and support computers and
related products and services
What They Did | Analyze Clickstream and Provide Real-time Recommendations Online
Challenge
Needed to understand customer behavior on their retail website
Wanted to increase conversion rate from generic visitation to purchase
Solution
Chose Azure over Amazon AWS as part of company strategy
Use Azure HDInsight, HBase for HDInsight, SQL Database, Blobs, Websites, Event Hubs,
Machine Learning
Chose HDInsight due to faster deployment times as a cloud-based solution
Understand Visitors:
• After clicks are made by customer, site will classify the customer and personalize their shopping
experience with recommendations as page is rendering
• Use IP address and click interests/behavior to trigger guess on segmentation
• Show recommendations to help convert to purchase
• Processes Omniture clickstream data with Hadoop and segment customers
If customer leaves the site, will send targeted advertisement
• ‘Abandoned Cart Emails’, ‘User Inactivity Emails’ & ‘Discount coupon on purchase of a System
Clickstream,
Recommendation
BK1
Email Server
Leading Computer Manufacturer / Retailer
How They Did It | Analyze Clickstream and Provide Real-time Recommendations Online
How They Did It
Collect clickstream data
• In tab separated text files
• Adding 22 new files per hour ~5-18 MB/file
• Currently 1TB and growing
Spin up Hadoop
• Use Hive scripts because of SQL-like syntax
• Extracts click behavior like buys, additions to carts, reviews etc.
and assigns scores
• Jobs run hourly
• Currently 8-nodes with plans to 16
Clickstream,
Recommendation
BK1
HDInsight Cluster
AzureML
Event Hub
ASA
AzureML
Azure SQL
DE
Blob
Storage
NoSQL
Storage
IaaS VM
HBase
Targeted Email
Persisted
Storage
Blob
Storage
Blob
Storage
Visitor
Information
Service
Omniture Product
Catalog
Website.com
Industrial automation company partnering with
multinational oil company
Oil and Gas
Leading industrial automation
company who employs over
20,000 people.
partnering with
Leading multinational oil and gas
company (one of the six oil and
gas super majors) who employs
over 90,000 people.
What They Did | IoT internet-connected sensors to generate analytics for proactive maintenance
Challenge
Manage sites used for dispensing liquefied natural gas (clean fuel for commercial customers
who do heavy-duty road transportation)
Built LNG refueling stations across US interstate highway
Stations are unmanned so they built 24x7 remote management and monitoring to track
diagnostics of each station for maintenance or tuning
Built internet-connected sensors embedded in 350 dispenser sites worldwide generating
tens of thousands data points per second
• Temperature, pressure, vibration, etc.
Data needs outgrew company’s internal datacenter and data warehouse
Solution
Chose Azure HDInsight, Data Factory, SQL Database, Machine Learning
Dashboards used to detect anomalies for proactive maintenance
• Changes in performance of the components
• Energy consumption of components
• Component downtime and reliability
Future: Goal is to expand program to hundreds of thousands of dispensers
Sensor,
Analytics
BK1
Industrial automation company partnering with
multinational oil company
How They Did It | IoT internet-connected sensors to generate analytics for proactive maintenance
How They Did It
Collect data from internet-collected sensors
• Tens of thousands data points per second
• Interpolate time-series prior to analysis
• Stored raw sensor data in Blobs every 5 minutes
Use Hadoop to execute scripts and Data Factory to
orchestrate
• Hive and Pig scripts orchestrated by Data Factory
• Data resulting from scripts loaded in SQL Database
• Queries detect site anomalies to indicate maintenance/tuning
Produced dashboards with role-based reporting
• Azure Machine Learning , SSRS, Power BI for O365
• Provide users with customizable interface
• View current and historical data (day-to-day operations, asset
performance over time, etc.)
• Leveraged Azure Mobile Notification Hub for real-time
notifications, alarms, or important events
Use Azure ML to predict
• Understand which pumps, run at what speeds, maximized water
supply while minimizing energy use
Sensor,
Analytics
Hadoop in the Cloud: Common Architectural Patterns

More Related Content

What's hot

Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Cloudera, Inc.
 
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...Edureka!
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitSaptak Sen
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data AnalyticsAttunity
 
Cost of Ownership for Hadoop Implementation
Cost of Ownership for Hadoop ImplementationCost of Ownership for Hadoop Implementation
Cost of Ownership for Hadoop ImplementationDataWorks Summit
 
McGraw-Hill Optimizes Analytics Workloads with Databricks
 McGraw-Hill Optimizes Analytics Workloads with Databricks McGraw-Hill Optimizes Analytics Workloads with Databricks
McGraw-Hill Optimizes Analytics Workloads with DatabricksAmazon Web Services
 
Part 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science WorkbenchPart 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science WorkbenchCloudera, Inc.
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteAmr Awadallah
 
Extending your Hadoop Implementation to the Cloud
Extending your Hadoop Implementation to the CloudExtending your Hadoop Implementation to the Cloud
Extending your Hadoop Implementation to the CloudDataWorks Summit
 
Cloud computing and Hadoop introduction
Cloud computing and Hadoop introductionCloud computing and Hadoop introduction
Cloud computing and Hadoop introductionchristian.perez
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupChallenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupAndrei Savu
 
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudData Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudCloudera, Inc.
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Carole Gunst
 
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera Director: Unlock the Full Potential of Hadoop in the CloudCloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera Director: Unlock the Full Potential of Hadoop in the CloudCloudera, Inc.
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseJames Serra
 
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac... Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...Cloudera, Inc.
 
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...Cloudera, Inc.
 
Self-Service Provisioning and Hadoop Management with Apache Ambari
Self-Service Provisioning and  Hadoop Management with Apache AmbariSelf-Service Provisioning and  Hadoop Management with Apache Ambari
Self-Service Provisioning and Hadoop Management with Apache AmbariDataWorks Summit
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?James Serra
 

What's hot (20)

Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera

 
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
Cost of Ownership for Hadoop Implementation
Cost of Ownership for Hadoop ImplementationCost of Ownership for Hadoop Implementation
Cost of Ownership for Hadoop Implementation
 
McGraw-Hill Optimizes Analytics Workloads with Databricks
 McGraw-Hill Optimizes Analytics Workloads with Databricks McGraw-Hill Optimizes Analytics Workloads with Databricks
McGraw-Hill Optimizes Analytics Workloads with Databricks
 
Part 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science WorkbenchPart 1: Introducing the Cloudera Data Science Workbench
Part 1: Introducing the Cloudera Data Science Workbench
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
 
Extending your Hadoop Implementation to the Cloud
Extending your Hadoop Implementation to the CloudExtending your Hadoop Implementation to the Cloud
Extending your Hadoop Implementation to the Cloud
 
Cloud computing and Hadoop introduction
Cloud computing and Hadoop introductionCloud computing and Hadoop introduction
Cloud computing and Hadoop introduction
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupChallenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
 
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudData Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
 
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera Director: Unlock the Full Potential of Hadoop in the CloudCloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac... Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...
 
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
 
Self-Service Provisioning and Hadoop Management with Apache Ambari
Self-Service Provisioning and  Hadoop Management with Apache AmbariSelf-Service Provisioning and  Hadoop Management with Apache Ambari
Self-Service Provisioning and Hadoop Management with Apache Ambari
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 

Viewers also liked

Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoopdarugar
 
Hadoop meets Cloud with Multi-Tenancy
Hadoop meets Cloud with Multi-TenancyHadoop meets Cloud with Multi-Tenancy
Hadoop meets Cloud with Multi-TenancyTreasure Data, Inc.
 
Configuring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the EnterpriseConfiguring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the EnterpriseCloudera, Inc.
 
Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Remy Rosenbaum
 
The Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data ArchitectureThe Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data ArchitectureDataWorks Summit/Hadoop Summit
 
100424 teradata cloud computing 3rd party influencers2c
100424 teradata cloud computing 3rd party influencers2c100424 teradata cloud computing 3rd party influencers2c
100424 teradata cloud computing 3rd party influencers2cguest8ebe0a8
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift
(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift
(ISM303) Migrating Your Enterprise Data Warehouse To Amazon RedshiftAmazon Web Services
 
Data Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationData Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationHortonworks
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 
Multi-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BTMulti-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BTCloudera, Inc.
 

Viewers also liked (16)

Cloud Computing: Hadoop
Cloud Computing: HadoopCloud Computing: Hadoop
Cloud Computing: Hadoop
 
Hadoop meets Cloud with Multi-Tenancy
Hadoop meets Cloud with Multi-TenancyHadoop meets Cloud with Multi-Tenancy
Hadoop meets Cloud with Multi-Tenancy
 
Hadoop on the Cloud
Hadoop on the CloudHadoop on the Cloud
Hadoop on the Cloud
 
Configuring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the EnterpriseConfiguring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the Enterprise
 
Managing a Multi-Tenant Data Lake
Managing a Multi-Tenant Data LakeManaging a Multi-Tenant Data Lake
Managing a Multi-Tenant Data Lake
 
Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 
The Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data ArchitectureThe Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data Architecture
 
100424 teradata cloud computing 3rd party influencers2c
100424 teradata cloud computing 3rd party influencers2c100424 teradata cloud computing 3rd party influencers2c
100424 teradata cloud computing 3rd party influencers2c
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift
(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift
(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift
 
Data Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationData Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop Implementation
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Big Data & The Cloud
Big Data & The CloudBig Data & The Cloud
Big Data & The Cloud
 
Multi-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BTMulti-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BT
 

Similar to Hadoop in the Cloud: Common Architectural Patterns

Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
 
NYC Data Amp - Microsoft Azure and Data Services Overview
NYC Data Amp - Microsoft Azure and Data Services OverviewNYC Data Amp - Microsoft Azure and Data Services Overview
NYC Data Amp - Microsoft Azure and Data Services OverviewTravis Wright
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesAmazon Web Services
 
Girish Juneja - Intel Big Data & Cloud Summit 2013
Girish Juneja - Intel Big Data & Cloud Summit 2013Girish Juneja - Intel Big Data & Cloud Summit 2013
Girish Juneja - Intel Big Data & Cloud Summit 2013IntelAPAC
 
Implementing Advanced Analytics Platform
Implementing Advanced Analytics PlatformImplementing Advanced Analytics Platform
Implementing Advanced Analytics PlatformArvind Sathi
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesAmazon Web Services
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Group
 
WebAction In-Memory Computing Summit 2015
WebAction In-Memory Computing Summit 2015WebAction In-Memory Computing Summit 2015
WebAction In-Memory Computing Summit 2015WebAction
 
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...Amazon Web Services
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationMongoDB
 
Business Analytics Paradigm Change
Business Analytics Paradigm ChangeBusiness Analytics Paradigm Change
Business Analytics Paradigm ChangeDmitry Anoshin
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...Big Data Spain
 
Deep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
Deep.bi - Real-time, Deep Data Analytics Platform For EcommerceDeep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
Deep.bi - Real-time, Deep Data Analytics Platform For EcommerceDeep.BI
 
Fast, Powerful and Scalable Analytics
Fast, Powerful and Scalable AnalyticsFast, Powerful and Scalable Analytics
Fast, Powerful and Scalable AnalyticsMariaDB plc
 
StreamCentral Technical Overview
StreamCentral Technical OverviewStreamCentral Technical Overview
StreamCentral Technical OverviewRaheel Retiwalla
 

Similar to Hadoop in the Cloud: Common Architectural Patterns (20)

Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
NYC Data Amp - Microsoft Azure and Data Services Overview
NYC Data Amp - Microsoft Azure and Data Services OverviewNYC Data Amp - Microsoft Azure and Data Services Overview
NYC Data Amp - Microsoft Azure and Data Services Overview
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
Girish Juneja - Intel Big Data & Cloud Summit 2013
Girish Juneja - Intel Big Data & Cloud Summit 2013Girish Juneja - Intel Big Data & Cloud Summit 2013
Girish Juneja - Intel Big Data & Cloud Summit 2013
 
Implementing Advanced Analytics Platform
Implementing Advanced Analytics PlatformImplementing Advanced Analytics Platform
Implementing Advanced Analytics Platform
 
Modern Data Architectures for Business Outcomes
Modern Data Architectures for Business OutcomesModern Data Architectures for Business Outcomes
Modern Data Architectures for Business Outcomes
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
WebAction In-Memory Computing Summit 2015
WebAction In-Memory Computing Summit 2015WebAction In-Memory Computing Summit 2015
WebAction In-Memory Computing Summit 2015
 
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...
 
Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?
 
Microstrategy Overview
Microstrategy OverviewMicrostrategy Overview
Microstrategy Overview
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital Transformation
 
Business Analytics Paradigm Change
Business Analytics Paradigm ChangeBusiness Analytics Paradigm Change
Business Analytics Paradigm Change
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 
Deep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
Deep.bi - Real-time, Deep Data Analytics Platform For EcommerceDeep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
Deep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
 
Fast, Powerful and Scalable Analytics
Fast, Powerful and Scalable AnalyticsFast, Powerful and Scalable Analytics
Fast, Powerful and Scalable Analytics
 
StreamCentral Technical Overview
StreamCentral Technical OverviewStreamCentral Technical Overview
StreamCentral Technical Overview
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Hadoop in the Cloud: Common Architectural Patterns

  • 1.
  • 2.
  • 3. Need for Massive Compute and Storage Advanced Deployment Expertise Volume, Variety, Velocity of Data Speed Scale Economics Always Up, Always On Open and flexible Time to value
  • 5. Microsoft’s cloud Hadoop offering 100% open source Apache Hadoop Built on the latest releases across Hadoop (2.6) Up and running in minutes with no hardware to deploy Harness existing .NET and Java skills Utilize familiar BI tools for analysis including Microsoft Excel
  • 6.
  • 7.
  • 9. HDInsight Supports HBase Data Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker Task Tracker Name Node Job Tracker HMaster Coordination Region Server Region Server Region Server Region Server
  • 10. Stream processing Search and query Data analytics (Excel) Web/thick client dashboards Devices to take action RabbitMQ / ActiveMQ
  • 11.
  • 12. DocumentDB HDInsight & Hadoop Microsoft Azure DocumentDB can now act as an input source or output sink for Hive, Pig, and MapReduce jobs. Store and query schema-less, JSON data and perform analytics with it.  Schema-less and native JSON  Transactional JavaScript  Rich querying  Tunable consistency  Scalable storage and throughput For more information visit http://aka.ms/documentdb-hdinsight
  • 13.
  • 14.
  • 15. Industry Use Case Type of Data Digital Marketing & Advertisement Campaign Tracking and Analysis Clickstream Customer Segment Analysis Unstructured Social Sentiment Analysis Social Media Retail and Consumer Goods Clickstream Analytics Clickstream Online Recommendation System Clickstream Entertainment & Gaming Analysis of game play data, player profiles and stats Server Logs User/player engagement and retention Server Logs Telemetry and Log Analysis Server Logs Social Sentiment Analysis Social Media Health Care Predictive Analysis of Patient Health & Clinical Decision Support Unstructured Population, risk, and Care management Semi-Structured Real-time quality measures to assist providers reach their regulatory requirements Unstructured Manufacturing Analysis of equipment-generated data (Wind Turbines, Gas Dispensers) Sensor Predictive Maintenance Sensor Power & Utilities Customer Engagement and Retention Analysis Semi-Structured Metering Analysis & Control Unstructured Government Fraud and money laundry detection Semi-Structured ETL for Online Invoice Systems Semi-Structured
  • 16. Industry Use Case Type of Data Food Analysis of lab and clinical trial data Structured & Unstructured Computer Software & IT Services Telemetry and Log Analysis Server Logs Telecommunications Analysis of Call Activity Server Logs Professional Services QoS Analysis Server Logs Customer Care Analysis Unstructured Across Multiple Industries ETL Multi-structured
  • 17. • Social Data • Server Logs Unstructured/Mixed • Click-stream • Sensor data Semi-structured • Customer records • Transactions Structured • Feature adoption and interaction • Failure analysis • Exception detection Product Development • Customer segmentation • Engagement and retention Customer Understanding • Social sentiment analysis • Recommendation systems Marketing • Trend discovery • Large population analysis Scientific
  • 18. Leading Computer Manufacturer / Retailer Solution Retail and Consumer Goods One of the largest computer manufacturer / retailers in the world. They develop, sell, repair and support computers and related products and services What They Did | Analyze Clickstream and Provide Real-time Recommendations Online Challenge Needed to understand customer behavior on their retail website Wanted to increase conversion rate from generic visitation to purchase Solution Chose Azure over Amazon AWS as part of company strategy Use Azure HDInsight, HBase for HDInsight, SQL Database, Blobs, Websites, Event Hubs, Machine Learning Chose HDInsight due to faster deployment times as a cloud-based solution Understand Visitors: • After clicks are made by customer, site will classify the customer and personalize their shopping experience with recommendations as page is rendering • Use IP address and click interests/behavior to trigger guess on segmentation • Show recommendations to help convert to purchase • Processes Omniture clickstream data with Hadoop and segment customers If customer leaves the site, will send targeted advertisement • ‘Abandoned Cart Emails’, ‘User Inactivity Emails’ & ‘Discount coupon on purchase of a System Clickstream, Recommendation BK1
  • 19. Email Server Leading Computer Manufacturer / Retailer How They Did It | Analyze Clickstream and Provide Real-time Recommendations Online How They Did It Collect clickstream data • In tab separated text files • Adding 22 new files per hour ~5-18 MB/file • Currently 1TB and growing Spin up Hadoop • Use Hive scripts because of SQL-like syntax • Extracts click behavior like buys, additions to carts, reviews etc. and assigns scores • Jobs run hourly • Currently 8-nodes with plans to 16 Clickstream, Recommendation BK1 HDInsight Cluster AzureML Event Hub ASA AzureML Azure SQL DE Blob Storage NoSQL Storage IaaS VM HBase Targeted Email Persisted Storage Blob Storage Blob Storage Visitor Information Service Omniture Product Catalog Website.com
  • 20. Industrial automation company partnering with multinational oil company Oil and Gas Leading industrial automation company who employs over 20,000 people. partnering with Leading multinational oil and gas company (one of the six oil and gas super majors) who employs over 90,000 people. What They Did | IoT internet-connected sensors to generate analytics for proactive maintenance Challenge Manage sites used for dispensing liquefied natural gas (clean fuel for commercial customers who do heavy-duty road transportation) Built LNG refueling stations across US interstate highway Stations are unmanned so they built 24x7 remote management and monitoring to track diagnostics of each station for maintenance or tuning Built internet-connected sensors embedded in 350 dispenser sites worldwide generating tens of thousands data points per second • Temperature, pressure, vibration, etc. Data needs outgrew company’s internal datacenter and data warehouse Solution Chose Azure HDInsight, Data Factory, SQL Database, Machine Learning Dashboards used to detect anomalies for proactive maintenance • Changes in performance of the components • Energy consumption of components • Component downtime and reliability Future: Goal is to expand program to hundreds of thousands of dispensers Sensor, Analytics
  • 21. BK1 Industrial automation company partnering with multinational oil company How They Did It | IoT internet-connected sensors to generate analytics for proactive maintenance How They Did It Collect data from internet-collected sensors • Tens of thousands data points per second • Interpolate time-series prior to analysis • Stored raw sensor data in Blobs every 5 minutes Use Hadoop to execute scripts and Data Factory to orchestrate • Hive and Pig scripts orchestrated by Data Factory • Data resulting from scripts loaded in SQL Database • Queries detect site anomalies to indicate maintenance/tuning Produced dashboards with role-based reporting • Azure Machine Learning , SSRS, Power BI for O365 • Provide users with customizable interface • View current and historical data (day-to-day operations, asset performance over time, etc.) • Leveraged Azure Mobile Notification Hub for real-time notifications, alarms, or important events Use Azure ML to predict • Understand which pumps, run at what speeds, maximized water supply while minimizing energy use Sensor, Analytics