Hadoop in the Cloud: Common Architectural Patterns
Need for Massive
Compute and
Storage
Advanced
Deployment
Expertise
Volume, Variety,
Velocity of Data
Speed Scale Economics
Always Up,
Always On Open and flexible
Time to value
Microsoft’s cloud Hadoop offering
100% open source Apache Hadoop
Built on the latest releases across Hadoop (2.6)
Up and running in minutes with no hardware to deploy
Harness existing .NET and Java skills
Utilize familiar BI tools for analysis including Microsoft Excel
HDInsight Supports HBase
Data Node Data Node Data Node Data Node
Task Tracker Task Tracker Task Tracker Task Tracker
Name Node
Job Tracker
HMaster
Coordination
Region Server Region Server Region Server Region Server
DocumentDB HDInsight & Hadoop
Microsoft Azure DocumentDB can now act as an
input source or output sink for Hive, Pig, and MapReduce jobs.
Store and query schema-less, JSON data and perform analytics with it.
Schema-less and native JSON
Transactional JavaScript
Rich querying
Tunable consistency
Scalable storage and
throughput
For more information visit http://aka.ms/documentdb-hdinsight
Industry Use Case Type of Data
Digital Marketing &
Advertisement
Campaign Tracking and Analysis Clickstream
Customer Segment Analysis Unstructured
Social Sentiment Analysis Social Media
Retail and Consumer Goods
Clickstream Analytics Clickstream
Online Recommendation System Clickstream
Entertainment & Gaming
Analysis of game play data, player profiles and stats Server Logs
User/player engagement and retention Server Logs
Telemetry and Log Analysis Server Logs
Social Sentiment Analysis Social Media
Health Care
Predictive Analysis of Patient Health & Clinical Decision Support Unstructured
Population, risk, and Care management Semi-Structured
Real-time quality measures to assist providers reach their regulatory requirements Unstructured
Manufacturing
Analysis of equipment-generated data (Wind Turbines, Gas Dispensers) Sensor
Predictive Maintenance Sensor
Power & Utilities
Customer Engagement and Retention Analysis Semi-Structured
Metering Analysis & Control Unstructured
Government
Fraud and money laundry detection Semi-Structured
ETL for Online Invoice Systems Semi-Structured
Industry Use Case Type of Data
Food Analysis of lab and clinical trial data Structured & Unstructured
Computer Software & IT Services Telemetry and Log Analysis Server Logs
Telecommunications Analysis of Call Activity Server Logs
Professional Services
QoS Analysis Server Logs
Customer Care Analysis Unstructured
Across Multiple Industries ETL Multi-structured
• Social Data
• Server Logs
Unstructured/Mixed
• Click-stream
• Sensor data
Semi-structured
• Customer records
• Transactions
Structured
• Feature adoption and interaction
• Failure analysis
• Exception detection
Product Development
• Customer segmentation
• Engagement and retention
Customer Understanding
• Social sentiment analysis
• Recommendation systems
Marketing
• Trend discovery
• Large population analysis
Scientific
Leading Computer Manufacturer / Retailer Solution
Retail and
Consumer Goods
One of the largest computer
manufacturer / retailers in the
world. They develop, sell, repair
and support computers and
related products and services
What They Did | Analyze Clickstream and Provide Real-time Recommendations Online
Challenge
Needed to understand customer behavior on their retail website
Wanted to increase conversion rate from generic visitation to purchase
Solution
Chose Azure over Amazon AWS as part of company strategy
Use Azure HDInsight, HBase for HDInsight, SQL Database, Blobs, Websites, Event Hubs,
Machine Learning
Chose HDInsight due to faster deployment times as a cloud-based solution
Understand Visitors:
• After clicks are made by customer, site will classify the customer and personalize their shopping
experience with recommendations as page is rendering
• Use IP address and click interests/behavior to trigger guess on segmentation
• Show recommendations to help convert to purchase
• Processes Omniture clickstream data with Hadoop and segment customers
If customer leaves the site, will send targeted advertisement
• ‘Abandoned Cart Emails’, ‘User Inactivity Emails’ & ‘Discount coupon on purchase of a System
Clickstream,
Recommendation
BK1
Email Server
Leading Computer Manufacturer / Retailer
How They Did It | Analyze Clickstream and Provide Real-time Recommendations Online
How They Did It
Collect clickstream data
• In tab separated text files
• Adding 22 new files per hour ~5-18 MB/file
• Currently 1TB and growing
Spin up Hadoop
• Use Hive scripts because of SQL-like syntax
• Extracts click behavior like buys, additions to carts, reviews etc.
and assigns scores
• Jobs run hourly
• Currently 8-nodes with plans to 16
Clickstream,
Recommendation
BK1
HDInsight Cluster
AzureML
Event Hub
ASA
AzureML
Azure SQL
DE
Blob
Storage
NoSQL
Storage
IaaS VM
HBase
Targeted Email
Persisted
Storage
Blob
Storage
Blob
Storage
Visitor
Information
Service
Omniture Product
Catalog
Website.com
Industrial automation company partnering with
multinational oil company
Oil and Gas
Leading industrial automation
company who employs over
20,000 people.
partnering with
Leading multinational oil and gas
company (one of the six oil and
gas super majors) who employs
over 90,000 people.
What They Did | IoT internet-connected sensors to generate analytics for proactive maintenance
Challenge
Manage sites used for dispensing liquefied natural gas (clean fuel for commercial customers
who do heavy-duty road transportation)
Built LNG refueling stations across US interstate highway
Stations are unmanned so they built 24x7 remote management and monitoring to track
diagnostics of each station for maintenance or tuning
Built internet-connected sensors embedded in 350 dispenser sites worldwide generating
tens of thousands data points per second
• Temperature, pressure, vibration, etc.
Data needs outgrew company’s internal datacenter and data warehouse
Solution
Chose Azure HDInsight, Data Factory, SQL Database, Machine Learning
Dashboards used to detect anomalies for proactive maintenance
• Changes in performance of the components
• Energy consumption of components
• Component downtime and reliability
Future: Goal is to expand program to hundreds of thousands of dispensers
Sensor,
Analytics
BK1
Industrial automation company partnering with
multinational oil company
How They Did It | IoT internet-connected sensors to generate analytics for proactive maintenance
How They Did It
Collect data from internet-collected sensors
• Tens of thousands data points per second
• Interpolate time-series prior to analysis
• Stored raw sensor data in Blobs every 5 minutes
Use Hadoop to execute scripts and Data Factory to
orchestrate
• Hive and Pig scripts orchestrated by Data Factory
• Data resulting from scripts loaded in SQL Database
• Queries detect site anomalies to indicate maintenance/tuning
Produced dashboards with role-based reporting
• Azure Machine Learning , SSRS, Power BI for O365
• Provide users with customizable interface
• View current and historical data (day-to-day operations, asset
performance over time, etc.)
• Leveraged Azure Mobile Notification Hub for real-time
notifications, alarms, or important events
Use Azure ML to predict
• Understand which pumps, run at what speeds, maximized water
supply while minimizing energy use
Sensor,
Analytics