Publicidad

Hadoop in the Cloud: Common Architectural Patterns

DataWorks Summit
27 de Apr de 2015
Publicidad

Más contenido relacionado

Presentaciones para ti(20)

Publicidad

Similar a Hadoop in the Cloud: Common Architectural Patterns(20)

Más de DataWorks Summit(20)

Publicidad

Hadoop in the Cloud: Common Architectural Patterns

  1. Need for Massive Compute and Storage Advanced Deployment Expertise Volume, Variety, Velocity of Data Speed Scale Economics Always Up, Always On Open and flexible Time to value
  2. Azure Storage HDInsight Data Factory ML Stream Analytics Database DocumentDB Search Event Hubs
  3. Microsoft’s cloud Hadoop offering 100% open source Apache Hadoop Built on the latest releases across Hadoop (2.6) Up and running in minutes with no hardware to deploy Harness existing .NET and Java skills Utilize familiar BI tools for analysis including Microsoft Excel
  4. HDInsight Supports Hive Hadoop 2.0
  5. HDInsight Supports HBase Data Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker Task Tracker Name Node Job Tracker HMaster Coordination Region Server Region Server Region Server Region Server
  6. Stream processing Search and query Data analytics (Excel) Web/thick client dashboards Devices to take action RabbitMQ / ActiveMQ
  7. DocumentDB HDInsight & Hadoop Microsoft Azure DocumentDB can now act as an input source or output sink for Hive, Pig, and MapReduce jobs. Store and query schema-less, JSON data and perform analytics with it.  Schema-less and native JSON  Transactional JavaScript  Rich querying  Tunable consistency  Scalable storage and throughput For more information visit http://aka.ms/documentdb-hdinsight
  8. Industry Use Case Type of Data Digital Marketing & Advertisement Campaign Tracking and Analysis Clickstream Customer Segment Analysis Unstructured Social Sentiment Analysis Social Media Retail and Consumer Goods Clickstream Analytics Clickstream Online Recommendation System Clickstream Entertainment & Gaming Analysis of game play data, player profiles and stats Server Logs User/player engagement and retention Server Logs Telemetry and Log Analysis Server Logs Social Sentiment Analysis Social Media Health Care Predictive Analysis of Patient Health & Clinical Decision Support Unstructured Population, risk, and Care management Semi-Structured Real-time quality measures to assist providers reach their regulatory requirements Unstructured Manufacturing Analysis of equipment-generated data (Wind Turbines, Gas Dispensers) Sensor Predictive Maintenance Sensor Power & Utilities Customer Engagement and Retention Analysis Semi-Structured Metering Analysis & Control Unstructured Government Fraud and money laundry detection Semi-Structured ETL for Online Invoice Systems Semi-Structured
  9. Industry Use Case Type of Data Food Analysis of lab and clinical trial data Structured & Unstructured Computer Software & IT Services Telemetry and Log Analysis Server Logs Telecommunications Analysis of Call Activity Server Logs Professional Services QoS Analysis Server Logs Customer Care Analysis Unstructured Across Multiple Industries ETL Multi-structured
  10. • Social Data • Server Logs Unstructured/Mixed • Click-stream • Sensor data Semi-structured • Customer records • Transactions Structured • Feature adoption and interaction • Failure analysis • Exception detection Product Development • Customer segmentation • Engagement and retention Customer Understanding • Social sentiment analysis • Recommendation systems Marketing • Trend discovery • Large population analysis Scientific
  11. Leading Computer Manufacturer / Retailer Solution Retail and Consumer Goods One of the largest computer manufacturer / retailers in the world. They develop, sell, repair and support computers and related products and services What They Did | Analyze Clickstream and Provide Real-time Recommendations Online Challenge Needed to understand customer behavior on their retail website Wanted to increase conversion rate from generic visitation to purchase Solution Chose Azure over Amazon AWS as part of company strategy Use Azure HDInsight, HBase for HDInsight, SQL Database, Blobs, Websites, Event Hubs, Machine Learning Chose HDInsight due to faster deployment times as a cloud-based solution Understand Visitors: • After clicks are made by customer, site will classify the customer and personalize their shopping experience with recommendations as page is rendering • Use IP address and click interests/behavior to trigger guess on segmentation • Show recommendations to help convert to purchase • Processes Omniture clickstream data with Hadoop and segment customers If customer leaves the site, will send targeted advertisement • ‘Abandoned Cart Emails’, ‘User Inactivity Emails’ & ‘Discount coupon on purchase of a System Clickstream, Recommendation BK1
  12. Email Server Leading Computer Manufacturer / Retailer How They Did It | Analyze Clickstream and Provide Real-time Recommendations Online How They Did It Collect clickstream data • In tab separated text files • Adding 22 new files per hour ~5-18 MB/file • Currently 1TB and growing Spin up Hadoop • Use Hive scripts because of SQL-like syntax • Extracts click behavior like buys, additions to carts, reviews etc. and assigns scores • Jobs run hourly • Currently 8-nodes with plans to 16 Clickstream, Recommendation BK1 HDInsight Cluster AzureML Event Hub ASA AzureML Azure SQL DE Blob Storage NoSQL Storage IaaS VM HBase Targeted Email Persisted Storage Blob Storage Blob Storage Visitor Information Service Omniture Product Catalog Website.com
  13. Industrial automation company partnering with multinational oil company Oil and Gas Leading industrial automation company who employs over 20,000 people. partnering with Leading multinational oil and gas company (one of the six oil and gas super majors) who employs over 90,000 people. What They Did | IoT internet-connected sensors to generate analytics for proactive maintenance Challenge Manage sites used for dispensing liquefied natural gas (clean fuel for commercial customers who do heavy-duty road transportation) Built LNG refueling stations across US interstate highway Stations are unmanned so they built 24x7 remote management and monitoring to track diagnostics of each station for maintenance or tuning Built internet-connected sensors embedded in 350 dispenser sites worldwide generating tens of thousands data points per second • Temperature, pressure, vibration, etc. Data needs outgrew company’s internal datacenter and data warehouse Solution Chose Azure HDInsight, Data Factory, SQL Database, Machine Learning Dashboards used to detect anomalies for proactive maintenance • Changes in performance of the components • Energy consumption of components • Component downtime and reliability Future: Goal is to expand program to hundreds of thousands of dispensers Sensor, Analytics
  14. BK1 Industrial automation company partnering with multinational oil company How They Did It | IoT internet-connected sensors to generate analytics for proactive maintenance How They Did It Collect data from internet-collected sensors • Tens of thousands data points per second • Interpolate time-series prior to analysis • Stored raw sensor data in Blobs every 5 minutes Use Hadoop to execute scripts and Data Factory to orchestrate • Hive and Pig scripts orchestrated by Data Factory • Data resulting from scripts loaded in SQL Database • Queries detect site anomalies to indicate maintenance/tuning Produced dashboards with role-based reporting • Azure Machine Learning , SSRS, Power BI for O365 • Provide users with customizable interface • View current and historical data (day-to-day operations, asset performance over time, etc.) • Leveraged Azure Mobile Notification Hub for real-time notifications, alarms, or important events Use Azure ML to predict • Understand which pumps, run at what speeds, maximized water supply while minimizing energy use Sensor, Analytics
Publicidad