Petascale Analytics - The World of Big Data Requires Big Analytics

Petascale Analytics The World of Big Data Requires Big Analytics October 2011 H. J. Schick IBM Germany Research & Development GmbH

Source: The Evolution of Live in 60 Seconds

Source: Realtime Apache Hadoop at Facebook

Quiz: What comes after zettabyte? 1 yottabyte = 1,000,000,000,000,000,000,000,000 bytes

Source: IDC Digital Universe Study , sponsored by EMC, May 2011

Google’s Server Design Source: cnet News, Google Uncloaks Once-Secret Server , April 2009

The Digital Universe is a Perpetual Tsunami ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Source: McKinsey Global Institute, Big data: The next frontier for innovation, competition and productivity , May 2011

The Evolution of Science Learning Systems (XXI Century) Era of Natural Philosophy Era of Modern Science Industrial Revolution Astronomy (Babylon, 1900 BC) Platonic Academy (387 BC) Mathematics (India, 499 BC) Scientific Revolution (1543 AD) Newton’s Laws (1687 AD) Relativity (1905 AD) Quantum Physics (1925 AD) Computing (1946 AD) DNA (1953 AD) Evolution of Science Time

Today’s Systems – The Calculating Paradigm Algorithms and Applications Static programming Archives Structured Data and Text The Calculating Paradigm People Hypothesize, Determine “what it means”, Run other applications…

Future Systems – The Learning Paradigm Training and Learning Engines To Build Models and Define Insight Hypothesis Engines To Understand and Plan Actions Policy Engine Business, Legal and Ethical Rules Verification Engines (e.g. Simulations) Active Learning (Natural Interfaces) Outcome Engine Actuation and Validation Society Nature Institutions Archives

New “Big Data” Brings New Opportunities, Requires New Analytics Up to 10,000 Times larger Up to 10,000 times faster Traditional Data Warehouse and Business Intelligence Data Scale Decision Frequency Data in Motion Data at Rest Telco Promotions 100,000 records/sec, 6B/day 10 ms/decision 270TB for Deep Analytics Smart Traffic 250K GPS probes/sec 630K segments/sec 2 ms/decision, 4K vehicles Homeland Security 600,000 records/sec, 50B/day 1-2 ms/decision 320TB for Deep Analytics yr mo wk day hr min sec … ms  s Exa Peta Tera Giga Mega Kilo Occasional Frequent Real-time DeepQA 100s GB for Deep Analytics 3 sec/decision

Enabling Multiple Solutions & Appliances to Achieve a Smarter Planet Peta 2 Analytics Appliance + + Reactive + Deep Analytics Platform Big Analytics Ecosystem Peta 2 Data-centric System Algorithms Big Data Skills DeepFriends Social Network Monitor DeepResponse Emergency Coordination DeepEyes Webcam Fusion DeepCurrent Power Delivery DeepSafety Police/Security DeepTraffic Area Traffic Prediction DeepWater Water management DeepBasket Food Market Prediction DeepBreath Air Quality Control DeepPulse Political Polling DeepThunder Local Weather Prediction DeepSoil Farm Prediction

Statistical Ensemble of 600 to 800 Scoring Engines ~30 Machine Learning Models Weigh Scores, Produce Confidence for Each Question 0<P<1 Hypothetical Question With Greatest Confidence is Chosen Evidence-Based Decision Support System S1 S2 S3 SN . . . Answer: A large country in the Western Hemisphere whose capital has a similar name. Hypothesis Generated from “Answer” Guess Questions Q1, Q2 … Qi Question: What is Brazil? Watson Today: Processes Unstructured Text & 200 Hypothesis/3 seconds Watson 3,000 cores;100 TFlops 2 TB memory ~ 200 KW Static Data Corpus Element Refresh Time Data Corpus 2 Weeks Hypothesis Engines Weeks to Months Scoring Engines Weeks to Months Decision Support Engine 4 Days

Exascale Research and Development Source: Exascale Research and Development – Request for Information , July 2011

Big Data Systems Require a Data-centric Architecture for Performance Data lives on disk and tape Move data to CPU as needed Deep Storage Hierarchy Data lives in persistent memory Many CPU ’s surround and use Shallow/Flat Storage Hierarchy Old Compute-centric Model New Data-centric Model Massive Parallelism Persistent Memory Largest change in system architecture since the System 360 Huge impact on hardware, systems software, and application design Flash Phase Change Manycore FPGA input output

Scale-in is the New Systems Battlefield Scale-down Scale-up Scale-in Exascale Peta 2 Low Med High Extreme System Density (1/Latency end-to-end ) Device Clusters Single Device Low Med High Physical Limits Scale-out NAS Blade Server Scale-out Maximize system capacity FLASH SSD 3D Chips FPGA Manycore BPRAM/SCM Interconnect In-mem DB DAS Scale-in Maximize system density Minimize end-to-end latency System Capacity (capability) Single Device Device Clusters 100K 10K 1K 100 10 High Med Low Terabyte HDD POWER 7 Scale-up Maximize device capacity Atom Transistor Atom Storage Scale-down Maximize feature density Cloud Computing

Storage Class Memory - The Tipping Point for Data-centric Systems HDD cost advantage continues, 1/10 SCM cost, but SCM dominates in performance, 10,000x faster than HDD Source: Chung Lam, IBM FLASH (Phase Change) SCM in 2015 $0.05 per GB $50K per PB $0.10 / GB $0.01 / GB Relative Cost Relative Latency DRAM 100 1 SCM 1 10 FLASH 15 1000 HDD 0.1 100000

Background: 3 Styles of Massively Parallel Systems Data in Motion: High Velocity Mixed Variety High Volume* (*over time) SPL, C, Java Data at Rest*: High Volume Mixed Variety Low Velocity Deep Analytics Extreme Scale-out (*pre-partitioned) Simulation (BlueGene) Generative Modeling Extreme Physics C/C++, Fortran, MPI, OpenMP Reactive Analytics Extreme Ingestion Streaming (Streams) Long Running Small Input Massive Output Hadoop/MapReduce (BigInsights) JAQL, Java Reducers Mappers Input Data (on disk) Output Data = compute node

Fault-tolerant Hadoop Distributed File System (HDFS) Source: Hadoop Overview , http://www.cloudera.com

Map Reduce Logical Data Flow Source: O’Reilly, Hadoop – The Definitive Guide 0067011990999991950051507004 ... 9999999N9+00001+99999999999 ... 0043011990999991950051512004 ... 9999999N9+00221+99999999999 ... 0043011990999991950051518004 ... 9999999N9-00111+99999999999 ... 0043012650999991949032412004 ... 0500001N9+01111+99999999999 ... 0043012650999991949032418004 ... 0500001N9+00781+99999999999 ... 1

Map Reduce Logical Data Flow Source: O’Reilly, Hadoop – The Definitive Guide (0, 006701199099999 1950 051507004...9999999N9+ 0000 1+99999999999...) (106, 004301199099999 1950 051512004...9999999N9+ 0022 1+99999999999...) (212, 004301199099999 1950 051518004...9999999N9- 0011 1+99999999999...) (318, 004301265099999 1949 032412004...0500001N9+ 0111 1+99999999999...) (424, 004301265099999 1949 032418004...0500001N9+ 0078 1+99999999999...) 2

Map Reduce Logical Data Flow Source: O’Reilly, Hadoop – The Definitive Guide (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78) 3

Map Reduce Logical Data Flow Source: O’Reilly, Hadoop – The Definitive Guide (1949, [111, 78]) (1950, [0, 22, −11]) 4

Map Reduce Logical Data Flow Source: O’Reilly, Hadoop – The Definitive Guide (1949, 111) (1950, 22) 5

The Blue Gene/Q ASIC Source: EDN News, Hot Chips: The Puzzle of Many Cores

The Blue Gene/Q Packaging Hierarchy Source: The Register, IBM ’s Blue Gene/Q Super Chip Grows 18 th Core

Opportunity: Blue Gene Active Storage + 512 BGQ Flash Cards Blue Gene/Q Active Storage Rack … scale it like BG/Q. ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Flash Capacity 320 GB I/O Bandwidth 1.5 GB/s IOPS 207,000 Nodes 512 Storage Cap 640 TB I/O Bandwidth 768 GB/s Random IOPS 100 Million Compute 104 TF Bi-Section BW 512 GB/s

NAND Flash Challenges ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Thank you very much for your attention.

Petascale Analytics - The World of Big Data Requires Big Analytics

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Petascale Analytics - The World of Big Data Requires Big Analytics

Similar a Petascale Analytics - The World of Big Data Requires Big Analytics (20)

Más de Heiko Joerg Schick

Más de Heiko Joerg Schick (17)

Último

Último (20)

Petascale Analytics - The World of Big Data Requires Big Analytics

Notas del editor