SlideShare una empresa de Scribd logo
1 de 37
Petascale Analytics The World of Big Data Requires Big Analytics October 2011 H. J. Schick IBM Germany Research & Development  GmbH
Source:  The Evolution of Live in 60 Seconds
 
Source:  Realtime Apache Hadoop at Facebook
Quiz:  What comes after zettabyte? 1 yottabyte  = 1,000,000,000,000,000,000,000,000 bytes
Experiment:
Source:  IDC Digital Universe Study , sponsored by EMC, May 2011
Google’s Server Design Source: cnet News,  Google Uncloaks Once-Secret Server , April 2009
The Digital Universe is a Perpetual Tsunami ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Source: McKinsey Global Institute,  Big data: The next frontier for innovation, competition and productivity , May 2011
The Evolution of Science Learning Systems (XXI Century) Era of Natural Philosophy Era of Modern Science Industrial Revolution Astronomy (Babylon, 1900 BC) Platonic Academy (387 BC) Mathematics (India, 499 BC) Scientific Revolution (1543 AD) Newton’s Laws (1687 AD) Relativity (1905 AD) Quantum Physics (1925 AD) Computing (1946 AD) DNA (1953 AD) Evolution of Science Time
Today’s  Systems – The Calculating Paradigm Algorithms and Applications Static programming Archives Structured Data and Text The Calculating Paradigm People Hypothesize, Determine “what it means”, Run other applications…
Future Systems – The Learning Paradigm  Training and Learning Engines To Build Models and Define Insight Hypothesis Engines To Understand and Plan Actions Policy Engine Business, Legal and Ethical Rules Verification Engines (e.g. Simulations) Active Learning (Natural Interfaces) Outcome Engine Actuation and Validation Society Nature Institutions Archives
New  “Big Data” Brings New Opportunities, Requires New Analytics Up to 10,000 Times larger Up to 10,000 times faster Traditional Data Warehouse and Business Intelligence Data Scale Decision Frequency Data in Motion Data at Rest Telco Promotions 100,000 records/sec, 6B/day 10 ms/decision 270TB for Deep Analytics Smart Traffic 250K GPS probes/sec 630K segments/sec 2 ms/decision, 4K vehicles Homeland Security 600,000 records/sec, 50B/day 1-2 ms/decision 320TB for Deep Analytics yr mo wk day hr min sec … ms  s Exa Peta Tera Giga Mega Kilo Occasional Frequent Real-time DeepQA   100s GB for Deep Analytics  3 sec/decision
Enabling Multiple Solutions & Appliances to Achieve a Smarter Planet Peta 2 Analytics Appliance + + Reactive + Deep Analytics Platform Big Analytics  Ecosystem Peta 2  Data-centric System Algorithms Big Data Skills DeepFriends Social Network Monitor DeepResponse Emergency Coordination DeepEyes Webcam Fusion DeepCurrent Power Delivery DeepSafety Police/Security DeepTraffic Area Traffic Prediction DeepWater Water management DeepBasket Food Market Prediction DeepBreath Air Quality Control DeepPulse Political Polling DeepThunder Local Weather Prediction DeepSoil Farm Prediction
Statistical Ensemble of 600 to 800 Scoring Engines ~30 Machine Learning Models Weigh Scores, Produce Confidence for Each Question 0<P<1 Hypothetical Question With  Greatest Confidence is Chosen Evidence-Based Decision Support  System S1 S2 S3 SN . . . Answer:   A large country in the Western Hemisphere whose capital has a similar name. Hypothesis Generated from “Answer” Guess Questions Q1, Q2 … Qi Question:   What is Brazil? Watson Today: Processes Unstructured Text & 200 Hypothesis/3 seconds  Watson 3,000 cores;100 TFlops 2 TB memory ~ 200 KW Static Data Corpus Element Refresh Time Data Corpus 2 Weeks Hypothesis Engines Weeks to Months Scoring Engines Weeks to Months Decision Support Engine 4 Days
Exascale Research and Development Source:  Exascale Research and Development – Request for Information , July 2011
Big Data Systems Require a Data-centric Architecture for Performance Data lives on disk and tape Move data to CPU as needed Deep Storage Hierarchy Data lives in persistent memory Many CPU ’s surround and use Shallow/Flat Storage Hierarchy Old Compute-centric Model New Data-centric Model Massive Parallelism Persistent Memory Largest change in system architecture since the System 360  Huge impact on hardware, systems software, and application design Flash Phase Change Manycore FPGA input output
Scale-in is the New Systems Battlefield Scale-down Scale-up Scale-in Exascale Peta 2 Low  Med  High  Extreme System Density (1/Latency   end-to-end ) Device Clusters Single Device Low  Med  High Physical Limits Scale-out NAS Blade Server Scale-out Maximize system capacity FLASH SSD 3D Chips FPGA Manycore BPRAM/SCM Interconnect In-mem DB DAS Scale-in Maximize system density Minimize end-to-end latency System Capacity (capability) Single Device  Device Clusters 100K 10K 1K 100 10 High Med Low Terabyte HDD POWER 7 Scale-up Maximize device capacity Atom Transistor Atom Storage Scale-down Maximize feature density Cloud Computing
Storage Class Memory - The Tipping Point for Data-centric Systems HDD cost advantage continues, 1/10 SCM cost, but SCM dominates in performance, 10,000x faster than HDD Source: Chung Lam, IBM FLASH (Phase Change) SCM in 2015 $0.05 per GB $50K per PB $0.10 / GB $0.01 / GB Relative Cost Relative Latency DRAM 100 1 SCM 1 10 FLASH 15 1000 HDD 0.1 100000
Background: 3 Styles of Massively Parallel Systems Data in Motion: High Velocity Mixed Variety High Volume* (*over time) SPL, C, Java Data at Rest*: High Volume Mixed Variety Low Velocity Deep Analytics Extreme Scale-out (*pre-partitioned) Simulation (BlueGene) Generative Modeling  Extreme Physics C/C++, Fortran, MPI, OpenMP Reactive Analytics  Extreme Ingestion Streaming (Streams) Long Running  Small Input Massive Output Hadoop/MapReduce (BigInsights) JAQL, Java Reducers Mappers Input Data (on disk) Output Data = compute node
Fault-tolerant Hadoop Distributed File System (HDFS) Source:  Hadoop Overview , http://www.cloudera.com
Map Reduce Logical Data Flow Source: O’Reilly,  Hadoop – The Definitive Guide 0067011990999991950051507004 ... 9999999N9+00001+99999999999 ... 0043011990999991950051512004 ... 9999999N9+00221+99999999999 ... 0043011990999991950051518004 ... 9999999N9-00111+99999999999 ... 0043012650999991949032412004 ... 0500001N9+01111+99999999999 ... 0043012650999991949032418004 ... 0500001N9+00781+99999999999 ... 1
Map Reduce Logical Data Flow Source: O’Reilly,  Hadoop – The Definitive Guide (0, 006701199099999 1950 051507004...9999999N9+ 0000 1+99999999999...) (106, 004301199099999 1950 051512004...9999999N9+ 0022 1+99999999999...) (212, 004301199099999 1950 051518004...9999999N9- 0011 1+99999999999...) (318, 004301265099999 1949 032412004...0500001N9+ 0111 1+99999999999...) (424, 004301265099999 1949 032418004...0500001N9+ 0078 1+99999999999...) 2
Map Reduce Logical Data Flow Source: O’Reilly,  Hadoop – The Definitive Guide (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78) 3
Map Reduce Logical Data Flow Source: O’Reilly,  Hadoop – The Definitive Guide (1949, [111, 78]) (1950, [0, 22, −11]) 4
Map Reduce Logical Data Flow Source: O’Reilly,  Hadoop – The Definitive Guide (1949, 111) (1950, 22) 5
The Blue Gene/Q ASIC Source:  EDN News, Hot Chips: The Puzzle of Many Cores
The Blue Gene/Q Packaging Hierarchy Source: The Register,  IBM ’s Blue Gene/Q Super Chip Grows 18 th  Core
Opportunity:  Blue Gene Active Storage   + 512 BGQ Flash Cards Blue Gene/Q Active Storage Rack …  scale it like BG/Q. ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Flash Capacity 320 GB I/O Bandwidth 1.5 GB/s IOPS 207,000 Nodes  512  Storage Cap 640 TB I/O Bandwidth 768 GB/s Random IOPS 100 Million Compute 104 TF Bi-Section BW 512 GB/s
NAND Flash Challenges ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Gartner ’s Hype Cycle
Thank you very much for your attention.

Más contenido relacionado

La actualidad más candente

High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...Larry Smarr
 
Early Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic ComputingEarly Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic ComputingDESMOND YUEN
 
Nikravesh big datafeb2013bt
Nikravesh big datafeb2013btNikravesh big datafeb2013bt
Nikravesh big datafeb2013btMasoud Nikravesh
 
Feeding the Multicore Beast:It’s All About the Data!
Feeding the Multicore Beast:It’s All About the Data!Feeding the Multicore Beast:It’s All About the Data!
Feeding the Multicore Beast:It’s All About the Data!Slide_N
 
Larry Smarr - Making Sense of Information Through Planetary Scale Computing
Larry Smarr - Making Sense of Information Through Planetary Scale ComputingLarry Smarr - Making Sense of Information Through Planetary Scale Computing
Larry Smarr - Making Sense of Information Through Planetary Scale ComputingDiamond Exchange
 
The OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XDThe OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XDLarry Smarr
 
04 New opportunities in photon science with high-speed X-ray imaging detecto...
04 New opportunities in photon science with high-speed X-ray imaging  detecto...04 New opportunities in photon science with high-speed X-ray imaging  detecto...
04 New opportunities in photon science with high-speed X-ray imaging detecto...RCCSRENKEI
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensOscar Law
 
Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004xlight
 
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...Larry Smarr
 
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore EraFlow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore EraRyousei Takano
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionNVIDIA Taiwan
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from IntelEdge AI and Vision Alliance
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Robert Grossman
 
Characterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksCharacterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksJason Riedy
 
Using Photonics to Prototype the Research Campus Infrastructure of the Future...
Using Photonics to Prototype the Research Campus Infrastructure of the Future...Using Photonics to Prototype the Research Campus Infrastructure of the Future...
Using Photonics to Prototype the Research Campus Infrastructure of the Future...Larry Smarr
 
10 Abundant-Data Computing
10 Abundant-Data Computing10 Abundant-Data Computing
10 Abundant-Data ComputingRCCSRENKEI
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rFerdinand Jamitzky
 

La actualidad más candente (20)

High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
High Performance Cyberinfrastructure Enabling Data-Driven Science in the Biom...
 
Early Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic ComputingEarly Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic Computing
 
Nikravesh big datafeb2013bt
Nikravesh big datafeb2013btNikravesh big datafeb2013bt
Nikravesh big datafeb2013bt
 
Feeding the Multicore Beast:It’s All About the Data!
Feeding the Multicore Beast:It’s All About the Data!Feeding the Multicore Beast:It’s All About the Data!
Feeding the Multicore Beast:It’s All About the Data!
 
Larry Smarr - Making Sense of Information Through Planetary Scale Computing
Larry Smarr - Making Sense of Information Through Planetary Scale ComputingLarry Smarr - Making Sense of Information Through Planetary Scale Computing
Larry Smarr - Making Sense of Information Through Planetary Scale Computing
 
The OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XDThe OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XD
 
04 New opportunities in photon science with high-speed X-ray imaging detecto...
04 New opportunities in photon science with high-speed X-ray imaging  detecto...04 New opportunities in photon science with high-speed X-ray imaging  detecto...
04 New opportunities in photon science with high-speed X-ray imaging detecto...
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware Challegens
 
Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004Gfarm Fs Tatebe Tip2004
Gfarm Fs Tatebe Tip2004
 
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
A Campus-Scale High Performance Cyberinfrastructure is Required for Data-Int...
 
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore EraFlow-centric Computing - A Datacenter Architecture in the Post Moore Era
Flow-centric Computing - A Datacenter Architecture in the Post Moore Era
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
 
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
"Accelerating Deep Learning Using Altera FPGAs," a Presentation from Intel
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
Characterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with MicrobenchmarksCharacterization of Emu Chick with Microbenchmarks
Characterization of Emu Chick with Microbenchmarks
 
Using Photonics to Prototype the Research Campus Infrastructure of the Future...
Using Photonics to Prototype the Research Campus Infrastructure of the Future...Using Photonics to Prototype the Research Campus Infrastructure of the Future...
Using Photonics to Prototype the Research Campus Infrastructure of the Future...
 
10 Abundant-Data Computing
10 Abundant-Data Computing10 Abundant-Data Computing
10 Abundant-Data Computing
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
 
Advances in GPU Computing
Advances in GPU ComputingAdvances in GPU Computing
Advances in GPU Computing
 

Similar a Petascale Analytics - The World of Big Data Requires Big Analytics

Cloud Computing y Big Data, próxima frontera de la innovación
Cloud Computing y Big Data, próxima frontera de la innovaciónCloud Computing y Big Data, próxima frontera de la innovación
Cloud Computing y Big Data, próxima frontera de la innovaciónFundación Ramón Areces
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowskaguest43b4df3
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World LazowskaWCET
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneIan Foster
 
Distributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlowDistributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlowJan Wiegelmann
 
Agents In An Exponential World Foster
Agents In An Exponential World FosterAgents In An Exponential World Foster
Agents In An Exponential World FosterIan Foster
 
Nikravesh australia long_versionkeynote2012
Nikravesh australia long_versionkeynote2012Nikravesh australia long_versionkeynote2012
Nikravesh australia long_versionkeynote2012Masoud Nikravesh
 
Cluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesCluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesGuy Coates
 
Watson christofer j_180208
Watson christofer j_180208Watson christofer j_180208
Watson christofer j_180208IBM Sverige
 
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure ExecutionSpark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure ExecutionDatabricks
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
IBM Power Systems: Designed for Data
IBM Power Systems: Designed for DataIBM Power Systems: Designed for Data
IBM Power Systems: Designed for DataIBM Power Systems
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Matej Misik
 
My other computer_is_a_datacentre
My other computer_is_a_datacentreMy other computer_is_a_datacentre
My other computer_is_a_datacentreSteve Loughran
 

Similar a Petascale Analytics - The World of Big Data Requires Big Analytics (20)

Cloud Computing y Big Data, próxima frontera de la innovación
Cloud Computing y Big Data, próxima frontera de la innovaciónCloud Computing y Big Data, próxima frontera de la innovación
Cloud Computing y Big Data, próxima frontera de la innovación
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
E Science As A Lens On The World Lazowska
E Science As A Lens On The World   LazowskaE Science As A Lens On The World   Lazowska
E Science As A Lens On The World Lazowska
 
Big Data and OSS at IBM
Big Data and OSS at IBMBig Data and OSS at IBM
Big Data and OSS at IBM
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
 
Distributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlowDistributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlow
 
Presentation_Final
Presentation_FinalPresentation_Final
Presentation_Final
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
 
Agents In An Exponential World Foster
Agents In An Exponential World FosterAgents In An Exponential World Foster
Agents In An Exponential World Foster
 
Nikravesh australia long_versionkeynote2012
Nikravesh australia long_versionkeynote2012Nikravesh australia long_versionkeynote2012
Nikravesh australia long_versionkeynote2012
 
Cluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesCluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomes
 
Watson christofer j_180208
Watson christofer j_180208Watson christofer j_180208
Watson christofer j_180208
 
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure ExecutionSpark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
IBM Power Systems: Designed for Data
IBM Power Systems: Designed for DataIBM Power Systems: Designed for Data
IBM Power Systems: Designed for Data
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
 
My other computer_is_a_datacentre
My other computer_is_a_datacentreMy other computer_is_a_datacentre
My other computer_is_a_datacentre
 

Más de Heiko Joerg Schick

Da Vinci - A scaleable architecture for neural network computing (updated v4)
Da Vinci - A scaleable architecture for neural network computing (updated v4)Da Vinci - A scaleable architecture for neural network computing (updated v4)
Da Vinci - A scaleable architecture for neural network computing (updated v4)Heiko Joerg Schick
 
Huawei empowers healthcare industry with AI technology
Huawei empowers healthcare industry with AI technologyHuawei empowers healthcare industry with AI technology
Huawei empowers healthcare industry with AI technologyHeiko Joerg Schick
 
The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...
The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...
The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...Heiko Joerg Schick
 
The Smarter Car for Autonomous Driving
 The Smarter Car for Autonomous Driving The Smarter Car for Autonomous Driving
The Smarter Car for Autonomous DrivingHeiko Joerg Schick
 
From edge computing to in-car computing
From edge computing to in-car computingFrom edge computing to in-car computing
From edge computing to in-car computingHeiko Joerg Schick
 
Need and value for various levels of autonomous driving
Need and value for various levels of autonomous drivingNeed and value for various levels of autonomous driving
Need and value for various levels of autonomous drivingHeiko Joerg Schick
 
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFSRun-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFSHeiko Joerg Schick
 
Browser and Management App for Google's Person Finder
Browser and Management App for Google's Person FinderBrowser and Management App for Google's Person Finder
Browser and Management App for Google's Person FinderHeiko Joerg Schick
 
IBM Corporate Service Corps - Helping Create Interactive Flood Maps
IBM Corporate Service Corps - Helping Create Interactive Flood MapsIBM Corporate Service Corps - Helping Create Interactive Flood Maps
IBM Corporate Service Corps - Helping Create Interactive Flood MapsHeiko Joerg Schick
 
Real time Flood Simulation for Metro Manila and the Philippines
Real time Flood Simulation for Metro Manila and the PhilippinesReal time Flood Simulation for Metro Manila and the Philippines
Real time Flood Simulation for Metro Manila and the PhilippinesHeiko Joerg Schick
 
directCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressdirectCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressHeiko Joerg Schick
 
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)Heiko Joerg Schick
 
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)Heiko Joerg Schick
 

Más de Heiko Joerg Schick (17)

Da Vinci - A scaleable architecture for neural network computing (updated v4)
Da Vinci - A scaleable architecture for neural network computing (updated v4)Da Vinci - A scaleable architecture for neural network computing (updated v4)
Da Vinci - A scaleable architecture for neural network computing (updated v4)
 
Huawei empowers healthcare industry with AI technology
Huawei empowers healthcare industry with AI technologyHuawei empowers healthcare industry with AI technology
Huawei empowers healthcare industry with AI technology
 
The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...
The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...
The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...
 
The Smarter Car for Autonomous Driving
 The Smarter Car for Autonomous Driving The Smarter Car for Autonomous Driving
The Smarter Car for Autonomous Driving
 
From edge computing to in-car computing
From edge computing to in-car computingFrom edge computing to in-car computing
From edge computing to in-car computing
 
Need and value for various levels of autonomous driving
Need and value for various levels of autonomous drivingNeed and value for various levels of autonomous driving
Need and value for various levels of autonomous driving
 
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFSRun-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
 
Blue Gene Active Storage
Blue Gene Active StorageBlue Gene Active Storage
Blue Gene Active Storage
 
Browser and Management App for Google's Person Finder
Browser and Management App for Google's Person FinderBrowser and Management App for Google's Person Finder
Browser and Management App for Google's Person Finder
 
IBM Corporate Service Corps - Helping Create Interactive Flood Maps
IBM Corporate Service Corps - Helping Create Interactive Flood MapsIBM Corporate Service Corps - Helping Create Interactive Flood Maps
IBM Corporate Service Corps - Helping Create Interactive Flood Maps
 
Real time Flood Simulation for Metro Manila and the Philippines
Real time Flood Simulation for Metro Manila and the PhilippinesReal time Flood Simulation for Metro Manila and the Philippines
Real time Flood Simulation for Metro Manila and the Philippines
 
Slimline Open Firmware
Slimline Open FirmwareSlimline Open Firmware
Slimline Open Firmware
 
Agnostic Device Drivers
Agnostic Device DriversAgnostic Device Drivers
Agnostic Device Drivers
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
directCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressdirectCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI Express
 
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
 
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
 

Último

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Último (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Petascale Analytics - The World of Big Data Requires Big Analytics

  • 1. Petascale Analytics The World of Big Data Requires Big Analytics October 2011 H. J. Schick IBM Germany Research & Development GmbH
  • 2. Source: The Evolution of Live in 60 Seconds
  • 3.  
  • 4. Source: Realtime Apache Hadoop at Facebook
  • 5.
  • 6.
  • 7. Quiz: What comes after zettabyte? 1 yottabyte = 1,000,000,000,000,000,000,000,000 bytes
  • 9. Source: IDC Digital Universe Study , sponsored by EMC, May 2011
  • 10. Google’s Server Design Source: cnet News, Google Uncloaks Once-Secret Server , April 2009
  • 11.
  • 12. Source: McKinsey Global Institute, Big data: The next frontier for innovation, competition and productivity , May 2011
  • 13. The Evolution of Science Learning Systems (XXI Century) Era of Natural Philosophy Era of Modern Science Industrial Revolution Astronomy (Babylon, 1900 BC) Platonic Academy (387 BC) Mathematics (India, 499 BC) Scientific Revolution (1543 AD) Newton’s Laws (1687 AD) Relativity (1905 AD) Quantum Physics (1925 AD) Computing (1946 AD) DNA (1953 AD) Evolution of Science Time
  • 14. Today’s Systems – The Calculating Paradigm Algorithms and Applications Static programming Archives Structured Data and Text The Calculating Paradigm People Hypothesize, Determine “what it means”, Run other applications…
  • 15. Future Systems – The Learning Paradigm Training and Learning Engines To Build Models and Define Insight Hypothesis Engines To Understand and Plan Actions Policy Engine Business, Legal and Ethical Rules Verification Engines (e.g. Simulations) Active Learning (Natural Interfaces) Outcome Engine Actuation and Validation Society Nature Institutions Archives
  • 16. New “Big Data” Brings New Opportunities, Requires New Analytics Up to 10,000 Times larger Up to 10,000 times faster Traditional Data Warehouse and Business Intelligence Data Scale Decision Frequency Data in Motion Data at Rest Telco Promotions 100,000 records/sec, 6B/day 10 ms/decision 270TB for Deep Analytics Smart Traffic 250K GPS probes/sec 630K segments/sec 2 ms/decision, 4K vehicles Homeland Security 600,000 records/sec, 50B/day 1-2 ms/decision 320TB for Deep Analytics yr mo wk day hr min sec … ms  s Exa Peta Tera Giga Mega Kilo Occasional Frequent Real-time DeepQA 100s GB for Deep Analytics 3 sec/decision
  • 17. Enabling Multiple Solutions & Appliances to Achieve a Smarter Planet Peta 2 Analytics Appliance + + Reactive + Deep Analytics Platform Big Analytics Ecosystem Peta 2 Data-centric System Algorithms Big Data Skills DeepFriends Social Network Monitor DeepResponse Emergency Coordination DeepEyes Webcam Fusion DeepCurrent Power Delivery DeepSafety Police/Security DeepTraffic Area Traffic Prediction DeepWater Water management DeepBasket Food Market Prediction DeepBreath Air Quality Control DeepPulse Political Polling DeepThunder Local Weather Prediction DeepSoil Farm Prediction
  • 18. Statistical Ensemble of 600 to 800 Scoring Engines ~30 Machine Learning Models Weigh Scores, Produce Confidence for Each Question 0<P<1 Hypothetical Question With Greatest Confidence is Chosen Evidence-Based Decision Support System S1 S2 S3 SN . . . Answer: A large country in the Western Hemisphere whose capital has a similar name. Hypothesis Generated from “Answer” Guess Questions Q1, Q2 … Qi Question: What is Brazil? Watson Today: Processes Unstructured Text & 200 Hypothesis/3 seconds Watson 3,000 cores;100 TFlops 2 TB memory ~ 200 KW Static Data Corpus Element Refresh Time Data Corpus 2 Weeks Hypothesis Engines Weeks to Months Scoring Engines Weeks to Months Decision Support Engine 4 Days
  • 19.
  • 20. Exascale Research and Development Source: Exascale Research and Development – Request for Information , July 2011
  • 21. Big Data Systems Require a Data-centric Architecture for Performance Data lives on disk and tape Move data to CPU as needed Deep Storage Hierarchy Data lives in persistent memory Many CPU ’s surround and use Shallow/Flat Storage Hierarchy Old Compute-centric Model New Data-centric Model Massive Parallelism Persistent Memory Largest change in system architecture since the System 360 Huge impact on hardware, systems software, and application design Flash Phase Change Manycore FPGA input output
  • 22. Scale-in is the New Systems Battlefield Scale-down Scale-up Scale-in Exascale Peta 2 Low Med High Extreme System Density (1/Latency end-to-end ) Device Clusters Single Device Low Med High Physical Limits Scale-out NAS Blade Server Scale-out Maximize system capacity FLASH SSD 3D Chips FPGA Manycore BPRAM/SCM Interconnect In-mem DB DAS Scale-in Maximize system density Minimize end-to-end latency System Capacity (capability) Single Device Device Clusters 100K 10K 1K 100 10 High Med Low Terabyte HDD POWER 7 Scale-up Maximize device capacity Atom Transistor Atom Storage Scale-down Maximize feature density Cloud Computing
  • 23. Storage Class Memory - The Tipping Point for Data-centric Systems HDD cost advantage continues, 1/10 SCM cost, but SCM dominates in performance, 10,000x faster than HDD Source: Chung Lam, IBM FLASH (Phase Change) SCM in 2015 $0.05 per GB $50K per PB $0.10 / GB $0.01 / GB Relative Cost Relative Latency DRAM 100 1 SCM 1 10 FLASH 15 1000 HDD 0.1 100000
  • 24. Background: 3 Styles of Massively Parallel Systems Data in Motion: High Velocity Mixed Variety High Volume* (*over time) SPL, C, Java Data at Rest*: High Volume Mixed Variety Low Velocity Deep Analytics Extreme Scale-out (*pre-partitioned) Simulation (BlueGene) Generative Modeling Extreme Physics C/C++, Fortran, MPI, OpenMP Reactive Analytics Extreme Ingestion Streaming (Streams) Long Running Small Input Massive Output Hadoop/MapReduce (BigInsights) JAQL, Java Reducers Mappers Input Data (on disk) Output Data = compute node
  • 25.
  • 26. Fault-tolerant Hadoop Distributed File System (HDFS) Source: Hadoop Overview , http://www.cloudera.com
  • 27. Map Reduce Logical Data Flow Source: O’Reilly, Hadoop – The Definitive Guide 0067011990999991950051507004 ... 9999999N9+00001+99999999999 ... 0043011990999991950051512004 ... 9999999N9+00221+99999999999 ... 0043011990999991950051518004 ... 9999999N9-00111+99999999999 ... 0043012650999991949032412004 ... 0500001N9+01111+99999999999 ... 0043012650999991949032418004 ... 0500001N9+00781+99999999999 ... 1
  • 28. Map Reduce Logical Data Flow Source: O’Reilly, Hadoop – The Definitive Guide (0, 006701199099999 1950 051507004...9999999N9+ 0000 1+99999999999...) (106, 004301199099999 1950 051512004...9999999N9+ 0022 1+99999999999...) (212, 004301199099999 1950 051518004...9999999N9- 0011 1+99999999999...) (318, 004301265099999 1949 032412004...0500001N9+ 0111 1+99999999999...) (424, 004301265099999 1949 032418004...0500001N9+ 0078 1+99999999999...) 2
  • 29. Map Reduce Logical Data Flow Source: O’Reilly, Hadoop – The Definitive Guide (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78) 3
  • 30. Map Reduce Logical Data Flow Source: O’Reilly, Hadoop – The Definitive Guide (1949, [111, 78]) (1950, [0, 22, −11]) 4
  • 31. Map Reduce Logical Data Flow Source: O’Reilly, Hadoop – The Definitive Guide (1949, 111) (1950, 22) 5
  • 32. The Blue Gene/Q ASIC Source: EDN News, Hot Chips: The Puzzle of Many Cores
  • 33. The Blue Gene/Q Packaging Hierarchy Source: The Register, IBM ’s Blue Gene/Q Super Chip Grows 18 th Core
  • 34.
  • 35.
  • 37. Thank you very much for your attention.

Notas del editor

  1. Picture 1 So let&apos;s start with the big numbers.  It would not be an exaggeration to say that we&apos;ve clearly entered the &amp;quot;Zettabyte Era&amp;quot;.  A zettabyte is a trillion gigabytes, or a billion terabytes -- as you prefer. This year (2011) we&apos;re forecasted to generate and consume 1.8 zettabytes of information as a society.  That&apos;s up from an estimated 1.3 zettabytes in 2010, with a forecasted 35 zettabytes by the end of this decade. Indeed, the most fascinating statement comes from the subhead of the press release: the rate of information growth appears to be exceeding Moore&apos;s Law -- a powerful argument for scale-out architectures if there ever was one. Picture 2 Go ahead, you need to visualize this.  Take your total amount of storage capacity you&apos;ve got today, and throw a multiplier of 50x against it. Contemplate that, just for a moment. And remember, that&apos;s just an average -- information-intensive businesses will likely see far more. Picture 3 Or consider that all this wonderful information will be stored in 75x more &amp;quot;containers&amp;quot; (files, objects, etc.) than we are dealing with today.  Picture 4 Or that, by the end of the decade, we&apos;ll have 10x as many servers to deal with: both physical and virtual.  It makes a certain sense -- more information -- and ever-more uses -- means vastly more servers -- of all types -- sloshing around than before, putting all that information to work. 
  2. Picture 1 So let&apos;s start with the big numbers.  It would not be an exaggeration to say that we&apos;ve clearly entered the &amp;quot;Zettabyte Era&amp;quot;.  A zettabyte is a trillion gigabytes, or a billion terabytes -- as you prefer. This year (2011) we&apos;re forecasted to generate and consume 1.8 zettabytes of information as a society.  That&apos;s up from an estimated 1.3 zettabytes in 2010, with a forecasted 35 zettabytes by the end of this decade. Indeed, the most fascinating statement comes from the subhead of the press release: the rate of information growth appears to be exceeding Moore&apos;s Law -- a powerful argument for scale-out architectures if there ever was one. Picture 2 Go ahead, you need to visualize this.  Take your total amount of storage capacity you&apos;ve got today, and throw a multiplier of 50x against it. Contemplate that, just for a moment. And remember, that&apos;s just an average -- information-intensive businesses will likely see far more. Picture 3 Or consider that all this wonderful information will be stored in 75x more &amp;quot;containers&amp;quot; (files, objects, etc.) than we are dealing with today.  Picture 4 Or that, by the end of the decade, we&apos;ll have 10x as many servers to deal with: both physical and virtual.  It makes a certain sense -- more information -- and ever-more uses -- means vastly more servers -- of all types -- sloshing around than before, putting all that information to work. 
  3. Picture 1 So let&apos;s start with the big numbers.  It would not be an exaggeration to say that we&apos;ve clearly entered the &amp;quot;Zettabyte Era&amp;quot;.  A zettabyte is a trillion gigabytes, or a billion terabytes -- as you prefer. This year (2011) we&apos;re forecasted to generate and consume 1.8 zettabytes of information as a society.  That&apos;s up from an estimated 1.3 zettabytes in 2010, with a forecasted 35 zettabytes by the end of this decade. Indeed, the most fascinating statement comes from the subhead of the press release: the rate of information growth appears to be exceeding Moore&apos;s Law -- a powerful argument for scale-out architectures if there ever was one. Picture 2 Go ahead, you need to visualize this.  Take your total amount of storage capacity you&apos;ve got today, and throw a multiplier of 50x against it. Contemplate that, just for a moment. And remember, that&apos;s just an average -- information-intensive businesses will likely see far more. Picture 3 Or consider that all this wonderful information will be stored in 75x more &amp;quot;containers&amp;quot; (files, objects, etc.) than we are dealing with today.  Picture 4 Or that, by the end of the decade, we&apos;ll have 10x as many servers to deal with: both physical and virtual.  It makes a certain sense -- more information -- and ever-more uses -- means vastly more servers -- of all types -- sloshing around than before, putting all that information to work. 
  4. &amp;quot;The evolution of science&amp;quot; 1. Babylonian astronomy was probably the first real example of science in practice. The next step along this path was the invention of mathematics. 2. The platonic academy was the first codification of scientific principles. 3. Not much happened during the middle ages but there was a rebirth after the renaissance which triggered the scientific revolution. Copernicus introduced a heliocentric world view and human anatomy was born. 4. Newton&apos;s laws marked the pinnacle of the era of natural philosophy. 5. The industrial revolution marked the birth of modern science (where science became useful to humanity broadly) and marked transition between the era of natural philosophy and the era of modern science. 6. Modern science progressed at an exponential rate with major advances coming quickly on the back of each other. 7. This entire evolution of science can be viewed as an example of exponential growth.
  5. 3 CHANGES to EMPHASIZE: EMPHASIS ON DYNAMIC NATURE OF MODELS, NOT STATIC - ACTIVE LEARNING – Hard - DYNAMIC Engines (Training, Policy, Hypothesis, Outcome, Verification) - Natural interfaces How is our Learning System different from past Machine Learning approaches? Our Learning System will automatically identify key features. Key Features selection is the technique of selecting a subset of relevant features for learning models. For example, key features to diagnose an illness may be a person&apos;s temperature, white blood cell count, pH level, etc. Current state of the art either has (A) humans identifying what are the key features for different domains or (B) allowing machine learning programs to extract key features based on expert rules (provided by humans) or statistical methods which may lead to false conclusions in domains that involve semantic ambiguity. The Learning System we&apos;re building will use crowd sourcing techniques to automatically identify key features for a domain and will proactively ask humans for disambiguation, instead of waiting for humans to notice the model is erroneous (for example models that rated questionable mortgages as AAA or a software program that deduces that Internet Cookies are edible). In this vein, another key difference in our Learning System is active continuous verification. The current trends that provide increasing amounts of digital data (e.g. IBM Smarter Planet sensors) will enable our Learning System to modify itself to prune key features that are no longer relevant. In summary, (1) Automatic Extraction of Key Features (2) Continuous active self verification and (3) The ability to select the appropriate Machine Learning technique (statistical, genetic programming, neural networks, etc), and modify these techniques to changing conditions - all these three features have not been integrated into prior Machine Learning approaches. A hypothesis is necessarily about a problem that is not formalized (if the problem were formalized, then no hypothesis would be required, only a formal solution) Without a formal problem, the task of formulating hypotheses becomes one of creating alternative problem representations and selecting among them, in part, based on possible solutions to each Known systems that attempt to do this require a defined problem space, where the range of possible hypotheses is calculated from a range of possible system states “ Real world” problems do not emerge from a range of possible states, however, but instead occur when previously defined ranges (or dimensions) are violated The only known systems capable of formulating hypotheses about arbitrary states and selecting among them are biological cognitive systems Explanation of this is necessary before a system that &amp;quot;Creates Hypotheses&amp;quot; can be introduced, even as a hypothetical
  6. 0067011990999991950051507004 ... 9999999N9+00001+99999999999 ... 0043011990999991950051512004 ... 9999999N9+00221+99999999999 ... 0043011990999991950051518004 ... 9999999N9-00111+99999999999 ... 0043012650999991949032412004 ... 0500001N9+01111+99999999999 ... 0043012650999991949032418004 ... 0500001N9+00781+99999999999 ... (0, 006701199099999 1950 051507004...9999999N9+ 0000 1+99999999999...) (106, 004301199099999 1950 051512004...9999999N9+ 0022 1+99999999999...) (212, 004301199099999 1950 051518004...9999999N9- 0011 1+99999999999...) (318, 004301265099999 1949 032412004...0500001N9+ 0111 1+99999999999...) (424, 004301265099999 1949 032418004...0500001N9+ 0078 1+99999999999...) (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78) (1949, [111, 78]) (1950, [0, 22, −11]) (1949, 111) (1950, 22)
  7. 0067011990999991950051507004 ... 9999999N9+00001+99999999999 ... 0043011990999991950051512004 ... 9999999N9+00221+99999999999 ... 0043011990999991950051518004 ... 9999999N9-00111+99999999999 ... 0043012650999991949032412004 ... 0500001N9+01111+99999999999 ... 0043012650999991949032418004 ... 0500001N9+00781+99999999999 ... (0, 006701199099999 1950 051507004...9999999N9+ 0000 1+99999999999...) (106, 004301199099999 1950 051512004...9999999N9+ 0022 1+99999999999...) (212, 004301199099999 1950 051518004...9999999N9- 0011 1+99999999999...) (318, 004301265099999 1949 032412004...0500001N9+ 0111 1+99999999999...) (424, 004301265099999 1949 032418004...0500001N9+ 0078 1+99999999999...) (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78) (1949, [111, 78]) (1950, [0, 22, −11]) (1949, 111) (1950, 22)
  8. The input to our map phase is the raw NCDC data
  9. 0067011990999991950051507004 ... 9999999N9+00001+99999999999 ... 0043011990999991950051512004 ... 9999999N9+00221+99999999999 ... 0043011990999991950051518004 ... 9999999N9-00111+99999999999 ... 0043012650999991949032412004 ... 0500001N9+01111+99999999999 ... 0043012650999991949032418004 ... 0500001N9+00781+99999999999 ... (0, 006701199099999 1950 051507004...9999999N9+ 0000 1+99999999999...) (106, 004301199099999 1950 051512004...9999999N9+ 0022 1+99999999999...) (212, 004301199099999 1950 051518004...9999999N9- 0011 1+99999999999...) (318, 004301265099999 1949 032412004...0500001N9+ 0111 1+99999999999...) (424, 004301265099999 1949 032418004...0500001N9+ 0078 1+99999999999...) (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78) (1949, [111, 78]) (1950, [0, 22, −11]) (1949, 111) (1950, 22)
  10. 0067011990999991950051507004 ... 9999999N9+00001+99999999999 ... 0043011990999991950051512004 ... 9999999N9+00221+99999999999 ... 0043011990999991950051518004 ... 9999999N9-00111+99999999999 ... 0043012650999991949032412004 ... 0500001N9+01111+99999999999 ... 0043012650999991949032418004 ... 0500001N9+00781+99999999999 ... (0, 006701199099999 1950 051507004...9999999N9+ 0000 1+99999999999...) (106, 004301199099999 1950 051512004...9999999N9+ 0022 1+99999999999...) (212, 004301199099999 1950 051518004...9999999N9- 0011 1+99999999999...) (318, 004301265099999 1949 032412004...0500001N9+ 0111 1+99999999999...) (424, 004301265099999 1949 032418004...0500001N9+ 0078 1+99999999999...) (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78) (1949, [111, 78]) (1950, [0, 22, −11]) (1949, 111) (1950, 22)
  11. 0067011990999991950051507004 ... 9999999N9+00001+99999999999 ... 0043011990999991950051512004 ... 9999999N9+00221+99999999999 ... 0043011990999991950051518004 ... 9999999N9-00111+99999999999 ... 0043012650999991949032412004 ... 0500001N9+01111+99999999999 ... 0043012650999991949032418004 ... 0500001N9+00781+99999999999 ... (0, 006701199099999 1950 051507004...9999999N9+ 0000 1+99999999999...) (106, 004301199099999 1950 051512004...9999999N9+ 0022 1+99999999999...) (212, 004301199099999 1950 051518004...9999999N9- 0011 1+99999999999...) (318, 004301265099999 1949 032412004...0500001N9+ 0111 1+99999999999...) (424, 004301265099999 1949 032418004...0500001N9+ 0078 1+99999999999...) (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78) (1949, [111, 78]) (1950, [0, 22, −11]) (1949, 111) (1950, 22)