9. Source: IDC Digital Universe Study , sponsored by EMC, May 2011
10. Google’s Server Design Source: cnet News, Google Uncloaks Once-Secret Server , April 2009
11.
12. Source: McKinsey Global Institute, Big data: The next frontier for innovation, competition and productivity , May 2011
13. The Evolution of Science Learning Systems (XXI Century) Era of Natural Philosophy Era of Modern Science Industrial Revolution Astronomy (Babylon, 1900 BC) Platonic Academy (387 BC) Mathematics (India, 499 BC) Scientific Revolution (1543 AD) Newton’s Laws (1687 AD) Relativity (1905 AD) Quantum Physics (1925 AD) Computing (1946 AD) DNA (1953 AD) Evolution of Science Time
14. Today’s Systems – The Calculating Paradigm Algorithms and Applications Static programming Archives Structured Data and Text The Calculating Paradigm People Hypothesize, Determine “what it means”, Run other applications…
15. Future Systems – The Learning Paradigm Training and Learning Engines To Build Models and Define Insight Hypothesis Engines To Understand and Plan Actions Policy Engine Business, Legal and Ethical Rules Verification Engines (e.g. Simulations) Active Learning (Natural Interfaces) Outcome Engine Actuation and Validation Society Nature Institutions Archives
16. New “Big Data” Brings New Opportunities, Requires New Analytics Up to 10,000 Times larger Up to 10,000 times faster Traditional Data Warehouse and Business Intelligence Data Scale Decision Frequency Data in Motion Data at Rest Telco Promotions 100,000 records/sec, 6B/day 10 ms/decision 270TB for Deep Analytics Smart Traffic 250K GPS probes/sec 630K segments/sec 2 ms/decision, 4K vehicles Homeland Security 600,000 records/sec, 50B/day 1-2 ms/decision 320TB for Deep Analytics yr mo wk day hr min sec … ms s Exa Peta Tera Giga Mega Kilo Occasional Frequent Real-time DeepQA 100s GB for Deep Analytics 3 sec/decision
17. Enabling Multiple Solutions & Appliances to Achieve a Smarter Planet Peta 2 Analytics Appliance + + Reactive + Deep Analytics Platform Big Analytics Ecosystem Peta 2 Data-centric System Algorithms Big Data Skills DeepFriends Social Network Monitor DeepResponse Emergency Coordination DeepEyes Webcam Fusion DeepCurrent Power Delivery DeepSafety Police/Security DeepTraffic Area Traffic Prediction DeepWater Water management DeepBasket Food Market Prediction DeepBreath Air Quality Control DeepPulse Political Polling DeepThunder Local Weather Prediction DeepSoil Farm Prediction
18. Statistical Ensemble of 600 to 800 Scoring Engines ~30 Machine Learning Models Weigh Scores, Produce Confidence for Each Question 0<P<1 Hypothetical Question With Greatest Confidence is Chosen Evidence-Based Decision Support System S1 S2 S3 SN . . . Answer: A large country in the Western Hemisphere whose capital has a similar name. Hypothesis Generated from “Answer” Guess Questions Q1, Q2 … Qi Question: What is Brazil? Watson Today: Processes Unstructured Text & 200 Hypothesis/3 seconds Watson 3,000 cores;100 TFlops 2 TB memory ~ 200 KW Static Data Corpus Element Refresh Time Data Corpus 2 Weeks Hypothesis Engines Weeks to Months Scoring Engines Weeks to Months Decision Support Engine 4 Days
19.
20. Exascale Research and Development Source: Exascale Research and Development – Request for Information , July 2011
21. Big Data Systems Require a Data-centric Architecture for Performance Data lives on disk and tape Move data to CPU as needed Deep Storage Hierarchy Data lives in persistent memory Many CPU ’s surround and use Shallow/Flat Storage Hierarchy Old Compute-centric Model New Data-centric Model Massive Parallelism Persistent Memory Largest change in system architecture since the System 360 Huge impact on hardware, systems software, and application design Flash Phase Change Manycore FPGA input output
22. Scale-in is the New Systems Battlefield Scale-down Scale-up Scale-in Exascale Peta 2 Low Med High Extreme System Density (1/Latency end-to-end ) Device Clusters Single Device Low Med High Physical Limits Scale-out NAS Blade Server Scale-out Maximize system capacity FLASH SSD 3D Chips FPGA Manycore BPRAM/SCM Interconnect In-mem DB DAS Scale-in Maximize system density Minimize end-to-end latency System Capacity (capability) Single Device Device Clusters 100K 10K 1K 100 10 High Med Low Terabyte HDD POWER 7 Scale-up Maximize device capacity Atom Transistor Atom Storage Scale-down Maximize feature density Cloud Computing
23. Storage Class Memory - The Tipping Point for Data-centric Systems HDD cost advantage continues, 1/10 SCM cost, but SCM dominates in performance, 10,000x faster than HDD Source: Chung Lam, IBM FLASH (Phase Change) SCM in 2015 $0.05 per GB $50K per PB $0.10 / GB $0.01 / GB Relative Cost Relative Latency DRAM 100 1 SCM 1 10 FLASH 15 1000 HDD 0.1 100000
24. Background: 3 Styles of Massively Parallel Systems Data in Motion: High Velocity Mixed Variety High Volume* (*over time) SPL, C, Java Data at Rest*: High Volume Mixed Variety Low Velocity Deep Analytics Extreme Scale-out (*pre-partitioned) Simulation (BlueGene) Generative Modeling Extreme Physics C/C++, Fortran, MPI, OpenMP Reactive Analytics Extreme Ingestion Streaming (Streams) Long Running Small Input Massive Output Hadoop/MapReduce (BigInsights) JAQL, Java Reducers Mappers Input Data (on disk) Output Data = compute node
Picture 1 So let's start with the big numbers. It would not be an exaggeration to say that we've clearly entered the &quot;Zettabyte Era&quot;. A zettabyte is a trillion gigabytes, or a billion terabytes -- as you prefer. This year (2011) we're forecasted to generate and consume 1.8 zettabytes of information as a society. That's up from an estimated 1.3 zettabytes in 2010, with a forecasted 35 zettabytes by the end of this decade. Indeed, the most fascinating statement comes from the subhead of the press release: the rate of information growth appears to be exceeding Moore's Law -- a powerful argument for scale-out architectures if there ever was one. Picture 2 Go ahead, you need to visualize this. Take your total amount of storage capacity you've got today, and throw a multiplier of 50x against it. Contemplate that, just for a moment. And remember, that's just an average -- information-intensive businesses will likely see far more. Picture 3 Or consider that all this wonderful information will be stored in 75x more &quot;containers&quot; (files, objects, etc.) than we are dealing with today. Picture 4 Or that, by the end of the decade, we'll have 10x as many servers to deal with: both physical and virtual. It makes a certain sense -- more information -- and ever-more uses -- means vastly more servers -- of all types -- sloshing around than before, putting all that information to work.
Picture 1 So let's start with the big numbers. It would not be an exaggeration to say that we've clearly entered the &quot;Zettabyte Era&quot;. A zettabyte is a trillion gigabytes, or a billion terabytes -- as you prefer. This year (2011) we're forecasted to generate and consume 1.8 zettabytes of information as a society. That's up from an estimated 1.3 zettabytes in 2010, with a forecasted 35 zettabytes by the end of this decade. Indeed, the most fascinating statement comes from the subhead of the press release: the rate of information growth appears to be exceeding Moore's Law -- a powerful argument for scale-out architectures if there ever was one. Picture 2 Go ahead, you need to visualize this. Take your total amount of storage capacity you've got today, and throw a multiplier of 50x against it. Contemplate that, just for a moment. And remember, that's just an average -- information-intensive businesses will likely see far more. Picture 3 Or consider that all this wonderful information will be stored in 75x more &quot;containers&quot; (files, objects, etc.) than we are dealing with today. Picture 4 Or that, by the end of the decade, we'll have 10x as many servers to deal with: both physical and virtual. It makes a certain sense -- more information -- and ever-more uses -- means vastly more servers -- of all types -- sloshing around than before, putting all that information to work.
Picture 1 So let's start with the big numbers. It would not be an exaggeration to say that we've clearly entered the &quot;Zettabyte Era&quot;. A zettabyte is a trillion gigabytes, or a billion terabytes -- as you prefer. This year (2011) we're forecasted to generate and consume 1.8 zettabytes of information as a society. That's up from an estimated 1.3 zettabytes in 2010, with a forecasted 35 zettabytes by the end of this decade. Indeed, the most fascinating statement comes from the subhead of the press release: the rate of information growth appears to be exceeding Moore's Law -- a powerful argument for scale-out architectures if there ever was one. Picture 2 Go ahead, you need to visualize this. Take your total amount of storage capacity you've got today, and throw a multiplier of 50x against it. Contemplate that, just for a moment. And remember, that's just an average -- information-intensive businesses will likely see far more. Picture 3 Or consider that all this wonderful information will be stored in 75x more &quot;containers&quot; (files, objects, etc.) than we are dealing with today. Picture 4 Or that, by the end of the decade, we'll have 10x as many servers to deal with: both physical and virtual. It makes a certain sense -- more information -- and ever-more uses -- means vastly more servers -- of all types -- sloshing around than before, putting all that information to work.
&quot;The evolution of science&quot; 1. Babylonian astronomy was probably the first real example of science in practice. The next step along this path was the invention of mathematics. 2. The platonic academy was the first codification of scientific principles. 3. Not much happened during the middle ages but there was a rebirth after the renaissance which triggered the scientific revolution. Copernicus introduced a heliocentric world view and human anatomy was born. 4. Newton's laws marked the pinnacle of the era of natural philosophy. 5. The industrial revolution marked the birth of modern science (where science became useful to humanity broadly) and marked transition between the era of natural philosophy and the era of modern science. 6. Modern science progressed at an exponential rate with major advances coming quickly on the back of each other. 7. This entire evolution of science can be viewed as an example of exponential growth.
3 CHANGES to EMPHASIZE: EMPHASIS ON DYNAMIC NATURE OF MODELS, NOT STATIC - ACTIVE LEARNING – Hard - DYNAMIC Engines (Training, Policy, Hypothesis, Outcome, Verification) - Natural interfaces How is our Learning System different from past Machine Learning approaches? Our Learning System will automatically identify key features. Key Features selection is the technique of selecting a subset of relevant features for learning models. For example, key features to diagnose an illness may be a person's temperature, white blood cell count, pH level, etc. Current state of the art either has (A) humans identifying what are the key features for different domains or (B) allowing machine learning programs to extract key features based on expert rules (provided by humans) or statistical methods which may lead to false conclusions in domains that involve semantic ambiguity. The Learning System we're building will use crowd sourcing techniques to automatically identify key features for a domain and will proactively ask humans for disambiguation, instead of waiting for humans to notice the model is erroneous (for example models that rated questionable mortgages as AAA or a software program that deduces that Internet Cookies are edible). In this vein, another key difference in our Learning System is active continuous verification. The current trends that provide increasing amounts of digital data (e.g. IBM Smarter Planet sensors) will enable our Learning System to modify itself to prune key features that are no longer relevant. In summary, (1) Automatic Extraction of Key Features (2) Continuous active self verification and (3) The ability to select the appropriate Machine Learning technique (statistical, genetic programming, neural networks, etc), and modify these techniques to changing conditions - all these three features have not been integrated into prior Machine Learning approaches. A hypothesis is necessarily about a problem that is not formalized (if the problem were formalized, then no hypothesis would be required, only a formal solution) Without a formal problem, the task of formulating hypotheses becomes one of creating alternative problem representations and selecting among them, in part, based on possible solutions to each Known systems that attempt to do this require a defined problem space, where the range of possible hypotheses is calculated from a range of possible system states “ Real world” problems do not emerge from a range of possible states, however, but instead occur when previously defined ranges (or dimensions) are violated The only known systems capable of formulating hypotheses about arbitrary states and selecting among them are biological cognitive systems Explanation of this is necessary before a system that &quot;Creates Hypotheses&quot; can be introduced, even as a hypothetical