SlideShare una empresa de Scribd logo
1 de 45
Scalability in Hadoop and
                                               Similar Systems
©MapR Technologies - Confidential              1
Big is the next big thing

     Big data and Hadoop are exploding


     Companies are being funded


     Books are being written


     Applications sprouting up everywhere




©MapR Technologies - Confidential   2
                                             2
Slow Motion Explosion




©MapR Technologies - Confidential   3
                                        3
Hadoop Explosion




©MapR Technologies - Confidential   4
                                        4
Why Now?

        But Moore’s law has applied for a long time


        Why is Hadoop exploding now?


        Why not 10 years ago?


        Why not 20?




9/18/2012
   ©MapR Technologies - Confidential    5
                                                       5
Size Matters, but …

     If it were just availability of data then existing big companies would
      adopt big data technology first




©MapR Technologies - Confidential      6
                                                          6
Size Matters, but …

     If it were just availability of data then existing big companies would
      adopt big data technology first


                       They didn’t




©MapR Technologies - Confidential      7
                                                          7
Or Maybe Cost

     If it were just a net positive value then finance companies should
      adopt first because they have higher opportunity value / byte




©MapR Technologies - Confidential     8
                                                        8
Or Maybe Cost

     If it were just a net positive value then finance companies should
      adopt first because they have higher opportunity value / byte


                       They didn’t




©MapR Technologies - Confidential     9
                                                        9
Backwards adoption

     Under almost any threshold argument startups would not adopt
      big data technology first




©MapR Technologies - Confidential   10
                                                    10
Backwards adoption

     Under almost any threshold argument startups would not adopt
      big data technology first


                       They did




©MapR Technologies - Confidential   11
                                                    11
Everywhere at Once?

     Something very strange is happening
       –   Big data is being applied at many different scales
       –   At many value scales
       –   By large companies and small




©MapR Technologies - Confidential             12
                                                                12
Everywhere at Once?

     Something very strange is happening
       –   Big data is being applied at many different scales
       –   At many value scales
       –   By large companies and small


                                    Why?




©MapR Technologies - Confidential             13
                                                                13
The Conventional Answer
More data is being produced more quickly
Data sizes are bigger than even a very large computer can hold
Cost to create and store continues to decrease




©MapR Technologies - Confidential      14
Analytics Scaling Laws

     Analytics scaling is all about the 80-20 rule
       –   Big gains for little initial effort
       –   Rapidly diminishing returns
     The key to net value is how costs scale
       –   Old school – exponential scaling
       –   Big data – linear scaling, low constant
     Cost/performance has changed radically
       –   IF you can use many commodity boxes




©MapR Technologies - Confidential                15
You’re kidding, people do that?


                                      We didn’t know that!

                                     We should have
                                     known that

                                    We knew that




©MapR Technologies - Confidential                  16
NSA, non-proliferation
                                      1




                                    0.75

                                                  Industry-wide data consortium
                           Value




                                     0.5
                                                 In-house analytics

                                                Intern with a spreadsheet
                                    0.25

                                               Anybody with eyes

                                      0
                                           0      500             1000      1500   2,000

                                                                  Scale




©MapR Technologies - Confidential                            17
1




                                    0.75




                                               Net value optimum has a
                           Value




                                     0.5       sharp peak well before
                                               maximum effort


                                    0.25




                                      0
                                           0   500            1000       1500   2,000

                                                              Scale




©MapR Technologies - Confidential                        18
But scaling laws are changing
                                         both slope and shape




©MapR Technologies - Confidential   19
1




                                    0.75
                           Value




                                     0.5
                                                                  More than just a little


                                    0.25




                                      0
                                           0   500        1000         1500           2,000

                                                          Scale




©MapR Technologies - Confidential                    20
1




                                    0.75
                           Value




                                     0.5


                                                                  They are changing a LOT!
                                    0.25




                                      0
                                           0   500        1000         1500         2,000

                                                          Scale




©MapR Technologies - Confidential                    21
©MapR Technologies - Confidential   22
©MapR Technologies - Confidential   23
1




                                    0.75
                           Value




                                     0.5




                                    0.25




                                      0
                                           0   500        1000    1500   2,000

                                                          Scale




©MapR Technologies - Confidential                    24
1




                                    0.75
                           Value




                                     0.5




                                    0.25




                                      0
                                           0   500        1000    1500   2,000

                                                          Scale




©MapR Technologies - Confidential                    25
1




                                    0.75

                                                                   A tipping point is reached and
                                                                   things change radically …
                           Value




                                     0.5

                                               Initially, linear cost scaling
                                               actually makes things worse
                                    0.25




                                      0
                                           0            500              1000      1500             2,000

                                                                         Scale




©MapR Technologies - Confidential                                   26
Pre-requisites for Tipping

     To reach the tipping point,
     Algorithms must scale out horizontally
       –   On commodity hardware
       –   That can and will fail
     Data practice must change
       –   Denormalized is the new black
       –   Flexible data dictionaries are the rule
       –   Structured data becomes rare




©MapR Technologies - Confidential              27
Yeah… but wait




©MapR Technologies - Confidential         28
The Standard Sort of Model

     People talk about the law of large numbers as if it were …



     Well, as if it were a law


     It’s not …


     It is a context and assumption dependent theorem




©MapR Technologies - Confidential     29
What if …

     These assumptions are:


     Changes have a
       –   stationary,
       –   independent,
       –   finite variance distribution




     What happens if these assumptions are wrong?


     And which of them is really wrong?

©MapR Technologies - Confidential         30
For Example
                         Stuff




                                    Tim e




©MapR Technologies - Confidential    31
End point
                         Stuff




                                            has nice
                                            tractable
                                            distribution




                                    Tim e




©MapR Technologies - Confidential    32
What if the Assumptions are Wrong?

     Take the finite variance as a simple example


     This leads to Levy stable distributions


     Like the Cauchy distribution




©MapR Technologies - Confidential      33
Is it Really Different?




©MapR Technologies - Confidential   34
Stuff




                                    Tim e




©MapR Technologies - Confidential    35
What About Real Life?




©MapR Technologies - Confidential             36
©MapR Technologies - Confidential   37
But is it Really Infinite Variance?

     Or are there other kinds of phenomena that show this?


     What about the independence assumption?



     What if the supposedly independent components of the system
      communicate?


     Like we do. Everyday. All the time.




©MapR Technologies - Confidential    38
Why the Difference?


                     The space of              Infinite                  The space of
                     all things that           variance                  interacting
                     change                                              things




                                       Law of large        Interacting
                                       numbers             agents




Apologies and credit to
Simon DaDeo, SFI

 ©MapR Technologies - Confidential                    39
What Happens with Interactions

     Social phenomena defeat the law of large numbers
     Distributions are well modeled by “rich get richer” processes
       –   Pittman-Yar process, Indian Buffet
     Limiting dstributions are heavy tailed, power law
     We see these distributions everywhere
       –   price of cotton in the 19th century
       –   word frequencies
       –   popularity of Github projects
       –   equity pricing and volumes
       –   sizes of cities
       –   popularity of web-sites


©MapR Technologies - Confidential                40
What are the
                                    Implications?



©MapR Technologies - Confidential         41
1




                                    0.75
                           Value




                                     0.5




                                    0.25




                                      0
                                           0   500        1000    1500   2,000

                                                          Scale




©MapR Technologies - Confidential                    42
In a Nutshell

     Scalability is much more important than we thought


     Mashups are more important than we thought


     Network effects are more important than we thought


     Exploration is more important than we thought


     Hadoop style linear scaling must be mixed with ad hoc analysis



©MapR Technologies - Confidential    43
Thank You




©MapR Technologies - Confidential   44
whoami?

     Ted Dunning
       –   @ted_dunning
       –   tdunning@maprtech.com (MapR distribution for Hadoop)
       –   tdunning@apache.com (Mahout, Hadoop, Lucene, Zookeeper, Drill)
       –   ted.dunning@gmail.com (me)


     More info:

       http://www.mapr.com/company/events/hadoop-in-finance-2012




©MapR Technologies - Confidential         45

Más contenido relacionado

La actualidad más candente

Bda-dunning-2012-12-06
Bda-dunning-2012-12-06Bda-dunning-2012-12-06
Bda-dunning-2012-12-06Ted Dunning
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time TogetherMapR Technologies
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionTed Dunning
 
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ..."Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...Edge AI and Vision Alliance
 
"How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M..."How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M...Edge AI and Vision Alliance
 
Talk on commercialising space data
Talk on commercialising space data Talk on commercialising space data
Talk on commercialising space data Alison B. Lowndes
 
Optimization of Resource Provisioning Cost in Cloud Computing
Optimization of Resource Provisioning Cost in Cloud Computing Optimization of Resource Provisioning Cost in Cloud Computing
Optimization of Resource Provisioning Cost in Cloud Computing Sivadon Chaisiri
 
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...Edge AI and Vision Alliance
 
New Media Services from a Mobile Chipset Vendor and Standardization Perspective
New Media Services from a Mobile Chipset Vendor and Standardization PerspectiveNew Media Services from a Mobile Chipset Vendor and Standardization Perspective
New Media Services from a Mobile Chipset Vendor and Standardization PerspectiveFörderverein Technische Fakultät
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...Edge AI and Vision Alliance
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ..."Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...Edge AI and Vision Alliance
 
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...KTN
 

La actualidad más candente (15)

Dunning strata-2012-27-02
Dunning strata-2012-27-02Dunning strata-2012-27-02
Dunning strata-2012-27-02
 
Bda-dunning-2012-12-06
Bda-dunning-2012-12-06Bda-dunning-2012-12-06
Bda-dunning-2012-12-06
 
Hcj 2013-01-21
Hcj 2013-01-21Hcj 2013-01-21
Hcj 2013-01-21
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
Strata 2014 Anomaly Detection
Strata 2014 Anomaly DetectionStrata 2014 Anomaly Detection
Strata 2014 Anomaly Detection
 
Devoxx Real-Time Learning
Devoxx Real-Time LearningDevoxx Real-Time Learning
Devoxx Real-Time Learning
 
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ..."Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded ...
 
"How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M..."How to Test and Validate an Automated Driving System," a Presentation from M...
"How to Test and Validate an Automated Driving System," a Presentation from M...
 
Talk on commercialising space data
Talk on commercialising space data Talk on commercialising space data
Talk on commercialising space data
 
Optimization of Resource Provisioning Cost in Cloud Computing
Optimization of Resource Provisioning Cost in Cloud Computing Optimization of Resource Provisioning Cost in Cloud Computing
Optimization of Resource Provisioning Cost in Cloud Computing
 
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
 
New Media Services from a Mobile Chipset Vendor and Standardization Perspective
New Media Services from a Mobile Chipset Vendor and Standardization PerspectiveNew Media Services from a Mobile Chipset Vendor and Standardization Perspective
New Media Services from a Mobile Chipset Vendor and Standardization Perspective
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
 
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ..."Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
"Collaboratively Benchmarking and Optimizing Deep Learning Implementations," ...
 
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
Implementing AI: High Performance Architectures: A Universal Accelerated Comp...
 

Destacado

Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Ted Dunning
 
Intelligent Search
Intelligent SearchIntelligent Search
Intelligent SearchTed Dunning
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clusteringTed Dunning
 
Transactional Data Mining
Transactional Data MiningTransactional Data Mining
Transactional Data MiningTed Dunning
 
Research Support in an Open Science Framework - Ron Dekker, seconded national...
Research Support in an Open Science Framework - Ron Dekker, seconded national...Research Support in an Open Science Framework - Ron Dekker, seconded national...
Research Support in an Open Science Framework - Ron Dekker, seconded national...Mari Tinnemans
 
R Cheat Sheet – Data Management
R Cheat Sheet – Data ManagementR Cheat Sheet – Data Management
R Cheat Sheet – Data ManagementDr. Volkan OBAN
 
Big Data Certification
Big Data CertificationBig Data Certification
Big Data CertificationExperfy
 
Framework for Data Informed Science Policy
Framework for Data Informed Science PolicyFramework for Data Informed Science Policy
Framework for Data Informed Science PolicyBrian Wee
 
InfosysPublicServices - Healthcare SOA | Program Management Framework
InfosysPublicServices - Healthcare SOA | Program Management FrameworkInfosysPublicServices - Healthcare SOA | Program Management Framework
InfosysPublicServices - Healthcare SOA | Program Management FrameworkInfosys
 
The Framework Program for the Sustainable Management of La Plata Basin's Wate...
The Framework Program for the Sustainable Management of La Plata Basin's Wate...The Framework Program for the Sustainable Management of La Plata Basin's Wate...
The Framework Program for the Sustainable Management of La Plata Basin's Wate...Iwl Pcu
 
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...ETCenter
 
Business Analyst Training - Gain America
Business Analyst Training - Gain AmericaBusiness Analyst Training - Gain America
Business Analyst Training - Gain AmericaGainAmerica
 
Scientific Data Cataloging Framework
Scientific Data Cataloging FrameworkScientific Data Cataloging Framework
Scientific Data Cataloging FrameworkSupun Nakandala
 
Open Science Framework (OSF)
Open Science Framework (OSF)Open Science Framework (OSF)
Open Science Framework (OSF)Andrew Sallans
 
Big Data University DS0103EN Certificate _ Data Science Methodology
Big Data University DS0103EN Certificate _ Data Science MethodologyBig Data University DS0103EN Certificate _ Data Science Methodology
Big Data University DS0103EN Certificate _ Data Science MethodologySumit Mattey
 

Destacado (20)

Iss
IssIss
Iss
 
Het Iss
Het IssHet Iss
Het Iss
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 
Intelligent Search
Intelligent SearchIntelligent Search
Intelligent Search
 
Graphlab dunning-clustering
Graphlab dunning-clusteringGraphlab dunning-clustering
Graphlab dunning-clustering
 
Transactional Data Mining
Transactional Data MiningTransactional Data Mining
Transactional Data Mining
 
Research Support in an Open Science Framework - Ron Dekker, seconded national...
Research Support in an Open Science Framework - Ron Dekker, seconded national...Research Support in an Open Science Framework - Ron Dekker, seconded national...
Research Support in an Open Science Framework - Ron Dekker, seconded national...
 
R Cheat Sheet – Data Management
R Cheat Sheet – Data ManagementR Cheat Sheet – Data Management
R Cheat Sheet – Data Management
 
Big Data Certification
Big Data CertificationBig Data Certification
Big Data Certification
 
Framework for Data Informed Science Policy
Framework for Data Informed Science PolicyFramework for Data Informed Science Policy
Framework for Data Informed Science Policy
 
InfosysPublicServices - Healthcare SOA | Program Management Framework
InfosysPublicServices - Healthcare SOA | Program Management FrameworkInfosysPublicServices - Healthcare SOA | Program Management Framework
InfosysPublicServices - Healthcare SOA | Program Management Framework
 
The Framework Program for the Sustainable Management of La Plata Basin's Wate...
The Framework Program for the Sustainable Management of La Plata Basin's Wate...The Framework Program for the Sustainable Management of La Plata Basin's Wate...
The Framework Program for the Sustainable Management of La Plata Basin's Wate...
 
R Cheat Sheet
R Cheat SheetR Cheat Sheet
R Cheat Sheet
 
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...
 
Big data framework
Big data frameworkBig data framework
Big data framework
 
Business Analyst Training - Gain America
Business Analyst Training - Gain AmericaBusiness Analyst Training - Gain America
Business Analyst Training - Gain America
 
Scientific Data Cataloging Framework
Scientific Data Cataloging FrameworkScientific Data Cataloging Framework
Scientific Data Cataloging Framework
 
Program Mgmt Framework
Program Mgmt FrameworkProgram Mgmt Framework
Program Mgmt Framework
 
Open Science Framework (OSF)
Open Science Framework (OSF)Open Science Framework (OSF)
Open Science Framework (OSF)
 
Big Data University DS0103EN Certificate _ Data Science Methodology
Big Data University DS0103EN Certificate _ Data Science MethodologyBig Data University DS0103EN Certificate _ Data Science Methodology
Big Data University DS0103EN Certificate _ Data Science Methodology
 

Similar a Chicago finance-big-data

Big data, why now?
Big data, why now?Big data, why now?
Big data, why now?Ted Dunning
 
Chicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted DunningChicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted DunningMapR Technologies
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?Ted Dunning
 
Steve Jenkins - Business Opportunities for Big Data in the Enterprise
Steve Jenkins - Business Opportunities for Big Data in the Enterprise Steve Jenkins - Business Opportunities for Big Data in the Enterprise
Steve Jenkins - Business Opportunities for Big Data in the Enterprise WeAreEsynergy
 
EMC's IT's Cloud Transformation, Thomas Becker, EMC
EMC's IT's Cloud Transformation, Thomas Becker, EMCEMC's IT's Cloud Transformation, Thomas Becker, EMC
EMC's IT's Cloud Transformation, Thomas Becker, EMCCloudOps Summit
 
How to Plan and Budget for 2013 with Cloud in Mind
How to Plan and Budget for 2013 with Cloud in MindHow to Plan and Budget for 2013 with Cloud in Mind
How to Plan and Budget for 2013 with Cloud in MindBluelock
 
predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21Ted Dunning
 
Data Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowData Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowMapR Technologies
 
2012 Future of Cloud Computing
2012 Future of Cloud Computing 2012 Future of Cloud Computing
2012 Future of Cloud Computing Michael Skok
 
Dell panel cloud computing - small biz summit 2012
Dell panel   cloud computing - small biz summit 2012Dell panel   cloud computing - small biz summit 2012
Dell panel cloud computing - small biz summit 2012Ramon Ray
 
Progress with confidence into next generation IT
Progress with confidence into next generation ITProgress with confidence into next generation IT
Progress with confidence into next generation ITPaul Muller
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRData Science London
 
Lax breakfast forum_developing_your_cloud_strategy_05_10_2012
Lax breakfast forum_developing_your_cloud_strategy_05_10_2012Lax breakfast forum_developing_your_cloud_strategy_05_10_2012
Lax breakfast forum_developing_your_cloud_strategy_05_10_2012Internap
 
Nyc lunch and learn 03 15 2012 final
Nyc lunch and learn   03 15 2012 finalNyc lunch and learn   03 15 2012 final
Nyc lunch and learn 03 15 2012 finalInternap
 
The Smarter Way to Commercialize Algorithms
The Smarter Way to Commercialize AlgorithmsThe Smarter Way to Commercialize Algorithms
The Smarter Way to Commercialize AlgorithmsCloudNSci
 
Managing your Cloud with Confidence
Managing your Cloud with Confidence Managing your Cloud with Confidence
Managing your Cloud with Confidence CA Nimsoft
 
CloudOps with OpsRamp: From Discovery to Resolution
CloudOps with OpsRamp: From Discovery to ResolutionCloudOps with OpsRamp: From Discovery to Resolution
CloudOps with OpsRamp: From Discovery to ResolutionOpsRamp
 
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleSpark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleIan Downard
 
Dr Markus Pleier - Datadeluge and big data, how IT operation get transformed
Dr Markus Pleier - Datadeluge and big data, how IT operation get transformedDr Markus Pleier - Datadeluge and big data, how IT operation get transformed
Dr Markus Pleier - Datadeluge and big data, how IT operation get transformedGlobal Business Events
 

Similar a Chicago finance-big-data (20)

Big data, why now?
Big data, why now?Big data, why now?
Big data, why now?
 
Chicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted DunningChicago Hadoop in Finance - Ted Dunning
Chicago Hadoop in Finance - Ted Dunning
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?
 
Steve Jenkins - Business Opportunities for Big Data in the Enterprise
Steve Jenkins - Business Opportunities for Big Data in the Enterprise Steve Jenkins - Business Opportunities for Big Data in the Enterprise
Steve Jenkins - Business Opportunities for Big Data in the Enterprise
 
EMC's IT's Cloud Transformation, Thomas Becker, EMC
EMC's IT's Cloud Transformation, Thomas Becker, EMCEMC's IT's Cloud Transformation, Thomas Becker, EMC
EMC's IT's Cloud Transformation, Thomas Becker, EMC
 
How to Plan and Budget for 2013 with Cloud in Mind
How to Plan and Budget for 2013 with Cloud in MindHow to Plan and Budget for 2013 with Cloud in Mind
How to Plan and Budget for 2013 with Cloud in Mind
 
predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21
 
Data Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowData Warehouse Evolution Roadshow
Data Warehouse Evolution Roadshow
 
Antonio piraino v1
Antonio piraino v1Antonio piraino v1
Antonio piraino v1
 
2012 Future of Cloud Computing
2012 Future of Cloud Computing 2012 Future of Cloud Computing
2012 Future of Cloud Computing
 
Dell panel cloud computing - small biz summit 2012
Dell panel   cloud computing - small biz summit 2012Dell panel   cloud computing - small biz summit 2012
Dell panel cloud computing - small biz summit 2012
 
Progress with confidence into next generation IT
Progress with confidence into next generation ITProgress with confidence into next generation IT
Progress with confidence into next generation IT
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Lax breakfast forum_developing_your_cloud_strategy_05_10_2012
Lax breakfast forum_developing_your_cloud_strategy_05_10_2012Lax breakfast forum_developing_your_cloud_strategy_05_10_2012
Lax breakfast forum_developing_your_cloud_strategy_05_10_2012
 
Nyc lunch and learn 03 15 2012 final
Nyc lunch and learn   03 15 2012 finalNyc lunch and learn   03 15 2012 final
Nyc lunch and learn 03 15 2012 final
 
The Smarter Way to Commercialize Algorithms
The Smarter Way to Commercialize AlgorithmsThe Smarter Way to Commercialize Algorithms
The Smarter Way to Commercialize Algorithms
 
Managing your Cloud with Confidence
Managing your Cloud with Confidence Managing your Cloud with Confidence
Managing your Cloud with Confidence
 
CloudOps with OpsRamp: From Discovery to Resolution
CloudOps with OpsRamp: From Discovery to ResolutionCloudOps with OpsRamp: From Discovery to Resolution
CloudOps with OpsRamp: From Discovery to Resolution
 
Spark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating ExampleSpark and MapR Streams: A Motivating Example
Spark and MapR Streams: A Motivating Example
 
Dr Markus Pleier - Datadeluge and big data, how IT operation get transformed
Dr Markus Pleier - Datadeluge and big data, how IT operation get transformedDr Markus Pleier - Datadeluge and big data, how IT operation get transformed
Dr Markus Pleier - Datadeluge and big data, how IT operation get transformed
 

Más de Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 

Más de Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 

Último

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Último (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Chicago finance-big-data

  • 1. Scalability in Hadoop and Similar Systems ©MapR Technologies - Confidential 1
  • 2. Big is the next big thing  Big data and Hadoop are exploding  Companies are being funded  Books are being written  Applications sprouting up everywhere ©MapR Technologies - Confidential 2 2
  • 3. Slow Motion Explosion ©MapR Technologies - Confidential 3 3
  • 5. Why Now?  But Moore’s law has applied for a long time  Why is Hadoop exploding now?  Why not 10 years ago?  Why not 20? 9/18/2012 ©MapR Technologies - Confidential 5 5
  • 6. Size Matters, but …  If it were just availability of data then existing big companies would adopt big data technology first ©MapR Technologies - Confidential 6 6
  • 7. Size Matters, but …  If it were just availability of data then existing big companies would adopt big data technology first They didn’t ©MapR Technologies - Confidential 7 7
  • 8. Or Maybe Cost  If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte ©MapR Technologies - Confidential 8 8
  • 9. Or Maybe Cost  If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte They didn’t ©MapR Technologies - Confidential 9 9
  • 10. Backwards adoption  Under almost any threshold argument startups would not adopt big data technology first ©MapR Technologies - Confidential 10 10
  • 11. Backwards adoption  Under almost any threshold argument startups would not adopt big data technology first They did ©MapR Technologies - Confidential 11 11
  • 12. Everywhere at Once?  Something very strange is happening – Big data is being applied at many different scales – At many value scales – By large companies and small ©MapR Technologies - Confidential 12 12
  • 13. Everywhere at Once?  Something very strange is happening – Big data is being applied at many different scales – At many value scales – By large companies and small Why? ©MapR Technologies - Confidential 13 13
  • 14. The Conventional Answer More data is being produced more quickly Data sizes are bigger than even a very large computer can hold Cost to create and store continues to decrease ©MapR Technologies - Confidential 14
  • 15. Analytics Scaling Laws  Analytics scaling is all about the 80-20 rule – Big gains for little initial effort – Rapidly diminishing returns  The key to net value is how costs scale – Old school – exponential scaling – Big data – linear scaling, low constant  Cost/performance has changed radically – IF you can use many commodity boxes ©MapR Technologies - Confidential 15
  • 16. You’re kidding, people do that? We didn’t know that! We should have known that We knew that ©MapR Technologies - Confidential 16
  • 17. NSA, non-proliferation 1 0.75 Industry-wide data consortium Value 0.5 In-house analytics Intern with a spreadsheet 0.25 Anybody with eyes 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 17
  • 18. 1 0.75 Net value optimum has a Value 0.5 sharp peak well before maximum effort 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 18
  • 19. But scaling laws are changing both slope and shape ©MapR Technologies - Confidential 19
  • 20. 1 0.75 Value 0.5 More than just a little 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 20
  • 21. 1 0.75 Value 0.5 They are changing a LOT! 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 21
  • 22. ©MapR Technologies - Confidential 22
  • 23. ©MapR Technologies - Confidential 23
  • 24. 1 0.75 Value 0.5 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 24
  • 25. 1 0.75 Value 0.5 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 25
  • 26. 1 0.75 A tipping point is reached and things change radically … Value 0.5 Initially, linear cost scaling actually makes things worse 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 26
  • 27. Pre-requisites for Tipping  To reach the tipping point,  Algorithms must scale out horizontally – On commodity hardware – That can and will fail  Data practice must change – Denormalized is the new black – Flexible data dictionaries are the rule – Structured data becomes rare ©MapR Technologies - Confidential 27
  • 28. Yeah… but wait ©MapR Technologies - Confidential 28
  • 29. The Standard Sort of Model  People talk about the law of large numbers as if it were …  Well, as if it were a law  It’s not …  It is a context and assumption dependent theorem ©MapR Technologies - Confidential 29
  • 30. What if …  These assumptions are:  Changes have a – stationary, – independent, – finite variance distribution  What happens if these assumptions are wrong?  And which of them is really wrong? ©MapR Technologies - Confidential 30
  • 31. For Example Stuff Tim e ©MapR Technologies - Confidential 31
  • 32. End point Stuff has nice tractable distribution Tim e ©MapR Technologies - Confidential 32
  • 33. What if the Assumptions are Wrong?  Take the finite variance as a simple example  This leads to Levy stable distributions  Like the Cauchy distribution ©MapR Technologies - Confidential 33
  • 34. Is it Really Different? ©MapR Technologies - Confidential 34
  • 35. Stuff Tim e ©MapR Technologies - Confidential 35
  • 36. What About Real Life? ©MapR Technologies - Confidential 36
  • 37. ©MapR Technologies - Confidential 37
  • 38. But is it Really Infinite Variance?  Or are there other kinds of phenomena that show this?  What about the independence assumption?  What if the supposedly independent components of the system communicate?  Like we do. Everyday. All the time. ©MapR Technologies - Confidential 38
  • 39. Why the Difference? The space of Infinite The space of all things that variance interacting change things Law of large Interacting numbers agents Apologies and credit to Simon DaDeo, SFI ©MapR Technologies - Confidential 39
  • 40. What Happens with Interactions  Social phenomena defeat the law of large numbers  Distributions are well modeled by “rich get richer” processes – Pittman-Yar process, Indian Buffet  Limiting dstributions are heavy tailed, power law  We see these distributions everywhere – price of cotton in the 19th century – word frequencies – popularity of Github projects – equity pricing and volumes – sizes of cities – popularity of web-sites ©MapR Technologies - Confidential 40
  • 41. What are the Implications? ©MapR Technologies - Confidential 41
  • 42. 1 0.75 Value 0.5 0.25 0 0 500 1000 1500 2,000 Scale ©MapR Technologies - Confidential 42
  • 43. In a Nutshell  Scalability is much more important than we thought  Mashups are more important than we thought  Network effects are more important than we thought  Exploration is more important than we thought  Hadoop style linear scaling must be mixed with ad hoc analysis ©MapR Technologies - Confidential 43
  • 44. Thank You ©MapR Technologies - Confidential 44
  • 45. whoami?  Ted Dunning – @ted_dunning – tdunning@maprtech.com (MapR distribution for Hadoop) – tdunning@apache.com (Mahout, Hadoop, Lucene, Zookeeper, Drill) – ted.dunning@gmail.com (me)  More info: http://www.mapr.com/company/events/hadoop-in-finance-2012 ©MapR Technologies - Confidential 45

Notas del editor

  1. Why is big data sooo fashionable with big and small companies from different industries? What has suddenly changed?
  2. Google searches are up 10x over just four years ago.
  3. Hadoop use is exploding. We chose this example, which shows job trends for Hadoop. Further evidence that you should pay attention during this talk.
  4. But we have seen constant growth for a long time. And simple growth would only explain some kinds of companies starting with big data (probably big ones) and then slow adoption. Databases started with big companies and took 20 years or more to reach everywhere because the need exceeded cost at different times for different companies. The internet, on the other hand, largely happened to everybody at the same time so it changed things in nearly all industries at all scales nearly simultaneously. Why is big data exploding right now and why is it exploding at all?
  5. The different kinds of scaling laws have different shape and I think that shape is the key.
  6. The value of analytics always increases with more data, but the rate of increase drops dramatically after an initial quick increase.
  7. In classical analytics, the cost of doing analytics increases sharply.
  8. The result is a net value that has a sharp optimum in the area where value is increasing rapidly and cost is not yet increasing so rapidly.
  9. New techniques such as Hadoop result in linear scaling of cost. This is a change in shape and it causes a qualitative change in the way that costs trade off against value to give net value. As technology improves, the slope of this cost line is also changing rapidly over time.
  10. This next sequence shows how the net value changes with different slope linear cost models.
  11. Notice how the best net value has jumped up significantly
  12. And as the line approaches horizontal, the highest net value occurs at dramatically larger data scale.
  13. And as the line approaches horizontal, the highest net value occurs at dramatically larger data scale.