SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
Big Data



Gaetan Lion
April 5, 2013
                1
Table of Content

1)   Big Data trends.
2)   How Big is your Data?
3)   Big Data Potential.
4)   Big technologies. New databases.
5)   Big quantitative methods. New stats.
6)   Big Data temperaments.
7)   Is Big always better?



                                            2
1) Big Data Trends




                     3
Cost of Data storage has dropped




                                   4
Social Media (Facebook & Twitter) has
                      grown exponentially
             Facebook vs Twitter # Active Users in 000
                       exponential growth                Facebook started
 1,200,000                                               in Feb 2004. Has
                                                         1 billion active
 1,000,000
                                                         users.
  800,000


  600,000                                                Twitter started in
                                                         March 2006.
  400,000
                                                         Has 500 million
  200,000                                                users.

        0
     Ap 8




     Ap 9


     O 9


     Ap 0


     O 0


     Ap 1


     O 1


     Ap 2


     O 2


           13
     O 8
     Ja 8




     Ja 9



      Ju 0




      Ju 1


     Ja 1



      Ju 2


     Ja 2
      Ju 8




      Ju 9




     Ja 0
        l-0




        l-1




        l-1



           1


        l-1
           0
        r-0

        l-0

          -0

           0
        r-0



          -0

           1
        r-1



         -1

           1
        r-1



          -1


        r-1



          -1
       n-




       n-
       n-




       n-




       n-




       n-
       ct




       ct




       ct




       ct
       ct
     Ja




                              Facebook   Twitter



Social networks are creating a huge live Unstructured Data.           5
Unstructured Data is taking over…




                                    6
2) How Big is your Data?

•   How Tall is it? How large is your sample (rows)?
•   How Wide is it? How many variables (columns)?
•   What is its Velocity? How frequently is it updated?
•   Does it include unstructured data (documents,
    emails, Social Media)?




                                                     7
3) Big Data Potential




                        8
4) Big Technologies.
   New Databases




                       9
10
Database: Structured vs Unstructured

                   Database         Database          Database          Reporting
   Data Type                          type
                   language                           structure           tool

Structured.        SQL                               Data Warehouse
                                    Relational                        Oracle Essbase
Customers,         structured                        Data Marts
                                    database                          & IBM Cognos
transactions,      query language                                                       Reporting
numbers in rows.                                                                        Business
                                                                                       Intelligence
                                                                        Hadoop
                                                                       Connectors
Unstructured.      NoSQL            Non-relational      Hadoop
Social Media,      not only SQL     database
Text documents,
Web services




                                                                                          11
5) Big quantitative methods.
         New Stats




                               12
New Stats Map
                                        A/B Testing
                                        (hypothesis testing)

                 Statistics &           Regression             Spatial Analysis
                 Regression

                                        Time Series            Signal Processing
                                        Analysis
Predictive
Analytics                               Association
                                        Rule Learning

             Data Mining &              Cluster Analysis
             Machine Learning
             (formerly Artificial       Classification
             Intelligence)
                                                               Pattern Recognition

                                        Neural Networks
                                                               Optimization          Genetic Algorithms

                                        Natural Language
                                                               Sentiment Analysis
                                        Processing


                                                                                               13
Definitions. Part I
Association Rule Learning: method to uncover interesting relationships
by generating and testing possible rules. One application is “market
basket analysis”, where a retailer figures out what products are
frequently bought together. A cited example is that shoppers who buy
diapers often buy beer.
Classification: identifies the categories in which new data belongs,
based on an existing data set grouped in predefined categories. It
differs from Cluster Analysis that starts without predefined categories.
Genetic algorithms: an optimization method inspired by the “survival of
the fittest” process. Potential solutions are encoded as “chromosomes”
that can combine and mutate. The chromosomes are selected for
survival within a modeled “environment.” Examples: optimizing the
performance of an investment portfolio.


                                                                           14
Definitions. Part II
Natural language processing (NLP): it uses algorithms to analyze text data.
 Sentiment Analysis is a common application. It measures customers’
reaction to a product campaign by analyzing social media.
Neural networks: models inspired by the workings of neurons and
synapses within the brain. Used for finding nonlinear patterns. They can
be used for Pattern recognition and Optimization. Examples of neural
network applications include identifying customers that may leave and
identifying fraudulent insurance claims.
Signal processing: an electrical engineering method to analyze signals
(radio, etc…) and discern between signal and noise. It is used to extract
the signal from the noise from a set of less precise data [Signal Detection
Theory].




                                                                       15
Definitions. Part III
Spatial Analysis: it analyzes geographic location encoded within
the data. The information comes from GPS. Applications
include spatial regression to figure a consumer willingness to
purchase a product given his location.




                                                                   16
6) Big Data Temperaments




Source: Harvard Business Review, April 2012 by Shvetank Shah, Andrew Horne
and Jaime Capella.

                                                                             17
7) Is Big always better?




                           18
No! says Nate Silver


•“I came to realize that prediction in the era of Big Data was
not going very well.”
•“If the quantity of information is increasing [exponentially]…
Most of it is just noise.”
•He refers to John P. Ioannidis 2005
paper: “Why Most Published
Research Findings are False.”
2/3ds of scientific papers’ results
can’t be replicated!

“… numbers have no way of speaking for
themselves. We speak for them.”
                                                              19
Nate’s targets

• Political pundits. Their “intuitive” election predictions have
  been disastrous. Granted, it was not because of Big Data
  but instead No Data. He showed them how to do it using
  Small Data (polls with samples < 1,000);
• Economists forecasters. They have used Big Data with
  poor results. The majority of them can’t forecast a
  recession already underway. ECRI predicted with certainty
  a double dip recession in 2011 using tens of variables they
  did not understand. Instead, the economy improved;
• Stock market & financial market forecasters. Similar
  performance as economists forecasters;
• Earthquake forecasting. The field is not well understood.

  “… Statistical inferences are much stronger when backed
  up by theory… about their root causes.”               20
No! says Vincent Granville



• Big Data is huge, but information is very sparse;
• Storing and processing the entire data is very inefficient;
• You can do better by smartly sampling only 5% of the
  data;


You don’t need Big Data, you need Smart Data.


                                                           21
Yes! Says Chris Anderson


     • He quotes Peter Norvig, Google’s research director: “All models
     are wrong, and increasingly you can succeed without them.”
     • “… with massive data, [the scientific method] is becoming
     obsolete.”
     • “We can throw the numbers into the biggest computing clusters …
     and let statistical algorithms find patterns where science cannot.”
     He mentions examples such as J.Craig Venter gene sequencing,
     Google Search, and Google Translator, among other successes.

“With enough data, the numbers speak for themselves.”

“Correlation supersedes causation, and science can advance without
                                                                   22
coherent models, unified theories, or … any … explanation at all.”
Big Data Effectiveness Map
            Field needing causal understanding                              Field not needing
                                                          Rule Based              causal
             Theory not well        Theory well
                                                                             understanding
               understood           understood
             More data more        More data more
Tall data        Noise                Signal
             Oversampling          Oversampling
                                                         More data better    More data better
          More variables more More variables more
                                                        model performance   model performance
            false positives      explanation
Wide data
           Multicollinearity   Multicollinearity
           Model overfitting   Model overfitting


               Economics,                                                    Google Search,
                                                         Games & Sports
            Financial markets,   Weather forecasting,                       Google Translator,
Examples                                                [Chess, Baseball,
               Earthquake        Customer behavior                          Google Flu-trends,
                                                         etc…], Politics
               forecasting                                                  Customer behavior
                                                                                          23

Más contenido relacionado

Destacado

Destacado (6)

MapReduce frameworks and methods - Adam Horvath, Google Technology User Grou...
MapReduce frameworks and methods  - Adam Horvath, Google Technology User Grou...MapReduce frameworks and methods  - Adam Horvath, Google Technology User Grou...
MapReduce frameworks and methods - Adam Horvath, Google Technology User Grou...
 
Gdg 2013
Gdg 2013Gdg 2013
Gdg 2013
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big data
 
structured and unstructured interview
structured and unstructured interviewstructured and unstructured interview
structured and unstructured interview
 
Why Structured Data & Semantic SEO Are Important - SMX East 2013
Why Structured Data & Semantic SEO Are Important - SMX East 2013Why Structured Data & Semantic SEO Are Important - SMX East 2013
Why Structured Data & Semantic SEO Are Important - SMX East 2013
 

Similar a Big data

Big Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureBig Data and Implications on Platform Architecture
Big Data and Implications on Platform Architecture
Odinot Stanislas
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
m_hepburn
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
Attila Barta
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
jdijcks
 
Sample Paper.doc.doc
Sample Paper.doc.docSample Paper.doc.doc
Sample Paper.doc.doc
butest
 
Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentation
AASTHA PANDEY
 

Similar a Big data (20)

Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
NoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesNoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, Opportunities
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 
Big Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureBig Data and Implications on Platform Architecture
Big Data and Implications on Platform Architecture
 
IBM Stream au Hadoop User Group
IBM Stream au Hadoop User GroupIBM Stream au Hadoop User Group
IBM Stream au Hadoop User Group
 
Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)
Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)
Big Data Meets Social Analytics - IBM Connect 2012 (CN-CC13)
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
There's no such thing as big data
There's no such thing as big dataThere's no such thing as big data
There's no such thing as big data
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
 
Sample Paper.doc.doc
Sample Paper.doc.docSample Paper.doc.doc
Sample Paper.doc.doc
 
Big Data: A Big Trap for Product Development
Big Data: A Big Trap for Product DevelopmentBig Data: A Big Trap for Product Development
Big Data: A Big Trap for Product Development
 
Future of Data - Big Data
Future of Data - Big DataFuture of Data - Big Data
Future of Data - Big Data
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data Analytics MIS presentation
Big Data Analytics MIS presentationBig Data Analytics MIS presentation
Big Data Analytics MIS presentation
 
Data mining
Data miningData mining
Data mining
 
Workshop_Presentation.pptx
Workshop_Presentation.pptxWorkshop_Presentation.pptx
Workshop_Presentation.pptx
 

Más de Gaetan Lion

Más de Gaetan Lion (20)

DRU projections testing.pptx
DRU projections testing.pptxDRU projections testing.pptx
DRU projections testing.pptx
 
Climate Change in 24 US Cities
Climate Change in 24 US CitiesClimate Change in 24 US Cities
Climate Change in 24 US Cities
 
Compact Letter Display (CLD). How it works
Compact Letter Display (CLD).  How it worksCompact Letter Display (CLD).  How it works
Compact Letter Display (CLD). How it works
 
CalPERS pensions vs. Social Security
CalPERS pensions vs. Social SecurityCalPERS pensions vs. Social Security
CalPERS pensions vs. Social Security
 
Recessions.pptx
Recessions.pptxRecessions.pptx
Recessions.pptx
 
Inequality in the United States
Inequality in the United StatesInequality in the United States
Inequality in the United States
 
Housing Price Models
Housing Price ModelsHousing Price Models
Housing Price Models
 
Global Aging.pdf
Global Aging.pdfGlobal Aging.pdf
Global Aging.pdf
 
Cryptocurrencies as an asset class
Cryptocurrencies as an asset classCryptocurrencies as an asset class
Cryptocurrencies as an asset class
 
Can you Deep Learn the Stock Market?
Can you Deep Learn the Stock Market?Can you Deep Learn the Stock Market?
Can you Deep Learn the Stock Market?
 
Can Treasury Inflation Protected Securities predict Inflation?
Can Treasury Inflation Protected Securities predict Inflation?Can Treasury Inflation Protected Securities predict Inflation?
Can Treasury Inflation Protected Securities predict Inflation?
 
How overvalued is the Stock Market?
How overvalued is the Stock Market? How overvalued is the Stock Market?
How overvalued is the Stock Market?
 
The relationship between the Stock Market and Interest Rates
The relationship between the Stock Market and Interest RatesThe relationship between the Stock Market and Interest Rates
The relationship between the Stock Market and Interest Rates
 
Life expectancy
Life expectancyLife expectancy
Life expectancy
 
Comparing R vs. Python for data visualization
Comparing R vs. Python for data visualizationComparing R vs. Python for data visualization
Comparing R vs. Python for data visualization
 
Will Stock Markets survive in 200 years?
Will Stock Markets survive in 200 years?Will Stock Markets survive in 200 years?
Will Stock Markets survive in 200 years?
 
Standardization
StandardizationStandardization
Standardization
 
Is Tom Brady the greatest quarterback?
Is Tom Brady the greatest quarterback?Is Tom Brady the greatest quarterback?
Is Tom Brady the greatest quarterback?
 
Regularization why you should avoid them
Regularization why you should avoid themRegularization why you should avoid them
Regularization why you should avoid them
 
Basketball the 3 pt game
Basketball the 3 pt gameBasketball the 3 pt game
Basketball the 3 pt game
 

Último

Último (20)

Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 

Big data

  • 2. Table of Content 1) Big Data trends. 2) How Big is your Data? 3) Big Data Potential. 4) Big technologies. New databases. 5) Big quantitative methods. New stats. 6) Big Data temperaments. 7) Is Big always better? 2
  • 3. 1) Big Data Trends 3
  • 4. Cost of Data storage has dropped 4
  • 5. Social Media (Facebook & Twitter) has grown exponentially Facebook vs Twitter # Active Users in 000 exponential growth Facebook started 1,200,000 in Feb 2004. Has 1 billion active 1,000,000 users. 800,000 600,000 Twitter started in March 2006. 400,000 Has 500 million 200,000 users. 0 Ap 8 Ap 9 O 9 Ap 0 O 0 Ap 1 O 1 Ap 2 O 2 13 O 8 Ja 8 Ja 9 Ju 0 Ju 1 Ja 1 Ju 2 Ja 2 Ju 8 Ju 9 Ja 0 l-0 l-1 l-1 1 l-1 0 r-0 l-0 -0 0 r-0 -0 1 r-1 -1 1 r-1 -1 r-1 -1 n- n- n- n- n- n- ct ct ct ct ct Ja Facebook Twitter Social networks are creating a huge live Unstructured Data. 5
  • 6. Unstructured Data is taking over… 6
  • 7. 2) How Big is your Data? • How Tall is it? How large is your sample (rows)? • How Wide is it? How many variables (columns)? • What is its Velocity? How frequently is it updated? • Does it include unstructured data (documents, emails, Social Media)? 7
  • 8. 3) Big Data Potential 8
  • 9. 4) Big Technologies. New Databases 9
  • 10. 10
  • 11. Database: Structured vs Unstructured Database Database Database Reporting Data Type type language structure tool Structured. SQL Data Warehouse Relational Oracle Essbase Customers, structured Data Marts database & IBM Cognos transactions, query language Reporting numbers in rows. Business Intelligence Hadoop Connectors Unstructured. NoSQL Non-relational Hadoop Social Media, not only SQL database Text documents, Web services 11
  • 12. 5) Big quantitative methods. New Stats 12
  • 13. New Stats Map A/B Testing (hypothesis testing) Statistics & Regression Spatial Analysis Regression Time Series Signal Processing Analysis Predictive Analytics Association Rule Learning Data Mining & Cluster Analysis Machine Learning (formerly Artificial Classification Intelligence) Pattern Recognition Neural Networks Optimization Genetic Algorithms Natural Language Sentiment Analysis Processing 13
  • 14. Definitions. Part I Association Rule Learning: method to uncover interesting relationships by generating and testing possible rules. One application is “market basket analysis”, where a retailer figures out what products are frequently bought together. A cited example is that shoppers who buy diapers often buy beer. Classification: identifies the categories in which new data belongs, based on an existing data set grouped in predefined categories. It differs from Cluster Analysis that starts without predefined categories. Genetic algorithms: an optimization method inspired by the “survival of the fittest” process. Potential solutions are encoded as “chromosomes” that can combine and mutate. The chromosomes are selected for survival within a modeled “environment.” Examples: optimizing the performance of an investment portfolio. 14
  • 15. Definitions. Part II Natural language processing (NLP): it uses algorithms to analyze text data. Sentiment Analysis is a common application. It measures customers’ reaction to a product campaign by analyzing social media. Neural networks: models inspired by the workings of neurons and synapses within the brain. Used for finding nonlinear patterns. They can be used for Pattern recognition and Optimization. Examples of neural network applications include identifying customers that may leave and identifying fraudulent insurance claims. Signal processing: an electrical engineering method to analyze signals (radio, etc…) and discern between signal and noise. It is used to extract the signal from the noise from a set of less precise data [Signal Detection Theory]. 15
  • 16. Definitions. Part III Spatial Analysis: it analyzes geographic location encoded within the data. The information comes from GPS. Applications include spatial regression to figure a consumer willingness to purchase a product given his location. 16
  • 17. 6) Big Data Temperaments Source: Harvard Business Review, April 2012 by Shvetank Shah, Andrew Horne and Jaime Capella. 17
  • 18. 7) Is Big always better? 18
  • 19. No! says Nate Silver •“I came to realize that prediction in the era of Big Data was not going very well.” •“If the quantity of information is increasing [exponentially]… Most of it is just noise.” •He refers to John P. Ioannidis 2005 paper: “Why Most Published Research Findings are False.” 2/3ds of scientific papers’ results can’t be replicated! “… numbers have no way of speaking for themselves. We speak for them.” 19
  • 20. Nate’s targets • Political pundits. Their “intuitive” election predictions have been disastrous. Granted, it was not because of Big Data but instead No Data. He showed them how to do it using Small Data (polls with samples < 1,000); • Economists forecasters. They have used Big Data with poor results. The majority of them can’t forecast a recession already underway. ECRI predicted with certainty a double dip recession in 2011 using tens of variables they did not understand. Instead, the economy improved; • Stock market & financial market forecasters. Similar performance as economists forecasters; • Earthquake forecasting. The field is not well understood. “… Statistical inferences are much stronger when backed up by theory… about their root causes.” 20
  • 21. No! says Vincent Granville • Big Data is huge, but information is very sparse; • Storing and processing the entire data is very inefficient; • You can do better by smartly sampling only 5% of the data; You don’t need Big Data, you need Smart Data. 21
  • 22. Yes! Says Chris Anderson • He quotes Peter Norvig, Google’s research director: “All models are wrong, and increasingly you can succeed without them.” • “… with massive data, [the scientific method] is becoming obsolete.” • “We can throw the numbers into the biggest computing clusters … and let statistical algorithms find patterns where science cannot.” He mentions examples such as J.Craig Venter gene sequencing, Google Search, and Google Translator, among other successes. “With enough data, the numbers speak for themselves.” “Correlation supersedes causation, and science can advance without 22 coherent models, unified theories, or … any … explanation at all.”
  • 23. Big Data Effectiveness Map Field needing causal understanding Field not needing Rule Based causal Theory not well Theory well understanding understood understood More data more More data more Tall data Noise Signal Oversampling Oversampling More data better More data better More variables more More variables more model performance model performance false positives explanation Wide data Multicollinearity Multicollinearity Model overfitting Model overfitting Economics, Google Search, Games & Sports Financial markets, Weather forecasting, Google Translator, Examples [Chess, Baseball, Earthquake Customer behavior Google Flu-trends, etc…], Politics forecasting Customer behavior 23