SlideShare una empresa de Scribd logo
1 de 25
Big Data 101
    Bouvet BigOne, 2013-03-14
    Lars Marius Garshol, larsga@bouvet.no, http://twitter.com/larsga

1
2
3
What is big data?

      Big Data is                 Small Data is
      any thing                when is fit in RAM.
       which is                Big Data is when is
     crash Excel.               crash because is
                                 not fit in RAM.




                                          Or, in other words, Big Data is data
                                          in volumes too great to process by
                                          traditional methods.


     https://twitter.com/devops_borat

4
Data accumulation

    • Today, data is accumulating at tremendous
      rates
       –   click streams from web visitors
       –   supermarket transactions
       –   sensor readings
       –   video camera footage
       –   GPS trails
       –   social media interactions
       –   ...
    • It really is becoming a challenge to store
      and process it all in a meaningful way

5
From WWW to VVV

    • Volume
      – data volumes are becoming unmanageable
    • Variety
      – data complexity is growing
      – more types of data captured than previously
    • Velocity
      – some data is arriving so rapidly that it must either
        be processed instantly, or lost
      – this is a whole subfield called “stream processing”




6
The promise of Big Data

• Data contains information of great
  business value
• If you can extract those insights you can
  make far better decisions
• ...but is data really that valuable?
8
9
“quadrupling the average cow's
     milk production since your parents
     were born”



     "When Freddie [as he is known]
     had no daughter records our
     equations predicted from his DNA
     that he would be the best bull,"
     USDA research geneticist Paul
     VanRaden emailed me with a
     detectable hint of pride. "Now he is
     the best progeny tested bull (as
     predicted)."




10
Ok, ok, but ... does it apply to our
     customers?
     • Norwegian Food Safety Authority
        – accumulates data on all farm animals
        – birth, death, movements, medication, samples, ...
     • Hafslund
        – time series from hydroelectric dams, power prices,
          meters of individual customers, ...
     • Social Security Administration
        – data on individual cases, actions taken, outcomes...
     • Statoil
        – massive amounts of data from oil exploration,
          operations, logistics, engineering, ...
     • Retailers
        – see Target example above
        – also, connection between what people buy, weather
          forecast, logistics, ...
11
How to extract insight from data?




        Monthly Retail Sales in New South Wales
       (NSW) Retail Department Stores
12
Estimating real estate prices

     • Take parameters
        –   x1    square meters
        –   x2    number of rooms
        –   x3    number of floors
        –   x4    energy cost per year
        –   x5    meters to nearest subway station
        –   x6    years since built
        –   x7    years since last refurbished
        –   ...
     • a x1 + b x2 + c x3 + ... = price
        – strip out the x-es and you have a vector
        – collect N samples of real flats with prices = matrix
        – welcome to the world of linear algebra
13
Types of algorithms

     •   Clustering
     •   Association learning
     •   Parameter estimation
     •   Recommendation engines
     •   Support Vector Machines
     •   Similarity matching
     •   Neural networks
     •   Bayesian networks
     •   Genetic algorithms


14
Basically, it’s all maths...

     •   Linear algebra
     •   Calculus
     •   Probability theory                      Only 10% in
     •   Graph theory                         devops are know
     •   ...                                     how of work
                                                with Big Data.
                                                 Only 1% are
                                               realize they are
                                              need 2 Big Data
                                                   for fault
                                                  tolerance




15
           https://twitter.com/devops_borat
Big data skills gap

     • Hardly anyone knows this stuff
     • It’s a big field, with lots and lots of theory
     • And it’s all maths, so it’s tricky to learn




     http://www.ibmbigdatahub.com/blog/addressing-big-data-skills-gap
16
     http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond#The_Big_Data_Skills_Gap
Two orthogonal aspects

     • Analytics / machine learning
       – learning insights from data
     • Big data
       – handling massive data volumes
     • Can be combined, or used separately




17
How to process Big Data?

     • If relational databases are not enough,
       what is?

                                                 Mining of Big
                                                     Data is
                                                 problem solve
                                                  in 2013 with
                                                      zgrep




18
              https://twitter.com/devops_borat
MapReduce

     • A framework for writing massively parallel
       code
     • Simple, straightforward model
     • Based on “map” and “reduce” functions
       from functional programming (LISP)




19
Things you can do in MapReduce

     • Google’s PageRank algorithm
       – easily expressible in MapReduce
       – one of the first applications of MapReduce
     • SQL
       – relational algebra has straightforward translation
         to the MapReduce model
     • Linear algebra
       – matrix operations are easily MapReducible
       – (PageRank is just a bunch of matrix operations)
     • Recommendation engines
       – also MapReducible (the SON algorithm)
       – ...
20
NoSQL and Big Data

     • Not really that relevant
     • Traditional databases handle big data sets,
       too
     • NoSQL databases have poor analytics
     • MapReduce often works from text files
        – can obviously work from SQL and NoSQL, too
     • NoSQL is more for high throughput
        – basically, AP from the CAP theorem, instead of CP
     • In practice, really Big Data is likely to be a
       mix
        – text files, NoSQL, and SQL
21
The 4th V: Veracity

     “The greatest enemy of knowledge is not
     ignorance, it is the illusion of knowledge.”
                        Daniel Borstin, in The Discoverers (1983)



                                                       95% of time,
                                                      when is clean Big
                                                      Data is get Little
                                                            Data




22
                   https://twitter.com/devops_borat
Data quality

     • A huge problem in practice
       – any manually entered data is suspect
       – most data sets are in practice deeply problematic
     • Even automatically gathered data can be a
       problem
       – systematic problems with sensors
       – errors causing data loss
       – incorrect metadata about the sensor
     • Never, never, never trust the data without
       checking it!
       – garbage in, garbage out, etc

23
Conclusion

     • Vast potential
        – to both big data and machine learning
     • Very difficult to realize that potential
        – requires mathematics, which nobody knows
     • We need to wake up!




24
Where to learn more

     • University of Oslo
       – has courses on linear algebra, probability, graph
         theory, ...
     • Stanford University
       – https://www.coursera.org/course/ml
     • Mining Massive Datasets
       – http://infolab.stanford.edu/~ullman/mmds.html




25

Más contenido relacionado

La actualidad más candente

Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
Srinath Perera
 

La actualidad más candente (20)

Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big Data
 
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
 
Big data Introduction by Mohan
Big data Introduction by MohanBig data Introduction by Mohan
Big data Introduction by Mohan
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Big data
Big dataBig data
Big data
 
Big data mining
Big data miningBig data mining
Big data mining
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Big data analysis
Big data analysisBig data analysis
Big data analysis
 
Big Data’s Big Impact on Businesses
Big Data’s Big Impact on BusinessesBig Data’s Big Impact on Businesses
Big Data’s Big Impact on Businesses
 
Big Data and Computer Science Education
Big Data and Computer Science EducationBig Data and Computer Science Education
Big Data and Computer Science Education
 

Destacado

Big Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data UniversityBig Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data University
Thushara M
 
Big Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data UniversityBig Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data University
Anishek Kamal
 
Emotive presentation
Emotive presentationEmotive presentation
Emotive presentation
ethansm
 
101 Marketing Charts
101 Marketing Charts101 Marketing Charts
101 Marketing Charts
HubSpot
 

Destacado (20)

Big Data, Big Deal? (A Big Data 101 presentation)
Big Data, Big Deal? (A Big Data 101 presentation)Big Data, Big Deal? (A Big Data 101 presentation)
Big Data, Big Deal? (A Big Data 101 presentation)
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
What is big data?
What is big data?What is big data?
What is big data?
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 
Data 101: A Gentle Introduction
Data 101: A Gentle IntroductionData 101: A Gentle Introduction
Data 101: A Gentle Introduction
 
Big Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data UniversityBig Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data University
 
Emotiv epoc introduction
Emotiv epoc introductionEmotiv epoc introduction
Emotiv epoc introduction
 
Big Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data UniversityBig Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data University
 
Emotiv System Team 8
Emotiv System  Team 8Emotiv System  Team 8
Emotiv System Team 8
 
Emotive presentation
Emotive presentationEmotive presentation
Emotive presentation
 
Emotiv epoc
Emotiv epocEmotiv epoc
Emotiv epoc
 
Cracking the Data Conundrum: How Successful Companies Make #BigData Operational
Cracking the Data Conundrum: How Successful Companies Make #BigData OperationalCracking the Data Conundrum: How Successful Companies Make #BigData Operational
Cracking the Data Conundrum: How Successful Companies Make #BigData Operational
 
Infografia i Visualització UOC Meet
Infografia i Visualització UOC MeetInfografia i Visualització UOC Meet
Infografia i Visualització UOC Meet
 
Project Monitoring and Evaluation
Project Monitoring and EvaluationProject Monitoring and Evaluation
Project Monitoring and Evaluation
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
 
8 M&E: Data Sources
8 M&E: Data Sources8 M&E: Data Sources
8 M&E: Data Sources
 
101 Marketing Charts
101 Marketing Charts101 Marketing Charts
101 Marketing Charts
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Emotiv Epoc/EEG/BCI
Emotiv Epoc/EEG/BCIEmotiv Epoc/EEG/BCI
Emotiv Epoc/EEG/BCI
 

Similar a Big data 101

Similar a Big data 101 (20)

Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
 
Level Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationLevel Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentation
 
Gilbane Boston 2012 Big Data 101
Gilbane Boston 2012 Big Data 101Gilbane Boston 2012 Big Data 101
Gilbane Boston 2012 Big Data 101
 
Big Data Analysis and Business Intelligence
Big Data Analysis and Business IntelligenceBig Data Analysis and Business Intelligence
Big Data Analysis and Business Intelligence
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?
 
Big Data & the importance of Data Science
Big Data & the importance of Data ScienceBig Data & the importance of Data Science
Big Data & the importance of Data Science
 
Big Data Rampage
Big Data RampageBig Data Rampage
Big Data Rampage
 
Data science meetup - Spiros Antonatos
Data science meetup - Spiros AntonatosData science meetup - Spiros Antonatos
Data science meetup - Spiros Antonatos
 
Big & Open Data: Challenges for Smartcity
Big & Open Data:  Challenges for SmartcityBig & Open Data:  Challenges for Smartcity
Big & Open Data: Challenges for Smartcity
 
What Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinWhat Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will Win
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Big Data
Big DataBig Data
Big Data
 
Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
 
Big and Small Web Data
Big and Small Web DataBig and Small Web Data
Big and Small Web Data
 
Introduction to big data for the EA course at Solvay MBA
Introduction to big data for the EA course at Solvay MBAIntroduction to big data for the EA course at Solvay MBA
Introduction to big data for the EA course at Solvay MBA
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptx
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 

Más de Lars Marius Garshol

Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Lars Marius Garshol
 

Más de Lars Marius Garshol (20)

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
 
Kveik - what is it?
Kveik - what is it?Kveik - what is it?
Kveik - what is it?
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithms
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
History of writing
History of writingHistory of writing
History of writing
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativity
 
Norwegian farmhouse ale
Norwegian farmhouse aleNorwegian farmhouse ale
Norwegian farmhouse ale
 
Archive integration with RDF
Archive integration with RDFArchive integration with RDF
Archive integration with RDF
 
The Euro crisis in 10 minutes
The Euro crisis in 10 minutesThe Euro crisis in 10 minutes
The Euro crisis in 10 minutes
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
 
Bitcoin - digital gold
Bitcoin - digital goldBitcoin - digital gold
Bitcoin - digital gold
 
Hops - the green gold
Hops - the green goldHops - the green gold
Hops - the green gold
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
 
Approximate string comparators
Approximate string comparatorsApproximate string comparators
Approximate string comparators
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programming
 
Semantisk integrasjon
Semantisk integrasjonSemantisk integrasjon
Semantisk integrasjon
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Big data 101

  • 1. Big Data 101 Bouvet BigOne, 2013-03-14 Lars Marius Garshol, larsga@bouvet.no, http://twitter.com/larsga 1
  • 2. 2
  • 3. 3
  • 4. What is big data? Big Data is Small Data is any thing when is fit in RAM. which is Big Data is when is crash Excel. crash because is not fit in RAM. Or, in other words, Big Data is data in volumes too great to process by traditional methods. https://twitter.com/devops_borat 4
  • 5. Data accumulation • Today, data is accumulating at tremendous rates – click streams from web visitors – supermarket transactions – sensor readings – video camera footage – GPS trails – social media interactions – ... • It really is becoming a challenge to store and process it all in a meaningful way 5
  • 6. From WWW to VVV • Volume – data volumes are becoming unmanageable • Variety – data complexity is growing – more types of data captured than previously • Velocity – some data is arriving so rapidly that it must either be processed instantly, or lost – this is a whole subfield called “stream processing” 6
  • 7. The promise of Big Data • Data contains information of great business value • If you can extract those insights you can make far better decisions • ...but is data really that valuable?
  • 8. 8
  • 9. 9
  • 10. “quadrupling the average cow's milk production since your parents were born” "When Freddie [as he is known] had no daughter records our equations predicted from his DNA that he would be the best bull," USDA research geneticist Paul VanRaden emailed me with a detectable hint of pride. "Now he is the best progeny tested bull (as predicted)." 10
  • 11. Ok, ok, but ... does it apply to our customers? • Norwegian Food Safety Authority – accumulates data on all farm animals – birth, death, movements, medication, samples, ... • Hafslund – time series from hydroelectric dams, power prices, meters of individual customers, ... • Social Security Administration – data on individual cases, actions taken, outcomes... • Statoil – massive amounts of data from oil exploration, operations, logistics, engineering, ... • Retailers – see Target example above – also, connection between what people buy, weather forecast, logistics, ... 11
  • 12. How to extract insight from data? Monthly Retail Sales in New South Wales (NSW) Retail Department Stores 12
  • 13. Estimating real estate prices • Take parameters – x1 square meters – x2 number of rooms – x3 number of floors – x4 energy cost per year – x5 meters to nearest subway station – x6 years since built – x7 years since last refurbished – ... • a x1 + b x2 + c x3 + ... = price – strip out the x-es and you have a vector – collect N samples of real flats with prices = matrix – welcome to the world of linear algebra 13
  • 14. Types of algorithms • Clustering • Association learning • Parameter estimation • Recommendation engines • Support Vector Machines • Similarity matching • Neural networks • Bayesian networks • Genetic algorithms 14
  • 15. Basically, it’s all maths... • Linear algebra • Calculus • Probability theory Only 10% in • Graph theory devops are know • ... how of work with Big Data. Only 1% are realize they are need 2 Big Data for fault tolerance 15 https://twitter.com/devops_borat
  • 16. Big data skills gap • Hardly anyone knows this stuff • It’s a big field, with lots and lots of theory • And it’s all maths, so it’s tricky to learn http://www.ibmbigdatahub.com/blog/addressing-big-data-skills-gap 16 http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond#The_Big_Data_Skills_Gap
  • 17. Two orthogonal aspects • Analytics / machine learning – learning insights from data • Big data – handling massive data volumes • Can be combined, or used separately 17
  • 18. How to process Big Data? • If relational databases are not enough, what is? Mining of Big Data is problem solve in 2013 with zgrep 18 https://twitter.com/devops_borat
  • 19. MapReduce • A framework for writing massively parallel code • Simple, straightforward model • Based on “map” and “reduce” functions from functional programming (LISP) 19
  • 20. Things you can do in MapReduce • Google’s PageRank algorithm – easily expressible in MapReduce – one of the first applications of MapReduce • SQL – relational algebra has straightforward translation to the MapReduce model • Linear algebra – matrix operations are easily MapReducible – (PageRank is just a bunch of matrix operations) • Recommendation engines – also MapReducible (the SON algorithm) – ... 20
  • 21. NoSQL and Big Data • Not really that relevant • Traditional databases handle big data sets, too • NoSQL databases have poor analytics • MapReduce often works from text files – can obviously work from SQL and NoSQL, too • NoSQL is more for high throughput – basically, AP from the CAP theorem, instead of CP • In practice, really Big Data is likely to be a mix – text files, NoSQL, and SQL 21
  • 22. The 4th V: Veracity “The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge.” Daniel Borstin, in The Discoverers (1983) 95% of time, when is clean Big Data is get Little Data 22 https://twitter.com/devops_borat
  • 23. Data quality • A huge problem in practice – any manually entered data is suspect – most data sets are in practice deeply problematic • Even automatically gathered data can be a problem – systematic problems with sensors – errors causing data loss – incorrect metadata about the sensor • Never, never, never trust the data without checking it! – garbage in, garbage out, etc 23
  • 24. Conclusion • Vast potential – to both big data and machine learning • Very difficult to realize that potential – requires mathematics, which nobody knows • We need to wake up! 24
  • 25. Where to learn more • University of Oslo – has courses on linear algebra, probability, graph theory, ... • Stanford University – https://www.coursera.org/course/ml • Mining Massive Datasets – http://infolab.stanford.edu/~ullman/mmds.html 25