SlideShare una empresa de Scribd logo
1 de 20
Descargar para leer sin conexión
A brief exploration of causes and consequences of wealth of data.




Dipesh Lall




dipeshlall@gmail.com
© 2011
Before we start our journey a bit about a bit, a byte and lots of bytes.


   •   A bit (b) is short for binary digit, after binary code (1 or 0) computers use to store and process data.
   •   Binary means base of 2 just like decimal means the base of 10.
   •   Byte (B) is the basic unit of computing used to create an English letter or number in computer code. One Byte is
       equal to 8 bits


                                Kilobyte    Megabyte    Gigabyte     Terabyte    Petabyte     Exabyte     Zettabyte       Yottabyte
Unit   Bit (b)     Byte (B)
                                  (KB)        (MB)        (GB)         (TB)        (PB)        (EB)         (ZB)            (YB)

                              1,000 bytes   1,000 KB    1,000 MB     1,000 GB    1,000 TB     1,000 PB    1,000 EB        1,000 ZB
Size    1 or 0      8 bits
                               210 bytes    220 bytes   230 bytes    240 bytes   250 bytes    260 bytes   270 bytes       280 bytes




   •   One page of typed text is roughly 2KB.
   •   All books catalogued in the US Library of Congress total around 15 TBs.
   •   Google processes about 1PB every hour.
   •   Monthly internet data flows at around 21 EBs.
   •   Total amount of information in existence is around 1.2 ZB.
   •   YB is currently too big to imagine (as per The Economist).
   •   International Bureau of Weights and Measures sets the name of the prefixes.




                                                                                                                            2
A perfect storm of forces is conspiring to generate a lot of data.
          Data storage costs are falling…                                      …data creating devices are growing…



                                                                      # of hosts

   $/TB




                                        Time                                                                    Time


          …data processing costs are falling…                                  …connectivity is growing…
                                                                                                                       Large volume
                                                                                                                       of data of rich
                                                                      Degree                                           variety at
                                                                                                                                         “Big Data”
$/GFLOPS                                                                    of
                                                                   connectivity                                        various
                                                                                                                       speeds

                                        Time                                                                    Time

          …data moving costs are falling…while…                                …along with performance expectations.




                                                            Speed of response
 $/Mbps




                                        Time                                                                    Time
                                                                                                                                              3
            Please note that the slope of the various lines are different but they are directionally correct.
Almost everything is instrumented which means data is being generated in
various formats at various speeds and in various volumes.


•   Structured data (tables, records)
•   Semi-structured data (XML and
    similar standards)
•   Complex data (hierarchical or
    legacy sources)
•   Event data (messages)
•   Unstructured data (human                        Volume
    language, audio, video)
•   Social media data
    (blogs, tweets, social networks)
•   Web logs and click streams
•   Spatial data (long/lat, GPS)
•   Machine generated data               Velocity            Variety
    (sensors, RFID, devices, server
    logs)
•   Scientific data
    (genomes, proteinomics, astronom
    y)




                                                                           4
Now all this data is pure cost unless it is transformed into information from
which insights can be drawn and right action taken to create or protect value.


•   The information value chain depicts the various stages in the journey of data from its creation to use:



                          Data    Information   Insights   Decisions Action   Value


•   At each stage of the value chain the right mix of business processes, human skills and technology capabilities are
    needed.
•   Relational database management systems (RDBMS) date back to the early 70s. RDBMS have worked well to
    handle transactional and structured data because this type of data can be stored in table format with relationships
    between and amongst the tables. The technology to manage RDBMS was developed at IBM (in San Jose) and
    was initially called SEQUEL (Structured English Query Language). Now called SQL
•   As more of the data generated shifts from structured to other formats the traditional methods of managing data are
    not practical.
•   So here is what has happened in the management of data over time.
     – Vertical scaling…bigger RDBMS machines…more disk space, more horse power, big data centers.
     – New methods, called Horizontal scaling, arrived as vertical scaling reached its limit from a data volume
          standpoint…so came Massively Parallel Processing (MPP) machines
     – But then came unstructured data (variety) and streaming data (velocity) so what was needed was a whole
          new way to manage data…Big Data (BD)




                                                                                                                      5
How do RDBMs really work (for the most part).


•   Multiple interfaces
•   Slow…disk drives need time to read-and-write
•   Sequential
•   Indexing a big challenge
•   Schema is not flexible



     Data is generated                                  Data is           Data is analyzed
                            Data is stored in                                                      Information is
        in multiple                                 aggregated in           in analytical
                              databases                                                               reported
         channels                                  data warehouses          applications



•   So the solution is to remove all these boxes (no pun intended) and get analytics as close as possible to the data.
    Hence, you hear terms like in-database analytics (analytics moving into d/b) or in-memory analytics (d/b moving
    into memory)



      Data is generated
                                 Data is stored, aggregated and analyzed on a single                Information is
         via multiple
                                 platform                                                              reported
          channels




                                                                                                                         6
RDBMs cannot scale because their intrinsic constraints run up against a
humbling rule that you cannot have everything in life and you have to chose.


•   RDBMS rely on the ACID principle
      – Atomicity: All or nothing
      – Consistent: All transactions take d/b from one state to another without impairing referential integrity
      – Isolation: Other operations cannot access data while transaction is midstream
      – Durability: Ability to recover from system failure
•   Vertically scaled RDBMs do honor the ACID principle but horizontally scaled RDBMs (MPP machines) do not. This
    is called the CAP Theorem. It says that you can have any two of the following three when you have a distributed
    RDBM system
      – Consistency which means you operate fully or not at all.
      – Availability which means a node failure does not prevent surviving nodes from completing the task.
      – Partition tolerance (the distributed part) which means that system continues to operate despite arbitrary
           message loss.
•   The two bullets above mean that as you scale RDBM system you run into a wall…actually a cap!




                                                                                                                  7
Therefore, RDBMS are not good at performing all types of analysis.


 •       We need scalable database models that are not dependent on a fixed data schemas.




                         App        App               App      App     App




                                                                                                          Need for a
                                                                                                           new data
                                                                                                          architecture
            App



                         Db    Db   Db     Db          Db      Db       Db
App


            Db

Db



      Vertical scaling                    Horizontal scaling                                      Schema agnostic
                                                                                                      scaling
                                                                     Volume growth

                                                                                            Velocity growth
                                                                                                Variety growth
                                                                                                                    8
The rich variety of data intruded to make data management a painnus posteriorus*.



  •      While the volume and velocity of the data is                                     Volume vector…..bad
         growing rapidly it is the growing variety of data                                                        Velocity vector…badder
         that is a complexity multiplier in the
         management of all these bits.
  •      RDBMS and MPP approaches exhausted the
         ability of current architectures to process the
         torrent of bits flowing.
  •      Hence arrived what I call Big Data
         Architecture (BDA)
  •      BDA does not replace existing investments in
         data management; BDA complements them
         so no need to rip-and-replace; it is more                                                              Variety vector…baddest
         insert-and-augment.
  •      BDA started in companies that had BD,
         essentially internet companies like Yahoo,
         Google, Facebook, Amazon, Twitter, LinkedIn
         that needed web-scale solutions to their data
         problems. They built this from scratch
         because there was nothing commercially
         available.
  •      This revolution was called NOSQL (Not Only SQL)
           •    The “NO” means that it is a technology that works in addition to SQL not instead of it.
  •      NOSQL databases were organically developed…these are essentially schema agnostic…meaning that some of
         the constraints of SQL databases are negotiated well.

*: painnus posteriorus is a contemporary acute discomfort of lower thoracic induced by unrelenting bit storms
                                                                                                                                   9
NOSQL solves the complexity, volume and speed constraints of an SQL design
by using four different data models.


•   Key value stores is a schema less model of storing data
•   Big table clones is a compressed high performance database system based on Google File System.
•   Document databases is a method to store semi-structured data
•   Graph databases uses graph structures (nodes, edges etc.) that provides index free lookups.



                                                NOSQL model

                                                               Document
     Key value stores            Big table clones                                       Graph databases
                                                               databases


       Based on                    Based on                   Based on                      Based on
     Amazon Dynamo               Google BigTable            Amazon Dynamo                 Graph Theory




       Memcached                     Hbase                    Lotus Domino                AllegroGraph
         Dynamo                    Cassandra                    CouchDB                     VertexDB
        Voldemort                  HyperTable                   MongoDB                       Neo4J
      Tokyo Cabinet                 AzureTS                       Riak                     Active RDF




                                                                                                          10
BDA is actually very effective.


•   Yahoo tested BDA by calculating Pi to 2,000,000,000,000,000th digit
•   It used 1,000 computers and the calculation took 23 days. This means 23,000 computing days.
•   Using RDBMs, it would have taken on PC about 500 years which is essentially ~182,621 computing days. Now
    that is ~87% improvement in speed (using a very rough back of the envelope calculation)
•   So yes, BDA works.




                                                                                                               11
BDA works by breaking a problem into pieces, analyze each piece separately
and then aggregating the results into a single response.

•       HADOOP is an instance of NOSQL that has two main parts: MapReduce and HDFS
         •  MapReduce means mapping a problem to worker nodes and then aggregating (reducing) the results
         •  HDFS is the file management systems that makes MapReduce work




                                                                 Map phase                               Reduce phase

    •     Google searches
    •     Amazon recommendations
                                                                                               Piece 1
    •     Paypal real time fraud detection
    •     Credit card unauthorized charges                                                     Piece 2
    •     Loopt




                                                                                Worker nodes
                                                                  Master node




                                                                                                                Master node
    •     Directions from office to bar/pub…nearest                                            Piece 3
          vs. cheapest                                Problem                                                                 Result
    •     Genomics searching (needle-in-a-haystack)                                            Piece 4
    •     Zynga gaming
                                                                                                 …
    •     Facebook Friends
    •     LinkedIn People-you-may-know (PYMK)                                                  Piece n
    •     GPS directions (as you drive)
    •      …




                                                                                                                                   12
What does BDA landscape look like?


•   It depends on what the need is but here is a simple graphic that shows the various elements. This is only
    illustrative.



         Data
                                            Visualization/Mobile/R
      presentation




                                                                                                                                     Displaying and monitoring logs: Chukwa
                                                                                            Job tracker
    Data processing                 Hadoop (batch); S4, Storm (streaming)




                                                                                                           Coordination: Zookeeper
      Data query                                   Pig, Hive


      Processing
                                               Azkaban, Oozie
      scheduler




                                                                                            Task tracker
       Database                         Voldemort, Cassandra, HBase



    Data collection                          Kafka, Flume, Scribe




                                                                                                                                                                              13
BDA architecture does not mean you need to throw away your investments in
traditional data analytics infrastructure.


•   BDA works alongside existing investments made by companies…not rip-and-replace!




                          Traditional BI infrastructure




                                                                        Reporting
                                                                              &
                                                                        Distribution




                                      BDA




                                                                                       14
Even NOSQL is getting challenged, but for now we got-to-dance-with-them-
what-brung-you.


•   Zynga needs additional 1,000 servers every week for their data needs.
•   Every search string you send to Google is divided and sent to 700-1000 servers so that you can get your response
    back in micro-seconds and thus not waste a few seconds in which you could have destroyed civilization.
•   Youtube serves 1 billion videos every day.
•   2.5 billion photos uploaded each month to Facebook.
•   ~150,000 zombie computers created every day (used in botnets for sending spam)
•   At beginning of 2009 there were 187 million web sites. At the end of 2009 there were 234 million web sites. 25%
    growth.




                                                                                                                  15
And what is next.




  Big Data   + Context + Interactivity =




                                           16
Smart Data…


              17
…which will make Minority Report scenarios   look like…




                                                          18
…Pong.




         19
New skills you should consider in the world of Big Data




 – Cultivate expertise but be a strong generalist
 – Develop and grow relationships and networks
 – Develop communication skills
 – Refine presentation skills
 – Read up, a lot
 – Monitor competition
 – Understand business, I mean really understand it
                                                                           Embrace*
 – Love the edge
 – Step outside your comfort zone, frequently
                                                                           ambiguity
 – If you have the appetite, read up a book or two on statistics
 – Think laterally, this just means do not be afraid to connect the dots




* At a minimum, learn to accept ambiguity


                                                                                       20

Más contenido relacionado

Similar a The causes and consequences of too many bits

Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja Swapnaja Tandale
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsGeorge Stathis
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
 
Big data | Hadoop | components of hadoop |Rahul Gulab Sing
Big data | Hadoop | components of hadoop |Rahul Gulab SingBig data | Hadoop | components of hadoop |Rahul Gulab Sing
Big data | Hadoop | components of hadoop |Rahul Gulab SingRahul Singh
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @ScaleDr Hajji Hicham
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1RojaT4
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An OverviewC. Scyphers
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by SunnyDignitasDigital1
 
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)Huibert Aalbers
 

Similar a The causes and consequences of too many bits (20)

Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big Data
Big DataBig Data
Big Data
 
Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
big_data.ppt
big_data.pptbig_data.ppt
big_data.ppt
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 
Big data | Hadoop | components of hadoop |Rahul Gulab Sing
Big data | Hadoop | components of hadoop |Rahul Gulab SingBig data | Hadoop | components of hadoop |Rahul Gulab Sing
Big data | Hadoop | components of hadoop |Rahul Gulab Sing
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @Scale
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Big Data Boom
Big Data BoomBig Data Boom
Big Data Boom
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by Sunny
 
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)
 

Último

Upgrade Your Banking Experience with Advanced Core Banking Applications
Upgrade Your Banking Experience with Advanced Core Banking ApplicationsUpgrade Your Banking Experience with Advanced Core Banking Applications
Upgrade Your Banking Experience with Advanced Core Banking ApplicationsIntellect Design Arena Ltd
 
Borderless Access - Global B2B Panel book-unlock 2024
Borderless Access - Global B2B Panel book-unlock 2024Borderless Access - Global B2B Panel book-unlock 2024
Borderless Access - Global B2B Panel book-unlock 2024Borderless Access
 
Anyhr.io | Presentation HR&Recruiting agency
Anyhr.io | Presentation HR&Recruiting agencyAnyhr.io | Presentation HR&Recruiting agency
Anyhr.io | Presentation HR&Recruiting agencyHanna Klim
 
Data skills for Agile Teams- Killing story points
Data skills for Agile Teams- Killing story pointsData skills for Agile Teams- Killing story points
Data skills for Agile Teams- Killing story pointsyasinnathani
 
MC Heights construction company in Jhang
MC Heights construction company in JhangMC Heights construction company in Jhang
MC Heights construction company in Jhangmcgroupjeya
 
Live-Streaming in the Music Industry Webinar
Live-Streaming in the Music Industry WebinarLive-Streaming in the Music Industry Webinar
Live-Streaming in the Music Industry WebinarNathanielSchmuck
 
IIBA® Melbourne - Navigating Business Analysis - Excellence for Career Growth...
IIBA® Melbourne - Navigating Business Analysis - Excellence for Career Growth...IIBA® Melbourne - Navigating Business Analysis - Excellence for Career Growth...
IIBA® Melbourne - Navigating Business Analysis - Excellence for Career Growth...AustraliaChapterIIBA
 
Entrepreneurship & organisations: influences and organizations
Entrepreneurship & organisations: influences and organizationsEntrepreneurship & organisations: influences and organizations
Entrepreneurship & organisations: influences and organizationsP&CO
 
Borderless Access - Global Panel book-unlock 2024
Borderless Access - Global Panel book-unlock 2024Borderless Access - Global Panel book-unlock 2024
Borderless Access - Global Panel book-unlock 2024Borderless Access
 
PDT 88 - 4 million seed - Seed - Protecto.pdf
PDT 88 - 4 million seed - Seed - Protecto.pdfPDT 88 - 4 million seed - Seed - Protecto.pdf
PDT 88 - 4 million seed - Seed - Protecto.pdfHajeJanKamps
 
Mihir Menda - Member of Supervisory Board at RMZ
Mihir Menda - Member of Supervisory Board at RMZMihir Menda - Member of Supervisory Board at RMZ
Mihir Menda - Member of Supervisory Board at RMZKanakChauhan5
 
Chicago Medical Malpractice Lawyer Chicago Medical Malpractice Lawyer.pdf
Chicago Medical Malpractice Lawyer Chicago Medical Malpractice Lawyer.pdfChicago Medical Malpractice Lawyer Chicago Medical Malpractice Lawyer.pdf
Chicago Medical Malpractice Lawyer Chicago Medical Malpractice Lawyer.pdfSourav Sikder
 
Tata Kelola Bisnis perushaan yang bergerak
Tata Kelola Bisnis perushaan yang bergerakTata Kelola Bisnis perushaan yang bergerak
Tata Kelola Bisnis perushaan yang bergerakEditores1
 
HELENE HECKROTTE'S PROFESSIONAL PORTFOLIO.pptx
HELENE HECKROTTE'S PROFESSIONAL PORTFOLIO.pptxHELENE HECKROTTE'S PROFESSIONAL PORTFOLIO.pptx
HELENE HECKROTTE'S PROFESSIONAL PORTFOLIO.pptxHelene Heckrotte
 
Borderless Access - Global Panel book-unlock 2024
Borderless Access - Global Panel book-unlock 2024Borderless Access - Global Panel book-unlock 2024
Borderless Access - Global Panel book-unlock 2024Borderless Access
 
Lecture_6.pptx English speaking easyb to
Lecture_6.pptx English speaking easyb toLecture_6.pptx English speaking easyb to
Lecture_6.pptx English speaking easyb toumarfarooquejamali32
 
The Vietnam Believer Newsletter_MARCH 25, 2024_EN_Vol. 003
The Vietnam Believer Newsletter_MARCH 25, 2024_EN_Vol. 003The Vietnam Believer Newsletter_MARCH 25, 2024_EN_Vol. 003
The Vietnam Believer Newsletter_MARCH 25, 2024_EN_Vol. 003believeminhh
 
Team B Mind Map for Organizational Chg..
Team B Mind Map for Organizational Chg..Team B Mind Map for Organizational Chg..
Team B Mind Map for Organizational Chg..dlewis191
 

Último (20)

Upgrade Your Banking Experience with Advanced Core Banking Applications
Upgrade Your Banking Experience with Advanced Core Banking ApplicationsUpgrade Your Banking Experience with Advanced Core Banking Applications
Upgrade Your Banking Experience with Advanced Core Banking Applications
 
Borderless Access - Global B2B Panel book-unlock 2024
Borderless Access - Global B2B Panel book-unlock 2024Borderless Access - Global B2B Panel book-unlock 2024
Borderless Access - Global B2B Panel book-unlock 2024
 
Anyhr.io | Presentation HR&Recruiting agency
Anyhr.io | Presentation HR&Recruiting agencyAnyhr.io | Presentation HR&Recruiting agency
Anyhr.io | Presentation HR&Recruiting agency
 
Data skills for Agile Teams- Killing story points
Data skills for Agile Teams- Killing story pointsData skills for Agile Teams- Killing story points
Data skills for Agile Teams- Killing story points
 
MC Heights construction company in Jhang
MC Heights construction company in JhangMC Heights construction company in Jhang
MC Heights construction company in Jhang
 
Live-Streaming in the Music Industry Webinar
Live-Streaming in the Music Industry WebinarLive-Streaming in the Music Industry Webinar
Live-Streaming in the Music Industry Webinar
 
IIBA® Melbourne - Navigating Business Analysis - Excellence for Career Growth...
IIBA® Melbourne - Navigating Business Analysis - Excellence for Career Growth...IIBA® Melbourne - Navigating Business Analysis - Excellence for Career Growth...
IIBA® Melbourne - Navigating Business Analysis - Excellence for Career Growth...
 
WAM Corporate Presentation Mar 25 2024.pdf
WAM Corporate Presentation Mar 25 2024.pdfWAM Corporate Presentation Mar 25 2024.pdf
WAM Corporate Presentation Mar 25 2024.pdf
 
Investment Opportunity for Thailand's Automotive & EV Industries
Investment Opportunity for Thailand's Automotive & EV IndustriesInvestment Opportunity for Thailand's Automotive & EV Industries
Investment Opportunity for Thailand's Automotive & EV Industries
 
Entrepreneurship & organisations: influences and organizations
Entrepreneurship & organisations: influences and organizationsEntrepreneurship & organisations: influences and organizations
Entrepreneurship & organisations: influences and organizations
 
Borderless Access - Global Panel book-unlock 2024
Borderless Access - Global Panel book-unlock 2024Borderless Access - Global Panel book-unlock 2024
Borderless Access - Global Panel book-unlock 2024
 
PDT 88 - 4 million seed - Seed - Protecto.pdf
PDT 88 - 4 million seed - Seed - Protecto.pdfPDT 88 - 4 million seed - Seed - Protecto.pdf
PDT 88 - 4 million seed - Seed - Protecto.pdf
 
Mihir Menda - Member of Supervisory Board at RMZ
Mihir Menda - Member of Supervisory Board at RMZMihir Menda - Member of Supervisory Board at RMZ
Mihir Menda - Member of Supervisory Board at RMZ
 
Chicago Medical Malpractice Lawyer Chicago Medical Malpractice Lawyer.pdf
Chicago Medical Malpractice Lawyer Chicago Medical Malpractice Lawyer.pdfChicago Medical Malpractice Lawyer Chicago Medical Malpractice Lawyer.pdf
Chicago Medical Malpractice Lawyer Chicago Medical Malpractice Lawyer.pdf
 
Tata Kelola Bisnis perushaan yang bergerak
Tata Kelola Bisnis perushaan yang bergerakTata Kelola Bisnis perushaan yang bergerak
Tata Kelola Bisnis perushaan yang bergerak
 
HELENE HECKROTTE'S PROFESSIONAL PORTFOLIO.pptx
HELENE HECKROTTE'S PROFESSIONAL PORTFOLIO.pptxHELENE HECKROTTE'S PROFESSIONAL PORTFOLIO.pptx
HELENE HECKROTTE'S PROFESSIONAL PORTFOLIO.pptx
 
Borderless Access - Global Panel book-unlock 2024
Borderless Access - Global Panel book-unlock 2024Borderless Access - Global Panel book-unlock 2024
Borderless Access - Global Panel book-unlock 2024
 
Lecture_6.pptx English speaking easyb to
Lecture_6.pptx English speaking easyb toLecture_6.pptx English speaking easyb to
Lecture_6.pptx English speaking easyb to
 
The Vietnam Believer Newsletter_MARCH 25, 2024_EN_Vol. 003
The Vietnam Believer Newsletter_MARCH 25, 2024_EN_Vol. 003The Vietnam Believer Newsletter_MARCH 25, 2024_EN_Vol. 003
The Vietnam Believer Newsletter_MARCH 25, 2024_EN_Vol. 003
 
Team B Mind Map for Organizational Chg..
Team B Mind Map for Organizational Chg..Team B Mind Map for Organizational Chg..
Team B Mind Map for Organizational Chg..
 

The causes and consequences of too many bits

  • 1. A brief exploration of causes and consequences of wealth of data. Dipesh Lall dipeshlall@gmail.com © 2011
  • 2. Before we start our journey a bit about a bit, a byte and lots of bytes. • A bit (b) is short for binary digit, after binary code (1 or 0) computers use to store and process data. • Binary means base of 2 just like decimal means the base of 10. • Byte (B) is the basic unit of computing used to create an English letter or number in computer code. One Byte is equal to 8 bits Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte Zettabyte Yottabyte Unit Bit (b) Byte (B) (KB) (MB) (GB) (TB) (PB) (EB) (ZB) (YB) 1,000 bytes 1,000 KB 1,000 MB 1,000 GB 1,000 TB 1,000 PB 1,000 EB 1,000 ZB Size 1 or 0 8 bits 210 bytes 220 bytes 230 bytes 240 bytes 250 bytes 260 bytes 270 bytes 280 bytes • One page of typed text is roughly 2KB. • All books catalogued in the US Library of Congress total around 15 TBs. • Google processes about 1PB every hour. • Monthly internet data flows at around 21 EBs. • Total amount of information in existence is around 1.2 ZB. • YB is currently too big to imagine (as per The Economist). • International Bureau of Weights and Measures sets the name of the prefixes. 2
  • 3. A perfect storm of forces is conspiring to generate a lot of data. Data storage costs are falling… …data creating devices are growing… # of hosts $/TB Time Time …data processing costs are falling… …connectivity is growing… Large volume of data of rich Degree variety at “Big Data” $/GFLOPS of connectivity various speeds Time Time …data moving costs are falling…while… …along with performance expectations. Speed of response $/Mbps Time Time 3 Please note that the slope of the various lines are different but they are directionally correct.
  • 4. Almost everything is instrumented which means data is being generated in various formats at various speeds and in various volumes. • Structured data (tables, records) • Semi-structured data (XML and similar standards) • Complex data (hierarchical or legacy sources) • Event data (messages) • Unstructured data (human Volume language, audio, video) • Social media data (blogs, tweets, social networks) • Web logs and click streams • Spatial data (long/lat, GPS) • Machine generated data Velocity Variety (sensors, RFID, devices, server logs) • Scientific data (genomes, proteinomics, astronom y) 4
  • 5. Now all this data is pure cost unless it is transformed into information from which insights can be drawn and right action taken to create or protect value. • The information value chain depicts the various stages in the journey of data from its creation to use: Data Information Insights Decisions Action Value • At each stage of the value chain the right mix of business processes, human skills and technology capabilities are needed. • Relational database management systems (RDBMS) date back to the early 70s. RDBMS have worked well to handle transactional and structured data because this type of data can be stored in table format with relationships between and amongst the tables. The technology to manage RDBMS was developed at IBM (in San Jose) and was initially called SEQUEL (Structured English Query Language). Now called SQL • As more of the data generated shifts from structured to other formats the traditional methods of managing data are not practical. • So here is what has happened in the management of data over time. – Vertical scaling…bigger RDBMS machines…more disk space, more horse power, big data centers. – New methods, called Horizontal scaling, arrived as vertical scaling reached its limit from a data volume standpoint…so came Massively Parallel Processing (MPP) machines – But then came unstructured data (variety) and streaming data (velocity) so what was needed was a whole new way to manage data…Big Data (BD) 5
  • 6. How do RDBMs really work (for the most part). • Multiple interfaces • Slow…disk drives need time to read-and-write • Sequential • Indexing a big challenge • Schema is not flexible Data is generated Data is Data is analyzed Data is stored in Information is in multiple aggregated in in analytical databases reported channels data warehouses applications • So the solution is to remove all these boxes (no pun intended) and get analytics as close as possible to the data. Hence, you hear terms like in-database analytics (analytics moving into d/b) or in-memory analytics (d/b moving into memory) Data is generated Data is stored, aggregated and analyzed on a single Information is via multiple platform reported channels 6
  • 7. RDBMs cannot scale because their intrinsic constraints run up against a humbling rule that you cannot have everything in life and you have to chose. • RDBMS rely on the ACID principle – Atomicity: All or nothing – Consistent: All transactions take d/b from one state to another without impairing referential integrity – Isolation: Other operations cannot access data while transaction is midstream – Durability: Ability to recover from system failure • Vertically scaled RDBMs do honor the ACID principle but horizontally scaled RDBMs (MPP machines) do not. This is called the CAP Theorem. It says that you can have any two of the following three when you have a distributed RDBM system – Consistency which means you operate fully or not at all. – Availability which means a node failure does not prevent surviving nodes from completing the task. – Partition tolerance (the distributed part) which means that system continues to operate despite arbitrary message loss. • The two bullets above mean that as you scale RDBM system you run into a wall…actually a cap! 7
  • 8. Therefore, RDBMS are not good at performing all types of analysis. • We need scalable database models that are not dependent on a fixed data schemas. App App App App App Need for a new data architecture App Db Db Db Db Db Db Db App Db Db Vertical scaling Horizontal scaling Schema agnostic scaling Volume growth Velocity growth Variety growth 8
  • 9. The rich variety of data intruded to make data management a painnus posteriorus*. • While the volume and velocity of the data is Volume vector…..bad growing rapidly it is the growing variety of data Velocity vector…badder that is a complexity multiplier in the management of all these bits. • RDBMS and MPP approaches exhausted the ability of current architectures to process the torrent of bits flowing. • Hence arrived what I call Big Data Architecture (BDA) • BDA does not replace existing investments in data management; BDA complements them so no need to rip-and-replace; it is more Variety vector…baddest insert-and-augment. • BDA started in companies that had BD, essentially internet companies like Yahoo, Google, Facebook, Amazon, Twitter, LinkedIn that needed web-scale solutions to their data problems. They built this from scratch because there was nothing commercially available. • This revolution was called NOSQL (Not Only SQL) • The “NO” means that it is a technology that works in addition to SQL not instead of it. • NOSQL databases were organically developed…these are essentially schema agnostic…meaning that some of the constraints of SQL databases are negotiated well. *: painnus posteriorus is a contemporary acute discomfort of lower thoracic induced by unrelenting bit storms 9
  • 10. NOSQL solves the complexity, volume and speed constraints of an SQL design by using four different data models. • Key value stores is a schema less model of storing data • Big table clones is a compressed high performance database system based on Google File System. • Document databases is a method to store semi-structured data • Graph databases uses graph structures (nodes, edges etc.) that provides index free lookups. NOSQL model Document Key value stores Big table clones Graph databases databases Based on Based on Based on Based on Amazon Dynamo Google BigTable Amazon Dynamo Graph Theory Memcached Hbase Lotus Domino AllegroGraph Dynamo Cassandra CouchDB VertexDB Voldemort HyperTable MongoDB Neo4J Tokyo Cabinet AzureTS Riak Active RDF 10
  • 11. BDA is actually very effective. • Yahoo tested BDA by calculating Pi to 2,000,000,000,000,000th digit • It used 1,000 computers and the calculation took 23 days. This means 23,000 computing days. • Using RDBMs, it would have taken on PC about 500 years which is essentially ~182,621 computing days. Now that is ~87% improvement in speed (using a very rough back of the envelope calculation) • So yes, BDA works. 11
  • 12. BDA works by breaking a problem into pieces, analyze each piece separately and then aggregating the results into a single response. • HADOOP is an instance of NOSQL that has two main parts: MapReduce and HDFS • MapReduce means mapping a problem to worker nodes and then aggregating (reducing) the results • HDFS is the file management systems that makes MapReduce work Map phase Reduce phase • Google searches • Amazon recommendations Piece 1 • Paypal real time fraud detection • Credit card unauthorized charges Piece 2 • Loopt Worker nodes Master node Master node • Directions from office to bar/pub…nearest Piece 3 vs. cheapest Problem Result • Genomics searching (needle-in-a-haystack) Piece 4 • Zynga gaming … • Facebook Friends • LinkedIn People-you-may-know (PYMK) Piece n • GPS directions (as you drive) • … 12
  • 13. What does BDA landscape look like? • It depends on what the need is but here is a simple graphic that shows the various elements. This is only illustrative. Data Visualization/Mobile/R presentation Displaying and monitoring logs: Chukwa Job tracker Data processing Hadoop (batch); S4, Storm (streaming) Coordination: Zookeeper Data query Pig, Hive Processing Azkaban, Oozie scheduler Task tracker Database Voldemort, Cassandra, HBase Data collection Kafka, Flume, Scribe 13
  • 14. BDA architecture does not mean you need to throw away your investments in traditional data analytics infrastructure. • BDA works alongside existing investments made by companies…not rip-and-replace! Traditional BI infrastructure Reporting & Distribution BDA 14
  • 15. Even NOSQL is getting challenged, but for now we got-to-dance-with-them- what-brung-you. • Zynga needs additional 1,000 servers every week for their data needs. • Every search string you send to Google is divided and sent to 700-1000 servers so that you can get your response back in micro-seconds and thus not waste a few seconds in which you could have destroyed civilization. • Youtube serves 1 billion videos every day. • 2.5 billion photos uploaded each month to Facebook. • ~150,000 zombie computers created every day (used in botnets for sending spam) • At beginning of 2009 there were 187 million web sites. At the end of 2009 there were 234 million web sites. 25% growth. 15
  • 16. And what is next. Big Data + Context + Interactivity = 16
  • 18. …which will make Minority Report scenarios look like… 18
  • 19. …Pong. 19
  • 20. New skills you should consider in the world of Big Data – Cultivate expertise but be a strong generalist – Develop and grow relationships and networks – Develop communication skills – Refine presentation skills – Read up, a lot – Monitor competition – Understand business, I mean really understand it Embrace* – Love the edge – Step outside your comfort zone, frequently ambiguity – If you have the appetite, read up a book or two on statistics – Think laterally, this just means do not be afraid to connect the dots * At a minimum, learn to accept ambiguity 20