SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
Building Big Data Applications Services for Private Clouds

Richard McDougall
Chief Architect, Storage and Application Services
VMware, Inc
@richardmcdougll




                                                    © 2009 VMware Inc. All rights reserved
Infrastructure, Apps and now Data…




                            Build    Run
     Private
               Public


                                Manage



Simplify Infrastructure   Simplify App Platform
                                                  Simplify Data
     With Cloud              Through PaaS




 2
Trend 1/3: New Data Growing at 60% Y/Y

Exabytes of information stored                                           20 Zetta by 2015

                                                                         1 Yotta by 2030

                                                                         Yes, you are part
                                                                         of the yotta
                                                        audio(           generation…
                                                  digital(tv(
                                               digital(photos(
                                       camera(phones,(rfid(
                                   medical(imaging,(sensors(
                  satellite(images,(logs,(scanners,(twi7er(
       cad/cam,(appliances,(machine(data,(digital(movies(



                                                          Source: The Information Explosion, 2009


3
Data Growth in the Enterprise




4
Trend 2/3: Big Data – Driven by Real-World Benefit




5
Enterprise : Early Adopter Industries and Use Cases




6
Early Adopters: Enterprise Segmentation

Verticals!                          Targets!                      Use Cases!

•        Financial Services"   •    Existing Hadoop Users"   •    Business Trend Analytics"
•        Retail"               •    Business Analysts"       •    Revenue analytics"
•        Telco"                •    Data Scientists"         •    CDR, call pattern analytics"
•        Manufacturing"        •    LOB managers"            •    Sensor data analytics"
•        Government"           •    IT/Ops"                  •    Log, machine data analytics"
                                                             •    Fraud detection"
                                                             •    Homeland security"
                                                             •    Predictive analytics"




     7
Early Adopters: Non-enterprise Segmentation

Verticals!                          Targets!                     Use Cases!

•        Online Advertising"   •    End users/Exec users"   •    Behavioral Analytics"
•        eCommerce"            •    Business Analysts"      •    Audience segmentation"
•        Mobile"               •    PM, LOB managers"       •    Revenue Optimization"
•        Social Media"         •    Marketing/Sales"        •    User activity monetization"
•        Gaming"               •    Data Engineers"         •    Inventory, price
                               •    Data Scientists"             management"
                               •    IT/Operations"          •    Recommendations"
                                                            •    Predictive analytics"




     8
Why now? more transactions (Social/Mobile/Local)



SoMoLo
                                  30B
                  500 TB          messages/     35 check-ins/   13k API calls/
                  data/day        month         sec             sec




Big
“traditional”
companies       1TB data/
                day                                     10k card
                                  3.7B calls/           transactions/sec
                                  month

                Size of data   communications            transactions
    9
Trend 3/3: Value from Data Exceeds Hardware Cost

!  Value from the intelligence of data analytics now outstrips the cost
  of hardware
  •  Hadoop enables the use of 10x lower cost hardware
  •  Hardware cost halving every 18mo
                                                         Value
                 Big Iron:
                 $40k/CPU

                                                                 Commodity
                                                                 Cluster:
                                                                 $1k/CPU
                                       Cost




 10
The Old Big Data Stack                                   Business
                                                        Intelligence




                    Extract, Transform,
                                                                Data
                                          Statistics
                           Load           (SAS, SPSS)       Visualization
                       (Informatica)                       (Crystal, Bus O)
          Files

           SQL
        Databases           E
                            T
                            L                Column Oriented
                                            Relational Database
                                          (Oracle, Teradata, DB2)
                       Master Data
                      Management
                      (Oracle, SAP)




11
The Old Big Data Stack

!  Unable to handle large data volumes & diversity of
 data
!  Iterative, brute-force and slow process
                                                                                        Business
!  Lack of ad-hoc data navigation across events and                                   Intelligence

 time
!  Cumbersome ETL to “process” and DBAs to
 “prepare”
!  Focused on structured data that is warehoused
!  Web analytics solutions force real-time events into                                        Data
 rigid schemas in DBs                             Extract, Transform,
                                                         Load
                                                                        Statistics
                                                                        (SAS, SPSS)
                                                                                          Visualization
                                                                                           (Crystal, Bus
                                                     (Informatica)                              O)
                                        Files

                                         SQL
                                      Databases           E                  Column Oriented
                                                          T                 Relational Database
                                                          L               (Oracle, Teradata, DB2)


                                                     Master Data
                                                    Management
                                                    (Oracle, SAP)


 12
The Journey To Big Data Analytics




1                              2                                  3
     All Data                      Data Science                       Real Time Decisions
     Faster Answers                Collaboration                      New Applications
     Elastic & Scalable            Self-Service                       Data Monetization


                                                                  Big Data Enabled Apps

                                   Agile Process & Tools


        Analytics Engines
        Analytic Engines       Analytic Productivity Platform


       Cloud Infrastructure

        BI As A Service                Agile Analytics             Predictive Enterprise
          Technology Focus          People & Productivity Focus        Application Focus

          Goal: encourage           Goal: discover meaningful        Goal: operationalize
          experimentation                  insights that                those insights
          with existing data          impact the business           as quickly as possible


13
Customer profiles

1.  Business analysts, LOB managers, execs
     •  Need: out-of-the-box analytics
     •  Designed for: self-service for end-user leveraging app
       developers
2.  Data engineers/analysts
     •  Need: out-of-the-box + some customization
     •  Designed for: admin + operations
3.  Data scientists
     •  Need: power capabilities + heavy customization
     •  Designed for: data scientists
4.  IT, Operations
     •  Need: out-of-the-box + some customization
     •  Designed for: IT/admin, ops


14
What is Data Science and Data Engineering?




              Distributed,
                                     Math and Statistical
        Parallelization Algorithm
                                        Knowledge
         & programming Skills


                          Data Science
                                &
                         Data Engineering

           Business Domain           Vertical or Horizontal
             and Problem             Use case and Analytics
            Understanding            Experience




15
What is Driving Big Data?




 Structured

                                                                                           Largely
                                                                                           Unstructured

Semi-structured




      Source: IBM and Oxford Survey: Getting Closer to Customers Tops Big Data Agenda, October 17, 2012




 16
Today’s Big Data System:


                Real Time
                 Streams


                     Real-Time
                     Processing
                      (s4, storm)
                                                   Analytics

     ETL
                                                             Data
                            Real Time                       Parallel
                            Structured      Big SQL          Batch
                            Database                      Processing




                            Unstructured Data (HDFS)



17
The Unified Analytics Cloud Platform



          Madlib
                             Analytics Tools       Karmasphere
              Data Meer                                      Tableau

         Hadoop R              Developer           Spring
                                                               PaaS
          Python              Frameworks       Cloudfoundry

        Cassandra                                    hBase
                    HDFS   Database/DataStore
           HawQ                                        Impala


         Data-Director
                              Data Platform           Data PaaS
            EMC Chorus



             vSphere       Cloud Infrastructure
                                                     Private
                                                                 Public




18
Business
The New Big Data System                                            Intelligence




                    Real Time
                     Streams

                           Automated
                            Models


                             Real-Time
                              Stream                     Data Visualization
                            Processing                       (Excel, Tableau)

               E
      Common
       Query   T                    Real Time        Structured         Unstructured
               L                    Structured          Data
                                                                         and Batch
                                                                         Processing
                                    Database           Engine            (Hadoop, Hive)
Federated Query
(SQL aggregation)
                                Structured and Unstructured Data
                                            (HDFS, S3)

                                      Cloud Infrastructure
                      Compute            Storage         Networking

 19
An Example – Automated Performance Management

    10M
Performance
 Stats/min




                           Trigger
                           Models



                                                            Batch
                                                           Baseline
                                                          Calculation




                                     Stats Database

                                 Cloud Infrastructure
                 Compute             Storage          Networking

20
Big (Data) problems: becoming the standardized stack

                           Google(       Facebook( Yahoo(       Linked(in( Cloudera( Twi7er(



Metadata&                  Dremel&       Hive&      Hive&                  Hive&
Schedule&&&pipeline&
workloads&                 Evenflow&      Databee&   Oozie&      Azkaban&   Oozie&
dataflow/queries&           A/Sawzall&    /Hive&     Pig/Hive&   Pig&       Pig/Hive&   Cascading&
MoreAstructured&data&store& Bigtable&     Hbase&    Hbase&      Voldemort& Hbase&      Cassandra&
DB&data&collecGon/
integraGon&                 MySQL&gateway&                      Sqoop&     Sqoop&
                                                    Data&
Event&data&collecGon&                    Scribe&    Highway&    KaLa?&     Flume&      Scribe&
Streaming&data&processing& A&            A&         A&          A&         A&          A&
Batch&data&processing&     Map/Reduce&   Hadoop&    Hadoop&     Hadoop&    Hadoop&     Hadoop&
File&Storage&              GFS&          Hadoop&    Hadoop&     Hadoop&    Hadoop&     Hadoop&


CoordinaGon&               Chubby&       Zookeeper& Zookeeper& Zookeeper& Zookeeper& Zookeeper&


  21
Business
New Technologies                                                      Intelligence



                          Twitter              Machine
                       Real Time
                        Sensor Data            Learning
                                                                                      CETAS
                        Streams
                       Mobile Events
                       Machine Logs
                               Automated
                                Models

                                  S4, Storm
                                 Real-Time
                                  Stream                   Data Visualization
                                                                   …
                                Processing                      (Excel, Tableau)

                E
      Common
       Query    T                        SPARK
                                       Real Time        Aster,             Unstructured
                L                        SHARK
                                       Structured     Greenplum
                                                                            and Batch
                                                                           Map-Reduce
                                        Gemfire                             Processing
                                       Database
                                         hBase?          Etc,               (Hadoop, Hive)
Query Virtualization
        …
(SQL aggregation)

                                   HDFS, Ceph, MAPR, Collosos

                                         Cloud Infrastructure
                         Compute            Storage         Networking

 22
Agenda

!  Frameworks
     •    Batch processing: Hadoop, Spark
     •    Graph processing: Pregel, Apache Giraph
     •    Real-time processing: Storm, S4, D-Streams
     •    Interactive processing: Hive, Impala, Shark
!  New requirements
     •  Better network architectures, abstractions and end-to-end resource
        management
     •  Whither disk-locality and the flexibility to move data to compute
        instead
     •  Cluster/Datacenter-wide storage abstractions and services
     •  The silo-less datacenter (multiple frameworks sharing a single
          physical cluster and sharing sticky data)



23
Big Data Processing Patterns (batch, real-time or interactive)


Hadoop,
Hive, Impala   Funnel                    Reverse Funnel        Data transform
Storm, S4,     (large input, small       (small input, large   (input and output
D-Streams,     output, e.g., link/ad     output, e.g.,         sizes similar, e.g,
Shark          click-statistics)         logfile loading)      data conversion/
                                                               translation)

 Spark
               Iterative, e.g, Machine
               learning tasks




 Pregel,
 Giraph
               Graph-based analyses
               to reason about relationships,
               e.g., PageRank, Ravi s social approach to VI management

   24
Batch processing frameworks (1/2)
!  Apache Hadoop MapReduce (Yahoo!)




      •  Parallel data-processing paradigm (made popular by Google). Uses a
        distributed file system (HDFS) for persistence. Uses commodity h/w
      •  Model of operation: Mapper (read from HDFS + compute in parallel) ->
        Reducer (process map outputs in parallel) -> write to HDFS
      •  Key components: Namenode, Datanode, TaskTracker, JobTracker
      •  Apache Zookeeper sometimes used for coordination
      •  Weakness: Not well-suited for iterative (or graph) computations
 25
Batch processing frameworks (2/2)
!  Spark (UC Berkeley)




     •  Support for iterative computations and interactive data-mining by
       caching data in cluster RAM. Uses commodity machines
     •  Core abstraction: Resilient Distributed Datasets (RDDs) used as
       variables in Spark programs. RDDs include lineage data for easy
       recovery/reconstruction
     •  Up to ~20X speedup over Hadoop. Used by Quantifind, Conviva, …

        Image courtesy Zaharia et al.: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

26
Graph processing frameworks

!  Pregel (Google)/Apache Giraph
             Compute               Communicate




                                                                   Barrier
     VM1         VM2




      •  Multiple instances of vertex-programs: user-defined functions running
           at/on each vertex
      •  Bulk Synchronous Parallel (BSP) processing, e.g., used for PageRank
      •  Stateful in-memory computations. Fault-tolerance via checkpoints
      •  Runs on commodity hardware (racks with high intra-rack bandwidth)

27
Real-time processing frameworks (stream-processing) 1/2

!  S4 (Yahoo!), Storm (Twitter)
       •  Record-at-a-time processing. Checkpointing for fault-tolerance (S4)




Image courtesy Zaharia et al.: https://www.usenix.org/sites/default/files/conference/protected-files/zaharia_hotcloud12_slides.pdf


  28
Real-time processing frameworks (stream-processing) 2/2

!  Discretized Streams/D-Streams (UC Berkeley)
       •  Treat a streaming computation as a series of batch computations on
           small time intervals. D-Stream = chain of RDDs
       •  Fault-tolerance without replication or upstream backup (buffering)




                                                                                                                       Time
Image courtesy Zaharia et al.: https://www.usenix.org/sites/default/files/conference/protected-files/zaharia_hotcloud12_slides.pdf


  29
Interactive processing frameworks 1/4

!  Apache Hive (Facebook)
       •  Open-source data warehouse built on top of Hadoop. HiveQL
           queries compiled into MapReduce jobs. Expensive Where clauses =
           Table scans = high latency




Image courtesy Cubrid: http://www.cubrid.org/blog/dev-platform/platforms-for-big-data/


  30
Interactive processing frameworks 2/4

!  Interactive Processing Frameworks – Pivotal Hawk




31
Interactive processing frameworks 3/4

!  Impala (Cloudera)
       •  Inspired by Dremel (Google). Key concepts: columnar-data storage
           (Trevni), aggregation trees for distributed query evaluation
       •  Takes advantage of Hive tables. Uses memory as a cache for tables
       •  Does not use MapReduce to answer queries (unlike Hive).
       •  3X - 90X faster than Hive




Image courtesy Cloudera: http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/


  32
Interactive processing frameworks 4/4

!  Shark (UC Berkeley)
       •  Key concepts: columnar-data storage (in-memory), Directed Acyclic
           Graphs of Tasks for distributed query optimization and evaluation,
           dynamic mid-query replanning
       •  Uses Spark RDDs to store data and query processing results
       •  SQL-interface (HiveQL compatible)
       •  100X faster than Hadoop, 100X faster than Hive




Image courtesy Xin et al.: http://shark.cs.berkeley.edu/presentations/2012-11-26-shark-tech-report.pdf


  33
Unifying the Big Data Platform using Virtualization

!  Goals
 •  Make it fast and easy to provision new data Clusters on Demand
 •  Allow Mixing of Workloads
 •  Leverage virtual machines to provide isolation (esp. for Multi-tenant)
 •  Optimize data performance based on virtual topologies
 •  Make the system reliable based on virtual topologies
!  Leveraging Virtualization
 •  Elastic scale
 •  Use high-availability to protect key services, e.g., Hadoop’s namenode/job
     tracker
 •  Resource controls and sharing: re-use underutilized memory, cpu
 •  Prioritize Workloads: limit or guarantee resource usage in a mixed
     environment
                                        Cloud Infrastructure

                                                                Private
                                                                          Public

34
A Unified Analytics Cloud Significantly Simplifies

                                 !  Simplify
                                   •  Single Hardware Infrastructure
                                   •  Faster/Easier provisioning
SQLCluster



                                         Big SQL        NoSQL          Hadoop
      NoSQL Cluster

                                               Unifed Analytics Infrastructure

                                                   Private
                                                             Public
 Hadoop Cluster

                                  !  Optimize
                                    •  Shared Resources = higher utilization
      Decision Support Cluster
                                    •  Elastic resources = faster on-demand access

 35
Simplify Hetrogeneous Data Management via Data PaaS



                                      Large-
                        File-                            In-            Big
                                       Scale
                       system                          Memory           SQL
                                      NoSQL



  Analytics Tools


     Developer

     Databases
                       Data PaaS – Common Data Management Layer

   Data Platform       Provisioning      Multi-tenancy          Import/Export
Cloud Infrastructure       Management             Data Discovery




                                         Cloud Infrastructure



36
Technology: Databases and Data Stores for Big Data

                      Unstructured                                               Structured


                                         Large-
                   File-                                           In-                      Big
                                          Scale
                  system                                         Memory                     SQL
                                         NoSQL



              Log files, machine   Loosely typed device
 Types of     generated data,      data, records, events,   Structured,
                                                                                      Structured data
   Data       documents,           statistics, complex      partitionable data
              device data, etc…    relations/graphs

 Techno-      NAS, HDFS, Blob,     Cassandra, hBase,        Gemfire, Redis,           HawQ, Impala, Aster,
  logies      S3, MAPR, etc..      Voldemort                Membase, SPARK            …

              Store any data,                                                         High performance for
                                   Easy to scale-out,
              easy to scale-out,                            High Throughput, low      repetitive queries.
     Values                        flexible and dynamic
              can optimize for                              latency                   Ease of query
                                   schema’s
              cost                                                                    language.

37
The Unified Analytics Cloud Platform



          Madlib
                            Analytics Tools       Karmasphere
              Data Meer                                     Tableau

         Hadoop R             Developer           Spring
                                                              PaaS
          Python             Frameworks       Cloudfoundry

        Cassandra                                   hBase
                  HDFS    Database/DataStore
           Greenplum                                  Voldemort


         Data-Director
                             Data Platform           Data PaaS
            EMC Chorus



             vSphere      Cloud Infrastructure
                                                    Private
                                                                Public




38
Summary

!  Revolution in Big Data is under way
 •  Data centric applications are now critical
!  Hadoop on Virtualization
 •  Proven performance
 •  Cloud/Virtualization values apparent for Hadoop use
!  Simplify through a Unified Analytics Cloud
 •  One Platform for today’s and future big-data systems
 •  Better Utilization
 •  Faster deployment, elastic resources
 •  Secure, Isolated, Multi-tenant capability for Analytics




39
References

!  Twitter
  •  @richardmcdougll
!  My CTO Blog
  •  http://communities.vmware.com/community/vmtn/cto/cloud

!  Hadoop on vSphere
  •  Talk @ Hadoop World
  •  Performance Paper – http://www.vmware.com/files/.../VMW-Hadoop-Performance-vSphere5.pdf
!  Spring Hadoop
  •  http://blog.springsource.org/2012/02/29/introducing-spring-hadoop




40

Más contenido relacionado

La actualidad más candente

Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...Dataconomy Media
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big dataYukti Kaura
 
20100806 cloudera 10 hadoopable problems webinar
20100806 cloudera 10 hadoopable problems webinar20100806 cloudera 10 hadoopable problems webinar
20100806 cloudera 10 hadoopable problems webinarCloudera, Inc.
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopFebiyan Rachman
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopAvkash Chauhan
 
Cetas Analytics as a Service for Predictive Analytics
Cetas Analytics as a Service for Predictive AnalyticsCetas Analytics as a Service for Predictive Analytics
Cetas Analytics as a Service for Predictive AnalyticsJ. David Morris
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in detailsMahmoud Yassin
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Imviplav
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopSavvycom Savvycom
 
Microsoft SQL Azure - Cloud Based Database Datasheet
Microsoft SQL Azure - Cloud Based Database DatasheetMicrosoft SQL Azure - Cloud Based Database Datasheet
Microsoft SQL Azure - Cloud Based Database DatasheetMicrosoft Private Cloud
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1Abbas Maazallahi
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - IntroductionTomy Rhymond
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core conceptsMaryan Faryna
 

La actualidad más candente (20)

Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
20100806 cloudera 10 hadoopable problems webinar
20100806 cloudera 10 hadoopable problems webinar20100806 cloudera 10 hadoopable problems webinar
20100806 cloudera 10 hadoopable problems webinar
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Introduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache HadoopIntroduction to Big Data Analytics on Apache Hadoop
Introduction to Big Data Analytics on Apache Hadoop
 
Cetas Analytics as a Service for Predictive Analytics
Cetas Analytics as a Service for Predictive AnalyticsCetas Analytics as a Service for Predictive Analytics
Cetas Analytics as a Service for Predictive Analytics
 
Cetas Predictive Analytics Prezo
Cetas Predictive Analytics PrezoCetas Predictive Analytics Prezo
Cetas Predictive Analytics Prezo
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
Microsoft SQL Azure - Cloud Based Database Datasheet
Microsoft SQL Azure - Cloud Based Database DatasheetMicrosoft SQL Azure - Cloud Based Database Datasheet
Microsoft SQL Azure - Cloud Based Database Datasheet
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Big data processing with apache spark part1
Big data processing with apache spark   part1Big data processing with apache spark   part1
Big data processing with apache spark part1
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Hadoop core concepts
Hadoop core conceptsHadoop core concepts
Hadoop core concepts
 

Destacado

Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsCloudera, Inc.
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Richard McDougall
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataRichard McDougall
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
 
Virtualization Primer for Java Developers
Virtualization Primer for Java DevelopersVirtualization Primer for Java Developers
Virtualization Primer for Java DevelopersRichard McDougall
 
Solaris Internals Preso circa 2009
Solaris Internals Preso circa 2009Solaris Internals Preso circa 2009
Solaris Internals Preso circa 2009Richard McDougall
 
Virtualizing Oracle Databases with VMware
Virtualizing Oracle Databases with VMwareVirtualizing Oracle Databases with VMware
Virtualizing Oracle Databases with VMwareRichard McDougall
 
Hortonworks roadshow
Hortonworks roadshowHortonworks roadshow
Hortonworks roadshowAccenture
 
Online hotel booking application - Design Process
Online hotel booking application - Design ProcessOnline hotel booking application - Design Process
Online hotel booking application - Design Processchayapathi sarath
 
VMware Performance Troubleshooting
VMware Performance TroubleshootingVMware Performance Troubleshooting
VMware Performance Troubleshootingglbsolutions
 
Denver VMUG nov 2011
Denver VMUG nov 2011Denver VMUG nov 2011
Denver VMUG nov 2011Dan Brinkmann
 
Citrix Remote Access Solution Soup
Citrix Remote Access Solution SoupCitrix Remote Access Solution Soup
Citrix Remote Access Solution SoupDan Brinkmann
 
How to Operationalise Real-Time Hadoop in the Cloud
How to Operationalise Real-Time Hadoop in the CloudHow to Operationalise Real-Time Hadoop in the Cloud
How to Operationalise Real-Time Hadoop in the CloudAttunity
 
Equinix Big Data Platform and Cassandra - A view into the journey
Equinix Big Data Platform and Cassandra - A view into the journeyEquinix Big Data Platform and Cassandra - A view into the journey
Equinix Big Data Platform and Cassandra - A view into the journeyPraveen Kumar
 
VMware vSphere Performance Troubleshooting
VMware vSphere Performance TroubleshootingVMware vSphere Performance Troubleshooting
VMware vSphere Performance TroubleshootingDan Brinkmann
 

Destacado (20)

Introduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data ApplicationsIntroduction to Designing and Building Big Data Applications
Introduction to Designing and Building Big Data Applications
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big Data
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
Making of the Burner Board
Making of the Burner BoardMaking of the Burner Board
Making of the Burner Board
 
Hadoop on Virtual Machines
Hadoop on Virtual MachinesHadoop on Virtual Machines
Hadoop on Virtual Machines
 
Virtualization Primer for Java Developers
Virtualization Primer for Java DevelopersVirtualization Primer for Java Developers
Virtualization Primer for Java Developers
 
Solaris Internals Preso circa 2009
Solaris Internals Preso circa 2009Solaris Internals Preso circa 2009
Solaris Internals Preso circa 2009
 
Virtualizing Oracle Databases with VMware
Virtualizing Oracle Databases with VMwareVirtualizing Oracle Databases with VMware
Virtualizing Oracle Databases with VMware
 
Hortonworks roadshow
Hortonworks roadshowHortonworks roadshow
Hortonworks roadshow
 
Hadoop I/O Analysis
Hadoop I/O AnalysisHadoop I/O Analysis
Hadoop I/O Analysis
 
Online hotel booking application - Design Process
Online hotel booking application - Design ProcessOnline hotel booking application - Design Process
Online hotel booking application - Design Process
 
VMware Performance Troubleshooting
VMware Performance TroubleshootingVMware Performance Troubleshooting
VMware Performance Troubleshooting
 
Denver VMUG nov 2011
Denver VMUG nov 2011Denver VMUG nov 2011
Denver VMUG nov 2011
 
Citrix Remote Access Solution Soup
Citrix Remote Access Solution SoupCitrix Remote Access Solution Soup
Citrix Remote Access Solution Soup
 
How to Operationalise Real-Time Hadoop in the Cloud
How to Operationalise Real-Time Hadoop in the CloudHow to Operationalise Real-Time Hadoop in the Cloud
How to Operationalise Real-Time Hadoop in the Cloud
 
Equinix Big Data Platform and Cassandra - A view into the journey
Equinix Big Data Platform and Cassandra - A view into the journeyEquinix Big Data Platform and Cassandra - A view into the journey
Equinix Big Data Platform and Cassandra - A view into the journey
 
VMware vSphere Performance Troubleshooting
VMware vSphere Performance TroubleshootingVMware vSphere Performance Troubleshooting
VMware vSphere Performance Troubleshooting
 

Similar a Building Big Data Applications

Big Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureBig Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureOdinot Stanislas
 
Big Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyBig Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyHitachi Vantara
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big DecisionsInnoTech
 
Making your Analytics Investment Pay Off - StampedeCon 2012
Making your Analytics Investment Pay Off - StampedeCon 2012Making your Analytics Investment Pay Off - StampedeCon 2012
Making your Analytics Investment Pay Off - StampedeCon 2012StampedeCon
 
SplunkLive: New Visibility=New Opportunity: How IT Can Drive Business Value
SplunkLive: New Visibility=New Opportunity: How IT Can Drive Business Value SplunkLive: New Visibility=New Opportunity: How IT Can Drive Business Value
SplunkLive: New Visibility=New Opportunity: How IT Can Drive Business Value Splunk
 
SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)
SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)
SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)Will Gardella
 
Big Data: A Big Trap for Product Development
Big Data: A Big Trap for Product DevelopmentBig Data: A Big Trap for Product Development
Big Data: A Big Trap for Product DevelopmentStrategy 2 Market, Inc,
 
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...Denodo
 
Introduction to Big Data An analogy between Sugar Cane & Big Data
Introduction to Big Data An analogy  between Sugar Cane & Big DataIntroduction to Big Data An analogy  between Sugar Cane & Big Data
Introduction to Big Data An analogy between Sugar Cane & Big DataJean-Marc Desvaux
 
Intel Cloud summit: Big Data by Nick Knupffer
Intel Cloud summit: Big Data by Nick KnupfferIntel Cloud summit: Big Data by Nick Knupffer
Intel Cloud summit: Big Data by Nick KnupfferIntelAPAC
 
sones company presentation
sones company presentationsones company presentation
sones company presentationsones GmbH
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Denodo
 
Hadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data WarehouseHadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data Warehousetervela
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
NoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesNoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesVishy Poosala
 
World Domination with Pentaho EE?
World Domination with Pentaho EE?World Domination with Pentaho EE?
World Domination with Pentaho EE?Jos van Dongen
 

Similar a Building Big Data Applications (20)

Big Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureBig Data and Implications on Platform Architecture
Big Data and Implications on Platform Architecture
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Big Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyBig Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage Strategy
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
Kurukshetra - Big Data
Kurukshetra - Big DataKurukshetra - Big Data
Kurukshetra - Big Data
 
Making your Analytics Investment Pay Off - StampedeCon 2012
Making your Analytics Investment Pay Off - StampedeCon 2012Making your Analytics Investment Pay Off - StampedeCon 2012
Making your Analytics Investment Pay Off - StampedeCon 2012
 
Galaxy of bits
Galaxy of bitsGalaxy of bits
Galaxy of bits
 
SplunkLive: New Visibility=New Opportunity: How IT Can Drive Business Value
SplunkLive: New Visibility=New Opportunity: How IT Can Drive Business Value SplunkLive: New Visibility=New Opportunity: How IT Can Drive Business Value
SplunkLive: New Visibility=New Opportunity: How IT Can Drive Business Value
 
SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)
SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)
SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)
 
Big Data: A Big Trap for Product Development
Big Data: A Big Trap for Product DevelopmentBig Data: A Big Trap for Product Development
Big Data: A Big Trap for Product Development
 
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
 
Introduction to Big Data An analogy between Sugar Cane & Big Data
Introduction to Big Data An analogy  between Sugar Cane & Big DataIntroduction to Big Data An analogy  between Sugar Cane & Big Data
Introduction to Big Data An analogy between Sugar Cane & Big Data
 
Data mining
Data miningData mining
Data mining
 
Intel Cloud summit: Big Data by Nick Knupffer
Intel Cloud summit: Big Data by Nick KnupfferIntel Cloud summit: Big Data by Nick Knupffer
Intel Cloud summit: Big Data by Nick Knupffer
 
sones company presentation
sones company presentationsones company presentation
sones company presentation
 
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
 
Hadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data WarehouseHadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data Warehouse
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
NoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesNoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, Opportunities
 
World Domination with Pentaho EE?
World Domination with Pentaho EE?World Domination with Pentaho EE?
World Domination with Pentaho EE?
 

Último

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Último (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

Building Big Data Applications

  • 1. Building Big Data Applications Services for Private Clouds Richard McDougall Chief Architect, Storage and Application Services VMware, Inc @richardmcdougll © 2009 VMware Inc. All rights reserved
  • 2. Infrastructure, Apps and now Data… Build Run Private Public Manage Simplify Infrastructure Simplify App Platform Simplify Data With Cloud Through PaaS 2
  • 3. Trend 1/3: New Data Growing at 60% Y/Y Exabytes of information stored 20 Zetta by 2015 1 Yotta by 2030 Yes, you are part of the yotta audio( generation… digital(tv( digital(photos( camera(phones,(rfid( medical(imaging,(sensors( satellite(images,(logs,(scanners,(twi7er( cad/cam,(appliances,(machine(data,(digital(movies( Source: The Information Explosion, 2009 3
  • 4. Data Growth in the Enterprise 4
  • 5. Trend 2/3: Big Data – Driven by Real-World Benefit 5
  • 6. Enterprise : Early Adopter Industries and Use Cases 6
  • 7. Early Adopters: Enterprise Segmentation Verticals! Targets! Use Cases! •  Financial Services" •  Existing Hadoop Users" •  Business Trend Analytics" •  Retail" •  Business Analysts" •  Revenue analytics" •  Telco" •  Data Scientists" •  CDR, call pattern analytics" •  Manufacturing" •  LOB managers" •  Sensor data analytics" •  Government" •  IT/Ops" •  Log, machine data analytics" •  Fraud detection" •  Homeland security" •  Predictive analytics" 7
  • 8. Early Adopters: Non-enterprise Segmentation Verticals! Targets! Use Cases! •  Online Advertising" •  End users/Exec users" •  Behavioral Analytics" •  eCommerce" •  Business Analysts" •  Audience segmentation" •  Mobile" •  PM, LOB managers" •  Revenue Optimization" •  Social Media" •  Marketing/Sales" •  User activity monetization" •  Gaming" •  Data Engineers" •  Inventory, price •  Data Scientists" management" •  IT/Operations" •  Recommendations" •  Predictive analytics" 8
  • 9. Why now? more transactions (Social/Mobile/Local) SoMoLo 30B 500 TB messages/ 35 check-ins/ 13k API calls/ data/day month sec sec Big “traditional” companies 1TB data/ day 10k card 3.7B calls/ transactions/sec month Size of data communications transactions 9
  • 10. Trend 3/3: Value from Data Exceeds Hardware Cost !  Value from the intelligence of data analytics now outstrips the cost of hardware •  Hadoop enables the use of 10x lower cost hardware •  Hardware cost halving every 18mo Value Big Iron: $40k/CPU Commodity Cluster: $1k/CPU Cost 10
  • 11. The Old Big Data Stack Business Intelligence Extract, Transform, Data Statistics Load (SAS, SPSS) Visualization (Informatica) (Crystal, Bus O) Files SQL Databases E T L Column Oriented Relational Database (Oracle, Teradata, DB2) Master Data Management (Oracle, SAP) 11
  • 12. The Old Big Data Stack !  Unable to handle large data volumes & diversity of data !  Iterative, brute-force and slow process Business !  Lack of ad-hoc data navigation across events and Intelligence time !  Cumbersome ETL to “process” and DBAs to “prepare” !  Focused on structured data that is warehoused !  Web analytics solutions force real-time events into Data rigid schemas in DBs Extract, Transform, Load Statistics (SAS, SPSS) Visualization (Crystal, Bus (Informatica) O) Files SQL Databases E Column Oriented T Relational Database L (Oracle, Teradata, DB2) Master Data Management (Oracle, SAP) 12
  • 13. The Journey To Big Data Analytics 1 2 3 All Data Data Science Real Time Decisions Faster Answers Collaboration New Applications Elastic & Scalable Self-Service Data Monetization Big Data Enabled Apps Agile Process & Tools Analytics Engines Analytic Engines Analytic Productivity Platform Cloud Infrastructure BI As A Service Agile Analytics Predictive Enterprise Technology Focus People & Productivity Focus Application Focus Goal: encourage Goal: discover meaningful Goal: operationalize experimentation insights that those insights with existing data impact the business as quickly as possible 13
  • 14. Customer profiles 1.  Business analysts, LOB managers, execs •  Need: out-of-the-box analytics •  Designed for: self-service for end-user leveraging app developers 2.  Data engineers/analysts •  Need: out-of-the-box + some customization •  Designed for: admin + operations 3.  Data scientists •  Need: power capabilities + heavy customization •  Designed for: data scientists 4.  IT, Operations •  Need: out-of-the-box + some customization •  Designed for: IT/admin, ops 14
  • 15. What is Data Science and Data Engineering? Distributed, Math and Statistical Parallelization Algorithm Knowledge & programming Skills Data Science & Data Engineering Business Domain Vertical or Horizontal and Problem Use case and Analytics Understanding Experience 15
  • 16. What is Driving Big Data? Structured Largely Unstructured Semi-structured Source: IBM and Oxford Survey: Getting Closer to Customers Tops Big Data Agenda, October 17, 2012 16
  • 17. Today’s Big Data System: Real Time Streams Real-Time Processing (s4, storm) Analytics ETL Data Real Time Parallel Structured Big SQL Batch Database Processing Unstructured Data (HDFS) 17
  • 18. The Unified Analytics Cloud Platform Madlib Analytics Tools Karmasphere Data Meer Tableau Hadoop R Developer Spring PaaS Python Frameworks Cloudfoundry Cassandra hBase HDFS Database/DataStore HawQ Impala Data-Director Data Platform Data PaaS EMC Chorus vSphere Cloud Infrastructure Private Public 18
  • 19. Business The New Big Data System Intelligence Real Time Streams Automated Models Real-Time Stream Data Visualization Processing (Excel, Tableau) E Common Query T Real Time Structured Unstructured L Structured Data and Batch Processing Database Engine (Hadoop, Hive) Federated Query (SQL aggregation) Structured and Unstructured Data (HDFS, S3) Cloud Infrastructure Compute Storage Networking 19
  • 20. An Example – Automated Performance Management 10M Performance Stats/min Trigger Models Batch Baseline Calculation Stats Database Cloud Infrastructure Compute Storage Networking 20
  • 21. Big (Data) problems: becoming the standardized stack Google( Facebook( Yahoo( Linked(in( Cloudera( Twi7er( Metadata& Dremel& Hive& Hive& Hive& Schedule&&&pipeline& workloads& Evenflow& Databee& Oozie& Azkaban& Oozie& dataflow/queries& A/Sawzall& /Hive& Pig/Hive& Pig& Pig/Hive& Cascading& MoreAstructured&data&store& Bigtable& Hbase& Hbase& Voldemort& Hbase& Cassandra& DB&data&collecGon/ integraGon& MySQL&gateway& Sqoop& Sqoop& Data& Event&data&collecGon& Scribe& Highway& KaLa?& Flume& Scribe& Streaming&data&processing& A& A& A& A& A& A& Batch&data&processing& Map/Reduce& Hadoop& Hadoop& Hadoop& Hadoop& Hadoop& File&Storage& GFS& Hadoop& Hadoop& Hadoop& Hadoop& Hadoop& CoordinaGon& Chubby& Zookeeper& Zookeeper& Zookeeper& Zookeeper& Zookeeper& 21
  • 22. Business New Technologies Intelligence Twitter Machine Real Time Sensor Data Learning CETAS Streams Mobile Events Machine Logs Automated Models S4, Storm Real-Time Stream Data Visualization … Processing (Excel, Tableau) E Common Query T SPARK Real Time Aster, Unstructured L SHARK Structured Greenplum and Batch Map-Reduce Gemfire Processing Database hBase? Etc, (Hadoop, Hive) Query Virtualization … (SQL aggregation) HDFS, Ceph, MAPR, Collosos Cloud Infrastructure Compute Storage Networking 22
  • 23. Agenda !  Frameworks •  Batch processing: Hadoop, Spark •  Graph processing: Pregel, Apache Giraph •  Real-time processing: Storm, S4, D-Streams •  Interactive processing: Hive, Impala, Shark !  New requirements •  Better network architectures, abstractions and end-to-end resource management •  Whither disk-locality and the flexibility to move data to compute instead •  Cluster/Datacenter-wide storage abstractions and services •  The silo-less datacenter (multiple frameworks sharing a single physical cluster and sharing sticky data) 23
  • 24. Big Data Processing Patterns (batch, real-time or interactive) Hadoop, Hive, Impala Funnel Reverse Funnel Data transform Storm, S4, (large input, small (small input, large (input and output D-Streams, output, e.g., link/ad output, e.g., sizes similar, e.g, Shark click-statistics) logfile loading) data conversion/ translation) Spark Iterative, e.g, Machine learning tasks Pregel, Giraph Graph-based analyses to reason about relationships, e.g., PageRank, Ravi s social approach to VI management 24
  • 25. Batch processing frameworks (1/2) !  Apache Hadoop MapReduce (Yahoo!) •  Parallel data-processing paradigm (made popular by Google). Uses a distributed file system (HDFS) for persistence. Uses commodity h/w •  Model of operation: Mapper (read from HDFS + compute in parallel) -> Reducer (process map outputs in parallel) -> write to HDFS •  Key components: Namenode, Datanode, TaskTracker, JobTracker •  Apache Zookeeper sometimes used for coordination •  Weakness: Not well-suited for iterative (or graph) computations 25
  • 26. Batch processing frameworks (2/2) !  Spark (UC Berkeley) •  Support for iterative computations and interactive data-mining by caching data in cluster RAM. Uses commodity machines •  Core abstraction: Resilient Distributed Datasets (RDDs) used as variables in Spark programs. RDDs include lineage data for easy recovery/reconstruction •  Up to ~20X speedup over Hadoop. Used by Quantifind, Conviva, … Image courtesy Zaharia et al.: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf 26
  • 27. Graph processing frameworks !  Pregel (Google)/Apache Giraph Compute Communicate Barrier VM1 VM2 •  Multiple instances of vertex-programs: user-defined functions running at/on each vertex •  Bulk Synchronous Parallel (BSP) processing, e.g., used for PageRank •  Stateful in-memory computations. Fault-tolerance via checkpoints •  Runs on commodity hardware (racks with high intra-rack bandwidth) 27
  • 28. Real-time processing frameworks (stream-processing) 1/2 !  S4 (Yahoo!), Storm (Twitter) •  Record-at-a-time processing. Checkpointing for fault-tolerance (S4) Image courtesy Zaharia et al.: https://www.usenix.org/sites/default/files/conference/protected-files/zaharia_hotcloud12_slides.pdf 28
  • 29. Real-time processing frameworks (stream-processing) 2/2 !  Discretized Streams/D-Streams (UC Berkeley) •  Treat a streaming computation as a series of batch computations on small time intervals. D-Stream = chain of RDDs •  Fault-tolerance without replication or upstream backup (buffering) Time Image courtesy Zaharia et al.: https://www.usenix.org/sites/default/files/conference/protected-files/zaharia_hotcloud12_slides.pdf 29
  • 30. Interactive processing frameworks 1/4 !  Apache Hive (Facebook) •  Open-source data warehouse built on top of Hadoop. HiveQL queries compiled into MapReduce jobs. Expensive Where clauses = Table scans = high latency Image courtesy Cubrid: http://www.cubrid.org/blog/dev-platform/platforms-for-big-data/ 30
  • 31. Interactive processing frameworks 2/4 !  Interactive Processing Frameworks – Pivotal Hawk 31
  • 32. Interactive processing frameworks 3/4 !  Impala (Cloudera) •  Inspired by Dremel (Google). Key concepts: columnar-data storage (Trevni), aggregation trees for distributed query evaluation •  Takes advantage of Hive tables. Uses memory as a cache for tables •  Does not use MapReduce to answer queries (unlike Hive). •  3X - 90X faster than Hive Image courtesy Cloudera: http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/ 32
  • 33. Interactive processing frameworks 4/4 !  Shark (UC Berkeley) •  Key concepts: columnar-data storage (in-memory), Directed Acyclic Graphs of Tasks for distributed query optimization and evaluation, dynamic mid-query replanning •  Uses Spark RDDs to store data and query processing results •  SQL-interface (HiveQL compatible) •  100X faster than Hadoop, 100X faster than Hive Image courtesy Xin et al.: http://shark.cs.berkeley.edu/presentations/2012-11-26-shark-tech-report.pdf 33
  • 34. Unifying the Big Data Platform using Virtualization !  Goals •  Make it fast and easy to provision new data Clusters on Demand •  Allow Mixing of Workloads •  Leverage virtual machines to provide isolation (esp. for Multi-tenant) •  Optimize data performance based on virtual topologies •  Make the system reliable based on virtual topologies !  Leveraging Virtualization •  Elastic scale •  Use high-availability to protect key services, e.g., Hadoop’s namenode/job tracker •  Resource controls and sharing: re-use underutilized memory, cpu •  Prioritize Workloads: limit or guarantee resource usage in a mixed environment Cloud Infrastructure Private Public 34
  • 35. A Unified Analytics Cloud Significantly Simplifies !  Simplify •  Single Hardware Infrastructure •  Faster/Easier provisioning SQLCluster Big SQL NoSQL Hadoop NoSQL Cluster Unifed Analytics Infrastructure Private Public Hadoop Cluster !  Optimize •  Shared Resources = higher utilization Decision Support Cluster •  Elastic resources = faster on-demand access 35
  • 36. Simplify Hetrogeneous Data Management via Data PaaS Large- File- In- Big Scale system Memory SQL NoSQL Analytics Tools Developer Databases Data PaaS – Common Data Management Layer Data Platform Provisioning Multi-tenancy Import/Export Cloud Infrastructure Management Data Discovery Cloud Infrastructure 36
  • 37. Technology: Databases and Data Stores for Big Data Unstructured Structured Large- File- In- Big Scale system Memory SQL NoSQL Log files, machine Loosely typed device Types of generated data, data, records, events, Structured, Structured data Data documents, statistics, complex partitionable data device data, etc… relations/graphs Techno- NAS, HDFS, Blob, Cassandra, hBase, Gemfire, Redis, HawQ, Impala, Aster, logies S3, MAPR, etc.. Voldemort Membase, SPARK … Store any data, High performance for Easy to scale-out, easy to scale-out, High Throughput, low repetitive queries. Values flexible and dynamic can optimize for latency Ease of query schema’s cost language. 37
  • 38. The Unified Analytics Cloud Platform Madlib Analytics Tools Karmasphere Data Meer Tableau Hadoop R Developer Spring PaaS Python Frameworks Cloudfoundry Cassandra hBase HDFS Database/DataStore Greenplum Voldemort Data-Director Data Platform Data PaaS EMC Chorus vSphere Cloud Infrastructure Private Public 38
  • 39. Summary !  Revolution in Big Data is under way •  Data centric applications are now critical !  Hadoop on Virtualization •  Proven performance •  Cloud/Virtualization values apparent for Hadoop use !  Simplify through a Unified Analytics Cloud •  One Platform for today’s and future big-data systems •  Better Utilization •  Faster deployment, elastic resources •  Secure, Isolated, Multi-tenant capability for Analytics 39
  • 40. References !  Twitter •  @richardmcdougll !  My CTO Blog •  http://communities.vmware.com/community/vmtn/cto/cloud !  Hadoop on vSphere •  Talk @ Hadoop World •  Performance Paper – http://www.vmware.com/files/.../VMW-Hadoop-Performance-vSphere5.pdf !  Spring Hadoop •  http://blog.springsource.org/2012/02/29/introducing-spring-hadoop 40