SlideShare una empresa de Scribd logo
1 de 34
Krishnan Parasuraman       Greg Rokita
Netezza                    Edmunds.com




  Building Scalable Data Platforms
 Hadoop and Netezza Deployment Models
Talking Points
• Building scalable data platforms
  – Architectural considerations

• Hadoop and Massively Parallel Databases
  – Similarities and differences
  – Usage patterns


• Practitioner’s View Point
  – Edmunds.com data warehouse platform


   2                      Hadoop World 2011
Building scalable data platforms
Typical Digital Media Information Processing Pipeline


        Clicks

        Visits

    Page Views                                                 • Scoring
                  Real Time                                    • Yield optimization
        Likes                                   Data           • Audience Analytics
                  Decision
        Tweets                               Processing
   Impressions
                   Engine
    Locations

                 • Display Ads                 • Correlate      Reporting
                 • Recommendation              • Structure
                 • Personalized Content        • Consolidate
                                                               • Aggregate
                                                               • Summarize
                                                               • Ad-hoc analysis



    3                                     Hadoop World 2011
Building scalable data platforms
     Clicks

     Visits

  Page Views
                Real Time
        Likes                    Data
                Decision
    Tweets                    Processing
  Impressions
                 Engine
   Locations                                    Reporting




                       DATA PLATFORM


    4                       Hadoop World 2011
Building scalable data platforms

                     Real Time
                                          Data
                      Decision
                                       Processing
                       Engine
                                                                              Reporting


             • Real Time
                                • High Velocity     • Compute intensive • Cached Queries
             • High Concurrency
Workloads    • Transactional
                                • Linearly Scalable • Full table scans  • Low Latency
                                • Disk bound        • Disk bound        • H. Concurrency
             • High Thruput

             • Structured        • Structured        • Mostly Structured • Structured
  Data       • Un-Structured     • Un-Structured     • Some unstructured • Relational
             • Key-Value pairs   • Machine Gen.

             • Stream Processing • Low Disk I/O      • In-DB computation • OLAP
Capability   • Memory resident • Fast Processing     • SQL and MR         • Columnar
             • Key based         • Low Cost/TB       • Analytic Libraries
               lookups
         5                              Hadoop World 2011
Building scalable data platforms

                     Real Time
                                           Data
                      Decision
                                        Processing
                       Engine
                                                                               Reporting


             • Real Time
                                • High Velocity     • Compute intensive • Cached Queries
             • High Concurrency
Workloads    • Transactional
                                • Linearly Scalable • Full table scans     • Low Latency
                                • Disk bound        • Disk bound
                                                                  Massively
             • High Thruput                                                • H. Concurrency
                                             Hadoop               Parallel DB
                       NoSQL
             • Structured        • Structured         • Mostly Structured • Structured
  Data               Databases
             • Un-Structured     • Un-Structured      • Some unstructured • Relational
                                                                             In-Memory
             • Key-Value pairs   • Machine Gen.
                                                                                 DB
                                      Graph
             • Stream Processing • Low Disk I/O             Plain Ole’ DB
                                                      • In-DB computation • OLAP
                                       DB
Capability   • Memory resident   • Fast Processing           on steroids • Columnar
                                                      • SQL and MR
             • Key based         • Low Cost/TB        • Analytic Libraries
               lookups
         6                               Hadoop World 2011
Myt A single technology will meet all the considerations for
  h our scalable data platform needs
               Best Practices


Workloads scale differently – Monolithic architectures don’t work

Minimize components – Data movement is painful

Understand tradeoffs – Performance  Price  Effort

Start with the core architecture and work in the edge cases



  7                        Hadoop World 2011
Massively parallel data warehouses
                   SQL And MR


                                                           Host controllers
                    Hosts

                                                           Network fabric


      FPGA   CPU    FPGA    CPU             FPGA     CPU   Massively
                                                           parallel
        Memory         Memory                   Memory
                                                           compute nodes


                                                           Distributed
                                                           Storage


  8                             Hadoop World 2011
Hadoop
                      Map Reduce

                        Job
                      Tracke
                               Name                         Master Node
                               Node
                         r



                                                            Network fabric

       Task            Task                  Task
      Tracke
               Data
               Node
                      Tracke
                               Data
                               Node
                                            Tracke
                                                     Data
                                                     Node
                                                            Parallel
         r               r                     r
                                                            compute nodes


                                                            Distributed
                                                            Storage


  9                             Hadoop World 2011
There are striking similarities….
                 Map Reduce

                   Job
                 Tracke
                          Name
                          Node
                                                      Massive
                    r
                                                      parallelism

                                                      Execute code &
                                                      algorithms next to
  Task            Task                Task            data
          Data            Data                 Data
 Tracke          Tracke              Tracke
          Node            Node                 Node
    r               r                   r
                                                      Scalable


                                                      Highly Available


                                                      Map Reduce

     10                          Hadoop World 2011
But also key differences
                          Map
                         Reduce
                                                                    Schema on Read – Data loading is fast




                                                          Hadoop
                     Job
                   Tracker
                                  Name
                                  Node                              Batch Mode data access
                                                                    Lower cost of data storage
                                                                    Process unstructured data
  Task     Data     Task          Data    Task     Data
 Tracker   Node    Tracker        Node   Tracker   Node




                                                                    Optimized for Performance

                                                          Netezza   Real time access, random reads,
                                                                    query optimizer, co-located joins

                                                                    Hardware Accelerated queries

                  Data Loading = File copy                          SQL and Map Reduce
                     Look Ma, No ETL


                                                                                                        11
These differences lead to opportunities for co-
existence for Hadoop in a Netezza environment
1. Scalable ETL engine
  – Complex data
  – Relationships not defined
  – Evolving schema
2. Queryable Archive
  – Moving computation is cheaper than moving data
3. Analytics sandbox
  – Exploratory analysis

   12                      Hadoop World 2011
Netezza-Hadoop: Deployment Patterns

                              Create context
                                                             Analyze
unstructured data      (classification, text mining)




                              Parse, aggregate            Analyze, report
semi-structured data




                                                            Active archival
                               Analyze, report           Long running queries
   structured data




        13                           Hadoop World 2011
Pattern 1: Data Processing Engine (ETL)

                            Hadoop Cluster
                                                              Netezza Environment



                                           NameNode
                                           JobTracker




Raw Weblogs

               DataNode       DataNode           DataNode
              TaskTracker    TaskTracker        TaskTracker




     14                       Hadoop World 2011
Pattern 2: Low cost storage and dynamic
provisioning
               Amazon Cloud
                                                      Netezza
                                                    Environment
                                     2
                                                3




                                      Elastic
                                    MapReduce


           1
                Amazon S3




   15                       Hadoop World 2011
Pattern 3: Queryable Archive



                       1




                                                             3
        Data Sources       2




                                                     Netezza
                                                   Environment




   16                          Hadoop World 2011
Edmunds.com and Scale
 o       Premier online resource for automotive information
         launched in 1995 as the first automotive information
         Web site
 o       15 million unique visitors
 o       210 million page views
 o       1 million+ new inventory items per day
 o       2 TB of new data every month
 o       40 node Hadoop cluster aggregating logs,
         advertising, vehicle, pricing, inventory and other data
         sets

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.


   No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Edmunds Proposition

             We have developed an iterative
               approach to data warehouse
        development that has dropped the time
         it takes for us to deliver reports to our
               users from months to weeks.


 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
 disclosure requires the express approval of Edmunds Inc.


     No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the

18   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
How did we do it?


   o           Process
   o           Technology
   o           Understanding of Value



No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.


   No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Process: agile approach
   o       Continuous and fast delivery of new features
   o       Collaboration between users and developers
   o       Make new data available quickly and
           inexpensively
   o       Quick problem resolution
   o       No wasting of entire development cycle if data is
           not useful
   o       Encouragement of exploration and creation of
           new applications
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.


   No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Process                                                                                                           Pre-process:
                                                                                                                    • Complete
                                                                                                                    • Raw
                                                                                                                    • Modeled as source data
                                                                                                                    • Generically loaded
                                                                                                                    • Quick turn-around
                                                                                                                    • Low retention
                                                                                                                    • Slower performance

                                                                                                                    Post-process:
                                                                                                                    • Filtered
                                                                                                                    • Transformed
                                                                                                                    • Modeled as star schema
                                                                                                                    • Optimized
                                                                                                                    • Slow turn-around
                                                                                                                    • High retention
                                                                                                                    • Fast performance
 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
 disclosure requires the express approval of Edmunds Inc.


     No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the

21   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Post-Process Sandbox
                                        Use Pre-                                Load data
                                        process                                in ad-hock
                                         data                                    manner

                                                                                                                                                  Discard:
                                                                                                                                                   prevents shadow
                                                                                                                           No                        production
                    Change                                                                                                                         little effort lost
                  schema (by
                    users or                        Prototype                                   Data has value?
                  developers)

                                                                                                                                                  Develop Optimized
                                                                                                               Yes                                Pipeline:
                                                                                                                                                   data is confirmed to
                                        Enhance
                                                                                Schema is                                                            be useful
                                                                                 stable?                                                           effort is warranted




 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
 disclosure requires the express approval of Edmunds Inc.


     No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the

22   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Technology

                         Publishing                                                        Hadoop
                                                                                                                                                           Netezza
                           System                                                             Stack

  • All Data                                                     • HBase raw data                                                • All data loaded from
  • Generic                                                      • Oozie job coordinator                                           Hadoop in batch
  • Thrift IDL with                                              • HDFS storage of pre                                           • Analysis and data
    Versioning                                                     and optimized data                                              exploration - use the
                                                                   replica of RDBMS in                                             speed and power
                                                                   files                                                         • Report generation




 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
 disclosure requires the express approval of Edmunds Inc.


     No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the

23   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Edmunds Publishing System




No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.


   No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
24 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Generic flow for pre-process

                                                  Producers: Inventory, Pricing, Vehicle,
                                                              Dealer, Leads
                                                                                           Broker

                                                                                      Consumer

                                                                                           HBase
                                                                                          Map-                                                                  G
                                                                                                                                                                e
                                                                                         Reduce
                                                                                                                                                                n
                                                                                         Netezza                                                                e
                                                                                         Action                                                                 r
                                                                                                                                                                i
                                                                                                                                                                c
No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.


                                                                                                                                                                ,
   No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
25 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
What architecture enables generic
  consumer?
                                                                              Thrift


                                             Camel


                                  ActiveMQ


   o            Message                                                                              o           Retries
            o           Delivery                                                                     o           Throttling
            o           Routing
            o           Persistence                                                                  o           Versioning
            o           Durability                                                                   o           Monitoring

No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.


   No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Flexibility for Producers and Consumers:
 Support for Topologies

          Field                                         Example Values                                                        Purpose
          Environment                                   PROD, TEST, DEV                                                       Promotion cycle of
                                                                                                                              deployment units
          Index                                         Blue, Green, Stage                                                    Environment Index
          Data Center                                   LAX1, EC2                                                             The data center where
                                                                                                                              deployment unit is located
          Site                                          Edmunds, Insideline                                                   Company’s Product
          Application                                   HBase, Digital Asset Manager                                          Deployment Unit




No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.


   No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Producer-Consumer matching
                                                                                      Match!
                 Producer                                              Virtual                            Queue
                                                                                                                                                     Consumer
                                                                       Topic                              Name
                                                                       Name
                                     Publish                                                                                Publish
                                     Inventory                                                                              Inventory
           I am                                                                                                                                                 I am
                                     Prod                                                                                   Test
                                     Lax                                         Broker
                                                                                                                            EC2
                                     Edmunds                                     Destination
                                                                                                                            Edmunds
                                     Inventory                                   Interceptor
                                                                                                                            Dealer

                                     Prod, Test                                                                             Prod
          Send To                    Lax, EC2                                                                               Lax, EC2                          Receive From
                                     Edmunds                                                                                Edmunds
                                     Dealer                                                                                 Inventory



No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
disclosure requires the express approval of Edmunds Inc.


   No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the
   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
HBase: how to handle data generically
      Colum                      Binary                                                        Discrete                                                    Type 2
      Family
      Columns                    Serialized                Hashcode of                         Thrift Thrift                           Thrift              Start           End             List of
                                 Thrift                    the Thrift                          Object Object                           Object              Date            Date            fields
                                 Object                    Object                              Field 1 Field 2                         Field 3




      Role                       System of Check if       Versioning at the most                                                                           Versioning for
                                 record    updates are    granular level for lookups                                                                       optimized
                                           necessary                                                                                                       dimension tables
                                           (optimization)




 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
 disclosure requires the express approval of Edmunds Inc.


     No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the

29   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Generic Thrift Persistence in HBase
     Column Name                                                                                                                                               Value
     [ModelYear]|F:id|T:long|I:0                                                                                                                               1368
     [ModelYear]|F:midYear|T:boolean|I:1                                                                                                                       false
     [ModelYear]|F:year|T:int|I:2                                                                                                                              1993
     [ModelYear]|F:name|T:java.lang.String|I:4                                                                                                                 Celica
     [ModelYear]#[attributss][0]|F:_key|T:java.lang.Long                                                                                                       64
     [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][0]|F:                                                                                       Standard Sport
     value|T:java.lang.String|I:1                                                                                                                              V:GT-S 2dr
     [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][1]|F:                                                                                       Hatchback
     value|T:java.lang.String|I:1
     [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][1]|F:i                                                                                      441
     d|T:long|I:2
     [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][3]|F:                                                                                       V:GT-S
     value|T:java.lang.String|I:1



 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
 disclosure requires the express approval of Edmunds Inc.


     No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the

30   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Netezza: Time is Money
          Compared to Oracle                                                 Business Value

          Up to 12x faster load times                                         Can reload data more frequently
                                                                              Failed workflows are no longer a big problem
                                                                              Helps in transition to real time system:
                                                                               We can now create intraday reports for Leads!

          Up to 400x faster query                                             More productive Business Intelligence
          times                                                               Queries that could ‘never’ finish in Oracle are
                                                                               now providing business value




 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
 disclosure requires the express approval of Edmunds Inc.


     No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the

31   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Generic and reusable Oozie actions for
  Netezza

                                  Oozie Load and Remove Action



                                             Apache CLI


                                                       Nzload and Nzsql (provisioned
                                                       on worker nodes using Chef)


 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
 disclosure requires the express approval of Edmunds Inc.


     No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the

32   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Value
     o      Data warehouse proves product value both
            internally and to our customers
     o      Failing fast and quick turn around allow us to
            know when we are building the right reporting
            and analytical products without a large up front
            investment
     o      By combining all data in a single system we are
            enabling new products to be developed that we
            previously could not


 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such
 disclosure requires the express approval of Edmunds Inc.


     No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the

33   Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
Krishnan Parasuraman       Greg Rokita
@kparasuraman              Edmunds.com




  Building Scalable Data Platforms
 Hadoop and Netezza Deployment Models

Más contenido relacionado

La actualidad más candente

Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera, Inc.
 
Hadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationHadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationCloudera, Inc.
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookBigDataCloud
 
Facing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopFacing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopfann wu
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.
 
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...Cloudera, Inc.
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Cloudera, Inc.
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013Michael Hiskey
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
Real-Time Loading to Sybase IQ
Real-Time Loading to Sybase IQReal-Time Loading to Sybase IQ
Real-Time Loading to Sybase IQSybase Türkiye
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionHadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionAndrew Brust
 

La actualidad más candente (20)

Introduction to h base
Introduction to h baseIntroduction to h base
Introduction to h base
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
 
Hadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationHadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote Presentation
 
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at FacebookA Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
A Survey of Petabyte Scale Databases and Storage Systems Deployed at Facebook
 
Facing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopFacing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoop
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
 
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
Hadoop World 2011: Building Realtime Big Data Services at Facebook with Hadoo...
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
 
Kognitio overview jan 2013
Kognitio overview jan 2013Kognitio overview jan 2013
Kognitio overview jan 2013
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Real-Time Loading to Sybase IQ
Real-Time Loading to Sybase IQReal-Time Loading to Sybase IQ
Real-Time Loading to Sybase IQ
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionHadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in Action
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 

Destacado

Extracting Big Value From Big Data in Digital Media - An Executive Webcast wi...
Extracting Big Value From Big Data in Digital Media - An Executive Webcast wi...Extracting Big Value From Big Data in Digital Media - An Executive Webcast wi...
Extracting Big Value From Big Data in Digital Media - An Executive Webcast wi...Krishnan Parasuraman
 
Emergence of Big Data in Digital Marketing
Emergence of Big Data  in Digital MarketingEmergence of Big Data  in Digital Marketing
Emergence of Big Data in Digital MarketingKrishnan Parasuraman
 
Big Data Journeys: Review of roadmaps taken by early adopters to achieve thei...
Big Data Journeys: Review of roadmaps taken by early adopters to achieve thei...Big Data Journeys: Review of roadmaps taken by early adopters to achieve thei...
Big Data Journeys: Review of roadmaps taken by early adopters to achieve thei...Krishnan Parasuraman
 

Destacado (7)

Extracting Big Value From Big Data in Digital Media - An Executive Webcast wi...
Extracting Big Value From Big Data in Digital Media - An Executive Webcast wi...Extracting Big Value From Big Data in Digital Media - An Executive Webcast wi...
Extracting Big Value From Big Data in Digital Media - An Executive Webcast wi...
 
The Revolution of Big Data
The Revolution of Big DataThe Revolution of Big Data
The Revolution of Big Data
 
Emergence of Big Data in Digital Marketing
Emergence of Big Data  in Digital MarketingEmergence of Big Data  in Digital Marketing
Emergence of Big Data in Digital Marketing
 
Big Data Forum - Phoenix
Big Data Forum - PhoenixBig Data Forum - Phoenix
Big Data Forum - Phoenix
 
Big Data Journeys: Review of roadmaps taken by early adopters to achieve thei...
Big Data Journeys: Review of roadmaps taken by early adopters to achieve thei...Big Data Journeys: Review of roadmaps taken by early adopters to achieve thei...
Big Data Journeys: Review of roadmaps taken by early adopters to achieve thei...
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
The New Enterprise Data Platform
The New Enterprise Data PlatformThe New Enterprise Data Platform
The New Enterprise Data Platform
 

Similar a Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
NOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the CloudNOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the Cloudboorad
 
Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchMapR Technologies
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillDataWorks Summit
 
North Bay Ruby Meetup 101911
North Bay Ruby Meetup 101911North Bay Ruby Meetup 101911
North Bay Ruby Meetup 101911Ines Sombra
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01eimhee
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopAllen Wittenauer
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
 
An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBWilliam LaForest
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Hortonworks
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Miguel Pastor
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesshnkr_rmchndrn
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use CasesDATAVERSITY
 

Similar a Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models (20)

Apache Drill
Apache DrillApache Drill
Apache Drill
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
NOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the CloudNOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the Cloud
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 March
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
 
North Bay Ruby Meetup 101911
North Bay Ruby Meetup 101911North Bay Ruby Meetup 101911
North Bay Ruby Meetup 101911
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache Hadoop
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skies
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Hadoop DB
Hadoop DBHadoop DB
Hadoop DB
 
Common MongoDB Use Cases
Common MongoDB Use CasesCommon MongoDB Use Cases
Common MongoDB Use Cases
 

Último

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 

Último (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 

Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

  • 1. Krishnan Parasuraman Greg Rokita Netezza Edmunds.com Building Scalable Data Platforms Hadoop and Netezza Deployment Models
  • 2. Talking Points • Building scalable data platforms – Architectural considerations • Hadoop and Massively Parallel Databases – Similarities and differences – Usage patterns • Practitioner’s View Point – Edmunds.com data warehouse platform 2 Hadoop World 2011
  • 3. Building scalable data platforms Typical Digital Media Information Processing Pipeline Clicks Visits Page Views • Scoring Real Time • Yield optimization Likes Data • Audience Analytics Decision Tweets Processing Impressions Engine Locations • Display Ads • Correlate Reporting • Recommendation • Structure • Personalized Content • Consolidate • Aggregate • Summarize • Ad-hoc analysis 3 Hadoop World 2011
  • 4. Building scalable data platforms Clicks Visits Page Views Real Time Likes Data Decision Tweets Processing Impressions Engine Locations Reporting DATA PLATFORM 4 Hadoop World 2011
  • 5. Building scalable data platforms Real Time Data Decision Processing Engine Reporting • Real Time • High Velocity • Compute intensive • Cached Queries • High Concurrency Workloads • Transactional • Linearly Scalable • Full table scans • Low Latency • Disk bound • Disk bound • H. Concurrency • High Thruput • Structured • Structured • Mostly Structured • Structured Data • Un-Structured • Un-Structured • Some unstructured • Relational • Key-Value pairs • Machine Gen. • Stream Processing • Low Disk I/O • In-DB computation • OLAP Capability • Memory resident • Fast Processing • SQL and MR • Columnar • Key based • Low Cost/TB • Analytic Libraries lookups 5 Hadoop World 2011
  • 6. Building scalable data platforms Real Time Data Decision Processing Engine Reporting • Real Time • High Velocity • Compute intensive • Cached Queries • High Concurrency Workloads • Transactional • Linearly Scalable • Full table scans • Low Latency • Disk bound • Disk bound Massively • High Thruput • H. Concurrency Hadoop Parallel DB NoSQL • Structured • Structured • Mostly Structured • Structured Data Databases • Un-Structured • Un-Structured • Some unstructured • Relational In-Memory • Key-Value pairs • Machine Gen. DB Graph • Stream Processing • Low Disk I/O Plain Ole’ DB • In-DB computation • OLAP DB Capability • Memory resident • Fast Processing on steroids • Columnar • SQL and MR • Key based • Low Cost/TB • Analytic Libraries lookups 6 Hadoop World 2011
  • 7. Myt A single technology will meet all the considerations for h our scalable data platform needs Best Practices Workloads scale differently – Monolithic architectures don’t work Minimize components – Data movement is painful Understand tradeoffs – Performance  Price  Effort Start with the core architecture and work in the edge cases 7 Hadoop World 2011
  • 8. Massively parallel data warehouses SQL And MR Host controllers Hosts Network fabric FPGA CPU FPGA CPU FPGA CPU Massively parallel Memory Memory Memory compute nodes Distributed Storage 8 Hadoop World 2011
  • 9. Hadoop Map Reduce Job Tracke Name Master Node Node r Network fabric Task Task Task Tracke Data Node Tracke Data Node Tracke Data Node Parallel r r r compute nodes Distributed Storage 9 Hadoop World 2011
  • 10. There are striking similarities…. Map Reduce Job Tracke Name Node Massive r parallelism Execute code & algorithms next to Task Task Task data Data Data Data Tracke Tracke Tracke Node Node Node r r r Scalable Highly Available Map Reduce 10 Hadoop World 2011
  • 11. But also key differences Map Reduce Schema on Read – Data loading is fast Hadoop Job Tracker Name Node Batch Mode data access Lower cost of data storage Process unstructured data Task Data Task Data Task Data Tracker Node Tracker Node Tracker Node Optimized for Performance Netezza Real time access, random reads, query optimizer, co-located joins Hardware Accelerated queries Data Loading = File copy SQL and Map Reduce Look Ma, No ETL 11
  • 12. These differences lead to opportunities for co- existence for Hadoop in a Netezza environment 1. Scalable ETL engine – Complex data – Relationships not defined – Evolving schema 2. Queryable Archive – Moving computation is cheaper than moving data 3. Analytics sandbox – Exploratory analysis 12 Hadoop World 2011
  • 13. Netezza-Hadoop: Deployment Patterns Create context Analyze unstructured data (classification, text mining) Parse, aggregate Analyze, report semi-structured data Active archival Analyze, report Long running queries structured data 13 Hadoop World 2011
  • 14. Pattern 1: Data Processing Engine (ETL) Hadoop Cluster Netezza Environment NameNode JobTracker Raw Weblogs DataNode DataNode DataNode TaskTracker TaskTracker TaskTracker 14 Hadoop World 2011
  • 15. Pattern 2: Low cost storage and dynamic provisioning Amazon Cloud Netezza Environment 2 3 Elastic MapReduce 1 Amazon S3 15 Hadoop World 2011
  • 16. Pattern 3: Queryable Archive 1 3 Data Sources 2 Netezza Environment 16 Hadoop World 2011
  • 17. Edmunds.com and Scale o Premier online resource for automotive information launched in 1995 as the first automotive information Web site o 15 million unique visitors o 210 million page views o 1 million+ new inventory items per day o 2 TB of new data every month o 40 node Hadoop cluster aggregating logs, advertising, vehicle, pricing, inventory and other data sets No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 18. Edmunds Proposition We have developed an iterative approach to data warehouse development that has dropped the time it takes for us to deliver reports to our users from months to weeks. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 18 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 19. How did we do it? o Process o Technology o Understanding of Value No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 20. Process: agile approach o Continuous and fast delivery of new features o Collaboration between users and developers o Make new data available quickly and inexpensively o Quick problem resolution o No wasting of entire development cycle if data is not useful o Encouragement of exploration and creation of new applications No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 21. Process Pre-process: • Complete • Raw • Modeled as source data • Generically loaded • Quick turn-around • Low retention • Slower performance Post-process: • Filtered • Transformed • Modeled as star schema • Optimized • Slow turn-around • High retention • Fast performance No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 21 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 22. Post-Process Sandbox Use Pre- Load data process in ad-hock data manner Discard:  prevents shadow No production Change  little effort lost schema (by users or Prototype Data has value? developers) Develop Optimized Yes Pipeline:  data is confirmed to Enhance Schema is be useful stable?  effort is warranted No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 22 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 23. Technology Publishing Hadoop Netezza System Stack • All Data • HBase raw data • All data loaded from • Generic • Oozie job coordinator Hadoop in batch • Thrift IDL with • HDFS storage of pre • Analysis and data Versioning and optimized data exploration - use the replica of RDBMS in speed and power files • Report generation No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 23 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 24. Edmunds Publishing System No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 24 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 25. Generic flow for pre-process Producers: Inventory, Pricing, Vehicle, Dealer, Leads Broker Consumer HBase Map- G e Reduce n Netezza e Action r i c No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. , No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 25 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 26. What architecture enables generic consumer? Thrift Camel ActiveMQ o Message o Retries o Delivery o Throttling o Routing o Persistence o Versioning o Durability o Monitoring No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 27. Flexibility for Producers and Consumers: Support for Topologies Field Example Values Purpose Environment PROD, TEST, DEV Promotion cycle of deployment units Index Blue, Green, Stage Environment Index Data Center LAX1, EC2 The data center where deployment unit is located Site Edmunds, Insideline Company’s Product Application HBase, Digital Asset Manager Deployment Unit No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 28. Producer-Consumer matching Match! Producer Virtual Queue Consumer Topic Name Name Publish Publish Inventory Inventory I am I am Prod Test Lax Broker EC2 Edmunds Destination Edmunds Inventory Interceptor Dealer Prod, Test Prod Send To Lax, EC2 Lax, EC2 Receive From Edmunds Edmunds Dealer Inventory No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 29. HBase: how to handle data generically Colum Binary Discrete Type 2 Family Columns Serialized Hashcode of Thrift Thrift Thrift Start End List of Thrift the Thrift Object Object Object Date Date fields Object Object Field 1 Field 2 Field 3 Role System of Check if Versioning at the most Versioning for record updates are granular level for lookups optimized necessary dimension tables (optimization) No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 29 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 30. Generic Thrift Persistence in HBase Column Name Value [ModelYear]|F:id|T:long|I:0 1368 [ModelYear]|F:midYear|T:boolean|I:1 false [ModelYear]|F:year|T:int|I:2 1993 [ModelYear]|F:name|T:java.lang.String|I:4 Celica [ModelYear]#[attributss][0]|F:_key|T:java.lang.Long 64 [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][0]|F: Standard Sport value|T:java.lang.String|I:1 V:GT-S 2dr [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][1]|F: Hatchback value|T:java.lang.String|I:1 [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][1]|F:i 441 d|T:long|I:2 [ModelYear]#[styles][3]#[attributeGroupsMap][10]#[attributes][3]|F: V:GT-S value|T:java.lang.String|I:1 No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 30 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 31. Netezza: Time is Money Compared to Oracle Business Value Up to 12x faster load times  Can reload data more frequently  Failed workflows are no longer a big problem  Helps in transition to real time system: We can now create intraday reports for Leads! Up to 400x faster query  More productive Business Intelligence times  Queries that could ‘never’ finish in Oracle are now providing business value No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 31 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 32. Generic and reusable Oozie actions for Netezza Oozie Load and Remove Action Apache CLI Nzload and Nzsql (provisioned on worker nodes using Chef) No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 32 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 33. Value o Data warehouse proves product value both internally and to our customers o Failing fast and quick turn around allow us to know when we are building the right reporting and analytical products without a large up front investment o By combining all data in a single system we are enabling new products to be developed that we previously could not No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the Edmunds Inc., and any such disclosure requires the express approval of Edmunds Inc. No part of this document or the information it contains may be used, or disclosed to any person or entity, for any purpose other than advancing the best interests of the 33 Edmunds.com, Inc., and any such disclosure requires the express approval of Edmunds.com, Inc.
  • 34. Krishnan Parasuraman Greg Rokita @kparasuraman Edmunds.com Building Scalable Data Platforms Hadoop and Netezza Deployment Models