SlideShare una empresa de Scribd logo
1 de 26
Using Hadoop
to Expand
Data Warehousing
                                                                        Mike Peterson
                                   VP of Platforms and Data Architecture, Neustar

                                                                           Ron Bodkin
                                                       CEO and Founder, Think Big Analytics
                                                          ron.bodkin@thinkbiganalytics.com
                                                                                          .
                                                                    x




1   Copyright © Think Big Analytics and Neustar Inc.
                                                                            June 13, 2012
Agenda

Overview
Technology
Process
Conclusion




2   Copyright © Think Big Analytics and Neustar Inc.
Big Data Highlights at Neustar
              2010                                      2011                  2012

                                                               Hadoop
            Hadoop Cluster                                     Cluster
            Rollout at Quova                                   Rollout with
                                                               UltraDNS




3    Copyright © Think Big Analytics and Neustar Inc.
The Business Case
                               100 TB of INCREMENTAL Data Storage
                                                            3 Year Cost
                                                            Millions US $


                                                                       $9.6      $9.6

                                                       $6.3




                   $0.2
        Hadoop Big Data                                Netezza        Teradata   Oracle
4   Copyright © Think Big Analytics and Neustar Inc.
Big Data Warehouse




    Challenges                                              Goals
    •     Cost to store unstructured data                   •    Integrate unstructured data with EDW
    •     Poor response time to changing BI needs           •    Predictive analytics based on data science
    •     Data Warehouse access for departments             •    Access to cluster for all users




5        Copyright © Think Big Analytics and Neustar Inc.
Data Agility
Classic Warehouse                                       Big Data Warehouse
»     ETL                                               »    Store raw data
»     Pre-parse all data                                »    Parse only when proven
»     Normalize up front                                »    Approximate parse on
»     Feed data marts                                        demand
»     New ideas = IT projects                           »    Analysis on demand
»     Aggregate/Summarize                               »    Provide ideas before
                                                             projects to optimize




6    Copyright © Think Big Analytics and Neustar Inc.
Change to Technology Focus

»     New data platforms unlock innovation
»     Not just package implementation
»     More open source technology
»     Rethink assumptions
»     Increase technology skills
»     Focus data teams




7    Copyright © Think Big Analytics and Neustar Inc.
Working Together

                                                       »    Expertise in delivery
                                                       »    Trusted partner
                                                       »    Collaborative development


                                                       »    Open source leader
                                                       »    Invested in client success
                                                       »    Price/performance




8   Copyright © Think Big Analytics and Neustar Inc.
Technology


9   Copyright © Think Big Analytics and Neustar Inc.
Architecture Overview

                 Samples & Aggregates
                    RSync/SCP +                               Capture
                       Scripts                                Server




                                                                                                             Cluster
                                                        Backup Master Server

                                                           Cronacle Agent                         Slave
                                                                                                   Slave
                                                            Postgres ETL                             Slave
                                                                                                  Slave Server
                                                            Hive + UDFs
                                                        Secondary Name Node                         Postgres
                Cronacle                                                       to HDFS             (incl. Hive
                Scheduler                                                            Slave         Metastore)
                                                           Master Server        Postgres + HDFS       HDFS
                                                                                                  Task Tracker
                    Hive                                    Name Node
                                                            Job Tracker
      Ad-Hoc queries and BI
                                                                                            Management, Monitoring



                                                               LDAP


10   Copyright © Think Big Analytics and Neustar Inc.
Initial Hadoop Cluster                                  Current Configuration
                                                          »  40 servers
                                                          »    Hadoop and PostgreSQL

                                                        Data Nodes
                                                          »  2 x 12 cores

                                                          »  64 GB memory

                                                          »  24 x 3TB SATA drives

                                                          »  Mixed Nodes –
                                                             Raid 6 storage
                                                         »  HDFS only nodes –
                                                             JBOD
                                                         »  10Gbit Ethernet




11   Copyright © Think Big Analytics and Neustar Inc.
System Scale
»  Query volume – light but ramping
  »  10,000 Map Reduce processes/day
»  Ingesting over 40B rows a day
  »  1.5TB with 7x compression
»  Storage utilization at 45%
»  Core utilization spikes when processing Machine
    Learning Algorithms
»  100% capture of multiple large product data sets



12   Copyright © Think Big Analytics and Neustar Inc.
Software Choices by Layer
                           Data Processing &
Platform                                                        Application Integration                                        Analytics
                         Resource Management

     Data Science & Algorithms                           Application Support


Business Intelligence &
                                                                                                                                                                       Current configuration:
   Reporting Hive
                                                                                                                                                                       »  Redwood Software




                                                                                          Hortonworks Data Platform, Ganglia




                                                                                                                                     LDAP, Hortonworks Data Platform
 Data Ingestion          Data Transformation & Aggregation             Data Publication
 Custom Script                         Hive                                 FTP                                                                                           scheduler




                                                                                               Management & Monitoring
                                Workflow Management
                                                                                                                                                                       »  Hortonworks Distribution




                                                                                                                                             Cluster Security
                                     Cronacle

      Metadata Management                           Low Latency Data Access
                                                GridSQL, not Hadoop ecosystem
                                                                                                                                                                       »  Move to Oracle JDK 6
         Data Governance                               Resource Management                                                                                             »  Move to Red Hat
                                                          Fair Scheduler

                             Platform Software
              Hortonworks Data Platform, Oracle JDK 6, RHEL 6

      Networking                                      Infrastructure
       10 GigE                                         HP Servers

                             Cluster Provisioning
                   On Premise Shared Hadoop and Grid SQL




13         Copyright © Think Big Analytics and Neustar Inc.
Massive Binary Format Data
                                             Query
                                     SELECT * FROM datafile
                                     WHERE dt='2012-06-15';                                »  Parse on fly: don’t
     1
                                                                                           duplicate or lose original
          Parse into records
                                                                                           »  Reused open source
               Binary
            InputFormat
                                                  Binary
                                                  SerDe
                                                                                           parser with custom
                                                                                           extensions
                                                             2
                                                                 Parse into fields lazily
                                                                                           »  Optimized with
                                                   Bean                                       »  profiling
                                                   Object
                                                 Inspector                                    »  lazy parsing
                                                             3
                                                                Fields determined
                                                                                                 minimizes object
     large partitioned binary file - 100's of TBs             by Java beans methods               creation
     compressed binary record 1                                                            »  CPU bound due to
                                                                                           parsing compact structure
     compressed binary record 2
                               ...

6/19/12                                                                   14
Disk Failures and Recovery
»  5 drives in 9 months; 3 were DOA
»  Hadoop handled failure perfectly
»  Raid 6 PostgreSQL & GridSQL working fine




15   Copyright © Think Big Analytics and Neustar Inc.
Storage Policy
»     Storage still isn’t free!
»     Newer data = 3 replicas
»     Older data = 2 replicas
»     Data retention = 1 year
»     Free space = 20% reserve




16   Copyright © Think Big Analytics and Neustar Inc.
Process


17   Copyright © Think Big Analytics and Neustar Inc.
The Big Data Journey


     Phase 1: Enterprise Cluster Dev & Deployment




                                                                         Data Science
     Phase 2: Comprehensive Ingestion by Service




                                                                                        Monetization

                                                                                                       Cost Savings
     Offerings




                                                             Ingestion
     Phase 3: Develop Big Data Capabilities
       »    New Applications
       »    Data Science
       »    Advanced Analytics
       »    Cost Savings                                   Data Services Strategy




18      Copyright © Think Big Analytics and Neustar Inc.
Organizational Approach
                                                        Executive Support

                                                   Roles and Organization

                                                        Data Governance

                                                            Outreach

                                                             Training

                                                        Product Definition

                                                          Data Sharing

                                                   Data Science / Analytics

            Data Acquisition External                                   Data Acquisition Internal

                                                          Platform Build
19   Copyright © Think Big Analytics and Neustar Inc.
Organizational Investment


                                                                                Data
                                                                              Developers
 Database
Administrator

                   Big Data                                                               Big Data
                 Administrator                                                            Engineer
      New workloads & tools                                                      Distributed development


                                                           Data
                                                         Architect


                                                                       Data Architect                 Data Scientist
                                                                     Big Data Modeling            Math, programming &
                                                                Varying data structures                 analysis

 20   Copyright © Think Big Analytics and Neustar Inc.
Training to Match Organization

                                                        Fundamentals Track

                                                              Tools Track


                                                         Applications Track

                                                        Guiding Principles
               •      Making Hadoop Big Data “relevant” at the company and job
               •      Cross-train math and engineering skill bases
               •      Lab Team exposure to new and emerging technology training


21   Copyright © Think Big Analytics and Neustar Inc.
Building Data Science Capability




22   Copyright © Think Big Analytics and Neustar Inc.
Use of Capability

                                                              Use
                                                              Case
                                                                                                                               Technology
                                                                                                                          Selection Criteria
                                                                                                                          »    Structure
                                                                                                                                                       Basic
                        Only Structured                     Data Type             Complex Structure                       »    Compute Scale          Reporting
                                                                                                                          »    Data Volume
                                                                                                                          »    Latency
  At least 10
                  # Calculations
                                           Under 10                                                                       »    Analysis Type        Data Ingestion
  PetaFLOP                                 PetaFLOP



                              100 TB
                                                      Data Size
                                                                                                                  Under                                Batch
                      EDW
                              or more
                                                                         EDW
                                                                                                                  10 TB
                                                                                                                                     Data            Processing
                       EDW                                                EDW
                        EDW                                                Data
                         EDW                          10-100 TB
                          Data


           Latency?                                               Analysis                                                Analysis                  Fast Analytics
                                                                   Type?                                                   Type?

                               Tightly Integrated                                  Other (Simple,
A minute               Under with existing data
                      a minute                            Parallel, Complex         Production, Tightly Integrated                                      Data
or more                                                                                                                                     Other
                                                                                     Structural) with existing data
                                                                                                                                                     Enrichment

                                        existing                                                       existing
                        EDW                                                             EDW                                                 EDW
                                        platform                                                       platform

                                                                                                                                                    Data Science

                                                       Trends
                                   •       Compute model scores faster
                                   •       Analyze full data sets
                                   •       Incorporate new data
                                   • 
                                                                   »  23	
  
                                           Build new services from data
Conclusion


24   Copyright © Think Big Analytics and Neustar Inc.
Getting Value from Big Data                             Finalize these
                                                        take aways

»  Expand warehousing capability with Hadoop
»  Enable data science to create new value
»  Organizational change is a journey




25   Copyright © Think Big Analytics and Neustar Inc.
Thank You

                                                                         Mike Peterson
                                    VP of Platforms and Data Architecture, Neustar

                                                                            Ron Bodkin
                                                        CEO and Founder, Think Big Analytics
                                                           ron.bodkin@thinkbiganalytics.com
                                                                     x




26   Copyright © Think Big Analytics and Neustar Inc.

Más contenido relacionado

La actualidad más candente

Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Imviplav
 
Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data ApplicationsRichard McDougall
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopFebiyan Rachman
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Jonathan Seidman
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - IntroductionTomy Rhymond
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in detailsMahmoud Yassin
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Scaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseScaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseAge Mooij
 
20100806 cloudera 10 hadoopable problems webinar
20100806 cloudera 10 hadoopable problems webinar20100806 cloudera 10 hadoopable problems webinar
20100806 cloudera 10 hadoopable problems webinarCloudera, Inc.
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopSavvycom Savvycom
 
Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An OverviewC. Scyphers
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld
 
Data infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedInData infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedInHari Shankar Sreekumar
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?Hortonworks
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHortonworks
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architectureMilos Milovanovic
 

La actualidad más candente (20)

Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data Applications
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Scaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseScaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBase
 
20100806 cloudera 10 hadoopable problems webinar
20100806 cloudera 10 hadoopable problems webinar20100806 cloudera 10 hadoopable problems webinar
20100806 cloudera 10 hadoopable problems webinar
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
Big Data Platforms: An Overview
Big Data Platforms: An OverviewBig Data Platforms: An Overview
Big Data Platforms: An Overview
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
 
Data infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedInData infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedIn
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 

Destacado

adage-factpack-neustar-final
adage-factpack-neustar-finaladage-factpack-neustar-final
adage-factpack-neustar-finalangielynncul
 
Algorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAlgorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAtner Yegorov
 
Applying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopApplying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopMark Johnson
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBradford Stephens
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big DataBernard Marr
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsDavid Portnoy
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataPaco Nathan
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsDavid Pittman
 
Hybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop ImplementationsHybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop ImplementationsDavid Portnoy
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataMarko Rodriguez
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data ScientistDaniel Tunkelang
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraMatthias Broecheler
 

Destacado (20)

adage-factpack-neustar-final
adage-factpack-neustar-finaladage-factpack-neustar-final
adage-factpack-neustar-final
 
Algorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysisAlgorithmic techniques-for-big-data-analysis
Algorithmic techniques-for-big-data-analysis
 
Pre processing big data
Pre processing big dataPre processing big data
Pre processing big data
 
Big Data
Big DataBig Data
Big Data
 
Big Data Overview
Big Data OverviewBig Data Overview
Big Data Overview
 
Applying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopApplying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and Hadoop
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Big data Overview
Big data OverviewBig data Overview
Big data Overview
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
Microsoft Azure
Microsoft AzureMicrosoft Azure
Microsoft Azure
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse Platforms
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
What is big data?
What is big data?What is big data?
What is big data?
 
Hybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop ImplementationsHybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop Implementations
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph Data
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
Titan: Big Graph Data with Cassandra
Titan: Big Graph Data with CassandraTitan: Big Graph Data with Cassandra
Titan: Big Graph Data with Cassandra
 

Similar a Using hadoop to expand data warehousing

Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopCloudera, Inc.
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data SolutionsMark Kromer
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendCaserta
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Big Data and HPC
Big Data and HPCBig Data and HPC
Big Data and HPCNetApp
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranJAX London
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Robert Grossman
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud
 
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...Cloudera, Inc.
 
Big Data 視覺化分析解決方案
Big Data 視覺化分析解決方案Big Data 視覺化分析解決方案
Big Data 視覺化分析解決方案Etu Solution
 
Bay Area Hadoop User Group
Bay Area Hadoop User GroupBay Area Hadoop User Group
Bay Area Hadoop User GroupPentaho
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingm_hepburn
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataRichard McDougall
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
 
Ibm big data ibm marriage of hadoop and data warehousing
Ibm big dataibm marriage of hadoop and data warehousingIbm big dataibm marriage of hadoop and data warehousing
Ibm big data ibm marriage of hadoop and data warehousing DataWorks Summit
 

Similar a Using hadoop to expand data warehousing (20)

Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Anexinet Big Data Solutions
Anexinet Big Data SolutionsAnexinet Big Data Solutions
Anexinet Big Data Solutions
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Big Data and HPC
Big Data and HPCBig Data and HPC
Big Data and HPC
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
 
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
 
Big Data 視覺化分析解決方案
Big Data 視覺化分析解決方案Big Data 視覺化分析解決方案
Big Data 視覺化分析解決方案
 
Bay Area Hadoop User Group
Bay Area Hadoop User GroupBay Area Hadoop User Group
Bay Area Hadoop User Group
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big Data
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Ibm big data ibm marriage of hadoop and data warehousing
Ibm big dataibm marriage of hadoop and data warehousingIbm big dataibm marriage of hadoop and data warehousing
Ibm big data ibm marriage of hadoop and data warehousing
 

Más de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 

Último (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Using hadoop to expand data warehousing

  • 1. Using Hadoop to Expand Data Warehousing Mike Peterson VP of Platforms and Data Architecture, Neustar Ron Bodkin CEO and Founder, Think Big Analytics ron.bodkin@thinkbiganalytics.com . x 1 Copyright © Think Big Analytics and Neustar Inc. June 13, 2012
  • 2. Agenda Overview Technology Process Conclusion 2 Copyright © Think Big Analytics and Neustar Inc.
  • 3. Big Data Highlights at Neustar 2010 2011 2012 Hadoop Hadoop Cluster Cluster Rollout at Quova Rollout with UltraDNS 3 Copyright © Think Big Analytics and Neustar Inc.
  • 4. The Business Case 100 TB of INCREMENTAL Data Storage 3 Year Cost Millions US $ $9.6 $9.6 $6.3 $0.2 Hadoop Big Data Netezza Teradata Oracle 4 Copyright © Think Big Analytics and Neustar Inc.
  • 5. Big Data Warehouse Challenges Goals •  Cost to store unstructured data •  Integrate unstructured data with EDW •  Poor response time to changing BI needs •  Predictive analytics based on data science •  Data Warehouse access for departments •  Access to cluster for all users 5 Copyright © Think Big Analytics and Neustar Inc.
  • 6. Data Agility Classic Warehouse Big Data Warehouse »  ETL »  Store raw data »  Pre-parse all data »  Parse only when proven »  Normalize up front »  Approximate parse on »  Feed data marts demand »  New ideas = IT projects »  Analysis on demand »  Aggregate/Summarize »  Provide ideas before projects to optimize 6 Copyright © Think Big Analytics and Neustar Inc.
  • 7. Change to Technology Focus »  New data platforms unlock innovation »  Not just package implementation »  More open source technology »  Rethink assumptions »  Increase technology skills »  Focus data teams 7 Copyright © Think Big Analytics and Neustar Inc.
  • 8. Working Together »  Expertise in delivery »  Trusted partner »  Collaborative development »  Open source leader »  Invested in client success »  Price/performance 8 Copyright © Think Big Analytics and Neustar Inc.
  • 9. Technology 9 Copyright © Think Big Analytics and Neustar Inc.
  • 10. Architecture Overview Samples & Aggregates RSync/SCP + Capture Scripts Server Cluster Backup Master Server Cronacle Agent Slave Slave Postgres ETL Slave Slave Server Hive + UDFs Secondary Name Node Postgres Cronacle to HDFS (incl. Hive Scheduler Slave Metastore) Master Server Postgres + HDFS HDFS Task Tracker Hive Name Node Job Tracker Ad-Hoc queries and BI Management, Monitoring LDAP 10 Copyright © Think Big Analytics and Neustar Inc.
  • 11. Initial Hadoop Cluster Current Configuration »  40 servers »  Hadoop and PostgreSQL Data Nodes »  2 x 12 cores »  64 GB memory »  24 x 3TB SATA drives »  Mixed Nodes – Raid 6 storage »  HDFS only nodes – JBOD »  10Gbit Ethernet 11 Copyright © Think Big Analytics and Neustar Inc.
  • 12. System Scale »  Query volume – light but ramping »  10,000 Map Reduce processes/day »  Ingesting over 40B rows a day »  1.5TB with 7x compression »  Storage utilization at 45% »  Core utilization spikes when processing Machine Learning Algorithms »  100% capture of multiple large product data sets 12 Copyright © Think Big Analytics and Neustar Inc.
  • 13. Software Choices by Layer Data Processing & Platform Application Integration Analytics Resource Management Data Science & Algorithms Application Support Business Intelligence & Current configuration: Reporting Hive »  Redwood Software Hortonworks Data Platform, Ganglia LDAP, Hortonworks Data Platform Data Ingestion Data Transformation & Aggregation Data Publication Custom Script Hive FTP scheduler Management & Monitoring Workflow Management »  Hortonworks Distribution Cluster Security Cronacle Metadata Management Low Latency Data Access GridSQL, not Hadoop ecosystem »  Move to Oracle JDK 6 Data Governance Resource Management »  Move to Red Hat Fair Scheduler Platform Software Hortonworks Data Platform, Oracle JDK 6, RHEL 6 Networking Infrastructure 10 GigE HP Servers Cluster Provisioning On Premise Shared Hadoop and Grid SQL 13 Copyright © Think Big Analytics and Neustar Inc.
  • 14. Massive Binary Format Data Query SELECT * FROM datafile WHERE dt='2012-06-15'; »  Parse on fly: don’t 1 duplicate or lose original Parse into records »  Reused open source Binary InputFormat Binary SerDe parser with custom extensions 2 Parse into fields lazily »  Optimized with Bean »  profiling Object Inspector »  lazy parsing 3 Fields determined minimizes object large partitioned binary file - 100's of TBs by Java beans methods creation compressed binary record 1 »  CPU bound due to parsing compact structure compressed binary record 2 ... 6/19/12 14
  • 15. Disk Failures and Recovery »  5 drives in 9 months; 3 were DOA »  Hadoop handled failure perfectly »  Raid 6 PostgreSQL & GridSQL working fine 15 Copyright © Think Big Analytics and Neustar Inc.
  • 16. Storage Policy »  Storage still isn’t free! »  Newer data = 3 replicas »  Older data = 2 replicas »  Data retention = 1 year »  Free space = 20% reserve 16 Copyright © Think Big Analytics and Neustar Inc.
  • 17. Process 17 Copyright © Think Big Analytics and Neustar Inc.
  • 18. The Big Data Journey Phase 1: Enterprise Cluster Dev & Deployment Data Science Phase 2: Comprehensive Ingestion by Service Monetization Cost Savings Offerings Ingestion Phase 3: Develop Big Data Capabilities »  New Applications »  Data Science »  Advanced Analytics »  Cost Savings Data Services Strategy 18 Copyright © Think Big Analytics and Neustar Inc.
  • 19. Organizational Approach Executive Support Roles and Organization Data Governance Outreach Training Product Definition Data Sharing Data Science / Analytics Data Acquisition External Data Acquisition Internal Platform Build 19 Copyright © Think Big Analytics and Neustar Inc.
  • 20. Organizational Investment Data Developers Database Administrator Big Data Big Data Administrator Engineer New workloads & tools Distributed development Data Architect Data Architect Data Scientist Big Data Modeling Math, programming & Varying data structures analysis 20 Copyright © Think Big Analytics and Neustar Inc.
  • 21. Training to Match Organization Fundamentals Track Tools Track Applications Track Guiding Principles •  Making Hadoop Big Data “relevant” at the company and job •  Cross-train math and engineering skill bases •  Lab Team exposure to new and emerging technology training 21 Copyright © Think Big Analytics and Neustar Inc.
  • 22. Building Data Science Capability 22 Copyright © Think Big Analytics and Neustar Inc.
  • 23. Use of Capability Use Case Technology Selection Criteria »  Structure Basic Only Structured Data Type Complex Structure »  Compute Scale Reporting »  Data Volume »  Latency At least 10 # Calculations Under 10 »  Analysis Type Data Ingestion PetaFLOP PetaFLOP 100 TB Data Size Under Batch EDW or more EDW 10 TB Data Processing EDW EDW EDW Data EDW 10-100 TB Data Latency? Analysis Analysis Fast Analytics Type? Type? Tightly Integrated Other (Simple, A minute Under with existing data a minute Parallel, Complex Production, Tightly Integrated Data or more Other Structural) with existing data Enrichment existing existing EDW EDW EDW platform platform Data Science Trends •  Compute model scores faster •  Analyze full data sets •  Incorporate new data •  »  23   Build new services from data
  • 24. Conclusion 24 Copyright © Think Big Analytics and Neustar Inc.
  • 25. Getting Value from Big Data Finalize these take aways »  Expand warehousing capability with Hadoop »  Enable data science to create new value »  Organizational change is a journey 25 Copyright © Think Big Analytics and Neustar Inc.
  • 26. Thank You Mike Peterson VP of Platforms and Data Architecture, Neustar Ron Bodkin CEO and Founder, Think Big Analytics ron.bodkin@thinkbiganalytics.com x 26 Copyright © Think Big Analytics and Neustar Inc.