SlideShare a Scribd company logo
1 of 34
DO NOT USE PUBLICLY
    From Zero to Hadoop                  PRIOR TO 10/23/12
    Headline Goes Here
    Speaker Name | Title
    Speaker Name or Subhead Goes Here
    March 19, 2013




1
Agenda
    • Hadoop Ecosystem Overview
    • Hadoop Core Technical Overview
       • HDFS
       • MapReduce

    • Hadoop in the Enterprise
       • Cluster Planning
       • Cluster Management with Cloudera Manager


2
Hadoop Ecosystem Overview
    What Are All These Things?




3
Hadoop Ecosystem
              INGEST                   STORE           EXPLORE                       PROCESS                           ANALYZE               SERVE


    MANAGEMENT SOFTWARE &                CONNECTORS
    TECHNICAL SUPPORT                                                                       BI             ETL         RDBMS
    SUBSCIPTION OPTIONS

       CLOUDERA NAVIGATOR
                                                         CLOUD           USER INTERFACE                                  WORKFLOW MGMT
          AUDIT                           CORE                                                                                              METADATA
                           LINEAGE                         WH                                                               OO
          (v1.0)                                                            HU
                                                           WHIRR            HUE                                             OOZIE
          ACCESS
                           LIFECYCLE
           (v1.0)                                        INTEGRATION     BATCH PROCESSING                                REAL-TIME ACCESS

         EXPLORE                                            SQ             HI         PI          MA          DF         & COMPUTE
                                                           SQOOP          HIVE        PIG        MAHOUT     DATAFU
                                                                                                                                               AC
                                                            FL           BATCH COMPUTE                                           IM           ACCESS

       CLOUDERA MANAGER
                                                           FLUME           MR               MR2                               IMPALA
                                                           FILE          MAPREDUCE    MAPREDUCE2
                                                                                                                                              MS
                    BDR                                   FUSE-DFS
                                                                         RESOURCE MGMT                                                      META STORE
                                                          REST           & COORDINATION                      YA                ZO
           RTD              RTQ                       WEBHDFS / HTTPFS                                      YARN             ZOOKEEPER

                                                                         STORAGE
                    CORE                                   SQL                                             HDFS                HB
                (REQUIRED)                              ODBC / JDBC
                                                                                                          HADOOP DFS           HBASE




4
Sqoop
    Performs Bi
    Directional data
    transfers between
    Hadoop and almost
    any SQL database
    with a JDBC driver



5
FlumeNG
    A streaming data       Client


    collection and         Client
                                    Agent



    aggregation system     Client
                                            Agent



    for massive volumes             Agent

                           Client
    of data, such as RPC
    services, Log4J,
    Syslog, etc.

6
HBase
• A low latency,
  distributed, non-
  SQL database built
  on HDFS.
• A “Columnar
  Database”



7
Hive
• Relational database
                                    SELECT
     abstraction using a SQL like      s.word, s.freq, k.freq
     dialect called HiveQL          FROM shakespeare
                                    JOIN ON (s.word= k.word)
• Statements are executed as        WHERE s.freq >= 5;
     One or more MapReduce
     Jobs

8
Pig
    •   High-level scripting language
                                        emps = LOAD 'people.txt’ AS
        for for executing one or more   (id,name,salary);
        MapReduce jobs                  rich = FILTER emps BY salary >
    •   Created to simplify authoring   200000;
                                        sorted_rich = ORDER rich BY
        of MapReduce jobs
                                        salary DESC;
    •   Can be extended with user       STORE sorted_rich INTO
        defined functions               ’rich_people.txt';


9
Oozie
     A workflow engine and
     scheduler built specifically
     for large-scale job
     orchestration on a
     Hadoop cluster




10
Zookeeper
• Zookeeper is a distributed
     consensus engine
• Provides well-defined concurrent
     access semantics:
      • Leader election
      • Service discovery
      • Distributed locking / mutual
         exclusion
      • Message board / mailboxes
11
Mahout
A machine learning library with
algorithms for:
•    Recommendation based on users'
     behavior.
•    Clustering groups related documents.
•    Classification from existing
     categorized.
•    Frequent item-set mining (shopping
     cart content).

12
Hadoop Security
• Authentication is secured by MIT Kerberos v5
     and integrated with LDAP
• Provides Identity, Authentication, and
     Authorization
• Useful for multitenancy or secure
     environments




13
Hadoop Core Technical Overview
     Only the Good Parts




14
Components of HDFS

• NameNode – Holds all metadata for HDFS
   • Needs to be a highly reliable machine
      • RAID drives – typically RAID 10
      • Dual power supplies
      • Dual network cards – Bonded
   • The more memory the better – typical 36GB to - 64GB
• Secondary NameNode – Provides check pointing for the
  NameNode. Same hardware as the NameNode should be used


15
Components of HDFS – Contd.

 •    DataNodes – Hardware will depend on the specific needs of the
      cluster
       • No RAID needed, JBOD (just a bunch of disks) is used
       • Typical ratio is:
           • 1 hard drive
           • 2 cores
           • 4GB of RAM




16
HDFS Architecture Overview
        Host 1        Host 3      Host 5
        Namenode      DataNode    DataNode




         Host 2       Host 4      Host n
         Secondary
                      DataNode    DataNode
         Namenode




17
HDFS Block Replication
                                    Node 1                        Node 2
        Block Size = 64MB
       Replication Factor = 3   2                         1
                                4                         2
       Blocks                   5                         5
                                                 Node 3
          1
                   HDFS                      1
          2
          3                                  3
                                    Node 4   4
          4                                                        Node 5
          5                     2                             1
                                3                             3
                                4                             5

18
MapReduce – Map
     • Records from the data source (lines out of files, rows of a
       database, etc) are fed into the map function as key*value pairs:
       e.g., (filename, line).
     • map() produces one or more intermediate values along with an
       output key from the input.
                        (key 1,               (key 1, int.
                        values)                 values)

             Map        (key 2,    Shuffle    (key 1, int.   Reduce   Final (key,
             Task       values)    Phase        values)       Task      values)

                        (key 3,               (key 1, int.
                        values)                 values)




19
MapReduce – Reduce
     • After the map phase is over, all the intermediate values for a
       given output key are combined together into a list

     • reduce() combines those intermediate values into one or more
       final values for that same output key
                        (key 1,                (key 1, int.
                        values)                  values)

             Map        (key 2,     Shuffle    (key 1, int.   Reduce    Final (key,
             Task       values)     Phase        values)       Task       values)

                        (key 3,                (key 1, int.
                        values)                  values)




20
MapReduce – Shuffle and Sort




21
Hadoop In the Enterprise
     How It Works In The Real World




22
Networking
     • One of the most important things to consider when
       setting up a Hadoop cluster
     • Typically a top of rack is used with Hadoop with a
       core switch
     • Careful on over subscribing the backplane of the
       switch!


24
Hadoop Typical Data Pipeline

                                                     Hadoop
                                                                                                                Marts
                                                      Oozie




                                                                Result or Calculated Data
                             Original Source Data
      Data Sources




                                                       Pig
                                                                                                      Data
                                                      Hive                                          Warehouse
                                                                                            Sqoop
                                                    MapReduce
                     Sqoop
                     Flume
                                                      HDFS




25
Hadoop Use Cases
Use Case                          Application               Industry             Application            Use Case

                             Social Network Analysis          Web          Clickstream Sessionization
     ADVANCED ANALYTICS




                              Content Optimization           Media         Clickstream Sessionization




                                                                                                           DATA PROCESSING
                               Network Analytics              Telco                Mediation

                          Loyalty & Promotions Analysis       Retail             Data Factory

                                 Fraud Analysis             Financial         Trade Reconciliation

                                 Entity Analysis             Federal                SIGINT

                              Sequencing Analysis         Bioinformatics       Genome Mapping


26
Hadoop in the Enterprise

      OPERATORS                      ENGINEERS      ANALYSTS          BUSINESS USERS



     Management                                                        Enterprise
                                       IDE’s      BI / Analytics
        Tools                                                          Reporting



                                                                                       CUSTOMERS
                                                          Enterprise Data Warehouse


                                                                                         Web
                                                                                       Application


                                     Relational
        Logs      Files   Web Data
                                     Databases




27
Cloudera Manager
     End-to-End Administration for CDH


                                   1     Manage
                                         Easily deploy, configure & optimize clusters



                                   2 Monitor
                                         Maintain a central view of all activity



                                   3 Diagnose
                                         Easily identify and resolve issues



                                   4 Integrate
                                         Use Cloudera Manager with existing tools



28
Install A Cluster In 3 Simple Steps
     Cloudera Manager Key Features


                  1
             Find Nodes
                                                                    2
                                                        Install Components
                                                                                                                       3
                                                                                                                 Assign Roles




  Enter the names of the hosts which will be      Cloudera Manager automatically installs the CDH   Verify the roles of the nodes within your cluster.
included in the Hadoop cluster. Click Continue.       components on the hosts you specified.                   Make changes as necessary.


29
View Service Health & Performance
     Cloudera Manager Key Features




30
Monitor & Diagnose Cluster Workloads
     Cloudera Manager Key Features




31
Visualize Health Status With Heatmaps
     Cloudera Manager Key Features




32
Rolling Upgrades
     Cloudera Manager Key Features




33
34
35

More Related Content

What's hot

An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMongoDB
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsCloudera, Inc.
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messagesyarapavan
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsTrendProgContest13
 
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionHadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionBenoit Perroud
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopGeorge Ang
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drilltshiran
 
Apache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondApache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondDataWorks Summit
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataRichard McDougall
 

What's hot (20)

Introduction to h base
Introduction to h baseIntroduction to h base
Introduction to h base
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDB
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance ImprovementsHadoop Summit 2012 | HBase Consistency and Performance Improvements
Hadoop Summit 2012 | HBase Consistency and Performance Improvements
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionHadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment Evolution
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using Hadoop
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Apache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondApache Hadoop Now Next and Beyond
Apache Hadoop Now Next and Beyond
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big Data
 

Similar to Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsKaniska Mandal
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlKhanderao Kand
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统yongboy
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera, Inc.
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionHadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionAndrew Brust
 
Distributed Stream Processing on Fluentd / #fluentd
Distributed Stream Processing on Fluentd / #fluentdDistributed Stream Processing on Fluentd / #fluentd
Distributed Stream Processing on Fluentd / #fluentdSATOSHI TAGOMORI
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 

Similar to Cloudera Sessions - Clinic 1 - Getting Started With Hadoop (20)

Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
 
Big data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosqlBig data hadoop ecosystem and nosql
Big data hadoop ecosystem and nosql
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionHadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in Action
 
Distributed Stream Processing on Fluentd / #fluentd
Distributed Stream Processing on Fluentd / #fluentdDistributed Stream Processing on Fluentd / #fluentd
Distributed Stream Processing on Fluentd / #fluentd
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Cloudera Sessions - Clinic 1 - Getting Started With Hadoop

  • 1. DO NOT USE PUBLICLY From Zero to Hadoop PRIOR TO 10/23/12 Headline Goes Here Speaker Name | Title Speaker Name or Subhead Goes Here March 19, 2013 1
  • 2. Agenda • Hadoop Ecosystem Overview • Hadoop Core Technical Overview • HDFS • MapReduce • Hadoop in the Enterprise • Cluster Planning • Cluster Management with Cloudera Manager 2
  • 3. Hadoop Ecosystem Overview What Are All These Things? 3
  • 4. Hadoop Ecosystem INGEST STORE EXPLORE PROCESS ANALYZE SERVE MANAGEMENT SOFTWARE & CONNECTORS TECHNICAL SUPPORT BI ETL RDBMS SUBSCIPTION OPTIONS CLOUDERA NAVIGATOR CLOUD USER INTERFACE WORKFLOW MGMT AUDIT CORE METADATA LINEAGE WH OO (v1.0) HU WHIRR HUE OOZIE ACCESS LIFECYCLE (v1.0) INTEGRATION BATCH PROCESSING REAL-TIME ACCESS EXPLORE SQ HI PI MA DF & COMPUTE SQOOP HIVE PIG MAHOUT DATAFU AC FL BATCH COMPUTE IM ACCESS CLOUDERA MANAGER FLUME MR MR2 IMPALA FILE MAPREDUCE MAPREDUCE2 MS BDR FUSE-DFS RESOURCE MGMT META STORE REST & COORDINATION YA ZO RTD RTQ WEBHDFS / HTTPFS YARN ZOOKEEPER STORAGE CORE SQL HDFS HB (REQUIRED) ODBC / JDBC HADOOP DFS HBASE 4
  • 5. Sqoop Performs Bi Directional data transfers between Hadoop and almost any SQL database with a JDBC driver 5
  • 6. FlumeNG A streaming data Client collection and Client Agent aggregation system Client Agent for massive volumes Agent Client of data, such as RPC services, Log4J, Syslog, etc. 6
  • 7. HBase • A low latency, distributed, non- SQL database built on HDFS. • A “Columnar Database” 7
  • 8. Hive • Relational database SELECT abstraction using a SQL like s.word, s.freq, k.freq dialect called HiveQL FROM shakespeare JOIN ON (s.word= k.word) • Statements are executed as WHERE s.freq >= 5; One or more MapReduce Jobs 8
  • 9. Pig • High-level scripting language emps = LOAD 'people.txt’ AS for for executing one or more (id,name,salary); MapReduce jobs rich = FILTER emps BY salary > • Created to simplify authoring 200000; sorted_rich = ORDER rich BY of MapReduce jobs salary DESC; • Can be extended with user STORE sorted_rich INTO defined functions ’rich_people.txt'; 9
  • 10. Oozie A workflow engine and scheduler built specifically for large-scale job orchestration on a Hadoop cluster 10
  • 11. Zookeeper • Zookeeper is a distributed consensus engine • Provides well-defined concurrent access semantics: • Leader election • Service discovery • Distributed locking / mutual exclusion • Message board / mailboxes 11
  • 12. Mahout A machine learning library with algorithms for: • Recommendation based on users' behavior. • Clustering groups related documents. • Classification from existing categorized. • Frequent item-set mining (shopping cart content). 12
  • 13. Hadoop Security • Authentication is secured by MIT Kerberos v5 and integrated with LDAP • Provides Identity, Authentication, and Authorization • Useful for multitenancy or secure environments 13
  • 14. Hadoop Core Technical Overview Only the Good Parts 14
  • 15. Components of HDFS • NameNode – Holds all metadata for HDFS • Needs to be a highly reliable machine • RAID drives – typically RAID 10 • Dual power supplies • Dual network cards – Bonded • The more memory the better – typical 36GB to - 64GB • Secondary NameNode – Provides check pointing for the NameNode. Same hardware as the NameNode should be used 15
  • 16. Components of HDFS – Contd. • DataNodes – Hardware will depend on the specific needs of the cluster • No RAID needed, JBOD (just a bunch of disks) is used • Typical ratio is: • 1 hard drive • 2 cores • 4GB of RAM 16
  • 17. HDFS Architecture Overview Host 1 Host 3 Host 5 Namenode DataNode DataNode Host 2 Host 4 Host n Secondary DataNode DataNode Namenode 17
  • 18. HDFS Block Replication Node 1 Node 2 Block Size = 64MB Replication Factor = 3 2 1 4 2 Blocks 5 5 Node 3 1 HDFS 1 2 3 3 Node 4 4 4 Node 5 5 2 1 3 3 4 5 18
  • 19. MapReduce – Map • Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). • map() produces one or more intermediate values along with an output key from the input. (key 1, (key 1, int. values) values) Map (key 2, Shuffle (key 1, int. Reduce Final (key, Task values) Phase values) Task values) (key 3, (key 1, int. values) values) 19
  • 20. MapReduce – Reduce • After the map phase is over, all the intermediate values for a given output key are combined together into a list • reduce() combines those intermediate values into one or more final values for that same output key (key 1, (key 1, int. values) values) Map (key 2, Shuffle (key 1, int. Reduce Final (key, Task values) Phase values) Task values) (key 3, (key 1, int. values) values) 20
  • 21. MapReduce – Shuffle and Sort 21
  • 22. Hadoop In the Enterprise How It Works In The Real World 22
  • 23. Networking • One of the most important things to consider when setting up a Hadoop cluster • Typically a top of rack is used with Hadoop with a core switch • Careful on over subscribing the backplane of the switch! 24
  • 24. Hadoop Typical Data Pipeline Hadoop Marts Oozie Result or Calculated Data Original Source Data Data Sources Pig Data Hive Warehouse Sqoop MapReduce Sqoop Flume HDFS 25
  • 25. Hadoop Use Cases Use Case Application Industry Application Use Case Social Network Analysis Web Clickstream Sessionization ADVANCED ANALYTICS Content Optimization Media Clickstream Sessionization DATA PROCESSING Network Analytics Telco Mediation Loyalty & Promotions Analysis Retail Data Factory Fraud Analysis Financial Trade Reconciliation Entity Analysis Federal SIGINT Sequencing Analysis Bioinformatics Genome Mapping 26
  • 26. Hadoop in the Enterprise OPERATORS ENGINEERS ANALYSTS BUSINESS USERS Management Enterprise IDE’s BI / Analytics Tools Reporting CUSTOMERS Enterprise Data Warehouse Web Application Relational Logs Files Web Data Databases 27
  • 27. Cloudera Manager End-to-End Administration for CDH 1 Manage Easily deploy, configure & optimize clusters 2 Monitor Maintain a central view of all activity 3 Diagnose Easily identify and resolve issues 4 Integrate Use Cloudera Manager with existing tools 28
  • 28. Install A Cluster In 3 Simple Steps Cloudera Manager Key Features 1 Find Nodes 2 Install Components 3 Assign Roles Enter the names of the hosts which will be Cloudera Manager automatically installs the CDH Verify the roles of the nodes within your cluster. included in the Hadoop cluster. Click Continue. components on the hosts you specified. Make changes as necessary. 29
  • 29. View Service Health & Performance Cloudera Manager Key Features 30
  • 30. Monitor & Diagnose Cluster Workloads Cloudera Manager Key Features 31
  • 31. Visualize Health Status With Heatmaps Cloudera Manager Key Features 32
  • 32. Rolling Upgrades Cloudera Manager Key Features 33
  • 33. 34
  • 34. 35

Editor's Notes

  1. Pool commodity servers in a single hierarchical namespace.Designed for large files that are written once and read many times.Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes.Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks.Default block size is 64MB, though most folks now set it to 128MB
  2. Apache Hadoop is a new solution in your existing infrastructure.It does not replace any existing major existing investment.Apache brings data that you’re already generating into context and integrates it with your business.You get access to key information about how your business is operating but pulling togetherWeb and application logsUnstructured filesWeb dataRelational dataHadoop is used by your team to analyze this data and deliver it to business users directly and via existing data management technologies