SlideShare una empresa de Scribd logo
1 de 42
Descargar para leer sin conexión
APP-CAP2956

Inside the Hadoop
Machine




Jeff Buell, VMware, Inc.

Richard McDougall, VMware, Inc.

Sanjay Radia, Hortonworks




                           #vmworldapps
Disclaimer

!  This session may contain product features that are
    currently under development.

!  This session/overview of the new technology represents
    no commitment from VMware to deliver these features in
    any generally available product.

!  Features are subject to change, and must not be included in
    contracts, purchase orders, or sales agreements of any kind.

!  Technical feasibility and market demand will affect final delivery.
!  Pricing and packaging for any new technologies or features
    discussed or presented have not been determined.




2
Broad Application of Hadoop technology

    Horizontal Use Cases                                           Vertical Use Cases


    Log Processing / Click
                                                                   Financial Services
      Stream Analytics

   Machine Learning /                                                Internet Retailer
sophisticated data mining

     Web crawling / text                                          Pharmaceutical / Drug
        processing                                                     Discovery

    Extract Transform Load
                                                                    Mobile / Telecom
      (ETL) replacement

    Image / XML message
                                                                   Scientific Research
        processing

      General archiving /
                                                                      Social Media
         compliance


    Hadoop’s ability to handle large unstructured data affordably and efficiently makes
     it a valuable tool kit for enterprises across a number of applications and fields.

3
How does Hadoop enable parallel processing?

!  A framework for distributed processing of large data sets across
    clusters of computers using a simple programming model.




                                  Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works


4
Hadoop System Architecture




!  MapReduce: Programming
     framework for highly parallel data
     processing
!  Hadoop Distributed File System
     (HDFS): Distributed data storage




 5
Job Tracker Schedules Tasks Where the Data Resides
                                            Job
                                          Tracker
     Job

      Input%File            Host%1                     Host%2           Host%3

    Split%1%–%64MB
                         Task%%                 Task%%               Task%%
    Split%2%–%64MB       Tracker                Tracker              Tracker
    Split%3%–%64MB
                           Task%<%1                   Task%<%2         Task%<%3




                           DataNode                   DataNode         DataNode



           %Input%File   Block%1%–%64MB             Block%2%–%64MB   Block%3%–%64MB




6
Hadoop Distributed File System




7
Hadoop Data Locality and Replication




8
Hadoop Topology Awareness




9
Why Virtualize Hadoop?



     Simple to Operate          Highly Available               Elastic Scaling


!  Rapid deployment        !  No more single point of    !  Shrink and expand
                              failure                       cluster on demand
!  Unified operations
   across enterprise       !  One click to setup         !  Resource Guarantee

!  Easy Clone of Cluster   !  High availability for MR   !  Independent scaling of
                              Jobs                          Compute and data




10
Enterprise Challenges with Using Hadoop

!  Deployment
  •  Slow to provision
  •  Complex to keep running/tune
!  Single Points of Failure
  •  Single point of failure with Name Node and Job tracker
  •  No HA for Hadoop Framework Components (Hive, HCatalog, etc.)
!  Low Utilization
  •  Dedicated clusters to run Hadoop with low CPU utilization
  •  No easy way to share resource between Hadoop and non-Hadoop workloads
  •  Noisy neighbor, lack resource containment
!  Need Multi-tenant Isolation, Resource Management, etc,…
  •  Noisy Neighbor - no performance or security isolation between different tenants/users
  •  Lack of configuration isolation - Can t run multiple versions on the cluster




11
Virtualization enables a Common Infrastructure for Big Data


                                                          MPP DB    HBase       Hadoop
      Virtualization Platform
                                                      Virtualization Platform


      Hadoop


                       HBase



                                        Cluster Consolidation
        MPP DB

                                        !  Simplify
                                          •  Single Hardware Infrastructure
Cluster Sprawling
                                          •  Unified operations
Single purpose clusters for various
business applications lead to cluster   !  Optimize
sprawl.                                   •  Shared Resources = higher utilization
                                          •  Elastic resources = faster on-demand access
 12
Deploy a Hadoop Cluster in under 30 Minutes

Step 1: Deploy Serengeti virtual appliance on vSphere.


                                                               Deploy vHelperOVF to
                                                                     vSphere




Step 2: A few simple commands to stand up Hadoop Cluster.
                                                             Select Compute, memory,
                                                               storage and network




                                                            Select configuration template




                                                               Automate deployment




                         Done


  13
A Tour Through Serengeti



$ ssh serengeti@serengeti-vm

$ serengeti

serengeti>




14
A Tour Through Serengeti



serengeti> cluster create --name myElephant

serengeti> cluster list -–name myElephant

name: myElephant, distro: cdh, status:RUNNING
  NAME    ROLES                                 INSTANCE   CPU MEM(MB) TYPE
  ---------------------------------------------------------------------------
  master [hadoop_NameNode, hadoop_jobtracker] 1            2   7500     LOCAL   50

name: myElephant, distro: cdh, status:RUNNING
  NAME    ROLES                                 INSTANCE   CPU MEM(MB) TYPE
  ---------------------------------------------------------------------------
  master [hive, hadoop_client, pig]             1          1   3700     LOCAL   50

     NAME                HOST                              IP
     -----------------------------------------------------------------
     myElephant-client0 rmc-elephant-009.eng.vmware.com    10.0.20.184




15
A Tour Through Serengeti



$ ssh rmc@rmc-elephant-009.eng.vmware.com

$ hadoop jar hadoop-examples.jar teragen 1000000000 tera-data

…




16
Serengeti Spec File
[
        "distro":"apache",               Choice of Distro
          {
             "name": "master",
             "roles": [
                "hadoop_NameNode",
                "hadoop_jobtracker"
             ],
             "instanceNum": 1,
             "instanceType": "MEDIUM",
             “ha”:true,                  HA Option
          },
          {
             "name": "worker",
             "roles": [
                "hadoop_datanode", "hadoop_tasktracker"
             ],
             "instanceNum": 5,
             "instanceType": "SMALL",
             "storage": {                Choice of Shared Storage or Local Disk
                "type": "LOCAL",
                "sizeGB": 10
             }
          },
    ]

17
Configuring Distro’s


{
         "name" : "cdh",
         "version" : "3u3",
         "packages" : [
           {
              "roles" : ["hadoop_NameNode", "hadoop_jobtracker",
                         "hadoop_tasktracker", "hadoop_datanode",
                         "hadoop_client"],
              "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz"
           },
           {
              "roles" : ["hive"],
              "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz"
           },
           {
              "roles" : ["pig"],
              "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz"
           }
         ]
    },




18
Open Source of Serengeti, Spring Hadoop, Hadoop Extensions


          Commercial Vendors            Community Projects




•  Support major distribution and multiple projects
•  Contribute Hadoop Virtualization Extension (HVE) to Open
   Source Community



19
Use Local Disk where it’s Needed




      SAN Storage          NAS Filers       Local Storage

     $2 - $10/Gigabyte   $1 - $5/Gigabyte   $0.05/Gigabyte

         $1M gets:          $1M gets:          $1M gets:
       0.5Petabytes         1 Petabyte       10 Petabytes
       200,000 IOPS       200,000 IOPS       400,000 IOPS
        8Gbyte/sec         10Gbyte/sec      250 Gbytes/sec

20
Extend Virtual Storage Architecture to Include Local Disk

 !  Shared Storage: SAN or NAS                                                            !  Hybrid Storage
         •  Easy to provision                                                              •  SAN for boot images, VMs, other
         •  Automated cluster rebalancing                                                       workloads
                                                                                           •  Local disk for Hadoop & HDFS
                                                                                           •  Scalable Bandwidth, Lower Cost/GB
          Other VM

                     Other VM




                                                  Other VM




                                                                               Other VM




                                                                                                     Other VM

                                                                                                                Other VM




                                                                                                                                             Other VM




                                                                                                                                                                          Other VM
Hadoop




                                Hadoop

                                         Hadoop




                                                             Hadoop

                                                                      Hadoop




                                                                                            Hadoop




                                                                                                                           Hadoop

                                                                                                                                    Hadoop




                                                                                                                                                        Hadoop

                                                                                                                                                                 Hadoop
          Host                           Host                         Host                           Host                           Host                         Host




     21
Hadoop has Significant Ephemeral Data


                    Map%Task%
                                                     Reduce%
                    Map%Task%
Job%                                    Map%         Reduce%               Sort%
                    Map%Task%           Output%
                                        file.out*
                                                                      Spills%
                    Map%Task%

     DFS%
                    Spills%
                    &%Logs%
                                  %         Shuffle%
                                            Map_*.out*
     Input%
     Data%
                    spill*.out*   75%%of%              Combine%                        DFS%
                                                       Intermediate.out*               Output%
       %                          Disk%Bandwidth%                                  %   Data%
       12%%of%                                                                     12%%of%
       Bandwidth%                                                                  Bandwidth%
                                      HDFS%




22
Virtualized Hadoop Performance

!  Issues of interest
  •  Native vs various virtual configurations
  •  Local disks vs Fibre Channel SAN
  •  Effect of protecting Hadoop master daemons with Fault Tolerance
  •  Public cloud (renting) vs private cloud (buying)

                         Arista 7124SX 10 GbE switch




     24x HP DL380 G7
     2x X5687, 72 GB
     16x SAS 146 GB
     Broadcom 10 GbE adapter
     Qlogic 8 Gb/s HBA
                                                          …
                                            EMC VNX7500


23
Configuration

!  Software
 •  vSphere 5.0 U1 (storage tests), 5.1 (Native/Virtual, FT)
 •  RHEL 6.1 x86_64
 •  Cloudera CDH3u4
 •  Hadoop applications: TeraGen, TeraSort, TeraValidate (1 TB)
!  Hadoop VMs
 •  Processors (16 logical threads), memory (72 GB), disks (12) partitioned among
     1, 2, or 4 VMs per host
 •  Separate VMs for NameNode and JobTracker for storage and FT tests
!  Hadoop configuration
 •  One map and one reduce task per vCPU (= logical thread)
     •  Machines are highly loaded
 •  256 MB block size
 •  FT tests: 8 – 256 MB block sizes to vary load on NN and JT

24
Native versus Virtual Platforms, 24 hosts, 12 disks/host
                                                 450


       Elapsed time, seconds (lower is better)   400


                                                 350
                                                                               Native
                                                                               1 VM
                                                 300
                                                                               2 VMs
                                                                               4 VMs
                                                 250


                                                 200


                                                 150


                                                 100


                                                  50


                                                   0
                                                       TeraGen   TeraSort   TeraValidate

25
Local vs Various SAN Storage Configurations
                                                            4.5
                                                                            16 x HP DL380G7, EMC VNX 7500, 96 physical disks
      Elapsed time ratio to Local disks (lower is better)    4                        Local disks
                                                                                      SAN JBOD
                                                            3.5                       SAN RAID-0, 16 KB page size
                                                                                      SAN RAID-0
                                                                                      SAN RAID-5
                                                             3


                                                            2.5


                                                             2


                                                            1.5


                                                             1


                                                            0.5


                                                             0
                                                                  TeraGen         TeraSort                TeraValidate

26
Performance Effect of FT for Master Daemons

!  NameNode and JobTracker placed in separate UP VMs
!  Small overhead: Enabling FT causes 2-4% slowdown for TeraSort
!  8 MB case places similar load on NN &JT as >200 hosts with 256 MB

                                       1.04
        Elapsed time ratio to FT off




                                                          TeraSort

                                       1.03



                                       1.02



                                       1.01



                                         1
                                              256    64            16     8
                                                    HDFS block size, MB


27
Different Clouds for Different Folks

!  Yahoo! Hadoop 2009: Classic benchmark test, 1460 hosts
!  Google/MapR: SaaS on Google Compute Engine
!  vSphere 5.1: 24 host cluster, 2 VMs/host, 8 or 12 disks/host,
  CDH3u4
!  Vastly different cluster sizes
  •  Compare throughput (MB sorted per second) normalized with resources
!  Cost: rental or estimate of running continuously for 3 years

                 #cores   #disks    TeraSort, s MB/s/core MB/s/disk        cost

 Yahoo!          11680     5840        62           1.3         2.6        ~$7

 Google/MapR      5024     1256        80           2.4         9.5        $16

 vSphere 5.1      192       192        442         11.2        11.2        ~$2

 vSphere 5.1      192       288        359         13.8         9.2        ~$2

28
Why Virtualize Hadoop?



     Simple to Operate          Highly Available               Elastic Scaling


!  Rapid deployment        !  No more single point of    !  Shrink and expand
                              failure                       cluster on demand
!  Unified operations
   across enterprise       !  One click to setup         !  Resource Guarantee

!  Easy Clone of Cluster   !  High availability for MR   !  Independent scaling of
                              Jobs                          Compute and data




29
VMware-Hortonworks Joint Engineering

!  Hortonworks goal
 •  Expand Hadoop ecosystem
 •  Provide first class support of various platforms
 •  Hadoop should run well on VMs
 •  VMs offer several advantages as presented earlier
 •  Take advantage of vSphere for HA
!  First class support for VMs
 •  Topology plugins (Hadoop-8468)
    •  2 VMs can be on same host
      •  Pick closer data
      •  Schedule tasks closer
      •  Don’t put two replicas on same host
 •  MR-tmp on HDFS using block pools
    •  Elastic Compute-VMs will not need local disk
 •  Fast communications within VMs
30
Hadoop Full-Stack High Availability


                          Slave Nodes of Hadoop Cluster


                    job        job              job   job    job


      Apps
     Running
     Outside
                                         Failover

                      JT into Safemode

               NN                          JT               NN
                                                                        N+K
               Server                       Server           Server   failover

                      HA Cluster for Master Daemons



31
HA is in HDP 1.0
     Using Total System Availability Architecture




32
HA in Hadoop 1 with HDP1

!  Full Stack High Availability
  •  Namenode
     •  Clients pause automatically
     •  JobTracker pauses automatically
  •  Other Hadoop master services (JT, …) coming


!  Use industry proven HA framework
  •  VMWare vSphere-HA
     •  Failover, fencing, …
     •  Corner cases are tricky – if not addressed, corruption
  •  Addition benefits:
     •  N-N & N+K failover
     •  Migration for maintenance




33
Hadoop NN/JT HA with vSphere




34
Namenode Failover Times

!  60 Nodes, 60K files, 6 million blocks, 300 TB raw storage – 1-3.5
 minutes
 •  Failure detection and Failover – 0.5 to 2 minutes
 •  Namenode Startup (exit safemode) – 30 sec
!  180 Nodes, 200K files, 18 million blocks, 900TB raw storage – 2-4.5
 minutes
 •  Failure detection and Failover – 0.5 to 2 minutes
 •  Namenode Startup (exit safemode) – 110 sec


 For vSphere - OS bootup is needed – 10-20 seconds is included above.


 Cold Failover is good enough for small/medium clusters
     Failure Detection and Automatic Failover Dominates




35
                                                            35
Summary

!  Advantages of Hadoop on VMs
 •  Cluster Management
 •  Cluster consolidation
 •  Greater Elasticity in mixed environment
 •  Alternate multi-tenancy to capacity scheduler’s offerings
!  HA for Hadoop Master Daemons
 •  vSphere based HA for NN, JT, … in Hadoop 1
 •  Total System Availability Architecture




36
Why Virtualize Hadoop?



     Simple to Operate          Highly Available               Elastic Scaling


!  Rapid deployment        !  No more single point of    !  Shrink and expand
                              failure                       cluster on demand
!  Unified operations
   across enterprise       !  One click to setup         !  Resource Guarantee

!  Easy Clone of Cluster   !  High availability for MR   !  Independent scaling of
                              Jobs                          Compute and data




37
Elastic Scaling and Multi-tenancy of Hadoop on vSphere



     VM                       VM                    VM             VM

          Current%
          Hadoop:%                 Compute               T1             T2
          %
          Combined%           VM                    VM
          Storage/                 Storage               Storage
          Compute




1.#Hadoop#in#VM#         2.#Separate#Compute#and#Data# 3.#Mul8.#Clusters#
<     Single%Tenant%     <     Single%Tenant%            <    MulQple%Tenants%
<     Fixed%Resources%   <     ElasQc%Compute%           <    ElasQc%Compute%
                         %

     38
Separated Compute and Data

                                                                                Slot
                                               Slot               Virtual   Slot
                                                                Virtual
                                                                  Hadoop        Slot
                                 Virtual       Slot           Virtual
                                                                Hadoop      Slot
                                 Hadoop                           Node
                                                              Hadoop
                                                                Node
                                 Node                         Node         Task Tracker
   Other                                   Task Tracker                 Task Tracker
   Workload




                                 Virtual
                                 Hadoop                   Datanode
                                 Node



   Virtualization Host                         VMDK                       VMDK




Truly Elastic Hadoop:
Scalable through virtual nodes


  39
References

www.projectserengeti.org
www.hortonworks.com
www.cloudera.com


Fault Tolerance performance whitepaper:
www.vmware.com/resources/techresources/10301


MapR/Google blog: www.mapr.com/blog/google-mapr




40
FILL OUT
A SURVEY

EVERY COMPLETE SURVEY
        IS ENTERED INTO
         DRAWING FOR A
   $25 VMWARE COMPANY
 STORE GIFT CERTIFICATE
APP-CAP2956

Inside the Hadoop
Machine




Jeff Buell, VMware, Inc.

Richard McDougall, VMware, Inc.

Sanjay Radia, Hortonworks




                           #vmworldapps

Más contenido relacionado

La actualidad más candente

Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera, Inc.
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jkEdureka!
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop AdministrationEdureka!
 
Virtualization Primer for Java Developers
Virtualization Primer for Java DevelopersVirtualization Primer for Java Developers
Virtualization Primer for Java DevelopersRichard McDougall
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Kathleen Ting
 
Administer Hadoop Cluster
Administer Hadoop ClusterAdminister Hadoop Cluster
Administer Hadoop ClusterEdureka!
 
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionHadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionBenoit Perroud
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaCloudera, Inc.
 
Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedInDataWorks Summit
 
Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14jijukjoseph
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messagesyarapavan
 
Охота на уязвимости Hadoop
Охота на уязвимости HadoopОхота на уязвимости Hadoop
Охота на уязвимости HadoopPositive Hack Days
 
Hadoop Distributed File System Reliability and Durability at Facebook
Hadoop Distributed File System Reliability and Durability at FacebookHadoop Distributed File System Reliability and Durability at Facebook
Hadoop Distributed File System Reliability and Durability at FacebookDataWorks Summit
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsTrendProgContest13
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Richard McDougall
 

La actualidad más candente (20)

Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With HadoopCloudera Sessions - Clinic 1 - Getting Started With Hadoop
Cloudera Sessions - Clinic 1 - Getting Started With Hadoop
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop Administration
 
Virtualization Primer for Java Developers
Virtualization Primer for Java DevelopersVirtualization Primer for Java Developers
Virtualization Primer for Java Developers
 
Hadoop administration
Hadoop administrationHadoop administration
Hadoop administration
 
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMUpgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Administer Hadoop Cluster
Administer Hadoop ClusterAdminister Hadoop Cluster
Administer Hadoop Cluster
 
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionHadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment Evolution
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedIn
 
Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14Hadoop single node installation on ubuntu 14
Hadoop single node installation on ubuntu 14
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
 
hadoop_module6
hadoop_module6hadoop_module6
hadoop_module6
 
Охота на уязвимости Hadoop
Охота на уязвимости HadoopОхота на уязвимости Hadoop
Охота на уязвимости Hadoop
 
Hadoop Distributed File System Reliability and Durability at Facebook
Hadoop Distributed File System Reliability and Durability at FacebookHadoop Distributed File System Reliability and Durability at Facebook
Hadoop Distributed File System Reliability and Durability at Facebook
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
 

Destacado

App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
news (Punjab college of technical education, ludhiana)
news (Punjab college of technical education, ludhiana)news (Punjab college of technical education, ludhiana)
news (Punjab college of technical education, ludhiana)jsmtkr1
 
Anne frank by vinodita
Anne frank by vinoditaAnne frank by vinodita
Anne frank by vinoditaJNV
 
EDUU609 Appropriate Use of data
EDUU609 Appropriate Use of dataEDUU609 Appropriate Use of data
EDUU609 Appropriate Use of datajanaeburger
 
Weekly news
Weekly newsWeekly news
Weekly newsjsmtkr1
 
News {Punjab college of technical education Ludhiana}
News {Punjab college of technical education Ludhiana}News {Punjab college of technical education Ludhiana}
News {Punjab college of technical education Ludhiana}jsmtkr1
 

Destacado (9)

App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
news (Punjab college of technical education, ludhiana)
news (Punjab college of technical education, ludhiana)news (Punjab college of technical education, ludhiana)
news (Punjab college of technical education, ludhiana)
 
Album pricelist1 5a
Album pricelist1 5aAlbum pricelist1 5a
Album pricelist1 5a
 
Anne frank by vinodita
Anne frank by vinoditaAnne frank by vinodita
Anne frank by vinodita
 
EDUU609 Appropriate Use of data
EDUU609 Appropriate Use of dataEDUU609 Appropriate Use of data
EDUU609 Appropriate Use of data
 
Linux basics
Linux basicsLinux basics
Linux basics
 
P&g final ppt
P&g final pptP&g final ppt
P&g final ppt
 
Weekly news
Weekly newsWeekly news
Weekly news
 
News {Punjab college of technical education Ludhiana}
News {Punjab college of technical education Ludhiana}News {Punjab college of technical education Ludhiana}
News {Punjab college of technical education Ludhiana}
 

Similar a App cap2956v2-121001194956-phpapp01 (1)

Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopDataWorks Summit
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Pivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DancePivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DanceEMC
 
Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudHadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudCloudera, Inc.
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop Sudarshan Pant
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Secure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosSecure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosEdureka!
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenmaharajothip1
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to HadoopAnandMHadoop
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanJim Kaskade
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Etu Solution
 
Deploying Hadoop-based Bigdata Environments
Deploying Hadoop-based Bigdata Environments Deploying Hadoop-based Bigdata Environments
Deploying Hadoop-based Bigdata Environments buildacloud
 

Similar a App cap2956v2-121001194956-phpapp01 (1) (20)

Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing Hadoop
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
 
Pivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DancePivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant Dance
 
Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudHadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in Cloud
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Secure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With KerberosSecure Hadoop Cluster With Kerberos
Secure Hadoop Cluster With Kerberos
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
Deploying Hadoop-based Bigdata Environments
Deploying Hadoop-based Bigdata Environments Deploying Hadoop-based Bigdata Environments
Deploying Hadoop-based Bigdata Environments
 

Último

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Último (20)

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

App cap2956v2-121001194956-phpapp01 (1)

  • 1. APP-CAP2956 Inside the Hadoop Machine Jeff Buell, VMware, Inc. Richard McDougall, VMware, Inc. Sanjay Radia, Hortonworks #vmworldapps
  • 2. Disclaimer !  This session may contain product features that are currently under development. !  This session/overview of the new technology represents no commitment from VMware to deliver these features in any generally available product. !  Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. !  Technical feasibility and market demand will affect final delivery. !  Pricing and packaging for any new technologies or features discussed or presented have not been determined. 2
  • 3. Broad Application of Hadoop technology Horizontal Use Cases Vertical Use Cases Log Processing / Click Financial Services Stream Analytics Machine Learning / Internet Retailer sophisticated data mining Web crawling / text Pharmaceutical / Drug processing Discovery Extract Transform Load Mobile / Telecom (ETL) replacement Image / XML message Scientific Research processing General archiving / Social Media compliance Hadoop’s ability to handle large unstructured data affordably and efficiently makes it a valuable tool kit for enterprises across a number of applications and fields. 3
  • 4. How does Hadoop enable parallel processing? !  A framework for distributed processing of large data sets across clusters of computers using a simple programming model. Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works 4
  • 5. Hadoop System Architecture !  MapReduce: Programming framework for highly parallel data processing !  Hadoop Distributed File System (HDFS): Distributed data storage 5
  • 6. Job Tracker Schedules Tasks Where the Data Resides Job Tracker Job Input%File Host%1 Host%2 Host%3 Split%1%–%64MB Task%% Task%% Task%% Split%2%–%64MB Tracker Tracker Tracker Split%3%–%64MB Task%<%1 Task%<%2 Task%<%3 DataNode DataNode DataNode %Input%File Block%1%–%64MB Block%2%–%64MB Block%3%–%64MB 6
  • 8. Hadoop Data Locality and Replication 8
  • 10. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling !  Rapid deployment !  No more single point of !  Shrink and expand failure cluster on demand !  Unified operations across enterprise !  One click to setup !  Resource Guarantee !  Easy Clone of Cluster !  High availability for MR !  Independent scaling of Jobs Compute and data 10
  • 11. Enterprise Challenges with Using Hadoop !  Deployment •  Slow to provision •  Complex to keep running/tune !  Single Points of Failure •  Single point of failure with Name Node and Job tracker •  No HA for Hadoop Framework Components (Hive, HCatalog, etc.) !  Low Utilization •  Dedicated clusters to run Hadoop with low CPU utilization •  No easy way to share resource between Hadoop and non-Hadoop workloads •  Noisy neighbor, lack resource containment !  Need Multi-tenant Isolation, Resource Management, etc,… •  Noisy Neighbor - no performance or security isolation between different tenants/users •  Lack of configuration isolation - Can t run multiple versions on the cluster 11
  • 12. Virtualization enables a Common Infrastructure for Big Data MPP DB HBase Hadoop Virtualization Platform Virtualization Platform Hadoop HBase Cluster Consolidation MPP DB !  Simplify •  Single Hardware Infrastructure Cluster Sprawling •  Unified operations Single purpose clusters for various business applications lead to cluster !  Optimize sprawl. •  Shared Resources = higher utilization •  Elastic resources = faster on-demand access 12
  • 13. Deploy a Hadoop Cluster in under 30 Minutes Step 1: Deploy Serengeti virtual appliance on vSphere. Deploy vHelperOVF to vSphere Step 2: A few simple commands to stand up Hadoop Cluster. Select Compute, memory, storage and network Select configuration template Automate deployment Done 13
  • 14. A Tour Through Serengeti $ ssh serengeti@serengeti-vm $ serengeti serengeti> 14
  • 15. A Tour Through Serengeti serengeti> cluster create --name myElephant serengeti> cluster list -–name myElephant name: myElephant, distro: cdh, status:RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE --------------------------------------------------------------------------- master [hadoop_NameNode, hadoop_jobtracker] 1 2 7500 LOCAL 50 name: myElephant, distro: cdh, status:RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE --------------------------------------------------------------------------- master [hive, hadoop_client, pig] 1 1 3700 LOCAL 50 NAME HOST IP ----------------------------------------------------------------- myElephant-client0 rmc-elephant-009.eng.vmware.com 10.0.20.184 15
  • 16. A Tour Through Serengeti $ ssh rmc@rmc-elephant-009.eng.vmware.com $ hadoop jar hadoop-examples.jar teragen 1000000000 tera-data … 16
  • 17. Serengeti Spec File [ "distro":"apache", Choice of Distro { "name": "master", "roles": [ "hadoop_NameNode", "hadoop_jobtracker" ], "instanceNum": 1, "instanceType": "MEDIUM", “ha”:true, HA Option }, { "name": "worker", "roles": [ "hadoop_datanode", "hadoop_tasktracker" ], "instanceNum": 5, "instanceType": "SMALL", "storage": { Choice of Shared Storage or Local Disk "type": "LOCAL", "sizeGB": 10 } }, ] 17
  • 18. Configuring Distro’s { "name" : "cdh", "version" : "3u3", "packages" : [ { "roles" : ["hadoop_NameNode", "hadoop_jobtracker", "hadoop_tasktracker", "hadoop_datanode", "hadoop_client"], "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz" }, { "roles" : ["hive"], "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz" }, { "roles" : ["pig"], "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz" } ] }, 18
  • 19. Open Source of Serengeti, Spring Hadoop, Hadoop Extensions Commercial Vendors Community Projects •  Support major distribution and multiple projects •  Contribute Hadoop Virtualization Extension (HVE) to Open Source Community 19
  • 20. Use Local Disk where it’s Needed SAN Storage NAS Filers Local Storage $2 - $10/Gigabyte $1 - $5/Gigabyte $0.05/Gigabyte $1M gets: $1M gets: $1M gets: 0.5Petabytes 1 Petabyte 10 Petabytes 200,000 IOPS 200,000 IOPS 400,000 IOPS 8Gbyte/sec 10Gbyte/sec 250 Gbytes/sec 20
  • 21. Extend Virtual Storage Architecture to Include Local Disk !  Shared Storage: SAN or NAS !  Hybrid Storage •  Easy to provision •  SAN for boot images, VMs, other •  Automated cluster rebalancing workloads •  Local disk for Hadoop & HDFS •  Scalable Bandwidth, Lower Cost/GB Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Host Host Host Host Host Host 21
  • 22. Hadoop has Significant Ephemeral Data Map%Task% Reduce% Map%Task% Job% Map% Reduce% Sort% Map%Task% Output% file.out* Spills% Map%Task% DFS% Spills% &%Logs% % Shuffle% Map_*.out* Input% Data% spill*.out* 75%%of% Combine% DFS% Intermediate.out* Output% % Disk%Bandwidth% % Data% 12%%of% 12%%of% Bandwidth% Bandwidth% HDFS% 22
  • 23. Virtualized Hadoop Performance !  Issues of interest •  Native vs various virtual configurations •  Local disks vs Fibre Channel SAN •  Effect of protecting Hadoop master daemons with Fault Tolerance •  Public cloud (renting) vs private cloud (buying) Arista 7124SX 10 GbE switch 24x HP DL380 G7 2x X5687, 72 GB 16x SAS 146 GB Broadcom 10 GbE adapter Qlogic 8 Gb/s HBA … EMC VNX7500 23
  • 24. Configuration !  Software •  vSphere 5.0 U1 (storage tests), 5.1 (Native/Virtual, FT) •  RHEL 6.1 x86_64 •  Cloudera CDH3u4 •  Hadoop applications: TeraGen, TeraSort, TeraValidate (1 TB) !  Hadoop VMs •  Processors (16 logical threads), memory (72 GB), disks (12) partitioned among 1, 2, or 4 VMs per host •  Separate VMs for NameNode and JobTracker for storage and FT tests !  Hadoop configuration •  One map and one reduce task per vCPU (= logical thread) •  Machines are highly loaded •  256 MB block size •  FT tests: 8 – 256 MB block sizes to vary load on NN and JT 24
  • 25. Native versus Virtual Platforms, 24 hosts, 12 disks/host 450 Elapsed time, seconds (lower is better) 400 350 Native 1 VM 300 2 VMs 4 VMs 250 200 150 100 50 0 TeraGen TeraSort TeraValidate 25
  • 26. Local vs Various SAN Storage Configurations 4.5 16 x HP DL380G7, EMC VNX 7500, 96 physical disks Elapsed time ratio to Local disks (lower is better) 4 Local disks SAN JBOD 3.5 SAN RAID-0, 16 KB page size SAN RAID-0 SAN RAID-5 3 2.5 2 1.5 1 0.5 0 TeraGen TeraSort TeraValidate 26
  • 27. Performance Effect of FT for Master Daemons !  NameNode and JobTracker placed in separate UP VMs !  Small overhead: Enabling FT causes 2-4% slowdown for TeraSort !  8 MB case places similar load on NN &JT as >200 hosts with 256 MB 1.04 Elapsed time ratio to FT off TeraSort 1.03 1.02 1.01 1 256 64 16 8 HDFS block size, MB 27
  • 28. Different Clouds for Different Folks !  Yahoo! Hadoop 2009: Classic benchmark test, 1460 hosts !  Google/MapR: SaaS on Google Compute Engine !  vSphere 5.1: 24 host cluster, 2 VMs/host, 8 or 12 disks/host, CDH3u4 !  Vastly different cluster sizes •  Compare throughput (MB sorted per second) normalized with resources !  Cost: rental or estimate of running continuously for 3 years #cores #disks TeraSort, s MB/s/core MB/s/disk cost Yahoo! 11680 5840 62 1.3 2.6 ~$7 Google/MapR 5024 1256 80 2.4 9.5 $16 vSphere 5.1 192 192 442 11.2 11.2 ~$2 vSphere 5.1 192 288 359 13.8 9.2 ~$2 28
  • 29. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling !  Rapid deployment !  No more single point of !  Shrink and expand failure cluster on demand !  Unified operations across enterprise !  One click to setup !  Resource Guarantee !  Easy Clone of Cluster !  High availability for MR !  Independent scaling of Jobs Compute and data 29
  • 30. VMware-Hortonworks Joint Engineering !  Hortonworks goal •  Expand Hadoop ecosystem •  Provide first class support of various platforms •  Hadoop should run well on VMs •  VMs offer several advantages as presented earlier •  Take advantage of vSphere for HA !  First class support for VMs •  Topology plugins (Hadoop-8468) •  2 VMs can be on same host •  Pick closer data •  Schedule tasks closer •  Don’t put two replicas on same host •  MR-tmp on HDFS using block pools •  Elastic Compute-VMs will not need local disk •  Fast communications within VMs 30
  • 31. Hadoop Full-Stack High Availability Slave Nodes of Hadoop Cluster job job job job job Apps Running Outside Failover JT into Safemode NN JT NN N+K Server Server Server failover HA Cluster for Master Daemons 31
  • 32. HA is in HDP 1.0 Using Total System Availability Architecture 32
  • 33. HA in Hadoop 1 with HDP1 !  Full Stack High Availability •  Namenode •  Clients pause automatically •  JobTracker pauses automatically •  Other Hadoop master services (JT, …) coming !  Use industry proven HA framework •  VMWare vSphere-HA •  Failover, fencing, … •  Corner cases are tricky – if not addressed, corruption •  Addition benefits: •  N-N & N+K failover •  Migration for maintenance 33
  • 34. Hadoop NN/JT HA with vSphere 34
  • 35. Namenode Failover Times !  60 Nodes, 60K files, 6 million blocks, 300 TB raw storage – 1-3.5 minutes •  Failure detection and Failover – 0.5 to 2 minutes •  Namenode Startup (exit safemode) – 30 sec !  180 Nodes, 200K files, 18 million blocks, 900TB raw storage – 2-4.5 minutes •  Failure detection and Failover – 0.5 to 2 minutes •  Namenode Startup (exit safemode) – 110 sec For vSphere - OS bootup is needed – 10-20 seconds is included above. Cold Failover is good enough for small/medium clusters Failure Detection and Automatic Failover Dominates 35 35
  • 36. Summary !  Advantages of Hadoop on VMs •  Cluster Management •  Cluster consolidation •  Greater Elasticity in mixed environment •  Alternate multi-tenancy to capacity scheduler’s offerings !  HA for Hadoop Master Daemons •  vSphere based HA for NN, JT, … in Hadoop 1 •  Total System Availability Architecture 36
  • 37. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling !  Rapid deployment !  No more single point of !  Shrink and expand failure cluster on demand !  Unified operations across enterprise !  One click to setup !  Resource Guarantee !  Easy Clone of Cluster !  High availability for MR !  Independent scaling of Jobs Compute and data 37
  • 38. Elastic Scaling and Multi-tenancy of Hadoop on vSphere VM VM VM VM Current% Hadoop:% Compute T1 T2 % Combined% VM VM Storage/ Storage Storage Compute 1.#Hadoop#in#VM# 2.#Separate#Compute#and#Data# 3.#Mul8.#Clusters# <  Single%Tenant% <  Single%Tenant% <  MulQple%Tenants% <  Fixed%Resources% <  ElasQc%Compute% <  ElasQc%Compute% % 38
  • 39. Separated Compute and Data Slot Slot Virtual Slot Virtual Hadoop Slot Virtual Slot Virtual Hadoop Slot Hadoop Node Hadoop Node Node Node Task Tracker Other Task Tracker Task Tracker Workload Virtual Hadoop Datanode Node Virtualization Host VMDK VMDK Truly Elastic Hadoop: Scalable through virtual nodes 39
  • 40. References www.projectserengeti.org www.hortonworks.com www.cloudera.com Fault Tolerance performance whitepaper: www.vmware.com/resources/techresources/10301 MapR/Google blog: www.mapr.com/blog/google-mapr 40
  • 41. FILL OUT A SURVEY EVERY COMPLETE SURVEY IS ENTERED INTO DRAWING FOR A $25 VMWARE COMPANY STORE GIFT CERTIFICATE
  • 42. APP-CAP2956 Inside the Hadoop Machine Jeff Buell, VMware, Inc. Richard McDougall, VMware, Inc. Sanjay Radia, Hortonworks #vmworldapps