SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
Hadoop as a Service
Hadoop on Virtualization
Hadoop World, December 2011

Jun Ping Du
Richard McDougall
VMware, Inc.




                              © 2009 VMware Inc. All rights reserved
Cloud: Big Shifts in Simplification and Optimization


1. Reduce the Complexity       2. Dramatically Lower        3. Enable Flexible, Agile
                                      Costs                    IT Service Delivery
     to simplify operations   to redirect investment into   to meet and anticipate the
        and maintenance        value-add opportunities        needs of the business




 2
Infrastructure, Apps and now Data…




                            Build    Run
     Private
               Public


                               Manage




Simplify Infrastructure   Simplify App Platform
                                                   Next Trend:
     With Cloud              Through PaaS
                                                  Simplify Data




 3
Trend 1/3: New Data Growing at 60% Y/Y

Exabytes of information stored                                         20 Zetta by 2015

                                                                       1 Yotta by 2030

                                                                       Yes, you are part
                                                                       of the yotta
                                                       audio(          generation…
                                                 digital(tv(
                                              digital(photos(
                                      camera(phones,(rfid(
                                  medical(imaging,(
                                  sensors(
                 satellite(images,(games,(scanners,(
                 twi8er(
       cad/cam,(appliances,(videoconfercing,(digital(movies(



                                                        Source: The Information Explosion , 2009


4
Trend 2/3: Big Data – Driven by Real-World Benefit




5
Trend 3/3: Value from Data Exceeds Hardware Cost

!  Value from the intelligence of data analytics now outstrips the cost
    of hardware
    •  Hadoop enables the use of lower cost hardware
    •  Hardware cost halving every 18mo
                                                       Value
                    Big Iron:
                    $40k/CPU

                                                               Commodity
                                                               Cluster:
                                                               $1k/CPU
                                             Cost




6
Three Big Reasons to Virtualize Hadoop: 1. Simplify Hardware

                           !  Trend is “not just hadoop” for big data
                              •  Hadoop is often combined with other
                               technologies: Big SQL, NoSQL etc,…
                            •  Unify the infrastructure platform for all
SQLCluster



                                  Big SQL        NoSQL           Hadoop
     NoSQL Cluster

                                       Unified Big Data Infrastructure

                                            Private
                                                      Public
 Hadoop Cluster
                            !  Common Hardware Base
                               •  Eliminate the hardware/driver/testing phase
                               •  Use existing team for ordering, diagnosis,
             DSS Cluster       capacity management of hardware farm

 7
Three Big Reasons to Virtualize Hadoop: 2. Rapid Provisioning

I WANT MY HADOOP CLUSTER NOW!

                                  !  Instant Cluster Provisioning
                                     •  Provision Hadoop Clusters instantly
                                     •  Automatable using provisioning
                                     engines/scripts: e.g. whir




  8
Three Big Reasons to Virtualize Hadoop: 3. Leverage Capabilities

!  Increase Utilization
    •  Hadoop cluster only uses resources it needs
    •  Extra resources can be used by other applications when not in use
!  Eliminate single points of failure
    •  Use vSphere HA for Namenode and Jobtracker
!  Use VM Isolation
    •  Create separate clusters with defensible security
    •  Enables multiple-versions of Hadoop on the same infrastructure
    •  Extends to Hadoop and Linux Environments
!  Leverage Resource Management
    •  Control/assign resources through resource pools
    •  E.g. Use spare cycles for Hadoop Processing through priority control




9
What? Hadoop in a VM? Really?




        Actually, Hadoop performs well in a virtual machine




10
Performance Test: Cluster Configuration



                Mellanox10 GbE switch



     AMAX ClusterMax
     2X X5650, 96 GB
     12X SATA 500 GB
     Mellanox 10 GbE adapter




11
Cluster Configuration
!  Hardware
 •  AMAX ClusterMax, 7 nodes
 •  2X X5650 2.67 GHz hex-core, 96 GB memory
 •  12X SATA 500 GB 7200 RPM (10 for Hadoop data), EXT4
 •  Mellanox ConnectX VPI (MT26418), 10 GbE
 •  Mellanox Vantage 6048, 10 GbE
!  OS/Hypervisor
 •  RHEL 6.1 x86_64 (native and guest)
 •  ESX 5.0 RTM with devel Mellanox driver
!  VMs (HT off/on)
 •  1 VM: 92000 MB, (12/24) vCPUs, 10 PRDM disks
 •  2 VMs: 46000 MB, (6/12) vCPUs, 5 PRDM disks
 •  4 VMs (HT on only):
    •  2 small: 18400 MB, 5 vCPUs, 2 disks
    •  2 large: 27600 MB, 7 vCPUs, 3 disks
12
Hadoop Configuration
Distribution
 •  Cloudera CDH3u0
 •  Based on Apache open-source 0.20.2
Parameters
 •  dfs.datanode.max.xcievers=4096
 •  dfs.replication=2
 •  dfs.block.size=134217728
 •  io.file.buffer.size=131072
 •  mapred.child.java.opts=”-Xmx2048m -Xmn512m” (native)
 •  mapred.child.java.opts=”-Xmx1900m -Xmn512m” (virtual)
!  Network topology
 •  Hadoop uses info for reliability and performance
 •  Multiple VMs/host: Each host is a “rack”



13
Benchmarks
!  Derived from test apps included in distro
!  Pi
  •  Direct-exec Monte-Carlo estimation of pi
  •  # map tasks = # logical processors
  •  1.68 T samples
!  TestDFSIO
  •  Streaming write and read
                                                         π ~ 4*R/(R+G) = 22/7
  •  1 TB
  •  More tasks than processors
!  Terasort
  •  3 phases: teragen, terasort, teravalidate
  •  10B or 35B records, each 100 Bytes (1 TB, 3.5 TB)
  •  More tasks than processors
  •  CPU, networking, and storage I/O

14
Performance of Hadoop for Several Workloads

                             Ratio of time taken – Lower is Better
                       1.2



                        1



                       0.8
     Ratio to Native




                       0.6

                                                                     1 VM
                       0.4                                           2 VMs



                       0.2



                        0




15
Architecting Hadoop as a Service using Virtualization

!  Goals
 •  Make it fast and easy to provision new Hadoop Clusters on Demand
 •  Leverage virtual machines to provide isolation (esp. for Multi-tenant)
 •  Optimize Hadoop’s performance based on virtual topologies
 •  Make the system reliable based on virtual topologies
!  Leveraging Virtualization
 •  Elastic scale in/out
 •  Use high-availability to protect namenode/job tracker
 •  Resource controls and sharing: re-use underutilized memory, cpu
 •  Prioritize Workloads: limit or guarantee resource usage in a mixed environment




16
Provisioning

!  Leverage the vSphere APIs to auto-deploy a cluster
 •  Whirr, HOD, or custom using ruby, chef, etc,…
!  Use linked-clones to rapidly fork many nodes




17
Fast Provisioning

!  From a “seed” node to a cluster




     Thin Provisioning               Linked Clone




        60GB => 3.5GB                     ~6 second

18
SAN, NAS or Local Disk?

 !  Shared Storage: SAN or NAS                                                            !  Hybrid Storage
         •  Easy to provision                                                              •  SAN for boot images, VMs, other
         •  Automated cluster rebalancing                                                        workloads
                                                                                           •  Local disk for HDFS
                                                                                           •  Scalable Bandwidth, Lower Cost/GB
          Other VM

                     Other VM




                                                  Other VM




                                                                               Other VM




                                                                                                     Other VM

                                                                                                                Other VM




                                                                                                                                             Other VM




                                                                                                                                                                          Other VM
Hadoop




                                Hadoop

                                         Hadoop




                                                             Hadoop

                                                                      Hadoop




                                                                                            Hadoop




                                                                                                                           Hadoop

                                                                                                                                    Hadoop




                                                                                                                                                        Hadoop

                                                                                                                                                                 Hadoop
          Host                           Host                         Host                           Host                           Host                         Host




     19
Enable Automatic Rack awareness through vSphere


!  Important to robust hadoop
 cluster


!  Automatic network topology
 detect — an important
 vSphere feature


!  Rack script is generated
 automatically




20
Multi-tenant: share cluster or not

!      Shared big cluster        VS.       Isolated small clusters




         High performance                         Secure
            Large scale                           Flexible
        Pre-job provisioning               Post-job provisioning

 Combination – as   customers’ requirement are different

 21
Elastic Hadoop Cluster

!  Traditional hadoop cluster
     •  Easy to scale out
       •  Fast-provision new hadoop nodes and join into existing cluster
     •  Hard to scale in
 While (ClusterIsTooLarge) {
      choose node k;
      kill (node k);
      wait (k’s data block is recovered);
      if necessary, hadoop.rebalance();
 }
!  Elastic hadoop cluster
                                        …
                                                                           Normal node

      NN                                                    JT             Elastic node

                                                                           TaskTracker
                                            …
                                                                           DataNode

22
Replica Placement

!  Second Replica
  •  Different rack
  •  Rack-awareness required


!  Third Replica
  •  Same rack, different physical host
  •  Nodes share host (in virtualized
     environment)




23
Demo




24
Performance

!  Create more smaller VMs
  •  Makes Hadoop scale better
  •  Allows for easier/faster adjustment of packing of VMs across hosts by vSphere
     (including through DRS)

!  Sizing/Configuration of storage is critical
  •  Plan on ~50Mbytes/sec of bandwidth per core
  •  SANs are typically configured by default for IOPS, not Bandwidth
  •  Ensure SAN ports/switch topology allows required aggregate bandwidth
  •  Performance of the backend storage should be tested/sized
  •  Local disks will give ~100-140MBytes/sec per disk: pick correct controller




25
Summary

!  Hadoop does work well in a virtual environment
!  Plan a virtual cluster, enable other big-data solutions on the same
 infrastructure
!  Leverage the recipes to automate your configuration and
 deployment




26

Más contenido relacionado

La actualidad más candente

Big data on virtualized infrastucture
Big data on virtualized infrastuctureBig data on virtualized infrastucture
Big data on virtualized infrastuctureDataWorks Summit
 
Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesDataWorks Summit
 
Cloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and DeploymentCloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and DeploymentGlusterFS
 
Gluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFSGluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFSGlusterFS
 
Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011GlusterFS
 
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageWebinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageGlusterFS
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Kathleen Ting
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaCloudera, Inc.
 
Using Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisUsing Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisScaleOut Software
 
Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedInDataWorks Summit
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHortonworks
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Gluster Blog 11.15.2010
Gluster Blog 11.15.2010Gluster Blog 11.15.2010
Gluster Blog 11.15.2010GlusterFS
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
 
What's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemWhat's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemCloudera, Inc.
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Open Stack
 
Postgres Plus Cloud Database
Postgres Plus Cloud DatabasePostgres Plus Cloud Database
Postgres Plus Cloud DatabaseGary Carter
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbaseRavi Veeramachaneni
 

La actualidad más candente (20)

Big data on virtualized infrastucture
Big data on virtualized infrastuctureBig data on virtualized infrastucture
Big data on virtualized infrastucture
 
Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual Machines
 
Cloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and DeploymentCloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and Deployment
 
Gluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFSGluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFS
 
Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011
 
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageWebinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
Using Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisUsing Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data Analysis
 
Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedIn
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
 
Google Compute and MapR
Google Compute and MapRGoogle Compute and MapR
Google Compute and MapR
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Gluster Blog 11.15.2010
Gluster Blog 11.15.2010Gluster Blog 11.15.2010
Gluster Blog 11.15.2010
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
What's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemWhat's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File System
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011
 
Postgres Plus Cloud Database
Postgres Plus Cloud DatabasePostgres Plus Cloud Database
Postgres Plus Cloud Database
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
 

Destacado

Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopHortonworks
 
VM Farms Thrive with Dedicated IP Storage Networks
VM Farms Thrive with Dedicated IP Storage NetworksVM Farms Thrive with Dedicated IP Storage Networks
VM Farms Thrive with Dedicated IP Storage NetworksBrocade
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Solaris Internals Preso circa 2009
Solaris Internals Preso circa 2009Solaris Internals Preso circa 2009
Solaris Internals Preso circa 2009Richard McDougall
 
Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data ApplicationsRichard McDougall
 
Virtualizing Oracle Databases with VMware
Virtualizing Oracle Databases with VMwareVirtualizing Oracle Databases with VMware
Virtualizing Oracle Databases with VMwareRichard McDougall
 
VMware Performance Troubleshooting
VMware Performance TroubleshootingVMware Performance Troubleshooting
VMware Performance Troubleshootingglbsolutions
 
Denver VMUG nov 2011
Denver VMUG nov 2011Denver VMUG nov 2011
Denver VMUG nov 2011Dan Brinkmann
 
Citrix Remote Access Solution Soup
Citrix Remote Access Solution SoupCitrix Remote Access Solution Soup
Citrix Remote Access Solution SoupDan Brinkmann
 
VMware vSphere Performance Troubleshooting
VMware vSphere Performance TroubleshootingVMware vSphere Performance Troubleshooting
VMware vSphere Performance TroubleshootingDan Brinkmann
 
VMware Advance Troubleshooting Workshop - Day 5
VMware Advance Troubleshooting Workshop - Day 5VMware Advance Troubleshooting Workshop - Day 5
VMware Advance Troubleshooting Workshop - Day 5Vepsun Technologies
 
VMware Advance Troubleshooting Workshop - Day 2
VMware Advance Troubleshooting Workshop - Day 2VMware Advance Troubleshooting Workshop - Day 2
VMware Advance Troubleshooting Workshop - Day 2Vepsun Technologies
 
VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4Vepsun Technologies
 
VMware Advance Troubleshooting Workshop - Day 3
VMware Advance Troubleshooting Workshop - Day 3VMware Advance Troubleshooting Workshop - Day 3
VMware Advance Troubleshooting Workshop - Day 3Vepsun Technologies
 
VMware Advance Troubleshooting Workshop - Day 6
VMware Advance Troubleshooting Workshop - Day 6VMware Advance Troubleshooting Workshop - Day 6
VMware Advance Troubleshooting Workshop - Day 6Vepsun Technologies
 

Destacado (17)

Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache Hadoop
 
VM Farms Thrive with Dedicated IP Storage Networks
VM Farms Thrive with Dedicated IP Storage NetworksVM Farms Thrive with Dedicated IP Storage Networks
VM Farms Thrive with Dedicated IP Storage Networks
 
Making of the Burner Board
Making of the Burner BoardMaking of the Burner Board
Making of the Burner Board
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Solaris Internals Preso circa 2009
Solaris Internals Preso circa 2009Solaris Internals Preso circa 2009
Solaris Internals Preso circa 2009
 
Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data Applications
 
Virtualizing Oracle Databases with VMware
Virtualizing Oracle Databases with VMwareVirtualizing Oracle Databases with VMware
Virtualizing Oracle Databases with VMware
 
Hadoop I/O Analysis
Hadoop I/O AnalysisHadoop I/O Analysis
Hadoop I/O Analysis
 
VMware Performance Troubleshooting
VMware Performance TroubleshootingVMware Performance Troubleshooting
VMware Performance Troubleshooting
 
Denver VMUG nov 2011
Denver VMUG nov 2011Denver VMUG nov 2011
Denver VMUG nov 2011
 
Citrix Remote Access Solution Soup
Citrix Remote Access Solution SoupCitrix Remote Access Solution Soup
Citrix Remote Access Solution Soup
 
VMware vSphere Performance Troubleshooting
VMware vSphere Performance TroubleshootingVMware vSphere Performance Troubleshooting
VMware vSphere Performance Troubleshooting
 
VMware Advance Troubleshooting Workshop - Day 5
VMware Advance Troubleshooting Workshop - Day 5VMware Advance Troubleshooting Workshop - Day 5
VMware Advance Troubleshooting Workshop - Day 5
 
VMware Advance Troubleshooting Workshop - Day 2
VMware Advance Troubleshooting Workshop - Day 2VMware Advance Troubleshooting Workshop - Day 2
VMware Advance Troubleshooting Workshop - Day 2
 
VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4
 
VMware Advance Troubleshooting Workshop - Day 3
VMware Advance Troubleshooting Workshop - Day 3VMware Advance Troubleshooting Workshop - Day 3
VMware Advance Troubleshooting Workshop - Day 3
 
VMware Advance Troubleshooting Workshop - Day 6
VMware Advance Troubleshooting Workshop - Day 6VMware Advance Troubleshooting Workshop - Day 6
VMware Advance Troubleshooting Workshop - Day 6
 

Similar a Hadoop on VMware

Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudHadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudCloudera, Inc.
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopDataWorks Summit
 
Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012Benoit Hudzia
 
Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09Steve Staso
 
Introduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesIntroduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesJason TC HOU (侯宗成)
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big datasolarisyourep
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big dataxKinAnx
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
End of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationEnd of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationCeph Community
 
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamComputeIn-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamComputePatrick McGarry
 
Build Your Private Cloud with Ezilla and Haduzilla
Build Your Private Cloud with Ezilla and HaduzillaBuild Your Private Cloud with Ezilla and Haduzilla
Build Your Private Cloud with Ezilla and HaduzillaJazz Yao-Tsung Wang
 
Architecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationArchitecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationVlad Ponomarev
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01eimhee
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
Red Hat Storage - Introduction to GlusterFS
Red Hat Storage - Introduction to GlusterFSRed Hat Storage - Introduction to GlusterFS
Red Hat Storage - Introduction to GlusterFSGlusterFS
 
Infinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - LondonInfinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - LondonHentsū
 

Similar a Hadoop on VMware (20)

Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudHadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in Cloud
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing Hadoop
 
Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012
 
Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09
 
Introduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesIntroduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network Issues
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
End of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationEnd of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph Replication
 
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamComputeIn-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
 
Build Your Private Cloud with Ezilla and Haduzilla
Build Your Private Cloud with Ezilla and HaduzillaBuild Your Private Cloud with Ezilla and Haduzilla
Build Your Private Cloud with Ezilla and Haduzilla
 
Architecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationArchitecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentation
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Red Hat Storage - Introduction to GlusterFS
Red Hat Storage - Introduction to GlusterFSRed Hat Storage - Introduction to GlusterFS
Red Hat Storage - Introduction to GlusterFS
 
Infinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - LondonInfinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - London
 

Último

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Último (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Hadoop on VMware

  • 1. Hadoop as a Service Hadoop on Virtualization Hadoop World, December 2011 Jun Ping Du Richard McDougall VMware, Inc. © 2009 VMware Inc. All rights reserved
  • 2. Cloud: Big Shifts in Simplification and Optimization 1. Reduce the Complexity 2. Dramatically Lower 3. Enable Flexible, Agile Costs IT Service Delivery to simplify operations to redirect investment into to meet and anticipate the and maintenance value-add opportunities needs of the business 2
  • 3. Infrastructure, Apps and now Data… Build Run Private Public Manage Simplify Infrastructure Simplify App Platform Next Trend: With Cloud Through PaaS Simplify Data 3
  • 4. Trend 1/3: New Data Growing at 60% Y/Y Exabytes of information stored 20 Zetta by 2015 1 Yotta by 2030 Yes, you are part of the yotta audio( generation… digital(tv( digital(photos( camera(phones,(rfid( medical(imaging,( sensors( satellite(images,(games,(scanners,( twi8er( cad/cam,(appliances,(videoconfercing,(digital(movies( Source: The Information Explosion , 2009 4
  • 5. Trend 2/3: Big Data – Driven by Real-World Benefit 5
  • 6. Trend 3/3: Value from Data Exceeds Hardware Cost !  Value from the intelligence of data analytics now outstrips the cost of hardware •  Hadoop enables the use of lower cost hardware •  Hardware cost halving every 18mo Value Big Iron: $40k/CPU Commodity Cluster: $1k/CPU Cost 6
  • 7. Three Big Reasons to Virtualize Hadoop: 1. Simplify Hardware !  Trend is “not just hadoop” for big data •  Hadoop is often combined with other technologies: Big SQL, NoSQL etc,… •  Unify the infrastructure platform for all SQLCluster Big SQL NoSQL Hadoop NoSQL Cluster Unified Big Data Infrastructure Private Public Hadoop Cluster !  Common Hardware Base •  Eliminate the hardware/driver/testing phase •  Use existing team for ordering, diagnosis, DSS Cluster capacity management of hardware farm 7
  • 8. Three Big Reasons to Virtualize Hadoop: 2. Rapid Provisioning I WANT MY HADOOP CLUSTER NOW! !  Instant Cluster Provisioning •  Provision Hadoop Clusters instantly •  Automatable using provisioning engines/scripts: e.g. whir 8
  • 9. Three Big Reasons to Virtualize Hadoop: 3. Leverage Capabilities !  Increase Utilization •  Hadoop cluster only uses resources it needs •  Extra resources can be used by other applications when not in use !  Eliminate single points of failure •  Use vSphere HA for Namenode and Jobtracker !  Use VM Isolation •  Create separate clusters with defensible security •  Enables multiple-versions of Hadoop on the same infrastructure •  Extends to Hadoop and Linux Environments !  Leverage Resource Management •  Control/assign resources through resource pools •  E.g. Use spare cycles for Hadoop Processing through priority control 9
  • 10. What? Hadoop in a VM? Really? Actually, Hadoop performs well in a virtual machine 10
  • 11. Performance Test: Cluster Configuration Mellanox10 GbE switch AMAX ClusterMax 2X X5650, 96 GB 12X SATA 500 GB Mellanox 10 GbE adapter 11
  • 12. Cluster Configuration !  Hardware •  AMAX ClusterMax, 7 nodes •  2X X5650 2.67 GHz hex-core, 96 GB memory •  12X SATA 500 GB 7200 RPM (10 for Hadoop data), EXT4 •  Mellanox ConnectX VPI (MT26418), 10 GbE •  Mellanox Vantage 6048, 10 GbE !  OS/Hypervisor •  RHEL 6.1 x86_64 (native and guest) •  ESX 5.0 RTM with devel Mellanox driver !  VMs (HT off/on) •  1 VM: 92000 MB, (12/24) vCPUs, 10 PRDM disks •  2 VMs: 46000 MB, (6/12) vCPUs, 5 PRDM disks •  4 VMs (HT on only): •  2 small: 18400 MB, 5 vCPUs, 2 disks •  2 large: 27600 MB, 7 vCPUs, 3 disks 12
  • 13. Hadoop Configuration Distribution •  Cloudera CDH3u0 •  Based on Apache open-source 0.20.2 Parameters •  dfs.datanode.max.xcievers=4096 •  dfs.replication=2 •  dfs.block.size=134217728 •  io.file.buffer.size=131072 •  mapred.child.java.opts=”-Xmx2048m -Xmn512m” (native) •  mapred.child.java.opts=”-Xmx1900m -Xmn512m” (virtual) !  Network topology •  Hadoop uses info for reliability and performance •  Multiple VMs/host: Each host is a “rack” 13
  • 14. Benchmarks !  Derived from test apps included in distro !  Pi •  Direct-exec Monte-Carlo estimation of pi •  # map tasks = # logical processors •  1.68 T samples !  TestDFSIO •  Streaming write and read π ~ 4*R/(R+G) = 22/7 •  1 TB •  More tasks than processors !  Terasort •  3 phases: teragen, terasort, teravalidate •  10B or 35B records, each 100 Bytes (1 TB, 3.5 TB) •  More tasks than processors •  CPU, networking, and storage I/O 14
  • 15. Performance of Hadoop for Several Workloads Ratio of time taken – Lower is Better 1.2 1 0.8 Ratio to Native 0.6 1 VM 0.4 2 VMs 0.2 0 15
  • 16. Architecting Hadoop as a Service using Virtualization !  Goals •  Make it fast and easy to provision new Hadoop Clusters on Demand •  Leverage virtual machines to provide isolation (esp. for Multi-tenant) •  Optimize Hadoop’s performance based on virtual topologies •  Make the system reliable based on virtual topologies !  Leveraging Virtualization •  Elastic scale in/out •  Use high-availability to protect namenode/job tracker •  Resource controls and sharing: re-use underutilized memory, cpu •  Prioritize Workloads: limit or guarantee resource usage in a mixed environment 16
  • 17. Provisioning !  Leverage the vSphere APIs to auto-deploy a cluster •  Whirr, HOD, or custom using ruby, chef, etc,… !  Use linked-clones to rapidly fork many nodes 17
  • 18. Fast Provisioning !  From a “seed” node to a cluster Thin Provisioning Linked Clone 60GB => 3.5GB ~6 second 18
  • 19. SAN, NAS or Local Disk? !  Shared Storage: SAN or NAS !  Hybrid Storage •  Easy to provision •  SAN for boot images, VMs, other •  Automated cluster rebalancing workloads •  Local disk for HDFS •  Scalable Bandwidth, Lower Cost/GB Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Host Host Host Host Host Host 19
  • 20. Enable Automatic Rack awareness through vSphere !  Important to robust hadoop cluster !  Automatic network topology detect — an important vSphere feature !  Rack script is generated automatically 20
  • 21. Multi-tenant: share cluster or not !  Shared big cluster VS. Isolated small clusters  High performance  Secure  Large scale  Flexible  Pre-job provisioning  Post-job provisioning Combination – as customers’ requirement are different 21
  • 22. Elastic Hadoop Cluster !  Traditional hadoop cluster •  Easy to scale out •  Fast-provision new hadoop nodes and join into existing cluster •  Hard to scale in While (ClusterIsTooLarge) { choose node k; kill (node k); wait (k’s data block is recovered); if necessary, hadoop.rebalance(); } !  Elastic hadoop cluster … Normal node NN JT Elastic node TaskTracker … DataNode 22
  • 23. Replica Placement !  Second Replica •  Different rack •  Rack-awareness required !  Third Replica •  Same rack, different physical host •  Nodes share host (in virtualized environment) 23
  • 25. Performance !  Create more smaller VMs •  Makes Hadoop scale better •  Allows for easier/faster adjustment of packing of VMs across hosts by vSphere (including through DRS) !  Sizing/Configuration of storage is critical •  Plan on ~50Mbytes/sec of bandwidth per core •  SANs are typically configured by default for IOPS, not Bandwidth •  Ensure SAN ports/switch topology allows required aggregate bandwidth •  Performance of the backend storage should be tested/sized •  Local disks will give ~100-140MBytes/sec per disk: pick correct controller 25
  • 26. Summary !  Hadoop does work well in a virtual environment !  Plan a virtual cluster, enable other big-data solutions on the same infrastructure !  Leverage the recipes to automate your configuration and deployment 26