SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
Hadoop as a Service
Hadoop on Virtualization
Hadoop World, December 2011

Jun Ping Du
Richard McDougall
VMware, Inc.




                              © 2009 VMware Inc. All rights reserved
Cloud: Big Shifts in Simplification and Optimization


1. Reduce the Complexity       2. Dramatically Lower        3. Enable Flexible, Agile
                                      Costs                    IT Service Delivery
     to simplify operations   to redirect investment into   to meet and anticipate the
        and maintenance        value-add opportunities        needs of the business




 2
Infrastructure, Apps and now Data…




                            Build    Run
     Private
               Public


                               Manage




Simplify Infrastructure   Simplify App Platform
                                                   Next Trend:
     With Cloud              Through PaaS
                                                  Simplify Data




 3
Trend 1/3: New Data Growing at 60% Y/Y

Exabytes of information stored                                         20 Zetta by 2015

                                                                       1 Yotta by 2030

                                                                       Yes, you are part
                                                                       of the yotta
                                                       audio(          generation…
                                                 digital(tv(
                                              digital(photos(
                                      camera(phones,(rfid(
                                  medical(imaging,(
                                  sensors(
                 satellite(images,(games,(scanners,(
                 twi8er(
       cad/cam,(appliances,(videoconfercing,(digital(movies(



                                                        Source: The Information Explosion , 2009


4
Trend 2/3: Big Data – Driven by Real-World Benefit




5
Trend 3/3: Value from Data Exceeds Hardware Cost

!  Value from the intelligence of data analytics now outstrips the cost
    of hardware
    •  Hadoop enables the use of lower cost hardware
    •  Hardware cost halving every 18mo
                                                       Value
                    Big Iron:
                    $40k/CPU

                                                               Commodity
                                                               Cluster:
                                                               $1k/CPU
                                             Cost




6
Three Big Reasons to Virtualize Hadoop: 1. Simplify Hardware

                           !  Trend is “not just hadoop” for big data
                              •  Hadoop is often combined with other
                               technologies: Big SQL, NoSQL etc,…
                            •  Unify the infrastructure platform for all
SQLCluster



                                  Big SQL        NoSQL           Hadoop
     NoSQL Cluster

                                       Unified Big Data Infrastructure

                                            Private
                                                      Public
 Hadoop Cluster
                            !  Common Hardware Base
                               •  Eliminate the hardware/driver/testing phase
                               •  Use existing team for ordering, diagnosis,
             DSS Cluster       capacity management of hardware farm

 7
Three Big Reasons to Virtualize Hadoop: 2. Rapid Provisioning

I WANT MY HADOOP CLUSTER NOW!

                                  !  Instant Cluster Provisioning
                                     •  Provision Hadoop Clusters instantly
                                     •  Automatable using provisioning
                                     engines/scripts: e.g. whir




  8
Three Big Reasons to Virtualize Hadoop: 3. Leverage Capabilities

!  Increase Utilization
    •  Hadoop cluster only uses resources it needs
    •  Extra resources can be used by other applications when not in use
!  Eliminate single points of failure
    •  Use vSphere HA for Namenode and Jobtracker
!  Use VM Isolation
    •  Create separate clusters with defensible security
    •  Enables multiple-versions of Hadoop on the same infrastructure
    •  Extends to Hadoop and Linux Environments
!  Leverage Resource Management
    •  Control/assign resources through resource pools
    •  E.g. Use spare cycles for Hadoop Processing through priority control




9
What? Hadoop in a VM? Really?




        Actually, Hadoop performs well in a virtual machine




10
Performance Test: Cluster Configuration



                Mellanox10 GbE switch



     AMAX ClusterMax
     2X X5650, 96 GB
     12X SATA 500 GB
     Mellanox 10 GbE adapter




11
Cluster Configuration
!  Hardware
 •  AMAX ClusterMax, 7 nodes
 •  2X X5650 2.67 GHz hex-core, 96 GB memory
 •  12X SATA 500 GB 7200 RPM (10 for Hadoop data), EXT4
 •  Mellanox ConnectX VPI (MT26418), 10 GbE
 •  Mellanox Vantage 6048, 10 GbE
!  OS/Hypervisor
 •  RHEL 6.1 x86_64 (native and guest)
 •  ESX 5.0 RTM with devel Mellanox driver
!  VMs (HT off/on)
 •  1 VM: 92000 MB, (12/24) vCPUs, 10 PRDM disks
 •  2 VMs: 46000 MB, (6/12) vCPUs, 5 PRDM disks
 •  4 VMs (HT on only):
    •  2 small: 18400 MB, 5 vCPUs, 2 disks
    •  2 large: 27600 MB, 7 vCPUs, 3 disks
12
Hadoop Configuration
Distribution
 •  Cloudera CDH3u0
 •  Based on Apache open-source 0.20.2
Parameters
 •  dfs.datanode.max.xcievers=4096
 •  dfs.replication=2
 •  dfs.block.size=134217728
 •  io.file.buffer.size=131072
 •  mapred.child.java.opts=”-Xmx2048m -Xmn512m” (native)
 •  mapred.child.java.opts=”-Xmx1900m -Xmn512m” (virtual)
!  Network topology
 •  Hadoop uses info for reliability and performance
 •  Multiple VMs/host: Each host is a “rack”



13
Benchmarks
!  Derived from test apps included in distro
!  Pi
  •  Direct-exec Monte-Carlo estimation of pi
  •  # map tasks = # logical processors
  •  1.68 T samples
!  TestDFSIO
  •  Streaming write and read
                                                         π ~ 4*R/(R+G) = 22/7
  •  1 TB
  •  More tasks than processors
!  Terasort
  •  3 phases: teragen, terasort, teravalidate
  •  10B or 35B records, each 100 Bytes (1 TB, 3.5 TB)
  •  More tasks than processors
  •  CPU, networking, and storage I/O

14
Performance of Hadoop for Several Workloads

                             Ratio of time taken – Lower is Better
                       1.2



                        1



                       0.8
     Ratio to Native




                       0.6

                                                                     1 VM
                       0.4                                           2 VMs



                       0.2



                        0




15
Architecting Hadoop as a Service using Virtualization

!  Goals
 •  Make it fast and easy to provision new Hadoop Clusters on Demand
 •  Leverage virtual machines to provide isolation (esp. for Multi-tenant)
 •  Optimize Hadoop’s performance based on virtual topologies
 •  Make the system reliable based on virtual topologies
!  Leveraging Virtualization
 •  Elastic scale in/out
 •  Use high-availability to protect namenode/job tracker
 •  Resource controls and sharing: re-use underutilized memory, cpu
 •  Prioritize Workloads: limit or guarantee resource usage in a mixed environment




16
Provisioning

!  Leverage the vSphere APIs to auto-deploy a cluster
 •  Whirr, HOD, or custom using ruby, chef, etc,…
!  Use linked-clones to rapidly fork many nodes




17
Fast Provisioning

!  From a “seed” node to a cluster




     Thin Provisioning               Linked Clone




        60GB => 3.5GB                     ~6 second

18
SAN, NAS or Local Disk?

 !  Shared Storage: SAN or NAS                                                            !  Hybrid Storage
         •  Easy to provision                                                              •  SAN for boot images, VMs, other
         •  Automated cluster rebalancing                                                        workloads
                                                                                           •  Local disk for HDFS
                                                                                           •  Scalable Bandwidth, Lower Cost/GB
          Other VM

                     Other VM




                                                  Other VM




                                                                               Other VM




                                                                                                     Other VM

                                                                                                                Other VM




                                                                                                                                             Other VM




                                                                                                                                                                          Other VM
Hadoop




                                Hadoop

                                         Hadoop




                                                             Hadoop

                                                                      Hadoop




                                                                                            Hadoop




                                                                                                                           Hadoop

                                                                                                                                    Hadoop




                                                                                                                                                        Hadoop

                                                                                                                                                                 Hadoop
          Host                           Host                         Host                           Host                           Host                         Host




     19
Enable Automatic Rack awareness through vSphere


!  Important to robust hadoop
 cluster


!  Automatic network topology
 detect — an important
 vSphere feature


!  Rack script is generated
 automatically




20
Multi-tenant: share cluster or not

!      Shared big cluster        VS.       Isolated small clusters




         High performance                         Secure
            Large scale                           Flexible
        Pre-job provisioning               Post-job provisioning

 Combination – as   customers’ requirement are different

 21
Elastic Hadoop Cluster

!  Traditional hadoop cluster
     •  Easy to scale out
       •  Fast-provision new hadoop nodes and join into existing cluster
     •  Hard to scale in
 While (ClusterIsTooLarge) {
      choose node k;
      kill (node k);
      wait (k’s data block is recovered);
      if necessary, hadoop.rebalance();
 }
!  Elastic hadoop cluster
                                        …
                                                                           Normal node

      NN                                                    JT             Elastic node

                                                                           TaskTracker
                                            …
                                                                           DataNode

22
Replica Placement

!  Second Replica
  •  Different rack
  •  Rack-awareness required


!  Third Replica
  •  Same rack, different physical host
  •  Nodes share host (in virtualized
     environment)




23
Demo




24
Performance

!  Create more smaller VMs
  •  Makes Hadoop scale better
  •  Allows for easier/faster adjustment of packing of VMs across hosts by vSphere
     (including through DRS)

!  Sizing/Configuration of storage is critical
  •  Plan on ~50Mbytes/sec of bandwidth per core
  •  SANs are typically configured by default for IOPS, not Bandwidth
  •  Ensure SAN ports/switch topology allows required aggregate bandwidth
  •  Performance of the backend storage should be tested/sized
  •  Local disks will give ~100-140MBytes/sec per disk: pick correct controller




25
Summary

!  Hadoop does work well in a virtual environment
!  Plan a virtual cluster, enable other big-data solutions on the same
 infrastructure
!  Leverage the recipes to automate your configuration and
 deployment




26

Más contenido relacionado

La actualidad más candente

Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual Machines
DataWorks Summit
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
Brock Noland
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
Joey Echeverria
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011
Open Stack
 

La actualidad más candente (20)

Big data on virtualized infrastucture
Big data on virtualized infrastuctureBig data on virtualized infrastucture
Big data on virtualized infrastucture
 
Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual Machines
 
Cloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and DeploymentCloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and Deployment
 
Gluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFSGluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFS
 
Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011
 
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageWebinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
Using Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisUsing Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data Analysis
 
Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedIn
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
 
Google Compute and MapR
Google Compute and MapRGoogle Compute and MapR
Google Compute and MapR
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Gluster Blog 11.15.2010
Gluster Blog 11.15.2010Gluster Blog 11.15.2010
Gluster Blog 11.15.2010
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
What's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemWhat's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File System
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011
 
Postgres Plus Cloud Database
Postgres Plus Cloud DatabasePostgres Plus Cloud Database
Postgres Plus Cloud Database
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
 

Destacado

VMware Performance Troubleshooting
VMware Performance TroubleshootingVMware Performance Troubleshooting
VMware Performance Troubleshooting
glbsolutions
 

Destacado (17)

Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache Hadoop
 
VM Farms Thrive with Dedicated IP Storage Networks
VM Farms Thrive with Dedicated IP Storage NetworksVM Farms Thrive with Dedicated IP Storage Networks
VM Farms Thrive with Dedicated IP Storage Networks
 
Making of the Burner Board
Making of the Burner BoardMaking of the Burner Board
Making of the Burner Board
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Solaris Internals Preso circa 2009
Solaris Internals Preso circa 2009Solaris Internals Preso circa 2009
Solaris Internals Preso circa 2009
 
Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data Applications
 
Virtualizing Oracle Databases with VMware
Virtualizing Oracle Databases with VMwareVirtualizing Oracle Databases with VMware
Virtualizing Oracle Databases with VMware
 
Hadoop I/O Analysis
Hadoop I/O AnalysisHadoop I/O Analysis
Hadoop I/O Analysis
 
VMware Performance Troubleshooting
VMware Performance TroubleshootingVMware Performance Troubleshooting
VMware Performance Troubleshooting
 
Denver VMUG nov 2011
Denver VMUG nov 2011Denver VMUG nov 2011
Denver VMUG nov 2011
 
Citrix Remote Access Solution Soup
Citrix Remote Access Solution SoupCitrix Remote Access Solution Soup
Citrix Remote Access Solution Soup
 
VMware vSphere Performance Troubleshooting
VMware vSphere Performance TroubleshootingVMware vSphere Performance Troubleshooting
VMware vSphere Performance Troubleshooting
 
VMware Advance Troubleshooting Workshop - Day 5
VMware Advance Troubleshooting Workshop - Day 5VMware Advance Troubleshooting Workshop - Day 5
VMware Advance Troubleshooting Workshop - Day 5
 
VMware Advance Troubleshooting Workshop - Day 2
VMware Advance Troubleshooting Workshop - Day 2VMware Advance Troubleshooting Workshop - Day 2
VMware Advance Troubleshooting Workshop - Day 2
 
VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4
 
VMware Advance Troubleshooting Workshop - Day 3
VMware Advance Troubleshooting Workshop - Day 3VMware Advance Troubleshooting Workshop - Day 3
VMware Advance Troubleshooting Workshop - Day 3
 
VMware Advance Troubleshooting Workshop - Day 6
VMware Advance Troubleshooting Workshop - Day 6VMware Advance Troubleshooting Workshop - Day 6
VMware Advance Troubleshooting Workshop - Day 6
 

Similar a Hadoop on VMware

Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
solarisyourep
 
Architecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationArchitecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentation
Vlad Ponomarev
 
Red Hat Storage - Introduction to GlusterFS
Red Hat Storage - Introduction to GlusterFSRed Hat Storage - Introduction to GlusterFS
Red Hat Storage - Introduction to GlusterFS
GlusterFS
 

Similar a Hadoop on VMware (20)

Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudHadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in Cloud
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing Hadoop
 
Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012
 
Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09
 
Introduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesIntroduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network Issues
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
End of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph ReplicationEnd of RAID as we know it with Ceph Replication
End of RAID as we know it with Ceph Replication
 
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamComputeIn-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
 
Build Your Private Cloud with Ezilla and Haduzilla
Build Your Private Cloud with Ezilla and HaduzillaBuild Your Private Cloud with Ezilla and Haduzilla
Build Your Private Cloud with Ezilla and Haduzilla
 
Architecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationArchitecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentation
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Red Hat Storage - Introduction to GlusterFS
Red Hat Storage - Introduction to GlusterFSRed Hat Storage - Introduction to GlusterFS
Red Hat Storage - Introduction to GlusterFS
 
Infinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - LondonInfinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - London
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Hadoop on VMware

  • 1. Hadoop as a Service Hadoop on Virtualization Hadoop World, December 2011 Jun Ping Du Richard McDougall VMware, Inc. © 2009 VMware Inc. All rights reserved
  • 2. Cloud: Big Shifts in Simplification and Optimization 1. Reduce the Complexity 2. Dramatically Lower 3. Enable Flexible, Agile Costs IT Service Delivery to simplify operations to redirect investment into to meet and anticipate the and maintenance value-add opportunities needs of the business 2
  • 3. Infrastructure, Apps and now Data… Build Run Private Public Manage Simplify Infrastructure Simplify App Platform Next Trend: With Cloud Through PaaS Simplify Data 3
  • 4. Trend 1/3: New Data Growing at 60% Y/Y Exabytes of information stored 20 Zetta by 2015 1 Yotta by 2030 Yes, you are part of the yotta audio( generation… digital(tv( digital(photos( camera(phones,(rfid( medical(imaging,( sensors( satellite(images,(games,(scanners,( twi8er( cad/cam,(appliances,(videoconfercing,(digital(movies( Source: The Information Explosion , 2009 4
  • 5. Trend 2/3: Big Data – Driven by Real-World Benefit 5
  • 6. Trend 3/3: Value from Data Exceeds Hardware Cost !  Value from the intelligence of data analytics now outstrips the cost of hardware •  Hadoop enables the use of lower cost hardware •  Hardware cost halving every 18mo Value Big Iron: $40k/CPU Commodity Cluster: $1k/CPU Cost 6
  • 7. Three Big Reasons to Virtualize Hadoop: 1. Simplify Hardware !  Trend is “not just hadoop” for big data •  Hadoop is often combined with other technologies: Big SQL, NoSQL etc,… •  Unify the infrastructure platform for all SQLCluster Big SQL NoSQL Hadoop NoSQL Cluster Unified Big Data Infrastructure Private Public Hadoop Cluster !  Common Hardware Base •  Eliminate the hardware/driver/testing phase •  Use existing team for ordering, diagnosis, DSS Cluster capacity management of hardware farm 7
  • 8. Three Big Reasons to Virtualize Hadoop: 2. Rapid Provisioning I WANT MY HADOOP CLUSTER NOW! !  Instant Cluster Provisioning •  Provision Hadoop Clusters instantly •  Automatable using provisioning engines/scripts: e.g. whir 8
  • 9. Three Big Reasons to Virtualize Hadoop: 3. Leverage Capabilities !  Increase Utilization •  Hadoop cluster only uses resources it needs •  Extra resources can be used by other applications when not in use !  Eliminate single points of failure •  Use vSphere HA for Namenode and Jobtracker !  Use VM Isolation •  Create separate clusters with defensible security •  Enables multiple-versions of Hadoop on the same infrastructure •  Extends to Hadoop and Linux Environments !  Leverage Resource Management •  Control/assign resources through resource pools •  E.g. Use spare cycles for Hadoop Processing through priority control 9
  • 10. What? Hadoop in a VM? Really? Actually, Hadoop performs well in a virtual machine 10
  • 11. Performance Test: Cluster Configuration Mellanox10 GbE switch AMAX ClusterMax 2X X5650, 96 GB 12X SATA 500 GB Mellanox 10 GbE adapter 11
  • 12. Cluster Configuration !  Hardware •  AMAX ClusterMax, 7 nodes •  2X X5650 2.67 GHz hex-core, 96 GB memory •  12X SATA 500 GB 7200 RPM (10 for Hadoop data), EXT4 •  Mellanox ConnectX VPI (MT26418), 10 GbE •  Mellanox Vantage 6048, 10 GbE !  OS/Hypervisor •  RHEL 6.1 x86_64 (native and guest) •  ESX 5.0 RTM with devel Mellanox driver !  VMs (HT off/on) •  1 VM: 92000 MB, (12/24) vCPUs, 10 PRDM disks •  2 VMs: 46000 MB, (6/12) vCPUs, 5 PRDM disks •  4 VMs (HT on only): •  2 small: 18400 MB, 5 vCPUs, 2 disks •  2 large: 27600 MB, 7 vCPUs, 3 disks 12
  • 13. Hadoop Configuration Distribution •  Cloudera CDH3u0 •  Based on Apache open-source 0.20.2 Parameters •  dfs.datanode.max.xcievers=4096 •  dfs.replication=2 •  dfs.block.size=134217728 •  io.file.buffer.size=131072 •  mapred.child.java.opts=”-Xmx2048m -Xmn512m” (native) •  mapred.child.java.opts=”-Xmx1900m -Xmn512m” (virtual) !  Network topology •  Hadoop uses info for reliability and performance •  Multiple VMs/host: Each host is a “rack” 13
  • 14. Benchmarks !  Derived from test apps included in distro !  Pi •  Direct-exec Monte-Carlo estimation of pi •  # map tasks = # logical processors •  1.68 T samples !  TestDFSIO •  Streaming write and read π ~ 4*R/(R+G) = 22/7 •  1 TB •  More tasks than processors !  Terasort •  3 phases: teragen, terasort, teravalidate •  10B or 35B records, each 100 Bytes (1 TB, 3.5 TB) •  More tasks than processors •  CPU, networking, and storage I/O 14
  • 15. Performance of Hadoop for Several Workloads Ratio of time taken – Lower is Better 1.2 1 0.8 Ratio to Native 0.6 1 VM 0.4 2 VMs 0.2 0 15
  • 16. Architecting Hadoop as a Service using Virtualization !  Goals •  Make it fast and easy to provision new Hadoop Clusters on Demand •  Leverage virtual machines to provide isolation (esp. for Multi-tenant) •  Optimize Hadoop’s performance based on virtual topologies •  Make the system reliable based on virtual topologies !  Leveraging Virtualization •  Elastic scale in/out •  Use high-availability to protect namenode/job tracker •  Resource controls and sharing: re-use underutilized memory, cpu •  Prioritize Workloads: limit or guarantee resource usage in a mixed environment 16
  • 17. Provisioning !  Leverage the vSphere APIs to auto-deploy a cluster •  Whirr, HOD, or custom using ruby, chef, etc,… !  Use linked-clones to rapidly fork many nodes 17
  • 18. Fast Provisioning !  From a “seed” node to a cluster Thin Provisioning Linked Clone 60GB => 3.5GB ~6 second 18
  • 19. SAN, NAS or Local Disk? !  Shared Storage: SAN or NAS !  Hybrid Storage •  Easy to provision •  SAN for boot images, VMs, other •  Automated cluster rebalancing workloads •  Local disk for HDFS •  Scalable Bandwidth, Lower Cost/GB Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Host Host Host Host Host Host 19
  • 20. Enable Automatic Rack awareness through vSphere !  Important to robust hadoop cluster !  Automatic network topology detect — an important vSphere feature !  Rack script is generated automatically 20
  • 21. Multi-tenant: share cluster or not !  Shared big cluster VS. Isolated small clusters  High performance  Secure  Large scale  Flexible  Pre-job provisioning  Post-job provisioning Combination – as customers’ requirement are different 21
  • 22. Elastic Hadoop Cluster !  Traditional hadoop cluster •  Easy to scale out •  Fast-provision new hadoop nodes and join into existing cluster •  Hard to scale in While (ClusterIsTooLarge) { choose node k; kill (node k); wait (k’s data block is recovered); if necessary, hadoop.rebalance(); } !  Elastic hadoop cluster … Normal node NN JT Elastic node TaskTracker … DataNode 22
  • 23. Replica Placement !  Second Replica •  Different rack •  Rack-awareness required !  Third Replica •  Same rack, different physical host •  Nodes share host (in virtualized environment) 23
  • 25. Performance !  Create more smaller VMs •  Makes Hadoop scale better •  Allows for easier/faster adjustment of packing of VMs across hosts by vSphere (including through DRS) !  Sizing/Configuration of storage is critical •  Plan on ~50Mbytes/sec of bandwidth per core •  SANs are typically configured by default for IOPS, not Bandwidth •  Ensure SAN ports/switch topology allows required aggregate bandwidth •  Performance of the backend storage should be tested/sized •  Local disks will give ~100-140MBytes/sec per disk: pick correct controller 25
  • 26. Summary !  Hadoop does work well in a virtual environment !  Plan a virtual cluster, enable other big-data solutions on the same infrastructure !  Leverage the recipes to automate your configuration and deployment 26