SlideShare a Scribd company logo
1 of 37
Enabling Highly Available, Elastic, Multi-tenancy
Hadoop on Demand

Richard McDougall,
VMware, Inc
@richardmcdougll




                                             © 2009 VMware Inc. All rights reserved
Cloud: Big Shifts in Simplification and Optimization


1. Reduce the Complexity      2. Dramatically Lower         3. Enable Flexible, Agile
                                     Costs                     IT Service Delivery
     to simplify operations   to redirect investment into   to meet and anticipate the
        and maintenance        value-add opportunities        needs of the business




 2
A Holistic View of a Big Data System:


                 Real Time
                  Streams


                      Real-Time
                      Processing
                       (s4, storm)
                                                     Analytics

    ETL                      Real Time
                             Structured      Big SQL
                             Database       (Greenplum,       Batch
                              (hBase,        AsterData,     Processing
                              Gemfire,         Etc…)
                             Cassandra)




                             Unstructured Data (HDFS)



3
Common Infrastructure for Big Data


                                                           MPP DB    HBase       Hadoop
     Virtualization Platform
                                                       Virtualization Platform


     Hadoop


                      HBase



                                        Cluster Consolidation
       MPP DB

                                        §  Simplify
                                          •  Single Hardware Infrastructure
Cluster Sprawling
                                          •  Unified operations
Single purpose clusters for various
business applications lead to cluster   §  Optimize
sprawl.                                   •  Shared Resources = higher utilization
                                          •  Elastic resources = faster on-demand access
 4
Enterprise Challenges with Using Hadoop

§  Deployment
     •  Slow to provision
     •  Complex to keep running/tune
§  Single Points of Failure
     •  Single point of failure with Name Node and Job tracker
     •  No HA for Hadoop Framework Components (Hive, HCatalog, etc.)
§  Low Utilization
     •  Dedicated clusters to run Hadoop with low CPU utilization
     •  No easy way to share resource between Hadoop and non-Hadoop workloads
     •  Noisy neighbor, lack resource containment
§  Need Multi-tenant Isolation, Resource Management, etc,…
     •  Noisy Neighbor - no performance or security isolation between different tenants/users
     •  Lack of configuration isolation - Can’t run multiple versions on the cluster




 5
I.     Market Overview & Insights
II.    Virtualization + Hadoop
III.  Distribution & OSS Contribution




6
Hadoop Runs Well on Virtualization

                                     Comparable performance to physical
                      1.2


                       1


                      0.8
    Ratio to Native




                      0.6


                      0.4                                                                             1 VM
                                                                                                      2 VMs

                      0.2


                       0




                            Source: http://www.vmware.com/files/pdf/techpaper/VMW-Hadoop-Performance-vSphere5.pdf
7
Use Local Disk where it’s Needed




     SAN Storage          NAS Filers       Local Storage

    $2 - $10/Gigabyte   $1 - $5/Gigabyte   $0.05/Gigabyte

        $1M gets:          $1M gets:          $1M gets:
      0.5Petabytes        1 Petabyte         20 Petabytes
      200,000 IOPS       400,000 IOPS      10,000,000 IOPS
       1Gbyte/sec         2Gbyte/sec        800 Gbytes/sec

8
Extend Virtual Storage Architecture to Include Local Disk

 §  Shared Storage: SAN or NAS                                                          §  Hybrid Storage
         •  Easy to provision                                                              •  SAN for boot images, VMs, other
         •  Automated cluster rebalancing                                                      workloads
                                                                                           •  Local disk for Hadoop & HDFS
                                                                                           •  Scalable Bandwidth, Lower Cost/GB
         Other VM

                    Other VM




                                                 Other VM




                                                                              Other VM




                                                                                                    Other VM

                                                                                                               Other VM




                                                                                                                                            Other VM




                                                                                                                                                                         Other VM
Hadoop




                               Hadoop

                                        Hadoop




                                                            Hadoop

                                                                     Hadoop




                                                                                           Hadoop




                                                                                                                          Hadoop

                                                                                                                                   Hadoop




                                                                                                                                                       Hadoop

                                                                                                                                                                Hadoop
         Host                           Host                         Host                           Host                           Host                         Host




     9
Why Virtualize Hadoop?



     Simple to Operate            Highly Available               Elastic Scaling


§  Rapid deployment        §  No more single point of    §  Shrink and expand
                                failure                        cluster on demand
§  Unified operations
    across enterprise       §  One click to setup         §  Resource Guarantee

§  Easy Clone of Cluster   §  High availability for MR   §  Independent scaling of
                                Jobs                           Compute and data




10
Deploy a Hadoop Cluster in under 30 Minutes

Step 1: Deploy Serengeti virtual appliance on vSphere.


                                                               Deploy vHelperOVF to
                                                                     vSphere




Step 2: A few simple commands to stand up Hadoop Cluster.
                                                             Select Compute, memory,
                                                               storage and network




                                                            Select configuration template




                                                               Automate deployment




                         Done


  11
A Tour Through Serengeti


$ ssh serengeti@serengeti-vm

$ serengeti

serengeti>




12
A Tour Through Serengeti


serengeti> cluster create --name myElephant

serengeti> cluster list -–name myElephant

name: myElephant, distro: cdh, status:RUNNING
  NAME    ROLES                                 INSTANCE   CPU MEM(MB) TYPE
  ---------------------------------------------------------------------------
  master [hadoop_NameNode, hadoop_jobtracker] 1            2   7500     LOCAL   50

name: myElephant, distro: cdh, status:RUNNING
  NAME    ROLES                                 INSTANCE   CPU MEM(MB) TYPE
  ---------------------------------------------------------------------------
  master [hive, hadoop_client, pig]             1          1   3700     LOCAL   50

     NAME                HOST                              IP
     -----------------------------------------------------------------
     myElephant-client0 rmc-elephant-009.eng.vmware.com    10.0.20.184




13
A Tour Through Serengeti


$ ssh rmc@rmc-elephant-009.eng.vmware.com

$ hadoop jar hadoop-examples.jar teragen 1000000000 tera-data

…




14
Serengeti Spec File
[
        "distro":"apache",               Choice of Distro
          {
             "name": "master",
             "roles": [
                "hadoop_NameNode",
                "hadoop_jobtracker"
             ],
             "instanceNum": 1,
             "instanceType": "MEDIUM",
             “ha”:true,                  HA Option
          },
          {
             "name": "worker",
             "roles": [
                "hadoop_datanode", "hadoop_tasktracker"
             ],
             "instanceNum": 5,
             "instanceType": "SMALL",
             "storage": {                Choice of Shared Storage or Local Disk
                "type": "LOCAL",
                "sizeGB": 10
             }
          },
    ]

15
Configuring Distro’s


{
         "name" : "cdh",
         "version" : "3u3",
         "packages" : [
           {
              "roles" : ["hadoop_NameNode", "hadoop_jobtracker",
                         "hadoop_tasktracker", "hadoop_datanode",
                         "hadoop_client"],
              "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz"
           },
           {
              "roles" : ["hive"],
              "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz"
           },
           {
              "roles" : ["pig"],
              "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz"
           }
         ]
    },




16
Serengeti Demo


                         Deploy Serengeti vApp on vSphere


                         Deploy a Hadoop cluster in 10 Minutes


                         Run MapReduce
     Serengeti Demo	

                         Scale out the Hadoop cluster


                         Create a Customized Hadoop cluster


                         Use Your Favorite Hadoop Distribution


17
Why Virtualize Hadoop?



     Simple to Operate            Highly Available               Elastic Scaling


§  Rapid deployment        §  No more single point of    §  Shrink and expand
                                failure                        cluster on demand
§  Unified operations
    across enterprise       §  One click to setup         §  Resource Guarantee

§  Easy Clone of Cluster   §  High availability for MR   §  Independent scaling of
                                Jobs                           Compute and data




18
High Availability for the Hadoop Stack



                                  ETL Tools        BI Reporting              RDBMS


                               Pig (Data   Flow)   Hive (SQL)               HCatalog
     Zookeepr (Coordination)




                                                             Hive           Hcatalog MDB
                                                            MetaDB




                                                                                           Management Server
                               MapReduce (Job Scheduling/Execution System)
                               HBase (Key-Value store)               Jobtracker



                                                                            Namenode
                                                     HDFS
                                       (Hadoop Distributed File System)
                                                                                            Server




19
Live Machine Migration Reduces Planned Downtime


Description:
Enables the live migration of virtual
machines from one host to another
with continuous service availability.

Benefits:
•     Revolutionary technology that is the
      basis for automated virtual machine
      movement
•     Meets service level and performance
      goals




 20
vSphere High Availability (HA) - protection against unplanned downtime




     Overview
      •  Protection against host and VM failures
      •  Automatic failure detection (host, guest OS)
      •  Automatic virtual machine restart in minutes, on any available host in cluster
      •  OS and application-independent, does not require complex configuration
       changes

21
vSphere Fault Tolerance provides continuous protection

                                                      Overview


                                                       •  Single identical VMs running in
                                                        lockstep on separate hosts
                                                       •  Zero downtime, zero data loss
     XX                                                 failover for all virtual machines in
     App   App   App        App     App   App   App

     HA HA             FT
      OS OS      OS         OS      OS    OS    OS
                                                        case of hardware failures
     VMware ESX                   VMware ESX
                                                       •  Integrated with VMware HA/DRS
                                                       •  No complex clustering or
                                                        specialized hardware required
                                                       •  Single common mechanism for all
           X                                            applications and operating
                                                        systems


       Zero downtime for Name Node, Job Tracker and other components in Hadoop clusters

22
One click to HA

§  Easy to setup, one click is all you need




23
Example HA Failover for Hadoop



       Serengeti
                       Namenode
                                      vSphere HA       Namenode
        Server




      TaskTracker     TaskTracker     TaskTracker     TaskTracker
     HDFS Datanode   HDFS Datanode   HDFS Datanode   HDFS Datanode
         Hive            Hive            Hive            Hive


        hBase           hBase           hBase           hBase




24
vSphere HA and Optionally FT

§  vSphere HA
 •  Is application-aware: will auto-restart NN if heartbeat goes away
 •  Is easy to configure
 •  Has no performance overhead
§  vSphere FT
 •  Has the added bonus of no pause-time when there is hardware failure
 •  Has a one vcpu max
 •  Perf. measurements: Has a 2% perf overhead to NN. Current extrapolated
     measurement shows this is good for ~300 host cluster.
§  HDFS 2 HA
 •  Only covers Namenode – what about the other 5+ master services?
 •  Not available in Apache Hadoop 0.20
 •  Not as battle-tested as vSphere HA
 •  Is more complex to install, manage

25
High Availability for the Hadoop Stack



                                  ETL Tools        BI Reporting              RDBMS


                               Pig (Data   Flow)   Hive (SQL)               HCatalog
     Zookeepr (Coordination)




                                                             Hive           Hcatalog MDB
                                                            MetaDB




                                                                                           Management Server
                               MapReduce (Job Scheduling/Execution System)
                               HBase (Key-Value store)               Jobtracker



                                                                            Namenode
                                                     HDFS
                                       (Hadoop Distributed File System)
                                                                                            Server




26
Why Virtualize Hadoop?



     Simple to Operate            Highly Available               Elastic Scaling


§  Rapid deployment        §  No more single point of    §  Shrink and expand
                                failure                        cluster on demand
§  Unified operations
    across enterprise       §  One click to setup         §  Resource Guarantee

§  Easy Clone of Cluster   §  High availability for MR   §  Independent scaling of
                                Jobs                           Compute and data




27
Elastic Scaling and Multi-tenancy of Hadoop on vSphere



       VM                               VM                               VM                  VM

            Current	
  
            Hadoop:	
                        Compute                          T1                  T2
            	
  
            Combined	
                  VM                               VM
            Storage/                         Storage                          Storage
            Compute




1.	
  Hadoop	
  in	
  VM	
       2.	
  Separate	
  Compute	
  and	
  Data	
   3.	
  Mul8.	
  Clusters	
  
-­‐     Single	
  Tenant	
       -­‐     Single	
  Tenant	
                    -­‐    Mul6ple	
  Tenants	
  
-­‐     Fixed	
  Resources	
     -­‐     Elas6c	
  Compute	
                   -­‐    Elas6c	
  Compute	
  
                                 	
  

       28
“Time Share”

     Other VM

                Other VM

                            Other VM

                                       Other VM

                                                  Other VM



                                                             Other VM

                                                                        Other VM

                                                                                    Other VM

                                                                                               Other VM

                                                                                                          Other VM




                                                                                                                     Other VM

                                                                                                                                Other VM

                                                                                                                                            Other VM

                                                                                                                                                       Other VM

                                                                                                                                                                  Other VM
     Hadoop

                 Hadoop




                                                              Hadoop

                                                                         Hadoop




                                                                                                                     Hadoop

                                                                                                                                Hadoop
                                                                                       vHelper

                                                                        VMware vSphere

                           Host                                                    Host                                                    Host
                           HDFS                                                    HDFS                                                    HDFS




            While existing apps run during the day to support business
            operations, Hadoop batch jobs kicks off at night to conduct
            deep analysis of data.
29
Virtualization delivers VM level Multi-tenancy

                                                                                                                       §  Performance isolation
                                              Coke	
                                    Pepsi	
                             •  No more noisy neighbors –
                                                                                                                              Resource container to
                                                                                                                              achieve guaranteed SLA
                                                                                                                              for different tenants/users/
                                                                                                                              jobs

Run6me	
  	
                                                                                                           §  Configuration isolation
                      	
  Hadoop	
  




                                                                	
  Hadoop	
  




                                                                                                    	
  Hadoop	
  
                          Virtual	
  




                                                                    Virtual	
  




                                                                                                        Virtual	
  
Layer	
                                                                                                                     •  Support multiple Hadoop

                                                           	
  Hadoop	
  
                                                           	
  Queue	
  
                                                           Virtual	
                                                          environments on the same
                                                                                                                              physical clusters
                                                                                                                               •  Multiple Linux versions
                                                                                                                               •  Multiple Hadoop
                              Data	
                             Data	
                     Data	
                                versions
                            Container	
                        Container	
                Container	
  
                                                                                                                       §  Security isolation
Data	
  
                                                                    HDFS	
                                                  •  Higher level of security
Layer	
  
                                                                                                                                •  Compute VM can only
                                                                                                                                 access data VM
                 Host	
                 Host	
           Host	
              Host	
      Host	
             Host	
  
                                                                                                                                 through Access Control
                                                                                                                                 List


   30
I.     Market Overview & Insights
II.    Virtualization + Hadoop
III.  Distribution & OSS Contribution




31
Open Source of Serengeti, Spring Hadoop, Hadoop Extensions


         Commercial Vendors             Community Projects




•  Support major distribution and multiple projects
•  Contribute Hadoop Virtualization Extension (HVE) to Open Source
   Community



32
Hadoop Virtualization Extensions: Topology Awareness




33
Virtual Topologies




34
Proposed Topology Changes

                            HADOOP-8468 (Umbrella JIRA)
                            HADOOP-8469
                            HDFS-3495
                            MAPREDUCE-4310
                            HDFS-3498
                            MAPREDUCE-4309
                            HADOOP-8470
                            HADOOP-8472




35
Spring for Apache Hadoop

§  Announced initial formation of Spring
 Data OSS project in 2010
 •  Enables Spring-powered applications to use
     new data access technologies
 •  Data project technologies around MongoDB,
     Neo4J, Riak, Redis, JDBC Extensions, JPA,
     REST, and Blob

§  Announcing additional contributions on GitHub:
 •  Integration with Cascading library
 •  Hbase support
 •  Hadoop security support
 •  More examples
 •  Administrative application, RESTful API to upload Hadoop jobs to schedule for
     batch execution, query status, etc.
 •  Web HDFS support
36
Big Data on Virtualized Infrastructure
Enabling Highly Available, Elastic, Multi-tenancy Hadoop on Demand

Richard McDougall,
VMware, Inc
@richardmcdougll




                                                            © 2009 VMware Inc. All rights reserved

More Related Content

What's hot

Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataWANdisco Plc
 
Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedInDataWorks Summit
 
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...SpringPeople
 
Hadoop Performance at LinkedIn
Hadoop Performance at LinkedInHadoop Performance at LinkedIn
Hadoop Performance at LinkedInAllen Wittenauer
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbaseRavi Veeramachaneni
 
Gluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFSGluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFSGlusterFS
 
Future of cloud storage
Future of cloud storageFuture of cloud storage
Future of cloud storageGlusterFS
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudKellyn Pot'Vin-Gorman
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaselarsgeorge
 
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageWebinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageGlusterFS
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterAltoros
 
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in ProductionUpgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in ProductionCloudera, Inc.
 
Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011GlusterFS
 
Gluster Blog 11.15.2010
Gluster Blog 11.15.2010Gluster Blog 11.15.2010
Gluster Blog 11.15.2010GlusterFS
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMongoDB
 
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, InformaticaHadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, InformaticaCloudera, Inc.
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
 

What's hot (20)

Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
 
Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedIn
 
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
 
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
 
Hadoop Performance at LinkedIn
Hadoop Performance at LinkedInHadoop Performance at LinkedIn
Hadoop Performance at LinkedIn
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
 
Gluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFSGluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFS
 
Future of cloud storage
Future of cloud storageFuture of cloud storage
Future of cloud storage
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle Cloud
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS StorageWebinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
Webinar Sept 22: Gluster Partners with Redapt to Deliver Scale-Out NAS Storage
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
 
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in ProductionUpgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in Production
 
Hadoop and OpenStack
Hadoop and OpenStackHadoop and OpenStack
Hadoop and OpenStack
 
Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011
 
Gluster Blog 11.15.2010
Gluster Blog 11.15.2010Gluster Blog 11.15.2010
Gluster Blog 11.15.2010
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDB
 
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, InformaticaHadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
 
IaaS for DBAs in Azure
IaaS for DBAs in AzureIaaS for DBAs in Azure
IaaS for DBAs in Azure
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
 

Viewers also liked

Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopHortonworks
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoopChiou-Nan Chen
 
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusterst_ivanov
 
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Rajit Saha
 
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld
 
Data Virtualization Primer - Introduction
Data Virtualization Primer - IntroductionData Virtualization Primer - Introduction
Data Virtualization Primer - IntroductionKenneth Peeples
 
Cloud Computing And Virtualization
Cloud Computing And VirtualizationCloud Computing And Virtualization
Cloud Computing And VirtualizationSonali Parab
 
Crash Course in Cloud Computing
Crash Course in Cloud ComputingCrash Course in Cloud Computing
Crash Course in Cloud ComputingAll Things Open
 
Soyez Big Data ready avec Isilon
Soyez Big Data ready avec IsilonSoyez Big Data ready avec Isilon
Soyez Big Data ready avec IsilonRSD
 
7. emc isilon hdfs enterprise storage for hadoop
7. emc isilon hdfs   enterprise storage for hadoop7. emc isilon hdfs   enterprise storage for hadoop
7. emc isilon hdfs enterprise storage for hadoopTaldor Group
 
EMC Hadoop Starter Kit
EMC Hadoop Starter KitEMC Hadoop Starter Kit
EMC Hadoop Starter KitEMC
 
Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop InnoTech
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big DataMrinal Kumar
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Cloud Computing and Big Data
Cloud Computing and Big DataCloud Computing and Big Data
Cloud Computing and Big DataZaloni
 
Gartner IT Symposium 2014 - VMware Cloud Services
Gartner IT Symposium 2014 - VMware Cloud ServicesGartner IT Symposium 2014 - VMware Cloud Services
Gartner IT Symposium 2014 - VMware Cloud ServicesPhilip Say
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
 

Viewers also liked (20)

Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache Hadoop
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoop
 
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop ClustersWBDB 2014 Benchmarking Virtualized Hadoop Clusters
WBDB 2014 Benchmarking Virtualized Hadoop Clusters
 
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
 
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
 
Data Virtualization Primer - Introduction
Data Virtualization Primer - IntroductionData Virtualization Primer - Introduction
Data Virtualization Primer - Introduction
 
Cloud Computing And Virtualization
Cloud Computing And VirtualizationCloud Computing And Virtualization
Cloud Computing And Virtualization
 
Crash Course in Cloud Computing
Crash Course in Cloud ComputingCrash Course in Cloud Computing
Crash Course in Cloud Computing
 
Soyez Big Data ready avec Isilon
Soyez Big Data ready avec IsilonSoyez Big Data ready avec Isilon
Soyez Big Data ready avec Isilon
 
7. emc isilon hdfs enterprise storage for hadoop
7. emc isilon hdfs   enterprise storage for hadoop7. emc isilon hdfs   enterprise storage for hadoop
7. emc isilon hdfs enterprise storage for hadoop
 
EMC Hadoop Starter Kit
EMC Hadoop Starter KitEMC Hadoop Starter Kit
EMC Hadoop Starter Kit
 
Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big Data
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Cloud Computing and Big Data
Cloud Computing and Big DataCloud Computing and Big Data
Cloud Computing and Big Data
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
EMC config Hadoop
EMC config HadoopEMC config Hadoop
EMC config Hadoop
 
Hadoop on Virtual Machines
Hadoop on Virtual MachinesHadoop on Virtual Machines
Hadoop on Virtual Machines
 
Gartner IT Symposium 2014 - VMware Cloud Services
Gartner IT Symposium 2014 - VMware Cloud ServicesGartner IT Symposium 2014 - VMware Cloud Services
Gartner IT Symposium 2014 - VMware Cloud Services
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 

Similar to Big data on virtualized infrastucture

Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudHadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudCloudera, Inc.
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
SAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloudaidanshribman
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012Benoit Hudzia
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Distributed Stream Processing on Fluentd / #fluentd
Distributed Stream Processing on Fluentd / #fluentdDistributed Stream Processing on Fluentd / #fluentd
Distributed Stream Processing on Fluentd / #fluentdSATOSHI TAGOMORI
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Scalable Object Storage with Apache CloudStack and Apache Hadoop
Scalable Object Storage with Apache CloudStack and Apache HadoopScalable Object Storage with Apache CloudStack and Apache Hadoop
Scalable Object Storage with Apache CloudStack and Apache HadoopChiradeep Vittal
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2DataWorks Summit
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01eimhee
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 

Similar to Big data on virtualized infrastucture (20)

Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudHadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in Cloud
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
SAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloud
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Distributed Stream Processing on Fluentd / #fluentd
Distributed Stream Processing on Fluentd / #fluentdDistributed Stream Processing on Fluentd / #fluentd
Distributed Stream Processing on Fluentd / #fluentd
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Scalable Object Storage with Apache CloudStack and Apache Hadoop
Scalable Object Storage with Apache CloudStack and Apache HadoopScalable Object Storage with Apache CloudStack and Apache Hadoop
Scalable Object Storage with Apache CloudStack and Apache Hadoop
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
 
Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01Commonanduniqueusecases 110831113310-phpapp01
Commonanduniqueusecases 110831113310-phpapp01
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Recently uploaded (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Big data on virtualized infrastucture

  • 1. Enabling Highly Available, Elastic, Multi-tenancy Hadoop on Demand Richard McDougall, VMware, Inc @richardmcdougll © 2009 VMware Inc. All rights reserved
  • 2. Cloud: Big Shifts in Simplification and Optimization 1. Reduce the Complexity 2. Dramatically Lower 3. Enable Flexible, Agile Costs IT Service Delivery to simplify operations to redirect investment into to meet and anticipate the and maintenance value-add opportunities needs of the business 2
  • 3. A Holistic View of a Big Data System: Real Time Streams Real-Time Processing (s4, storm) Analytics ETL Real Time Structured Big SQL Database (Greenplum, Batch (hBase, AsterData, Processing Gemfire, Etc…) Cassandra) Unstructured Data (HDFS) 3
  • 4. Common Infrastructure for Big Data MPP DB HBase Hadoop Virtualization Platform Virtualization Platform Hadoop HBase Cluster Consolidation MPP DB §  Simplify •  Single Hardware Infrastructure Cluster Sprawling •  Unified operations Single purpose clusters for various business applications lead to cluster §  Optimize sprawl. •  Shared Resources = higher utilization •  Elastic resources = faster on-demand access 4
  • 5. Enterprise Challenges with Using Hadoop §  Deployment •  Slow to provision •  Complex to keep running/tune §  Single Points of Failure •  Single point of failure with Name Node and Job tracker •  No HA for Hadoop Framework Components (Hive, HCatalog, etc.) §  Low Utilization •  Dedicated clusters to run Hadoop with low CPU utilization •  No easy way to share resource between Hadoop and non-Hadoop workloads •  Noisy neighbor, lack resource containment §  Need Multi-tenant Isolation, Resource Management, etc,… •  Noisy Neighbor - no performance or security isolation between different tenants/users •  Lack of configuration isolation - Can’t run multiple versions on the cluster 5
  • 6. I.  Market Overview & Insights II.  Virtualization + Hadoop III.  Distribution & OSS Contribution 6
  • 7. Hadoop Runs Well on Virtualization Comparable performance to physical 1.2 1 0.8 Ratio to Native 0.6 0.4 1 VM 2 VMs 0.2 0 Source: http://www.vmware.com/files/pdf/techpaper/VMW-Hadoop-Performance-vSphere5.pdf 7
  • 8. Use Local Disk where it’s Needed SAN Storage NAS Filers Local Storage $2 - $10/Gigabyte $1 - $5/Gigabyte $0.05/Gigabyte $1M gets: $1M gets: $1M gets: 0.5Petabytes 1 Petabyte 20 Petabytes 200,000 IOPS 400,000 IOPS 10,000,000 IOPS 1Gbyte/sec 2Gbyte/sec 800 Gbytes/sec 8
  • 9. Extend Virtual Storage Architecture to Include Local Disk §  Shared Storage: SAN or NAS §  Hybrid Storage •  Easy to provision •  SAN for boot images, VMs, other •  Automated cluster rebalancing workloads •  Local disk for Hadoop & HDFS •  Scalable Bandwidth, Lower Cost/GB Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Host Host Host Host Host Host 9
  • 10. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling §  Rapid deployment §  No more single point of §  Shrink and expand failure cluster on demand §  Unified operations across enterprise §  One click to setup §  Resource Guarantee §  Easy Clone of Cluster §  High availability for MR §  Independent scaling of Jobs Compute and data 10
  • 11. Deploy a Hadoop Cluster in under 30 Minutes Step 1: Deploy Serengeti virtual appliance on vSphere. Deploy vHelperOVF to vSphere Step 2: A few simple commands to stand up Hadoop Cluster. Select Compute, memory, storage and network Select configuration template Automate deployment Done 11
  • 12. A Tour Through Serengeti $ ssh serengeti@serengeti-vm $ serengeti serengeti> 12
  • 13. A Tour Through Serengeti serengeti> cluster create --name myElephant serengeti> cluster list -–name myElephant name: myElephant, distro: cdh, status:RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE --------------------------------------------------------------------------- master [hadoop_NameNode, hadoop_jobtracker] 1 2 7500 LOCAL 50 name: myElephant, distro: cdh, status:RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE --------------------------------------------------------------------------- master [hive, hadoop_client, pig] 1 1 3700 LOCAL 50 NAME HOST IP ----------------------------------------------------------------- myElephant-client0 rmc-elephant-009.eng.vmware.com 10.0.20.184 13
  • 14. A Tour Through Serengeti $ ssh rmc@rmc-elephant-009.eng.vmware.com $ hadoop jar hadoop-examples.jar teragen 1000000000 tera-data … 14
  • 15. Serengeti Spec File [ "distro":"apache", Choice of Distro { "name": "master", "roles": [ "hadoop_NameNode", "hadoop_jobtracker" ], "instanceNum": 1, "instanceType": "MEDIUM", “ha”:true, HA Option }, { "name": "worker", "roles": [ "hadoop_datanode", "hadoop_tasktracker" ], "instanceNum": 5, "instanceType": "SMALL", "storage": { Choice of Shared Storage or Local Disk "type": "LOCAL", "sizeGB": 10 } }, ] 15
  • 16. Configuring Distro’s { "name" : "cdh", "version" : "3u3", "packages" : [ { "roles" : ["hadoop_NameNode", "hadoop_jobtracker", "hadoop_tasktracker", "hadoop_datanode", "hadoop_client"], "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz" }, { "roles" : ["hive"], "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz" }, { "roles" : ["pig"], "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz" } ] }, 16
  • 17. Serengeti Demo Deploy Serengeti vApp on vSphere Deploy a Hadoop cluster in 10 Minutes Run MapReduce Serengeti Demo Scale out the Hadoop cluster Create a Customized Hadoop cluster Use Your Favorite Hadoop Distribution 17
  • 18. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling §  Rapid deployment §  No more single point of §  Shrink and expand failure cluster on demand §  Unified operations across enterprise §  One click to setup §  Resource Guarantee §  Easy Clone of Cluster §  High availability for MR §  Independent scaling of Jobs Compute and data 18
  • 19. High Availability for the Hadoop Stack ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) HCatalog Zookeepr (Coordination) Hive Hcatalog MDB MetaDB Management Server MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) Jobtracker Namenode HDFS (Hadoop Distributed File System) Server 19
  • 20. Live Machine Migration Reduces Planned Downtime Description: Enables the live migration of virtual machines from one host to another with continuous service availability. Benefits: •  Revolutionary technology that is the basis for automated virtual machine movement •  Meets service level and performance goals 20
  • 21. vSphere High Availability (HA) - protection against unplanned downtime Overview •  Protection against host and VM failures •  Automatic failure detection (host, guest OS) •  Automatic virtual machine restart in minutes, on any available host in cluster •  OS and application-independent, does not require complex configuration changes 21
  • 22. vSphere Fault Tolerance provides continuous protection Overview •  Single identical VMs running in lockstep on separate hosts •  Zero downtime, zero data loss XX failover for all virtual machines in App App App App App App App HA HA FT OS OS OS OS OS OS OS case of hardware failures VMware ESX VMware ESX •  Integrated with VMware HA/DRS •  No complex clustering or specialized hardware required •  Single common mechanism for all X applications and operating systems Zero downtime for Name Node, Job Tracker and other components in Hadoop clusters 22
  • 23. One click to HA §  Easy to setup, one click is all you need 23
  • 24. Example HA Failover for Hadoop Serengeti Namenode vSphere HA Namenode Server TaskTracker TaskTracker TaskTracker TaskTracker HDFS Datanode HDFS Datanode HDFS Datanode HDFS Datanode Hive Hive Hive Hive hBase hBase hBase hBase 24
  • 25. vSphere HA and Optionally FT §  vSphere HA •  Is application-aware: will auto-restart NN if heartbeat goes away •  Is easy to configure •  Has no performance overhead §  vSphere FT •  Has the added bonus of no pause-time when there is hardware failure •  Has a one vcpu max •  Perf. measurements: Has a 2% perf overhead to NN. Current extrapolated measurement shows this is good for ~300 host cluster. §  HDFS 2 HA •  Only covers Namenode – what about the other 5+ master services? •  Not available in Apache Hadoop 0.20 •  Not as battle-tested as vSphere HA •  Is more complex to install, manage 25
  • 26. High Availability for the Hadoop Stack ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) HCatalog Zookeepr (Coordination) Hive Hcatalog MDB MetaDB Management Server MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) Jobtracker Namenode HDFS (Hadoop Distributed File System) Server 26
  • 27. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling §  Rapid deployment §  No more single point of §  Shrink and expand failure cluster on demand §  Unified operations across enterprise §  One click to setup §  Resource Guarantee §  Easy Clone of Cluster §  High availability for MR §  Independent scaling of Jobs Compute and data 27
  • 28. Elastic Scaling and Multi-tenancy of Hadoop on vSphere VM VM VM VM Current   Hadoop:   Compute T1 T2   Combined   VM VM Storage/ Storage Storage Compute 1.  Hadoop  in  VM   2.  Separate  Compute  and  Data   3.  Mul8.  Clusters   -­‐  Single  Tenant   -­‐  Single  Tenant   -­‐  Mul6ple  Tenants   -­‐  Fixed  Resources   -­‐  Elas6c  Compute   -­‐  Elas6c  Compute     28
  • 29. “Time Share” Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop vHelper VMware vSphere Host Host Host HDFS HDFS HDFS While existing apps run during the day to support business operations, Hadoop batch jobs kicks off at night to conduct deep analysis of data. 29
  • 30. Virtualization delivers VM level Multi-tenancy §  Performance isolation Coke   Pepsi   •  No more noisy neighbors – Resource container to achieve guaranteed SLA for different tenants/users/ jobs Run6me     §  Configuration isolation  Hadoop    Hadoop    Hadoop   Virtual   Virtual   Virtual   Layer   •  Support multiple Hadoop  Hadoop    Queue   Virtual   environments on the same physical clusters •  Multiple Linux versions •  Multiple Hadoop Data   Data   Data   versions Container   Container   Container   §  Security isolation Data   HDFS   •  Higher level of security Layer   •  Compute VM can only access data VM Host   Host   Host   Host   Host   Host   through Access Control List 30
  • 31. I.  Market Overview & Insights II.  Virtualization + Hadoop III.  Distribution & OSS Contribution 31
  • 32. Open Source of Serengeti, Spring Hadoop, Hadoop Extensions Commercial Vendors Community Projects •  Support major distribution and multiple projects •  Contribute Hadoop Virtualization Extension (HVE) to Open Source Community 32
  • 33. Hadoop Virtualization Extensions: Topology Awareness 33
  • 35. Proposed Topology Changes HADOOP-8468 (Umbrella JIRA) HADOOP-8469 HDFS-3495 MAPREDUCE-4310 HDFS-3498 MAPREDUCE-4309 HADOOP-8470 HADOOP-8472 35
  • 36. Spring for Apache Hadoop §  Announced initial formation of Spring Data OSS project in 2010 •  Enables Spring-powered applications to use new data access technologies •  Data project technologies around MongoDB, Neo4J, Riak, Redis, JDBC Extensions, JPA, REST, and Blob §  Announcing additional contributions on GitHub: •  Integration with Cascading library •  Hbase support •  Hadoop security support •  More examples •  Administrative application, RESTful API to upload Hadoop jobs to schedule for batch execution, query status, etc. •  Web HDFS support 36
  • 37. Big Data on Virtualized Infrastructure Enabling Highly Available, Elastic, Multi-tenancy Hadoop on Demand Richard McDougall, VMware, Inc @richardmcdougll © 2009 VMware Inc. All rights reserved