SlideShare una empresa de Scribd logo
1 de 37
Fei Dong
Duke University
 April 6, 2012
•   Introduction
  •   Optimizing Multi-Job Workflows
  •   Optimizing Iterative Workflows
  •   Optimizing Key-Value Stores
  •   Alidade: A Real-Life Application
  •   Summary
  •   Questions and Answers


10/4/2012                  2             Starfish-E
10/4/2012   3   Starfish-E
Typical Hadoop Stack                   New Software and Model
                                                        Oozie       Sqoop    Jaql
 High Level       MapReduce          Hadoop Streaming
                 Programs(Java)       (Python, Ruby)    Pig     Hive    Cascading

Hadoop Core
                                                        Iterative
                      MapReduce Execution Engine                       EMR    MRv2
                                                         Model



                         Distributed File System        HBase        ElephantDB


Physical Level    Physical Machine                      Virtual Machine
                                              SATA            EC2
                         CPU                                                 SSD
                                              Disk            Unit


10/4/2012                                     4                         Starfish-E
Optimizers
              Search through space of tuning choices
                                        Cluster
                      Job

                                            Data layout

       Profiler                                     What-if Engine
                      Workflow
   Collects concise              Workload          Estimates impact of
    summaries of                                  hypothetical changes
      execution                                        on execution

 Starfish limitation: focus on individual MapReduce jobs on Hadoop

10/4/2012                          5                        Starfish-E
Starfish-Extended


10/4/2012           6      Starfish-E
High-level layers have evolved over Hadoop to support
  comprehensive workflows, such as Pig, Hive, Cascading.




       Can we optimize such workflows with Starfish?

10/4/2012                    7                 Starfish-E
• Data is processed iteratively.
 • MapReduce framework does not directly support
   iterations.
                                    Loop: n
       Input
                                                   Output3
                                                   /Input2

               Output1        Output2
        J1               J2                   J3
               /Input2        /Input3



                                  Output      J4


  Can we support iterative execution in a workflow?
10/4/2012                     8                          Starfish-E
• HDFS: Replication, Fault tolerance, Scalability
  • HBase: Host very large tables – billions of rows
    X millions of columns.




     Can we optimize storage system like HBase?

10/4/2012                 9                Starfish-E
• Rule Based Optimization (RBO)
     – Use a set of rules to determine how to execute a
       plan.
  • Cost Based Optimization
     – Cheapest plan use the least amount of resource.
  • Starfish employ CBO approach to MapReduce
    programs.
          Can we put RBO + CBO together ?


10/4/2012                   10                 Starfish-E
1. MapReduce Workflow Optimizer in Cascading

  2. Iterative Workflow Optimizer

  3. Key-Value Stores Optimizer using Rule-based
     technology




10/4/2012               11              Starfish-E
• Cascading
     – Data processing API on Hadoop
     – Flow of operation, not jobs




10/4/2012                 12           Starfish-E
• Replace Hadoop Old API with New API
  • Cascading Profiler
     – Job Graph + Conf Graph to represent a workflow
  • Cascading What-if Engine
  • Cascading Optimizer




10/4/2012                  13                 Starfish-E
10/4/2012   14   Starfish-E
10/4/2012   15   Starfish-E
• The jobs have the same execution behavior
    across iterations → we can use a single
    “iterative” profile.
  • Combine MapReduce jobs into a logical unit of
    work (inspired by Oozie)




10/4/2012               16               Starfish-E
• PageRank: 10G page graphs

                         3500

                         3000
      Running Time (s)




                         2500

                         2000

                         1500                                      Original
                         1000                                      Optimization
                          500

                            0
                                1   2          4          6   10

                                        Total Iteration


10/4/2012                                          17                Starfish-E
5. HBase Process           e.g. splits, compactions
      High

             4. HBase Schema             e.g. compression, bloom filter



             3. HBase Configuration      e.g. garbage collection, heap



             2. Hadoop Configuration e.g. xciever, handlers

      Low    1. Operating System         e.g. ulimit, nproc



10/4/2012                        18                           Starfish-E
• JVM Settings:
     – "-server -XX:+UseParallelGC -XX:ParallelGCThraed=8 -
       XX:+AggressivHeap -XX:+HeapDumpOnOutOfMemoryError".
     – The parallel GC leverages multiple CPUs.




10/4/2012                        19                     Starfish-E
10/4/2012   20   Starfish-E
10/4/2012   21   Starfish-E
10/4/2012   22   Starfish-E
• Recommend c1.xlarge, m2.xlarge to run HBase.
  • Isolate HBase cluster to avoid memory competition with other
    services.
  • Factors to affect writing: HLog > Split> Compact.
  • If applications do not require strict data durability, closing
    HLog can get 2X speedup.
  • Compression can save space on storage. Snappy provide high
    speeds and reasonable compression.
  • In a read-busy system, using bloom filters with the matching
    update or read patterns can save a huge amount of IO.
  • …


10/4/2012                       23                     Starfish-E
• Alidade is constraint-based geolocation
    system.
  • Alidade has two phases.
     – Preprocessing data
     – Iterative geolocation




10/4/2012                      24         Starfish-E
10/4/2012   25   Starfish-E
1. Iterative Model
  2. Heavy Computation
     – Represent polygon in a spherical surface.
     – Calculate Intersection of polygons
  3. Large Scale Data
  4. Limited Resource Allocation
     – Depend on many services such as
       HDFS, JobTrackers, TaskTracker, HBase, etc.


10/4/2012                   26                 Starfish-E
•   Hadoop CDH3U3 based on 0.20.2
  •   HBase CDH3U3 based on 0.90.4
  •   11 m1.large nodes, 11 m2.xlarge nodes
  •   30 map slots and 20 reduce slots
  •   Workflow:
      – YCSB generates 10M records and some workloads
        on read/write.
      – Alidade generates 70M records after translating.


10/4/2012                   27                 Starfish-E
10/4/2012   28   Starfish-E
10/4/2012   29   Starfish-E
10/4/2012   30   Starfish-E
10/4/2012   31   Starfish-E
11 m1.large   21 m1.large 11 m2.xlarge


Write Capacity        43593/s       87012/s      58273/s
CPU                   44            88           72
Storage Capacity      8.5T          17T          17T


Nodes cost per hour   $3.5          $7.1         $5.0


Traffic Cost          $4            $8           $4
Setup Duration        2hr           2hr          2hr
AWS Billed Duration   19hr          11hr         12hr


Total Cost            $68.6         $82.8        $68.8

  10/4/2012                                 32                 Starfish-E
• Alidade is a CPU-intensive job. The
    “IntersectionWritable.solve” contributes most
    of executing time (> 70%).
  • Currently, Starfish Optimizer is better fitted for
    I/O intensive jobs
  • Alidade helped Starfish improve Profiling.
    (reduce overhead for “sequencefiles”)
  • Memory issue for HBase


10/4/2012                  33                Starfish-E
• Extended Starfish to support the evolving
    Hadoop system
     – Automatic tuning of Cascading workflow. We can
       boost performance by 20% to 200%.
     – Support iterative workflow using simple syntax.
     – Optimize Key-Value Stores in Hadoop
     – Leveraged cost-based optimizer and rule-base
       optimizer to get good performance in a complex
       real-life workflow.


10/4/2012                  34                 Starfish-E
10/4/2012   35   Starfish-E
Thanks
              



10/4/2012     36     Starfish-E
10/4/2012   37   Starfish-E

Más contenido relacionado

La actualidad más candente

Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedInAllen Wittenauer
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
 
Hadoop Performance at LinkedIn
Hadoop Performance at LinkedInHadoop Performance at LinkedIn
Hadoop Performance at LinkedInAllen Wittenauer
 
Hadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationHadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationCloudera, Inc.
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performanceDataWorks Summit
 
Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Ryu Kobayashi
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
Couchbase Performance Benchmarking
Couchbase Performance BenchmarkingCouchbase Performance Benchmarking
Couchbase Performance BenchmarkingRenat Khasanshyn
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillTomer Shiran
 
Storage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesStorage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesLINE Corporation (Tech Unit)
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataDataWorks Summit
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshootingmapr-academy
 
Kerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataKerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataEnkitec
 

La actualidad más candente (20)

Cosbench apac
Cosbench apacCosbench apac
Cosbench apac
 
Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedIn
 
HBase with MapR
HBase with MapRHBase with MapR
HBase with MapR
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
Hadoop Performance at LinkedIn
Hadoop Performance at LinkedInHadoop Performance at LinkedIn
Hadoop Performance at LinkedIn
 
Hadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationHadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote Presentation
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 
Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
Couchbase Performance Benchmarking
Couchbase Performance BenchmarkingCouchbase Performance Benchmarking
Couchbase Performance Benchmarking
 
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMUpgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
HDF-EOS Tools
HDF-EOS ToolsHDF-EOS Tools
HDF-EOS Tools
 
Storage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesStorage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messages
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
 
ha_module5
ha_module5ha_module5
ha_module5
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshooting
 
Kerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataKerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadata
 
52 nfs
52 nfs52 nfs
52 nfs
 

Similar a Extend starfish to Support the Growing Hadoop Ecosystem

Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Implementing Parallelism in PostgreSQL - PGCon 2014
Implementing Parallelism in PostgreSQL - PGCon 2014Implementing Parallelism in PostgreSQL - PGCon 2014
Implementing Parallelism in PostgreSQL - PGCon 2014EDB
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudEdureka!
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Etu Solution
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Greenplum Database on HDFS
Greenplum Database on HDFSGreenplum Database on HDFS
Greenplum Database on HDFSDataWorks Summit
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Steve Min
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?  Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You? EMC
 
An Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive ApplicationsAn Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive ApplicationsXiao Qin
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanNarayana B
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...DataWorks Summit
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Chris Nauroth
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weitingWei Ting Chen
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 

Similar a Extend starfish to Support the Growing Hadoop Ecosystem (20)

Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Implementing Parallelism in PostgreSQL - PGCon 2014
Implementing Parallelism in PostgreSQL - PGCon 2014Implementing Parallelism in PostgreSQL - PGCon 2014
Implementing Parallelism in PostgreSQL - PGCon 2014
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Greenplum Database on HDFS
Greenplum Database on HDFSGreenplum Database on HDFS
Greenplum Database on HDFS
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?  Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
 
An Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive ApplicationsAn Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive Applications
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 

Último

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Último (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Extend starfish to Support the Growing Hadoop Ecosystem

  • 1. Fei Dong Duke University April 6, 2012
  • 2. Introduction • Optimizing Multi-Job Workflows • Optimizing Iterative Workflows • Optimizing Key-Value Stores • Alidade: A Real-Life Application • Summary • Questions and Answers 10/4/2012 2 Starfish-E
  • 3. 10/4/2012 3 Starfish-E
  • 4. Typical Hadoop Stack New Software and Model Oozie Sqoop Jaql High Level MapReduce Hadoop Streaming Programs(Java) (Python, Ruby) Pig Hive Cascading Hadoop Core Iterative MapReduce Execution Engine EMR MRv2 Model Distributed File System HBase ElephantDB Physical Level Physical Machine Virtual Machine SATA EC2 CPU SSD Disk Unit 10/4/2012 4 Starfish-E
  • 5. Optimizers Search through space of tuning choices Cluster Job Data layout Profiler What-if Engine Workflow Collects concise Workload Estimates impact of summaries of hypothetical changes execution on execution Starfish limitation: focus on individual MapReduce jobs on Hadoop 10/4/2012 5 Starfish-E
  • 7. High-level layers have evolved over Hadoop to support comprehensive workflows, such as Pig, Hive, Cascading. Can we optimize such workflows with Starfish? 10/4/2012 7 Starfish-E
  • 8. • Data is processed iteratively. • MapReduce framework does not directly support iterations. Loop: n Input Output3 /Input2 Output1 Output2 J1 J2 J3 /Input2 /Input3 Output J4 Can we support iterative execution in a workflow? 10/4/2012 8 Starfish-E
  • 9. • HDFS: Replication, Fault tolerance, Scalability • HBase: Host very large tables – billions of rows X millions of columns. Can we optimize storage system like HBase? 10/4/2012 9 Starfish-E
  • 10. • Rule Based Optimization (RBO) – Use a set of rules to determine how to execute a plan. • Cost Based Optimization – Cheapest plan use the least amount of resource. • Starfish employ CBO approach to MapReduce programs. Can we put RBO + CBO together ? 10/4/2012 10 Starfish-E
  • 11. 1. MapReduce Workflow Optimizer in Cascading 2. Iterative Workflow Optimizer 3. Key-Value Stores Optimizer using Rule-based technology 10/4/2012 11 Starfish-E
  • 12. • Cascading – Data processing API on Hadoop – Flow of operation, not jobs 10/4/2012 12 Starfish-E
  • 13. • Replace Hadoop Old API with New API • Cascading Profiler – Job Graph + Conf Graph to represent a workflow • Cascading What-if Engine • Cascading Optimizer 10/4/2012 13 Starfish-E
  • 14. 10/4/2012 14 Starfish-E
  • 15. 10/4/2012 15 Starfish-E
  • 16. • The jobs have the same execution behavior across iterations → we can use a single “iterative” profile. • Combine MapReduce jobs into a logical unit of work (inspired by Oozie) 10/4/2012 16 Starfish-E
  • 17. • PageRank: 10G page graphs 3500 3000 Running Time (s) 2500 2000 1500 Original 1000 Optimization 500 0 1 2 4 6 10 Total Iteration 10/4/2012 17 Starfish-E
  • 18. 5. HBase Process e.g. splits, compactions High 4. HBase Schema e.g. compression, bloom filter 3. HBase Configuration e.g. garbage collection, heap 2. Hadoop Configuration e.g. xciever, handlers Low 1. Operating System e.g. ulimit, nproc 10/4/2012 18 Starfish-E
  • 19. • JVM Settings: – "-server -XX:+UseParallelGC -XX:ParallelGCThraed=8 - XX:+AggressivHeap -XX:+HeapDumpOnOutOfMemoryError". – The parallel GC leverages multiple CPUs. 10/4/2012 19 Starfish-E
  • 20. 10/4/2012 20 Starfish-E
  • 21. 10/4/2012 21 Starfish-E
  • 22. 10/4/2012 22 Starfish-E
  • 23. • Recommend c1.xlarge, m2.xlarge to run HBase. • Isolate HBase cluster to avoid memory competition with other services. • Factors to affect writing: HLog > Split> Compact. • If applications do not require strict data durability, closing HLog can get 2X speedup. • Compression can save space on storage. Snappy provide high speeds and reasonable compression. • In a read-busy system, using bloom filters with the matching update or read patterns can save a huge amount of IO. • … 10/4/2012 23 Starfish-E
  • 24. • Alidade is constraint-based geolocation system. • Alidade has two phases. – Preprocessing data – Iterative geolocation 10/4/2012 24 Starfish-E
  • 25. 10/4/2012 25 Starfish-E
  • 26. 1. Iterative Model 2. Heavy Computation – Represent polygon in a spherical surface. – Calculate Intersection of polygons 3. Large Scale Data 4. Limited Resource Allocation – Depend on many services such as HDFS, JobTrackers, TaskTracker, HBase, etc. 10/4/2012 26 Starfish-E
  • 27. Hadoop CDH3U3 based on 0.20.2 • HBase CDH3U3 based on 0.90.4 • 11 m1.large nodes, 11 m2.xlarge nodes • 30 map slots and 20 reduce slots • Workflow: – YCSB generates 10M records and some workloads on read/write. – Alidade generates 70M records after translating. 10/4/2012 27 Starfish-E
  • 28. 10/4/2012 28 Starfish-E
  • 29. 10/4/2012 29 Starfish-E
  • 30. 10/4/2012 30 Starfish-E
  • 31. 10/4/2012 31 Starfish-E
  • 32. 11 m1.large 21 m1.large 11 m2.xlarge Write Capacity 43593/s 87012/s 58273/s CPU 44 88 72 Storage Capacity 8.5T 17T 17T Nodes cost per hour $3.5 $7.1 $5.0 Traffic Cost $4 $8 $4 Setup Duration 2hr 2hr 2hr AWS Billed Duration 19hr 11hr 12hr Total Cost $68.6 $82.8 $68.8 10/4/2012 32 Starfish-E
  • 33. • Alidade is a CPU-intensive job. The “IntersectionWritable.solve” contributes most of executing time (> 70%). • Currently, Starfish Optimizer is better fitted for I/O intensive jobs • Alidade helped Starfish improve Profiling. (reduce overhead for “sequencefiles”) • Memory issue for HBase 10/4/2012 33 Starfish-E
  • 34. • Extended Starfish to support the evolving Hadoop system – Automatic tuning of Cascading workflow. We can boost performance by 20% to 200%. – Support iterative workflow using simple syntax. – Optimize Key-Value Stores in Hadoop – Leveraged cost-based optimizer and rule-base optimizer to get good performance in a complex real-life workflow. 10/4/2012 34 Starfish-E
  • 35. 10/4/2012 35 Starfish-E
  • 36. Thanks  10/4/2012 36 Starfish-E
  • 37. 10/4/2012 37 Starfish-E

Notas del editor

  1. Welcome to come my defense. I am Fei Dong from computer science departmentMy project title is to extend starfish to support the growing hadoop ecosystemIn the background, you see elephant and starfish. It seems little connection in biology. We will figure out some magic connection in my lecture
  2. Data is growing so fast, from kb -> pbLots of company meet the big data problem and challenges, scalibility, relibility, performanceHow to store the data , and retrieve data efficiently
  3. Hadoop history: started from nutch, by Doug Cutting, growing in Yahoo, cloudera and hortonworks focus on this.A large scale batch data processing platformSome ideas from Google published papers, GFS, MapReduceOpen Source top project of Apache
  4. How to use Hadoop? You can follow some turtuals, but we also care about performanceHowever, there are more than 190 parameters. hard to tune manually to get good performance, Starfish is a research project led by Prof. ShivnathBabuHave some impact on academic
  5. Main difference from Cascading is JAVA API, so users can easily pick up without any burden
  6. e.g.: pagerank, kmeans algorithms
  7. BigTableThink about database, while not scalable, Hbase is one NOSQL, scale out, auto sharding
  8. Elg: cloudera suggest you can set the number of reduces close to the reduce slots the cluster ownCost model
  9. Encapsulate Abstract Operation on DataEach, GroupBy, …Connected by a Pipe
  10. DAG, direct acyclic graph
  11. Hbase often has more concurrent clients : increase dfs.datanode.max.xcievers 256->1024Uproc: maximum number of processesHandlers: io thread numberXciever: an upper bound on the number of files that it will serve at any one time
  12. A lot of miss during reads.Speed up reads by cutting down internal lookups
  13. Slots: capacity, concurrent running processes
  14. Not always good to increase reduce slots, due to Server bottleneck
  15. Synchronized time