SlideShare una empresa de Scribd logo
1 de 56
Cheetah: A High Performance,
      Custom DWH on
     Top of MapReduce


             Tilani Gunawardena
Content
•   Introduction
•   Background
•   Schema design,Query language
•   Architecture
•   Data Storage and Compression
•   Query plan and execution
•   Query Optimization
•   Integration
•   Performance Evaluation
•   Conclusions & Future Works
Introduction
• Hadoop-impressive scalability & flexibility to
  handle structured as well as unstructured data
• A shared-nothing MPP architecture built upon
  cheap, commodity hardware
  – MPP relational data warehouses
     • Aster-Data, DATAllegro, Infobright, Greenplum,
       ParAccel, Vertica
  – MapReduce system
• Relational data ware-houses
   – Highly optimized for storing and querying relational data.
   – Hard to scale to 100s,1000s of nodes

• MapReduce
   – Handles failures & scale to 1000s node
   – Lacks a declarative query inter-face
       • Users have to write code to access the data
       • May result in redundant code
       • Requires a lot of effort & technical skills.

• Recently we can see convergence of these system
   – AsterData, GreenPlum- are extended with some MapReduce capabilities.
   – Number of data analytics tools have been built on top of Hadoop
       • Pig - translates a high-level data flow language into MapReduce jobs
       • HBase - similar to BigTable provides random read and write access to a huge distributed
         (key, value) store
       • Hive & HadoopDB - support SQL-like query language.
Background and Motivation
• www.turn.com- use platform to more effectively,
  scale, optimize performance & centrally manage
  campaigns
• Data management challenges
  – Data : schema changes, not relational data, daata size
  – Simple yet Powerful Query Language
  – Data Mining Applications:simple yet efficient data
    access method.
  – Performance
Cheetah
• A custom data warehouse system, built on top of
  Hadoop.
  – Succinct Query Language
     • Virtual views
     • Clients with no prior SQL knowledge are able to quickly
       grasp
  – High Performance
  – Seamless integration of MapReduce and Data
    Warehouse
     • Advantage of the power of both MapReduce (massive
       parallelism & scalability) & data warehouse (easy &
       efficient data access) technologies
Similar Works
• HadoopDB -PostgreSQL as the underlying
  storage and query processor for Hadoop.
• Cheetah-new data warehouse on top of
  Hadoop
  – Specical purpose
  – Most data are not relational and fast evelving
    schema

• Hive -is the most similar work to cheetah
Schema Design, Query Language
• Virtual View over Warehouse Schema
• Query Language
• Security and Anonymity
Virtual View over Warehouse Schema
   Virtual view on top of the star or snow-flake schema


• Virtual view
   – contains attributes from fact
     tables, dimension tables
   – are exposed to the users for
     query.
• At runtime, only the tables
  that the query referred to are
  accessed & joined
• Users no longer have to
  understand the underlying
  schema design
• There may be multiple fact
  tables and subsequently
  multiple virtual views
Handle Big Dimension Tables
• Filtering,grouping and aggregation operators easily & efficiently
  supported by Hadoop
• Implementation of JOIN operator on top of MapReduce is not as
  straight-forward
• 2 ways
   – Reduce phase:more general & applicable to all scenarios
       • Partitions the input tables in Map Phase & Actual Join in Reduce Phase
   – Map phase :
       • one of join tables is small (m - small tables with one big table)
           – load the small table into memory & perform a hash-lookup while scanning the other
             big table
       • Input tables are already partitioned on the join column
           – Read the same partitions from both tables and perform the join ????
               » Problem :HDFS no facility to force two data blocks with the same
                  partition key to be stored on the same node
           – One of the join tables still has to transferred over the network Cost ???
• Denormalize big dimension tables.
   – Assume big dimension tables are either insertion-only or slowly
     changing dimensions
• Handle Schema Changes
  – Schema versioned table -to efficiently support schema
    changes on the fact table


• Handle Nested Tables
  –   single user may have different type of events

  – Cheetah support fact tables with a nested relational
    data model.
  – Define a nested relational virtual view
  – Query language that is much simpler than the
    standard nested relational query
Schema Design, Query Language
• Virtual View over Warehouse Schema
• Query Language
• Security and Anonymity
Query Language
• Cheetah supports single block, SQL-like query




   – Fact tables / virtual views-Impressions, Clicks and Actions.

• Cheetah supports Multi-Block Query

• Query language
   – Fairly simple
       • Users do not have to understand the underlying schema design
       • They do not have to repetitively write any join predicates.
Schema Design, Query Language
• Virtual View over Warehouse Schema
• Query Language
• Security and Anonymity
Security and Anonymity
• Supports row level security based on virtual
  views.
• Provides ways to anonymize data
Architecture
• Simple yet efficient
• Open:also provide a simple, non-SQL interface
Query MR Job
• Issue query through either Web UI, or CLI or
  Java code via JDBC
• Query is sent to the node that runs Query Driver
• Query Driver
         Query  MapReduce job
• Each node in the Hadoop cluster provides a data
  access primitive (DAP) interface

Ad-hoc MR Job
• Can issue a similar API call for fine-grained data
  access
Data Storage & Compression
• Storage Format
• Columnar Compression
Storage Format
• Text (in CSV format)
  – Simplest storage format & commonly used in web access
    logs.
• Serialized java object
• Row-based binary array
  – Commonly used in row-oriented database systems
• Columnar binary array

 Storage format -huge impact on both
  compression ratio and query performance.
 In Cheetah, we store data in columnar format
  whenever possible
Data Storage & Compression
• Storage Format
• Columnar Compression
Columnar Compression
• Compression type for each column set is
  dynamically determined based on data in each cell
• ETL phase- best compression method is chosen
• After one cell is created, it is further compressed
  using GZIP.
Query Plan & Execution
• Input Files
• Map Phase Plan
• Reduce Phase Plan
Input Files
• Input files to the MapReduce job are always
  fact tables.
• Fact tables are partitioned by date
• Fact tables are further partitioned by a
  dimension key attribute, referred as DID
Query Plan & Execution
• Input Files
• Map Phase Plan
• Reduce Phase Plan
Map Phase Plan
• Each node in the cluster stores some portion of the fact table
  data blocks and (small) dimension files
• Query contains two operators scanner and aggregation.
• Scanner operator has an interface which resembles a SELECT
  followed by PROJECT operator over the virtual view
• Scanner operator translates the request to an equivalent SPJ
  query to pick up the attributes on the dimension tables.
• As optimization-Dimensions are loaded into in-memory hash
  tables only once if different map tasks share the same JVM.
• Hash-based implementation of group by operator-Default
Query Plan & Execution
• Input Files
• Map Phase Plan
• Reduce Phase Plan
Reduce Phase Plan
• First performs global aggregation over the results from map phase.
• Then it evaluates any residual expressions over the aggregate values and/or
  the HAVING clause
• If the ORDER BY columns are group by columns
    – They are already sorted by Hadoop framework during the reduce phase.
• If the ORDER BY columns are the aggregation columns
    – Then we sort the results within each reduce task & merge final results
       after MapReduce job completes.
Query Optimization
•   MapReduce Job Configuration
•   MultiQuery Optimization
•   Exploiting Materialized Views
•   LowLatency Query Optimization
MapReduce Job Configuration
• # of map tasks - based on the # of input files & number of
  blocks per file.
• # of reduce tasks -supplied by the job itself & has a big
  impact on performance.

• query output
   – Small:map phase dominates total cost.
   – Large:it is mandatory to have sufficient number of reducers to
     partition the work.

• Heuristics
   – #of reducers is proportional to the number of group by
     columns in the query.
   – if the group by column includes some column with very large
     cardinality, we increase # of reducers as well.
Query Optimization
•   MapReduce Job Configuration
•   MultiQuery Optimization
•   Exploiting Materialized Views
•   LowLatency Query Optimization
MultiQuery Optimization
• In Cheetah allow users to simultaneously
  submit multiple queries & execute them in a
  single batch, as long as these queries have the
  same FROM and DATES clauses
Map Phase
• Shared scanner-shares the scan of the fact tables
  & joins to the dimension tables
• Scanner will attach a query ID to each output row
• Output from different aggregation operators will
  be merged into a single output stream.
Reduce Phase
• Split the input rows based on their query Ids
• Send them to the corresponding query operators.
Query Optimization
•   MapReduce Job Configuration
•   MultiQuery Optimization
•   Exploiting Materialized Views
•   LowLatency Query Optimization
Exploiting Materialized Views(1)
• Definition of Materialized Views
   – Each materialized view only includes the columns in
     the face table, i.e., excludes those on the dimension
     tables.
   – It is partitioned by date


• Both columns referred in the query reside on the fact
  table, Impressions
• Resulting virtual view has two types of columns - group by
  columns & aggregate columns.
Exploiting Materialized Views(2)
• View Matching and Query Rewriting
  – To make use of materialized view


• Refer virtual view that corresponds to same fact
  table that materialized view is defined upon.
• Non-aggregate columns referred in the SELECT and
  WHERE clauses in the query must be a subset of
  the materialized view’s group by columns
• Aggregate columns must be computable from the
  materialized view’s aggregate columns.
• Replace the virtual view in the query with the
  matching materialized view
Query Optimization
•   MapReduce Job Configuration
•   MultiQuery Optimization
•   Exploiting Materialized Views
•   Low Latency Query Optimization
LowLatency Query Optimization
• Current Hadoop implementation has some non-
  trivial overhead itself
  – Ex:job start time,JVM start time


Problem :For small queries, this becomes a
significant extra overhead.
  – In query translation phase: if size of the input file is
    small it may choose to directly read the file from HDFS
    and then process the query locally.
Integration
• Cheetah provides JDBC interface -user program
  can submit query & iterate through the output
  results.
• If query results are too big for a single program to
  consume, user can write a MapReduce job to
  analyze the query output files which are stored on
  HDFS
• Cheetah provides a non-SQL interface that can be
  easily integrated into any ad-hoc MapReduce jobs
• Achieve ad-hoc MapReduce
  – Specify the input files, which can be one or more fact
    table partitions.
  – In the Map program, it needs to include a scanner to
    access individual raw record
  – After that, user has complete freedom to decide what
    to do with the data.
• Advantages to have low-level, non-SQL interface open
  to external applications
  – It provides ad-hoc MapReduce programs efficient and more
    importantly local data access
  – The virtual view-based scanner hides most of the schema
    complexity from MapReduce developers
  – The data is well compressed and the access method is fairly
    optimized inside the scanner operator

   Summery :
• Ad-hoc MapReduce programs can now
  automatically take full advantage of both
  MapReduce (massive parallelism and scalability)
  and data warehouse (easy and efficient data
  access) technologies.
Performance Evaluation
•   Implementation
•   Storage Format
•   Small .vs. Big Queries
•   MultiQuery Optimization
Implementation
Important :
• Query engine must have low CPU overhead.
  – choosing the right data format
  – efficient implementation of various components on
    the data processing path.
     • Efficient hashing method
• All the experiments are performed on a cluster
  with 10 nodes.
• Each node has two quad core, 8GBytes memory
  and 4x1TBytes 7200RPM hard disks
• Cloudera’s Hadoop distribution version 0.20.1.
• The size of the data blocks for fact tables is
  512MBytes.
Performance Evaluation
•   Implementation
•   Storage Format
•   Small .vs. Big Queries
•   MultiQuery Optimization
Storage Format
• We store one fact table partition into four files
  formats,
  – Text (in CSV format)
  – Java object (equivalent to row-based binary array),
  – Column-based binary array,
  – Column-based binary array with compressions


• Each file is further compressed by Hadoop’s
  GZIP library at block level
• We run a simple aggregate query over three different data
  formats, namely, Text, Java Object and Columnar (with
  compression).
• We use iostat to monitor the CPU utilization and IO throughput at
  each node
Performance Evaluation
•   Implementation
•   Storage Format
•   Small .vs. Big Queries
•   MultiQuery Optimization
Small .vs. Big Queries
• Impact of query complexity on query performance.
• We create a test query
   – two joins
   – one predicate,
   – 3 group by columns
   – 7 aggregate functions.
Performance Evaluation
•   Implementation
•   Storage Format
•   Small .vs. Big Queries
•   MultiQuery Optimization
MultiQuery Optimization
• Randomly pick 40 queries from our query
  workload.
• The number of output rows for these queries
  ranges from few hundred to 1M.
• We compare the time for executing these queries
  in a single batch to the time for executing them
  separately.
Conclusions
• Cheetah -data warehouse system built on top
  of the MapReduce technology.
• The virtual view abstraction plays a central
  role in designing the Cheetah system.
• Multi-query optimization.
Future Work
• Current IO throughput 130MBytes (Figure 12)
  has not reached the maximum possible speed
  of hard disks.
• Current multi-query optimization only exploits
  shared data scan and shared joins.
  – further explore predicate sharing and aggregation
    sharing
• Previous query results for answering similar
  queries later.
Thank You!

Más contenido relacionado

La actualidad más candente

Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Altis AWS Snowflake Practice
Altis AWS Snowflake PracticeAltis AWS Snowflake Practice
Altis AWS Snowflake PracticeSamanthaSwain7
 
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQLDataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQLDataStax
 
Piranha vs. mammoth predator appliances that chew up big data
Piranha vs. mammoth   predator appliances that chew up big dataPiranha vs. mammoth   predator appliances that chew up big data
Piranha vs. mammoth predator appliances that chew up big dataJack (Yaakov) Bezalel
 
Balance agility and governance with #TrueDataOps and The Data Cloud
Balance agility and governance with #TrueDataOps and The Data CloudBalance agility and governance with #TrueDataOps and The Data Cloud
Balance agility and governance with #TrueDataOps and The Data CloudKent Graziano
 
GigaOm-sector-roadmap-cloud-analytic-databases-2017
GigaOm-sector-roadmap-cloud-analytic-databases-2017GigaOm-sector-roadmap-cloud-analytic-databases-2017
GigaOm-sector-roadmap-cloud-analytic-databases-2017Jeremy Maranitch
 
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...Kent Graziano
 
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachUsing OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachKent Graziano
 
Sydney: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cloud
Sydney: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cloud Sydney: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cloud
Sydney: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cloud Certus Solutions
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Company report xinglian
Company report xinglianCompany report xinglian
Company report xinglianXinglian Liu
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overviewjdijcks
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data EngineeringHarald Erb
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWKent Graziano
 
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014Amazon Web Services
 
Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Kent Graziano
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Zaloni
 
Actionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceActionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceHarald Erb
 
Chug building a data lake in azure with spark and databricks
Chug   building a data lake in azure with spark and databricksChug   building a data lake in azure with spark and databricks
Chug building a data lake in azure with spark and databricksBrandon Berlinrut
 

La actualidad más candente (20)

Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Altis AWS Snowflake Practice
Altis AWS Snowflake PracticeAltis AWS Snowflake Practice
Altis AWS Snowflake Practice
 
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQLDataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
 
Piranha vs. mammoth predator appliances that chew up big data
Piranha vs. mammoth   predator appliances that chew up big dataPiranha vs. mammoth   predator appliances that chew up big data
Piranha vs. mammoth predator appliances that chew up big data
 
Balance agility and governance with #TrueDataOps and The Data Cloud
Balance agility and governance with #TrueDataOps and The Data CloudBalance agility and governance with #TrueDataOps and The Data Cloud
Balance agility and governance with #TrueDataOps and The Data Cloud
 
GigaOm-sector-roadmap-cloud-analytic-databases-2017
GigaOm-sector-roadmap-cloud-analytic-databases-2017GigaOm-sector-roadmap-cloud-analytic-databases-2017
GigaOm-sector-roadmap-cloud-analytic-databases-2017
 
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...HOW TO SAVE  PILEs of $$$BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
HOW TO SAVE PILEs of $$$ BY CREATING THE BEST DATA MODEL THE FIRST TIME (Ksc...
 
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile ApproachUsing OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach
 
Sydney: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cloud
Sydney: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cloud Sydney: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cloud
Sydney: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cloud
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Company report xinglian
Company report xinglianCompany report xinglian
Company report xinglian
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
 
Snowflake for Data Engineering
Snowflake for Data EngineeringSnowflake for Data Engineering
Snowflake for Data Engineering
 
Demystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFWDemystifying Data Warehousing as a Service - DFW
Demystifying Data Warehousing as a Service - DFW
 
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014
(ENT211) Migrating the US Government to the Cloud | AWS re:Invent 2014
 
Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)Agile Methods and Data Warehousing (2016 update)
Agile Methods and Data Warehousing (2016 update)
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
 
Actionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceActionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data Science
 
Chug building a data lake in azure with spark and databricks
Chug   building a data lake in azure with spark and databricksChug   building a data lake in azure with spark and databricks
Chug building a data lake in azure with spark and databricks
 
Data Federation
Data FederationData Federation
Data Federation
 

Similar a Cheetah:Data Warehouse on Top of MapReduce

Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010John Sichi
 
A Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemA Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemDataWorks Summit
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemSerendio Inc.
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Imviplav
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfWasyihunSema2
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit
 

Similar a Cheetah:Data Warehouse on Top of MapReduce (20)

Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Apache hive
Apache hiveApache hive
Apache hive
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 
A Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemA Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop Ecosystem
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem
 
Apache drill
Apache drillApache drill
Apache drill
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2Big data analytics with hadoop volume 2
Big data analytics with hadoop volume 2
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
try
trytry
try
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 

Más de Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Más de Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL (20)

BlockChain.pptx
BlockChain.pptxBlockChain.pptx
BlockChain.pptx
 
Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
Introduction to cloud computing
Introduction to cloud computingIntroduction to cloud computing
Introduction to cloud computing
 
Data analytics
Data analyticsData analytics
Data analytics
 
Hadoop Eco system
Hadoop Eco systemHadoop Eco system
Hadoop Eco system
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
evaluation and credibility-Part 2
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
Machine Learning and Data Mining
Machine Learning and Data MiningMachine Learning and Data Mining
Machine Learning and Data Mining
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Decision tree
Decision treeDecision tree
Decision tree
 
kmean clustering
kmean clusteringkmean clustering
kmean clustering
 
Covering algorithm
Covering algorithmCovering algorithm
Covering algorithm
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Assosiate rule mining
Assosiate rule miningAssosiate rule mining
Assosiate rule mining
 
Big data in telecom
Big data in telecomBig data in telecom
Big data in telecom
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
MapReduce
MapReduceMapReduce
MapReduce
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
 

Último

CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 

Último (20)

CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 

Cheetah:Data Warehouse on Top of MapReduce

  • 1. Cheetah: A High Performance, Custom DWH on Top of MapReduce Tilani Gunawardena
  • 2. Content • Introduction • Background • Schema design,Query language • Architecture • Data Storage and Compression • Query plan and execution • Query Optimization • Integration • Performance Evaluation • Conclusions & Future Works
  • 3. Introduction • Hadoop-impressive scalability & flexibility to handle structured as well as unstructured data • A shared-nothing MPP architecture built upon cheap, commodity hardware – MPP relational data warehouses • Aster-Data, DATAllegro, Infobright, Greenplum, ParAccel, Vertica – MapReduce system
  • 4. • Relational data ware-houses – Highly optimized for storing and querying relational data. – Hard to scale to 100s,1000s of nodes • MapReduce – Handles failures & scale to 1000s node – Lacks a declarative query inter-face • Users have to write code to access the data • May result in redundant code • Requires a lot of effort & technical skills. • Recently we can see convergence of these system – AsterData, GreenPlum- are extended with some MapReduce capabilities. – Number of data analytics tools have been built on top of Hadoop • Pig - translates a high-level data flow language into MapReduce jobs • HBase - similar to BigTable provides random read and write access to a huge distributed (key, value) store • Hive & HadoopDB - support SQL-like query language.
  • 5. Background and Motivation • www.turn.com- use platform to more effectively, scale, optimize performance & centrally manage campaigns • Data management challenges – Data : schema changes, not relational data, daata size – Simple yet Powerful Query Language – Data Mining Applications:simple yet efficient data access method. – Performance
  • 6. Cheetah • A custom data warehouse system, built on top of Hadoop. – Succinct Query Language • Virtual views • Clients with no prior SQL knowledge are able to quickly grasp – High Performance – Seamless integration of MapReduce and Data Warehouse • Advantage of the power of both MapReduce (massive parallelism & scalability) & data warehouse (easy & efficient data access) technologies
  • 7. Similar Works • HadoopDB -PostgreSQL as the underlying storage and query processor for Hadoop. • Cheetah-new data warehouse on top of Hadoop – Specical purpose – Most data are not relational and fast evelving schema • Hive -is the most similar work to cheetah
  • 8. Schema Design, Query Language • Virtual View over Warehouse Schema • Query Language • Security and Anonymity
  • 9. Virtual View over Warehouse Schema Virtual view on top of the star or snow-flake schema • Virtual view – contains attributes from fact tables, dimension tables – are exposed to the users for query. • At runtime, only the tables that the query referred to are accessed & joined • Users no longer have to understand the underlying schema design • There may be multiple fact tables and subsequently multiple virtual views
  • 10. Handle Big Dimension Tables • Filtering,grouping and aggregation operators easily & efficiently supported by Hadoop • Implementation of JOIN operator on top of MapReduce is not as straight-forward • 2 ways – Reduce phase:more general & applicable to all scenarios • Partitions the input tables in Map Phase & Actual Join in Reduce Phase – Map phase : • one of join tables is small (m - small tables with one big table) – load the small table into memory & perform a hash-lookup while scanning the other big table • Input tables are already partitioned on the join column – Read the same partitions from both tables and perform the join ???? » Problem :HDFS no facility to force two data blocks with the same partition key to be stored on the same node – One of the join tables still has to transferred over the network Cost ??? • Denormalize big dimension tables. – Assume big dimension tables are either insertion-only or slowly changing dimensions
  • 11. • Handle Schema Changes – Schema versioned table -to efficiently support schema changes on the fact table • Handle Nested Tables – single user may have different type of events – Cheetah support fact tables with a nested relational data model. – Define a nested relational virtual view – Query language that is much simpler than the standard nested relational query
  • 12. Schema Design, Query Language • Virtual View over Warehouse Schema • Query Language • Security and Anonymity
  • 13. Query Language • Cheetah supports single block, SQL-like query – Fact tables / virtual views-Impressions, Clicks and Actions. • Cheetah supports Multi-Block Query • Query language – Fairly simple • Users do not have to understand the underlying schema design • They do not have to repetitively write any join predicates.
  • 14. Schema Design, Query Language • Virtual View over Warehouse Schema • Query Language • Security and Anonymity
  • 15. Security and Anonymity • Supports row level security based on virtual views. • Provides ways to anonymize data
  • 16. Architecture • Simple yet efficient • Open:also provide a simple, non-SQL interface
  • 17. Query MR Job • Issue query through either Web UI, or CLI or Java code via JDBC • Query is sent to the node that runs Query Driver • Query Driver Query  MapReduce job • Each node in the Hadoop cluster provides a data access primitive (DAP) interface Ad-hoc MR Job • Can issue a similar API call for fine-grained data access
  • 18. Data Storage & Compression • Storage Format • Columnar Compression
  • 19. Storage Format • Text (in CSV format) – Simplest storage format & commonly used in web access logs. • Serialized java object • Row-based binary array – Commonly used in row-oriented database systems • Columnar binary array  Storage format -huge impact on both compression ratio and query performance.  In Cheetah, we store data in columnar format whenever possible
  • 20. Data Storage & Compression • Storage Format • Columnar Compression
  • 21. Columnar Compression • Compression type for each column set is dynamically determined based on data in each cell • ETL phase- best compression method is chosen • After one cell is created, it is further compressed using GZIP.
  • 22. Query Plan & Execution • Input Files • Map Phase Plan • Reduce Phase Plan
  • 23. Input Files • Input files to the MapReduce job are always fact tables. • Fact tables are partitioned by date • Fact tables are further partitioned by a dimension key attribute, referred as DID
  • 24. Query Plan & Execution • Input Files • Map Phase Plan • Reduce Phase Plan
  • 25. Map Phase Plan • Each node in the cluster stores some portion of the fact table data blocks and (small) dimension files • Query contains two operators scanner and aggregation.
  • 26. • Scanner operator has an interface which resembles a SELECT followed by PROJECT operator over the virtual view • Scanner operator translates the request to an equivalent SPJ query to pick up the attributes on the dimension tables. • As optimization-Dimensions are loaded into in-memory hash tables only once if different map tasks share the same JVM. • Hash-based implementation of group by operator-Default
  • 27. Query Plan & Execution • Input Files • Map Phase Plan • Reduce Phase Plan
  • 28. Reduce Phase Plan • First performs global aggregation over the results from map phase. • Then it evaluates any residual expressions over the aggregate values and/or the HAVING clause • If the ORDER BY columns are group by columns – They are already sorted by Hadoop framework during the reduce phase. • If the ORDER BY columns are the aggregation columns – Then we sort the results within each reduce task & merge final results after MapReduce job completes.
  • 29. Query Optimization • MapReduce Job Configuration • MultiQuery Optimization • Exploiting Materialized Views • LowLatency Query Optimization
  • 30. MapReduce Job Configuration • # of map tasks - based on the # of input files & number of blocks per file. • # of reduce tasks -supplied by the job itself & has a big impact on performance. • query output – Small:map phase dominates total cost. – Large:it is mandatory to have sufficient number of reducers to partition the work. • Heuristics – #of reducers is proportional to the number of group by columns in the query. – if the group by column includes some column with very large cardinality, we increase # of reducers as well.
  • 31. Query Optimization • MapReduce Job Configuration • MultiQuery Optimization • Exploiting Materialized Views • LowLatency Query Optimization
  • 32. MultiQuery Optimization • In Cheetah allow users to simultaneously submit multiple queries & execute them in a single batch, as long as these queries have the same FROM and DATES clauses
  • 33. Map Phase • Shared scanner-shares the scan of the fact tables & joins to the dimension tables • Scanner will attach a query ID to each output row • Output from different aggregation operators will be merged into a single output stream.
  • 34. Reduce Phase • Split the input rows based on their query Ids • Send them to the corresponding query operators.
  • 35. Query Optimization • MapReduce Job Configuration • MultiQuery Optimization • Exploiting Materialized Views • LowLatency Query Optimization
  • 36. Exploiting Materialized Views(1) • Definition of Materialized Views – Each materialized view only includes the columns in the face table, i.e., excludes those on the dimension tables. – It is partitioned by date • Both columns referred in the query reside on the fact table, Impressions • Resulting virtual view has two types of columns - group by columns & aggregate columns.
  • 37. Exploiting Materialized Views(2) • View Matching and Query Rewriting – To make use of materialized view • Refer virtual view that corresponds to same fact table that materialized view is defined upon. • Non-aggregate columns referred in the SELECT and WHERE clauses in the query must be a subset of the materialized view’s group by columns • Aggregate columns must be computable from the materialized view’s aggregate columns.
  • 38. • Replace the virtual view in the query with the matching materialized view
  • 39. Query Optimization • MapReduce Job Configuration • MultiQuery Optimization • Exploiting Materialized Views • Low Latency Query Optimization
  • 40. LowLatency Query Optimization • Current Hadoop implementation has some non- trivial overhead itself – Ex:job start time,JVM start time Problem :For small queries, this becomes a significant extra overhead. – In query translation phase: if size of the input file is small it may choose to directly read the file from HDFS and then process the query locally.
  • 41. Integration • Cheetah provides JDBC interface -user program can submit query & iterate through the output results. • If query results are too big for a single program to consume, user can write a MapReduce job to analyze the query output files which are stored on HDFS • Cheetah provides a non-SQL interface that can be easily integrated into any ad-hoc MapReduce jobs
  • 42. • Achieve ad-hoc MapReduce – Specify the input files, which can be one or more fact table partitions. – In the Map program, it needs to include a scanner to access individual raw record – After that, user has complete freedom to decide what to do with the data.
  • 43. • Advantages to have low-level, non-SQL interface open to external applications – It provides ad-hoc MapReduce programs efficient and more importantly local data access – The virtual view-based scanner hides most of the schema complexity from MapReduce developers – The data is well compressed and the access method is fairly optimized inside the scanner operator Summery : • Ad-hoc MapReduce programs can now automatically take full advantage of both MapReduce (massive parallelism and scalability) and data warehouse (easy and efficient data access) technologies.
  • 44. Performance Evaluation • Implementation • Storage Format • Small .vs. Big Queries • MultiQuery Optimization
  • 45. Implementation Important : • Query engine must have low CPU overhead. – choosing the right data format – efficient implementation of various components on the data processing path. • Efficient hashing method • All the experiments are performed on a cluster with 10 nodes. • Each node has two quad core, 8GBytes memory and 4x1TBytes 7200RPM hard disks • Cloudera’s Hadoop distribution version 0.20.1. • The size of the data blocks for fact tables is 512MBytes.
  • 46. Performance Evaluation • Implementation • Storage Format • Small .vs. Big Queries • MultiQuery Optimization
  • 47. Storage Format • We store one fact table partition into four files formats, – Text (in CSV format) – Java object (equivalent to row-based binary array), – Column-based binary array, – Column-based binary array with compressions • Each file is further compressed by Hadoop’s GZIP library at block level
  • 48.
  • 49. • We run a simple aggregate query over three different data formats, namely, Text, Java Object and Columnar (with compression). • We use iostat to monitor the CPU utilization and IO throughput at each node
  • 50. Performance Evaluation • Implementation • Storage Format • Small .vs. Big Queries • MultiQuery Optimization
  • 51. Small .vs. Big Queries • Impact of query complexity on query performance. • We create a test query – two joins – one predicate, – 3 group by columns – 7 aggregate functions.
  • 52. Performance Evaluation • Implementation • Storage Format • Small .vs. Big Queries • MultiQuery Optimization
  • 53. MultiQuery Optimization • Randomly pick 40 queries from our query workload. • The number of output rows for these queries ranges from few hundred to 1M. • We compare the time for executing these queries in a single batch to the time for executing them separately.
  • 54. Conclusions • Cheetah -data warehouse system built on top of the MapReduce technology. • The virtual view abstraction plays a central role in designing the Cheetah system. • Multi-query optimization.
  • 55. Future Work • Current IO throughput 130MBytes (Figure 12) has not reached the maximum possible speed of hard disks. • Current multi-query optimization only exploits shared data scan and shared joins. – further explore predicate sharing and aggregation sharing • Previous query results for answering similar queries later.

Notas del editor

  1. Users have to write code in order to access the datMapReduce is just an execution model, the underlying data storage and access method are completely left to users toimplement.