SlideShare una empresa de Scribd logo
1 de 20
Building A Hybrid Warehouse:
Efficient Joins between Data
Stored in HDFS and Enterprise
Warehouse
YUANYUAN TIAN (YTIAN@US.IBM.COM)
IBM RESEARCH -- ALMADEN
Publications: Tian et al EDBT 2015, Tian et al TODS 2016 (invited as Best of EDBT 2015)
Big Data in The Enterprise
Hadoop + Spark
ETL/
ELT
Graph
ML
Analytics
Streams SQL
HDFS
social
SQL Queries
EDW
SQL
Engine
Example Scenario
SELECT L.url_prex, COUNT(*)
FROM Transaction T, Logs L
WHERE T.category = ‘Canon Camera’
AND region(L.ip)= ‘East Coast’
AND T.uid=L.uid
AND T.date >= L.date AND T.date <= L.date+1
GROUP BY L.url_prex
Find out the number of views of the urls visited by customers
with IP addresses from East Coast who bought Canon
Camera within one day of their online visits
Hadoop + Spark
SQL
HDFS
EDW
Logs Transactions
Table L Table T
Correlate customers’ online behaviors with sales
Hybrid Warehouse
 What is a Hybrid Warehouse?
 A special federation between Hadoop-like big data platforms and EDWs
 Two asymmetric, heterogeneous, and independent distributed systems.
 Existing federation solutions are inadequate
 Client-server model to access remote databases and move data
 Single connection for data transmission
EDW SQL-on-Hadoop
Data Ownership • Own its data
• Control data organization and partitioning
• Work with existing files on HDFS
• Cannot dictate data layout
Index Support Build and exploit index Scan-based only, no index support
Update Support update-in-place Append only
Capacity • High-end servers
• Smaller cluster size
• Commodity machines
• Larger cluster size (up to 10,000s nodes)
Joins in Hybrid Warehouse
 Focus an equi-join between two big tables in the hybrid warehouse
 Table T in an EDW (a shared-nothing full-fledged parallel database)
 Table L on HDFS, with a scan-based distributed data processing engine (HQP)
 Both tables are large, but generally |L|>>|T|
 Data not distributed/partitioned by join key at either side
 Queries are issued and results are returned at EDW side
 Final result is relatively small due to aggregation
SELECT L.url_prex, COUNT(*)
FROM Transaction T, Logs L
WHERE T.category = ‘Canon Camera’
AND region(L.ip)= ‘East Coast’
AND T.uid=L.uid
AND T.date >= L.date AND T.date <= L.date+1
GROUP BY L.url_prex
Existing Hybrid Solutions
 Data of one system is entirely loaded into the other.
 DB  HDFS: DB data gets updated frequently, HDFS doesn’t support update properly
 HDFS  DB: HDFS data is often too big to be moved into DB
 Dynamically ingest needed data from HDFS into DB
 e.g. Microsoft Polybase, Pivotal Hawq, TeraData SQL-H, Oracle Big Data SQL
 Selection and projection pushdown to the HDFS side
 Joins executed in the DB side only
 Heavy burden on the DB side
 Assume that SQL-on-Hadoop systems are not efficient at join processing
 NOT TRUE ANYMORE! (IBM Big SQL, Impala, Presto, etc.)
 Split querying processing between DB and HDFS
 Microsoft Polybase
 Joins executed in Hadoop, only when both tables are on HDFS
Goals and Contributions
 Goals:
 Fully utilize the processing power and massive parallelism of both systems
 Minimize data movement across the network
 Exploit the use of Bloom filters
 Consider performing joins both at the DB side and the HDFS side
 Contributions:
 Adapt and extend well-know distributed join algorithms to work in the hybrid warehouse
 Propose a new zigzag join algorithm that is proved to work well in most cases
 Implement the algorithms in a prototype of the hybrid warehouse architecture with DB2 DPF and our
join engine on HDFS
 Empirically compare all join algorithms in different selectivity settings
 Develop a sophisticated cost model for all the join algorithms
DB-Side Join
Move the HDFS data after selection & projection to DB
 Used in most existing hybrid systems: Polybase, Pivotal Hawq. etc
 HDFS table after selection & projection can still be big
Bloom filters to exploit join selectivity
HDFS-Side Broadcast Join
If the DB table after selection & projection is very small
 Broadcast the DB table to HDFS side to avoid shuffling the HDFS table.
HDFS-Side Repartition Join
When the DB table after selection & projection is still large
 Both sides agree on a hash function for data shuffling
 Bloom filter to exploit join selectivity
HDFS-Side Zigzag Join
2-way Bloom filter can further reduces the DB data transferred to HDFS side.
Implementation
 EDW: DB2 DPF extended with unfenced C UDFs
 Computing & applying Bloom filters
 Different ways of transferring data between DB2 and JEN
 HQP: Our own C++ join execution engine, called JEN
 Sophisticated HDFS-side join engine using multi-threading, pipelining, hash-based aggregations, etc
 Coordination between DB2 and the HDFS-side engine
 Parallel communication layer between DB2 and the HDFS-side engine
 HCatalog: For storing the meta data of HDFS tables
 Each join algorithm is invoked by issuing a single query to DB2
JEN Overview
 Built with a prototype of the IO layer and the scheduler from an early version of IBM Big SQL 3.0
 A JEN cluster consists one coordinator and n workers
 Each JEN worker:
 Multi-threaded, run on each HDFS DataNode
 Read parts of HDFS tables (leveraging IO layer of IBM Big SQL 3.0)
 Execute local query plans
 Communicate in parallel with other JEN workers (MPI-based)
 Communicate in parallel with DB2 agents : through TCP/IP sockets.
 JEN coordinator:
 Manage JEN workers
 Orchestrate connection and communication between JEN workers and DB2 agents
 Retrieve meta data for HDFS tables
 Assign HDFS blocks to JEN workers (leveraging the scheduler of IBM Big SQL 3.0)
Experimental Setup
 HDFS cluster:
 30 DataNodes, each runs 1 JEN worker
 Each server: 8 cores, 32 GB RAM, 1 Gbit Ethernet, 4 disks for HDFS
 DB2 DPF:
 5 severs, each runs 6 DB2 agents
 Each server: 12 cores, 96 GB RAM, 10 Gbit Ethernet, 11 disks for DB2 data storage
Interconnection: 20Gbit switch
Dataset:
 Log table L on HDFS (15 billion records)
 1TB in text format
 421GB in Parquet format (default)
 Transaction table T in DB2 (1.6 billion records, 97GB)
 Bloom filter: 128 million bits with 2 hash function
 # join keys: 16 million
 false positive rate: 5%
DB-Side Joins vs HDFS-Side Joins
 DB-Side joins work well only when selectivity on
L is small (σL<= 0.01 ).
 HDFS-side joins show very steady performance
with increasing L’.
 HDFS-side join (especially zigzag join) is a very
reliable choice for joins in the hybrid warehouse!
Transaction Table Selectivity = 0.1
DB-side joins deteriorate fast !
Broadcast Join vs Repartition Join
Broadcast join only works for very limited cases, e.g when σT<= 0.001 (T’ <=25MB).
Tradeoff: broadcasting T’ (30*T’) via interconnection vs sending T’ via interconnection + shuffling
L’ within HDFS cluster
0
50
100
150
200
250
0.001 0.01 0.1 0.2
Time(sec)
Log Table Selectivity
broadcast repartition
0
50
100
150
200
250
0.001 0.01 0.1 0.2
Time(sec)
Log Table Selectivity
broadcast repartition
Transaction table selectivity = 0.001 Transaction table selectivity = 0.01
Zigzag Join vs Repartition Joins
Transaction Table Selectivity = 0.1
HDFS tuples shuffled DB tuples sent
Repartition 5,854 million 165 million
Repartition (BF) 591 million 165 million
Zigzag 591 million 30 million
Zigzag join is most efficient
 Up to 2.1x faster than repartition join, up to 1.8x faster than repartition
join with BF
Zigzag join significantly reduces data movement
 9.9x less HDFS data shuffled, 5.5x less DB data sent
Zigzag join is the best HDFS-side join algorithm!
Cost Model of Join Algorithms
 Goal:
 Capture the relative performance of the join algorithms
 Enable a query optimizer in the hybrid warehouse to choose the right join strategy
 Estimate total resource time (disk IO, network IO, CPU) in milliseconds
 Parameters used in cost formulas:
 System parameters: only related to the system environment (common to all queries)
 # DB nodes, # HDFS nodes, DB buffer pool size, disk IO speeds (DB, HDFS), network IO speeds (DB, HDFS, in-between), etc
 Estimated through a learning suite which runs a number of test programs
 Query parameters: query-specific parameters
 Table cardinalities, table sizes, local predicate selectivity, join selectivity, Bloom filter size, Bloom filter false-positive rate, etc
 DB table: leverage DB stats
 HDFS table: estimate through sampling or Hive Analyze Table command if possible
 Join selectivity: estimate through sampling
Validation of Cost Models
Selectivity
on T
Selectivity
on L
Join Selectivity
on T
Join Selectivity
on L
Best from
Cost Model
Best from
Experiment
Intersection
Metric
0.05 0.001 0.0005 0.05 db(BF) db(BF) 0
0.05 0.01 0.005 0.05 db(BF) db(BF) 0.18
0.05 0.1 0.05 0.05 zigzag zigzag 0.08
0.05 0.2 0.1 0.05 zigzag zigzag 0
0.1 0.001 0.0005 0.1 db(BF) db(BF) 0
0.1 0.01 0.005 0.1 db(BF) db(BF) 0.18
0.1 0.1 0.05 0.1 zigzag zigzag 0.14
0.1 0.2 0.1 0.1 zigzag zigzag 0.06
Cost model correctly finds the best algorithm in every case!
Even the ranking of the algorithms is similar or identical to that of empirical observation!
Concluding Remarks
 Emerging need for hybrid warehouse: enterprise warehouses will co-exist with big data systems
 Bloom filter is a good way to filter data, and use them both ways
 Powerful SQL processing capability on the HDFS side
 IBM Big SQL, Impala, Hive 14, …
 Existing SQL-on-Hadoop systems can be augmented with the capabilities of JEN
 More capacity and investment on the big data side
 Exploit capacity without moving data
 It is better to do the joins on the Hadoop side
 More complex usage patterns are emerging
 EDW on premise, Hadoop on cloud

Más contenido relacionado

La actualidad más candente

Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...Puneet Kansal
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop WorldCloudera, Inc.
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?DataWorks Summit
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processingroyans
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analyticsAvinash Pandu
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overviewvhrocca
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 

La actualidad más candente (20)

Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
OPTIMIZATION OF MULTIPLE CORRELATED QUERIES BY DETECTING SIMILAR DATA SOURCE ...
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop World
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop Presentation
Hadoop PresentationHadoop Presentation
Hadoop Presentation
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with Hadoop
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 

Similar a Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and Enterprise Data Warehouse (EDW)

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Cloudera, Inc.
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataHadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataCloudera, Inc.
 
Hadoop in sigmod 2011
Hadoop in sigmod 2011Hadoop in sigmod 2011
Hadoop in sigmod 2011Bin Cai
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Amazon Web Services
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsbarbie0909
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
 
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB DayGetting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB DayAmazon Web Services Korea
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionEtu Solution
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreKelly Technologies
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015Daniela Zuppini
 

Similar a Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and Enterprise Data Warehouse (EDW) (20)

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataHadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
 
Hadoop in sigmod 2011
Hadoop in sigmod 2011Hadoop in sigmod 2011
Hadoop in sigmod 2011
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
HadoopDB
HadoopDBHadoopDB
HadoopDB
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Accordion - VLDB 2014
Accordion - VLDB 2014Accordion - VLDB 2014
Accordion - VLDB 2014
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB DayGetting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
 

Último

Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjurptikerjasaptiker
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制vexqp
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss ConfederationEfruzAsilolu
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样wsppdmt
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........EfruzAsilolu
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制vexqp
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 

Último (20)

Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 

Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and Enterprise Data Warehouse (EDW)

  • 1. Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and Enterprise Warehouse YUANYUAN TIAN (YTIAN@US.IBM.COM) IBM RESEARCH -- ALMADEN Publications: Tian et al EDBT 2015, Tian et al TODS 2016 (invited as Best of EDBT 2015)
  • 2. Big Data in The Enterprise Hadoop + Spark ETL/ ELT Graph ML Analytics Streams SQL HDFS social SQL Queries EDW SQL Engine
  • 3. Example Scenario SELECT L.url_prex, COUNT(*) FROM Transaction T, Logs L WHERE T.category = ‘Canon Camera’ AND region(L.ip)= ‘East Coast’ AND T.uid=L.uid AND T.date >= L.date AND T.date <= L.date+1 GROUP BY L.url_prex Find out the number of views of the urls visited by customers with IP addresses from East Coast who bought Canon Camera within one day of their online visits Hadoop + Spark SQL HDFS EDW Logs Transactions Table L Table T Correlate customers’ online behaviors with sales
  • 4. Hybrid Warehouse  What is a Hybrid Warehouse?  A special federation between Hadoop-like big data platforms and EDWs  Two asymmetric, heterogeneous, and independent distributed systems.  Existing federation solutions are inadequate  Client-server model to access remote databases and move data  Single connection for data transmission EDW SQL-on-Hadoop Data Ownership • Own its data • Control data organization and partitioning • Work with existing files on HDFS • Cannot dictate data layout Index Support Build and exploit index Scan-based only, no index support Update Support update-in-place Append only Capacity • High-end servers • Smaller cluster size • Commodity machines • Larger cluster size (up to 10,000s nodes)
  • 5. Joins in Hybrid Warehouse  Focus an equi-join between two big tables in the hybrid warehouse  Table T in an EDW (a shared-nothing full-fledged parallel database)  Table L on HDFS, with a scan-based distributed data processing engine (HQP)  Both tables are large, but generally |L|>>|T|  Data not distributed/partitioned by join key at either side  Queries are issued and results are returned at EDW side  Final result is relatively small due to aggregation SELECT L.url_prex, COUNT(*) FROM Transaction T, Logs L WHERE T.category = ‘Canon Camera’ AND region(L.ip)= ‘East Coast’ AND T.uid=L.uid AND T.date >= L.date AND T.date <= L.date+1 GROUP BY L.url_prex
  • 6. Existing Hybrid Solutions  Data of one system is entirely loaded into the other.  DB  HDFS: DB data gets updated frequently, HDFS doesn’t support update properly  HDFS  DB: HDFS data is often too big to be moved into DB  Dynamically ingest needed data from HDFS into DB  e.g. Microsoft Polybase, Pivotal Hawq, TeraData SQL-H, Oracle Big Data SQL  Selection and projection pushdown to the HDFS side  Joins executed in the DB side only  Heavy burden on the DB side  Assume that SQL-on-Hadoop systems are not efficient at join processing  NOT TRUE ANYMORE! (IBM Big SQL, Impala, Presto, etc.)  Split querying processing between DB and HDFS  Microsoft Polybase  Joins executed in Hadoop, only when both tables are on HDFS
  • 7. Goals and Contributions  Goals:  Fully utilize the processing power and massive parallelism of both systems  Minimize data movement across the network  Exploit the use of Bloom filters  Consider performing joins both at the DB side and the HDFS side  Contributions:  Adapt and extend well-know distributed join algorithms to work in the hybrid warehouse  Propose a new zigzag join algorithm that is proved to work well in most cases  Implement the algorithms in a prototype of the hybrid warehouse architecture with DB2 DPF and our join engine on HDFS  Empirically compare all join algorithms in different selectivity settings  Develop a sophisticated cost model for all the join algorithms
  • 8. DB-Side Join Move the HDFS data after selection & projection to DB  Used in most existing hybrid systems: Polybase, Pivotal Hawq. etc  HDFS table after selection & projection can still be big Bloom filters to exploit join selectivity
  • 9. HDFS-Side Broadcast Join If the DB table after selection & projection is very small  Broadcast the DB table to HDFS side to avoid shuffling the HDFS table.
  • 10. HDFS-Side Repartition Join When the DB table after selection & projection is still large  Both sides agree on a hash function for data shuffling  Bloom filter to exploit join selectivity
  • 11. HDFS-Side Zigzag Join 2-way Bloom filter can further reduces the DB data transferred to HDFS side.
  • 12. Implementation  EDW: DB2 DPF extended with unfenced C UDFs  Computing & applying Bloom filters  Different ways of transferring data between DB2 and JEN  HQP: Our own C++ join execution engine, called JEN  Sophisticated HDFS-side join engine using multi-threading, pipelining, hash-based aggregations, etc  Coordination between DB2 and the HDFS-side engine  Parallel communication layer between DB2 and the HDFS-side engine  HCatalog: For storing the meta data of HDFS tables  Each join algorithm is invoked by issuing a single query to DB2
  • 13. JEN Overview  Built with a prototype of the IO layer and the scheduler from an early version of IBM Big SQL 3.0  A JEN cluster consists one coordinator and n workers  Each JEN worker:  Multi-threaded, run on each HDFS DataNode  Read parts of HDFS tables (leveraging IO layer of IBM Big SQL 3.0)  Execute local query plans  Communicate in parallel with other JEN workers (MPI-based)  Communicate in parallel with DB2 agents : through TCP/IP sockets.  JEN coordinator:  Manage JEN workers  Orchestrate connection and communication between JEN workers and DB2 agents  Retrieve meta data for HDFS tables  Assign HDFS blocks to JEN workers (leveraging the scheduler of IBM Big SQL 3.0)
  • 14. Experimental Setup  HDFS cluster:  30 DataNodes, each runs 1 JEN worker  Each server: 8 cores, 32 GB RAM, 1 Gbit Ethernet, 4 disks for HDFS  DB2 DPF:  5 severs, each runs 6 DB2 agents  Each server: 12 cores, 96 GB RAM, 10 Gbit Ethernet, 11 disks for DB2 data storage Interconnection: 20Gbit switch Dataset:  Log table L on HDFS (15 billion records)  1TB in text format  421GB in Parquet format (default)  Transaction table T in DB2 (1.6 billion records, 97GB)  Bloom filter: 128 million bits with 2 hash function  # join keys: 16 million  false positive rate: 5%
  • 15. DB-Side Joins vs HDFS-Side Joins  DB-Side joins work well only when selectivity on L is small (σL<= 0.01 ).  HDFS-side joins show very steady performance with increasing L’.  HDFS-side join (especially zigzag join) is a very reliable choice for joins in the hybrid warehouse! Transaction Table Selectivity = 0.1 DB-side joins deteriorate fast !
  • 16. Broadcast Join vs Repartition Join Broadcast join only works for very limited cases, e.g when σT<= 0.001 (T’ <=25MB). Tradeoff: broadcasting T’ (30*T’) via interconnection vs sending T’ via interconnection + shuffling L’ within HDFS cluster 0 50 100 150 200 250 0.001 0.01 0.1 0.2 Time(sec) Log Table Selectivity broadcast repartition 0 50 100 150 200 250 0.001 0.01 0.1 0.2 Time(sec) Log Table Selectivity broadcast repartition Transaction table selectivity = 0.001 Transaction table selectivity = 0.01
  • 17. Zigzag Join vs Repartition Joins Transaction Table Selectivity = 0.1 HDFS tuples shuffled DB tuples sent Repartition 5,854 million 165 million Repartition (BF) 591 million 165 million Zigzag 591 million 30 million Zigzag join is most efficient  Up to 2.1x faster than repartition join, up to 1.8x faster than repartition join with BF Zigzag join significantly reduces data movement  9.9x less HDFS data shuffled, 5.5x less DB data sent Zigzag join is the best HDFS-side join algorithm!
  • 18. Cost Model of Join Algorithms  Goal:  Capture the relative performance of the join algorithms  Enable a query optimizer in the hybrid warehouse to choose the right join strategy  Estimate total resource time (disk IO, network IO, CPU) in milliseconds  Parameters used in cost formulas:  System parameters: only related to the system environment (common to all queries)  # DB nodes, # HDFS nodes, DB buffer pool size, disk IO speeds (DB, HDFS), network IO speeds (DB, HDFS, in-between), etc  Estimated through a learning suite which runs a number of test programs  Query parameters: query-specific parameters  Table cardinalities, table sizes, local predicate selectivity, join selectivity, Bloom filter size, Bloom filter false-positive rate, etc  DB table: leverage DB stats  HDFS table: estimate through sampling or Hive Analyze Table command if possible  Join selectivity: estimate through sampling
  • 19. Validation of Cost Models Selectivity on T Selectivity on L Join Selectivity on T Join Selectivity on L Best from Cost Model Best from Experiment Intersection Metric 0.05 0.001 0.0005 0.05 db(BF) db(BF) 0 0.05 0.01 0.005 0.05 db(BF) db(BF) 0.18 0.05 0.1 0.05 0.05 zigzag zigzag 0.08 0.05 0.2 0.1 0.05 zigzag zigzag 0 0.1 0.001 0.0005 0.1 db(BF) db(BF) 0 0.1 0.01 0.005 0.1 db(BF) db(BF) 0.18 0.1 0.1 0.05 0.1 zigzag zigzag 0.14 0.1 0.2 0.1 0.1 zigzag zigzag 0.06 Cost model correctly finds the best algorithm in every case! Even the ranking of the algorithms is similar or identical to that of empirical observation!
  • 20. Concluding Remarks  Emerging need for hybrid warehouse: enterprise warehouses will co-exist with big data systems  Bloom filter is a good way to filter data, and use them both ways  Powerful SQL processing capability on the HDFS side  IBM Big SQL, Impala, Hive 14, …  Existing SQL-on-Hadoop systems can be augmented with the capabilities of JEN  More capacity and investment on the big data side  Exploit capacity without moving data  It is better to do the joins on the Hadoop side  More complex usage patterns are emerging  EDW on premise, Hadoop on cloud

Notas del editor

  1. With the advent of big dats, the enterprise analytics landscape has dramatically changed. The HDFS has become an important data repository for all business analytics. Enterprises are using various big data technologies to process data and drive actionable insights. HDFS serves as the storage where other distributed processing frameworks, such as Hadoop and spark, access and operate on large volumes of data. At the same time, enterprise data warehouses (EDWs) continue to support critical business analytics. EDWs are usually shared-nothing parallel databases that support complex SQL processing, updates, and transactions. As a result, they manage up-to-date data and support various business analytics tools, such as reporting and dashboards. A new generation of applications have emerged, requiring access and correlation of data stored in HDFS and EDWs.
  2. For example, a company running an ad campaign may want to evaluate the effectiveness of its campaign by correlating click stream data stored in HDFS with actual sales data stored in the database. This requires joining the transaction table T in the parallel database with the log table L on HDFS. Such analysis can be expressed as the following SQL query.
  3. These applications, together with the coexistence of HDFS and EDWs, have created the need for a new generation of a special federation between Hadoop-like big data platforms and EDWs, which we call the hybrid warehouse. It is very important to highlight the unique challenges of the hybrid warehouse. First of all, we are dealing with two asymmetric, heterogeneous, and independent distributed systems. A full-fledged database and a sql-on Hadoop processor have very different characteristics.
  4. In thiswork, we envision an architecture of the hybrid warehouse by studying the important problem of efficiently executing joins between HDFS and EDW data
  5. Many database/HDFS hybrid systems fetch the HDFS table and execute the join in the database. We first explore this approach, which we call DB-side join. Note that the HDFS table L is usually much larger than the database table T. Even if the local predicates predL are highly selective, the filtered HDFS table L can still be quite big. In order to further reduce the amount of data transferred from HDFS to the parallel database, we introduce a Bloom filter bfT on the join key of T , which is the database table after applying local predicates and projection, and send the Bloom filter to the HDFS side. T
  6. We now consider executing the join at the HDFS side. If the predicates predT on the database table T are highly selective, the filtered database data T is small enough to be sent to every HQP node, so that only local joins are needed without any shuffling of the HDFS data.
  7. If the local predicates predT over the database table T are not highly selective, then broadcasting the filtered data T to all HQP nodes is not a good strategy. In this case, we need a robust join algorithm.
  8. When local predicates on neither the HDFS table nor the database table are selective, we need to fully exploit the join selectivity to perform the join efficiently. We can even further reduce the amount of data movement between the two systems by exploiting bloom filters in both ways.