SlideShare una empresa de Scribd logo
1 de 30
JackHare
a framework for SQL to NoSQL translation using MapReduce
Wu-Chun Chung·Hung-Pin Lin·
Shih-Chang Chen·Mon-Fong Jiang·
Yeh-Ching Chung
Received: 15 December 2012 / Accepted: 6 September 2013
© Springer Science+Business Media New York 2013

Presented by 康志強
2013.10.22
1
Outline
• Introduction
• Related work
• The JackHare framework architecture
• Unstructured data processing in HBase
• Experimental results
• Conclusions

2
Introduction
• BigData 的問題 (massive data)
– 資料的存取速度
– 資料合併的問題
平行處理時資料的即時性、正確性。

• Hadoop MapReduce
– to process the massive data in parallel.

• Hadoop distributed file system
– difficult to update data frequently

3
Introduction
• Hbase
– to place the data over a scale-out storage system
– to manipulate the changeable data in a transparent
way
– the Hbase interface is not friendly

• JackHare
– 遵守ANSI-SQL和JDBC-4.0規格的API,用來操作
Apache Hbase
– using MapReduce framework for processing the
unstructured data in HBase
4
Introduction
• 資料的存取速度
– 1990, 硬碟可存1,370M,傳輸速度4.4MB/s
– 現在,1 TB,傳輸速度 100MB/s
– 平行進行資料讀取及寫入,加快速度

• Hadoop Distributed File System
– difficult to update data frequently in such file
system

5
Introduction
• 資料合併的問題
– 正確性

• MapReduce
– 分散式程式框架
– Map就是將一個工作分到多個Node
– Reduce就是將各個Node的結果再重新結合成最後
的結果
– 資料本地化
– 運用高階的查詢語言 (Pig, Hive)
6
Introduction
• MapReduce

7
Introduction
• Hbase
– 架構在HDFS上的分散式資料庫
– 使用列 (row) 和行 (column) 為索引存取資料值
– 每一筆資料都有一個時間戳記 (timestamp),因此
同一個欄位可依不同時間存在多筆資料。
(Version)
– HBase的資料表 (table) 是由許多row及數個column
family組成
– 可供MapReduce的程式當作資料來源或儲存媒介
8
Introduction
• Hbase

9
Introduction
• NoSQL資料庫
• http://www.ithome.com.tw/itadm/article.php?c=6336
0&s=5

10
Introduction
• JackHare
– allowing users to use the ANSI-SQL queries to
manipulate large-scale data
– 遵守ANSI-SQL和JDBC-4.0規格的API,用來操作
Apache Hbase
– using MapReduce framework for processing the
unstructured data in Hbase

11
Related work
• Pig
– HDFS 與 MapReduce 叢集環境中執行
– Pig Latin - a simpler procedural language
– http://pig.apache.org/docs/r0.12.0/basic.html#nest
edblock

• Hive
– 提供類似SQL的查詢語言來查詢資料(HiveQL)
– 可管理HDFS的資料
– https://cwiki.apache.org/confluence/display/Hive/T
utorial
12
Related work
• YSmart
– An SQL-to-MapReduce Translator
– http://ysmart.cse.ohio-state.edu/

• S2MART
– Smart Sql to Map-Reduce Translators

13
Related work
• HadoopDB
– An Architectural Hybrid of MapReduce and DBMS
Technologies for Analytical
– HadoopDB provides SQL query via a translation
called SQL-MR-SQL (SMS), based on Hive.
– http://db.cs.yale.edu/hadoopdb/hadoopdb.html

• Clydesdale
– structured data processing on MapReduce
– focuses on processing the data fitting a star schema
14
Related work
• SQL查詢轉換為MapReduce
• Hbase
– 滿足頻繁的數據更新
– 維持NoSQL數據庫的可擴展性和可靠性

15
The JackHare framework architecture

16
The JackHare framework architecture
• User submits an ANSI-SQL query by SQL client
application.
• The compiler scans and parses the ANSI-SQL
query.
• Lookup the related table name, column families
and column qualifier of HBase.
• Generate MapReduce code according to the
query commands and metadata.
17
The JackHare framework architecture
• Access HBase and execute the MapReduce job.
• The results wrapped back from the back-end.
• The returned results are shown on SQL client
application according to RDB schema.

18
The JackHare framework architecture
SQuirreL

19
Unstructured data processing in
HBase
• remap the data in relational database to HBase

20
Unstructured data processing in
HBase
• remap the data in relational database to HBase

21
Unstructured data processing in
HBase
• Analysis of SQL clauses
– SELECT, FROM and WHERE clauses
– Extended clauses
•
•
•
•
•

GROUP BY
HAVING
ORDER BY
JOIN
AGGREGATE FUNCTIONs

22
Experimental results
• Experimental environment
– two Intel Xeon L5640 CPU, 24 GB ram and
3 TB HD
– 16-node virtual machine cluster on four physical
machines
– Hadoop 0.20.203 (15 October, 2013: release 2.2.0 available)
– Hbase 0.92.0 (2013-09-20 | Version: 0.97.0-SNAPSHOT)
– Hive 0.9.0
– JAVA 1.6.0, maximum heap size is 512 MB
23
Experimental results
• Experimental environment
– Node : two cores at 2 GHz with 4 GB ram and 400
GB storage space
– MySQL : two cores at 2 GHz, 4 GB ram and
– 800 GB hard disk
– 3 Table : LOT, WAFER and DIE

24
Experimental results
• Results

25
Experimental results

26
Experimental results

27
Experimental results

28
Conclusions

29
• 報告完畢….

30

Más contenido relacionado

La actualidad más candente

Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document StoreConnector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document StoreFilipe Silva
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseCloudera, Inc.
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data gridBogdan Dina
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
 
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseBDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseDavid Lauzon
 
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBaseCon
 
ODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" SourcesODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" SourcesMark Rittman
 
Payment Gateway Live hadoop project
Payment Gateway Live hadoop projectPayment Gateway Live hadoop project
Payment Gateway Live hadoop projectKamal A
 
DataStax | DataStax Enterprise Advanced Replication (Brian Hess & Cliff Gilmo...
DataStax | DataStax Enterprise Advanced Replication (Brian Hess & Cliff Gilmo...DataStax | DataStax Enterprise Advanced Replication (Brian Hess & Cliff Gilmo...
DataStax | DataStax Enterprise Advanced Replication (Brian Hess & Cliff Gilmo...DataStax
 
Hawq wp 042313_final
Hawq wp 042313_finalHawq wp 042313_final
Hawq wp 042313_finalEMC
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaDavid Lauzon
 
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Data Con LA
 
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDBHBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDBHBaseCon
 
CCV: migrating our payment processing system to MariaDB
CCV: migrating our payment processing system to MariaDBCCV: migrating our payment processing system to MariaDB
CCV: migrating our payment processing system to MariaDBMariaDB plc
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016StampedeCon
 

La actualidad más candente (20)

Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document StoreConnector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
Connector/J Beyond JDBC: the X DevAPI for Java and MySQL as a Document Store
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data grid
 
Abhishek_Mishra
Abhishek_MishraAbhishek_Mishra
Abhishek_Mishra
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop and HBase @eBay
Hadoop and HBase @eBayHadoop and HBase @eBay
Hadoop and HBase @eBay
 
What database
What databaseWhat database
What database
 
Apache hive1
Apache hive1Apache hive1
Apache hive1
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseBDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
 
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial Industry
 
ODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" SourcesODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" Sources
 
Payment Gateway Live hadoop project
Payment Gateway Live hadoop projectPayment Gateway Live hadoop project
Payment Gateway Live hadoop project
 
DataStax | DataStax Enterprise Advanced Replication (Brian Hess & Cliff Gilmo...
DataStax | DataStax Enterprise Advanced Replication (Brian Hess & Cliff Gilmo...DataStax | DataStax Enterprise Advanced Replication (Brian Hess & Cliff Gilmo...
DataStax | DataStax Enterprise Advanced Replication (Brian Hess & Cliff Gilmo...
 
Hawq wp 042313_final
Hawq wp 042313_finalHawq wp 042313_final
Hawq wp 042313_final
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using Impala
 
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
 
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDBHBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
 
CCV: migrating our payment processing system to MariaDB
CCV: migrating our payment processing system to MariaDBCCV: migrating our payment processing system to MariaDB
CCV: migrating our payment processing system to MariaDB
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 

Similar a JackHare- a framework for SQL to NoSQL translation using MapReduce

001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introductionScott Miao
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin
 
Resume_Shivam_08072016
Resume_Shivam_08072016Resume_Shivam_08072016
Resume_Shivam_08072016Shivam Tyagi
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
HariKrishna4+_cv
HariKrishna4+_cvHariKrishna4+_cv
HariKrishna4+_cvrevuri
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremRahul Jain
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersAdaryl "Bob" Wakefield, MBA
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Real World NoSQL (by Chris Yuen)
Real World NoSQL (by Chris Yuen)Real World NoSQL (by Chris Yuen)
Real World NoSQL (by Chris Yuen)orcsab
 

Similar a JackHare- a framework for SQL to NoSQL translation using MapReduce (20)

Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
 
Hadoop DB
Hadoop DBHadoop DB
Hadoop DB
 
001 hbase introduction
001 hbase introduction001 hbase introduction
001 hbase introduction
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
 
Resume_Shivam_08072016
Resume_Shivam_08072016Resume_Shivam_08072016
Resume_Shivam_08072016
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
HariKrishna4+_cv
HariKrishna4+_cvHariKrishna4+_cv
HariKrishna4+_cv
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Real World NoSQL (by Chris Yuen)
Real World NoSQL (by Chris Yuen)Real World NoSQL (by Chris Yuen)
Real World NoSQL (by Chris Yuen)
 

Más de 康志強 大人

AWS Lambda Multi-Cloud Practices
AWS Lambda Multi-Cloud PracticesAWS Lambda Multi-Cloud Practices
AWS Lambda Multi-Cloud Practices康志強 大人
 
AWS CloudFront、S3 Streamming
AWS CloudFront、S3 StreammingAWS CloudFront、S3 Streamming
AWS CloudFront、S3 Streamming康志強 大人
 
Running Hadoop on Amazon EC2
Running Hadoop on Amazon EC2Running Hadoop on Amazon EC2
Running Hadoop on Amazon EC2康志強 大人
 
Hadoop 2.2.0 Multi-node cluster Installation on Ubuntu
Hadoop 2.2.0 Multi-node cluster Installation on Ubuntu Hadoop 2.2.0 Multi-node cluster Installation on Ubuntu
Hadoop 2.2.0 Multi-node cluster Installation on Ubuntu 康志強 大人
 
FreeNAS installation and setup for shared storage (1/2)
FreeNAS installation and setup for shared storage (1/2)FreeNAS installation and setup for shared storage (1/2)
FreeNAS installation and setup for shared storage (1/2)康志強 大人
 
CloudStack Installation on Ubuntu
CloudStack Installation on UbuntuCloudStack Installation on Ubuntu
CloudStack Installation on Ubuntu康志強 大人
 
OpenSTACK Installation on Ubuntu
OpenSTACK Installation on UbuntuOpenSTACK Installation on Ubuntu
OpenSTACK Installation on Ubuntu康志強 大人
 

Más de 康志強 大人 (9)

Hadoop 3.1.1 single node
Hadoop 3.1.1 single nodeHadoop 3.1.1 single node
Hadoop 3.1.1 single node
 
AWS Lambda Multi-Cloud Practices
AWS Lambda Multi-Cloud PracticesAWS Lambda Multi-Cloud Practices
AWS Lambda Multi-Cloud Practices
 
AWS CloudFront、S3 Streamming
AWS CloudFront、S3 StreammingAWS CloudFront、S3 Streamming
AWS CloudFront、S3 Streamming
 
Running Hadoop on Amazon EC2
Running Hadoop on Amazon EC2Running Hadoop on Amazon EC2
Running Hadoop on Amazon EC2
 
Hadoop 2.2.0 Multi-node cluster Installation on Ubuntu
Hadoop 2.2.0 Multi-node cluster Installation on Ubuntu Hadoop 2.2.0 Multi-node cluster Installation on Ubuntu
Hadoop 2.2.0 Multi-node cluster Installation on Ubuntu
 
Tomcat ssl 設定
Tomcat ssl 設定Tomcat ssl 設定
Tomcat ssl 設定
 
FreeNAS installation and setup for shared storage (1/2)
FreeNAS installation and setup for shared storage (1/2)FreeNAS installation and setup for shared storage (1/2)
FreeNAS installation and setup for shared storage (1/2)
 
CloudStack Installation on Ubuntu
CloudStack Installation on UbuntuCloudStack Installation on Ubuntu
CloudStack Installation on Ubuntu
 
OpenSTACK Installation on Ubuntu
OpenSTACK Installation on UbuntuOpenSTACK Installation on Ubuntu
OpenSTACK Installation on Ubuntu
 

Último

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Último (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

JackHare- a framework for SQL to NoSQL translation using MapReduce

  • 1. JackHare a framework for SQL to NoSQL translation using MapReduce Wu-Chun Chung·Hung-Pin Lin· Shih-Chang Chen·Mon-Fong Jiang· Yeh-Ching Chung Received: 15 December 2012 / Accepted: 6 September 2013 © Springer Science+Business Media New York 2013 Presented by 康志強 2013.10.22 1
  • 2. Outline • Introduction • Related work • The JackHare framework architecture • Unstructured data processing in HBase • Experimental results • Conclusions 2
  • 3. Introduction • BigData 的問題 (massive data) – 資料的存取速度 – 資料合併的問題 平行處理時資料的即時性、正確性。 • Hadoop MapReduce – to process the massive data in parallel. • Hadoop distributed file system – difficult to update data frequently 3
  • 4. Introduction • Hbase – to place the data over a scale-out storage system – to manipulate the changeable data in a transparent way – the Hbase interface is not friendly • JackHare – 遵守ANSI-SQL和JDBC-4.0規格的API,用來操作 Apache Hbase – using MapReduce framework for processing the unstructured data in HBase 4
  • 5. Introduction • 資料的存取速度 – 1990, 硬碟可存1,370M,傳輸速度4.4MB/s – 現在,1 TB,傳輸速度 100MB/s – 平行進行資料讀取及寫入,加快速度 • Hadoop Distributed File System – difficult to update data frequently in such file system 5
  • 6. Introduction • 資料合併的問題 – 正確性 • MapReduce – 分散式程式框架 – Map就是將一個工作分到多個Node – Reduce就是將各個Node的結果再重新結合成最後 的結果 – 資料本地化 – 運用高階的查詢語言 (Pig, Hive) 6
  • 8. Introduction • Hbase – 架構在HDFS上的分散式資料庫 – 使用列 (row) 和行 (column) 為索引存取資料值 – 每一筆資料都有一個時間戳記 (timestamp),因此 同一個欄位可依不同時間存在多筆資料。 (Version) – HBase的資料表 (table) 是由許多row及數個column family組成 – 可供MapReduce的程式當作資料來源或儲存媒介 8
  • 11. Introduction • JackHare – allowing users to use the ANSI-SQL queries to manipulate large-scale data – 遵守ANSI-SQL和JDBC-4.0規格的API,用來操作 Apache Hbase – using MapReduce framework for processing the unstructured data in Hbase 11
  • 12. Related work • Pig – HDFS 與 MapReduce 叢集環境中執行 – Pig Latin - a simpler procedural language – http://pig.apache.org/docs/r0.12.0/basic.html#nest edblock • Hive – 提供類似SQL的查詢語言來查詢資料(HiveQL) – 可管理HDFS的資料 – https://cwiki.apache.org/confluence/display/Hive/T utorial 12
  • 13. Related work • YSmart – An SQL-to-MapReduce Translator – http://ysmart.cse.ohio-state.edu/ • S2MART – Smart Sql to Map-Reduce Translators 13
  • 14. Related work • HadoopDB – An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical – HadoopDB provides SQL query via a translation called SQL-MR-SQL (SMS), based on Hive. – http://db.cs.yale.edu/hadoopdb/hadoopdb.html • Clydesdale – structured data processing on MapReduce – focuses on processing the data fitting a star schema 14
  • 15. Related work • SQL查詢轉換為MapReduce • Hbase – 滿足頻繁的數據更新 – 維持NoSQL數據庫的可擴展性和可靠性 15
  • 16. The JackHare framework architecture 16
  • 17. The JackHare framework architecture • User submits an ANSI-SQL query by SQL client application. • The compiler scans and parses the ANSI-SQL query. • Lookup the related table name, column families and column qualifier of HBase. • Generate MapReduce code according to the query commands and metadata. 17
  • 18. The JackHare framework architecture • Access HBase and execute the MapReduce job. • The results wrapped back from the back-end. • The returned results are shown on SQL client application according to RDB schema. 18
  • 19. The JackHare framework architecture SQuirreL 19
  • 20. Unstructured data processing in HBase • remap the data in relational database to HBase 20
  • 21. Unstructured data processing in HBase • remap the data in relational database to HBase 21
  • 22. Unstructured data processing in HBase • Analysis of SQL clauses – SELECT, FROM and WHERE clauses – Extended clauses • • • • • GROUP BY HAVING ORDER BY JOIN AGGREGATE FUNCTIONs 22
  • 23. Experimental results • Experimental environment – two Intel Xeon L5640 CPU, 24 GB ram and 3 TB HD – 16-node virtual machine cluster on four physical machines – Hadoop 0.20.203 (15 October, 2013: release 2.2.0 available) – Hbase 0.92.0 (2013-09-20 | Version: 0.97.0-SNAPSHOT) – Hive 0.9.0 – JAVA 1.6.0, maximum heap size is 512 MB 23
  • 24. Experimental results • Experimental environment – Node : two cores at 2 GHz with 4 GB ram and 400 GB storage space – MySQL : two cores at 2 GHz, 4 GB ram and – 800 GB hard disk – 3 Table : LOT, WAFER and DIE 24