SlideShare una empresa de Scribd logo
1 de 17
Descargar para leer sin conexión
12 SQL-ON-HADOOP TOOLS
Saggi Neumann - CTO and co-founder, Xplenty
BRINGING SQL TO HADOOP
In our recent post, 8 SQL-on-Hadoop challenges, we quickly listed several
tools that help to bridge the gap between the two technologies without
going into details. This time we’ll dive in and learn about 12 tools that
bring SQL to Hadoop in various ways.
OPEN SOURCE
SQL-ON-HADOOP TOOLS
APACHE HIVE
Initially developed by Facebook, Apache Hive is a data warehouse
infrastructure that is built on top of Hadoop. It allows querying data
stored on HDFS for analysis via HQL, an SQL-like language that is
translated to MapReduce jobs. Although it seems to provide SQL
functionality, Hive performs batch processing on Hadoop and does not
provide interactive querying. It stores metadata in a relational database
and requires maintaining a schema for the data. Only four file formats are
supported by Hive: text, SequenceFile, ORC and RCFile. Hive supports
processing compressed data on Hadoop and also user defined functions.
▪ Bottom line - batch processing on Hadoop with an SQL like language
APACHE SQOOP
Apache Sqoop allows importing and exporting data from relational
databases to Hadoop via JDBC, the standard API for connecting to
databases with Java. It can also work without JDBC as long as the relevant
tools allow bulk import/export of data. Sqoop works by running a query
on the relational database and exporting the resulting rows into files in
either one of these formats: text, binary, Avro, or Sequence Files. These
files can then be saved on Hadoop’s HDFS. They can also be exported from
Hadoop back into a relational database. Finally, Sqoop integrates with
HCatalog, a table and storage management service for Hadoop that allows
querying Sqoop’s imported files via Hive or Pig. See our Sqoop blog
post for more info.
▪ Bottom line - import/export data from SQL databases to/from Apache
Hadoop
BIGSQL
BigSQL is a pre-made package of PostgreSQL and Hadoop that you can
easily download and install to try out on your local machine. Aside from
Apache Hadoop and PostgreSQL, it also includes Cassandra, Tez, Hive,
Zookeeper, and HadoopFDW. Extra components such as Pig, Sqoop, and
HBase can be downloaded additionally.
▪ Bottom line - pre-made package for trying out Hadoop with PostgreSQL
on your machine
LINGUAL
While other tools provide SQL-like syntax, Cascading’s Lingual claims to
provide a full ANSI SQL interface for Hadoop, thus allowing for easier
integration with existing BI tools and helping SQL skilled personnel to use
Hadoop immediately. Lingual supports JDBC and also includes an SQL
shell. Despite the SQL interface, it still executes queries on Hadoop in
batch processing.
▪ Bottom line - ANSI SQL interface for Hadoop
APACHE PHOENIX
Apache Phoenix is an SQL skin for interactive queries over HBase. It
compiles SQL queries into a series of HBase scans and produces JDBC
result sets. Note that it requires maintaining a schema which could be
built from scratch or mapped from an existing HBase table. Furthermore,
there are several features Phoenix doesn’t support: full transaction
support, derived tables, relational operators, and misc built-in functions
(although they can be added manually). The project is mainly maintained
by Salesforce, Intel, and Hortonworks.
▪ Bottom line - interactive SQL over HBase
IMPALA
Cloudera’s Impala is a query engine that runs on top of Hadoop and
executes interactive SQL queries on HDFS and HBase. While Hive runs in
batch processing, Impala runs the queries in real-time, thus integrating
SQL based business intelligence tools with Hadoop. Although Cloudera is
the main developer behind this tool, it is fully open source and supports
the following file formats: text, LZO, SequenceFile, Avro and RCFile.
Impala can also run on the cloud via Amazon’s Elastic MapReduce.
▪ Bottom line - Cloudera’s solution for interactive SQL queries over HDFS
and HBase
PRESTO
Presto is also an interactive SQL query engine. It runs on top of Hive,
HBase, and even relational databases and proprietary data stores, thus
combining data from multiple sources across the organization. Facebook is
the main developer behind Presto and the company uses it to query
internal data stores, including a 300PB data warehouse. Airbnb and
Dropbox also use Presto, so it seems tried and tested for the enterprise.
▪ Bottom line - Facebook’s solution for interactive SQL queries over Hive
and HBase
CITUSDB
CitusDB (not to be confused with CitrusDB) is another interactive querying
engine with SQL-like functionality that works over Hadoop. It’s based on
Dremel, Google’s version of a real-time analytics database to process Big
Data, and unlike Impala and Presto it uses PostgreSQL as the SQL engine
that works behind the scenes. CitusDB can run on-premise or in the cloud
and supports features such as full-text search and geo search as well as
ODBC/JDBC compatibility. However, being an analytical database it only
supports loading the data in batches.
▪ Bottom line - SQL on Hadoop interactive querying with PostgreSQL
INFINIDB
InfiniDB is a columnar database that integrates with HDFS to perform real-
time analytics on Hadoop with MySQL compatibility. The data is stored in
their own columnar format on disk with support for MySQL’s major data
types. Other formats and non-relational data structures aren’t supported,
although Parquet is on the long term road map. They recently
ran benchmarks against other open source SQL-on-Hadoop engines and
claim to have much better performance than Hive and Presto. InfiniDB
also supports windowing functions for analytics.
COMMERCIAL
SQL-ON-HADOOP TOOLS
HADAPT
Hadapt is a commercial product that brings a native SQL implementation
to Hadoop. Because it combines Hadoop with a storage layer of a
relational database, it allows querying Hadoop via SQL interactively rather
than as a batch process. They can handle structured and unstructured
data without a predefined schema.
▪ Bottom line - interactive SQL querying on Hadoop
JETHRO DATA
Jethro claims the title of "fastest SQL on Hadoop" by providing an SQL
engine for Hadoop that automatically indexes the data as soon as it is
written to Hadoop. According to them, it executes queries 100 times
faster than Hive and 10 times faster than Impala. Jethro can be added to
an existing Hadoop cluster and is supposed to be non-intrusive and it isn’t
installed on any of the Hadoop storage nodes.
▪ Bottom line - fast non-intrusive SQL-on-Hadoop via auto-indexing
HAWQ
HAWQ (HAdoop With Query) is a commercial SQL-on-Hadoop platform by
Pivotal, a subsidiary of EMC. It provides a parallel SQL query engine using
Pivotal’s Greenplum Analytic Database and Hadoop’s HDFS for data
storage. This engine is supposed to be useful for analytics with full
transaction support and supports creating external tables on HDFS that
read text, Hive, HBase, and soon Parquet. Pivotal received
some criticism about a year ago that this is not a true Hadoop product
because they claim to have over 300 engineers working on Hadoop, yet
none of them contribute to any of the Hadoop related projects. As these
lines are written, that’s still true.
▪ Bottom line - Pivotal’s SQL-on-Hadoop
XPLENTY
WWW.XPLENTY.COM

Más contenido relacionado

La actualidad más candente

Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Remy Rosenbaum
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Data Con LA
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Rich Data Graphs for MapReduce
Rich Data Graphs for MapReduceRich Data Graphs for MapReduce
Rich Data Graphs for MapReduceScott Cinnamond
 
Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Daniel Abadi
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...HBaseCon
 
JethroData technical white paper
JethroData technical white paperJethroData technical white paper
JethroData technical white paperJethroData
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems ResearchDr. Mirko Kämpf
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoopinside-BigData.com
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBaseAnil Gupta
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!Edureka!
 
Hadoop data access layer v4.0
Hadoop data access layer v4.0Hadoop data access layer v4.0
Hadoop data access layer v4.0SpringPeople
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Edureka!
 

La actualidad más candente (20)

Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Rich Data Graphs for MapReduce
Rich Data Graphs for MapReduceRich Data Graphs for MapReduce
Rich Data Graphs for MapReduce
 
Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012Boston Hadoop Meetup, April 26 2012
Boston Hadoop Meetup, April 26 2012
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 
JethroData technical white paper
JethroData technical white paperJethroData technical white paper
JethroData technical white paper
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems Research
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
Hbase mhug 2015
Hbase mhug 2015Hbase mhug 2015
Hbase mhug 2015
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
Hadoop data access layer v4.0
Hadoop data access layer v4.0Hadoop data access layer v4.0
Hadoop data access layer v4.0
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 

Destacado

Final version sql over hadoop ver1
Final version sql over hadoop ver1Final version sql over hadoop ver1
Final version sql over hadoop ver1Sudheesh Narayanan
 
Design for a Distributed Name Node
Design for a Distributed Name NodeDesign for a Distributed Name Node
Design for a Distributed Name NodeAaron Cordova
 
Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014Tsuyoshi OZAWA
 
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?DataWorks Summit
 
DCAT-AP exchanging metadata
DCAT-AP exchanging metadataDCAT-AP exchanging metadata
DCAT-AP exchanging metadataBart Hanssens
 
ckan 2.0: Harvesting from other sources
ckan 2.0: Harvesting from other sourcesckan 2.0: Harvesting from other sources
ckan 2.0: Harvesting from other sourcesChengjen Lee
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayDataWorks Summit
 
Hadoop Operations: How to Secure and Control Cluster Access
Hadoop Operations: How to Secure and Control Cluster AccessHadoop Operations: How to Secure and Control Cluster Access
Hadoop Operations: How to Secure and Control Cluster AccessCloudera, Inc.
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRDouglas Bernardini
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingBart Vandewoestyne
 

Destacado (13)

Final version sql over hadoop ver1
Final version sql over hadoop ver1Final version sql over hadoop ver1
Final version sql over hadoop ver1
 
Design for a Distributed Name Node
Design for a Distributed Name NodeDesign for a Distributed Name Node
Design for a Distributed Name Node
 
Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014
 
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
 
DCAT-AP exchanging metadata
DCAT-AP exchanging metadataDCAT-AP exchanging metadata
DCAT-AP exchanging metadata
 
DCAT: a tale of exchanging metadata
DCAT: a tale of exchanging metadataDCAT: a tale of exchanging metadata
DCAT: a tale of exchanging metadata
 
ckan 2.0: Harvesting from other sources
ckan 2.0: Harvesting from other sourcesckan 2.0: Harvesting from other sources
ckan 2.0: Harvesting from other sources
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 
Hadoop Operations: How to Secure and Control Cluster Access
Hadoop Operations: How to Secure and Control Cluster AccessHadoop Operations: How to Secure and Control Cluster Access
Hadoop Operations: How to Secure and Control Cluster Access
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
 

Similar a 12 SQL On-Hadoop Tools

Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training Keylabs
 
Big Data Training in Mohali
Big Data Training in MohaliBig Data Training in Mohali
Big Data Training in MohaliE2MATRIX
 
Big Data Training in Amritsar
Big Data Training in AmritsarBig Data Training in Amritsar
Big Data Training in AmritsarE2MATRIX
 
Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in LudhianaE2MATRIX
 
Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Muthu Natarajan
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoopOmar Jaber
 
Intro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseIntro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseJonathan Bloom
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 

Similar a 12 SQL On-Hadoop Tools (20)

Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training
 
Big Data Training in Mohali
Big Data Training in MohaliBig Data Training in Mohali
Big Data Training in Mohali
 
Big Data Training in Amritsar
Big Data Training in AmritsarBig Data Training in Amritsar
Big Data Training in Amritsar
 
Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in Ludhiana
 
Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.
 
Big data and tools
Big data and tools Big data and tools
Big data and tools
 
Apache hive1
Apache hive1Apache hive1
Apache hive1
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Bigdata ppt
Bigdata pptBigdata ppt
Bigdata ppt
 
Bigdata
BigdataBigdata
Bigdata
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
Hadoop vs Apache Spark
Hadoop vs Apache SparkHadoop vs Apache Spark
Hadoop vs Apache Spark
 
Intro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseIntro to Hybrid Data Warehouse
Intro to Hybrid Data Warehouse
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
 

Último

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 

Último (17)

Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdf
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 

12 SQL On-Hadoop Tools

  • 1. 12 SQL-ON-HADOOP TOOLS Saggi Neumann - CTO and co-founder, Xplenty
  • 2. BRINGING SQL TO HADOOP In our recent post, 8 SQL-on-Hadoop challenges, we quickly listed several tools that help to bridge the gap between the two technologies without going into details. This time we’ll dive in and learn about 12 tools that bring SQL to Hadoop in various ways.
  • 4. APACHE HIVE Initially developed by Facebook, Apache Hive is a data warehouse infrastructure that is built on top of Hadoop. It allows querying data stored on HDFS for analysis via HQL, an SQL-like language that is translated to MapReduce jobs. Although it seems to provide SQL functionality, Hive performs batch processing on Hadoop and does not provide interactive querying. It stores metadata in a relational database and requires maintaining a schema for the data. Only four file formats are supported by Hive: text, SequenceFile, ORC and RCFile. Hive supports processing compressed data on Hadoop and also user defined functions. ▪ Bottom line - batch processing on Hadoop with an SQL like language
  • 5. APACHE SQOOP Apache Sqoop allows importing and exporting data from relational databases to Hadoop via JDBC, the standard API for connecting to databases with Java. It can also work without JDBC as long as the relevant tools allow bulk import/export of data. Sqoop works by running a query on the relational database and exporting the resulting rows into files in either one of these formats: text, binary, Avro, or Sequence Files. These files can then be saved on Hadoop’s HDFS. They can also be exported from Hadoop back into a relational database. Finally, Sqoop integrates with HCatalog, a table and storage management service for Hadoop that allows querying Sqoop’s imported files via Hive or Pig. See our Sqoop blog post for more info. ▪ Bottom line - import/export data from SQL databases to/from Apache Hadoop
  • 6. BIGSQL BigSQL is a pre-made package of PostgreSQL and Hadoop that you can easily download and install to try out on your local machine. Aside from Apache Hadoop and PostgreSQL, it also includes Cassandra, Tez, Hive, Zookeeper, and HadoopFDW. Extra components such as Pig, Sqoop, and HBase can be downloaded additionally. ▪ Bottom line - pre-made package for trying out Hadoop with PostgreSQL on your machine
  • 7. LINGUAL While other tools provide SQL-like syntax, Cascading’s Lingual claims to provide a full ANSI SQL interface for Hadoop, thus allowing for easier integration with existing BI tools and helping SQL skilled personnel to use Hadoop immediately. Lingual supports JDBC and also includes an SQL shell. Despite the SQL interface, it still executes queries on Hadoop in batch processing. ▪ Bottom line - ANSI SQL interface for Hadoop
  • 8. APACHE PHOENIX Apache Phoenix is an SQL skin for interactive queries over HBase. It compiles SQL queries into a series of HBase scans and produces JDBC result sets. Note that it requires maintaining a schema which could be built from scratch or mapped from an existing HBase table. Furthermore, there are several features Phoenix doesn’t support: full transaction support, derived tables, relational operators, and misc built-in functions (although they can be added manually). The project is mainly maintained by Salesforce, Intel, and Hortonworks. ▪ Bottom line - interactive SQL over HBase
  • 9. IMPALA Cloudera’s Impala is a query engine that runs on top of Hadoop and executes interactive SQL queries on HDFS and HBase. While Hive runs in batch processing, Impala runs the queries in real-time, thus integrating SQL based business intelligence tools with Hadoop. Although Cloudera is the main developer behind this tool, it is fully open source and supports the following file formats: text, LZO, SequenceFile, Avro and RCFile. Impala can also run on the cloud via Amazon’s Elastic MapReduce. ▪ Bottom line - Cloudera’s solution for interactive SQL queries over HDFS and HBase
  • 10. PRESTO Presto is also an interactive SQL query engine. It runs on top of Hive, HBase, and even relational databases and proprietary data stores, thus combining data from multiple sources across the organization. Facebook is the main developer behind Presto and the company uses it to query internal data stores, including a 300PB data warehouse. Airbnb and Dropbox also use Presto, so it seems tried and tested for the enterprise. ▪ Bottom line - Facebook’s solution for interactive SQL queries over Hive and HBase
  • 11. CITUSDB CitusDB (not to be confused with CitrusDB) is another interactive querying engine with SQL-like functionality that works over Hadoop. It’s based on Dremel, Google’s version of a real-time analytics database to process Big Data, and unlike Impala and Presto it uses PostgreSQL as the SQL engine that works behind the scenes. CitusDB can run on-premise or in the cloud and supports features such as full-text search and geo search as well as ODBC/JDBC compatibility. However, being an analytical database it only supports loading the data in batches. ▪ Bottom line - SQL on Hadoop interactive querying with PostgreSQL
  • 12. INFINIDB InfiniDB is a columnar database that integrates with HDFS to perform real- time analytics on Hadoop with MySQL compatibility. The data is stored in their own columnar format on disk with support for MySQL’s major data types. Other formats and non-relational data structures aren’t supported, although Parquet is on the long term road map. They recently ran benchmarks against other open source SQL-on-Hadoop engines and claim to have much better performance than Hive and Presto. InfiniDB also supports windowing functions for analytics.
  • 14. HADAPT Hadapt is a commercial product that brings a native SQL implementation to Hadoop. Because it combines Hadoop with a storage layer of a relational database, it allows querying Hadoop via SQL interactively rather than as a batch process. They can handle structured and unstructured data without a predefined schema. ▪ Bottom line - interactive SQL querying on Hadoop
  • 15. JETHRO DATA Jethro claims the title of "fastest SQL on Hadoop" by providing an SQL engine for Hadoop that automatically indexes the data as soon as it is written to Hadoop. According to them, it executes queries 100 times faster than Hive and 10 times faster than Impala. Jethro can be added to an existing Hadoop cluster and is supposed to be non-intrusive and it isn’t installed on any of the Hadoop storage nodes. ▪ Bottom line - fast non-intrusive SQL-on-Hadoop via auto-indexing
  • 16. HAWQ HAWQ (HAdoop With Query) is a commercial SQL-on-Hadoop platform by Pivotal, a subsidiary of EMC. It provides a parallel SQL query engine using Pivotal’s Greenplum Analytic Database and Hadoop’s HDFS for data storage. This engine is supposed to be useful for analytics with full transaction support and supports creating external tables on HDFS that read text, Hive, HBase, and soon Parquet. Pivotal received some criticism about a year ago that this is not a true Hadoop product because they claim to have over 300 engineers working on Hadoop, yet none of them contribute to any of the Hadoop related projects. As these lines are written, that’s still true. ▪ Bottom line - Pivotal’s SQL-on-Hadoop