Unified Framework for Big Data FDW

Shivram Mani ( Pivotal)
Unified Framework for
Big Data Foreign Data Wrappers
@ FOSDEM PGDay 2016

Agenda
● Introduction to Hadoop Ecosystem
● Why Postgres SQL on Hadoop
● Current state of SQL on Hadoop (FDW/Big data wrappers)
● PXF - Design & Architecture
● Demo
● Benefits of using PXF with FDW
● Q&A

Agenda
➢ Introduction to Hadoop Ecosystem
● Demo
● Q&A

What is Hadoop/Big Data
Apache Hadoop is an open source framework for distributed processing of large data sets across clusters
of computers.
● Commodity Hardware
● Scale out
● Fault tolerance
● Support multiple file formats
Mapreduce HBase
Hive Pig
Clustered File
System
Distributed
Data Processing
Top level
Abstractions
ETL Tools BI Tools RDMS
Hadoop Distributed File System (HDFS)
Top level
Interfaces

Agenda
➢ Why Postgres SQL on Hadoop
● PXF
● Demo
● Q&A

Motivations: SQL on Hadoop
RDBMS
?
various formats, storages
supported on HDFS
● ANSI SQL
● Cost based optimizer
● Transactions
● Indexes
Foreign
Tables!

Agenda
➢ Current state of SQL on External Hadoop - FDW/Big data wrappers
● Demo
● Q&A

Foreign Data Wrappers (FDW)
Foreign tables and foreign data wrapper is postgres way to read external data.
1. Create FDW (compiled C functions in the handler)
2. Declare the extension (FDW)
3. Create server that uses the wrapper
4. Create table that uses the server
CREATE FOREIGN DATA WRAPPER
hadoop_fdw
HANDLER hadoop_fdw_handler
NO VALIDATOR;
CREATE EXTENSION hadoop_fdw;
CREATE SERVER hadoop_server
FOREIGN DATA WRAPPER hadoop_fdw
OPTIONS (address '127.0.0.1', port '10000');
CREATE FOREIGN TABLE retail_history (
name text,
price double precision )
SERVER hadoop_server
OPTIONS (table 'example.retail_history');

Foreign Data Wrappers - Implementation
Creating a new foreign data wrapper simply consists of implementing the API of the FDW as c-
language functions.
Scanning a foreign table requires implementation of the following:
● GetForeignRelSize - Estimate of the relation size
● GetForeignPaths - Get access paths for the foreign data
● GetForeignPlan - Plan the foreign paths of this table
● BeginForeignScan - Start scan. Open connections, etc
● IterateForeignScan - Perform scan and return tuples
● EndForeignScan - End scan. Close connection, etc

Big Data Wrappers (Multicorn, BigSQL EnterpriseDB)
Create a Hive table
corresponding to HDFS file/HBase table
Create Extension, Server &
Foreign Table
schema and necessary Options
Results mapped
to postgres table
Query connects to HiveServer
via thrift client
Hive server executes
mapreduce jobs
Query Foreign Table

Big Data Wrapper - Communication
libthrift
F
D
W
MetaStore

Agenda
● Current state of SQL on Hadoop - FDW/Big data wrappers
➢ PXF - Design & Architecture
● Demo
● Q&A

● HAWQ is an MPP SQL engine on HDFS (evolved from Greenplum Database)
● PXF is an extensible framework that allows HAWQ to query external data.
● PXF includes built-in connectors for accessing data in HDFS files, Hive & HBase tables.
● Users can create custom connectors to other parallel data stores or processing engines.
HAWQ Extension Framework - PXF

PXF - Communication
Apache Tomcat
PXF Webapp
REST API
Java API
libhdfs3, written in C, segments
External Tables
Native Tables
HTTP, port: 51200
Java API
Java API

Architecture - Deployment
HAWQ
Master Node
NN
pxf
HBase
Master
DN4
pxf
HAWQ
seg4
DN1
pxf
HAWQ
seg1
HBase
Region
Server1
DN2
pxf
HAWQ
seg2
HBase
Region
Server2
DN3
pxf
HAWQ
seg3
HBase
Region
Server3
* PXF needs to be installed on all DN
* PXF is recommended to be installed on NN

Design - Components(PXF)
Fragmenter Get the locations of fragments for an external table
Implicitly provides stats to query optimizer
Accessor Understand and read/write the fragment , return
records
Resolver Convert records to HAWQ consumable format (Data Types)

CREATE EXTENSION hadoop_fdw;
CREATE SERVER hadoop_server
FOREIGN DATA WRAPPER hadoop_fdw
OPTIONS (address '127.0.0.1', port
'10000');
CREATE FOREIGN TABLE retail_history (
name text,
SERVER hadoop_server
OPTIONS (table 'example.retail_history');
CREATE PROTOCOL PXF;
DDL Comparison
LOCATION('pxf://127.0.0.1:51200/
example.retail_history?
CREATE EXTERNAL TABLE retail_history
name text,
PROFILE = HIVE
FORMAT 'CUSTOM'
(formatter='pxfwritable_import');
PXF FDW
* Items with the same color have similar action

Architecture - Data Flow: Query (HDFS)
HAWQ
Master Node NN
pxf
DN1
pxf
HAWQ
seg1
select * from ext_table0
pxf:
//<namenode><port>/path
/to/data
getFragments()
REST
1
Fragments
JSON2
7
3
Split
mapping
(fragment -
> segment)
DN1
pxf
HAWQ
seg1
DN1
pxf
HAWQ
seg1
Query dispatched to Segment 1,2,3… (Interconnect)
5
Read() REST
6 records
8
query result
records (stream)
Fragmenter
Resolver
Accessor
4

PXF Plugins, Profiles
• Built-in with HAWQ (Profiles)
• HDFS: HDFSTextSimple(R/W), HDFSTextMulti(R), Avro(R)
• Hive(R): Hive, HiveRC, HiveText
• HBase(R): HBase
• Community (https://bintray.com/big-data/maven/pxf-plugins/view )
• JSON HAWQ-178
• Cassandra
• Accumulo
• ...

Agenda
➢ Demo
● Q&A

Demo
https://github.com/shivzone/pxf_demo

● Implement FDW callback functions that will interact with PXF.
● Use the enhanced libcurl library - libchurl
PXF as Big Data Wrapper Abstraction
Apache
Tomcat
PXF
WebappREST API Java API
HTTP, port: 51200
Java API
Java API
F
D
W

Agenda
● Demo
➢ Benefits of using PXF with FDW
● Q&A

Benefits of using PXF with FDW
● FDW isolated from underlying hadoop ecosystem APIs
● Direct access of HDFS data.
● Access Hive data without overhead of underlying execution framework
● Access HBase data without mapped Hive table
● Supports Single node & parallel execution
● Extensibility/ease of building extensions
● Support for multiple versions of underlying distributions
● Built in filter push down and support for stats

Resources
● Github
https://github.com/apache/incubator-hawq/tree/master/pxf
● Documentation
http://hawq.docs.pivotal.io/docs-hawq/topics/PivotalExtensionFrameworkPXF.html
● Wiki
https://cwiki.apache.org/confluence/display/HAWQ/PXF

Unified Framework for Big Data FDW

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Recently uploaded

Recently uploaded (20)

Unified Framework for Big Data FDW