With the introduction of Foreign Data Wrappers in Postgres 9.1, access to distributed systems such as Hdfs, HBase, Hive with their multiple data formats is feasible.
However, the existing FDW implementations for Big Data systems, such as Hdfs or Hive, lack a few key features and doesn’t have a common framework.
The talk introduced PXF an open source project which provides a unified extensible framework for accessing any distributed system data source. PXF is currently being used by Apache HAWQ’s external table via REST API and is in the process of being integrated with other SQL engines. Existing plugins include loading and querying of data stored in HDFS, HBase and Hive. It supports a wide range of data formats such as Text, Avro, Sequence, Hive RCFile, ORC, Parquet and Avro formats and HBase. The pluggable framework makes it very convenient for adding new custom plugins. It also supports advanced statistics and filter pushdown.
With the integration of PXF into Postgres FDW, we can achieve a single unified pluggable framework to read and write any distributed system data source.
1. Shivram Mani ( Pivotal)
Unified Framework for
Big Data Foreign Data Wrappers
@ FOSDEM PGDay 2016
2. Agenda
● Introduction to Hadoop Ecosystem
● Why Postgres SQL on Hadoop
● Current state of SQL on Hadoop (FDW/Big data wrappers)
● PXF - Design & Architecture
● Demo
● Benefits of using PXF with FDW
● Q&A
3. Agenda
➢ Introduction to Hadoop Ecosystem
● Why Postgres SQL on Hadoop
● Current state of SQL on Hadoop (FDW/Big data wrappers)
● PXF - Design & Architecture
● Demo
● Benefits of using PXF with FDW
● Q&A
4. What is Hadoop/Big Data
Apache Hadoop is an open source framework for distributed processing of large data sets across clusters
of computers.
● Commodity Hardware
● Scale out
● Fault tolerance
● Support multiple file formats
Mapreduce HBase
Hive Pig
Clustered File
System
Distributed
Data Processing
Top level
Abstractions
ETL Tools BI Tools RDMS
Hadoop Distributed File System (HDFS)
Top level
Interfaces
5. Agenda
● Introduction to Hadoop Ecosystem
➢ Why Postgres SQL on Hadoop
● Current state of SQL on Hadoop (FDW/Big data wrappers)
● PXF
● Demo
● Benefits of using PXF with FDW
● Q&A
6. Motivations: SQL on Hadoop
RDBMS
?
various formats, storages
supported on HDFS
● ANSI SQL
● Cost based optimizer
● Transactions
● Indexes
Foreign
Tables!
7. Agenda
● Introduction to Hadoop Ecosystem
● Why Postgres SQL on Hadoop
➢ Current state of SQL on External Hadoop - FDW/Big data wrappers
● PXF - Design & Architecture
● Demo
● Benefits of using PXF with FDW
● Q&A
8. Foreign Data Wrappers (FDW)
Foreign tables and foreign data wrapper is postgres way to read external data.
1. Create FDW (compiled C functions in the handler)
2. Declare the extension (FDW)
3. Create server that uses the wrapper
4. Create table that uses the server
CREATE FOREIGN DATA WRAPPER
hadoop_fdw
HANDLER hadoop_fdw_handler
NO VALIDATOR;
CREATE EXTENSION hadoop_fdw;
CREATE SERVER hadoop_server
FOREIGN DATA WRAPPER hadoop_fdw
OPTIONS (address '127.0.0.1', port '10000');
CREATE FOREIGN TABLE retail_history (
name text,
price double precision )
SERVER hadoop_server
OPTIONS (table 'example.retail_history');
9. Foreign Data Wrappers - Implementation
Creating a new foreign data wrapper simply consists of implementing the API of the FDW as c-
language functions.
Scanning a foreign table requires implementation of the following:
● GetForeignRelSize - Estimate of the relation size
● GetForeignPaths - Get access paths for the foreign data
● GetForeignPlan - Plan the foreign paths of this table
● BeginForeignScan - Start scan. Open connections, etc
● IterateForeignScan - Perform scan and return tuples
● EndForeignScan - End scan. Close connection, etc
10. Big Data Wrappers (Multicorn, BigSQL EnterpriseDB)
Create a Hive table
corresponding to HDFS file/HBase table
Create Extension, Server &
Foreign Table
schema and necessary Options
Results mapped
to postgres table
Query connects to HiveServer
via thrift client
Hive server executes
mapreduce jobs
Query Foreign Table
12. Agenda
● Introduction to Hadoop Ecosystem
● Why Postgres SQL on Hadoop
● Current state of SQL on Hadoop - FDW/Big data wrappers
➢ PXF - Design & Architecture
● Demo
● Benefits of using PXF with FDW
● Q&A
13. ● HAWQ is an MPP SQL engine on HDFS (evolved from Greenplum Database)
● PXF is an extensible framework that allows HAWQ to query external data.
● PXF includes built-in connectors for accessing data in HDFS files, Hive & HBase tables.
● Users can create custom connectors to other parallel data stores or processing engines.
HAWQ Extension Framework - PXF
14. PXF - Communication
Apache Tomcat
PXF Webapp
REST API
Java API
libhdfs3, written in C, segments
External Tables
Native Tables
HTTP, port: 51200
Java API
Java API
15. Architecture - Deployment
HAWQ
Master Node
NN
pxf
HBase
Master
DN4
pxf
HAWQ
seg4
DN1
pxf
HAWQ
seg1
HBase
Region
Server1
DN2
pxf
HAWQ
seg2
HBase
Region
Server2
DN3
pxf
HAWQ
seg3
HBase
Region
Server3
* PXF needs to be installed on all DN
* PXF is recommended to be installed on NN
16. Design - Components(PXF)
Fragmenter Get the locations of fragments for an external table
Implicitly provides stats to query optimizer
Accessor Understand and read/write the fragment , return
records
Resolver Convert records to HAWQ consumable format (Data Types)
17. CREATE EXTENSION hadoop_fdw;
CREATE SERVER hadoop_server
FOREIGN DATA WRAPPER hadoop_fdw
OPTIONS (address '127.0.0.1', port
'10000');
CREATE FOREIGN TABLE retail_history (
name text,
price double precision )
SERVER hadoop_server
OPTIONS (table 'example.retail_history');
CREATE PROTOCOL PXF;
DDL Comparison
LOCATION('pxf://127.0.0.1:51200/
example.retail_history?
CREATE EXTERNAL TABLE retail_history
name text,
price double precision )
PROFILE = HIVE
FORMAT 'CUSTOM'
(formatter='pxfwritable_import');
PXF FDW
* Items with the same color have similar action
18. Architecture - Data Flow: Query (HDFS)
HAWQ
Master Node NN
pxf
DN1
pxf
HAWQ
seg1
select * from ext_table0
pxf:
//<namenode><port>/path
/to/data
getFragments()
REST
1
Fragments
JSON2
7
3
Split
mapping
(fragment -
> segment)
DN1
pxf
HAWQ
seg1
DN1
pxf
HAWQ
seg1
Query dispatched to Segment 1,2,3… (Interconnect)
5
Read() REST
6 records
8
query result
records (stream)
Fragmenter
Resolver
Accessor
4
20. Agenda
● Introduction to Hadoop Ecosystem
● Why Postgres SQL on Hadoop
● Current state of SQL on Hadoop - FDW/Big data wrappers
● PXF - Design & Architecture
➢ Demo
● Benefits of using PXF with FDW
● Q&A
22. ● Implement FDW callback functions that will interact with PXF.
● Use the enhanced libcurl library - libchurl
PXF as Big Data Wrapper Abstraction
Apache
Tomcat
PXF
WebappREST API Java API
HTTP, port: 51200
Java API
Java API
F
D
W
23. Agenda
● Introduction to Hadoop Ecosystem
● Why Postgres SQL on Hadoop
● Current state of SQL on Hadoop - FDW/Big data wrappers
● PXF - Design & Architecture
● Demo
➢ Benefits of using PXF with FDW
● Q&A
24. Benefits of using PXF with FDW
● FDW isolated from underlying hadoop ecosystem APIs
● Direct access of HDFS data.
● Access Hive data without overhead of underlying execution framework
● Access HBase data without mapped Hive table
● Supports Single node & parallel execution
● Extensibility/ease of building extensions
● Support for multiple versions of underlying distributions
● Built in filter push down and support for stats