One of the key differences between Presto and Hive, also a crucial functional requirement Facebook made when launching this new SQL engine project, was to have the opportunity to query different kinds of data sources via a uniform ANSI SQL interface.
Presto, an open source distributed analytical SQL engine, implements this with it’s connector architecture, creating an abstraction layer for anything that can be expressed as in a row-like format, ranging from MySQL tables, HDFS, Amazon S3 to NoSQL stores, Kafka streams and proprietary data sources. Presto connector SPI allows anyone to implement a Presto connector and benefit from the capabilities of the Presto SQL engine, enabling them to join data from various sources within a single SQL query.
1. 1
Presto - SQL on anything
January 2017
Grzegorz Kokosiński
Karol Sobczak
Teradata Center for Hadoop
2. 2
Agenda
- Who are we?
- What is Presto?
- What is data federation?
- Different federation strategies in other databases (HIVE)
- what is supported and what are the problems
- Presto Connector
- Show time
3. 3
Lets make some noise
• Let tweet about this presentation!
– #whug
– #prestodb
– #teradata
• Later on we will query that data!
5. 5
What is Presto?
• 100% open source distributed SQL query engine
- Originally developed by Facebook
• Key Differentiators:
- Performance & Scale
- Cross platform query capability, not only SQL on Hadoop
• Apache licensed, hosted on GitHub
- Certified distro & support from Teradata
7. 7
• Facebook
– Multiple production clusters (100s of nodes total)
- 300PB in HDFS, sharded MySQL, SSD-based Raptor
– 1000s of internal daily active users
– 10s-100s of concurrent queries
• Netflix
– 250+ node on EC2, 40+ PB in S3 (Parquet format)
– Over 650 active users and 6K+ queries daily
• Twitter
– 200+ nodes on-premises over Parquet nested data
• Uber
– 200+ nodes (2 dedicated clusters) with 25K+ & 3K+ queries daily
• FINRA
– 120+ nodes in AWS, 2PB is S3, 200+ users (supported by Teradata)
Presto in Production
8. 8
• In-memory processing
• Pipelined execution across nodes (MPP-style)
– Vectorized columnar processing
– Multithreaded execution keeps all CPU cores busy
• Presto is written in highly tuned Java
– Efficient memory management (reduced GC overhead)
– Very careful coding of inner loops
– Runtime bytecode generation
• Optimized ORC & Parquet readers
• Excellent performance with interactive SQL analytics
– Enables to use BI tools
Presto – Query Execution Performance
9. 9
• Hadoop/Hive connector & file formats (HDFS/S3):
– HDFS & S3 + HCatalog
– ORC, RCFile, Parquet, SequenceFile, Text
• Raptor
– columnar store on flash driven by Facebook
• Open source data stores (driven by the community)
– MySQL & PostgreSQL (non-parallel)
– Cassandra (by Teradata)
– Kafka
– Redis
– MongoDB
– ElasticSearch
– Accumulo (by Bloomberg)
Supported data sources & file formats
10. 10
[ WITH with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expr [, ...]
[ FROM table1 [[ INNER | OUTER ] JOIN table2 ON (…)]
[ WHERE condition ]
[ GROUP BY expression [, ...] ]
[ HAVING condition]
[ UNION [ ALL | DISTINCT ] select ]
[ ORDER BY expression [ ASC | DESC ] [, ...] ]
[ LIMIT [ count | ALL ] ]
In addition:
• Windowing functions
• UNNEST, TABLESAMPLE
• ROLLUP, CUBE, GROUPING SETS
• UNION, EXCEPT, INTERSECT
• Subqueries (EXISTS, IN)
ANSI SQL Support
11. 11
Presto is not a database!
• Presto is a query execution engine (storage independent)
• Pluggable custom user functionalities
– Connectors
– Functions
– Types
– System access controllers
– Resource group configuration managers
– Event listeners
– …
• Built-in core functionalities:
– parser, execution, types, sql functions, monitoring
12. 12
Data federation
• Query data from several data sources (databases)
• Streaming
– One to One
- there is a single connection between database access points
- e.g. PSQL via PSQL
- using storage handlers to access RDBMS data from Hive
– Many to One
- many connections from one database nodes to a single access point of
other database
- Accessing REST from UDF in (possibly each) HIVE map/reduce task
– Many to Many
- workers talk to each other directly
• Through storage
– Needs (intermittent) data materialization
• Presto supports them all!
13. 13
Data federation common problems
• model incompatibilities
• multinode streaming is not always possible
• transactions
• cost based optimizations (statistics)
• SQL pushdown (predicates, projections, aggregations?, joins?)
14. 14
Connector
• Presto interface to access arbitrary data source (hive, mysql, jmx)
• Provides:
– metadata
– ability to distributed, parallel and streamed read/write
– transaction boundary
– physical data layouts
– statistics
– (SQL) predicate pushdown)
– indexes (index join)
– session or table properties
– access control
– procedures (CALL …
– . . .
• Most (if not all) of the above points are optional
15. 15
Presto Architecture
Data stream API
Worker
Data stream API
Worker
Coordinator
Metadata
API
Parser/
analyzer
Planner Scheduler
Worker
Client
Data location
API
Pluggable
16. 16
Data federation with Presto
• Through the storage
• Demo
– HIVE
HDFS
DataNode
HDFS
DataNode
Hive
Metastore
HDFS
Namenode
data transfer
Presto
worker
Presto
worker
Presto
coordinator
data transfer
metadata
metadata
17. 17
Data federation with Presto
• One to One
• Demo
– psql
– REST
– and above with HIVE
Presto
worker
Presto
worker
Presto
coordinator
SQL
Database
JDBC metadataJDBC data
18. 18
Many to many - data federation with Presto
AMP
AMP
AMP
AMP
Q
G
E
x
c
h
a
n
g
e
Q
G
E
x
c
h
a
n
g
e
PE Coordinator
Worker Thread
Worker Thread
Worker Thread
Worker Thread
Init & metadata exchange
Bi-directional
fully parallel
data exchange
TERADATA PRESTO
• Key features:
• Low latency
• High performance
• Concurrency
• SQL pushdown
• Data conversion
• Compression
• Efficient CPU usage
19. 19
Conclusion
• Presto Connector is expressive
• 3rd party data source is 1st class citizen
• Single ANSI SQL to rule them all
– use BI tools on data which is not BI friendly
• Rapid data integration
20. 20
Certified Distro: www.teradata.com/presto
Website: www.prestodb.io
Presto Users Group: www.groups.google.com/group/presto-users
GitHub:
www.github.com/prestodb/presto
www.github.com/Teradata/presto
More information