Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Presto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data Stores
Kamil Bajda-Pawlikowski, CTO, Starburst Data
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Presto: Fast SQL-on-Anything Across Data Lakes, DBMS, and NoSQL Data Stores
1. Presto: Fast SQL-on-Anything
across Data Lakes, DBMS, and NoSQL data stores
Kamil Bajda-Pawlikowski
Co-founder and CTO Data Orchestration Summit 2020
2. What is Presto?
2
Community-driven open
source project
High performance MPP SQL engine
• Interactive ANSI SQL queries
• Proven scalability
• High concurrency
Deploy Anywhere
• Kubernetes
• Cloud (AWS, Azure, GCP)
• On premises
Separation of compute & storage
• Scale storage & compute independently
• SQL-on-anything
• Federated queries
3. About Starburst
3
Enterprise Grade
Security
On-Prem, or
Cloud
Rapid Time to
Insights
Low Cost of
Ownership
24x7 Expert
Support
ANSI SQL MPP
Query Engine
High
Concurrency
Our Platform
Named Open Source
Startup to Watch 2020
600% Growth YoY
100+
Enterprise Customers
NPS Score
80+
Massive
Scale
6. Why Delta Lake?
▪ ACID properties over data lake
▪ Open source table format
▪ Stored as Parquet files
▪ Object storage support
▪ Schema evolution
▪ Time travel feature
▪ Metadata & statistics
▪ Data skipping & z-ordering
7. Native Presto Delta Lake Reader
Supports data skipping & dynamic filtering
Optimizes query using file statistics
Supports reading the Delta transaction log
Native connector written from scratch
8. Query-time Data Federation
● Single point of access to numerous
data sources
● Query Delta Lake and federate with
legacy databases as well as many
NoSQL data stores
● Enforce table, column and row level
policies to ensure maximum data
security
● Mask column data for different groups
and users
9. Data Consumption & Analytics BI Reporting Tools
SQL Query Tools
• Connect using a variety of BI and SQL
tools including Looker, Tableau, Power
BI and DBeaver
• JDBC, ODBC and many libraries
including Python, R and Java
SELECT id, COUNT(*), SUM(active_seconds)
FROM delta.iot.events e
JOIN snowflake.sales.customer c ON (e.customer_id = c.id)
WHERE e.event_date >= current_date
AND c.region = 'US'
AND c.id IN
(SELECT l.customer_id
FROM elastic.web.logs l
WHERE l.visit_date >= date '2020-01-01')
GROUP BY id;