SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
SQL on everything, in memory 
Page ‹#› © Hortonworks Inc. 2014 
Julian Hyde 
Strata, NYC 
October 16th, 2014 
Apache 
Calcite
About me 
Julian Hyde 
Architect at Hortonworks 
Open source: 
• Founder & lead, Apache Calcite (query optimization framework) 
• Founder & lead, Pentaho Mondrian (analysis engine) 
• Committer, Apache Drill 
• Contributor, Apache Hive 
• Contributor, Cascading Lingual (SQL interface to Cascading) 
Past: 
• SQLstream (streaming SQL) 
• Broadbase (data warehouse) 
• Oracle (SQL kernel development) 
Page ‹#› © Hortonworks Inc. 2014
SQL: in and out of fashion 
1969 — CODASYL (network database) 
1979 — First commercial SQL RDBMSs 
1990 — Acceptance — transaction processing on SQL 
1993 — Multi-dimensional databases 
1996 — SQL EDWs 
2006 — Hadoop and other “big data” technologies 
2008 — NoSQL 
2011 — SQL on Hadoop 
2014 — Interactive analytics on {Hadoop, NoSQL, DBMS}, using SQL 
! 
SQL remains popular. 
But why? 
Page ‹#› © Hortonworks Inc. 2014
“SQL inside” 
Implementing SQL well is hard 
• System cannot just “run the query” as written 
• Require relational algebra, query planner (optimizer) & metadata 
…but it’s worth the effort 
! 
Algebra-based systems are more flexible 
• Add new algorithms (e.g. a better join) 
• Re-organize data 
• Choose access path based on statistics 
• Dumb queries (e.g. machine-generated) 
• Relational, schema-less, late-schema, non-relational (e.g. key-value, document) 
Page ‹#› © Hortonworks Inc. 2014
Apache Calcite 
Page ‹#› © Hortonworks Inc. 2014 
Apache 
Calcite
Apache Calcite 
Apache incubator project since May, 2014 
• Originally named Optiq 
Query planning framework 
• Relational algebra, rewrite rules, cost model 
• Extensible 
Packaging 
• Library (JDBC server optional) 
• Open source 
• Community-authored rules, adapters 
Adoption 
• Embedded: Lingual (SQL interface to Cascading), Apache Drill, Apache Hive, Kylin OLAP 
• Adapters: Splunk, Spark, MongoDB, JDBC, CSV, JSON, Web tables, In-memory data 
Page ‹#› © Hortonworks Inc. 2014
Conventional DB architecture 
Page ‹#› © Hortonworks Inc. 2014
Calcite architecture 
Page ‹#› © Hortonworks Inc. 2014
Demo 
{sqlline, apache-calcite-0.9.1, .csv} 
Page ‹#› © Hortonworks Inc. 2014
Expression tree 
Splunk 
Table: splunk 
MySQL 
Page ‹#› © Hortonworks Inc. 2014 
SELECT p.“product_name”, COUNT(*) AS c 
FROM “splunk”.”splunk” AS s 
JOIN “mysql”.”products” AS p 
ON s.”product_id” = p.”product_id” 
WHERE s.“action” = 'purchase' 
GROUP BY p.”product_name” 
ORDER BY c DESC 
Key: product_id 
join 
Key: product_name 
Agg: count 
group 
Condition: 
action = 
'purchase' 
filter 
Key: c DESC 
sort 
scan 
scan 
Table: products
Expression tree 
(optimized) 
Splunk 
Table: splunk 
Page ‹#› © Hortonworks Inc. 2014 
Key: product_id 
join 
Key: product_name 
Agg: count 
group 
Condition: 
action = 
'purchase' 
filter 
Key: c DESC 
sort 
scan 
MySQL 
scan 
Table: products 
SELECT p.“product_name”, COUNT(*) AS c 
FROM “splunk”.”splunk” AS s 
JOIN “mysql”.”products” AS p 
ON s.”product_id” = p.”product_id” 
WHERE s.“action” = 'purchase' 
GROUP BY p.”product_name” 
ORDER BY c DESC
Calcite – APIs and SPIs 
Page ‹#› © Hortonworks Inc. 2014 
Cost, statistics 
RelOptCost 
RelOptCostFactory 
RelMetadataProvider 
• RelMdColumnUniquensss 
• RelMdDistinctRowCount 
• RelMdSelectivity 
SQL parser 
SqlNode 
SqlParser 
SqlValidator 
Transformation rules 
RelOptRule 
• MergeFilterRule 
• PushAggregateThroughUnionRule 
• 100+ more 
Global transformations 
• Unification (materialized view) 
• Column trimming 
• De-correlation 
Relational algebra 
RelNode (operator) 
• TableScan 
• Filter 
• Project 
• Union 
• Aggregate 
• … 
RelDataType (type) 
RexNode (expression) 
RelTrait (physical property) 
• RelConvention (calling-convention) 
• RelCollation (sortedness) 
• TBD (bucketedness/distribution) JDBC driver 
Metadata 
Schema 
Table 
Function 
• TableFunction 
• TableMacro 
Lattice
Calcite Planning Process 
SqlNode ! 
RelNode + RexNode 
Page ‹#› © Hortonworks Inc. 2014 
SQL 
parse 
tree 
Planner 
RelNode 
Graph 
Sql-to-Rel Converter 
1. Plan Graph 
• Node for each node in 
Input Plan 
• Each node is a Set of 
alternate Sub Plans 
• Set further divided into 
Subsets: based on traits 
like sortedness 
2. Rules 
• Rule: specifies an Operator 
sub-graph to match and 
logic to generate equivalent 
‘better’ sub-graph 
• New and original sub-graph 
both remain in contention 
3. Cost Model 
• RelNodes have Cost & 
Cumulative Cost 
4. Metadata Providers 
- Used to plug in Schema, 
cost formulas 
- Filter selectivity 
- Join selectivity 
- NDV calculations 
Rule Match Queue 
- Add Rule matches to Queue 
- Apply Rule match 
transformations to plan graph 
- Iterate for fixed iterations or until 
cost doesn’t change 
- Match importance based on 
cost of RelNode and height 
Best RelNode Graph 
Translate to 
runtime 
Logical Plan 
Based on “Volcano” & “Cascades” papers [G. Graefe]
Demo 
{sqlline, apache-calcite-0.9.1, .csv, CsvPushProjectOntoTableRule} 
Page ‹#› © Hortonworks Inc. 2014
Analytics 
Page ‹#› © Hortonworks Inc. 2014
Mondrian OLAP (Saiku user interface)
Interactive queries on NoSQL 
Typical requirements 
NoSQL operational database (e.g. HBase, MongoDB, Cassandra) 
Analytic queries aggregate over full scan 
Speed-of-thought response (< 5 seconds first query, < 1 second second query) 
Data freshness (< 10 minutes) 
! 
Other requirements 
Hybrid system (e.g. Hive + HBase) 
Star schema 
Page ‹#› © Hortonworks Inc. 2014
Star schema 
Page ‹#› © Hortonworks Inc. 2014 
Sales Time Inventory 
Customer 
Warehouse 
Key 
Fact table 
Dimension table 
Many-to-one relationship 
Product 
Product 
class 
Promotion
Simple analytics problem? 
System 
100M US census records 
1KB each record, 100GB total 
4 SATA 3 disks, total read throughput 1.2GB/s 
! 
Requirement 
Count all records in < 5s 
Solution #1 
It’s not possible! It takes 80s just to read the data 
Solution #2 
Cheat! 
Page ‹#› © Hortonworks Inc. 2014
How to cheat 
Multiple tricks 
Compress data 
Column-oriented storage 
Store data in sorted order 
Put data in memory 
Cache previous results 
Pre-compute (materialize) aggregates 
! 
Common factors 
Make a copy of the data 
Organize it in a different way 
Optimizer chooses the most suitable data organization 
SQL query is unchanged 
Page ‹#› © Hortonworks Inc. 2014
Filter-join-aggregate query 
SELECT product.id, sum(sales.units), sum(sales.price), count(*) 
FROM sales … 
JOIN customer ON … 
JOIN time ON … 
JOIN product ON … 
JOIN product_class ON … 
WHERE time.year = 2014 
AND time.quarter = ‘Q1’ 
AND product.color = ‘Red’ 
GROUP BY … 
Page ‹#› © Hortonworks Inc. 2014 
Time 
Sales Inventory 
Product 
Customer 
Warehouse 
ProductClass
Materialized view, lattice, tile 
Materialized view 
A table whose contents are guaranteed to be the same as 
executing a given query. 
Lattice 
Recommends, builds, and recognizes summary 
materialized views (tiles) based on a star schema. 
A query defines the tables and many:1 relationships in the 
star schema. 
Tile 
A summary materialized view that belongs to a lattice. 
A tile may or may not be materialized. 
Materialization methods: 
• Declare in lattice 
• Generate via recommender algorithm 
• Created in response to query 
Page ‹#› © Hortonworks Inc. 2014 
(FAKE SYNTAX) 
CREATE MATERIALIZED VIEW t AS 
SELECT * FROM emps 
WHERE deptno = 10; 
CREATE LATTICE star AS 
SELECT * 
FROM sales_fact_1997 AS s 
JOIN product AS p ON … 
JOIN product_class AS pc ON … 
JOIN customer AS c ON … 
JOIN time_by_day AS t ON …; 
CREATE MATERIALIZED VIEW zg IN star 
SELECT gender, zipcode, 
COUNT(*), SUM(unit_sales) 
FROM star 
GROUP BY gender, zipcode;
Lattice () 1 
Page ‹#› © Hortonworks Inc. 2014 
> select count(*) as c, sum(unit_sales) as s 
> from star; 
+-----------+-----------+ 
| C | S | 
+-----------+-----------+ 
| 1,000,000 | 266,773.0 | 
+-----------+-----------+ 
1 row selected 
! 
> select * from star; 
1,000,000 rows selected 
raw 1m
Lattice - top tiles () 1 
(z) 43k (s) 50 (g) 2 (y) 5 (m) 12 
Page ‹#› © Hortonworks Inc. 2014 
raw 1m 
Key 
! 
z zipcode (43k) 
s state (50) 
g gender (2) 
y year (5) 
m month (12) 
> select zipcode, count(*) as c, 
> sum(unit_sales) as s 
> from star 
> group by zipcode; 
+---------+-----------+-----------+ 
| ZIPCODE | C | S | 
+---------+-----------+-----------+ 
| 10000 | 23 | 31.5 | 
… 
+---------+-----------+-----------+ 
43,000 rows selected 
> select state, count(*) as c, 
> sum(unit_sales) as s 
> from star 
> group by state; 
+-------+-----------+-----------+ 
| STATE | C | S | 
+-------+-----------+-----------+ 
| AL | 201,693 | 5,520.0 | 
… 
+-------+-----------+-----------+ 
50 rows selected
Lattice - more tiles () 1 
(z, s) 43.4k (g, y) 10 (y, m) 60 
Key 
! 
z zipcode (43k) 
s state (50) 
g gender (2) 
y year (5) 
m month (12) 
(z) 43k (s) 50 (g) 2 (y) 5 (m) 12 
Page ‹#› © Hortonworks Inc. 2014 
(z, s, g, y, 
m) 912k 
(s, g, y, m) 6k 
raw 1m 
(g, y, m) 120 
Fewer 
than you would 
expect, because 5m 
combinations cannot 
occur in 1m row 
table 
Fewer than you 
would expect, because 
state depends on 
zipcode
Lattice - complete () 1 
(z, s) 43.4k (y, m) 60 
Key 
! 
z zipcode (43k) 
s state (50) 
g gender (2) 
y year (5) 
m month (12) 
(z) 43k (s) 50 (g) 2 (y) 5 (m) 12 
Page ‹#› © Hortonworks Inc. 2014 
(z, s, g, y, 
m) 910k 
(s, g, y, m) 6k 
(z, g, y, m) 
909k 
(z, s, y, m) 
830k 
raw 1m 
(z, s, g, m) 
643k 
(z, s, g, y) 
391k 
(z, s, g) 87k 
(g, y) 10 
(g, y, m) 120 
(g, m) 24
Lattice - optimized () 1 
(z, s) 43.4k (y, m) 60 
Key 
! 
z zipcode (43k) 
s state (50) 
g gender (2) 
y year (5) 
m month (12) 
(z) 43k (s) 50 (g) 2 (y) 5 (m) 12 
Page ‹#› © Hortonworks Inc. 2014 
(z, s, g, y, 
m) 912k 
(s, g, y, m) 6k 
(z, g, y, m) 
909k 
(z, s, y, m) 
831k 
raw 1m 
(z, s, g, m) 
644k 
(z, s, g, y) 
392k 
(z, s, g) 87k 
(g, y) 10 
(g, y, m) 120 
(g, m) 24
Lattice - optimized () 1 
(z, s) 43.4k (y, m) 60 
Key 
! 
z zipcode (43k) 
s state (50) 
g gender (2) 
y year (5) 
m month (12) 
(z) 43k (s) 50 (g) 2 (y) 5 (m) 12 
Page ‹#› © Hortonworks Inc. 2014 
(z, s, g, y, 
m) 912k 
(s, g, y, m) 6k 
(z, g, y, m) 
909k 
(z, s, y, m) 
831k 
raw 1m 
(z, s, g, m) 
644k 
(z, s, g, y) 
392k 
(z, s, g) 87k 
(g, y) 10 
(g, y, m) 120 
(g, m) 24 
Aggregate Cost 
(rows) 
Benefit (query 
rows saved) 
% queries 
s, g, y, m 6k 497k 50% 
z, s, g 87k 304k 33% 
g, y 10 1.5k 25% 
g, m 24 1.5k 25% 
s, g 100 1.5k 25% 
y, m 60 1.5k 25%
Demo 
{mysql-foodmart-lattice-model.json} 
Page ‹#› © Hortonworks Inc. 2014
Tiled, in-memory materializations 
Page ‹#› © Hortonworks Inc. 2014 
Query: SELECT x, SUM(y) FROM t GROUP BY x 
In-memory 
materialized 
queries 
Tables 
on disk 
Where we’re going… smart, distributed memory cache & compute framework 
http://hortonworks.com/blog/dmmq/
Kylin OLAP engine 
Calcite 
used here 
Page ‹#› © Hortonworks Inc. 2014
Thank you! 
Apache 
Calcite 
@julianhyde 
http://calcite.incubator.apache.org 
http://www.kylin.io 
Page ‹#› © Hortonworks Inc. 2014

Más contenido relacionado

La actualidad más candente

Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits allJulian Hyde
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...Altinity Ltd
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitFlink Forward
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOAltinity Ltd
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache CalciteDataWorks Summit
 

La actualidad más candente (20)

Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
 
Using Queryable State for Fun and Profit
Using Queryable State for Fun and ProfitUsing Queryable State for Fun and Profit
Using Queryable State for Fun and Profit
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
 

Destacado

Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / OptiqJulian Hyde
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowDremio Corporation
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too muchJulian Hyde
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketDremio Corporation
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeDremio Corporation
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 

Destacado (11)

Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / Optiq
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too much
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 

Similar a SQL on everything, in memory

Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
Cost-based Query Optimization in Hive
Cost-based Query Optimization in HiveCost-based Query Optimization in Hive
Cost-based Query Optimization in HiveDataWorks Summit
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveJulian Hyde
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql databasePARIKSHIT SAVJANI
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaData Science Thailand
 
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache CalciteA Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache CalciteSalesforce Engineering
 
ASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH dataASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH dataJohn Beresniewicz
 
Webinar: What's New in Solr 6
Webinar: What's New in Solr 6Webinar: What's New in Solr 6
Webinar: What's New in Solr 6Lucidworks
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)Julian Hyde
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...DataStax Academy
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 

Similar a SQL on everything, in memory (20)

Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Cost-based Query Optimization in Hive
Cost-based Query Optimization in HiveCost-based Query Optimization in Hive
Cost-based Query Optimization in Hive
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
 
Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql database
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache CalciteA Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
 
ASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH dataASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH data
 
Webinar: What's New in Solr 6
Webinar: What's New in Solr 6Webinar: What's New in Solr 6
Webinar: What's New in Solr 6
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Spark sql meetup
Spark sql meetupSpark sql meetup
Spark sql meetup
 

Más de Julian Hyde

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteJulian Hyde
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Julian Hyde
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQLJulian Hyde
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming languageJulian Hyde
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityJulian Hyde
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're IncubatingJulian Hyde
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteJulian Hyde
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesJulian Hyde
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineeringJulian Hyde
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Julian Hyde
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databasesJulian Hyde
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and FastJulian Hyde
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache CalciteJulian Hyde
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache CalciteJulian Hyde
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Julian Hyde
 

Más de Julian Hyde (20)

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using Calcite
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQL
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're Incubating
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databases
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
 

Último

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Último (20)

Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

SQL on everything, in memory

  • 1. SQL on everything, in memory Page ‹#› © Hortonworks Inc. 2014 Julian Hyde Strata, NYC October 16th, 2014 Apache Calcite
  • 2. About me Julian Hyde Architect at Hortonworks Open source: • Founder & lead, Apache Calcite (query optimization framework) • Founder & lead, Pentaho Mondrian (analysis engine) • Committer, Apache Drill • Contributor, Apache Hive • Contributor, Cascading Lingual (SQL interface to Cascading) Past: • SQLstream (streaming SQL) • Broadbase (data warehouse) • Oracle (SQL kernel development) Page ‹#› © Hortonworks Inc. 2014
  • 3. SQL: in and out of fashion 1969 — CODASYL (network database) 1979 — First commercial SQL RDBMSs 1990 — Acceptance — transaction processing on SQL 1993 — Multi-dimensional databases 1996 — SQL EDWs 2006 — Hadoop and other “big data” technologies 2008 — NoSQL 2011 — SQL on Hadoop 2014 — Interactive analytics on {Hadoop, NoSQL, DBMS}, using SQL ! SQL remains popular. But why? Page ‹#› © Hortonworks Inc. 2014
  • 4. “SQL inside” Implementing SQL well is hard • System cannot just “run the query” as written • Require relational algebra, query planner (optimizer) & metadata …but it’s worth the effort ! Algebra-based systems are more flexible • Add new algorithms (e.g. a better join) • Re-organize data • Choose access path based on statistics • Dumb queries (e.g. machine-generated) • Relational, schema-less, late-schema, non-relational (e.g. key-value, document) Page ‹#› © Hortonworks Inc. 2014
  • 5. Apache Calcite Page ‹#› © Hortonworks Inc. 2014 Apache Calcite
  • 6. Apache Calcite Apache incubator project since May, 2014 • Originally named Optiq Query planning framework • Relational algebra, rewrite rules, cost model • Extensible Packaging • Library (JDBC server optional) • Open source • Community-authored rules, adapters Adoption • Embedded: Lingual (SQL interface to Cascading), Apache Drill, Apache Hive, Kylin OLAP • Adapters: Splunk, Spark, MongoDB, JDBC, CSV, JSON, Web tables, In-memory data Page ‹#› © Hortonworks Inc. 2014
  • 7. Conventional DB architecture Page ‹#› © Hortonworks Inc. 2014
  • 8. Calcite architecture Page ‹#› © Hortonworks Inc. 2014
  • 9. Demo {sqlline, apache-calcite-0.9.1, .csv} Page ‹#› © Hortonworks Inc. 2014
  • 10. Expression tree Splunk Table: splunk MySQL Page ‹#› © Hortonworks Inc. 2014 SELECT p.“product_name”, COUNT(*) AS c FROM “splunk”.”splunk” AS s JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id” WHERE s.“action” = 'purchase' GROUP BY p.”product_name” ORDER BY c DESC Key: product_id join Key: product_name Agg: count group Condition: action = 'purchase' filter Key: c DESC sort scan scan Table: products
  • 11. Expression tree (optimized) Splunk Table: splunk Page ‹#› © Hortonworks Inc. 2014 Key: product_id join Key: product_name Agg: count group Condition: action = 'purchase' filter Key: c DESC sort scan MySQL scan Table: products SELECT p.“product_name”, COUNT(*) AS c FROM “splunk”.”splunk” AS s JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id” WHERE s.“action” = 'purchase' GROUP BY p.”product_name” ORDER BY c DESC
  • 12. Calcite – APIs and SPIs Page ‹#› © Hortonworks Inc. 2014 Cost, statistics RelOptCost RelOptCostFactory RelMetadataProvider • RelMdColumnUniquensss • RelMdDistinctRowCount • RelMdSelectivity SQL parser SqlNode SqlParser SqlValidator Transformation rules RelOptRule • MergeFilterRule • PushAggregateThroughUnionRule • 100+ more Global transformations • Unification (materialized view) • Column trimming • De-correlation Relational algebra RelNode (operator) • TableScan • Filter • Project • Union • Aggregate • … RelDataType (type) RexNode (expression) RelTrait (physical property) • RelConvention (calling-convention) • RelCollation (sortedness) • TBD (bucketedness/distribution) JDBC driver Metadata Schema Table Function • TableFunction • TableMacro Lattice
  • 13. Calcite Planning Process SqlNode ! RelNode + RexNode Page ‹#› © Hortonworks Inc. 2014 SQL parse tree Planner RelNode Graph Sql-to-Rel Converter 1. Plan Graph • Node for each node in Input Plan • Each node is a Set of alternate Sub Plans • Set further divided into Subsets: based on traits like sortedness 2. Rules • Rule: specifies an Operator sub-graph to match and logic to generate equivalent ‘better’ sub-graph • New and original sub-graph both remain in contention 3. Cost Model • RelNodes have Cost & Cumulative Cost 4. Metadata Providers - Used to plug in Schema, cost formulas - Filter selectivity - Join selectivity - NDV calculations Rule Match Queue - Add Rule matches to Queue - Apply Rule match transformations to plan graph - Iterate for fixed iterations or until cost doesn’t change - Match importance based on cost of RelNode and height Best RelNode Graph Translate to runtime Logical Plan Based on “Volcano” & “Cascades” papers [G. Graefe]
  • 14. Demo {sqlline, apache-calcite-0.9.1, .csv, CsvPushProjectOntoTableRule} Page ‹#› © Hortonworks Inc. 2014
  • 15. Analytics Page ‹#› © Hortonworks Inc. 2014
  • 16. Mondrian OLAP (Saiku user interface)
  • 17. Interactive queries on NoSQL Typical requirements NoSQL operational database (e.g. HBase, MongoDB, Cassandra) Analytic queries aggregate over full scan Speed-of-thought response (< 5 seconds first query, < 1 second second query) Data freshness (< 10 minutes) ! Other requirements Hybrid system (e.g. Hive + HBase) Star schema Page ‹#› © Hortonworks Inc. 2014
  • 18. Star schema Page ‹#› © Hortonworks Inc. 2014 Sales Time Inventory Customer Warehouse Key Fact table Dimension table Many-to-one relationship Product Product class Promotion
  • 19. Simple analytics problem? System 100M US census records 1KB each record, 100GB total 4 SATA 3 disks, total read throughput 1.2GB/s ! Requirement Count all records in < 5s Solution #1 It’s not possible! It takes 80s just to read the data Solution #2 Cheat! Page ‹#› © Hortonworks Inc. 2014
  • 20. How to cheat Multiple tricks Compress data Column-oriented storage Store data in sorted order Put data in memory Cache previous results Pre-compute (materialize) aggregates ! Common factors Make a copy of the data Organize it in a different way Optimizer chooses the most suitable data organization SQL query is unchanged Page ‹#› © Hortonworks Inc. 2014
  • 21. Filter-join-aggregate query SELECT product.id, sum(sales.units), sum(sales.price), count(*) FROM sales … JOIN customer ON … JOIN time ON … JOIN product ON … JOIN product_class ON … WHERE time.year = 2014 AND time.quarter = ‘Q1’ AND product.color = ‘Red’ GROUP BY … Page ‹#› © Hortonworks Inc. 2014 Time Sales Inventory Product Customer Warehouse ProductClass
  • 22. Materialized view, lattice, tile Materialized view A table whose contents are guaranteed to be the same as executing a given query. Lattice Recommends, builds, and recognizes summary materialized views (tiles) based on a star schema. A query defines the tables and many:1 relationships in the star schema. Tile A summary materialized view that belongs to a lattice. A tile may or may not be materialized. Materialization methods: • Declare in lattice • Generate via recommender algorithm • Created in response to query Page ‹#› © Hortonworks Inc. 2014 (FAKE SYNTAX) CREATE MATERIALIZED VIEW t AS SELECT * FROM emps WHERE deptno = 10; CREATE LATTICE star AS SELECT * FROM sales_fact_1997 AS s JOIN product AS p ON … JOIN product_class AS pc ON … JOIN customer AS c ON … JOIN time_by_day AS t ON …; CREATE MATERIALIZED VIEW zg IN star SELECT gender, zipcode, COUNT(*), SUM(unit_sales) FROM star GROUP BY gender, zipcode;
  • 23. Lattice () 1 Page ‹#› © Hortonworks Inc. 2014 > select count(*) as c, sum(unit_sales) as s > from star; +-----------+-----------+ | C | S | +-----------+-----------+ | 1,000,000 | 266,773.0 | +-----------+-----------+ 1 row selected ! > select * from star; 1,000,000 rows selected raw 1m
  • 24. Lattice - top tiles () 1 (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 Page ‹#› © Hortonworks Inc. 2014 raw 1m Key ! z zipcode (43k) s state (50) g gender (2) y year (5) m month (12) > select zipcode, count(*) as c, > sum(unit_sales) as s > from star > group by zipcode; +---------+-----------+-----------+ | ZIPCODE | C | S | +---------+-----------+-----------+ | 10000 | 23 | 31.5 | … +---------+-----------+-----------+ 43,000 rows selected > select state, count(*) as c, > sum(unit_sales) as s > from star > group by state; +-------+-----------+-----------+ | STATE | C | S | +-------+-----------+-----------+ | AL | 201,693 | 5,520.0 | … +-------+-----------+-----------+ 50 rows selected
  • 25. Lattice - more tiles () 1 (z, s) 43.4k (g, y) 10 (y, m) 60 Key ! z zipcode (43k) s state (50) g gender (2) y year (5) m month (12) (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 Page ‹#› © Hortonworks Inc. 2014 (z, s, g, y, m) 912k (s, g, y, m) 6k raw 1m (g, y, m) 120 Fewer than you would expect, because 5m combinations cannot occur in 1m row table Fewer than you would expect, because state depends on zipcode
  • 26. Lattice - complete () 1 (z, s) 43.4k (y, m) 60 Key ! z zipcode (43k) s state (50) g gender (2) y year (5) m month (12) (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 Page ‹#› © Hortonworks Inc. 2014 (z, s, g, y, m) 910k (s, g, y, m) 6k (z, g, y, m) 909k (z, s, y, m) 830k raw 1m (z, s, g, m) 643k (z, s, g, y) 391k (z, s, g) 87k (g, y) 10 (g, y, m) 120 (g, m) 24
  • 27. Lattice - optimized () 1 (z, s) 43.4k (y, m) 60 Key ! z zipcode (43k) s state (50) g gender (2) y year (5) m month (12) (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 Page ‹#› © Hortonworks Inc. 2014 (z, s, g, y, m) 912k (s, g, y, m) 6k (z, g, y, m) 909k (z, s, y, m) 831k raw 1m (z, s, g, m) 644k (z, s, g, y) 392k (z, s, g) 87k (g, y) 10 (g, y, m) 120 (g, m) 24
  • 28. Lattice - optimized () 1 (z, s) 43.4k (y, m) 60 Key ! z zipcode (43k) s state (50) g gender (2) y year (5) m month (12) (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 Page ‹#› © Hortonworks Inc. 2014 (z, s, g, y, m) 912k (s, g, y, m) 6k (z, g, y, m) 909k (z, s, y, m) 831k raw 1m (z, s, g, m) 644k (z, s, g, y) 392k (z, s, g) 87k (g, y) 10 (g, y, m) 120 (g, m) 24 Aggregate Cost (rows) Benefit (query rows saved) % queries s, g, y, m 6k 497k 50% z, s, g 87k 304k 33% g, y 10 1.5k 25% g, m 24 1.5k 25% s, g 100 1.5k 25% y, m 60 1.5k 25%
  • 29. Demo {mysql-foodmart-lattice-model.json} Page ‹#› © Hortonworks Inc. 2014
  • 30. Tiled, in-memory materializations Page ‹#› © Hortonworks Inc. 2014 Query: SELECT x, SUM(y) FROM t GROUP BY x In-memory materialized queries Tables on disk Where we’re going… smart, distributed memory cache & compute framework http://hortonworks.com/blog/dmmq/
  • 31. Kylin OLAP engine Calcite used here Page ‹#› © Hortonworks Inc. 2014
  • 32. Thank you! Apache Calcite @julianhyde http://calcite.incubator.apache.org http://www.kylin.io Page ‹#› © Hortonworks Inc. 2014