SlideShare una empresa de Scribd logo
1 de 44
Descargar para leer sin conexión
Introduction to 
Spark SQL & Catalyst 
Takuya UESHIN 
! 
Spark Meetup 2014/09/08(Mon)
Who am I? 
Takuya UESHIN 
@ueshin 
github.com/ueshin 
Nautilus Technologies, Inc. 
A Spark contributor 
2
Agenda 
What is Spark SQL? 
Catalyst in depth 
SQL core in depth 
Interesting issues 
How to contribute 
3
What is Spark SQL?
What is Spark SQL? 
Spark SQL is one of 
Spark components. 
Executes SQL on Spark 
Builds SchemaRDD like LINQ 
Optimizes execution plan. 
5
What is Spark SQL? 
Catalyst provides a execution planning 
framework for relational operations. 
Including: 
SQL parser & analyzer 
Logical operators & general expressions 
Logical optimizer 
A framework to transform operator tree. 
6
What is Spark SQL? 
To execute query needs some steps. 
Parse 
Analyze 
Logical Plan 
Optimize 
Physical Plan 
Execute 
7
What is Spark SQL? 
To execute query needs some steps. 
Parse 
Analyze 
Logical Plan 
Optimize 
Physical Plan 
Execute 
8 
Catalyst
What is Spark SQL? 
To execute query needs some steps. 
Parse 
Analyze 
Logical Plan 
Optimize 
Physical Plan 
Execute 
9 
SQL core
Catalyst in depth
Catalyst in depth 
Provides a execution planning framework for 
relational operations. 
Row & DataType’s 
Trees & Rules 
Logical Operators 
Expressions 
Optimizations 
11
Row & DataType’s 
o.a.s.sql.catalyst.types.DataType 
Long, Int, Short, Byte, Float, Double, Decimal 
String, Binary, Boolean, Timestamp 
Array, Map, Struct 
o.a.s.sql.catalyst.expressions.Row 
Represents a single row. 
Can contain complex types. 
12
Trees & Rules 
o.a.s.sql.catalyst.trees.TreeNode 
Provides transformations of tree. 
foreach, map, flatMap, collect 
transform, transformUp, 
transformDown 
Used for operator tree, expression tree. 
13
Trees & Rules 
o.a.s.sql.catalyst.rules.Rule 
Represents a tree transform rule. 
o.a.s.sql.catalyst.rules.RuleExecutor 
A framework to transform trees 
based on rules. 
14
Logical Operators 
Basic Operators 
Project, Filter, … 
Binary Operators 
Join, Except, Intersect, Union, … 
Aggregate 
Generate, Distinct 
Sort, Limit 
InsertInto, WriteToFile 
15 
Project 
Filter 
Join 
Table Table
Expressions 
Literal 
Arithmetics 
UnaryMinus, Sqrt, MaxOf 
Add, Subtract, Multiply, … 
Predicates 
EqualTo, LessThan, LessThanOrEqual, GreaterThan, 
GreaterThanOrEqual 
Not, And, Or, In, If, CaseWhen 
16 
+ 
1 2
Expressions 
Cast 
GetItem, GetField 
Coalesce, IsNull, IsNotNull 
StringOperations 
Like, Upper, Lower, Contains, 
StartsWith, EndsWith, Substring, … 
17
Optimizations 
ConstantFolding 
NullPropagation 
ConstantFolding 
BooleanSimplification 
SimplifyFilters 
FilterPushdown 
CombineFilters 
PushPredicateThroughProject 
PushPredicateThroughJoin 
ColumnPruning 
18
Optimizations 
NullPropagation, ConstantFolding 
Replace expressions that can be evaluated 
with some literal value to the value. 
ex) 
1 + null => null 
1 + 2 => 3 
Count(null) => 0 
19
Optimizations 
BooleanSimplification 
Simplifies boolean expressions that can be determined. 
ex) 
false AND $right => false 
true AND $right => $right 
true OR $right => true 
false OR $right => $right 
If(true, $then, $else) => $then 
20
Optimizations 
SimplifyFilters 
Removes filters that can be evaluated 
trivially. 
ex) 
Filter(true, child) => child 
Filter(false, child) => empty 
21
Optimizations 
CombineFilters 
Merges two filters. 
ex) 
Filter($fc, Filter($nc, child)) => 
Filter(AND($fc, $nc), child) 
22
Optimizations 
PushPredicateThroughProject 
Pushes Filter operators through 
Project operator. 
ex) 
Filter(‘i === 1, Project(‘i, ‘j, child)) 
=> 
Project(‘i, ‘j, Filter(‘i === 1, child)) 
23
Optimizations 
PushPredicateThroughJoin 
Pushes Filter operators through Join 
operator. 
ex) 
Filter(“left.i”.attr === 1, Join(left, 
right) => 
Join(Filter(‘i === 1, left), right) 
24
Optimizations 
ColumnPruning 
Eliminates the reading of unused 
columns. 
ex) 
Join(left, right, LeftSemi, 
“left.id”.attr === “right.id”.attr) => 
Join(left, Project(‘id, right), LeftSemi) 
25
SQL core in depth
SQL core in depth 
Provides: 
Physical operators to build RDD 
Conversion from Existing RDD of Product to 
SchemaRDD support 
Parquet file read/write support 
JSON file read support 
Columnar in-memory table support 
27
SQL core in depth 
o.a.s.sql.SchemaRDD 
Extends RDD[Row]. 
Has logical plan tree. 
Provides LINQ-like interfaces to construct 
logical plan. 
select, where, join, orderBy, … 
Executes the plan. 
28
SQL core in depth 
o.a.s.sql.execution.SparkStrategies 
Converts logical plan to physical. 
Some rules are based on statistics 
of the operators. 
29
SQL core in depth 
Parquet read/write support 
Columnar storage format for Hadoop 
Reads existing Parquet files. 
Converts Parquet schema to row schema. 
Writes new Parquet files. 
Currently DecimalType and 
TimestampType are not supported. 
30
SQL core in depth 
JSON read support 
Loads a JSON file (one object per line) 
Infers row schema from the entire 
dataset. 
Giving the schema is experimental. 
Inferring the schema by sampling is 
also experimental. 
31
SQL core in depth 
Columnar in-memory table support 
Caches table like RDD.cache, but as 
columnar style. 
Can prune unnecessary columns 
when read data. 
32
Interesting issues
Interesting issues 
Support the GroupingSet/ROLLUP/CUBE 
https://issues.apache.org/jira/browse/SPARK-2663 
Use statistics to skip partitions when 
reading from in-memory columnar 
data 
https://issues.apache.org/jira/browse/SPARK-2961 
34
Interesting issues 
Pluggable interface for shuffles 
https://issues.apache.org/jira/browse/SPARK-2044 
Sort-merge join 
https://issues.apache.org/jira/browse/SPARK-2213 
Cost-based join reordering 
https://issues.apache.org/jira/browse/SPARK-2216 
35
How to contribute
How to contribute 
See: Contributing to Spark 
Open an issue on JIRA 
Send pull-request at GitHub 
Communicate with committers and 
reviewers 
Congratulations! 
37
Conclusion 
Introduced Spark SQL & Catalyst 
Now you know them very well! 
And you know how to contribute. 
! 
Let’s contribute to Spark & Spark SQL!! 
38
an addition
What are we doing?
What are we doing? 
To execute query needs some steps. 
Parse 
Analyze 
Logical Plan 
Optimize 
Physical Plan 
Execute
What are we doing? 
To execute query needs some steps. 
Parse 
Analyze 
Logical Plan 
Optimize 
Physical Plan 
Execute
What are we doing? 
To execute query needs some steps. 
Parse 
Analyze 
Logical Plan 
Optimize 
Physical Plan 
Execute 
DSL 
ASG 
++ 
++ 
++ 
business logic
Thanks!

Más contenido relacionado

La actualidad más candente

Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 

La actualidad más candente (20)

Spark SQL
Spark SQLSpark SQL
Spark SQL
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5Tachyon-2014-11-21-amp-camp5
Tachyon-2014-11-21-amp-camp5
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
Spark sql
Spark sqlSpark sql
Spark sql
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Spark etl
Spark etlSpark etl
Spark etl
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 

Destacado

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYC
Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYCJeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYC
Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYC
MLconf
 

Destacado (20)

Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
Spark sql meetup
Spark sql meetupSpark sql meetup
Spark sql meetup
 
Enhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min QiuEnhancements on Spark SQL optimizer by Min Qiu
Enhancements on Spark SQL optimizer by Min Qiu
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Failing gracefully
Failing gracefullyFailing gracefully
Failing gracefully
 
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
 
Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYC
Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYCJeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYC
Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYC
 
What's next for Big Data? -- Apache Spark
What's next for Big Data? -- Apache SparkWhat's next for Big Data? -- Apache Spark
What's next for Big Data? -- Apache Spark
 
Big Data Analytics 1: Driving Personalized Experiences Using Customer Profiles
Big Data Analytics 1: Driving Personalized Experiences Using Customer ProfilesBig Data Analytics 1: Driving Personalized Experiences Using Customer Profiles
Big Data Analytics 1: Driving Personalized Experiences Using Customer Profiles
 
Cassandra UDF and Materialized Views
Cassandra UDF and Materialized ViewsCassandra UDF and Materialized Views
Cassandra UDF and Materialized Views
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
11 Shocking Stats That Will Transform Your Marketing Strategy
11 Shocking Stats That Will Transform Your Marketing Strategy 11 Shocking Stats That Will Transform Your Marketing Strategy
11 Shocking Stats That Will Transform Your Marketing Strategy
 
Acquire, Grow & Retain Customers, Fast
Acquire, Grow & Retain Customers, FastAcquire, Grow & Retain Customers, Fast
Acquire, Grow & Retain Customers, Fast
 
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
 
Hadoop Source Code Reading #17
Hadoop Source Code Reading #17Hadoop Source Code Reading #17
Hadoop Source Code Reading #17
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Apache streams 2015
Apache streams 2015Apache streams 2015
Apache streams 2015
 

Similar a 20140908 spark sql & catalyst

Similar a 20140908 spark sql & catalyst (20)

Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
 
Spark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van Hovell
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
SCALA - Functional domain
SCALA -  Functional domainSCALA -  Functional domain
SCALA - Functional domain
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Mila...
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Mila...Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Mila...
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Mila...
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBAgile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDB
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Hyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache SparkHyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache Spark
 
Trunk and branches for database configuration management
Trunk and branches for database configuration managementTrunk and branches for database configuration management
Trunk and branches for database configuration management
 
SQLGitHub - Access GitHub API with SQL-like syntaxes
SQLGitHub - Access GitHub API with SQL-like syntaxesSQLGitHub - Access GitHub API with SQL-like syntaxes
SQLGitHub - Access GitHub API with SQL-like syntaxes
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 

Más de Takuya UESHIN (9)

Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)Introducing Koalas 1.0 (and 1.1)
Introducing Koalas 1.0 (and 1.1)
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
2019.03.19 Deep Dive into Spark SQL with Advanced Performance Tuning
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
Deep Dive into Spark SQL with Advanced Performance Tuning
Deep Dive into Spark SQL with Advanced Performance TuningDeep Dive into Spark SQL with Advanced Performance Tuning
Deep Dive into Spark SQL with Advanced Performance Tuning
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache Spark
 
20110616 HBase勉強会(第二回)
20110616 HBase勉強会(第二回)20110616 HBase勉強会(第二回)
20110616 HBase勉強会(第二回)
 
20100724 HBaseプログラミング
20100724 HBaseプログラミング20100724 HBaseプログラミング
20100724 HBaseプログラミング
 

Último

怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 

Último (20)

怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 

20140908 spark sql & catalyst

  • 1. Introduction to Spark SQL & Catalyst Takuya UESHIN ! Spark Meetup 2014/09/08(Mon)
  • 2. Who am I? Takuya UESHIN @ueshin github.com/ueshin Nautilus Technologies, Inc. A Spark contributor 2
  • 3. Agenda What is Spark SQL? Catalyst in depth SQL core in depth Interesting issues How to contribute 3
  • 5. What is Spark SQL? Spark SQL is one of Spark components. Executes SQL on Spark Builds SchemaRDD like LINQ Optimizes execution plan. 5
  • 6. What is Spark SQL? Catalyst provides a execution planning framework for relational operations. Including: SQL parser & analyzer Logical operators & general expressions Logical optimizer A framework to transform operator tree. 6
  • 7. What is Spark SQL? To execute query needs some steps. Parse Analyze Logical Plan Optimize Physical Plan Execute 7
  • 8. What is Spark SQL? To execute query needs some steps. Parse Analyze Logical Plan Optimize Physical Plan Execute 8 Catalyst
  • 9. What is Spark SQL? To execute query needs some steps. Parse Analyze Logical Plan Optimize Physical Plan Execute 9 SQL core
  • 11. Catalyst in depth Provides a execution planning framework for relational operations. Row & DataType’s Trees & Rules Logical Operators Expressions Optimizations 11
  • 12. Row & DataType’s o.a.s.sql.catalyst.types.DataType Long, Int, Short, Byte, Float, Double, Decimal String, Binary, Boolean, Timestamp Array, Map, Struct o.a.s.sql.catalyst.expressions.Row Represents a single row. Can contain complex types. 12
  • 13. Trees & Rules o.a.s.sql.catalyst.trees.TreeNode Provides transformations of tree. foreach, map, flatMap, collect transform, transformUp, transformDown Used for operator tree, expression tree. 13
  • 14. Trees & Rules o.a.s.sql.catalyst.rules.Rule Represents a tree transform rule. o.a.s.sql.catalyst.rules.RuleExecutor A framework to transform trees based on rules. 14
  • 15. Logical Operators Basic Operators Project, Filter, … Binary Operators Join, Except, Intersect, Union, … Aggregate Generate, Distinct Sort, Limit InsertInto, WriteToFile 15 Project Filter Join Table Table
  • 16. Expressions Literal Arithmetics UnaryMinus, Sqrt, MaxOf Add, Subtract, Multiply, … Predicates EqualTo, LessThan, LessThanOrEqual, GreaterThan, GreaterThanOrEqual Not, And, Or, In, If, CaseWhen 16 + 1 2
  • 17. Expressions Cast GetItem, GetField Coalesce, IsNull, IsNotNull StringOperations Like, Upper, Lower, Contains, StartsWith, EndsWith, Substring, … 17
  • 18. Optimizations ConstantFolding NullPropagation ConstantFolding BooleanSimplification SimplifyFilters FilterPushdown CombineFilters PushPredicateThroughProject PushPredicateThroughJoin ColumnPruning 18
  • 19. Optimizations NullPropagation, ConstantFolding Replace expressions that can be evaluated with some literal value to the value. ex) 1 + null => null 1 + 2 => 3 Count(null) => 0 19
  • 20. Optimizations BooleanSimplification Simplifies boolean expressions that can be determined. ex) false AND $right => false true AND $right => $right true OR $right => true false OR $right => $right If(true, $then, $else) => $then 20
  • 21. Optimizations SimplifyFilters Removes filters that can be evaluated trivially. ex) Filter(true, child) => child Filter(false, child) => empty 21
  • 22. Optimizations CombineFilters Merges two filters. ex) Filter($fc, Filter($nc, child)) => Filter(AND($fc, $nc), child) 22
  • 23. Optimizations PushPredicateThroughProject Pushes Filter operators through Project operator. ex) Filter(‘i === 1, Project(‘i, ‘j, child)) => Project(‘i, ‘j, Filter(‘i === 1, child)) 23
  • 24. Optimizations PushPredicateThroughJoin Pushes Filter operators through Join operator. ex) Filter(“left.i”.attr === 1, Join(left, right) => Join(Filter(‘i === 1, left), right) 24
  • 25. Optimizations ColumnPruning Eliminates the reading of unused columns. ex) Join(left, right, LeftSemi, “left.id”.attr === “right.id”.attr) => Join(left, Project(‘id, right), LeftSemi) 25
  • 26. SQL core in depth
  • 27. SQL core in depth Provides: Physical operators to build RDD Conversion from Existing RDD of Product to SchemaRDD support Parquet file read/write support JSON file read support Columnar in-memory table support 27
  • 28. SQL core in depth o.a.s.sql.SchemaRDD Extends RDD[Row]. Has logical plan tree. Provides LINQ-like interfaces to construct logical plan. select, where, join, orderBy, … Executes the plan. 28
  • 29. SQL core in depth o.a.s.sql.execution.SparkStrategies Converts logical plan to physical. Some rules are based on statistics of the operators. 29
  • 30. SQL core in depth Parquet read/write support Columnar storage format for Hadoop Reads existing Parquet files. Converts Parquet schema to row schema. Writes new Parquet files. Currently DecimalType and TimestampType are not supported. 30
  • 31. SQL core in depth JSON read support Loads a JSON file (one object per line) Infers row schema from the entire dataset. Giving the schema is experimental. Inferring the schema by sampling is also experimental. 31
  • 32. SQL core in depth Columnar in-memory table support Caches table like RDD.cache, but as columnar style. Can prune unnecessary columns when read data. 32
  • 34. Interesting issues Support the GroupingSet/ROLLUP/CUBE https://issues.apache.org/jira/browse/SPARK-2663 Use statistics to skip partitions when reading from in-memory columnar data https://issues.apache.org/jira/browse/SPARK-2961 34
  • 35. Interesting issues Pluggable interface for shuffles https://issues.apache.org/jira/browse/SPARK-2044 Sort-merge join https://issues.apache.org/jira/browse/SPARK-2213 Cost-based join reordering https://issues.apache.org/jira/browse/SPARK-2216 35
  • 37. How to contribute See: Contributing to Spark Open an issue on JIRA Send pull-request at GitHub Communicate with committers and reviewers Congratulations! 37
  • 38. Conclusion Introduced Spark SQL & Catalyst Now you know them very well! And you know how to contribute. ! Let’s contribute to Spark & Spark SQL!! 38
  • 40. What are we doing?
  • 41. What are we doing? To execute query needs some steps. Parse Analyze Logical Plan Optimize Physical Plan Execute
  • 42. What are we doing? To execute query needs some steps. Parse Analyze Logical Plan Optimize Physical Plan Execute
  • 43. What are we doing? To execute query needs some steps. Parse Analyze Logical Plan Optimize Physical Plan Execute DSL ASG ++ ++ ++ business logic