SlideShare una empresa de Scribd logo
1 de 72
Descargar para leer sin conexión
Hive	Bucketing	in
Apache	Spark
Tejas	Patil
Facebook
Agenda
• Why	bucketing	?
• Why	is	shuffle	bad	?
• How	to	avoid	shuffle	?
• When	to	use	bucketing	?
• Spark’s	bucketing	support
• Bucketing	semantics	of	Spark	vs	Hive
• Hive	bucketing	support	in	Spark
• SQL	Planner	improvements
Why	bucketing	?
Table	A Table	B
.	.	.	.	.	.	 .	.	.	.	.	.	JOIN
Table	A Table	B
.	.	.	.	.	.	 .	.	.	.	.	.	JOIN
Broadcast	hash	join
- Ship	smaller	table	to	all	nodes
- Stream	the	other	table
Table	A Table	B
.	.	.	.	.	.	 .	.	.	.	.	.	JOIN
Broadcast	hash	join Shuffle	hash	join
- Ship	smaller	table	to	all	nodes
- Stream	the	other	table
- Shuffle	both	tables,
- Hash	smaller	one,	stream	the	
bigger	one
Table	A Table	B
.	.	.	.	.	.	 .	.	.	.	.	.	JOIN
Broadcast	hash	join Shuffle	hash	join
- Ship	smaller	table	to	all	nodes
- Stream	the	other	table
- Shuffle	both	tables,
- Hash	smaller	one,	stream	the	
bigger	one
Sort	merge	join
- Shuffle	both	tables,
- Sort	both	tables,	buffer	one,	
stream	the	bigger	one
Table	A Table	B
.	.	.	.	.	.	 .	.	.	.	.	.	JOIN
Broadcast	hash	join Shuffle	hash	join Sort	merge	join
- Ship	smaller	table	to	all	nodes
- Stream	the	other	table
- Shuffle	both	tables,
- Hash	smaller	one,	stream	the	
bigger	one
- Shuffle	both	tables,
- Sort	both	tables,	buffer	one,	
stream	the	bigger	one
Table	A Table	B
.	.	.	.	.	.	 .	.	.	.	.	.	
Sort	Merge	Join
Table	A Table	B
.	.	.	.	.	.	 .	.	.	.	.	.	
Shuffle ShuffleShuffleShuffle
Sort	Merge	Join
Table	A Table	B
.	.	.	.	.	.	 .	.	.	.	.	.	
Sort Sort Sort Sort
Shuffle ShuffleShuffleShuffle
Sort	Merge	Join
Table	A Table	B
.	.	.	.	.	.	 .	.	.	.	.	.	
Sort
Join
Sort
Join
Sort
Join
Sort
Join
Shuffle ShuffleShuffleShuffle
Sort	Merge	Join
Sort	Merge	Join
Why	is	shuffle	bad	?
Table	A Table	B
.	.	.	.	.	.	 .	.	.	.	.	.	
Disk Disk
Disk	IO	for	shuffle	outputs
Why	is	shuffle	bad	?
Table	A Table	B
.	.	.	.	.	.	 .	.	.	.	.	.	
Shuffle Shuffle Shuffle Shuffle
M	*	R	network	connections
Why	is	shuffle	bad	?
How	to	avoid	shuffle	?
student	s JOIN	attendance	a	
ON	s.id =	a.student_id
student	s	JOIN	results	r	
ON	s.id =	r.student_id
student	s	JOIN	course_registrationc
ON	s.id =	c.student_id
……..
……..
student	s	JOIN	attendance	a	
ON	s.id =	a.student_id
student	s	JOIN	results	r	
ON	s.id =	r.student_id
student	s	JOIN	attendance	a	
ON	s.id =	a.student_id
student	s	JOIN	results	r	
ON	s.id =	r.student_id
Pre-compute	at	
table	creation
Bucketing	=
pre-(shuffle	+	sort)	inputs
on	join	keys
Without	
bucketing
Job(s)	populating	input	table
Without	
bucketing
With
bucketing
Job(s)	populating	input	table
Performance	comparison
0
100
200
300
400
500
600
700
800
900
Before After	Bucketing
Reserved	CPU	time	(days)
0
1
2
3
4
5
6
7
8
Before After	Bucketing
Latency	(hours)
When	to	use	bucketing	?
• Tables	used	frequently	in	JOINs	with	same	key
• Student	->	student_id
• Employee	->	employee_id
• Users	->	user_id
When	to	use	bucketing	?
• Tables	used	frequently	in	JOINs	with	same	key
• Student	->	student_id
• Employee	->	employee_id
• Users	->	user_id
….	=>	Dimension	tables
When	to	use	bucketing	?
• Tables	used	frequently	in	JOINs	with	same	key
• Student	->	student_id
• Employee	->	employee_id
• Users	->	user_id
• Loading	daily	cumulative	tables
• Both	base	and	delta	tables	could	be	bucketed	on	a	common	column
When	to	use	bucketing	?
• Tables	used	frequently	in	JOINs	with	same	key
• Student	->	student_id
• Employee	->	employee_id
• Users	->	user_id
• Loading	daily	cumulative	tables
• Both	base	and	delta	tables	could	be	bucketed	on	a	common	column
• Indexing	capability
Spark’s	bucketing	support
Creation	of	bucketed	tables
df.write
.bucketBy(numBuckets,	 "col1",	...)
.sortBy("col1",	...)
.saveAsTable("bucketed_table")
via	Dataframe API
Creation	of	bucketed	tables
CREATE	TABLE	bucketed_table(
column1	INT,
...
)	USING	parquet
CLUSTERED	BY	(column1,	…)
SORTED	BY	(column1,	…)
INTO	`n`	BUCKETS
via	SQL	statement
Check	bucketing	spec
scala>	sparkContext.sql("DESC	FORMATTED	student").collect.foreach(println)
[#	col_name,data_type,comment]
[student_id,int,null]
[name,int,null]
[#	Detailed	Table	Information,,]
[Database,default,]
[Table,table1,]
[Owner,tejas,]
[Created,Fri May	12	08:06:33	PDT	2017,]
[Type,MANAGED,]
[Num Buckets,64,]
[Bucket	Columns,[`student_id`],]
[Sort	Columns,[`student_id`],]
[Properties,[serialization.format=1],]
[Serde Library,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
[InputFormat,org.apache.hadoop.mapred.SequenceFileInputFormat,]
Bucketing	config
SET	spark.sql.sources.bucketing.enabled=true
[SPARK-15453]	Extract	bucketing	info	in	
FileSourceScanExec
SELECT	*	FROM	tableA JOIN	tableB ON	tableA.i=	tableB.i AND	tableA.j=	tableB.j
[SPARK-15453]	Extract	bucketing	info	in	
FileSourceScanExec
SELECT	*	FROM	tableA JOIN	tableB ON	tableA.i=	tableB.i AND	tableA.j=	tableB.j
Sort	Merge	Join
Sort
Exchange
TableScan
Sort
Exchange
TableScan Already	bucketed	on	columns	(i,	j)
Shuffle	on	columns	(i,	j)
Sort	on	columns	(i,	j)
Join	on	[i,j],	[i,j]	respectively	of	relations
[SPARK-15453]	Extract	bucketing	info	in	
FileSourceScanExec
SELECT	*	FROM	tableA JOIN	tableB ON	tableA.i=	tableB.i AND	tableA.j=	tableB.j
Sort	Merge	Join
Sort
Exchange
TableScan
Sort
Exchange
TableScan Already	bucketed	on	columns	(i,	j)
Shuffle	on	columns	(i,	j)
Sort	on	columns	(i,	j)
Join	on	[i,j],	[i,j]	respectively	of	relations
[SPARK-15453]	Extract	bucketing	info	in	
FileSourceScanExec
SELECT	*	FROM	tableA JOIN	tableB ON	tableA.i=	tableB.i AND	tableA.j=	tableB.j
Sort	Merge	Join
TableScan TableScan
Sort	Merge	Join
TableScan TableScanSort
Exchange
Sort
Exchange
Bucketing	semantics	of
Spark	vs	Hive
Bucketing	semantics	of	Spark	vs	Hive
bucket	0 bucket	1
bucket
(n-1)
…..
Hive
my_bucketed_table
Bucketing	semantics	of	Spark	vs	Hive
bucket	0 bucket	1
bucket
(n-1)
…..
Hive Spark
bucket	0 bucket	1
bucket
(n-1)
…..
bucket	0
bucket	0
bucket	1
bucket
(n-1)bucket
(n-1)bucket
(n-1)
my_bucketed_tablemy_bucketed_table
Bucketing	semantics	of	Spark	vs	Hive
bucket	0 bucket	1
bucket
(n-1)
…..
Hive
M M M M M
R R R
…..
Bucketing	semantics	of	Spark	vs	Hive
bucket	0 bucket	1
bucket
(n-1)
…..
Hive Spark
M M M M M
R R R
…..
bucket	0 bucket	1
bucket
(n-1)
…..
M M M M M…..
bucket	0
bucket	0
bucket	1
bucket
(n-1)bucket
(n-1)bucket
(n-1)
Bucketing	semantics	of	Spark	vs	Hive
Hive Spark
Model Optimizes	 reads,	writes	are	costly Writes	are	cheaper,	 reads	are	
costlier
Bucketing	semantics	of	Spark	vs	Hive
Hive Spark
Model Optimizes	 reads,	writes	are	costly Writes	are	cheaper,	 reads	are	
costlier
Unit	of	bucketing A	single	file	represents	 a	 single	
bucket
A	collection	 of	files	together	
comprise	 a	single	bucket
Bucketing	semantics	of	Spark	vs	Hive
Hive Spark
Model Optimizes	 reads,	writes	are	costly Writes	are	cheaper,	 reads	are	
costlier
Unit	of	bucketing A	single	file	represents	 a	 single	
bucket
A	collection	 of	files	together	
comprise	 a	single	bucket
Sort	ordering Each	bucket	is	sorted	globally.	 This	
makes	it	easy	to	optimize	 queries	
reading	this	data
Each	file	is	individually	 sorted	 but	
the	“bucket”	as	a	whole	is	not	
globally	 sorted
Bucketing	semantics	of	Spark	vs	Hive
Hive Spark
Model Optimizes	 reads,	writes	are	costly Writes	are	cheaper,	 reads	are	
costlier
Unit	of	bucketing A	single	file	represents	 a	 single	
bucket
A	collection	 of	files	together	
comprise	 a	single	bucket
Sort	ordering Each	bucket	is	sorted	globally.	 This	
makes	it	easy	to	optimize	 queries	
reading	this	data
Each	file	is	individually	 sorted	 but	
the	“bucket”	as	a	whole	is	not	
globally	 sorted
Hashing	function Hive’s	 inbuilt	 hash Murmur3Hash
[SPARK-19256] Hive	bucketing	support
• Introduce	Hive’s	hashing	function	[SPARK-17495]
• Enable	creating	hive	bucketed	tables	[SPARK-17729]
• Support	Hive’s	`Bucketed	file	format`
• Propagate	Hive	bucketing	information	to	planner	[SPARK-17654]
• Expressing	outputPartitioning and	requiredChildDistribution
• Creating	empty	bucket	files	when	no	data	for	a	bucket
• Allow	Sort	Merge	join	over	tables	with	number	of	buckets	
multiple	of	each	other
• Support	N-way	Sort	Merge	Join
[SPARK-19256]	Hive	bucketing	support
• Introduce	Hive’s	hashing	function	[SPARK-17495]
• Enable	creating	hive	bucketed	tables	[SPARK-17729]
• Support	Hive’s	`Bucketed	file	format`
• Propagate	Hive	bucketing	information	to	planner	[SPARK-17654]
• Expressing	outputPartitioning and	requiredChildDistribution
• Creating	empty	bucket	files	when	no	data	for	a	bucket
• Allow	Sort	Merge	join	over	tables	with	number	of	buckets	
multiple	of	each	other
• Support	N-way	Sort	Merge	Join
Merged	in
upstream
[SPARK-19256]	Hive	bucketing	support
• Introduce	Hive’s	hashing	function	[SPARK-17495]
• Enable	creating	hive	bucketed	tables	[SPARK-17729]
• Support	Hive’s	`Bucketed	file	format`
• Propagate	Hive	bucketing	information	to	planner	[SPARK-17654]
• Expressing	outputPartitioning and	requiredChildDistribution
• Creating	empty	bucket	files	when	no	data	for	a	bucket
• Allow	Sort	Merge	join	over	tables	with	number	of	buckets	
multiple	of	each	other
• Support	N-way	Sort	Merge	Join
FB-prod	(6	months)
SQL	Planner	improvements
[SPARK-19122]	Unnecessary	shuffle+sort added	if	
join	predicates	ordering	differ	from	bucketing	and	
sorting	order
SELECT	*	FROM	a	JOIN	b	ON	a.x=b.xAND	a.y=b.y
[SPARK-19122]	Unnecessary	shuffle+sort added	if	
join	predicates	ordering	differ	from	bucketing	and	
sorting	order
Sort	Merge	Join
TableScan TableScan
SELECT	*	FROM	a	JOIN	b	ON	a.x=b.xAND	a.y=b.y
[SPARK-19122]	Unnecessary	shuffle+sort added	if	
join	predicates	ordering	differ	from	bucketing	and	
sorting	order
SELECT	*	FROM	a	JOIN	b	ON	a.x=b.xAND	a.y=b.y
Sort	Merge	Join
TableScan TableScan
SELECT	*	FROM	a	JOIN	b	ON	a.y=b.y AND	a.x=b.x
[SPARK-19122]	Unnecessary	shuffle+sort added	if	
join	predicates	ordering	differ	from	bucketing	and	
sorting	order
Sort	Merge	Join
Exchange
TableScan
Sort
Exchange
TableScan
Sort
Sort	Merge	Join
TableScan TableScan
SELECT	*	FROM	a	JOIN	b	ON	a.x=b.xAND	a.y=b.y SELECT	*	FROM	a	JOIN	b	ON	a.y=b.y AND	a.x=b.x
[SPARK-19122]	Unnecessary	shuffle+sort added	if	
join	predicates	ordering	differ	from	bucketing	and	
sorting	order
Sort	Merge	Join
Exchange
TableScan
Sort
Exchange
TableScan
Sort
Sort	Merge	Join
TableScan TableScan
SELECT	*	FROM	a	JOIN	b	ON	a.x=b.xAND	a.y=b.y SELECT	*	FROM	a	JOIN	b	ON	a.y=b.y AND	a.x=b.x
[SPARK-19122]	Unnecessary	shuffle+sort added	if	
join	predicates	ordering	differ	from	bucketing	and	
sorting	order
Sort	Merge	Join
TableScan TableScan
SELECT	*	FROM	a	JOIN	b	ON	a.x=b.xAND	a.y=b.y SELECT	*	FROM	a	JOIN	b	ON	a.y=b.y AND	a.x=b.x
Sort	Merge	Join
TableScan TableScan
[SPARK-17271]	Planner	adds	un-necessary	Sort	
even	if	child	ordering	is	semantically	same	as	
required	ordering
SELECT	*	FROM	table1	a	JOIN	table2	b	ON	a.col1=b.col1
[SPARK-17271]	Planner	adds	un-necessary	Sort	
even	if	child	ordering	is	semantically	same	as	
required	ordering
SELECT	*	FROM	table1	a	JOIN	table2	b	ON	a.col1=b.col1
Sort	Merge	Join
Sort
TableScan
Sort
TableScan Already	bucketed	on	column	`col1`
Sort	on	columns	(a.col1	and	b.col1)
Join	on	[col1],	[col1]	respectively	of	relations
[SPARK-17271]	Planner	adds	un-necessary	Sort	
even	if	child	ordering	is	semantically	same	as	
required	ordering
SELECT	*	FROM	table1	a	JOIN	table2	b	ON	a.col1=b.col1
Sort	Merge	Join
TableScan TableScan Already	bucketed	on	column	`col1`
Sort	on	columns	(a.col1	and	b.col1)Sort Sort
Join	on	[col1],	[col1]	respectively	of	relations
[SPARK-17271]	Planner	adds	un-necessary	Sort	
even	if	child	ordering	is	semantically	same	as	
required	ordering
SELECT	*	FROM	table1	a	JOIN	table2	b	ON	a.col1=b.col1
Sort	Merge	Join
TableScan TableScan
Sort Sort
Sort	Merge	Join
TableScan TableScan
[SPARK-17698]	Join	predicates	should	not	
contain	filter	clauses
SELECT	a.id,	b.id FROM	table1	a	FULL	OUTER	JOIN	table2	b	ON	a.id=	b.id AND	a.id='1'	AND	b.id='1'
[SPARK-17698]	Join	predicates	should	not	
contain	filter	clauses
SELECT	a.id,	b.id FROM	table1	a	FULL	OUTER	JOIN	table2	b	ON	a.id=	b.id AND	a.id='1'	AND	b.id='1'
Sort	Merge	Join
TableScan TableScan
Sort
Exchange
Sort
Exchange
Already	bucketed	on	column	[id]
Shuffle	on	columns	[id,	1,	id]	and	[id,	id,	1]
Sort	on	columns	[id,	id,	1]	and	[id,	1,	id]	
Join	on	[id,	id,	1],	[id,	1,	id]	respectively	of	relations
[SPARK-17698]	Join	predicates	should	not	
contain	filter	clauses
SELECT	a.id,	b.id FROM	table1	a	FULL	OUTER	JOIN	table2	b	ON	a.id =	b.id AND a.id='1'	AND b.id='1'
Sort	Merge	Join
TableScan TableScan
Sort
Exchange
Sort
Exchange
Already	bucketed	on	column	[id]
Shuffle	on	columns	[id,	1,	id]	and	[id,	id,	1]
Sort	on	columns	[id,	id,	1]	and	[id,	1,	id]	
Join	on	[id,	id,	1],	[id,	1,	id]	respectively	of	relations
[SPARK-17698]	Join	predicates	should	not	
contain	filter	clauses
SELECT	a.id,	b.id FROM	table1	a	FULL	OUTER	JOIN	table2	b	ON	a.id =	b.id AND a.id='1'	AND b.id='1'
Sort	Merge	Join
TableScan TableScan
Sort
Exchange
Sort
Exchange
Already	bucketed	on	column	[id]
Shuffle	on	columns	[id,	1,	id]	and	[id,	id,	1]
Sort	on	columns	[id,	id,	1]	and	[id,	1,	id]	
Join	on	[id,	id,	1],	[id,	1,	id]	respectively	of	relations
[SPARK-17698]	Join	predicates	should	not	
contain	filter	clauses
SELECT	a.id,	b.id FROM	table1	a	FULL	OUTER	JOIN	table2	b	ON	a.id =	b.id AND a.id='1'	AND b.id='1'
Sort	Merge	Join
TableScan TableScan
Sort	Merge	Join
TableScan TableScanSort
Exchange
Sort
Exchange
[SPARK-13450]	Introduce	
ExternalAppendOnlyUnsafeRowArray
[SPARK-13450]	Introduce	
ExternalAppendOnlyUnsafeRowArray
Buffered	
relation
Streamed	
relation
If	there	are	a	lot	of	rows	to	be	
buffered	(eg.	Skew),	
the	query	would	OOM
row	with	
key=`foo`
rows	with	
key=`foo`
[SPARK-13450]	Introduce	
ExternalAppendOnlyUnsafeRowArray
Buffered	
relation
Streamed	
relation
row	with	
key=`foo`
rows	with	
key=`foo`
Disk
buffer	would	spill	to	disk	
after	reaching	a	threshold
spill
Summary
• Shuffle	and	sort	is	costly	due	to	disk	and	network	IO
• Bucketing	will	pre-(shuffle	and	sort)	the	inputs
• Expect	at	least	2-5x	gains	after	bucketing	input	tables	for	joins
• Candidates	for	bucketing:
• Tables	used	frequently	in	JOINs	with	same	key
• Loading	of	data	in	cumulative	tables
• Spark’s	bucketing	is	not	compatible	with	Hive’s	bucketing
Questions	?

Más contenido relacionado

La actualidad más candente

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 

La actualidad más candente (20)

Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 

Más de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
shambhavirathore45
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
shivangimorya083
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Último (20)

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

Hive Bucketing in Apache Spark with Tejas Patil