SlideShare una empresa de Scribd logo
1 de 40
1© Cloudera, Inc. All rights reserved.
Tips for Writing ETL Pipelines
with Spark
Imran Rashid|Cloudera, Apache Spark PMC
2© Cloudera, Inc. All rights reserved.
Outline
• Quick Refresher
• Tips for Pipelines
• Spark Performance
• Using the UI
• Understanding Stage Boundaries
• Baby photos
3© Cloudera, Inc. All rights reserved.
About Me
• Member of the Spark PMC
• User of Spark from v0.5 at Quantifind
• Built ETL pipelines, prototype to production
• Supported Data Scientists
• Now work on Spark full time at Cloudera
4© Cloudera, Inc. All rights reserved.
RDDs: Resilient Distributed Dataset
• Data is distributed into partitions spread across a cluster
• Each partition is processed independently and in parallel
• Logical view of the data – not materialized
Image from Dean Wampler, Typesafe
5© Cloudera, Inc. All rights reserved.
Expressive API
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
• sample
• take
• first
• partitionBy
• mapWith
• pipe
• save
• ...
6© Cloudera, Inc. All rights reserved.
Cheap!
• No serialization
• No IO
• Pipelined
Expensive!
• Serialize Data
• Write to disk
• Transfer over
network
• Deserialize Data
7© Cloudera, Inc. All rights reserved.
Compare to MapReduce Word Count
Spark
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Hadoop MapReduce
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
8© Cloudera, Inc. All rights reserved.
Useful Patterns
9© Cloudera, Inc. All rights reserved.
Pipelines get complicated
• Pipelines get messy
• Input data is messy
• Things go wrong
• Never fast enough
• Need stability for months to
years
• Need Forecasting / Capacity
Planning
Alice one
year ago
Bob 6
months ago
Connie 3
months ago
Derrick last
month
Alice last week
10© Cloudera, Inc. All rights reserved.
Design Goals
• Modularity
• Error Handling
• Understand where and how
11© Cloudera, Inc. All rights reserved.
Catching Errors (1)
sc.textFile(…).map{ line =>
//blows up with parse exception
parse(line)
}
sc.textFile(…).flatMap { line =>
//now we’re safe, right?
Try(parse(line)).toOption
}
How many errors?
1 record? 100 records?
90% of our data?
12© Cloudera, Inc. All rights reserved.
Catching Errors (2)
val parseErrors = sc.accumulator(0L)
val parsed = sc.textFile(…).flatMap { line =>
Try(parse(line)) match {
case Success(s) => Some(s)
case Failure(f) =>
parseErrors += 1
None
}
// parse errors is always 0
if (parseErrors > 500) fail(…)
// and what if we want to see those errors?
13© Cloudera, Inc. All rights reserved.
Catching Errors (3)
• Accumulators break the
RDD abstraction
• You care about when
an action has taken
place
• Force action, or pass
error handling on
• SparkListener to deal w/
failures
• https://gist.github.com/squito/2f7cc02c313
e4c9e7df4#file-accumulatorlistener-scala
case class ParsedWithErrorCounts(val parsed:
RDD[LogLine], errors: Accumulator[Long])
def parseCountErrors(path: String, sc: SparkContext):
ParsedWithErrorCounts = {
val parseErrorCounter =
sc.accumulator(0L).setName(“parseErrors”)
val parsed = sc.textFile(path).flatMap { line =>
line match {
case LogPattern(date, thread, level, source, msg)
=>
Some(LogLine(date, thread, level, source, msg))
case _ =>
parseErrorCounter += 1
None
}
}
ParsedWithErrorCounts(parsed, parseErrorCounter)
}
14© Cloudera, Inc. All rights reserved.
Catching Errors (4)
• Accumulators can give
you “multiple output”
• Create sample of error
records
• You can look at them for
debugging
• WARNING: accumulators
are not scalable
class ReservoirSample[T] {...}
class ReservoirSampleAccumulableParam[T] extends
AccumulableParam[ReservoirSample[T], T]{...}
def parseCountErrors(path: String, sc: SparkContext):
ParsedWithErrorCounts = {
val parseErrors = sc.accumulable(
new ReservoirSample[String](100))(…)
val parsed = sc.textFile(path).flatMap { line =>
line match {
case LogPattern(date, thread, level, source, msg)
=>
Some(LogLine(date, thread, level, source, msg))
case _ =>
parseErrors += line
None
}
}
ParsedWithErrorCounts(parsed, parseErrors)
}
15© Cloudera, Inc. All rights reserved.
Catching Errors (5)
• What if instead, we just filter out each condition?
• Beware deep pipelines
• Eg. RDD.randomSplit
Huge Raw Data
Filter
FlatMap
…parsed
Error 1
Error 2
16© Cloudera, Inc. All rights reserved.
Modularity with RDDs
• Who is caching what?
• What resources should each component?
• What assumptions are made on inputs?
17© Cloudera, Inc. All rights reserved.
Win By Cheating
• Fastest way to shuffle a lot of data:
• Don’t shuffle
• Second fastest way to shuffle a lot of data:
• Shuffle a small amount of data
• ReduceByKey
• Approximate Algorithms
• Same as MapReduce
• BloomFilters, HyperLogLog, Tdigest
• Joins with Narrow Dependencies
18© Cloudera, Inc. All rights reserved.
ReduceByKey when Possible
• ReduceByKey allows a map-side-combine
• Data is merged together before its
serialized & sent over network
• GroupByKey transfers all the data
• Higher serialization and network transfer
costs
parsed
.map{line =>(line.level, 1)}
.reduceByKey{(a, b) => a + b}
.collect()
parsed
.map{line =>(line.level, 1)}
.groupByKey.map{case(word,counts) =>
(word,counts.sum)}
.collect()
19© Cloudera, Inc. All rights reserved.
But I need groupBy
• Eg., incoming transaction logs from user
• 10 TB of historical data
• 50 GB of new data each day
Historical Logs
Day 1
logs
Day 2
Logs
Day 3
Logs
Grouped Logs
20© Cloudera, Inc. All rights reserved.
Using Partitioners for Narrow Joins
• Sort the Historical Logs once
• Each day, sort the small new data
• Join – narrow dependency
• Write data to hdfs
• Day 2 – now what?
• SPARK-1061
• Read from hdfs
• “Remember” data was written
with a partitioner
Wide Join Narrow Join
21© Cloudera, Inc. All rights reserved.
Assume Partitioned
• Day 2 – now what?
• SPARK-1061
• Read from hdfs
• “Remember” data was
written with a partitioner
// Day 1
val myPartitioner = …
val historical =
sc.hadoopFile(“…/mergedLogs/2015/05/19”, …)
.partitionBy(myPartitioner)
val newData =
sc.hadoopFile(“…/newData/2015/05/20”, …)
.partitionBy(myPartitioner)
val grouped = myRdd.cogroup(newData)
grouped.saveAsHadoopFile(
“…/mergedLogs/2015/05/20”)
//Day 2 – new spark context
val historical =
sc.hadoopFile(“…/mergedLogs/2015/05/20”, …)
.assumePartitionedBy(myPartitioner)
22© Cloudera, Inc. All rights reserved.
Recovering from Errors
• I write bugs
• You write bugs
• Spark has bugs
• The bugs might appear after 17 hours in stage 78 of your application
• Spark’s failure recovery might not help you
23© Cloudera, Inc. All rights reserved.
HDFS: Its not so bad
• DiskCachedRDD
• Before doing any work, check if it exists on disk
• If so, just load it
• If not, create it and write it to disk
24© Cloudera, Inc. All rights reserved.
Partitions, Partitions, Partitions …
• Partitions should be small
• Max partition size is 2GB*
• Small partitions help deal w/ stragglers
• Small partitions avoid overhead – take a closer look at internals …
• Partitions should be big
• “For ML applications, the best setting to set the number of partitions to match
the number of cores to reduce shuffle size.” Xiangrui Meng on user@
• Why? Take a closer look at internals …
25© Cloudera, Inc. All rights reserved.
Parameterize Partition Numbers
• Many transformations take a second parameter
• reduceByKey(…, nPartitions)
• sc.textFile(…, nPartitions)
• Both sides of shuffle matter!
• Shuffle read (aka “reduce”)
• Shuffle write (aka “map”) – controlled by previous stage
• As datasets change, you might need to change the numbers
• Make this a parameter to your application
• Yes, you may need to expose a LOT of parameters
26© Cloudera, Inc. All rights reserved.
Using the UI
27© Cloudera, Inc. All rights reserved.
Some Demos
• Collect a lot of data
• Slow tasks
• DAG visualization
• RDD names
28© Cloudera, Inc. All rights reserved.
Understanding Performance
29© Cloudera, Inc. All rights reserved.
What data and where is it going?
• Narrow Dependencies (aka “OneToOneDependency”)
• cheap
• Wide Dependencies (aka shuffles)
• how much is shuffled
• Is it skewed
• Driver bottleneck
30© Cloudera, Inc. All rights reserved.
Driver can be a bottleneck
Credit: Sandy Ryza, Cloudera
31© Cloudera, Inc. All rights reserved.
Driver can be a bottleneck
GOOD BAD
rdd.collect() Exploratory data analysis; merging a
small set of results.
Sequentially scan entire data set on driver.
No parallelism, OOM on driver.
rdd.reduce() Summarize the results from a small
dataset.
Big Data Structures, from lots of
partitions.
sc.accumulator() Small data types, eg., counters. Big Data Structures, from lots of
partitions. Set of a million “most
interesting” user ids from each partition.
32© Cloudera, Inc. All rights reserved.
Stage Boundaries
33© Cloudera, Inc. All rights reserved.
Stages are not MapReduce Steps!
Map
Reduce
Shuffle
Map
Reduce
Shuffle
Map
Reduce
Shuffle
Map
Reduce
Shuffle
Map
ReduceByKey
(mapside
combine)
Shuffle
Filter
MapReduce
Step
ReduceByKey
FlatMap
GroupByKey
Collect
Shuffle
34© Cloudera, Inc. All rights reserved.
I still get confused
(discussion in a code review, testing a large sortByKey)
WP: … then we wait for completion of stage 3 …
ME: hang on, stage 3? Why are there 3 stages?
SortByKey does one extra pass to find the range of the
keys, but that’s two stages
WP: The other stage is data generation
ME: That can’t be right. Data Generation is pipelined,
its just part of the first stage
…
ME: duh – the final sort is two stages – shuffle write
then shuffle read
InputRDD
Sample
data to find
range of
keys
ShuffleMap
for Sort
ShuffleRead
for Sort
Stage 1
Stage 2
Stage 3
NB:
computed twice!
35© Cloudera, Inc. All rights reserved.
Tip grab bag
• Minimize data volume
• Compact formats: avro, parquet
• Kryo Serialization
• require registration in development, but not in production
• Look at data skew, key cardinality
• Tune your cluster
• Use the UI to tune your job
• Set names on all cached RDDs
36© Cloudera, Inc. All rights reserved.
More Resources
• Very active and friendly community
• http://spark.apache.org/community.html
• Dean Wampler’s self-paced spark workshop
• https://github.com/deanwampler/spark-workshop
• Tips for Better Spark Jobs
• http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-
better-spark-programs
• Tuning & Debugging Spark (with another explanation of internals)
• http://www.slideshare.net/pwendell/tuning-and-debugging-in-apache-spark
• Tuning Spark On Yarn
• http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
37© Cloudera, Inc. All rights reserved.
Thank you
38© Cloudera, Inc. All rights reserved.
Cleaning Up Resources (Try 1)
39© Cloudera, Inc. All rights reserved.
Cleaning Up Resources (Try 2)
40© Cloudera, Inc. All rights reserved.
Cleaning Up Resources (Success)

Más contenido relacionado

La actualidad más candente

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 

La actualidad más candente (20)

Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 
Kafka connect 101
Kafka connect 101Kafka connect 101
Kafka connect 101
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Enterprise Integration Patterns
Enterprise Integration PatternsEnterprise Integration Patterns
Enterprise Integration Patterns
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 

Destacado

Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
DataWorks Summit
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
Alfredo Abate
 

Destacado (20)

ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 
Oracle's BigData solutions
Oracle's BigData solutionsOracle's BigData solutions
Oracle's BigData solutions
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
 
Oracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesOracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA Technologies
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 
Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 

Similar a Spark etl

Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 

Similar a Spark etl (20)

Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
ASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH dataASHviz - Dats visualization research experiments using ASH data
ASHviz - Dats visualization research experiments using ASH data
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 

Último

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Último (20)

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 

Spark etl

  • 1. 1© Cloudera, Inc. All rights reserved. Tips for Writing ETL Pipelines with Spark Imran Rashid|Cloudera, Apache Spark PMC
  • 2. 2© Cloudera, Inc. All rights reserved. Outline • Quick Refresher • Tips for Pipelines • Spark Performance • Using the UI • Understanding Stage Boundaries • Baby photos
  • 3. 3© Cloudera, Inc. All rights reserved. About Me • Member of the Spark PMC • User of Spark from v0.5 at Quantifind • Built ETL pipelines, prototype to production • Supported Data Scientists • Now work on Spark full time at Cloudera
  • 4. 4© Cloudera, Inc. All rights reserved. RDDs: Resilient Distributed Dataset • Data is distributed into partitions spread across a cluster • Each partition is processed independently and in parallel • Logical view of the data – not materialized Image from Dean Wampler, Typesafe
  • 5. 5© Cloudera, Inc. All rights reserved. Expressive API • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip • sample • take • first • partitionBy • mapWith • pipe • save • ...
  • 6. 6© Cloudera, Inc. All rights reserved. Cheap! • No serialization • No IO • Pipelined Expensive! • Serialize Data • Write to disk • Transfer over network • Deserialize Data
  • 7. 7© Cloudera, Inc. All rights reserved. Compare to MapReduce Word Count Spark public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Hadoop MapReduce val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 8. 8© Cloudera, Inc. All rights reserved. Useful Patterns
  • 9. 9© Cloudera, Inc. All rights reserved. Pipelines get complicated • Pipelines get messy • Input data is messy • Things go wrong • Never fast enough • Need stability for months to years • Need Forecasting / Capacity Planning Alice one year ago Bob 6 months ago Connie 3 months ago Derrick last month Alice last week
  • 10. 10© Cloudera, Inc. All rights reserved. Design Goals • Modularity • Error Handling • Understand where and how
  • 11. 11© Cloudera, Inc. All rights reserved. Catching Errors (1) sc.textFile(…).map{ line => //blows up with parse exception parse(line) } sc.textFile(…).flatMap { line => //now we’re safe, right? Try(parse(line)).toOption } How many errors? 1 record? 100 records? 90% of our data?
  • 12. 12© Cloudera, Inc. All rights reserved. Catching Errors (2) val parseErrors = sc.accumulator(0L) val parsed = sc.textFile(…).flatMap { line => Try(parse(line)) match { case Success(s) => Some(s) case Failure(f) => parseErrors += 1 None } // parse errors is always 0 if (parseErrors > 500) fail(…) // and what if we want to see those errors?
  • 13. 13© Cloudera, Inc. All rights reserved. Catching Errors (3) • Accumulators break the RDD abstraction • You care about when an action has taken place • Force action, or pass error handling on • SparkListener to deal w/ failures • https://gist.github.com/squito/2f7cc02c313 e4c9e7df4#file-accumulatorlistener-scala case class ParsedWithErrorCounts(val parsed: RDD[LogLine], errors: Accumulator[Long]) def parseCountErrors(path: String, sc: SparkContext): ParsedWithErrorCounts = { val parseErrorCounter = sc.accumulator(0L).setName(“parseErrors”) val parsed = sc.textFile(path).flatMap { line => line match { case LogPattern(date, thread, level, source, msg) => Some(LogLine(date, thread, level, source, msg)) case _ => parseErrorCounter += 1 None } } ParsedWithErrorCounts(parsed, parseErrorCounter) }
  • 14. 14© Cloudera, Inc. All rights reserved. Catching Errors (4) • Accumulators can give you “multiple output” • Create sample of error records • You can look at them for debugging • WARNING: accumulators are not scalable class ReservoirSample[T] {...} class ReservoirSampleAccumulableParam[T] extends AccumulableParam[ReservoirSample[T], T]{...} def parseCountErrors(path: String, sc: SparkContext): ParsedWithErrorCounts = { val parseErrors = sc.accumulable( new ReservoirSample[String](100))(…) val parsed = sc.textFile(path).flatMap { line => line match { case LogPattern(date, thread, level, source, msg) => Some(LogLine(date, thread, level, source, msg)) case _ => parseErrors += line None } } ParsedWithErrorCounts(parsed, parseErrors) }
  • 15. 15© Cloudera, Inc. All rights reserved. Catching Errors (5) • What if instead, we just filter out each condition? • Beware deep pipelines • Eg. RDD.randomSplit Huge Raw Data Filter FlatMap …parsed Error 1 Error 2
  • 16. 16© Cloudera, Inc. All rights reserved. Modularity with RDDs • Who is caching what? • What resources should each component? • What assumptions are made on inputs?
  • 17. 17© Cloudera, Inc. All rights reserved. Win By Cheating • Fastest way to shuffle a lot of data: • Don’t shuffle • Second fastest way to shuffle a lot of data: • Shuffle a small amount of data • ReduceByKey • Approximate Algorithms • Same as MapReduce • BloomFilters, HyperLogLog, Tdigest • Joins with Narrow Dependencies
  • 18. 18© Cloudera, Inc. All rights reserved. ReduceByKey when Possible • ReduceByKey allows a map-side-combine • Data is merged together before its serialized & sent over network • GroupByKey transfers all the data • Higher serialization and network transfer costs parsed .map{line =>(line.level, 1)} .reduceByKey{(a, b) => a + b} .collect() parsed .map{line =>(line.level, 1)} .groupByKey.map{case(word,counts) => (word,counts.sum)} .collect()
  • 19. 19© Cloudera, Inc. All rights reserved. But I need groupBy • Eg., incoming transaction logs from user • 10 TB of historical data • 50 GB of new data each day Historical Logs Day 1 logs Day 2 Logs Day 3 Logs Grouped Logs
  • 20. 20© Cloudera, Inc. All rights reserved. Using Partitioners for Narrow Joins • Sort the Historical Logs once • Each day, sort the small new data • Join – narrow dependency • Write data to hdfs • Day 2 – now what? • SPARK-1061 • Read from hdfs • “Remember” data was written with a partitioner Wide Join Narrow Join
  • 21. 21© Cloudera, Inc. All rights reserved. Assume Partitioned • Day 2 – now what? • SPARK-1061 • Read from hdfs • “Remember” data was written with a partitioner // Day 1 val myPartitioner = … val historical = sc.hadoopFile(“…/mergedLogs/2015/05/19”, …) .partitionBy(myPartitioner) val newData = sc.hadoopFile(“…/newData/2015/05/20”, …) .partitionBy(myPartitioner) val grouped = myRdd.cogroup(newData) grouped.saveAsHadoopFile( “…/mergedLogs/2015/05/20”) //Day 2 – new spark context val historical = sc.hadoopFile(“…/mergedLogs/2015/05/20”, …) .assumePartitionedBy(myPartitioner)
  • 22. 22© Cloudera, Inc. All rights reserved. Recovering from Errors • I write bugs • You write bugs • Spark has bugs • The bugs might appear after 17 hours in stage 78 of your application • Spark’s failure recovery might not help you
  • 23. 23© Cloudera, Inc. All rights reserved. HDFS: Its not so bad • DiskCachedRDD • Before doing any work, check if it exists on disk • If so, just load it • If not, create it and write it to disk
  • 24. 24© Cloudera, Inc. All rights reserved. Partitions, Partitions, Partitions … • Partitions should be small • Max partition size is 2GB* • Small partitions help deal w/ stragglers • Small partitions avoid overhead – take a closer look at internals … • Partitions should be big • “For ML applications, the best setting to set the number of partitions to match the number of cores to reduce shuffle size.” Xiangrui Meng on user@ • Why? Take a closer look at internals …
  • 25. 25© Cloudera, Inc. All rights reserved. Parameterize Partition Numbers • Many transformations take a second parameter • reduceByKey(…, nPartitions) • sc.textFile(…, nPartitions) • Both sides of shuffle matter! • Shuffle read (aka “reduce”) • Shuffle write (aka “map”) – controlled by previous stage • As datasets change, you might need to change the numbers • Make this a parameter to your application • Yes, you may need to expose a LOT of parameters
  • 26. 26© Cloudera, Inc. All rights reserved. Using the UI
  • 27. 27© Cloudera, Inc. All rights reserved. Some Demos • Collect a lot of data • Slow tasks • DAG visualization • RDD names
  • 28. 28© Cloudera, Inc. All rights reserved. Understanding Performance
  • 29. 29© Cloudera, Inc. All rights reserved. What data and where is it going? • Narrow Dependencies (aka “OneToOneDependency”) • cheap • Wide Dependencies (aka shuffles) • how much is shuffled • Is it skewed • Driver bottleneck
  • 30. 30© Cloudera, Inc. All rights reserved. Driver can be a bottleneck Credit: Sandy Ryza, Cloudera
  • 31. 31© Cloudera, Inc. All rights reserved. Driver can be a bottleneck GOOD BAD rdd.collect() Exploratory data analysis; merging a small set of results. Sequentially scan entire data set on driver. No parallelism, OOM on driver. rdd.reduce() Summarize the results from a small dataset. Big Data Structures, from lots of partitions. sc.accumulator() Small data types, eg., counters. Big Data Structures, from lots of partitions. Set of a million “most interesting” user ids from each partition.
  • 32. 32© Cloudera, Inc. All rights reserved. Stage Boundaries
  • 33. 33© Cloudera, Inc. All rights reserved. Stages are not MapReduce Steps! Map Reduce Shuffle Map Reduce Shuffle Map Reduce Shuffle Map Reduce Shuffle Map ReduceByKey (mapside combine) Shuffle Filter MapReduce Step ReduceByKey FlatMap GroupByKey Collect Shuffle
  • 34. 34© Cloudera, Inc. All rights reserved. I still get confused (discussion in a code review, testing a large sortByKey) WP: … then we wait for completion of stage 3 … ME: hang on, stage 3? Why are there 3 stages? SortByKey does one extra pass to find the range of the keys, but that’s two stages WP: The other stage is data generation ME: That can’t be right. Data Generation is pipelined, its just part of the first stage … ME: duh – the final sort is two stages – shuffle write then shuffle read InputRDD Sample data to find range of keys ShuffleMap for Sort ShuffleRead for Sort Stage 1 Stage 2 Stage 3 NB: computed twice!
  • 35. 35© Cloudera, Inc. All rights reserved. Tip grab bag • Minimize data volume • Compact formats: avro, parquet • Kryo Serialization • require registration in development, but not in production • Look at data skew, key cardinality • Tune your cluster • Use the UI to tune your job • Set names on all cached RDDs
  • 36. 36© Cloudera, Inc. All rights reserved. More Resources • Very active and friendly community • http://spark.apache.org/community.html • Dean Wampler’s self-paced spark workshop • https://github.com/deanwampler/spark-workshop • Tips for Better Spark Jobs • http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing- better-spark-programs • Tuning & Debugging Spark (with another explanation of internals) • http://www.slideshare.net/pwendell/tuning-and-debugging-in-apache-spark • Tuning Spark On Yarn • http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
  • 37. 37© Cloudera, Inc. All rights reserved. Thank you
  • 38. 38© Cloudera, Inc. All rights reserved. Cleaning Up Resources (Try 1)
  • 39. 39© Cloudera, Inc. All rights reserved. Cleaning Up Resources (Try 2)
  • 40. 40© Cloudera, Inc. All rights reserved. Cleaning Up Resources (Success)