SlideShare una empresa de Scribd logo
1 de 51
Apache Spark 
Buenos Aires High Scalability 
Buenos Aires, Argentina, Dic 2014 
Fernando Rodriguez Olivera 
@frodriguez
Fernando Rodriguez Olivera 
Professor at Universidad Austral (Distributed Systems, Compiler 
Design, Operating Systems, …) 
Creator of mvnrepository.com 
Organizer at Buenos Aires High Scalability Group, Professor at 
nosqlessentials.com 
Twitter: @frodriguez
Apache Spark 
Apache Spark is a Fast and General Engine 
for Large-Scale data processing 
In-Memory computing primitives 
Supports for Batch, Interactive, Iterative and 
Stream processing with Unified API
Apache Spark 
Unified API for multiple kind of processing 
Batch (high throughput) 
Interactive (low latency) 
Stream (continuous processing) 
Iterative (results used immediately)
Daytona Gray Sort 100TB Benchmark 
Data Size Time Nodes Cores 
Hadoop MR 
(2013) 
102.5 TB 72 min 2,100 
50,400 
physical 
Apache 
Spark 
(2014) 
100 TB 23 min 206 
6,592 
virtualized 
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Daytona Gray Sort 100TB Benchmark 
Data Size Time Nodes Cores 
Hadoop MR 
(2013) 
102.5 TB 72 min 2,100 
50,400 
physical 
Apache 
Spark 
(2014) 
100 TB 23 min 206 
6,592 
virtualized 
3X faster using 10X fewer machines 
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Hadoop vs Spark for Iterative Proc 
Logistic regression in Hadoop and Spark 
source: https://spark.apache.org/
Hadoop MR Limits 
Job Job Job 
Hadoop HDFS 
MapReduce designed for Batch Processing: 
- Communication between jobs through FS 
- Fault-Tolerance (between jobs) by Persistence to FS 
- Memory not managed (relies on OS caches) 
Compensated with: Storm, Samza, Giraph, Impala, Presto, etc
Apache Spark 
Apache Spark (Core) 
Spark 
SQL 
Spark 
Streaming ML lib GraphX 
Powered by Scala and Akka 
APIs for Java, Scala, Python
Resilient Distributed Datasets (RDD) 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
Immutable Collection of Objects
Resilient Distributed Datasets (RDD) 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
Immutable Collection of Objects 
Partitioned and Distributed
Resilient Distributed Datasets (RDD) 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
Immutable Collection of Objects 
Partitioned and Distributed 
Stored in Memory
Resilient Distributed Datasets (RDD) 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
Immutable Collection of Objects 
Partitioned and Distributed 
Stored in Memory 
Partitions Recomputed on Failure
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
...
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
Compute 
Function 
(transformation) 
e.g: apply 
function 
to count 
chars
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
RDD of Ints 
11 
... 
... 
10 
... 
5 
... 
7 
... 
Compute 
Function 
(transformation) 
e.g: apply 
function 
to count 
chars
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
RDD of Ints 
11 
... 
... 
10 
... 
5 
... 
7 
... 
depends on 
Compute 
Function 
(transformation) 
e.g: apply 
function 
to count 
chars
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
RDD of Ints 
11 
... 
... 
10 
... 
5 
... 
7 
... 
depends on 
Compute 
Function 
(transformation) 
e.g: apply 
function 
to count 
chars 
Int 
N 
Action
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
RDD of Ints 
11 
... 
... 
10 
... 
5 
... 
7 
... 
Compute 
Function 
(transformation) 
e.g: apply 
function 
to count 
chars 
RDD Implementation 
Partitions 
Compute Function 
Dependencies 
Preferred Compute 
Location 
(for each partition) 
Partitioner 
depends on 
Int 
N 
Action
Spark API 
val spark = new SparkContext() 
val lines = spark.textFile(“hdfs://docs/”) // RDD[String] 
val nonEmpty = lines.filter(l => l.nonEmpty()) // RDD[String] 
val count = nonEmpty.count 
Scala 
SparkContext spark = new SparkContext(); 
JavaRDD<String> lines = spark.textFile(“hdfs://docs/”) 
JavaRDD<String> nonEmpty = lines.filter(l -> l.length() > 0); 
long count = nonEmpty.count(); 
Java 8 Python 
spark = SparkContext() 
lines = spark.textFile(“hdfs://docs/”) 
nonEmpty = lines.filter(lambda line: len(line) > 0) 
count = nonEmpty.count()
RDD Operations 
Transformations Actions 
map(func) 
flatMap(func) 
filter(func) 
take(N) 
count() 
collect() 
groupByKey() 
reduceByKey(func) 
reduce(func) 
mapValues(func) 
takeOrdered(N) 
top(N) 
… …
Text Processing Example 
Top Words by Frequency 
(Step by step)
Create RDD from External Data 
Apache Spark 
Hadoop FileSystem, 
I/O Formats, Codecs 
HDFS S3 HBase MongoDB 
Cassandra 
… 
Spark can read/write from any data source supported by Hadoop 
I/O via Hadoop is optional (e.g: Cassandra connector bypass Hadoop) 
// Step 1 - Create RDD from Hadoop Text File 
val docs = spark.textFile(“/docs/”) 
ElasticSearch
Function map 
RDD[String] RDD[String] 
Hello World 
A New Line 
hello 
... 
The end 
.map(line => line.toLowerCase) 
hello world 
a new line 
hello 
... 
the end 
= 
.map(_.toLowerCase) 
// Step 2 - Convert lines to lower case 
val lower = docs.map(line => line.toLowerCase)
Functions map and flatMap 
RDD[String] 
hello world 
a new line 
hello 
... 
the end
Functions map and flatMap 
RDD[String] 
hello world 
a new line 
hello 
... 
the end 
.map( … ) 
RDD[Array[String]] 
_.split(“s+”) 
hello 
a 
hello 
... 
the 
world 
new line 
end
Functions map and flatMap 
RDD[String] 
hello world 
a new line 
hello 
... 
the end 
.map( … ) 
RDD[Array[String]] 
_.split(“s+”) 
hello 
a 
hello 
... 
the 
world 
new line 
end 
.flatten 
RDD[String] 
hello 
world 
a 
new 
line 
... 
*
Functions map and flatMap 
hello world 
a new line 
hello 
... 
the end 
RDD[Array[String]] 
hello 
.flatMap(line => line.split(“s+“)) 
RDD[String] 
.map( … ) 
_.split(“s+”) 
a 
hello 
... 
the 
world 
new line 
end 
.flatten 
RDD[String] 
hello 
world 
a 
new 
line 
... 
*
Functions map and flatMap 
RDD[String] 
hello world 
a new line 
hello 
... 
the end 
.map( … ) 
RDD[Array[String]] 
_.split(“s+”) 
hello 
a 
world 
new line 
hello 
... 
the 
end 
.flatten 
.flatMap(line => line.split(“s+“)) 
RDD[String] 
world 
// Step 3 - Split lines into words 
val words = lower.flatMap(line => line.split(“s+“)) 
Note: flatten() not available in spark, only flatMap 
hello 
a 
new 
line 
... 
*
Key-Value Pairs 
RDD[Tuple2[String, Int]] 
RDD[String] RDD[(String, Int)] 
hello 
world 
a 
new 
line 
hello 
... 
hello 
world 
a 
new 
line 
hello 
... 
.map(word => Tuple2(word, 1)) 
1 
1 
1 
1 
1 
1 
= 
.map(word => (word, 1)) 
// Step 4 - Split lines into words 
val counts = words.map(word => (word, 1)) 
Pair RDD
Shuffling 
RDD[(String, Int)] 
hello 
world 
a 
new 
line 
hello 
1 
1 
1 
1 
1 
1
Shuffling 
hello 
world 
a 
new 
line 
hello 
1 
1 
1 
1 
1 
1 
RDD[(String, Iterator[Int])] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
1 
.groupByKey 
1 
RDD[(String, Int)]
Shuffling 
hello 
world 
a 
new 
line 
hello 
1 
1 
1 
1 
1 
1 
RDD[(String, Iterator[Int])] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
1 
.groupByKey 
1 
RDD[(String, Int)] 
RDD[(String, Int)] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
2 
.mapValues 
_.reduce(…) 
(a,b) => a+b
Shuffling 
hello 
world 
a 
new 
line 
hello 
1 
1 
1 
1 
1 
1 
RDD[(String, Iterator[Int])] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
1 
.groupByKey 
1 
.reduceByKey((a, b) => a + b) 
RDD[(String, Int)] 
RDD[(String, Int)] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
2 
.mapValues 
_.reduce(…) 
(a,b) => a+b
Shuffling 
RDD[(String, Int)] 
hello 
world 
a 
new 
line 
hello 
1 
1 
1 
1 
1 
1 
RDD[(String, Iterator[Int])] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
1 
.groupByKey 
1 
RDD[(String, Int)] 
.reduceByKey((a, b) => a + b) 
// Step 5 - Count all words 
val freq = counts.reduceByKey(_ + _) 
world 
a 
1 
1 
new 1 
line 
hello 
1 
2 
.mapValues 
_.reduce(…) 
(a,b) => a+b
Top N (Prepare data) 
RDD[(String, Int)] RDD[(Int, String)] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
2 
.map(_.swap) 
1 
1 
1 new 
world 
a 
line 
hello 
1 
2 
// Step 6 - Swap tuples (partial code) 
freq.map(_.swap)
Top N (First Attempt) 
RDD[(Int, String)] 
1 
1 
1 new 
world 
a 
line 
hello 
1 
2
Top N (First Attempt) 
RDD[(Int, String)] 
1 
1 
1 new 
world 
a 
line 
hello 
1 
2 
.sortByKey 
RDD[(Int, String)] 
2 
1 
1 a 
hello 
world 
new 
line 
1 
1 
(sortByKey(false) for descending)
Top N (First Attempt) 
RDD[(Int, String)] Array[(Int, String)] 
1 
1 
1 new 
world 
a 
line 
hello 
1 
2 
hello 
world 
2 
1 
RDD[(Int, String)] 
2 
1 
1 a 
hello 
world 
.sortByKey .take(N) 
new 
line 
1 
1 
(sortByKey(false) for descending)
Top N 
Array[(Int, String)] 
RDD[(Int, String)] 
1 
1 
1 new 
world 
a 
line 
hello 
1 
2 
world 
a 
1 
1 
.top(N) 
hello 
line 
2 
1 
hello 
line 
2 
1 
local top N * 
local top N * 
reduction 
* local top N implemented by bounded priority queues 
// Step 6 - Swap tuples (complete code) 
val top = freq.map(_.swap).top(N)
Top Words by Frequency (Full Code) 
val spark = new SparkContext() 
// RDD creation from external data source 
val docs = spark.textFile(“hdfs://docs/”) 
// Split lines into words 
val lower = docs.map(line => line.toLowerCase) 
val words = lower.flatMap(line => line.split(“s+“)) 
val counts = words.map(word => (word, 1)) 
// Count all words (automatic combination) 
val freq = counts.reduceByKey(_ + _) 
// Swap tuples and get top results 
val top = freq.map(_.swap).top(N) 
top.foreach(println)
RDD Persistence (in-memory) 
RDD 
… 
... 
... 
… 
... 
… 
... 
… 
... 
.cache() 
.persist() 
.persist(storageLevel) 
StorageLevel: 
MEMORY_ONLY, 
MEMORY_ONLY_SER, 
MEMORY_AND_DISK, 
MEMORY_AND_DISK_SER, 
DISK_ONLY, … 
(memory only) 
(memory only) 
(lazy persistence & caching)
RDD Lineage 
RDD Transformations 
words = sc.textFile(“hdfs://large/file/”) HadoopRDD 
.map(_.toLowerCase) 
.flatMap(_.split(“ “)) FlatMappedRDD 
nums = words.filter(_.matches(“[0-9]+”)) 
alpha.count() 
MappedRDD 
alpha = words.filter(_.matches(“[a-z]+”)) 
FilteredRDD 
FilteredRDD 
Lineage 
(built on the driver 
by the transformations) 
Action (run job on the cluster)
SchemaRDD & SQL 
SchemaRDD 
Row 
... 
... 
Row 
... 
... 
Row 
Row 
... 
RRD of Row + Column Metadata 
Queries with SQL 
Support for Reflection, JSON, 
Parquet, …
SchemaRDD & SQL 
topWords 
Row 
... 
... 
Row 
... 
... 
Row 
Row 
... 
case class Word(text: String, n: Int) 
val wordsFreq = freq.map { 
case (text, count) => Word(text, count) 
} // RDD[Word] 
wordsFreq.registerTempTable("wordsFreq") 
val topWords = sql("select text, n 
from wordsFreq 
order by n desc 
limit 20”) // RDD[Row] 
topWords.collect().foreach(println)
Spark Streaming 
DStream 
RDD RDD RDD RDD RDD RDD 
Data Collected, Buffered and Replicated 
by a Receiver (one per DStream) 
then Pushed to a stream as small RDDs 
Configurable Batch Intervals. 
e.g: 1 second, 5 seconds, 5 minutes 
Receiver 
e.g: Kafka, 
Kinesis, 
Flume, 
Sockets, 
Akka 
etc
DStream Transformations 
DStream 
RDD RDD RDD RDD RDD RDD 
DStream 
transform 
RDD RDD RDD RDD RDD RDD 
Receiver 
// Example 
val entries = stream.transform { rdd => rdd.map(Log.parse) } 
// Alternative 
val entries = stream.map(Log.parse)
Parallelism with Multiple Receivers 
DStream 1 
Receiver 1 RDD RDD RDD RDD RDD RDD 
DStream 2 
Receiver 2 RDD RDD RDD RDD RDD RDD 
union of (stream1, stream2, …) 
Union can be used to manage multiple DStreams as 
a single logical stream
Sliding Windows 
DStream 
RDD RDD RDD RDD RDD RDD 
DStream 
… … … W3 W2 W1 
Window Length: 3, Sliding Interval: 1 
Receiver
Deployment with Hadoop 
A 
B 
C 
D 
/large/file 
allocates resources 
(cores and memory) 
Spark 
Worker 
Data 
Node 1 
Application 
Spark 
Worker 
Data 
Node 3 
Spark 
Worker 
Data 
Node 4 
Spark 
Worker 
Data 
Node 2 
A C B C A B A B 
Spark 
Master 
Name 
Node 
RF 3 D D D C 
Client 
Submit App 
(mode=cluster) 
Driver Executors Executors Executors 
DN + Spark 
HDFS Spark
Fernando Rodriguez Olivera 
twitter: @frodriguez

Más contenido relacionado

La actualidad más candente

Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistenceVenkat Datla
 
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...Edureka!
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to SparkLi Ming Tsai
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewMario Cartia
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkDatio Big Data
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
 

La actualidad más candente (20)

Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
 
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certifica...
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Spark
SparkSpark
Spark
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Sqoop
SqoopSqoop
Sqoop
 

Destacado

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksData Con LA
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009Cloudera, Inc.
 
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and ImplementationDistributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementationtugrulh
 
Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadershipsjoerdluteyn
 
Spark, the new age of data scientist
Spark, the new age of data scientistSpark, the new age of data scientist
Spark, the new age of data scientistMassimiliano Martella
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginningsDaniel Leon
 
Spark Streaming Data Pipelines
Spark Streaming Data PipelinesSpark Streaming Data Pipelines
Spark Streaming Data PipelinesMapR Technologies
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinesecolorant
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute modelDean Wampler
 
SPIntersection 2016 - MICROSOFT CLOUD IDENTITIES IN AZURE AND OFFICE 365
SPIntersection 2016 - MICROSOFT CLOUD IDENTITIES IN AZURE AND OFFICE 365SPIntersection 2016 - MICROSOFT CLOUD IDENTITIES IN AZURE AND OFFICE 365
SPIntersection 2016 - MICROSOFT CLOUD IDENTITIES IN AZURE AND OFFICE 365Scott Hoag
 

Destacado (20)

Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
 
AWS Kinesis Streams
AWS Kinesis StreamsAWS Kinesis Streams
AWS Kinesis Streams
 
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and ImplementationDistributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
 
Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadership
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Spark, the new age of data scientist
Spark, the new age of data scientistSpark, the new age of data scientist
Spark, the new age of data scientist
 
Performance
PerformancePerformance
Performance
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
 
Spark Streaming Data Pipelines
Spark Streaming Data PipelinesSpark Streaming Data Pipelines
Spark Streaming Data Pipelines
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinese
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute model
 
SPIntersection 2016 - MICROSOFT CLOUD IDENTITIES IN AZURE AND OFFICE 365
SPIntersection 2016 - MICROSOFT CLOUD IDENTITIES IN AZURE AND OFFICE 365SPIntersection 2016 - MICROSOFT CLOUD IDENTITIES IN AZURE AND OFFICE 365
SPIntersection 2016 - MICROSOFT CLOUD IDENTITIES IN AZURE AND OFFICE 365
 

Similar a Apache Spark & Streaming

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...CloudxLab
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to sparkJavier Arrieta
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...Data Con LA
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiSpark Summit
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON Padma shree. T
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide trainingSpark Summit
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for MobileBugSense
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 

Similar a Apache Spark & Streaming (20)

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for Mobile
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 

Último

Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptNarmatha D
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate productionChinnuNinan
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadaditya806802
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptbibisarnayak0
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxNiranjanYadav41
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectErbil Polytechnic University
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 

Último (20)

Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.ppt
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate production
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasad
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.ppt
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptx
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction Project
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 

Apache Spark & Streaming

  • 1. Apache Spark Buenos Aires High Scalability Buenos Aires, Argentina, Dic 2014 Fernando Rodriguez Olivera @frodriguez
  • 2. Fernando Rodriguez Olivera Professor at Universidad Austral (Distributed Systems, Compiler Design, Operating Systems, …) Creator of mvnrepository.com Organizer at Buenos Aires High Scalability Group, Professor at nosqlessentials.com Twitter: @frodriguez
  • 3. Apache Spark Apache Spark is a Fast and General Engine for Large-Scale data processing In-Memory computing primitives Supports for Batch, Interactive, Iterative and Stream processing with Unified API
  • 4. Apache Spark Unified API for multiple kind of processing Batch (high throughput) Interactive (low latency) Stream (continuous processing) Iterative (results used immediately)
  • 5. Daytona Gray Sort 100TB Benchmark Data Size Time Nodes Cores Hadoop MR (2013) 102.5 TB 72 min 2,100 50,400 physical Apache Spark (2014) 100 TB 23 min 206 6,592 virtualized source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
  • 6. Daytona Gray Sort 100TB Benchmark Data Size Time Nodes Cores Hadoop MR (2013) 102.5 TB 72 min 2,100 50,400 physical Apache Spark (2014) 100 TB 23 min 206 6,592 virtualized 3X faster using 10X fewer machines source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
  • 7. Hadoop vs Spark for Iterative Proc Logistic regression in Hadoop and Spark source: https://spark.apache.org/
  • 8. Hadoop MR Limits Job Job Job Hadoop HDFS MapReduce designed for Batch Processing: - Communication between jobs through FS - Fault-Tolerance (between jobs) by Persistence to FS - Memory not managed (relies on OS caches) Compensated with: Storm, Samza, Giraph, Impala, Presto, etc
  • 9. Apache Spark Apache Spark (Core) Spark SQL Spark Streaming ML lib GraphX Powered by Scala and Akka APIs for Java, Scala, Python
  • 10. Resilient Distributed Datasets (RDD) RDD of Strings Hello World ... ... A New Line ... ... hello The End ... Immutable Collection of Objects
  • 11. Resilient Distributed Datasets (RDD) RDD of Strings Hello World ... ... A New Line ... ... hello The End ... Immutable Collection of Objects Partitioned and Distributed
  • 12. Resilient Distributed Datasets (RDD) RDD of Strings Hello World ... ... A New Line ... ... hello The End ... Immutable Collection of Objects Partitioned and Distributed Stored in Memory
  • 13. Resilient Distributed Datasets (RDD) RDD of Strings Hello World ... ... A New Line ... ... hello The End ... Immutable Collection of Objects Partitioned and Distributed Stored in Memory Partitions Recomputed on Failure
  • 14. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ...
  • 15. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ... Compute Function (transformation) e.g: apply function to count chars
  • 16. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ... RDD of Ints 11 ... ... 10 ... 5 ... 7 ... Compute Function (transformation) e.g: apply function to count chars
  • 17. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ... RDD of Ints 11 ... ... 10 ... 5 ... 7 ... depends on Compute Function (transformation) e.g: apply function to count chars
  • 18. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ... RDD of Ints 11 ... ... 10 ... 5 ... 7 ... depends on Compute Function (transformation) e.g: apply function to count chars Int N Action
  • 19. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ... RDD of Ints 11 ... ... 10 ... 5 ... 7 ... Compute Function (transformation) e.g: apply function to count chars RDD Implementation Partitions Compute Function Dependencies Preferred Compute Location (for each partition) Partitioner depends on Int N Action
  • 20. Spark API val spark = new SparkContext() val lines = spark.textFile(“hdfs://docs/”) // RDD[String] val nonEmpty = lines.filter(l => l.nonEmpty()) // RDD[String] val count = nonEmpty.count Scala SparkContext spark = new SparkContext(); JavaRDD<String> lines = spark.textFile(“hdfs://docs/”) JavaRDD<String> nonEmpty = lines.filter(l -> l.length() > 0); long count = nonEmpty.count(); Java 8 Python spark = SparkContext() lines = spark.textFile(“hdfs://docs/”) nonEmpty = lines.filter(lambda line: len(line) > 0) count = nonEmpty.count()
  • 21. RDD Operations Transformations Actions map(func) flatMap(func) filter(func) take(N) count() collect() groupByKey() reduceByKey(func) reduce(func) mapValues(func) takeOrdered(N) top(N) … …
  • 22. Text Processing Example Top Words by Frequency (Step by step)
  • 23. Create RDD from External Data Apache Spark Hadoop FileSystem, I/O Formats, Codecs HDFS S3 HBase MongoDB Cassandra … Spark can read/write from any data source supported by Hadoop I/O via Hadoop is optional (e.g: Cassandra connector bypass Hadoop) // Step 1 - Create RDD from Hadoop Text File val docs = spark.textFile(“/docs/”) ElasticSearch
  • 24. Function map RDD[String] RDD[String] Hello World A New Line hello ... The end .map(line => line.toLowerCase) hello world a new line hello ... the end = .map(_.toLowerCase) // Step 2 - Convert lines to lower case val lower = docs.map(line => line.toLowerCase)
  • 25. Functions map and flatMap RDD[String] hello world a new line hello ... the end
  • 26. Functions map and flatMap RDD[String] hello world a new line hello ... the end .map( … ) RDD[Array[String]] _.split(“s+”) hello a hello ... the world new line end
  • 27. Functions map and flatMap RDD[String] hello world a new line hello ... the end .map( … ) RDD[Array[String]] _.split(“s+”) hello a hello ... the world new line end .flatten RDD[String] hello world a new line ... *
  • 28. Functions map and flatMap hello world a new line hello ... the end RDD[Array[String]] hello .flatMap(line => line.split(“s+“)) RDD[String] .map( … ) _.split(“s+”) a hello ... the world new line end .flatten RDD[String] hello world a new line ... *
  • 29. Functions map and flatMap RDD[String] hello world a new line hello ... the end .map( … ) RDD[Array[String]] _.split(“s+”) hello a world new line hello ... the end .flatten .flatMap(line => line.split(“s+“)) RDD[String] world // Step 3 - Split lines into words val words = lower.flatMap(line => line.split(“s+“)) Note: flatten() not available in spark, only flatMap hello a new line ... *
  • 30. Key-Value Pairs RDD[Tuple2[String, Int]] RDD[String] RDD[(String, Int)] hello world a new line hello ... hello world a new line hello ... .map(word => Tuple2(word, 1)) 1 1 1 1 1 1 = .map(word => (word, 1)) // Step 4 - Split lines into words val counts = words.map(word => (word, 1)) Pair RDD
  • 31. Shuffling RDD[(String, Int)] hello world a new line hello 1 1 1 1 1 1
  • 32. Shuffling hello world a new line hello 1 1 1 1 1 1 RDD[(String, Iterator[Int])] world a 1 1 new 1 line hello 1 1 .groupByKey 1 RDD[(String, Int)]
  • 33. Shuffling hello world a new line hello 1 1 1 1 1 1 RDD[(String, Iterator[Int])] world a 1 1 new 1 line hello 1 1 .groupByKey 1 RDD[(String, Int)] RDD[(String, Int)] world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b
  • 34. Shuffling hello world a new line hello 1 1 1 1 1 1 RDD[(String, Iterator[Int])] world a 1 1 new 1 line hello 1 1 .groupByKey 1 .reduceByKey((a, b) => a + b) RDD[(String, Int)] RDD[(String, Int)] world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b
  • 35. Shuffling RDD[(String, Int)] hello world a new line hello 1 1 1 1 1 1 RDD[(String, Iterator[Int])] world a 1 1 new 1 line hello 1 1 .groupByKey 1 RDD[(String, Int)] .reduceByKey((a, b) => a + b) // Step 5 - Count all words val freq = counts.reduceByKey(_ + _) world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b
  • 36. Top N (Prepare data) RDD[(String, Int)] RDD[(Int, String)] world a 1 1 new 1 line hello 1 2 .map(_.swap) 1 1 1 new world a line hello 1 2 // Step 6 - Swap tuples (partial code) freq.map(_.swap)
  • 37. Top N (First Attempt) RDD[(Int, String)] 1 1 1 new world a line hello 1 2
  • 38. Top N (First Attempt) RDD[(Int, String)] 1 1 1 new world a line hello 1 2 .sortByKey RDD[(Int, String)] 2 1 1 a hello world new line 1 1 (sortByKey(false) for descending)
  • 39. Top N (First Attempt) RDD[(Int, String)] Array[(Int, String)] 1 1 1 new world a line hello 1 2 hello world 2 1 RDD[(Int, String)] 2 1 1 a hello world .sortByKey .take(N) new line 1 1 (sortByKey(false) for descending)
  • 40. Top N Array[(Int, String)] RDD[(Int, String)] 1 1 1 new world a line hello 1 2 world a 1 1 .top(N) hello line 2 1 hello line 2 1 local top N * local top N * reduction * local top N implemented by bounded priority queues // Step 6 - Swap tuples (complete code) val top = freq.map(_.swap).top(N)
  • 41. Top Words by Frequency (Full Code) val spark = new SparkContext() // RDD creation from external data source val docs = spark.textFile(“hdfs://docs/”) // Split lines into words val lower = docs.map(line => line.toLowerCase) val words = lower.flatMap(line => line.split(“s+“)) val counts = words.map(word => (word, 1)) // Count all words (automatic combination) val freq = counts.reduceByKey(_ + _) // Swap tuples and get top results val top = freq.map(_.swap).top(N) top.foreach(println)
  • 42. RDD Persistence (in-memory) RDD … ... ... … ... … ... … ... .cache() .persist() .persist(storageLevel) StorageLevel: MEMORY_ONLY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER, DISK_ONLY, … (memory only) (memory only) (lazy persistence & caching)
  • 43. RDD Lineage RDD Transformations words = sc.textFile(“hdfs://large/file/”) HadoopRDD .map(_.toLowerCase) .flatMap(_.split(“ “)) FlatMappedRDD nums = words.filter(_.matches(“[0-9]+”)) alpha.count() MappedRDD alpha = words.filter(_.matches(“[a-z]+”)) FilteredRDD FilteredRDD Lineage (built on the driver by the transformations) Action (run job on the cluster)
  • 44. SchemaRDD & SQL SchemaRDD Row ... ... Row ... ... Row Row ... RRD of Row + Column Metadata Queries with SQL Support for Reflection, JSON, Parquet, …
  • 45. SchemaRDD & SQL topWords Row ... ... Row ... ... Row Row ... case class Word(text: String, n: Int) val wordsFreq = freq.map { case (text, count) => Word(text, count) } // RDD[Word] wordsFreq.registerTempTable("wordsFreq") val topWords = sql("select text, n from wordsFreq order by n desc limit 20”) // RDD[Row] topWords.collect().foreach(println)
  • 46. Spark Streaming DStream RDD RDD RDD RDD RDD RDD Data Collected, Buffered and Replicated by a Receiver (one per DStream) then Pushed to a stream as small RDDs Configurable Batch Intervals. e.g: 1 second, 5 seconds, 5 minutes Receiver e.g: Kafka, Kinesis, Flume, Sockets, Akka etc
  • 47. DStream Transformations DStream RDD RDD RDD RDD RDD RDD DStream transform RDD RDD RDD RDD RDD RDD Receiver // Example val entries = stream.transform { rdd => rdd.map(Log.parse) } // Alternative val entries = stream.map(Log.parse)
  • 48. Parallelism with Multiple Receivers DStream 1 Receiver 1 RDD RDD RDD RDD RDD RDD DStream 2 Receiver 2 RDD RDD RDD RDD RDD RDD union of (stream1, stream2, …) Union can be used to manage multiple DStreams as a single logical stream
  • 49. Sliding Windows DStream RDD RDD RDD RDD RDD RDD DStream … … … W3 W2 W1 Window Length: 3, Sliding Interval: 1 Receiver
  • 50. Deployment with Hadoop A B C D /large/file allocates resources (cores and memory) Spark Worker Data Node 1 Application Spark Worker Data Node 3 Spark Worker Data Node 4 Spark Worker Data Node 2 A C B C A B A B Spark Master Name Node RF 3 D D D C Client Submit App (mode=cluster) Driver Executors Executors Executors DN + Spark HDFS Spark
  • 51. Fernando Rodriguez Olivera twitter: @frodriguez