This document provides an overview of Apache Spark, a framework for large-scale data processing. It discusses what big data is, the history and advantages of Spark, and Spark's execution model. Key concepts explained include Resilient Distributed Datasets (RDDs), transformations, actions, and MapReduce algorithms like word count. Examples are provided to illustrate Spark's use of RDDs and how it can improve on Hadoop MapReduce.
2. 2
05.12.2015
Apache Spark
• What is Big Data?
• Internet of Things
• What is Apache Spark?
• History of Apache Spark
• Why Spark?
• Spark Execution Flow
• Spark Context
• Resilient Distributed Dataset (RDD)
• RDD Examples
• MapReduce Algorithm
• MapReduce Example: Word Count
• Let’s try some examples
3. 3
05.12.2015
What is Big Data?
Apache Spark
• “Big data” is similar to “small
data”, but bigger in size – is not
incorrect
• But having data bigger it
requires different approaches
• Big Data is a set of technologies and methods for handling
large volumes of data at rapid speeds and of various
formats.
4. 4
05.12.2015
Big Data and the Internet of Things
Apache Spark
• Connected Intelligence
• Every day, we create 2.5 quintillion bytes of data — so much that
90% of the data in the world today has been created in the last two
years alone. IBM, “Bringing big data to the enterprise”
6. 6
05.12.2015
Big Data – Trends and Opportunities
Apache Spark
”Welcome to the Internet of Customers. Behind every app, every device,
and every connection, is a customer. Billions of them. And each and every
one is speeding toward the future.” Salesforce.com
7. 7
05.12.2015
What is Apache Spark?
Apache Spark
• Emerging big data framework
• Open source framework for fast distributed in-memory
data processing and data analytics
• Extension/alternative to the MapReduce model
• Currently an Apache high-priority “top-level” project
8. 8
05.12.2015
What is Apache Spark?
Apache Spark
• Written in Scala
Functional programming language that runs in a JVM
• Key Concepts
Avoid the data bottleneck by distributing data when it is stored
Bring the processing to the data
Data is stored in memory
9. 9
05.12.2015
History of Apache Spark
Apache Spark
• Started in UC Berkeley AMPLab as a research
project by Matei Zaharia, 2009
AMP = Algorithms Machines People
AMPLab is integrating Algorithms, Machines, and People to
make sense of Big Data
• Spark become open source, March 2010
• Spark donated to Apache Software Foundation,
June 2013
• Spark becomes a top-level Apache project,
February 2014
10. 10
05.12.2015
Why Spark?
Apache Spark
Speed
• Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
• Last year, Spark took over Hadoop by completing the 100 TB
Daytona GraySort contest 3x faster on one tenth the number of
machines and it also became the fastest open source engine for
sorting a petabyte.
11. 11
05.12.2015
Why Spark?
Apache Spark
Ease of Use
• Write applications quickly in Java, Scala, Python, R.
• Spark offers over 80 high-level operators that make it
easy to build parallel apps. And you can use it
interactively from the Scala, Python and R shells.
12. 12
05.12.2015
Why Spark?
Apache Spark
Generality
• Combine SQL, streaming, and complex analytics
• Spark powers a stack of libraries including SQL and
DataFrames, MLlib for machine learning, GraphX, and Spark
Streaming. You can combine these libraries seamlessly in the
same application
13. 13
05.12.2015
Why Spark?
Apache Spark
Runs Everywhere
• Spark runs on Hadoop, Mesos, standalone, or in the cloud. It
can access diverse data sources including HDFS, Cassandra,
HBase, and S3.
• You can run Spark using its standalone cluster mode, on EC2,
on Hadoop YARN, or on Apache Mesos. Access data in
HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop
data source.
14. 14
05.12.2015
Execution Flow
Apache Spark
• Cluster Manager An external service to manage resources on the cluster
(standalone manager, YARN, Apache Mesos)
• Worker Node : Node that run the application program in cluster
• Executor
1. Process launched on a worker node, that runs the Tasks
2. Keep data in memory or disk storage
• Task : A unit of work that will be sent to executor
• Job
1. Consists multiple tasks
2. Created based on a Action
• Stage : Each Job is divided into smaller set of tasks called Stages that is sequential
and depend on each other
• SparkContext : represents the connection to a Spark cluster, and can be used to
create RDDs, accumulators and broadcast variables on that cluster.
• Driver Program
The process to start the execution (main() function)
15. 15
05.12.2015
Spark Context
Apache Spark
• Every Spark application requires a Spark Context
The main entry point to the Spark API
• Spark Shell Provides a preconfigured Spark Context call sc
16. 16
05.12.2015
Resilient Distributed Dataset (RDD)
Apache Spark
• RDD is a basic Abstraction in Spark
• Distributed collection of objects
• RDD(Resilient Distributed Dataset)
1. Resilient – if data in memory is lost, it can be recreated
2. Distributed – stored in memory across the cluster
3. Dataset – initial data can come from a file or created
programmatically
19. 19
05.12.2015
RDD Operations
Apache Spark
• Two types of RDD operations
Actions – return values
count
take(n)
Transformations – define new RDDs based
on the current one
filter
map
reduce
21. 21
05.12.2015
RDDs
Apache Spark
• RDDs can hold any type of element
Primitive types: integers, chars, Boolean, string, etc
Sequence type: lists, arrays, tuples, dicts, etc.
Scala/Java Object (if serializable)
• Some types of RDDs have additional
functionality
Double RDDs – RDDs consisting of numeric data
Pair RDDs – RDDs consisting of Key-Value pairs
22. 22
05.12.2015
Pair RDDs
Apache Spark
• Pair RDDs are a special form of RDD
Each element must be a key-value pair
Keys and values can be any type
• Why?
Use with Map-Reduce algorithms
Many additional functions are available for common data
processing needs – e.g. sorting, joining, grouping,
counting, etc
23. 23
05.12.2015
MapReduce
Apache Spark
• MapReduce is a common programming model
1. Two phases
Map – process each element in a data set
Reduce – aggregate or consolidate the data
2. Easily applicable to distributed processing of large data sets
• Hadoop MapReduce is the major implementation
1. Limited
Each job has one Map phase, one Reduce phase in each
Job output saved to files
• Spark implements MapReduce with much greater flexibility
1. Map and Reduce functions can be interspersed
2. Results stored in memory
Operations can be
chained easily
Spark execution flow
Hadoop execution flow