When learning Apache Spark, where should a person begin? What are the key fundamentals when learning Apache Spark? Resilient Distributed Datasets, Spark Drivers and Context, Transformations, Actions.
3. Assumptions
• You want to learn Apache Spark, but need to know where
to begin
• You need to know the fundamentals of Spark in order to
progress in your learning of Spark
• You need to evaluate if Spark could be an appropriate fit
for your use cases or career growth
One or more of the following
Friday, January 22, 16
4. In a nutshell, why spark?
• Engine for efficient large-scale processing. It’s faster than
Hadoop MapReduce
• Spark can complement your existing Hadoop investments
such as HDFS and Hive
• Rich ecosystem including support for SQL, Machine
Learning, Steaming and multiple language APIs such as
Scala, Python and Java
Friday, January 22, 16
6. Spark Essentials
• Resilient Distributed Datasets (RDD)
• Transformers
• Actions
• Spark Driver Programs and SparkContext
To begin, you need to know:
Friday, January 22, 16
7. Resilient Distributed Datasets (RDDs)
• RDDs are Spark’s primary abstraction for data
interaction (lazy, in memory)
• RDDs are an immutable, distributed collection of
elements separated into partitions
• There are multiple types of RDDs
• RDDs can be created from an external data sets such as
Hadoop InputFormats, text files on a variety of file
systems or existing RDDs via a Spark Transformations
Friday, January 22, 16
8. Transformations
• RDD functions which return pointers to new RDDs
(remember: lazy)
• map, flatMap, filter, etc.
Friday, January 22, 16
9. Actions
• RDD functions which return values to the driver
• reduce, collect, count, etc.
Friday, January 22, 16
11. Spark Driver Programs and Context
• Spark driver is a program that declares transformations and
actions on RDDs of data
• A driver submits the serialized RDD graph to the master
where the master creates tasks. These tasks are delegated to
the workers for execution.
• Workers are where the tasks are actually executed.
Friday, January 22, 16
12. Driver Program and SparkContext
Image borrowed from http://spark.apache.org/docs/latest/cluster-overview.html
Friday, January 22, 16
13. References
• For course information and discount coupons, visit http://
www.supergloo.com/
• Learning Spark Book Summary http://www.amazon.com/
Learning-Spark-Summary-Lightning-Fast-Deconstructed-
ebook/dp/B019HS7USA/
Friday, January 22, 16