This document provides an outline and overview of a Spark and Flink tutorial presented by Alexander Panchenko, Gerold Hintz, and Steffen Remus on 11.07.2016. The tutorial covers Scala basics, Spark fundamentals including RDDs and transformations/actions, running Spark applications, and key differences between Spark and Flink. It aims to help users get started with Apache Spark and Flink for distributed data processing and machine learning.
Getting started in Apache Spark and Flink (with Scala) - Part II
1. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus
Getting started in Apache Spark
and Flink (with Scala)
Alexander Panchenko, Gerold Hintz, Steffen Remus
2. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 2
Outline
Scala
basics of Scala programming language
Spark
motivation / what do you get on top of MapReduce
basics of Spark: RDDs, transformations, actions, shuffling
“tricks” useful in Spark context
Spark Hands-on session
run Spark notebook and solve easy tasks
setup Spark project & submit job to cluster
Flink
theory
difference from Spark
3. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 3
Three main benefits to use Spark
1. Spark is easy to use—you can develop applications on your laptop,
using a high-level API
2. Spark is fast, enabling interactive use and complex algorithms
3. Spark is a general engine, letting you combine multiple types of
computations (e.g., SQL queries, text processing, and machine
learning) that might previously have required different engines.
This tutorial is based on the book by creators of Spark:
Karau H., Konwinski A., Windell P., Zaharia M. “Learning
Spark. Lighting-fast Data Analysis.” O’Really. 2015
4. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 4
Data Science Tasks
Experimentation: development of the model
Python, MATLAB, R
iPython notebooks
Interactive computing
Easy-to-use
Production: using the model
Java, Scala, C++/C
Unit tests
Fault tolerance
No interactive computing
Scalability
Scala + Spark can be used for both!
5. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 5
A Brief History of Spark
Spark is an open source project
Spark started in 2009 as a research project in the UC Berkeley RAD Lab
Research papers were published about Spark at academic conferences
and soon after its creation in 2009
In 2011, the AMPLab started to develop higher-level components on
Spark, such as Shark (Hive on Spark) and Spark Streaming
Currently one of the most active project in Scala language:
6. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 6
What Is Apache Spark?
Spark Core: resilient distributed dataset (RDD)
Spark SQL: Hive tables, Parquet, JSON, Datasets
7. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 7
What Is Apache Spark?
Components for distributed execution in Spark
8. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 8
Spark Runtime Architecture
The components of a distributed Spark application
9. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 9
Spark Runtime Architecture
The master/slave architecture with one central coordinator and many
distributed workers
The central coordinator is called the driver
The driver communicates with distributed workers called executors
The driver is the process where the main() method of your program runs
The driver:
Converting a user program into tasks
Scheduling tasks on executors
10. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 10
Downloading Spark and Getting Started
Download a version “Pre-built for Hadoop 2.X and later”:
http://spark.apache.org/downloads.html
Directories you see here that come with Spark:
README.md
Contains short instructions for getting started with Spark.
bin
Contains executable files that can be used to interact with Spark in various
ways (e.g., the Spark shell, which we will cover later in this chapter).
core, streaming, python, …
Contains the source code of major components of the Spark project.
examples
Contains some helpful Spark standalone jobs that you can look at and run to
learn about the Spark API.
11. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 11
Introduction to Spark’s Scala Shell
Run: bin/spark-shell
Type in the shell the Scala line count:
We can run parallel operations on the RDD, such as counting the lines of
text in the file or printing the first one
12. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 12
Filtering: lambda functions
Filtering example (Scala):
Filtering example (Java 7):
Filtering example (Java 8):
13. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 13
Standalone Spark Applications
Link to Spark (Maven or SBT), e.g.:
Write a sample class, e.g. word count:
14. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 14
Standalone Spark Applications
SBT build file
Build JAR and run it:
15. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 15
Programming with RDDs
RDD -- Resilient Distributed Dataset
Immutable distributed collection of objects
Each RDD is split into multiple partitions
Partitions may be computed on different nodes
Creating an RDD
Loading an external dataset
Distributing a collection of objects
16. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 16
Programming with RDDs
Once created, RDDs offer two types of operations:
Transformations
Actions
Transformations construct a new RDD from a previous one
Actions, compute a result based on an RDD
either return it to the driver program
or save it to an external storage system, e.g. HDFS
RDDs are recomputed each time you run an action
To reuse an RDD you need to persist it in memory:
17. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 17
Spark Execution Steps (Shell & Standalone)
1. Create some input RDDs from external data.
2. Transform them to define new RDDs using transformations like filter().
3. Persist any intermediate RDDs that will need to be reused.
4. Launch actions such as count() and first() to kick off a parallel
computation, which is then optimized and executed by Spark.
18. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 18
RDD Operations: Transformations
filter() operation does not mutate the existing inputRDD
It returns a pointer to an entirely new RDD
inputRDD can still be reused later in the program, e.g.:
19. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 19
RDD Operations: Actions
Return some result and launch actual computation:
take() to retrieve a small number of elements
20. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 20
Common Transformations and Actions
Element-wise transformations
Mapped and filtered RDD from an input RDD:
Squaring the values in an RDD:
21. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 21
Common Transformations and Actions
Element-wise transformations
Splitting lines into multiple words:
Difference between flatMap() and map() on an RDD:
22. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 22
Common Transformations and Actions
Some simple set operations:
23. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 23
Common Transformations and Actions
Basic RDD transformations on an RDD containing {1, 2, 3, 3}:
24. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 24
Common Transformations and Actions
Two-RDD transformations on RDDs containing {1, 2, 3} and {3, 4, 5}:
25. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 25
Common Transformations and Actions
Basic actions on an RDD containing {1, 2, 3, 3}:
26. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 26
Common Transformations and Actions
Basic actions on an RDD containing {1, 2, 3, 3}:
27. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 27
Persistence (Caching)
Double execution: Reusing result:
Persistence levels:
28. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 28
Working with Key/Value Pairs
Pair RDDs are a useful building block in many programs
Allow you to act on each key in parallel or regroup data
For instance:
reduceByKey() method that can aggregate data for each key
join() method that can merge two RDDs by grouping elements with the same
key
Creating Pair RDDs = creating Scala tuples:
Creating a pair RDD using the first word as the key
29. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 29
Transformations on Pair RDDs
Transformations on one pair RDD (example: {(1, 2), (3, 4), (3, 6)})
30. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 30
Transformations on Pair RDDs
Transformations on one pair RDD (example: {(1, 2), (3, 4), (3, 6)})
31. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 31
Transformations on Pair RDDs
Transformations on two pair RDDs (rdd = {(1, 2), (3, 4), (3, 6)}
other = {(3, 9)})
32. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 32
Transformations on Pair RDDs
Using partial functions syntax for Pair RDDs in Scala
Simple filter on second element:
33. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 33
Transformations on Pair RDDs
Word and document counts:
Per-key average with reduceByKey() and mapValues():
34. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 34
Transformations on Pair RDDs
Word count example revisited:
35. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 35
Transformations on Pair RDDs
Example of a join (inner join is the default):
36. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 36
Actions Available on Pair RDDs
Actions on pair RDDs (example ({(1, 2), (3, 4), (3, 6)}))
37. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 37
Example: PageRank
links – (pageID, link List) – a list of neighbors of each page
ranks – (pageID,rank) – current rank for each page
38. 11.07.2016 | Spark tutorial | A. Panchenko, G. Hintz, S. Remus | 38
Important topics not covered in this intro
MLlib
Machine Learning in the distributed way
Basic Linear Algebra in the distributed way: sparse and dense vectors
and matrices
Partitioning
No free lunch, neither automagic scaling of any algorithm
Making efficient algorithm = trying to minimize shuffling of the data
Spark SQL, Spark 2.0, Datasets, DataFrames
Something like Python’s pandas or R’s DataFrame
Great for interactive data mining and for working with CSV files