Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.
2. Founded by the creators of Apache Spark
out of UC Berkeley’s AMPLab
Fully committed to 100% open source
Apache Spark
Support and Grow the
Spark Community and Ecosystem
Building Databricks Cloud
3. Databricks & Datastax
Apache Spark is packaged as part of Datastax
Enterprise Analytics 4.5
Databricks & Datstax Have Partnered for
Apache Spark Engineering and Support
4. Big Data Analytics
Where We’ve Been
• 2003 & 2004 - Google
GFS & MapReduce Papers
are Precursors to Hadoop
• 2006 & 2007 - Google
BigTable and Amazon
DynamoDB Paper
Precursor to Cassandra,
HBase, Others
9. What's Working?
Many Excellent Innovations Have Come From Big Data Analytics:
• Distributed & Data Parallel is disruptive ... because we needed it
• We Now Have Massive throughput… Solved the ETL Problem
• The Data Hub/Lake Is Possible
10. What Needs to Improve?
Go Beyond MapReduce
MapReduce is a Very Powerful
and Flexible Engine
Processing Throughput
Previously Unobtainable on
Commodity Equipment
But MapReduce Isn’t Enough:
• Essentially Batch-only
• Inefficient with respect to
memory use, latency
• Too Hard to Program
11. What Needs to Improve?
Go Beyond (S)QL
SQL Support Has Been A
Welcome Interface on Many
Platforms
And in many cases, a faster
alternative
But SQL Is Often Not Enough:
• Sometimes you want to write real programs
(Loops, variables, functions, existing
libraries) but don’t want to build UDFs.
• Machine Learning (see above, plus iterative)
• Multi-step pipelines
• Often an Additional System
12. What Needs to Improve?
Ease of Use
Big Data Distributions Provide a
number of Useful Tools and
Systems
Choices are Good to Have
But This Is Often Unsatisfactory:
• Each new system has it’s own configs,
APIs, and management, coordination of
multiple systems is challenging
• A typical solution requires stringing
together disparate systems - we need
unification
• Developers want the full power of their
programming language
13. What Needs to Improve?
Latency
Big Data systems are
throughput-oriented
Some new SQL Systems
provide interactivity
But We Need More:
• Interactivity beyond SQL
interfaces
• Repeated access of the same
datasets (i.e. caching)
15. Apache Spark
Originally developed in 2009 in UC Berkeley’s
AMPLab
Fully open sourced in 2010 – now at Apache
Software Foundation
http://spark.apache.org
16. Project Activity
June 2013 June 2014
total
contributors 68 255
companies
contributing 17 50
total lines
of code 63,000 175,000
17. Project Activity
June 2013 June 2014
total
contributors 68 255
companies
contributing 17 50
total lines
of code 63,000 175,000
18. Compared to Other Projects
1200
900
600
300
0
300000
225000
150000
75000
0
Commits Lines of Code Changed
Activity
in past 6
months
19. Compared to Other Projects
1200
900
600
300
0
300000
225000
150000
75000
0
Commits Lines of Code Changed
Activity
in past 6
months
Spark is now the most active project in the
Hadoop ecosystem
20. Spark on Github
So active on Github, sometimes we break it
Over 1200 Forks (can’t display Network Graphs)
~80 commits to master each week
So many PRs We Built our own PR UI
21. Apache Spark - Easy to
Use And Very Fast
Fast and general cluster computing system interoperable with Big Data
Systems Like Hadoop and Cassandra
Improved Efficiency:
• In-memory computing primitives
• General computation graphs
Improved Usability:
• Rich APIs
• Interactive shell
22. Apache Spark - Easy to
Use And Very Fast
Fast and general cluster computing system interoperable with Big Data
Systems Like Hadoop and Cassandra
Improved Efficiency:
• Up to 100× faster
In-memory computing primitives
• (2-10× on disk)
General computation graphs
Improved Usability:
• Rich APIs
2-5× less code
• Interactive shell
23. Apache Spark - A
Robust SDK for Big
Data Applications
SQL
Machine
Learning
Streaming Graph
Core
Unified System With Libraries to
Build a Complete Solution
!
Full-featured Programming
Environment in Scala, Java, Python…
Very developer-friendly, Functional
API for working with Data
!
Runtimes available on several
platforms
24. Spark Is A Part Of Most
Big Data Platforms
• All Major Hadoop Distributions Include
Spark
• Spark Is Also Integrated With Non-Hadoop
Big Data Platforms like DSE
• Spark Applications Can Be Written Once
and Deployed Anywhere
SQL
Machine
Learning
Streaming Graph
Core
Deploy Spark Apps Anywhere
25. Easy: Get Started
Immediately
Interactive Shell Multi-language support
Python
lines = sc.textFile(...)
lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...)
lines.filter(x => x.contains(“ERROR”)).count()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
26. Easy: Clean API
Write programs in terms of transformations on
distributed datasets
Resilient Distributed Datasets
• Collections of objects spread
across a cluster, stored in RAM
or on Disk
• Built through parallel
transformations
• Automatically rebuilt on failure
Operations
• Transformations
(e.g. map, filter, groupBy)
• Actions
(e.g. count, collect, save)
28. Easy: Expressive API
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save ...
29. Easy: Example – Word Count
Hadoop MapReduce
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
!
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
!
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
!
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
!
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Spark
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
30. Easy: Example – Word Count
Hadoop MapReduce
public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {
!
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
!
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
!
public static class WorkdCountReduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
!
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Spark
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
31. Easy: Works Well With
Hadoop
Data Compatibility
• Access your existing Hadoop
Data
• Use the same data formats
• Adheres to data locality for
efficient processing
!
Deployment Models
• “Standalone” deployment
• YARN-based deployment
• Mesos-based deployment
• Deploy on existing Hadoop
cluster or side-by-side
32. Example: Logistic Regression
data = spark.textFile(...).map(readPoint).cache()
!
w = numpy.random.rand(D)
!
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x)
.reduce(lambda x, y: x + y)
w -= gradient
!
print “Final w: %s” % w
33. Fast: Using RAM, Operator
Graphs
In-memory Caching
• Data Partitions read from RAM
instead of disk
Operator Graphs
• Scheduling Optimizations
• Fault Tolerance
=
RDD
=
cached
partition
join
A: B:
groupBy
C: D: E:
filter
Stage
3
Stage
1
Stage
2
F:
map
34. Fast: Logistic Regression
Performance
Running Time (s)
4000
3000
2000
1000
0
1 5 10 20 30
Number of Iterations
110
s
/
iteration
Hadoop Spark
first
iteration
80
s
further
iterations
1
s
35. Fast: Scales Down Seamlessly
Execution
time
(s)
100
75
50
25
0
Cache
disabled 25% 50% 75% Fully
cached
%
of
working
set
in
cache
11.5304
29.7471
40.7407
58.0614
68.8414
36. Easy: Fault Recovery
RDDs track lineage information that can be used to
efficiently recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD
Mapped
filter RDD
(func
=
startsWith(…))
map
(func
=
split(...))
40. Working With RDDs
RDRDDD RDRDDD
Transformations
textFile = sc.textFile(”SomeFile.txt”)
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
41. Working With RDDs
RDRDDD RDRDDD
Transformations
textFile = sc.textFile(”SomeFile.txt”)
Action Value
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
linesWithSpark.count()
74
!
linesWithSpark.first()
# Apache Spark
42. Example: Log Mining
Load error messages from a log into memory, then interactively search for
various patterns
43. Load error messages from a log into memory, then interactively search for
various patterns
Worker
Example: Log Mining
Worker
Worker
Driver
44. Load error messages from a log into memory, then interactively search for
various patterns
Worker
Example: Log Mining
Worker
Worker
Driver
lines = spark.textFile(“hdfs://...”)
45. Example: Log Mining
Load error messages from a log into memory, then interactively search for
various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
46. Example: Log Mining
Load error messages from a log into memory, then interactively search for
various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
47. Example: Log Mining
Load error messages from a log into memory, then interactively search for
various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
48. Example: Log Mining
Load error messages from a log into memory, then interactively search for
various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count() Action
49. Example: Log Mining
Load error messages from a log into memory, then interactively search for
various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
50. Example: Log Mining
Load error messages from a log into memory, then interactively search for
various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Driver
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
tasks
tasks
tasks
51. lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Driver
Worker
Read
HDFS
Block
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Read
HDFS
Block
Read
HDFS
Block
Example: Log Mining
Load error messages from a log into memory, then interactively search for
various patterns
52. lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Driver
Worker
Process
& Cache
Data
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
Process
& Cache
Data
Process
& Cache
Data
Example: Log Mining
Load error messages from a log into memory, then interactively search for
various patterns
53. Example: Log Mining
Load error messages from a log into memory, then interactively search for
various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Driver
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
results
results
results
54. Example: Log Mining
Load error messages from a log into memory, then interactively search for
various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Driver
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
55. Example: Log Mining
Load error messages from a log into memory, then interactively search for
various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Driver
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
tasks
tasks
tasks
56. lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
Process
from
Cache
Process
from
Cache
Process
from
Cache
Example: Log Mining
Load error messages from a log into memory, then interactively search for
various patterns
57. Example: Log Mining
Load error messages from a log into memory, then interactively search for
various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
results
results
results
58. Example: Log Mining
Load error messages from a log into memory, then interactively search for
various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
Cache your data ➔ Faster Results
Full-text search of Wikipedia
• 60GB on 20 EC2 machines
• 0.5 sec from cache vs. 20s for on-disk
59. Cassandra + Spark:
A Great Combination
Both are Easy to Use
Spark Can Help You Bridge Your Hadoop and
Cassandra Systems
Use Spark Libraries, Caching on-top of Cassandra-stored
Data
Combine Spark Streaming with Cassandra Storage Datastax
spark-cassandra-connector:
https://github.com/datastax/
spark-cassandra-connector
60. Schema RDDs (Spark SQL)
• Built-in Mechanism for recognizing Structured data in Spark
• Allow for systems to apply several data access and relational
optimizations (e.g. predicate push-down, partition pruning, broadcast
joins)
• Columnar in-memory representation when cached
• Native Support for structured formats like parquet, JSON
• Great Compatibility with the Rest of the Stack (python, libraries, etc.)
61. Thank You!
Visit http://databricks.com:
Blogs, Tutorials and more
!
Questions?