3. 3WHAT IS SPARK
Spark Definition
• Big Data Analytic Engine
• Cluster Computing Framework
4. 4
2009 2013
2011 NOW
WHAT IS SPARK
Spark History
Initialized by one Microsoft Paper and
created by UC Berkeley AMP Lab in 2009
Open Source
Donated to the Apache Software
Foundation in 2013
Current stable version: 1.5.2
5. 5WHY SPARK
Spark Benefits
• Speed
– It claimed to be 100x faster than Hadoop. The
reason is instead of putting all the data on
hard, Spark temporarily store them on RAM.
– (Ben: however, it is not that faster, not as high as 100x during the use, at least on some applications.)
• Easy to use
6. 6WHY SPARK
Three main distributed computing frameworks comparison
Hadoop Spark Storm
Source Google UC Berkeley AMP Lab Twitter
Open source
Date
2007 2011 2011
Support
Language
Java, and many
others
Scala, Java, Python Java, Clojure
Time Lagging High Seconds Real-time
Scenario • Low real-time
• large volume
of big data
• One batch
• Medium size Data
block
• More real-time
• Small Data Trunk
• High real-time
Used by Facebook,
Google
Google, Taobao Twitter, Sina Weibo
Source: based on what I found from wiki and google over time.
7. 7SPARK STRUCTURE
API
Allow users to
interact with SQL
like queries.
Allow users to
process data in real
time and batch.
Allow users to
process data using
Machine Learning.
Allow users to
build/transform/rea
son graph data.
4
1
1
2
2
3 4
https://databricks.com/spark/about
8. 8RUN SPARK
Spark Start Mode
• Local (Mainly for testing)
• Standalone
• Mesos (popular)
• YARN (popular)