Presentation detailed about capabilities of In memory Analytic using Apache Spark. Apache Spark overview with programming mode, cluster mode with Mosos, supported operations and comparison with Hadoop Map Reduce. Elaborating Apache Spark Stack expansion like Shark, Streaming, MLib, GraphX
2. Agenda
Overview of Spark
Spark with Hadoop MapReduce
Spark Elements and Operations
Spark Cluster Overview
Spark Examples
Spark Stack Extensions:
Shark
Streaming
Mlib
Graphx
3. In Memory Analytics
• In-memory analytics is an approach to querying data when it resides in a
computer’s random access memory (RAM), as opposed to querying data
that is stored on physical disks.
• This results in vastly shortened query response times, allowing business
intelligence (BI) and analytic applications to support faster business
decisions.
• As the cost of RAM declines, in-memory analytics is becoming feasible
for many businesses.
• BI and analytic applications have long supported caching data in RAM, but
older 32-bit operating systems provided only 4 GB of addressable memory.
• Newer 64-bit operating systems, with up to 1 terabyte (TB) addressable
memory (and perhaps more in the future), have made it possible to cache
large volumes of data -- potentially an entire data warehouse or data mart --
in a computer’s RAM.
4. Not a modified version of Hadoop
Separate, fast, Map-Reduce-like engine
In-memory data storage for very fast iterative queries
Generate execution of graphs and powerful optimizations
Up to 40x faster than Hadoop
Spark beats Hadoop by providing primitives for in-memory cluster
computing; thereby avoiding the I/O bottleneck between the individual
jobs of an iterative MapReduce workflow that repeatedly performs
computations on the same working set.
Compatible with Hadoop’s storage APIs
Can read/write to any Hadoop-supported systems, including
HDFS, Hbase, SequenceFiles, etc
What is Spark
- Lightning-Fast Cluster Computing
8. Spark Programming Model
Key idea : Resilient Distributed Data (RDD)
Distributed collections of objects that can be cached in memory across cluster nodes
Manipulated through various parallel operations
Automatically rebuilt on failures
Types of RDD:
Parallelized collections: Take an existing Scala collection and run functions on it in
parallel
scala> val distData = sc.parallelize(data)
distData: spark.RDD[Int] = spark.ParallelCollection@10d13e3e
Hadoop datasets : Run functions on each record of a file in Hadoop distributed file
system or any other storage system supported by Hadoop
scala> val distFile = sc.textFile("data.txt")
distFile: spark.RDD[String] = spark.HadoopRDD@1d4cee08
9.
10.
11. For example, consider the following job:
rdd1.map(splitlines).filter("ERROR")
rdd2.map(splitlines).groupBy(key)
rdd2.join(rdd1, key).take(10)
Automatic Parallelization of Complex Flows
When constructing a complex pipeline of
MapReduce jobs, the task of correctly
parallelizing the sequence of jobs is left to
you. Thus, a scheduler tool such as
Apache Oozie is often required to
carefully construct this sequence.
With Spark, a whole series of individual
tasks is expressed as a single program
flow that is lazily evaluated so that the
system has a complete picture of the
execution graph.
This approach allows the core scheduler
to correctly map the dependencies
across different stages in the
application, and automatically parallelize
the flow of operators without user
intervention.
12. Spark vs Hadoop
Spark is a high-speed cluster computing system compatible with Hadoop that
can outperform it by up to 100 times considering its ability to perform
computations in memory
13. Transformations (eg: map, filter, group by) :
Create a new dataset from an existing one
Actions ( eg: count, collect, save) :
Return a value to the driver program after running a computation
on the dataset
14. Spark Elements
Application User program built on Spark. Consists of a driver program and executors on the
cluster.
Driver program The process running the main() function of the application and creating the
SparkContext
Cluster manager An external service for acquiring resources on the cluster (e.g. standalone
manager, Mesos, YARN)
Worker node Any node that can run application code in the cluster
Executor A process launched for an application on a worker node, that runs tasks and
keeps data in memory or disk storage across them. Each application has its own executors.
Task A unit of work that will be sent to one executor
Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark
action (e.g. save, collect); you'll see this term used in the driver's logs.
Stage Each job gets divided into smaller sets of tasks called stages that depend on each other
(similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.
15. Spark Cluster Overview
Cluster Manager Types
• Standalone – a simple cluster manager included with Spark that makes it
easy to set up a cluster.
• Apache Mesos – a general cluster manager that can also run Hadoop
MapReduce and service applications.
• Hadoop YARN – the resource manager in Hadoop 2.
16. Mesos (Dynamic Resource Sharing for
Clusters) Run Modes
Spark can run over Mesos in two modes: “fine-grained” and “coarse-
grained”.
Fine-grained mode, which is the default, each Spark task runs as a
separate Mesos task.
This allows multiple instances of Spark (and other frameworks) to share machines at
a very fine granularity, where each application gets more or fewer machines as it
ramps up, but it comes with an additional overhead in launching each task, which
may be inappropriate for low-latency applications (e.g. interactive queries or serving
web requests).
Coarse-grained mode will instead launch only one long-running Spark
task on each Mesos machine, and dynamically schedule its own “mini-
tasks” within it.
The benefit is much lower startup overhead, but at the cost of reserving the Mesos
resources for the complete duration of the application.
17.
18. Task Scheduler
• Runs general DAGs
• Pipelines functions within a
stage
• Cache-aware data reuse &
locality
• Partitioning-aware to avoid
shuffles
19. Spark Stack Extension
Spark powers a stack of high-level tools including
Shark for SQL
MLlib for machine learning
GraphX
Spark Streaming.
You can combine these frameworks seamlessly in the same
application.
20.
21. Shark
Shark makes Hive faster and more powerful.
Shark is a new data analysis system that marries query
processing with complex analytics on large clusters
Shark is an open source distributed SQL query engine for
Hadoop data. It brings state-of-the-art performance and
advanced analytics to Hive users.
Speed : Run Hive queries up to 100x faster in memory, or
10x on disk.
22.
23. Streaming
Spark Streaming makes it easy to build scalable fault-tolerant
streaming applications.
Spark Streaming brings Spark's language-integrated API to stream processing, letting
you write streaming applications the same way you write batch jobs.
It supports both Java and Scala.
Spark Streaming lets you reuse the same code for batch processing, join streams
against historical data, or run ad-hoc queries on stream state
Spark Streaming can read data from HDFS, Flume, Kafka, Twitter and ZeroMQ.
Since Spark Streaming is built on top of Spark, users can apply Spark's in-built
machine learning algorithms (MLlib), and graph processing algorithms (GraphX) on
data streams
TwitterUtils.createStream(...)
.filter(_.getText.contains("Spark"))
.countByWindow(Seconds(5))
Counting tweets on a sliding window
stream.join(historicCounts).filter {
case (word, (curCount, oldCount)) =>
curCount > oldCount
}
Find words with higher frequency than
historic data
24. MLlib
MLlib is Apache Spark's scalable machine learning library.
MLlib fits into Spark's APIs and interoperates with NumPy in
Python (starting in Spark 0.9). You can use any Hadoop data
source (e.g. HDFS, HBase, or local files), making it easy to plug
into Hadoop workflows.
points = spark.textFile("hdfs://...")
.map(parsePoint)
model = KMeans.train(points)
Calling MLlib in Scala
25. GraphX
Unifying Graphs and Tables
GraphX extends the distributed fault-tolerant collections API and
interactive console of Spark with a new graph API which leverages
recent advances in graph systems (e.g., GraphLab) to enable
users to easily and interactively build, transform, and reason about
graph structured data at scale.
26. BDAS, the Berkeley Data
Analytics Stack,
https://amplab.cs.berkeley.edu/software/
BDAS, the Berkeley Data Analytics Stack, is an open source software stack that
integrates software components being built by the AMPLab to make sense of Big Data.
27. Software and Research
Projects
Shark - Hive and SQL on top of Spark
MLbase - Machine Learning project on top of Spark
BlinkDB - a massively parallel, approximate query engine built on top of Shark and Spark
GraphX - a graph processing & analytics framework on top of Spark (GraphX has been merged into
Spark 0.9)
Apache Mesos - Cluster management system that supports running Spark
Tachyon - In memory storage system that supports running Spark
Apache MRQL - A query processing and optimization system for large-scale, distributed data
analysis, built on top of Apache Hadoop, Hama, and Spark
OpenDL - A deep learning algorithm library based on Spark framework. Just kick off.
SparkR - R frontend for Spark
Spark Job Server - REST interface for managing and submitting Spark jobs on the same cluster
28. Conclusion
“Bigdata” is moving beyond one-pass batch jobs, to
low-latency apps that need data sharing
RDDs offer fault-tolerant sharing at memory speed
Spark uses them to combine streaming, batch &
interactive analytics in one system