Apache Spark is a distributed computation framework optimized to work in-memory, and heavily influenced by concepts from functional programming languages.
Hazelcast - open source in-memory data grid capable of amazing feats of scale - provides wide range of distributed computing primitives computation, including ExecutorService, M/R and Aggregations frameworks.
The nature of data exploration and analysis requires data scientists be able to ask questions that weren't planned to be asked—and get an answer fast!
In this talk, Viktor will explore Spark and see how it works together with Hazelcast to provide a robust in-memory open-source big data analytics solution!
5. @gamussa @hazelcast #oraclecode
When to use Spark?
Data Science Tasks
when questions are unknown
Data Processing Tasks
when you have to much data
You’re tired of Hadoop
9. @gamussa @hazelcast #oraclecode
Resilient Distributed Datasets (RDD)
are the primary abstraction in Spark –
a fault-tolerant collection of elements that can be
operated on in parallel
24. @gamussa @hazelcast #oraclecode
Hadoop datasets
run functions on each record of a file in Hadoop distributed
file system or any other storage system supported by
Hadoop
26. @gamussa @hazelcast #oraclecode
Hazelcast IMDG
is an operational,
in-memory,
distributed computing platform
that manages data using
in-memory storage, and
performs parallel execution for
breakthrough application speed
and scale