Spark

Spark
:
Lightening
Fast
Cluster

Compu6ng
Framework
!

Ni#sh
Upre#

nzu100@cse.psu.edu

Agenda
for
the
Day

•  How
do
we
mine
useful
informa#on
from

massive
datasets
?

•  Overview
of
Problems
in
Big
Data
Space.

•  Apache
Hadoop
and
its
Limita#ons.

•  Introduce
Apache
Spark.

•  Switching
Gears
:
Exploring
Spark
Internals

•  Discuss
an
ac#ve
research
area,
“BlinkDB”
:

Queries
with
Bounded
Reponses
#me
on
very

large
data.

•  Conclusion

Data
is
KEY
for
Organiza6ons.

Data
Mining
is
challenging.

Mining
is
even
more
challenging

with
Massive
Datasets
!

Big
Data
Problem
for
Amazon

“Grouping
books
on
same
topic”

Solu6on
:
k-‐Means
Algorithm

Not
so
simple
with
large
datasets
…

How
to
solve
this
problem
?

Back
In
2000
Google
had
a
problem…

•  Processing
massive
amount
of
raw
data
(
crawled

documents,
web
request
logs
)

•  Specialized
Hardware
(Ver#cal
Scaling)
was
super

expensive.

•  The
computa#on
thus
needed
to
be
distributed.

Solu6on
:
Google’s
Map
Reduce
Paradigm
by

Jeﬀrey
Dean
and
Sanjay
Ghemawat.

Essen6ally
a
Distributed
Cluster
Compu6ng

Framework.

What
are
the
core
ideas
behind
Map

Reduce
and
why
does
it
scale
to

Massive
Datasets?

Why
is
Map
Reduce
so
important
?

•  Using
commodity
hardware
for
computa#on.

•  Overcome
Commodity
Hardware
limita2ons:

Provides
solu#on
for
an
environment
where

failures
are
very
frequent.

•  Pushing
computa#on
to
data
(unlike
the
other

way
round)

•  Provides
abstrac2on
to
focus
on
domain
logic
:
A

programming
/
cluster
compu#ng
model
for

distributed
compu#ng
using
func#onal
primi#ves

map
and
reduce.

Simplest
Big
Data
Problem
:
Coun6ng

Words.

Text
Mining
:
Word
Count

Map-‐Reduce
Pseudo
Code

Does
the
Map
Reduce
Paradigm

completely
solve
the
Big
Data

Problem?

Where
do
we
go
from
here
?

The
BIG
Picture
!

What
do
we
need
to
process
Big
Data?

Three
major
Big
Data
Scenarios

•  Interac6ve
Queries
:
Enable
faster
decision.

Example
:
Query
website
logs
and
diagnose
why
the

website
is
slow
?
(
Apache
Pig,
Apache
Hive..
)

•  Sophis6cated
Batch
Data
Processing
:
Enable

be_er
decisions.

Example
:
Trend
Analysis,
Analy2cs

•  Streaming
Data
Processing
:
Real
#me
decision

making.

Example
:
Fraud
Detec2on,
detect
DDoS
aJacks.

A>er
Google
’s
iniBal
work
…

The
Open
Source
Community
soon

caught
up
with
Hadoop
Ecosystem
!

Tools
for
every
Data
Analysis

scenario…

There
are
many
inherent
limita6ons

with
Map
Reduce
ecosystem
…

Map
Reduce
Limita6ons

•  Itera6ve
Jobs
:
Common
algorithms
apply
a

func#on
repeatedly
to
the
same
dataset.

While
each
itera#on
can
be
expressed
as
a

MapReduce
job,
each
job
must
store
and
than

reload
data
from
disk.

•  Interac6ve
Analysis
:
There
is
a
need
to
run

ad-‐hoc
queries
on
datasets.
We
want
to
be

able
to
load
a
dataset
into
memory
across

machines
and
query
it
repeatedly.

Example
1
:
Itera6ve
Breadth
First

Search
in
Hadoop.

Example
2
:
Ad-‐Hoc
SQL
like
Queries

anyone
?

Using

Hive
and
Pig
?

to
the
Rescue
!

Spark
easily
outperforms
Hadoop
in
the
discussed

scenarios
by
10X
–
100X.

not
simply
about
Performance
gain…

Introducing
Spark

•  Open
Source
data
analy#cs
cluster
compu#ng

framework.

•  Provides
primi#ves
for
In-‐Memory
cluster

compu#ng
that
allows
user
to
load
data
into

cluster’s
memory
and
query
it
repeatedly

(spills
to
HDFS
only
when
needed).

•  Allows
interac#ve
ad-‐hoc
data
explora#on.

(Supports
Pipelining
&
Lazy
Ini#aliza#on)

•  Uniﬁes
batch,
streaming
and
interac6ve

computa6on.

A
rich
Spark
Ecosystem

The
Scala
Programming
language

-‐
Object
/
Func#onal

-‐
Runs
on
the
JVM

-‐
Concise
Syntax

-‐
Aims
to
support
interac#ve
scrip#ng
+

development

Working
with
Spark
…

Why
is
it
so
Powerful
?

Exploring
Spark
Internals
…

•  Spark
:
Cluster
Compu6ng
with
Working
Sets

Matei
Zaharia,
Mosharaf
Chowdhury,
Michael
J.
Franklin,
Sco_

Shenker,
Ion
Stoica.
Hot
Cloud
2010.
June
2010.

•  Resilient
Distributed
Datasets:
A
Fault-‐Tolerant

Abstrac6on
for
In-‐Memory
Cluster
Compu6ng

Matei
Zaharia,
Mosharaf
Chowdhury,
Tathagata
Das,
Ankur
Dave,

Jus#n
Ma,
Murphy
McCauley,
Michael
J.
Franklin,
Sco_
Shenker,

Ion

Stoica.
NSDI
2012.
April
2012.
Best
Paper
Award
and
Honorable

Men6on
for
Community
Award.

Spark
Essen6als
:
Drive
Program

Core
of
Spark
:
RDDs

•  RDD
:
Resilient
Distributed
Datasets
are
a
distributed

memory
abstrac6on
that
lets
programmer
perform
in

memory
computa#ons
on
large
clusters
in
fault

tolerant
manner.

•  An
RDD
is
a
read-‐only
collec6on
of
objects
par66oned

across
a
set
of
machines.

•  RDDs
provide
an
interface
based
on
coarse-‐grained

transforma#on
that
applies
same
opera#on
to
many

data
items.
This
allows
fault
tolerance
by
logging

transforma6ons
and
building
a
dataset
lineage
that

can
be
used
to
rebuilt
them
if
a
par66on
is
lost.

Periodic
CheckPoin6ng
of
data
for

long
running
lineage.

Shared
Variables
in
SPARK
:

-‐
Broadcast
Variables

-‐
Accumulators

Programmers
can
create
two
restricted

types
of
shared
variables
to
support
two

common
simple
usage
paJerns…

Ini6al
Experiment
Results

Logis6c

Regression
29GB

dataset.

Interac6ve
Data

Mining
on

Wikipedia

1
TB
from
disk

took
170s

Back
to
our
Ques6on
…

Why
is
SPARK
so
Powerful
?

RDDs
Expressivity
and
Generality

•  RDDs
are
able
to
express
a
diverse
set
of
programming

models
as
the
Restric#ons
have
li_le
impact
in
parallel

applica#ons.

•  A
lot
of
these
programs
naturally
apply
the
same

opera#on
on
many
records,
making
them
a
good
fit.

•  Previous
systems
explored
specific
problems
with

MapReduce.
However,
at
the
core
of
the
problem
is

the
need
for
a
common
data
sharing
abstrac6on.

•  RDDs
capture
all
major
op#miza#ons
:
keeping
specific

data
in
memory,
custom
par##oning
to
minimize

communica#on
and
recovering
from
failures

effec#vely.

Queries
with
Bounded
Error
and

Bounded
Response
Time
on
Very

Large
Data.

Explore
More
of
BlinkDB

•  Visit
:
h_p://blinkdb.org/

•  Blink
and
It’s
Done:
Interac6ve
Queries
on
Very

Large
Data
:
In
PVLDB
5(12):
1902-‐1905,
2012,

Istanbul,
Turkey
by
Sameer
Agarwal,
Aurojit

Panda,
Barzan
Mozafari,
Anand
P.
Iyer,
Samuel

Madden,
Ion
Stoica.

•  Queries
with
Bounded
Errors
and
Bounded

Response
Times
on
Very
Large
Data
:
Sameer

Agarwal,
Barzan
Mozafari,
Aurojit
Panda,
Henry

Milner,
Samuel
Madden,
Ion
Stoica.
BlinkDB:.
In

ACM
EuroSys
2013,
Prague,
Czech
Republic
(Best

Paper
Award).

Spark

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Spark

Similar a Spark (20)

Más de Nitish Upreti

Más de Nitish Upreti (7)

Último

Último (16)

Spark