2. Agenda
for
the
Day
• How
do
we
mine
useful
informa#on
from
massive
datasets
?
• Overview
of
Problems
in
Big
Data
Space.
• Apache
Hadoop
and
its
Limita#ons.
• Introduce
Apache
Spark.
• Switching
Gears
:
Exploring
Spark
Internals
• Discuss
an
ac#ve
research
area,
“BlinkDB”
:
Queries
with
Bounded
Reponses
#me
on
very
large
data.
• Conclusion
7. Not
so
simple
with
large
datasets
…
How
to
solve
this
problem
?
8. Back
In
2000
Google
had
a
problem…
• Processing
massive
amount
of
raw
data
(
crawled
documents,
web
request
logs
)
• Specialized
Hardware
(Ver#cal
Scaling)
was
super
expensive.
• The
computa#on
thus
needed
to
be
distributed.
Solu6on
:
Google’s
Map
Reduce
Paradigm
by
Jeffrey
Dean
and
Sanjay
Ghemawat.
Essen6ally
a
Distributed
Cluster
Compu6ng
Framework.
9. What
are
the
core
ideas
behind
Map
Reduce
and
why
does
it
scale
to
Massive
Datasets?
10. Why
is
Map
Reduce
so
important
?
• Using
commodity
hardware
for
computa#on.
• Overcome
Commodity
Hardware
limita2ons:
Provides
solu#on
for
an
environment
where
failures
are
very
frequent.
• Pushing
computa#on
to
data
(unlike
the
other
way
round)
• Provides
abstrac2on
to
focus
on
domain
logic
:
A
programming
/
cluster
compu#ng
model
for
distributed
compu#ng
using
func#onal
primi#ves
map
and
reduce.
16. Three
major
Big
Data
Scenarios
• Interac6ve
Queries
:
Enable
faster
decision.
Example
:
Query
website
logs
and
diagnose
why
the
website
is
slow
?
(
Apache
Pig,
Apache
Hive..
)
• Sophis6cated
Batch
Data
Processing
:
Enable
be_er
decisions.
Example
:
Trend
Analysis,
Analy2cs
• Streaming
Data
Processing
:
Real
#me
decision
making.
Example
:
Fraud
Detec2on,
detect
DDoS
aJacks.
17.
A>er
Google
’s
iniBal
work
…
The
Open
Source
Community
soon
caught
up
with
Hadoop
Ecosystem
!
20. There
are
many
inherent
limita6ons
with
Map
Reduce
ecosystem
…
21. Map
Reduce
Limita6ons
• Itera6ve
Jobs
:
Common
algorithms
apply
a
func#on
repeatedly
to
the
same
dataset.
While
each
itera#on
can
be
expressed
as
a
MapReduce
job,
each
job
must
store
and
than
reload
data
from
disk.
• Interac6ve
Analysis
:
There
is
a
need
to
run
ad-‐hoc
queries
on
datasets.
We
want
to
be
able
to
load
a
dataset
into
memory
across
machines
and
query
it
repeatedly.
22. Example
1
:
Itera6ve
Breadth
First
Search
in
Hadoop.
23. Example
2
:
Ad-‐Hoc
SQL
like
Queries
anyone
?
Using
Hive
and
Pig
?
24. to
the
Rescue
!
Spark
easily
outperforms
Hadoop
in
the
discussed
scenarios
by
10X
–
100X.
29. Introducing
Spark
• Open
Source
data
analy#cs
cluster
compu#ng
framework.
• Provides
primi#ves
for
In-‐Memory
cluster
compu#ng
that
allows
user
to
load
data
into
cluster’s
memory
and
query
it
repeatedly
(spills
to
HDFS
only
when
needed).
• Allows
interac#ve
ad-‐hoc
data
explora#on.
(Supports
Pipelining
&
Lazy
Ini#aliza#on)
• Unifies
batch,
streaming
and
interac6ve
computa6on.
34. Exploring
Spark
Internals
…
• Spark
:
Cluster
Compu6ng
with
Working
Sets
Matei
Zaharia,
Mosharaf
Chowdhury,
Michael
J.
Franklin,
Sco_
Shenker,
Ion
Stoica.
Hot
Cloud
2010.
June
2010.
• Resilient
Distributed
Datasets:
A
Fault-‐Tolerant
Abstrac6on
for
In-‐Memory
Cluster
Compu6ng
Matei
Zaharia,
Mosharaf
Chowdhury,
Tathagata
Das,
Ankur
Dave,
Jus#n
Ma,
Murphy
McCauley,
Michael
J.
Franklin,
Sco_
Shenker,
Ion
Stoica.
NSDI
2012.
April
2012.
Best
Paper
Award
and
Honorable
Men6on
for
Community
Award.
36. Core
of
Spark
:
RDDs
• RDD
:
Resilient
Distributed
Datasets
are
a
distributed
memory
abstrac6on
that
lets
programmer
perform
in
memory
computa#ons
on
large
clusters
in
fault
tolerant
manner.
• An
RDD
is
a
read-‐only
collec6on
of
objects
par66oned
across
a
set
of
machines.
• RDDs
provide
an
interface
based
on
coarse-‐grained
transforma#on
that
applies
same
opera#on
to
many
data
items.
This
allows
fault
tolerance
by
logging
transforma6ons
and
building
a
dataset
lineage
that
can
be
used
to
rebuilt
them
if
a
par66on
is
lost.
42. Shared
Variables
in
SPARK
:
-‐
Broadcast
Variables
-‐
Accumulators
Programmers
can
create
two
restricted
types
of
shared
variables
to
support
two
common
simple
usage
paJerns…
43. Ini6al
Experiment
Results
Logis6c
Regression
29GB
dataset.
Interac6ve
Data
Mining
on
Wikipedia
1
TB
from
disk
took
170s
44. Back
to
our
Ques6on
…
Why
is
SPARK
so
Powerful
?
45. RDDs
Expressivity
and
Generality
• RDDs
are
able
to
express
a
diverse
set
of
programming
models
as
the
Restric#ons
have
li_le
impact
in
parallel
applica#ons.
• A
lot
of
these
programs
naturally
apply
the
same
opera#on
on
many
records,
making
them
a
good
fit.
• Previous
systems
explored
specific
problems
with
MapReduce.
However,
at
the
core
of
the
problem
is
the
need
for
a
common
data
sharing
abstrac6on.
• RDDs
capture
all
major
op#miza#ons
:
keeping
specific
data
in
memory,
custom
par##oning
to
minimize
communica#on
and
recovering
from
failures
effec#vely.
51. Explore
More
of
BlinkDB
• Visit
:
h_p://blinkdb.org/
• Blink
and
It’s
Done:
Interac6ve
Queries
on
Very
Large
Data
:
In
PVLDB
5(12):
1902-‐1905,
2012,
Istanbul,
Turkey
by
Sameer
Agarwal,
Aurojit
Panda,
Barzan
Mozafari,
Anand
P.
Iyer,
Samuel
Madden,
Ion
Stoica.
• Queries
with
Bounded
Errors
and
Bounded
Response
Times
on
Very
Large
Data
:
Sameer
Agarwal,
Barzan
Mozafari,
Aurojit
Panda,
Henry
Milner,
Samuel
Madden,
Ion
Stoica.
BlinkDB:.
In
ACM
EuroSys
2013,
Prague,
Czech
Republic
(Best
Paper
Award).