3. What
is
Big
Data
• Volume:
Large
Amounts
of
Data
at
rest
• Velocity:
milliseconds
to
seconds
to
respond
• Variety:
Data
in
many
forms
(Structured,
Unstructured,
MulEmedia,
Text
etc.)
• Veracity:
Data
in
doubt
4. • 30
billion
pieces
of
content
a
month
•
1
Peta
byte
of
content
every
day
• 2
Billion
videos
watched
everyday
• 3
Billion
people
will
be
online
• Sharing
8
zeQabytes
of
data
9. Background
• Underlying
Technology
invented
by
Google
• Google
Big-‐Table
&
Google
File
System
• Doug
Cung
created
NUTCH
and
Hadoop
was
spun
off
at
Yahoo
• Yahoo
played
a
key
role
in
developing
Hadoop
for
enterprise
applicaEons
10. Hadoop
• Is
a
framework
• Built
on
commodity
hardware
• Implements
computaEonal
paradigm
called
Map-‐Reduce
• Provides
a
distributed
file
system
called
HDFS
to
store
data
• Node
failures
are
automaEcally
handled
11. Data
Becomes
BoQleneck
• Geng
data
to
processors
is
expensive
• Typical
disk
data
transfer
rate
75MB/sec
• 100GB
data
transfer
:
22mins
approx.
• New
approach
is
needed
12. Hadoop
Solves
• Problems
where
you
have
lot
of
data
• Mixture
of
complex
and
structured
data
• Speeds
up
computaEons
by
distribuEon
• Mantra
is
take
computaEon
to
the
data,
don’t
bring
data
to
computaEon
14. Hadoop
Architecture
• Master
Slave
philosophy
• Designed
to
run
on
large
number
of
machines
• Machines
don’t
share
memory
or
disk
• Rack
them
up
and
run
Hadoop
on
each
machine
15. Hadoop
Architecture
• Data
is
divided
and
spread
across
servers
• Hadoop
keeps
track
of
where
the
data
is
• Hadoop
replicates
data
to
mulEple
copies
to
avoid
single
point
of
failure
• MapReduce
is
a
programming
model
to
process
large
sets
of
data
in
parallel
• Map
the
operaEon
out
to
all
servers
• Shuffle
the
results
• Reduce
the
results
back
into
one
result
set
18. HDFS
• Distributed
file
system
• Highly
fault
tolerant
• HDFS
instance
can
span
across
many
servers
• Has
large
datasets
into
terabytes
to
petabytes
• Moving
computaEon
is
cheaper
than
moving
data
• Large
block
sizes
(128MB
for
example)