The document is a presentation on big data and Hadoop. It introduces the speaker, Adam Muise, and discusses the challenges of dealing with large and diverse datasets. Traditional approaches of separating data into silos are no longer sufficient. The presentation argues that a distributed system like Hadoop is needed to bring all data together and enable it to be analyzed as a whole.
16. Another
EDW
Analy=cal
DB
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
The
solu=on?
EDW
Data
Data
Data
Data
Data
Data
Data
Data
Data
OLTP
Data
Data
Data
Data
Data
Data
Data
Data
Data
Yet
Another
EDW
Data
Data
Data
Data
Data
Data
Data
Data
Data
17. Another
EDW
Analy=cal
DB
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
OLTP
Ummm…you
dropped
something
EDW
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Yet
Another
EDW
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
20. Wait,
you’ve
seen
this
before.
Data
Data
Data
…
Sausage
Factory
Data
Data
Data
Data
Data
Data
Data
Data
Data
…
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
21. Your
data
silos
are
lonely
places.
EDW
Accounts
Customers
Web
Proper=es
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
22. …
Data
likes
to
be
together.
EDW
Accounts
Customers
Data
Data
Web
Proper=es
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
23. New
types
of
data
don’t
quite
fit
your
pris=ne
view
of
the
world
Logs
Data
Data
Data
Data
Data
Data
Data
CDR/SIP
Data
Data
Data
Data
Data
Data
Data
My
LiYle
Data
Empire
Data
?
Data
?
Data
Data
Data
Data
Data
?
?
Data
Data
24. To
resolve
this,
some
people
take
hints
from
Lord
Of
The
Rings..
26. ETL
Data
Data
Data
ETL
ETL
ETL
EDW
Data
Data
Data
Data
Data
Schema
Data
Data
Data
Data
…but
that
has
its
problems
too.
ETL
Data
Data
Data
ETL
ETL
ETL
EDW
Data
Data
Data
Data
Data
Schema
Data
Data
Data
Data
33. If
you
could
design
a
system
that
would
handle
this,
what
would
it
look
like?
34. It
would
probably
need
a
highly
resilient,
self-‐healing,
cost-‐efficient,
distributed
file
system…
Storage
Storage
Storage
Storage
Storage
Storage
Storage
Storage
Storage
35. It
would
probably
need
a
completely
parallel
processing
framework
that
took
tasks
to
the
data…
Processing
Processing
Processing
Storage
Storage
Storage
Processing
Processing
Processing
Storage
Storage
Storage
Processing
Processing
Processing
Storage
Storage
Storage
36. It
would
probably
run
on
commodity
hardware,
virtualized
machines,
and
common
OS
pladorms
Processing
Processing
Processing
Storage
Storage
Storage
Processing
Processing
Processing
Storage
Storage
Storage
Processing
Processing
Processing
Storage
Storage
Storage
37. It
would
probably
be
open
source
so
innova=on
could
happen
as
quickly
as
possible
40. HDFS
stores
data
in
blocks
and
replicates
those
blocks
block1
Processing
Processing
Processing
Storage
Storage
Storage
block2
block2
Processing
Processing
Processing
block1
Storage
Storage
Storage
block3
block2
Processing
Storage
block3
Processing
Processing
block1
Storage
Storage
block3
41. If
a
block
fails
then
HDFS
always
has
the
other
copies
and
heals
itself
block1
Processing
Processing
Processing
block3
Storage
Storage
Storage
block2
block2
Processing
Processing
Processing
block1
Storage
Storage
Storage
block3
block2
Processing
Storage
block3
Processing
Processing
block1
Storage
Storage
X
42. MapReduce
is
a
programming
paradigm
that
completely
parallel
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Data
Data
Data
Reducer
Data
Data
Data
Reducer
Data
Data
Data
43. MapReduce
has
three
phases:
Map,
Sort/Shuffle,
Reduce
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Mapper
Mapper
Key,
Value
Key,
Value
Key,
Value
Reducer
Key,
Value
Key,
Value
Key,
Value
Mapper
Reducer
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Mapper
Reducer
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Key,
Value
Mapper
Key,
Value
Key,
Value
Key,
Value
44. MapReduce
applies
to
a
lot
of
data
processing
problems
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Mapper
Mapper
Mapper
Mapper
Mapper
Reducer
Data
Data
Data
Reducer
Data
Data
Data
Reducer
Data
Data
Data
47. YARN
abstracts
resource
management
so
you
can
run
more
than
just
MapReduce
MapReduce
V2
MapReduce
V?
STORM
Giraph
Tez
YARN
HDFS2
MPI
HBase
…
and
more
48. YARN
turns
Hadoop
into
a
smart
phone:
An
App
Ecosystem
hortonworks.com/yarn/
49. Check
out
the
book
too…
Preview
at:
hortonworks.com/yarn/
50. YARN
is
an
essen=al
part
of
a
balanced
breakfast
in
Hadoop
2.0
Oct
15
2013:
Apache
Community
releases
Hadoop
2.2.0
Halloween
2013:
Hortonworks
releases
HDP
2.0
GA