Hadoop is dead, long live Hadoop! The rise of Big Data and the solution of Hadoop

1
Hadoop
is
dead,
long
live
Hadoop!

Lars
George

|

EMEA
Chief
Architect

@larsgeorge

A
Eulogy
and
ProclamaAon

What
the
Press
Says…

2
Source:
hFp://blogs.the451group.com/informaAon_management/2012/07/09/hadoop-‐is-‐dead-‐long-‐live-‐hadoop/

3
Big
Data…
WTH?

A
brief
reasoning
for
Hadoop’s
existence.

4
—
Bubble
Buddy,
Head
of
IT

Big
Data
–
A
Misnomer

•  Misleading
to
quick
assumpAons

•  Current
challenges
are
driven
by
many
things,
not
just
the

size
of
data

•  ANY
company
can
use
the
Big
Data
principles
to

improve
speciﬁc
business
metrics

•  Increased
data
retenAon

•  Access
to
all
the
data

•  Machine
learning
for
paFern
detecAon,
recommendaAons

•  But
what
has
happened
to
cause
this
all?

5

Explosive
Data
Growth

6
10,000

2005
2015
2010

5,000

0

1.8 trillion gigabytes of
data
was

created
in
2011…

§  More
than
90%
is
unstructured
data

§  Approx.
500
quadrillion
ﬁles

§  QuanAty
doubles
every
2
years

STRUCTURED
DATA
UNSTRUCTURED
DATA

GIGABYTES
OF
DATA
CREATED
(IN
BILLIONS)

Source:
IDC
2011

The
‘Big
Data’
Phenomenon

7
Big
Data
Drivers:

§  The
proliferaAon
of
data
capture

and
creaAon
technologies

§  Increased
“interconnectedness”

drives
consumpAon
(creaAng
more

data)

§  Inexpensive
storage
makes
it

possible
to
keep
more,
longer

§  InnovaAve
somware
and
analysis

tools
turn
data
into
informaAon

Big
Data
encompasses
not
only

the
content itself,
but
how
it’s consumed.

More Devices
More
Consumption
More Content
New & Better
Information
§  Every
gigabyte
of
stored
content
can
generate
a

petabyte
or
more
of
transient
data*

§  The
informaAon
about
you
is
much
greater
than

the
informaAon
you
create

*Source:
IDC
2011

The
Current
SoluAons

8
10,000

2005
2015
2010

5,000

0

Current Database Solutions are

designed
for
structured
data.

§  OpAmized
to
answer
known
quesPons
quickly

§  Schemas
dictate
form/context

§  Diﬃcult
to
adapt
to
new
data
types
and
new

quesAons

§  Expensive
at
Petabyte
scale

STRUCTURED
DATA
UNSTRUCTURED
DATA

GIGABYTES
OF
DATA
CREATED
(IN
BILLIONS)

10%

Data
Management
Strategies

Have
Stayed
the
Same

•  Raw
data
on
SAN,
NAS

and
tape

•  Data
moved
from

storage
to
compute

•  RelaAonal
models
with

predesigned
schemas

Too
Much
Data,
Too
Many
Sources

•  Can’t
ingest
fast
enough

Too
Much
Data,
Too
Many
Sources

$
!
$ $
$
•  Can’t
ingest
fast
enough

•  Costs
too
much
to
store

Too
Much
Data,
Too
Many
Sources

1
2 3 4
5
•  Can’t
ingest
fast
enough

•  Costs
too
much
to
store

•  Exists
in
diﬀerent
places

Too
Much
Data,
Too
Many
Sources

•  Can’t
ingest
fast
enough

•  Costs
too
much
to
store

•  Exists
in
diﬀerent
places

•  Archived
data
is
lost

Can’t
Use
It
The
Way
You
Want
To

•  Analysis
and
processing

takes
too
long

Can’t
Use
It
The
Way
You
Want
To

1
2 3 4
5
•  Analysis
and
processing

takes
too
long

•  Data
exists
in
silos

Can’t
Use
It
The
Way
You
Want
To

? ? ?
•  Analysis
and
processing

takes
too
long

•  Data
exists
in
silos

•  Can’t
ask
new
quesAons

Can’t
Use
It
The
Way
You
Want
To

•  Analysis
and
processing

takes
too
long

•  Data
exists
in
silos

•  Can’t
ask
new
quesAons

•  Can’t
analyze

unstructured
data

The
Big
Data
Challenge

18
VOLUME
VARIETY
VELOCITY
DEMANDS
A

NEW
APPROACH

Big
Data
Contains
Limitless
Insights…

BUT

WEB
LOGS

SOCIAL

MEDIA

TRANSACTIONAL

DATA

SMART

GRIDS

OPERATIONAL
DATA

DIGITAL

CONTENT

R&D
DATA

AD
IMPRESSIONS

FILES

Big
Data
Challenges

19
Cost-‐eﬀecAvely
managing
the
volume, velocity and
variety of
data

Deriving
value
across

structured and unstructured data

AdapAng
to
context changes and integraAng
new data sources and types

Big
Data
SoluAon
Requirements

20
Cost-effectively manage
the
volume,
variety
and
velocity
of
data

Process and analyze
large,
complex
data
sets…quickly

Flexibly adapt
to
context
changes
and
new
data
types

21
Google’s
Approach
to
Big
Data

Hadoop’s
Pedigree

A
Timeline
View
#1

22

Google
File
System

•  FoundaAon
of
scalable,
fail-‐safe,
self-‐healing
storage

•  One
central
place
of
truth

•  Cost-‐effecAve
hardware
finally
available

•  19”
Rack
servers
with
decent
amount
of
disk
space

•  Handling
of
failures
built
in

•  Components
or
enAre
servers

•  At
scale
there
are
always
hardware
faults

•  Simple
file
system
interface

•  Finally
no
need
for
expensive,
proprietary
systems

23
Storage

MapReduce

•  First
take
on
distributed
data
processing
framework

•  Same
concepts
as
Google
File
System,
i.e.

•  Fail-‐safe
and
scalable

•  Handles
a
wide
range
of
data
processing
problems

•  BUT
not
all
of
them
(more
later)

•  Simple
API
reading
and
wriAng
Key/Value
pairs

•  Framework
handles
heavy
task
of
data
movement

•  Core
concept
is
data
locality,
heavy
I/O

•  Brings
code
to
data,
not
the
opposite
(i.e.
no
HPC)

•  Accessible
in
many
programming
languages

24
Processing

BigTable

•  Adds
database
like
random
access
to
data

•  EﬀecAvely
a
Key/Value
store
with
table
semanAcs

•  Used
for
small
data
points

•  Usually
less
than
a
megabyte
per
Key/Value

•  Forfeits
advanced
concepts
for
ease
of
scalability

•  No
transacAons,
no
query
language

•  Powers
many
applicaAons
at
Google

•  Uses
Google
File
System
as
storage
layer

•  Tight
integraAon
with
MapReduce
for
batch

processing

25
Random
Access

Dremel,
Tenzing,
Pregel

•  Dremel
adds
specific
file
format
and
query
language

•  Used
for
highly
selecAve
queries,
data
exploraAon

•  File
layout
is
opAmized
for
very
effecAve
scanning

•  Runs
alongside
of
MapReduce
and
File
System

•  Tenzing
adds
SQL
over
various
data
sources

•  Can
query
raw
files,
Dremel
files,
or
BigTable
data
etc.

•  Brings
“known”
paradigm
to
stored
data

•  Pregel
adds
graph
processing
API

26
Query
API

Percolator,
Megastore

•  AddiAons
to
BigTable
to
add
“missing”
features

•  Percolator
is
using
BigTable
to
update
search
index

incrementally,
needs
transacAons

•  Distributes
updates
with
mulA-‐phase
commits

•  Megastore
drives
Google
App
Engine
to
also
add

transacAons
for
user
API

•  Uses
ranges
of
rows
as
en#ty
groups

•  Reduces
locking
to
small
subsets

•  OpAmisAc,
roll-‐forward
only
transacAons

•  Java
layer
over
BigTable
API

27
TransacAons

Spanner,
F1

•  Future
of
Google’s
distributed
storage
and

processing
system

•  Spanner
is
a
scalable,
mulA-‐version,
globally-‐

distributed,
and
synchronously-‐replicated
database

•  Replicates
across
datacenters

•  Uses
TrueTime
(atomic
clocks)
for
synchronizaAon

•  Uses
Colossus
for
storage
(a
GFS
successor)

•  F1
replaced
MySQL
for
AdWords
service

•  SQL
over
data
stored
in
Spanner

•  Colocated
with
Spanner
processes

28
World-‐Wide
Data

29
The
Hadoop
Story

A
Eulogy

What
is
Apache
Hadoop?

30
Has
the
Flexibility
to
Store
and

Mine
Any
Type
of
Data

§  Ask
quesAons
across
structured
and

unstructured
data
that
were
previously

impossible
to
ask
or
solve

§  Not
bound
by
a
single
schema

Excels
at

Processing
Complex
Data

§  Scale-‐out
architecture
divides
workloads

across
mulAple
nodes

§  Flexible
ﬁle
system
eliminates
ETL

boFlenecks

Scales

Economically

§  Can
be
deployed
on
commodity

hardware

§  Open
source
plavorm
guards
against

vendor
lock

Hadoop
Distributed

File
System
(HDFS)

Self-‐Healing,
High

Bandwidth
Clustered

Storage

MapReduce/YARN

Distributed
CompuAng

Framework

Apache Hadoop
is
an
open
source

plavorm
for
data
storage
and
processing

that
is…

ü  Scalable

ü  Fault
tolerant

ü  Distributed

CORE
HADOOP
SYSTEM
COMPONENTS

Core
Hadoop:
HDFS

31
Self-healing, high bandwidth
1
2
3
4
5
2
4
5
HDFS
1
2
5
1
3
4
2
3
5
1
3
4
HDFS
breaks
incoming
ﬁles
into
blocks
and
stores
them
redundantly
across
the
cluster.

Core
Hadoop:
MapReduce

32
framework.
1
2
3
4
5
2
4
5
MR
1
2
5
1
3
4
2
3
5
1
3
4
Processes
large
jobs
in
parallel
across
many
nodes
and
combines
the
results.

Why
Hadoop
Was
Created

33
New opportunities to
derive
value
from

all
your
data.

Exploding
Data
Volumes

&
Types

Driving
The
Need
For
A
Flexible,

Scalable
SoluPon

It’s difficult to handle data this diverse, at this scale.
Traditional platforms can’t keep pace.
WEB

LOGS

SOCIAL

MEDIA

TRANSACTIONAL

DATA

SMART

GRIDS

OPERATIONAL

DATA

DIGITAL

CONTENT

R&D

DATA

AD
IMPRESSIONS

FILES

•  Any
Kind

•  From
Any
Source

•  Structured
&
Unstructured

•  At
Scale

•  Deep
Analysis

•  ExhausAve
&
Detailed

•  SophisAcated
Algorithms

•  Generate
Results
Quickly

•  Extract More Value
•  From More Data
•  More Cost Effectively
•  With Greater Flexibility
BIG
DATA

HARD

PROBLEMS

NEW
OPPORTUNITIES

The
Core
Values
of
Hadoop

34
A platform for
§  Designed to store and
process data at petabyte
scale
§  Scale-out architecture
increases capacity and
processing power linearly
§  Perform operations in
parallel across the entire
cluster
§  Store data in any format –
free from rigid schemas
§  Define context at the time
you ask the question
§  Process and analyze data
using virtually any
programming language
§  Build out your cluster on
your hardware of choice
§  Open source software
guards against vendor
lock-in
§  Wide integration ensures
investment protection
1 2 3

Hadoop
In
PracAce

35

36

Cloudera
Soaware
Stack

Turnkey
soluAon
for
Big
Data
and
Advanced
AnalyAcs
use-‐cases

CDH

100%
OPEN
SOURCE

HADOOP
DISTRIBUTION

CLOUDERA
MANAGER

END-‐TO-‐END
SYSTEM
MANAGEMENT

CORE
PROJECTS
PREMIUM
PROJECTS
CONNECTORS

HDFS
MAPREDUCE
FLUME
HCATALOG

MICROSTRATEGY

NETEZZA

ORACLE

QLIKVIEW

TABLEAU

TERADATA

HIVE
HUE
MAHOUT
OOZIE

PIG
SQOOP
WHIRR
ZOOKEEPER

HBASE

IMPALA

SEARCH
(BETA)

DEPLOYMENT
MONITORING
API
SNMP
CONFIG
ROLLBACKS
PHONE
HOME

SERVICE
MGMT
DIAGNOSTICS
ROLLING
UPGRADES
LDAP
REPORTING
BACKUP/DR

CLOUDERA
SUPPORT

BEST-‐IN-‐CLASS
TECHNICAL
SUPPORT,

COMMUNICTY
ADVOCACY
&

INDEMNIFICATION

CLOUDERA
NAVIGATOR

END-‐TO-‐END
DATA
MANAGEMENT

ACCESS
MGMT
DATA
AUDIT

CORE
HADOOP

PROJECTS

CLOUDERA

MANAGER

CLOUDERA

NAVIGATOR

HBASE
IMPALA

37
Spin
some
YARN!

Reborn
again!

Back
to
the
Press
again…

38
Source:
hFp://gigaom.com/2012/07/07/why-‐the-‐days-‐are-‐numbered-‐for-‐hadoop-‐as-‐we-‐know-‐it/

A
Timeline
View
#2

39

First:
What
is
MapReduce
1?

40

MoAvaAons
to
Change
MR1

41
•  Scaling
>4000
nodes

•  Fewer,
larger
clusters

•  No
single
source
of
truth,
data
in
“silos”
again

•  HA
of
Job
Tracker
diﬃcult

•  Large,
complex
state

•  Poor
resource
uAlizaAon

•  Slots
in
MR1
are
for
either
map
or
reduce

YARN:
Yet
Another
Resource
NegoAator

42

Split
of
ResponsibiliAes

43
Job
Tracker

Resource

Manager

ApplicaAon

Master

split

•  One
per
Cluster

•  Long-‐lived

•  App-‐level

•  One
per
app
instance

•  Short-‐lived

•  Task-‐level
scheduling

and
monitoring

Fine-‐grained
Resource
Control

•  Node
Manager
is
a
generalized
Task
Tracker

•  Task
Tracker

•  Fixed
number
of
map
and
reduce
slots

•  Node
Manager

•  Containers
with
variable
resource
limits

44

Node
Manager:
Containers

45

YARN
+
MapReduce
2

46
•  YARN
“runs”
MapReduce
as
an
applicaAon

•  MR
is
user
space

•  YARN
is
kernel

YARN
ApplicaAons

•  Distributed
shell

•  Open
MPI

•  Master-‐worker

•  Apache
Giraph,
Hama

•  Spark

47

48
Summary

What
the
future
may
hold

Enterprise
Data
EvoluAon

RDBMS/EDW
HADOOP-OPTIMIZED
INFRASTRUCTURE
AMOUNTOFDATA
BUSINESS IMPACT
NEXT-GEN DATA
COMPUTING PLATFORM
DATA-DRIVEN
ORGANIZATION
AMOUNT
OF
DATA

•
Data
collecAon
&
reporAng

•
Process
data
faster

•
Store
data
more
cost-‐eﬀecAvely

•
Simplify
infrastructure

•
Combine
data
from
across
the
business

•
Ask
new
quesAons
immediately

•
Enable
new
real-‐Ame
applicaAons

1980s
2000s
2010s

CREATE

COMPETITIVE
ADVANTAGE

IMPROVE

OPERATIONAL
EFFICIENCY

Playing
Catchup

•  Improve
overall
performance

•  Google’s
code
is
kernel
module,
C++,
as
low
as
possible

•  Hadoop
is
Java,
for
ease
of
development
in
open-‐source

•  Maybe
rewrite
parts
of
the
stack?

•  Overall
goal:
saturate
machine
specs
(I/O,
CPU,
RAM)

•  Add
missing
features

•  Everything
is
based
on
“hearsay”,
aka
research
papers
and

presentaAons

•  Add
what
is
necessary
or
for
the
sake
of
it?

50

Further
Extend
or
Invent?

•  YARN
is
a
good
example
for
what
can
be
done

•  Look
at
every
component
and
evaluate

•  Work
with
research
and
universiAes,
companies
to

drive
new
development

•  What
else
can
be
done
with
all
that
data?

51

52
—
Jim
Gray,
Computer
ScienAst

From
Framework
to
Plavorm
to
Commodity

•  Hadoop
distribuAons
are
already
a
commodity

•  Move
up
the
stack
to
reach
commercial
space

•  Simplify
data
processing

•  ConAnuuity

•  WibiData
(Kiji)

•  Cloudera
CDK

•  Pure
Hadoop
SoluAons

•  DataMeer

•  Plavora

53

Hadoop…
live
long
and
prosper!

54

Lars
George,
EMEA
Chief
Architect,
Cloudera

@larsgeorge

Thank
you!

Hadoop is dead, long live Hadoop! The rise of Big Data and the solution of Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop is dead, long live Hadoop! The rise of Big Data and the solution of Hadoop

Similar to Hadoop is dead, long live Hadoop! The rise of Big Data and the solution of Hadoop (20)

Recently uploaded

Recently uploaded (20)

Hadoop is dead, long live Hadoop! The rise of Big Data and the solution of Hadoop