May 29, 2014 Toronto Hadoop User Group - Micro ETL

MICRO-‐ETL:

AN
EMERGING
PATTERN
OF
USE
WITH
HADOOP

AND
NEAR-‐REALTIME
FRAMEWORKS

Adam
Muise
–
Principle
Architect,
Hortonworks

We
do
Hadoop

The
leaders
of
Hadoop’s

development

Community
driven,

Enterprise
Focused

Drive
InnovaEon
in

the
plaForm
–
We

lead
the
roadmap

100%
Open
Source
–

DemocraEzed
Access
to

Data

What
is
Micro-‐ETL?

It
turns
out
several
other
people

have
made
it
up
before
so
I
don’t

feel
like
a
megalomaniac.

Another
terminology
calls
it
Near-‐
RealEme
ETL.

hTp://www.researchgate.net/publicaEon/226219087_Near_Real_Time_ETL/ﬁle/79e4150b23b3aca5aa.pdf

Micro-‐ETL/Near-‐RealEme
ETL

involves:

1.
An
intra-‐batch
and/or
near-‐real3me

ingest
of
analy3cal
data

2.
Processing
on
a
small
batches
of
data

or
event
streams

3.
A
scalable
“ETL”
processing

framework

Why
Use
Micro-‐ETL:

1.
Your
analy3cs
require
more
up
to
date

informa3on
than
provided
by
a
regular
batch

process

2.
There
is
opera3onal
risk
(ie
falling
behind
the
data

ﬁrehose,
data
loss,
data
ﬁdelity)
in
leaving
the

processing
of
events
un3l
batch
can
process
them

3.
Your
data
ingest
rate
is
inconsistent
and
there
is

value
in
keeping
up-‐to-‐date
with
current
events.

When
not
to
use
Micro-‐ETL:

1.
Your
data
ingest
rates
are
predictable
and

analyzing
them
intra-‐batch
provides
liMle

value.

2.
Processing
in
a
large
batch
yields
a

complete
popula3on
of
the
data
required
to

make
a
decision.

3.
You
have
exis3ng
investments
in
tradi3onal

ETL
tools
that
outweigh
any
beneﬁts
in
a
new

tool/framework
(as
tools
evolve,
this
will
be
a

moot
point
however)

Micro-‐ETL
involves
regular

processing
tasks
only
run
on
data

frameworks
that
can
handle

near-‐realEme
event
streams

Let’s
refresh
on
some
core

Hadoop
concepts…

YARN
=
Yet
Another
Resource

NegoEator

Resource
Manager

+

Node
Managers

=
YARN

Resource
Manager

AppMaster

Node
Manager

Scheduler

AppMaster

AppMaster

Node
Manager

Node
Manager

Node
Manager

Container

Container

MapReduce

Container

Storm

Container

Container

Container

Pig

Container

Container

Container

YARN
abstracts
resource

management
so
you
can
run
all
sorts

of
distributed
applicaEons

HDFS

MapReduce
V2

YARN

MapReduce
V?
STORM

MPI
Giraph

HBase
Tez

…
and

more
Spark

The
following
secEon
outlines

frameworks
that
are
emerging
as

data
processing
opEons
to
batch-‐
driven
MapReduce

Three
important
facts
about
Tez:

1.
Tez
is
a
YARN
applicaEon.

2.
Tez
will
eventually
replace

MapRecue.

3.
Tez
scales
as
well
as
the
rest
of

Hadoop
scales
(thousands
of
nodes).

Tez
provides
a
layer
for
abstract

tasks,
these
could
be
mappers,

reducers,
customized
stream

processes,
in
memory
structures,

etc

Tez
chains
tasks
together
into
one
job
to

get
jobs
like
Map
–
Reduce
–
Reduce.

This
is
ideal
for
apps
like
Hive.

TezMap

TezMap

TezMap

TezMap

TezMap

TezReduce

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

TezReduce

TezReduce

TezReduce

TezReduce

TezReduce

Group
By

ProjecEons

Order
By

Tez
provides
the
opEon
for
more

complicated
workﬂows

Tez
allows
for
more
complicated

workﬂow
primiEves
than
just
Map

and
Reduce.
A
Tez
task
is
composed

of
a
programmable
Input,
Output,

and
Processor

YARN
can
provide
long-‐running

containers*
for
applicaEons
like

Storm,
Hbase,
JBoss,
etc

*
-‐
With
the
help
of
Apache
Slider:

hTp://wiki.apache.org/incubator/SliderProposal

Yahoo!,
the
original
author
of

Hadoop
and
the
largest
Hadoop

user,
is
bepng
on
Apache
Hive/Tez/
YARN
as
their
core
data

architecture.

hTp://yahoodevelopers.tumblr.com/post/85930551108/yahoo-‐
bepng-‐on-‐apache-‐hive-‐tez-‐and-‐yarn

Storm
is
a
distributed
execuEon

engine
that
handles
streaming
data

Storm
processes
streaming
event

data
as
tuples.
Each
event
is

generated/ingested
through
a
spout

and
processed
in
series
of
bolts.
The

spouts
and
bolts
form
a
topology.

Refresh
on
Summingbird

Summingbird
is
a
processing

framework
that
runs
over
Storm.
It

uses
Scala
and
has
MapReduce-‐like

features.
Technically
it’s
Scalding

(Cascading
with
Scala).

Spark
is
a
framework
designed
to

handle
in-‐memory
compuEng.
If
you

are
using
Hadoop,
you
are
typically

running
Spark
on
YARN.

Spark
uses
RDDs
(Resilient

Distributed
Datasets)
as
a
primiEve

to
enable
in-‐memory
processing.

RDDs
can
be
created
from
all
sorts

of
data,
like
HDFS
ﬁles:

scala>
val
distFile
=
sc.textFile("hdfs://my.namenode.com:8020/tmp/data.txt")

distFile:
spark.RDD[String]
=
spark.HadoopRDD@1d4cee08

Spark
Streaming
constructs
DStream

primiEves
from
a
streaming
data

source.
DStreams
are
actually
made

up
of
many
RDDs
and
use
the

common
Spark
Engine.

Spark
Streaming
has
an
expressive

API
to
allow
typical
transformaEons

on
RDDs
or
DStreams.
This
is

comparable
to
MapReduce.

Since
Micro-‐ETL
involves
porEng

tradiEonal
data
processing
tasks
to

diﬀerent
execuEon
environments,
it

makes
sense
that
data
processing

tools
would
facilitate
execuEon
on

mulEple
plaForms.

Some
opEons
in
the
ﬁeld…

You
can
write
generic
processing

libraries
(Java)
for
your
data
and

port
them
from
MapReduce/Tez
to

Storm.
Storm
can
also
run

Summingbird
(Scala).

Spark
already
provides
a

streaming
and
processing
layer

for
Micro-‐ETL.
These
can
be

Scala,
Java,
or
Python.

Cascading
has
recently
announced

that
it
would
support
Tez
and
Storm

immediately
in
version
3.0.
They

also
have
plans
for
Spark

hTp://www.concurrenEnc.com/2014/05/cascading-‐3-‐0-‐adds-‐mulEple-‐framework-‐support-‐concurrent-‐driven-‐manages-‐big-‐data-‐apps/

Pentaho
KeTle
currently

supports
porEng
their
data

workﬂows
to
Storm
in
a
beta

version.

hTp://wiki.pentaho.com/display/BAD/KeTle+ExecuEon+on+Storm

Talend
recently
asked
for
$40

million
in
funding
to
help
push

further
into
Big
Data.
That
includes

tooling
to
port
Talend
ETL
workﬂows

to
Storm
and
Tez.

hTp://techcrunch.com/2013/12/11/talend-‐raises-‐40m-‐to-‐more-‐aggressively-‐extend-‐into-‐big-‐data-‐market-‐sets-‐sights-‐on-‐ipo/

May 29, 2014 Toronto Hadoop User Group - Micro ETL

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (13)

Similar a May 29, 2014 Toronto Hadoop User Group - Micro ETL

Similar a May 29, 2014 Toronto Hadoop User Group - Micro ETL (20)

Más de Adam Muise

Más de Adam Muise (20)

Último

Último (20)

May 29, 2014 Toronto Hadoop User Group - Micro ETL