3. We
do
Hadoop
The
leaders
of
Hadoop’s
development
Community
driven,
Enterprise
Focused
Drive
InnovaEon
in
the
plaForm
–
We
lead
the
roadmap
100%
Open
Source
–
DemocraEzed
Access
to
Data
6. It
turns
out
several
other
people
have
made
it
up
before
so
I
don’t
feel
like
a
megalomaniac.
Another
terminology
calls
it
Near-‐
RealEme
ETL.
hTp://www.researchgate.net/publicaEon/226219087_Near_Real_Time_ETL/file/79e4150b23b3aca5aa.pdf
7. Micro-‐ETL/Near-‐RealEme
ETL
involves:
1.
An
intra-‐batch
and/or
near-‐real3me
ingest
of
analy3cal
data
2.
Processing
on
a
small
batches
of
data
or
event
streams
3.
A
scalable
“ETL”
processing
framework
8. Why
Use
Micro-‐ETL:
1.
Your
analy3cs
require
more
up
to
date
informa3on
than
provided
by
a
regular
batch
process
2.
There
is
opera3onal
risk
(ie
falling
behind
the
data
firehose,
data
loss,
data
fidelity)
in
leaving
the
processing
of
events
un3l
batch
can
process
them
3.
Your
data
ingest
rate
is
inconsistent
and
there
is
value
in
keeping
up-‐to-‐date
with
current
events.
9. When
not
to
use
Micro-‐ETL:
1.
Your
data
ingest
rates
are
predictable
and
analyzing
them
intra-‐batch
provides
liMle
value.
2.
Processing
in
a
large
batch
yields
a
complete
popula3on
of
the
data
required
to
make
a
decision.
3.
You
have
exis3ng
investments
in
tradi3onal
ETL
tools
that
outweigh
any
benefits
in
a
new
tool/framework
(as
tools
evolve,
this
will
be
a
moot
point
however)
10. Micro-‐ETL
involves
regular
processing
tasks
only
run
on
data
frameworks
that
can
handle
near-‐realEme
event
streams
15. YARN
abstracts
resource
management
so
you
can
run
all
sorts
of
distributed
applicaEons
HDFS
MapReduce
V2
YARN
MapReduce
V?
STORM
MPI
Giraph
HBase
Tez
…
and
more
Spark
16. The
following
secEon
outlines
frameworks
that
are
emerging
as
data
processing
opEons
to
batch-‐
driven
MapReduce
18. Three
important
facts
about
Tez:
1.
Tez
is
a
YARN
applicaEon.
2.
Tez
will
eventually
replace
MapRecue.
3.
Tez
scales
as
well
as
the
rest
of
Hadoop
scales
(thousands
of
nodes).
19. Tez
provides
a
layer
for
abstract
tasks,
these
could
be
mappers,
reducers,
customized
stream
processes,
in
memory
structures,
etc
20. Tez
chains
tasks
together
into
one
job
to
get
jobs
like
Map
–
Reduce
–
Reduce.
This
is
ideal
for
apps
like
Hive.
TezMap
TezMap
TezMap
TezMap
TezMap
TezReduce
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
TezReduce
TezReduce
TezReduce
TezReduce
TezReduce
Group
By
ProjecEons
Order
By
22. Tez
allows
for
more
complicated
workflow
primiEves
than
just
Map
and
Reduce.
A
Tez
task
is
composed
of
a
programmable
Input,
Output,
and
Processor
23. YARN
can
provide
long-‐running
containers*
for
applicaEons
like
Storm,
Hbase,
JBoss,
etc
*
-‐
With
the
help
of
Apache
Slider:
hTp://wiki.apache.org/incubator/SliderProposal
24. Yahoo!,
the
original
author
of
Hadoop
and
the
largest
Hadoop
user,
is
bepng
on
Apache
Hive/Tez/
YARN
as
their
core
data
architecture.
hTp://yahoodevelopers.tumblr.com/post/85930551108/yahoo-‐
bepng-‐on-‐apache-‐hive-‐tez-‐and-‐yarn
26. Storm
is
a
distributed
execuEon
engine
that
handles
streaming
data
27. Storm
processes
streaming
event
data
as
tuples.
Each
event
is
generated/ingested
through
a
spout
and
processed
in
series
of
bolts.
The
spouts
and
bolts
form
a
topology.
29. Summingbird
is
a
processing
framework
that
runs
over
Storm.
It
uses
Scala
and
has
MapReduce-‐like
features.
Technically
it’s
Scalding
(Cascading
with
Scala).
31. Spark
is
a
framework
designed
to
handle
in-‐memory
compuEng.
If
you
are
using
Hadoop,
you
are
typically
running
Spark
on
YARN.
32. Spark
uses
RDDs
(Resilient
Distributed
Datasets)
as
a
primiEve
to
enable
in-‐memory
processing.
RDDs
can
be
created
from
all
sorts
of
data,
like
HDFS
files:
scala>
val
distFile
=
sc.textFile("hdfs://my.namenode.com:8020/tmp/data.txt")
distFile:
spark.RDD[String]
=
spark.HadoopRDD@1d4cee08
33. Spark
Streaming
constructs
DStream
primiEves
from
a
streaming
data
source.
DStreams
are
actually
made
up
of
many
RDDs
and
use
the
common
Spark
Engine.
34. Spark
Streaming
has
an
expressive
API
to
allow
typical
transformaEons
on
RDDs
or
DStreams.
This
is
comparable
to
MapReduce.
35. Since
Micro-‐ETL
involves
porEng
tradiEonal
data
processing
tasks
to
different
execuEon
environments,
it
makes
sense
that
data
processing
tools
would
facilitate
execuEon
on
mulEple
plaForms.
37. You
can
write
generic
processing
libraries
(Java)
for
your
data
and
port
them
from
MapReduce/Tez
to
Storm.
Storm
can
also
run
Summingbird
(Scala).
38. Spark
already
provides
a
streaming
and
processing
layer
for
Micro-‐ETL.
These
can
be
Scala,
Java,
or
Python.
39. Cascading
has
recently
announced
that
it
would
support
Tez
and
Storm
immediately
in
version
3.0.
They
also
have
plans
for
Spark
hTp://www.concurrenEnc.com/2014/05/cascading-‐3-‐0-‐adds-‐mulEple-‐framework-‐support-‐concurrent-‐driven-‐manages-‐big-‐data-‐apps/
40. Pentaho
KeTle
currently
supports
porEng
their
data
workflows
to
Storm
in
a
beta
version.
hTp://wiki.pentaho.com/display/BAD/KeTle+ExecuEon+on+Storm
41. Talend
recently
asked
for
$40
million
in
funding
to
help
push
further
into
Big
Data.
That
includes
tooling
to
port
Talend
ETL
workflows
to
Storm
and
Tez.
hTp://techcrunch.com/2013/12/11/talend-‐raises-‐40m-‐to-‐more-‐aggressively-‐extend-‐into-‐big-‐data-‐market-‐sets-‐sights-‐on-‐ipo/