Application architectures with Hadoop and Sessionization in MR

1
Headline
Goes
Here

Speaker
Name
or
Subhead
Goes
Here

DO
NOT
USE
PUBLICLY

PRIOR
TO
10/23/12

ApplicaAon
Architectures
with

Apache
Hadoop

Mark
Grover
|
@mark_grover

East
Bay
JUG

hadooparchitecturebook.com

June
25th,
2014

©2014 Cloudera, Inc. All Rights
Reserved.

About
Me

•  CommiTer
on
Apache
Bigtop,
commiTer
and
PPMC
member

on
Apache
Sentry
(incubaAng).

•  Contributor
to
Hadoop,
Hive,
Spark,
Sqoop,
Flume.

•  SoYware
developer
at
Cloudera

•  @mark_grover

2
Reserved.

Co-‐authoring
O’Reilly
book

•  @hadooparchbook

•  hadooparchitecturebook.com

Reserved.
3

What
is
Apache
Hadoop?

4
Has
the
Flexibility
to
Store
and

Mine
Any
Type
of
Data

§  Ask
quesAons
across
structured
and

unstructured
data
that
were
previously

impossible
to
ask
or
solve

§  Not
bound
by
a
single
schema

Excels
at

Processing
Complex
Data

§  Scale-‐out
architecture
divides
workloads

across
mulAple
nodes

§  Flexible
ﬁle
system
eliminates
ETL

boTlenecks

Scales

Economically

§  Can
be
deployed
on
commodity

hardware

§  Open
source
plaaorm
guards
against

vendor
lock

Hadoop
Distributed

File
System
(HDFS)

Self-‐Healing,
High

Bandwidth
Clustered

Storage

MapReduce

Distributed
CompuAng

Framework

Apache Hadoop
is
an
open
source

plaaorm
for
data
storage
and
processing

that
is…

ü  Scalable

ü  Fault
tolerant

ü  Distributed

CORE
HADOOP
SYSTEM
COMPONENTS

Reserved.

5
Click
Stream
Analysis

Case
Study

Reserved.

AnalyAcs

Reserved.
6

Web
Logs

Reserved.
7

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0"
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com"
"Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/
GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile
Safari/533.1”

Clickstream
AnalyAcs

Reserved.
8

244.157.45.12 - - [17/Oct/
2014:21:08:30 ] "GET /seatposts
HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/
top_online_shops" "Mozilla/5.0
(Macintosh; Intel Mac OS X
10_9_2) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/
36.0.1944.0 Safari/537.36”

Click
Stream
Analysis
(Before
Hadoop)

9

Reserved.
Web
logs

ODS

Data

Warehouse

Query
Extract

Transform

Load

Business

Intelligence

Transform

The
Problems
(Before
Hadoop)

10

Reserved.
Web
logs

ODS

Data

Warehouse

Query
Extract

Transform

Load

Business

Intelligence

Transform

1

1

1

Slow
Data
TransformaAons
=
Missed
ETL
SLAs.

2

2

Slow
Queries
=
Frustrated
Business
Users.

Click
Stream
Analysis
(AYer
Hadoop)

11

Reserved.
Web
logs

ODS

Data

Warehouse

Query
Extract

Transform

Load

Business

Intelligence

Transform
X
X

Web
logs

ODS

Business

Intelligence

Query

Click
Stream
Analysis
(AYer
Hadoop)

12

Reserved.
Transform

Web
logs

ODS

Business

Intelligence

Query

Click
Stream
Analysis
(AYer
Hadoop)

13

Reserved.
Transform

Flume
or
Sqoop?

Flume
or
Sqoop?

Hive/Impala/MR?

Hive/Impala/Spark?

Challenges
of
Hadoop
ImplementaAon

14

Reserved.

Challenges
of
Hadoop
ImplementaAon

15

Reserved.

Other
challenges
-‐
Architectural
ConsideraAons

•  Storage
managers?

•  HDFS?
HBase?

•  Data
storage
and
modeling:

•  File
formats?
Compression?
Schema
design?

•  Data
movement

•  How
do
we
actually
get
the
data
into
Hadoop?
How
do
we
get
it
out?

•  Metadata

•  How
do
we
manage
data
about
the
data?

•  Data
access
and
processing

•  How
will
the
data
be
accessed
once
in
Hadoop?
How
can
we
transform
it?
How
do

we
query
it?

•  OrchestraAon

•  How
do
we
manage
the
workﬂow
for
all
of
this?

16
Reserved.

17
High
level
Design

Reserved.

Clickstream
AnalyAcs

Reserved.
18

244.157.45.12 - - [17/Oct/
2014:21:08:30 ] "GET /seatposts
HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/
top_online_shops" "Mozilla/5.0
(Macintosh; Intel Mac OS X
10_9_2) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/
36.0.1944.0 Safari/537.36”

Reserved.
19

Hadoop

Cluster

BI/VisualizaAon

tool
(e.g.

microstrategy)

BI

Analysts

Spark
For
machine
learning

and
graph
processing

R/Python
StaAsAcal
Analysis

Custom

Apps

3.
Accessing

2.
Processing

4.
OrchestraAon

via
Oozie
1.
IngesAon

OperaAonal

Data
Store

CRM
System

Via
Sqoop

Web
servers

Website
users

20
Since
that’s
of
most
interest
to
this
audience

2.
Processing

Reserved.

Processing

•  De-‐duplicaAon

•  Filtering

•  SessionizaAon

21
Reserved.

DeduplicaAon
–
remove
duplicate
records

Reserved.
22


Filtering
–
ﬁlter
out
invalid
records

Reserved.
23

"Mozilla/5.0 (Linux; U…

SessionizaAon

Reserved.
24

Website
visit

Visitor
1

Session
1

Visitor
1

Session
2

Visitor
2

Session
1

>
30
minutes

Why
sessionize?

Helps
answers
quesAons
like:

•  What
is
my
website’s
bounce
rate?

•  i.e.
how
many
%
of
visitors
don’t
go
past
the
landing
page?

•  Which
markeAng
channels
(e.g.
organic
search,
display
ad,
etc.)

are
leading
to
most
sessions?

•  Which
ones
of
those
lead
to
most
conversions
(e.g.
people

buying
things,
signing
up,
etc.)

•  Do
aTribuAon
analysis
–
which
channels
are
responsible
for

most
conversions?

25
Reserved.

SessionizaAon

Reserved.
26

like Gecko) Chrome/36.0.1944.0 Safari/537.36” 165
Safari/533.1” 166

How
to
Sessionize?

1.  Given
a
list
of
clicks,
determine
which
clicks
came
from
the

same
user

2.  Given
a
parAcular
user's
clicks,
determine
if
a
given
click
is
a

part
of
a
new
session
or
a
conAnuaAon
of
the
previous

session

27
Reserved.

#1
–
Which
clicks
are
from
same
user?

•  We
can
use:

•  IP
address
(244.157.45.12)

•  Cookies
(A9A3BECE0563982D)

•  IP
address
(244.157.45.12)and
user
agent
string
((KHTML,
like Gecko) Chrome/36.0.1944.0 Safari/
537.36")

28
Reserved.

#1
–
Which
clicks
are
from
same
user?

•  We
can
use:

•  IP
address
(244.157.45.12)

•  Cookies
(A9A3BECE0563982D)

•  IP
address
(244.157.45.12)and
user
agent
string
((KHTML,
like Gecko) Chrome/36.0.1944.0 Safari/
537.36")

29
Reserved.

#1
–
Which
clicks
are
from
same
user?

Reserved.
30

Safari/533.1”

#2
–
Which
clicks

part
of
the
same
session?

Reserved.
31

Safari/533.1”
>
30
mins
apart
=
diﬀerent

sessions

32
Intro
to
MapReduce

Reserved.

MapReduce

•  Map
–
Apply
a
funcAon
to
each
input
record

•  Shuﬄe
&
Sort
–
ParAAon
the
map
output
and
sort
each

parAAon

•  Reduce
–
Apply
aggregaAon
funcAon
to
all
values
in
each

parAAon

33
Reserved.

34
SessionizaAon
in
MapReduce

Reserved.

35
github.com/hadooparchitecturebook/hadoop-‐arch-‐book/tree/master/ch06/MRSessionize

SessionizaAon
in
MapReduce

Reserved.

SessionizaAon
in
MapReduce

36
Reserved.
Map

Reduce

Reduce

Log
line

IP1,
log
lines

IP1,
log
lines

Log
line,
session
ID

Map

Map

Log
line

Log
line
IP2,
log
lines

IP2,
log
lines
Log
line,
session
ID

Mapper
for
SessionizaAon

37
Reserved.

public
static
class
SessionizeMapper

extends
Mapper<Object,
Text,
IpTimestampKey,
Text>
{

private
Matcher
logRecordMatcher;

public
void
map(Object
key,
Text
value,
Context
context

)
throws
IOException,
InterruptedException
{

logRecordMatcher
=
logRecordPattern.matcher(value.toString());

//
We
only
emit
something
out
if
the
record
matches
with
our
regex.
Otherwise,
we

assume
the
record
is
busted
and
simply
ignore
it

if
(logRecordMatcher.matches())
{

String
ip
=
logRecordMatcher.group(1);

DateTime
timestamp
=
DateTime.parse(logRecordMatcher.group(2),

TIMESTAMP_FORMATTER);

Long
unixTimestamp
=
timestamp.getMillis();

IpTimestampKey
outputKey
=
new
IpTimestampKey(ip,
unixTimestamp);

context.write(outputKey,
value);

}

}

}

Reducer
for
SessionizaAon

38
Reserved.

public
static
class
SessionizeReducer

extends
Reducer<IpTimestampKey,
Text,
IpTimestampKey,
Text>
{

private
Text
result
=
new
Text();

private
static
int
sessionId
=
0;

private
Long
lastTimeStamp
=
null;

public
void
reduce(IpTimestampKey
key,
Iterable<Text>
values,
Context
context

)
throws
IOException,
InterruptedException
{

for
(Text
value
:
values)
{

String
logRecord
=
value.toString();

//
If
this
is
the
first
record
for
this
user
or
it's
been
more
than
the
timeout

//
since
the
last
click
from
this
user,
let's
increment
the
session
ID.

if
(lastTimeStamp
==
null
||
(key.getUnixTimestamp()
-‐
lastTimeStamp
>

SESSION_TIMEOUT_IN_MS))
{

sessionId++;

}

lastTimeStamp
=
key.getUnixTimestamp();

result.set(logRecord
+
"
"
+
sessionId);

//
Since
we
only
care
about
printing
out
the
entire
record
in
the
result,
with

//
session
ID
appended
at
the
end,
we
just
emit
out
"null"
for
the
key

context.write(null,
result);

}

}

}

Secondary
sorAng
–
by
Amestamp

•  Need
records
to
reducer
to
be
grouped
by
IP
address
and

sorted
by
Amestamp
–
a
concept
called
secondary
sor/ng

•  Instead
of
using
just
IP
address
as
map
output
key
and
reduce

input
key

•  We
use
a
composite
key
(IP,
Amestamp)
as
map
output
key
and

reduce
input
key

39
Reserved.

Secondary
sorAng
–
vocabulary

•  Composite
key
–
IP
address,
Amestamp

•  Natural
key
–
IP
address

•  Secondary
sort
key
-‐
Amestamp

40
Reserved.

Secondary
sorAng

•  Custom
Grouping
Comparator
–
on
Natural
Key
(IP)

•  Custom
Sort
Comparator
–
on
Composite
Key
(IP,
address)

•  Custom
ParAAoner
–
on
Natural
Key
(IP)

job.setGroupingComparatorClass(NaturalKeyComparator.class
);

job.setSortComparatorClass(CompositeKeyComparator.class);

job.setPartitionerClass(NaturalKeyPartitioner.class);

41
Reserved.

42
Data
Storage
and
Modeling

Reserved.

Data
Storage
–
Storage
Manager
consideraAons

•  Popular
storage
managers
for
Hadoop

•  Hadoop
Distributed
File
System
(HDFS)

•  HBase

43
Reserved.

Data
Storage
–
HDFS
vs
HBase

HDFS

•  Stores
data
directly
as
ﬁles

•  Fast
scans

•  Poor
random
reads/writes

HBase

•  Stores
data
as
Hﬁles
on
HDFS

•  Slow
scans

•  Fast
random
reads/writes

44

Reserved.

Data
Storage
–
Storage
Manager
consideraAons

•  We
choose
HDFS

•  AnalyAcal
needs
in
this
case
served
beTer
by
fast
scans.

45
Reserved.

46
Data
Storage
Format

Reserved.

Data
Storage
–
Format
ConsideraAons

•  Store
as
plain
text?

•  Sure,
well
supported
by
Hadoop.

•  Text
can
easily
be
processed
by
MapReduce,
loaded
into
Hive
for

analysis,
and
so
on.

•  But…

•  Will
begin
to
consume
lots
of
space
in
HDFS.

•  May
not
be
opAmal
for
processing
by
tools
in
the
Hadoop

ecosystem.

47
Reserved.

Data
Storage
–
Format
ConsideraAons

•  But,
we
can
compress
the
text
files…

•  Gzip
–
supported
by
Hadoop,
but
not
spliTable.

•  Bzip2
–
hey,
spliTable!
Great
compression!
But
decompression
is

slooowww.

•  LZO
–
spliTable
(with
some
work),
good
compress/de-‐compress

performance.
Good
choice
for
storing
text
files
on
Hadoop.

•  Snappy
–
provides
a
good
tradeoff
between
size
and
speed.

48
Reserved.

Data
Storage
–
More
About
Snappy

•  Designed
at
Google
to
provide
high
compression
speeds
with

reasonable
compression.

•  Not
the
highest
compression,
but
provides
very
good
performance

for
processing
on
Hadoop.

•  Snappy
is
not
spliTable
though,
which
brings
us
to…

49
Reserved.

SequenceFile

• Stores
records
as
binary

key/value
pairs.

• SequenceFile
“blocks”

can
be
compressed.

• This
enables
spliTability

with
non-‐spliTable

compression.

50

Reserved.

Avro

•  Kinda
SequenceFile
on

Steroids.

•  Self-‐documenAng
–
stores

schema
in
header.

•  Provides
very
eﬃcient

storage.

•  Supports
spliTable

compression.

51

Reserved.

Our
Format
Choices…

•  Avro
with
Snappy

•  Snappy
provides
opAmized
compression.

•  Avro
provides
compact
storage,
self-‐documenAng
ﬁles,
and

supports
schema
evoluAon.

•  Avro
also
provides
beTer
failure
handling
than
other
choices.

•  SequenceFiles
would
also
be
a
good
choice,
and
are
directly

supported
by
ingesAon
tools
in
the
ecosystem.

•  But
only
supports
Java.

52
Reserved.

53
HDFS
Schema
Design

Reserved.

Recommended
HDFS
Schema
Design

•  How
to
lay
out
data
on
HDFS?

54
Reserved.

Recommended
HDFS
Schema
Design

/user/<username>
-‐
User
specific
data,
jars,
conf
files

/etl
–
Data
in
various
stages
of
ETL
workflow

/tmp
–
temp
data
from
tools
or
shared
between
users

/data
–
shared
data
for
the
enAre
organizaAon

/app
–
Everything
but
data:
UDF
jars,
HQL
files,
Oozie
workflows

55
Reserved.

56
Advanced
HDFS
Schema
Design

Reserved.

What
is
ParAAoning?

57
dataset

col=val1/file.txt

col=val2/file.txt

.

.

.

col=valn/file.txt

dataset

file1.txt

file2.txt

.

.

.

filen.txt

Un-‐parAAoned
HDFS

directory
structure

ParAAoned
HDFS
directory

structure

Reserved.

What
is
ParAAoning?

58
clicks

dt=2014-‐01-‐01/clicks.txt


.

.

.


clicks

clicks-‐2014-‐01-‐01.txt

clicks-‐2014-‐01-‐02.txt

.

.

.

clicks-‐2014-‐03-‐31.txt

Un-‐parAAoned
HDFS

directory
structure

ParAAoned
HDFS
directory

structure

Reserved.

ParAAoning

•  Split
the
dataset
into
smaller
consumable
chunks

•  Rudimentary
form
of
“indexing”

•  <data
set
name>/
<parAAon_column_name=parAAon_column_value>/{ﬁles}

59
Reserved.

ParAAoning
consideraAons

•  What
column
to
bucket
by?

•  HDFS
is
append
only.

•  Don’t
have
too
many
parAAons
(<10,000)

•  Don’t
have
too
many
small
ﬁles
in
the
parAAons
(more
than

block
size
generally)

•  We
decided
to
parAAon
by
/mestamp

60
Reserved.

What
is
buckeAng?

61
clicks



Un-‐bucketed
HDFS

directory
structure

clicks

dt=2014-‐01-‐01/file0.txt

dt=2014-‐01-‐01/file1.txt

dt=2014-‐01-‐01/file2.txt

dt=2014-‐01-‐01/file3.txt

dt=2014-‐01-‐02/file0.txt

dt=2014-‐01-‐02/file1.txt

dt=2014-‐01-‐02/file2.txt

dt=2014-‐01-‐02/file3.txt

Bucketed
HDFS
directory

structure

Reserved.

BuckeAng

•  Hash-‐bucketed
ﬁles
within
each
parAAon
based
on
a
parAcular

column

•  Useful
when
sampling

•  In
some
joins,
pre-‐reqs:

•  Datasets
bucketed
on
the
same
key
as
the
join
key

•  Number
of
buckets
are
the
same
or
one
is
a
mulAple
of
the
other

62
Reserved.

BuckeAng
consideraAons?

•  Which
column
to
bucket
on?

•  How
many
buckets?

•  We
decided
to
bucket
based
on
cookie

63
Reserved.

De-‐normalizing
consideraAons

•  In
general,
big
data
joins
are
expensive

•  When
to
de-‐normalize?

•  Decided
to
join
the
smaller
dimension
tables

•  Big
fact
tables
are
sAll
joined

64
Reserved.

65
Data
IngesAon

Reserved.

File
Transfers

• “hadoop
fs
–put
<ﬁle>”

• Reliable,
but
not
resilient

to
failure.

• Other
opAons
are

mountable
HDFS,
for

example
NFSv3.

66

Reserved.

Streaming
IngesAon

•  Flume

•  Reliable,
distributed,
and
available
system
for
eﬃcient
collecAon,

aggregaAon
and
movement
of
streaming
data,
e.g.
logs.

•  Ka•a

•  Reliable
and
distributed
publish-‐subscribe
messaging
system.

67
Reserved.

Flume
vs.
Ka•a

• Purpose
built
for
Hadoop

data
ingest.

• Pre-‐built
sinks
for
HDFS,

HBase,
etc.

• Supports
transformaAon

of
data
in-‐ﬂight.

• General
pub-‐sub

messaging
framework.

• Hadoop
not
supported,

requires
3rd-‐party

component
(Camus).

• Just
a
message
transport

(a
very
fast
one).

68

Reserved.

Flume
vs.
Ka•a

•  BoTom
line:

•  Flume
very
well
integrated
with
Hadoop
ecosystem,
well
suited

to
ingesAon
of
sources
such
as
log
ﬁles.

•  Ka•a
is
a
highly
reliable
and
scalable
enterprise
messaging

system,
and
great
for
scaling
out
to
mulAple
consumers.

69
Reserved.

A
Quick
IntroducAon
to
Flume

70

Flume
Agent

Source
Channel
Sink
DesAnaAon
External

Source

Web
Server

TwiTer

JMS

System
logs

…

Consumes
events

and
forwards
to

channels

Stores
events

unAl
consumed

by
sinks
–
ﬁle,

memory,
JDBC

Removes
event
from

channel
and
puts

into
external

desAnaAon

JVM

process
hosAng
components

Reserved.

A
Quick
IntroducAon
to
Flume

•  Reliable
–
events
are
stored
in
channel
unAl
delivered
to
next
stage.

•  Recoverable
–
events
can
be
persisted
to
disk
and
recovered
in
the

event
of
failure.

71
Flume
Agent

Source
Channel
Sink
DesAnaAon

Reserved.

A
Quick
IntroducAon
to
Flume

• DeclaraAve

•  No
coding
required.

•  ConﬁguraAon
speciﬁes

how
components
are

wired
together.

72

Reserved.

A
Brief
Discussion
of
Flume
PaTerns
–
Fan-‐in

• Flume
agent
runs
on

each
of
our
servers.

• These
agents
send
data

to
mulAple
agents
to

provide
reliability.

• Flume
provides
support

for
load
balancing.

73

Reserved.

A
Brief
Discussion
of
Flume
PaTerns
–
Spli‚ng

•  Common
need
is
to
split

data
on
ingest.

•  For
example:

•  Sending
data
to
mulAple

clusters
for
DR.

•  To
mulAple
desAnaAons.

•  Flume
also
supports

parAAoning,
which
is
key

to
our
implementaAon.

74

Reserved.

Sqoop
Overview

•  Apache
project
designed
to
ease
import
and
export
of
data

between
Hadoop
and
external
data
stores
such
as
relaAonal

databases.

•  Great
for
doing
bulk
imports
and
exports
of
data
between

HDFS,
Hive
and
HBase
and
an
external
data
store.
Not
suited

for
ingesAng
event
based
data.

Reserved.
75

IngesAon
Decisions

•  Historical
Data

•  Smaller
files:
file
transfer

•  Larger
files:
Flume
with
spooling
directory
source.

•  Incoming
Data

•  Flume
with
the
spooling
directory
source.

76
Reserved.

77
Data
Processing
and
Access

Reserved.

Data
ﬂow

78

Raw
data

ParAAoned

clickstream

data

Other
data

(Financial,

CRM,
etc.)

Aggregated

dataset
#2

Aggregated

dataset
#1

Reserved.

Data
processing
tools

79

•  Hive

•  Impala

•  Pig,
etc.

Reserved.

Hive

80

•  Open
source
data
warehouse
system
for
Hadoop

•  Converts
SQL-‐like
queries
to
MapReduce
jobs

•  Work
is
being
done
to
move
this
away
from
MR

•  Stores
metadata
in
Hive
metastore

•  Can
create
tables
over
HDFS
or
HBase
data

•  Access
available
via
JDBC/ODBC

Reserved.

Impala

81

•  Real-‐Ame
open
source
SQL
query
engine
for
Hadoop

•  Doesn’t
build
on
MapReduce

•  WriTen
in
C++,
uses
LLVM
for
run-‐Ame
code
generaAon

•  Can
create
tables
over
HDFS
or
HBase
data

•  Accesses
Hive
metastore
for
metadata

•  Access
available
via
JDBC/ODBC

Reserved.

Pig

82

•  Higher
level
abstracAon
over
MapReduce
(like
Hive)

•  Write
transformaAons
in
scripAng
language
–
Pig
LaAn

•  Can
access
Hive
metastore
via
HCatalog
for
metadata

Reserved.

Data
Processing
consideraAons

83

•  We
chose
Hive
for
ETL

and
Impala
for
interac/ve
BI.

Reserved.

84
Metadata
Management

Reserved.

What
is
Metadata?

85

•  Metadata
is
data
about
the
data

•  Format
in
which
data
is
stored

•  Compression
codec

•  LocaAon
of
the
data

•  Is
the
data
parAAoned/bucketed/sorted?

Reserved.

Metadata
in
Hive

86
Hive

Metastore

Reserved.

Metadata

87

•  Hive
metastore
has
become
the
de-‐facto
metadata
repository

•  HCatalog
makes
Hive
metastore
accessible
to
other

applicaAons
(Pig,
MapReduce,
custom
apps,
etc.)

Reserved.

Hive
+
HCatalog

88

Reserved.

89
OrchestraAon

Reserved.

OrchestraAon

•  Once
the
data
is
in
Hadoop,
we
need
a
way
to
manage

workﬂows
in
our
architecture.

•  Scheduling
and
tracking
MapReduce
jobs,
Hive
jobs,
etc.

•  Several
opAons
here:

•  Cron

•  Oozie,
Azkaban

•  3rd-‐party
tools,
Talend,
Pentaho,
InformaAca,
enterprise

schedulers.

90
Reserved.

Oozie

• Supports
deﬁning
and

execuAng
a
sequence
of

jobs.

• Can
trigger
jobs
based
on

external
dependencies
or

schedules.

91

Reserved.

92
Final
Architecture

Reserved.

Final
Architecture
–
High
Level
Overview

93

Data

Sources

IngesAon

Data

Storage/
Processing

Data

ReporAng/
Analysis

Reserved.

Final
Architecture
–
High
Level
Overview

94

Data

Sources

IngesAon

Data

Storage/
Processing

Data

ReporAng/
Analysis

Reserved.

Final
Architecture
–
IngesAon

95

Web
App
Avro
Agent

Web
App
Avro
Agent

Web
App
Avro
Agent

Web
App
Avro
Agent

Web
App
Avro
Agent

Web
App
Avro
Agent

Web
App
Avro
Agent

Web
App
Avro
Agent

Flume
Agent

Flume
Agent

Flume
Agent

Flume
Agent

Fan-‐in

PaTern

MulA
Agents
for

Failover
and
rolling
restarts

HDFS

Reserved.

Final
Architecture
–
High
Level
Overview

96

Data

Sources

IngesAon

Data

Storage/
Processing

Data

ReporAng/
Analysis

Reserved.

Final
Architecture
–
Storage
and
Processing

97

/etl/weblogs/20140331/

/etl/weblogs/20140401/

…

Data
Processing

/data/markeAng/clickstream/bouncerate/

/data/markeAng/clickstream/aTribuAon/

…

Reserved.

Final
Architecture
–
High
Level
Overview

98

Data

Sources

IngesAon

Data

Storage/
Processing

Data

ReporAng/
Analysis

Reserved.

Final
Architecture
–
Data
Access

99

Hive/
Impala

BI/
AnalyAcs

Tools

DWH

Sqoop

Local

Disk

R,
etc.

DB
import
tool

JDBC/ODBC

Reserved.

Contact
info

•  Mark
Grover

•  @mark_grover

•  www.linkedin.com/in/grovermark

•  Slides
at
slideshare.net/markgrover

100
Reserved.

101
Reserved.

Application architectures with Hadoop and Sessionization in MR

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (14)

Similar a Application architectures with Hadoop and Sessionization in MR

Similar a Application architectures with Hadoop and Sessionization in MR (20)

Más de markgrover

Más de markgrover (20)

Último

Último (20)

Application architectures with Hadoop and Sessionization in MR