Más contenido relacionado La actualidad más candente (20) Similar a Application architectures with Hadoop – Big Data TechCon 2014 (20) Más de hadooparchbook (8) Application architectures with Hadoop – Big Data TechCon 20141. 1
Headline
Goes
Here
Speaker
Name
or
Subhead
Goes
Here
DO
NOT
USE
PUBLICLY
PRIOR
TO
10/23/12
ApplicaAon
Architectures
with
Hadoop
Mark
Grover
|
SoGware
Engineer
Jonathan
Seidman
|
SoluAons
Architect,
Partner
Engineering
April
1,
2014
©2014 Cloudera, Inc. All Rights
Reserved.
2. About
Us
• Mark
• CommiOer
on
Apache
Bigtop,
commiOer
and
PPMC
member
on
Apache
Sentry
(incubaAng).
• Contributor
to
Hadoop,
Hive,
Spark,
Sqoop,
Flume.
• @mark_grover
• Jonathan
• SoluAons
Architect,
Partner
Engineering
Team.
• Co-‐founder
of
Chicago
Hadoop
User
Group
and
Chicago
Big
Data.
• jseidman@cloudera.com
• @jseidman
2
©2014 Cloudera, Inc. All Rights
Reserved.
3. Co-‐authoring
O’Reilly
book
• Titled
‘Hadoop
ApplicaAon
Architectures’
• How
to
build
end-‐to-‐end
soluAons
using
Apache
Hadoop
and
related
tools
• Updates
on
TwiOer:
@hadooparchbook
• hOp://www.hadooparchitecturebook.com
©2014 Cloudera, Inc. All Rights
Reserved.
3
8. Web
Log
Example
©2014 Cloudera, Inc. All Rights
Reserved.
8
[2012/09/22 20:56:04.294 -0500] "GET /info/ HTTP/1.1" 200 701 "-"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en)"
"age=38&gender=1&incomeCategory=5&session=983040389&user=627735038&
region=8&userType=1”
[2012/09/23 14:12:52.294 -0500] "GET /wish/remove/275 HTTP/1.1" 200
701 "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; en-us)
AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16"
"age=63&gender=1&incomeCategory=1&session=1561203915&user=136433448
8®ion=4&userType=1"
9. Hadoop
Architectural
ConsideraAons
• Storage
managers?
• HDFS?
HBase?
• Data
storage
and
modeling:
• File
formats?
Compression?
Schema
design?
• Data
movement
• How
do
we
actually
get
the
data
into
Hadoop?
How
do
we
get
it
out?
• Metadata
• How
do
we
manage
data
about
the
data?
• Data
access
and
processing
• How
will
the
data
be
accessed
once
in
Hadoop?
How
can
we
transform
it?
How
do
we
query
it?
• OrchestraAon
• How
do
we
manage
the
workflow
for
all
of
this?
9
©2014 Cloudera, Inc. All Rights
Reserved.
11. Data
Storage
–
Storage
Manager
consideraAons
• Popular
storage
managers
for
Hadoop
• Hadoop
Distributed
File
System
(HDFS)
• HBase
11
©2014 Cloudera, Inc. All Rights
Reserved.
12. Data
Storage
–
HDFS
vs
HBase
HDFS
• Stores
data
directly
as
files
• Fast
scans
• Poor
random
reads/writes
HBase
• Stores
data
as
Hfiles
on
HDFS
• Slow
scans
• Fast
random
reads/writes
12
©2014 Cloudera, Inc. All Rights
Reserved.
13. Data
Storage
–
Storage
Manager
consideraAons
• We
choose
HDFS
• AnalyAcal
needs
in
this
case
served
beOer
by
fast
scans.
13
©2014 Cloudera, Inc. All Rights
Reserved.
15. Data
Storage
–
Format
ConsideraAons
• Store
as
plain
text?
• Sure,
well
supported
by
Hadoop.
• Text
can
easily
be
processed
by
MapReduce,
loaded
into
Hive
for
analysis,
and
so
on.
• But…
• Will
begin
to
consume
lots
of
space
in
HDFS.
• May
not
be
opAmal
for
processing
by
tools
in
the
Hadoop
ecosystem.
15
©2014 Cloudera, Inc. All Rights
Reserved.
16. Data
Storage
–
Format
ConsideraAons
• But,
we
can
compress
the
text
files…
• Gzip
–
supported
by
Hadoop,
but
not
spliOable.
• Bzip2
–
hey,
spliOable!
Great
compression!
But
decompression
is
slooowww.
• LZO
–
spliOable
(with
some
work),
good
compress/de-‐compress
performance.
Good
choice
for
storing
text
files
on
Hadoop.
• Snappy
–
provides
a
good
tradeoff
between
size
and
speed.
16
©2014 Cloudera, Inc. All Rights
Reserved.
17. Data
Storage
–
More
About
Snappy
• Designed
at
Google
to
provide
high
compression
speeds
with
reasonable
compression.
• Not
the
highest
compression,
but
provides
very
good
performance
for
processing
on
Hadoop.
• Snappy
is
not
spliOable
though,
which
brings
us
to…
17
©2014 Cloudera, Inc. All Rights
Reserved.
18. SequenceFile
• Stores
records
as
binary
key/value
pairs.
• SequenceFile
“blocks”
can
be
compressed.
• This
enables
spliOability
with
non-‐spliOable
compression.
18
©2014 Cloudera, Inc. All Rights
Reserved.
19. Avro
• Kinda
SequenceFile
on
Steroids.
• Self-‐documenAng
–
stores
schema
in
header.
• Provides
very
efficient
storage.
• Supports
spliOable
compression.
19
©2014 Cloudera, Inc. All Rights
Reserved.
20. Our
Format
Choices…
• Avro
with
Snappy
• Snappy
provides
opAmized
compression.
• Avro
provides
compact
storage,
self-‐documenAng
files,
and
supports
schema
evoluAon.
• Avro
also
provides
beOer
failure
handling
than
other
choices.
• SequenceFiles
would
also
be
a
good
choice,
and
are
directly
supported
by
ingesAon
tools
in
the
ecosystem.
• But
only
supports
Java.
20
©2014 Cloudera, Inc. All Rights
Reserved.
23. Recommended
HDFS
Schema
Design
/user/<username>
-‐
User
specific
data,
jars,
conf
files
/etl
–
Data
in
various
stages
of
ETL
workflow
/tmp
–
temp
data
from
tools
or
shared
between
users
/data
–
shared
data
for
the
enAre
organizaAon
/app
–
Everything
but
data:
UDF
jars,
HQL
files,
Oozie
workflows
23
©2014 Cloudera, Inc. All Rights
Reserved.
25. What
is
ParAAoning?
25
dataset
col=val1/file.txt
col=val2/file.txt
.
.
.
col=valn/file.txt
dataset
file1.txt
file2.txt
.
.
.
filen.txt
Un-‐parAAoned
HDFS
directory
structure
ParAAoned
HDFS
directory
structure
©2014 Cloudera, Inc. All Rights
Reserved.
26. What
is
ParAAoning?
26
clicks
dt=2014-‐01-‐01/clicks.txt
dt=2014-‐01-‐02/clicks.txt
.
.
.
dt=2014-‐03-‐31/clicks.txt
clicks
clicks-‐2014-‐01-‐01.txt
clicks-‐2014-‐01-‐02.txt
.
.
.
clicks-‐2014-‐03-‐31.txt
Un-‐parAAoned
HDFS
directory
structure
ParAAoned
HDFS
directory
structure
©2014 Cloudera, Inc. All Rights
Reserved.
27. ParAAoning
• Split
the
dataset
into
smaller
consumable
chunks
• Rudimentary
form
of
“indexing”
• <data
set
name>/
<parAAon_column_name=parAAon_column_value>/{files}
27
©2014 Cloudera, Inc. All Rights
Reserved.
28. ParAAoning
consideraAons
• What
column
to
bucket
by?
• HDFS
is
append
only.
• Don’t
have
too
many
parAAons
(<10,000)
• Don’t
have
too
many
small
files
in
the
parAAons
(more
than
block
size
generally)
• We
decided
to
parAAon
by
1mestamp
28
©2014 Cloudera, Inc. All Rights
Reserved.
29. What
is
buckeAng?
29
clicks
dt=2014-‐01-‐01/clicks.txt
dt=2014-‐01-‐02/clicks.txt
Un-‐bucketed
HDFS
directory
structure
clicks
dt=2014-‐01-‐01/file0.txt
dt=2014-‐01-‐01/file1.txt
dt=2014-‐01-‐01/file2.txt
dt=2014-‐01-‐01/file3.txt
dt=2014-‐01-‐02/file0.txt
dt=2014-‐01-‐02/file1.txt
dt=2014-‐01-‐02/file2.txt
dt=2014-‐01-‐02/file3.txt
Bucketed
HDFS
directory
structure
©2014 Cloudera, Inc. All Rights
Reserved.
30. BuckeAng
• Hash-‐bucketed
files
within
each
parAAon
based
on
a
parAcular
column
• Useful
when
sampling
• In
some
joins,
pre-‐reqs:
• Datasets
bucketed
on
the
same
key
as
the
join
key
• Number
of
buckets
are
the
same
or
one
is
a
mulAple
of
the
other
30
©2014 Cloudera, Inc. All Rights
Reserved.
31. BuckeAng
consideraAons?
• Which
column
to
bucket
on?
• How
many
buckets?
• We
decided
to
bucket
based
on
cookie
31
©2014 Cloudera, Inc. All Rights
Reserved.
32. De-‐normalizing
consideraAons
• In
general,
big
data
joins
are
expensive
• When
to
de-‐normalize?
• Decided
to
join
the
smaller
dimension
tables
• Big
fact
tables
are
sAll
joined
32
©2014 Cloudera, Inc. All Rights
Reserved.
34. File
Transfers
• “hadoop
fs
–put
<file>”
• Reliable,
but
not
resilient
to
failure.
• Other
opAons
are
mountable
HDFS,
for
example
NFSv3.
34
©2014 Cloudera, Inc. All Rights
Reserved.
35. Streaming
IngesAon
• Flume
• Reliable,
distributed,
and
available
system
for
efficient
collecAon,
aggregaAon
and
movement
of
streaming
data,
e.g.
logs.
• Ka{a
• Reliable
and
distributed
publish-‐subscribe
messaging
system.
35
©2014 Cloudera, Inc. All Rights
Reserved.
36. Flume
vs.
Ka{a
• Purpose
built
for
Hadoop
data
ingest.
• Pre-‐built
sinks
for
HDFS,
HBase,
etc.
• Supports
transformaAon
of
data
in-‐flight.
• General
pub-‐sub
messaging
framework.
• Hadoop
not
supported,
requires
3rd-‐party
component
(Camus).
• Just
a
message
transport
(a
very
fast
one).
36
©2014 Cloudera, Inc. All Rights
Reserved.
37. Flume
vs.
Ka{a
• BoOom
line:
• Flume
very
well
integrated
with
Hadoop
ecosystem,
well
suited
to
ingesAon
of
sources
such
as
log
files.
• Ka{a
is
a
highly
reliable
and
scalable
enterprise
messaging
system,
and
great
for
scaling
out
to
mulAple
consumers.
37
©2014 Cloudera, Inc. All Rights
Reserved.
38. A
Quick
IntroducAon
to
Flume
38
Flume
Agent
Source
Channel
Sink
DesAnaAon
External
Source
Web
Server
TwiOer
JMS
System
logs
…
Consumes
events
and
forwards
to
channels
Stores
events
unAl
consumed
by
sinks
–
file,
memory,
JDBC
Removes
event
from
channel
and
puts
into
external
desAnaAon
JVM
process
hosAng
components
©2014 Cloudera, Inc. All Rights
Reserved.
39. A
Quick
IntroducAon
to
Flume
• Reliable
–
events
are
stored
in
channel
unAl
delivered
to
next
stage.
• Recoverable
–
events
can
be
persisted
to
disk
and
recovered
in
the
event
of
failure.
39
Flume
Agent
Source
Channel
Sink
DesAnaAon
©2014 Cloudera, Inc. All Rights
Reserved.
40. A
Quick
IntroducAon
to
Flume
• DeclaraAve
• No
coding
required.
• ConfiguraAon
specifies
how
components
are
wired
together.
40
©2014 Cloudera, Inc. All Rights
Reserved.
41. A
Brief
Discussion
of
Flume
PaOerns
–
Fan-‐in
• Flume
agent
runs
on
each
of
our
servers.
• These
agents
send
data
to
mulAple
agents
to
provide
reliability.
• Flume
provides
support
for
load
balancing.
41
©2014 Cloudera, Inc. All Rights
Reserved.
42. A
Brief
Discussion
of
Flume
PaOerns
–
Spli~ng
• Common
need
is
to
split
data
on
ingest.
• For
example:
• Sending
data
to
mulAple
clusters
for
DR.
• To
mulAple
desAnaAons.
• Flume
also
supports
parAAoning,
which
is
key
to
our
implementaAon.
42
©2014 Cloudera, Inc. All Rights
Reserved.
43. Sqoop
Overview
• Apache
project
designed
to
ease
import
and
export
of
data
between
Hadoop
and
external
data
stores
such
as
relaAonal
databases.
• Great
for
doing
bulk
imports
and
exports
of
data
between
HDFS,
Hive
and
HBase
and
an
external
data
store.
Not
suited
for
ingesAng
event
based
data.
©2014 Cloudera, Inc. All Rights
Reserved.
43
44. IngesAon
Decisions
• Historical
Data
• Smaller
files:
file
transfer
• Larger
files:
Flume
with
spooling
directory
source.
• Incoming
Data
• Flume
with
the
spooling
directory
source.
44
©2014 Cloudera, Inc. All Rights
Reserved.
46. Data
flow
46
Raw
data
ParAAoned
clickstream
data
Other
data
(Financial,
CRM,
etc.)
Aggregated
dataset
#2
Aggregated
dataset
#1
©2014 Cloudera, Inc. All Rights
Reserved.
48. Hive
48
• Open
source
data
warehouse
system
for
Hadoop
• Converts
SQL-‐like
queries
to
MapReduce
jobs
• Work
is
being
done
to
move
this
away
from
MR
• Stores
metadata
in
Hive
metastore
• Can
create
tables
over
HDFS
or
HBase
data
• Access
available
via
JDBC/ODBC
©2014 Cloudera, Inc. All Rights
Reserved.
49. Impala
49
• Real-‐Ame
open
source
SQL
query
engine
for
Hadoop
• Doesn’t
build
on
MapReduce
• WriOen
in
C++,
uses
LLVM
for
run-‐Ame
code
generaAon
• Can
create
tables
over
HDFS
or
HBase
data
• Accesses
Hive
metastore
for
metadata
• Access
available
via
JDBC/ODBC
©2014 Cloudera, Inc. All Rights
Reserved.
50. Pig
50
• Higher
level
abstracAon
over
MapReduce
(like
Hive)
• Write
transformaAons
in
scripAng
language
–
Pig
LaAn
• Can
access
Hive
metastore
via
HCatalog
for
metadata
©2014 Cloudera, Inc. All Rights
Reserved.
53. What
is
Metadata?
53
• Metadata
is
data
about
the
data
• Format
in
which
data
is
stored
• Compression
codec
• LocaAon
of
the
data
• Is
the
data
parAAoned/bucketed/sorted?
©2014 Cloudera, Inc. All Rights
Reserved.
54. Metadata
in
Hive
54
Hive
Metastore
©2014 Cloudera, Inc. All Rights
Reserved.
55. Metadata
55
• Hive
metastore
has
become
the
de-‐facto
metadata
repository
• HCatalog
makes
Hive
metastore
accessible
to
other
applicaAons
(Pig,
MapReduce,
custom
apps,
etc.)
©2014 Cloudera, Inc. All Rights
Reserved.
58. OrchestraAon
• Once
the
data
is
in
Hadoop,
we
need
a
way
to
manage
workflows
in
our
architecture.
• Scheduling
and
tracking
MapReduce
jobs,
Hive
jobs,
etc.
• Several
opAons
here:
• Cron
• Oozie,
Azkaban
• 3rd-‐party
tools,
Talend,
Pentaho,
InformaAca,
enterprise
schedulers.
58
©2014 Cloudera, Inc. All Rights
Reserved.
59. Oozie
• Supports
defining
and
execuAng
a
sequence
of
jobs.
• Can
trigger
jobs
based
on
external
dependencies
or
schedules.
59
©2014 Cloudera, Inc. All Rights
Reserved.
61. Final
Architecture
–
High
Level
Overview
61
Data
Sources
IngesAon
Data
Storage/
Processing
Data
ReporAng/
Analysis
©2014 Cloudera, Inc. All Rights
Reserved.
62. Final
Architecture
–
High
Level
Overview
62
Data
Sources
IngesAon
Data
Storage/
Processing
Data
ReporAng/
Analysis
©2014 Cloudera, Inc. All Rights
Reserved.
63. Final
Architecture
–
IngesAon
63
Web
App
Avro
Agent
Web
App
Avro
Agent
Web
App
Avro
Agent
Web
App
Avro
Agent
Web
App
Avro
Agent
Web
App
Avro
Agent
Web
App
Avro
Agent
Web
App
Avro
Agent
Flume
Agent
Flume
Agent
Flume
Agent
Flume
Agent
Fan-‐in
PaOern
MulA
Agents
for
Failover
and
rolling
restarts
HDFS
©2014 Cloudera, Inc. All Rights
Reserved.
64. Final
Architecture
–
High
Level
Overview
64
Data
Sources
IngesAon
Data
Storage/
Processing
Data
ReporAng/
Analysis
©2014 Cloudera, Inc. All Rights
Reserved.
65. Final
Architecture
–
Storage
and
Processing
65
/etl/weblogs/20140331/
/etl/weblogs/20140401/
…
Data
Processing
/data/markeAng/clickstream/bouncerate/
/data/markeAng/clickstream/aOribuAon/
…
©2014 Cloudera, Inc. All Rights
Reserved.
66. Final
Architecture
–
High
Level
Overview
66
Data
Sources
IngesAon
Data
Storage/
Processing
Data
ReporAng/
Analysis
©2014 Cloudera, Inc. All Rights
Reserved.
67. Final
Architecture
–
Data
Access
67
Hive/
Impala
BI/
AnalyAcs
Tools
DWH
Sqoop
Local
Disk
R,
etc.
DB
import
tool
JDBC/ODBC
©2014 Cloudera, Inc. All Rights
Reserved.
68. Contact
info
• Mark
Grover
• @mark_grover
• www.linkedin.com/in/grovermark
• Jonathan
Seidman
• jseidman@cloudera.com
• @jseidman
• hOps://www.linkedin.com/pub/jonathan-‐seidman/1/26a/959
• hOp://www.slideshare.net/jseidman
• Slides
at
slideshare.net/hadooparchbook
68
©2014 Cloudera, Inc. All Rights
Reserved.