Building Hadoop Data Applications with Kite

11
Headline
Goes
Here

Speaker
Name
or
Subhead
Goes
Here

Building
Hadoop
Data
Applica;ons
with
Kite

Tom
White
@tom_e_white

Hadoop
Users
Group
UK,
London

17
June
2014

About
me

•  Engineer
at
Cloudera
working

on
Core
Hadoop
and
Kite

•  Apache
Hadoop
CommiMer,

PMC
Member,
Apache
Member

•  Author
of

“Hadoop:
The
Deﬁni;ve
Guide”

2

Hadoop
0.1

% cat bigdata.txt | hadoop fs -put - in!
% hadoop MyJob in out!
% hadoop fs -get out!
3

Characteris;cs

•  Batch
applica;ons
only

•  Low-‐level
coding

•  File
format

•  Serializa;on

•  Par;;oning
scheme

4

Common
Data,
Many
Tools

#
tools
>>
#
ﬁle
formats
>>
#
ﬁle
systems

6

Glossary

•  Apache
Avro
–
cross-‐language
data
serializa;on
library

•  Apache
Parquet
(incuba;ng)
–
column-‐oriented
storage
format

for
nested
data

•  Apache
Hive
–
data
warehouse
(SQL
and
metastore)

•  Apache
Flume
–
streaming
log
capture
and
delivery
system

•  Apache
Oozie
–
workﬂow
scheduler
system

•  Apache
Crunch
–
Java
API
for
wri;ng
data
pipelines

•  Impala
–
interac;ve
SQL
on
Hadoop

7

Outline

•  A
Typical
Applica;on

•  Kite
SDK

•  An
Example

•  Advanced
Kite

8

A
typical
applica;on
(zoom
100:1)

9

A
typical
applica;on
(zoom
10:1)

10

A
typical
pipeline
(zoom
5:1)

11

Kite
Codiﬁes
Best
Prac;ce
as
APIs,
Tools,
Docs

and
Examples

13

Kite

•  A
client-‐side
library
for
wri;ng
Hadoop
Data
Applica;ons

•  First
release
was
in
April
2013
as
CDK

•  0.14.1
last
month

•  Open
source,
Apache
2
license,
kitesdk.org

•  Modular

•  Data
module
(HDFS,
Flume,
Crunch,
Hive,
HBase)

•  Morphlines
transforma;on
module

•  Maven
plugin

14

Kite
Data
Module

•  Dataset
–
a
collec;on
of
en;;es

•  DatasetRepository
–
physical
storage
loca;on
for
datasets

•  DatasetDescriptor
–
holds
dataset
metadata
(schema,
format)

•  DatasetWriter
–
write
en;;es
to
a
dataset
in
a
stream

•  DatasetReader
–
read
en;;es
from
a
dataset

•  hMp://kitesdk.org/docs/current/apidocs/index.html

16

1.
Deﬁne
the
Event
En;ty

public class Event {!
private long id;!
private long timestamp;!
private String source;!
// getters and setters!
}!
17

2.
Create
the
Events
Dataset

DatasetRepository repo =
DatasetRepositories.open("repo:hive");!
DatasetDescriptor descriptor =!
new DatasetDescriptor.Builder()!
.schema(Event.class).build();!
repo.create("events", descriptor);!
18

(2.
or
with
the
Maven
plugin)

$ mvn kite:create-dataset !
-Dkite.repositoryUri='repo:hive' !
-Dkite.datasetName=events !
-Dkite.avroSchemaReflectClass=com.example.Event!
19

A
peek
at
the
Avro
schema

$ hive -e "DESCRIBE EXTENDED events"!
...!
{!
"type" : "record",!
"name" : "Event",!
"namespace" : "com.example",!
"fields" : [!
{ "name" : "id", "type" : "long" },!
{ "name" : "timestamp", "type" : "long" },!
{ "name" : "source", "type" : "string" }!
]!
}!
20

3.
Write
Events

Logger logger = Logger.getLogger(...);!
Event event = new Event();!
event.setId(id);!
event.setTimestamp(System.currentTimeMillis());!
event.setSource(source);!
logger.info(event);!
21

Log4j
conﬁgura;on

log4j.appender.flume =
org.kitesdk.data.flume.Log4jAppender!
log4j.appender.flume.Hostname = localhost!
log4j.appender.flume.Port = 41415!
log4j.appender.flume.DatasetRepositoryUri = repo:hive!
log4j.appender.flume.DatasetName = events!
22

The
resul;ng
ﬁle
layout

/user!
/hive!
/warehouse!
/events!
/FlumeData.1375659013795!
/FlumeData.1375659013796!
23
Avro

ﬁles

4.
Generate
Summaries
with
Crunch

PCollection<Event> events =
read(asSource(repo.load("events"), Event.class));!
PCollection<Summary> summaries = events!
.by(new GetTimeBucket(), // minute of day, source!
Avros.pairs(Avros.longs(), Avros.strings()))!
.groupByKey()!
.parallelDo(new MakeSummary(),!
Avros.reflects(Summary.class));!
write(summaries, asTarget(repo.load("summaries"))!24

…
and
run
using
Maven

$ mvn kite:create-dataset -Dkite.datasetName=summaries ...!
<plugin>!
<groupId>org.kitesdk</groupId>!
<artifactId>kite-maven-plugin</artifactId>!
<configuration>!
<toolClass>com.example.GenerateSummaries</toolClass>!
</configuration>!
</plugin>!
$ mvn kite:run-tool!
25

…
Ad
Hoc
Queries

$ impala-shell -q 'SELECT source, COUNT(1) AS cnt
FROM events GROUP BY source'!
+--------------------------------------+-----+!
| source | cnt |!
+--------------------------------------+-----+!
| 018dc1b6-e6b0-489e-bce3-115917e00632 | 38 |!
| bc80040e-09d1-4ad2-8bd8-82afd1b8431a | 85 |!
+--------------------------------------+-----+!
Returned 2 row(s) in 0.56s!
27

Uniﬁed
Storage
Interface

•  Dataset
–
streaming
access,
HDFS
storage

•  RandomAccessDataset
–
random
access,
HBase
storage

•  Par;;onStrategy
deﬁnes
how
to
map
an
en;ty
to
par;;ons
in

HDFS
or
row
keys
in
HBase

29

Filesystem
Par;;ons

PartitionStrategy p = new PartitionStrategy.Builder()!
.year("timestamp")!
.month("timestamp")!
.day("timestamp").build();!
/user/hive/warehouse/events!
/year=2014/month=02/day=08!
/FlumeData.1375659013795!
/FlumeData.1375659013796!
30

HBase
Keys:
Deﬁned
in
Avro

{!
"name": "username",!
"type": "string",!
"mapping": { "type": "key", "value": "0" }!
},!
{!
"name": "favoriteColor",!
"type": "string",!
"mapping": { "type": "column", "value": "meta:fc" }!
}!
31

Random
Access
Dataset:
Crea;on

RandomAccessDatasetRepository repo =
DatasetRepositories.openRandomAccess(!
"repo:hbase:localhost");!
RandomAccessDataset<User> users = repo.load("users");!
users.put(new User("bill", "green"));!
users.put(new User("alice", "blue"));!
32

Random
Access
Dataset:
Retrieval

Key key = new Key.Builder(users)!
.add("username", "bill").build();!
User bill = users.get(key);!
33

Views

View<User> view = users.from("username", "bill");!
DatasetReader<User> reader = view.newReader();!
reader.open();!
for (User user : reader) {!
System.out.println(user);!
}!
reader.close();!
34

Parallel
Processing

•  Goal
is
for
Hadoop
processing
frameworks
to
“just
work”

•  Support
Formats,
Par;;ons,
Views

•  Na;ve
Kite
components,
e.g.
DatasetOutputFormat
for
MR

35
HDFS
Dataset
HBase
Dataset

Crunch
Yes
Yes

MapReduce
Yes
Yes

Hive
Yes
Planned

Impala
Yes
Planned

Schema
Evolu;on

public class Event {!
private long id;!
private long timestamp;!
private String source;!
@Nullable private String ipAddress;!
}!
$ mvn kite:update-dataset !
-Dkite.datasetName=events !
-Dkite.avroSchemaReflectClass=com.example.Event!
36

Searchable
Datasets

•  Use
Flume
Solr
Sink
(in

addi;on
to
HDFS
Sink)

•  Morphlines
library
to
define

fields
to
index

•  SolrCloud
runs
on
cluster
from

indexes
in
HDFS

•  Future
support
in
Kite
to
index

selected
fields
automa;cally

37

Kite
makes
it
easy
to
get
data
into
Hadoop

with
a
ﬂexible
schema
model
that
is
storage

agnos;c
in
a
format
that
can
be
processed

with
a
wide
range
of
Hadoop
tools

39

Genng
Started
With
Kite

•  Examples
at
github.com/kite-‐sdk/kite-‐examples

•  Working
with
streaming
and
random-‐access
datasets

•  Logging
events
to
datasets
from
a
webapp

•  Running
a
periodic
job

•  Migra;ng
data
from
CSV
to
a
Kite
dataset

•  Conver;ng
an
Avro
dataset
to
a
Parquet
dataset

•  Wri;ng
and
conﬁguring
Morphlines

•  Using
Morphlines
to
write
JSON
records
to
a
dataset

40

Ques;ons?

kitesdk.org

@tom_e_white

tom@cloudera.com

41

Applica;ons

•  [Batch]
Analyze
an
archive
of
songs1

•  [Interac;ve
SQL]
Ad
hoc
queries
on
recommenda;ons
from

social
media
applica;ons2

•  [Search]
Searching
email
traﬃc
in
near-‐real;me3

•  [ML]
Detec;ng
fraudulent
transac;ons
using
clustering4

43
[1]
hMp://blog.cloudera.com/blog/2012/08/process-‐a-‐million-‐songs-‐with-‐apache-‐pig/

[2]
hMp://blog.cloudera.com/blog/2014/01/how-‐wajam-‐answers-‐business-‐ques;ons-‐faster-‐with-‐hadoop/

[3]
hMp://blog.cloudera.com/blog/2013/09/email-‐indexing-‐using-‐cloudera-‐search/

[4]
hMp://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/

…
or
use
JDBC

Class.forName("org.apache.hive.jdbc.HiveDriver");!
Connection connection = DriverManager.getConnection(!
"jdbc:hive2://localhost:21050/;auth=noSasl");!
Statement statement = connection.createStatement();!
ResultSet resultSet = statement.executeQuery(!
"SELECT * FROM summaries");!
44

Apps

•  App
–
a
packaged
Java
program
that
runs
on
a
Hadoop
cluster

•  cdk:package-‐app
–
create
a
package
on
the
local
ﬁlesystem

•  like
an
exploded
WAR

•  Oozie
format

•  cdk:deploy-‐app
–
copy
packaged
app
to
HDFS

•  cdk:run-‐app
–
execute
the
app

•  Workﬂow
app
–
runs
once

•  Coordinator
app
–
runs
other
apps
(like
cron)

45

Morphlines
Example

46
morphlines
:
[

{

id
:
morphline1

importCommands
:
["com.cloudera.**",
"org.apache.solr.**"]

commands
:
[

{
readLine
{}
}

{

grok
{

dic;onaryFiles
:
[/tmp/grok-‐dic;onaries]

expressions
:
{

message
:
"""<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_;mestamp}
%
{SYSLOGHOST:syslog_hostname}
%{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:
%
{GREEDYDATA:syslog_message}"""

}

}

}

{
loadSolr
{}
}

]

}

]

Example Input

<164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22

Output Record

syslog_pri:164

syslog_timestamp:Feb 4 10:46:14

syslog_hostname:syslog

syslog_program:sshd

syslog_pid:607

syslog_message:listening on 0.0.0.0 port 22.

Building Hadoop Data Applications with Kite

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Building Hadoop Data Applications with Kite

Similar a Building Hadoop Data Applications with Kite (20)

Más de huguk

Más de huguk (20)

Último

Último (20)

Building Hadoop Data Applications with Kite