European Football Icons that Missed Opportunities at UEFA Euro 2024.docx
Building Hadoop Data Applications with Kite
1. 11
Headline
Goes
Here
Speaker
Name
or
Subhead
Goes
Here
Building
Hadoop
Data
Applica;ons
with
Kite
Tom
White
@tom_e_white
Hadoop
Users
Group
UK,
London
17
June
2014
2. About
me
• Engineer
at
Cloudera
working
on
Core
Hadoop
and
Kite
• Apache
Hadoop
CommiMer,
PMC
Member,
Apache
Member
• Author
of
“Hadoop:
The
Defini;ve
Guide”
2
7. Glossary
• Apache
Avro
–
cross-‐language
data
serializa;on
library
• Apache
Parquet
(incuba;ng)
–
column-‐oriented
storage
format
for
nested
data
• Apache
Hive
–
data
warehouse
(SQL
and
metastore)
• Apache
Flume
–
streaming
log
capture
and
delivery
system
• Apache
Oozie
–
workflow
scheduler
system
• Apache
Crunch
–
Java
API
for
wri;ng
data
pipelines
• Impala
–
interac;ve
SQL
on
Hadoop
7
8. Outline
• A
Typical
Applica;on
• Kite
SDK
• An
Example
• Advanced
Kite
8
14. Kite
• A
client-‐side
library
for
wri;ng
Hadoop
Data
Applica;ons
• First
release
was
in
April
2013
as
CDK
• 0.14.1
last
month
• Open
source,
Apache
2
license,
kitesdk.org
• Modular
• Data
module
(HDFS,
Flume,
Crunch,
Hive,
HBase)
• Morphlines
transforma;on
module
• Maven
plugin
14
16. Kite
Data
Module
• Dataset
–
a
collec;on
of
en;;es
• DatasetRepository
–
physical
storage
loca;on
for
datasets
• DatasetDescriptor
–
holds
dataset
metadata
(schema,
format)
• DatasetWriter
–
write
en;;es
to
a
dataset
in
a
stream
• DatasetReader
–
read
en;;es
from
a
dataset
• hMp://kitesdk.org/docs/current/apidocs/index.html
16
17. 1.
Define
the
Event
En;ty
public class Event {!
private long id;!
private long timestamp;!
private String source;!
// getters and setters!
}!
17
18. 2.
Create
the
Events
Dataset
DatasetRepository repo =
DatasetRepositories.open("repo:hive");!
DatasetDescriptor descriptor =!
new DatasetDescriptor.Builder()!
.schema(Event.class).build();!
repo.create("events", descriptor);!
18
19. (2.
or
with
the
Maven
plugin)
$ mvn kite:create-dataset !
-Dkite.repositoryUri='repo:hive' !
-Dkite.datasetName=events !
-Dkite.avroSchemaReflectClass=com.example.Event!
19
29. Unified
Storage
Interface
• Dataset
–
streaming
access,
HDFS
storage
• RandomAccessDataset
–
random
access,
HBase
storage
• Par;;onStrategy
defines
how
to
map
an
en;ty
to
par;;ons
in
HDFS
or
row
keys
in
HBase
29
30. Filesystem
Par;;ons
PartitionStrategy p = new PartitionStrategy.Builder()!
.year("timestamp")!
.month("timestamp")!
.day("timestamp").build();!
/user/hive/warehouse/events!
/year=2014/month=02/day=08!
/FlumeData.1375659013795!
/FlumeData.1375659013796!
30
35. Parallel
Processing
• Goal
is
for
Hadoop
processing
frameworks
to
“just
work”
• Support
Formats,
Par;;ons,
Views
• Na;ve
Kite
components,
e.g.
DatasetOutputFormat
for
MR
35
HDFS
Dataset
HBase
Dataset
Crunch
Yes
Yes
MapReduce
Yes
Yes
Hive
Yes
Planned
Impala
Yes
Planned
36. Schema
Evolu;on
public class Event {!
private long id;!
private long timestamp;!
private String source;!
@Nullable private String ipAddress;!
}!
$ mvn kite:update-dataset !
-Dkite.datasetName=events !
-Dkite.avroSchemaReflectClass=com.example.Event!
36
37. Searchable
Datasets
• Use
Flume
Solr
Sink
(in
addi;on
to
HDFS
Sink)
• Morphlines
library
to
define
fields
to
index
• SolrCloud
runs
on
cluster
from
indexes
in
HDFS
• Future
support
in
Kite
to
index
selected
fields
automa;cally
37
39. Kite
makes
it
easy
to
get
data
into
Hadoop
with
a
flexible
schema
model
that
is
storage
agnos;c
in
a
format
that
can
be
processed
with
a
wide
range
of
Hadoop
tools
39
40. Genng
Started
With
Kite
• Examples
at
github.com/kite-‐sdk/kite-‐examples
• Working
with
streaming
and
random-‐access
datasets
• Logging
events
to
datasets
from
a
webapp
• Running
a
periodic
job
• Migra;ng
data
from
CSV
to
a
Kite
dataset
• Conver;ng
an
Avro
dataset
to
a
Parquet
dataset
• Wri;ng
and
configuring
Morphlines
• Using
Morphlines
to
write
JSON
records
to
a
dataset
40
43. Applica;ons
• [Batch]
Analyze
an
archive
of
songs1
• [Interac;ve
SQL]
Ad
hoc
queries
on
recommenda;ons
from
social
media
applica;ons2
• [Search]
Searching
email
traffic
in
near-‐real;me3
• [ML]
Detec;ng
fraudulent
transac;ons
using
clustering4
43
[1]
hMp://blog.cloudera.com/blog/2012/08/process-‐a-‐million-‐songs-‐with-‐apache-‐pig/
[2]
hMp://blog.cloudera.com/blog/2014/01/how-‐wajam-‐answers-‐business-‐ques;ons-‐faster-‐with-‐hadoop/
[3]
hMp://blog.cloudera.com/blog/2013/09/email-‐indexing-‐using-‐cloudera-‐search/
[4]
hMp://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/
44. …
or
use
JDBC
Class.forName("org.apache.hive.jdbc.HiveDriver");!
Connection connection = DriverManager.getConnection(!
"jdbc:hive2://localhost:21050/;auth=noSasl");!
Statement statement = connection.createStatement();!
ResultSet resultSet = statement.executeQuery(!
"SELECT * FROM summaries");!
44
45. Apps
• App
–
a
packaged
Java
program
that
runs
on
a
Hadoop
cluster
• cdk:package-‐app
–
create
a
package
on
the
local
filesystem
• like
an
exploded
WAR
• Oozie
format
• cdk:deploy-‐app
–
copy
packaged
app
to
HDFS
• cdk:run-‐app
–
execute
the
app
• Workflow
app
–
runs
once
• Coordinator
app
–
runs
other
apps
(like
cron)
45
46. Morphlines
Example
46
morphlines
:
[
{
id
:
morphline1
importCommands
:
["com.cloudera.**",
"org.apache.solr.**"]
commands
:
[
{
readLine
{}
}
{
grok
{
dic;onaryFiles
:
[/tmp/grok-‐dic;onaries]
expressions
:
{
message
:
"""<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_;mestamp}
%
{SYSLOGHOST:syslog_hostname}
%{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:
%
{GREEDYDATA:syslog_message}"""
}
}
}
{
loadSolr
{}
}
]
}
]
Example Input
<164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22
Output Record
syslog_pri:164
syslog_timestamp:Feb 4 10:46:14
syslog_hostname:syslog
syslog_program:sshd
syslog_pid:607
syslog_message:listening on 0.0.0.0 port 22.