Más contenido relacionado La actualidad más candente (20) Similar a Kite SDK introduction for Portland Big Data (20) Kite SDK introduction for Portland Big Data2. Resources
©2014 Cloudera, Inc. All rights reserved.
• Kite guide
• http://tiny.cloudera.com/KiteGuide
• Dataset overview and intro
• http://tiny.cloudera.com/Datasets
• Command-line tutorial
• http://tiny.cloudera.com/KiteCLI
• Kite repository and examples
• https://github.com/kite-sdk/kite
• https://github.com/kite-sdk/kite-examples
4. What problem does Kite solve?
©2014 Cloudera, Inc. All rights reserved.
• Accessibility for getting started
• Easy to get started, without being an expert
• Use before understanding
• Save time for experienced developers
• Off-the-shelf tools for common tasks
• Quickly iterate and test configurations
5. Kite Datasets: Motivation
©2014 Cloudera, Inc. All rights reserved.
• Focus on using data, not managing files
• Developers shouldn’t have to maintain data files
• Use through configuration, not code
• Need consistency across the platform
8. Kite Datasets: Motivation
©2014 Cloudera, Inc. All rights reserved.
Application ApplicationApplication
Database
Data files
Data files
Kite Data
HBase
Data files HBase
Maintained by the Kite
9. Kite Datasets: Goals
©2014 Cloudera, Inc. All rights reserved.
• Think in terms of data: datasets, views, records
• Describe data, layout and Kite does the right thing
• Should work consistently across the platform
• Reliable
10. Kite Datasets: Compatibility
©2014 Cloudera, Inc. All rights reserved.
Project HDFS (avro) HDFS (parquet) HBase
Kite 1.0 1.0 1.0
Flume Sink 1.0 1.0 1.0
MapReduce 1.0 1.0 1.0
Crunch 1.0 1.0 1.0
Hive 1.0 1.0 1.1
Impala 1.0 1.0 *
* depends on common HBase encoding format
11. Current compatibility (0.15.0)
©2014 Cloudera, Inc. All rights reserved.
Project HDFS (avro) HDFS (parquet) HBase
Kite 1.0 1.0 1.0
Flume Sink 1.0 1.0 1.0
MapReduce 1.0 1.0 1.0
Crunch 1.0 1.0 1.0
Hive 1.0 1.0 1.1
Impala 1.0 1.0 *
* depends on common HBase encoding format
12. Agenda
©2014 Cloudera, Inc. All rights reserved.
• Kite background
• Kite data
Application
Kite Data
Data files HBase
Maintained by the Kite
13. Datasets
©2014 Cloudera, Inc. All rights reserved.
• A collection of records or entities
• Like a Hive or relational table
• Generic, reflected, or generated objects
• Identified by URI
• dataset:hdfs:/data/ratings
• dataset:hive:/data/ratings
• dataset:hbase:zk1/ratings
ratings = Datasets.load("dataset:hive:/data/ratings")
15. Dataset configuration, JSON
©2014 Cloudera, Inc. All rights reserved.
• Schema (Avro)
• Record fields, like a table definition
• Partition strategy
• Layout or key definition from record fields
16. Configuring partitioning
©2014 Cloudera, Inc. All rights reserved.
• Partition strategy
[ {
"source" : "timestamp",
"type" : "year"
}, {
"source" : "timestamp",
"type" : "month"
}, {
"source" : "timestamp",
"type" : "day"
} ]
datasets/
└── ratings/
├── year=1997/
│ ├── month=09/
│ │ ├── day=20/
│ │ ├── ...
│ │ └── day=30/
│ ├── month=10/
│ │ ├── day=01/
│ │ ├── ...
17. Configuring key building
©2014 Cloudera, Inc. All rights reserved.
• Partition strategy for HBase
[ {
"source" : "email",
"type" : "hash",
"buckets": 32
}, {
"source" : "email",
"type" : "identity"
} ]
(22, "buzz@pixar.com")
x80x00x00x16buzz@pixar.comx00x00
18. Dataset configuration, JSON
©2014 Cloudera, Inc. All rights reserved.
• Schema (Avro)
• Record fields, like a table definition
• Partition strategy
• Layout or key definition from record fields
• Column mapping (HBase)
• Where to store record fields
19. {
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "email",
"type" : "string"
}, ... ]
}
Mapping example
©2014 Cloudera, Inc. All rights reserved.
family name counts prefs
row key last first visits flash
buzz@pixar.com Lightyear Buzz 315 true
[
{ "source": "email",
"type": "key" },
...
]
20. {
"type" : "record",
"name" : "User",
"fields" : [ {
"name" : "lastName",
"type" : "string"
}, ... ]
}
Mapping example
©2014 Cloudera, Inc. All rights reserved.
family name counts prefs
row key last first visits flash
buzz@pixar.com Lightyear Buzz 315 true
[
{ "source": "lastName",
"type": "column",
"family": "name",
"qualifier": "last" },
...
]
21. Command-line demo?
©2014 Cloudera, Inc. All rights reserved.
1. Describe your data
dataset obj-schema org.movielens.Rating --jar app.jar
--output rating.avsc
2. Describe your layout
dataset partition-config ts:year ts:month ts:day
--schema rating.avsc --output ymd.json
3. Create a dataset
dataset create ratings --schema rating.avsc
--partition-by ymd.json
22. Command-line tool
©2014 Cloudera, Inc. All rights reserved.
• Executable jar download
• Inspects the environment
• Must be used on-cluster
• Classpath for HBase, Hive, etc.
• Debugging:
debug=true ./dataset -v <command>
• Requires MAPRED_HOME variable on CDH5
23. Resources
©2014 Cloudera, Inc. All rights reserved.
• Kite guide
• http://tiny.cloudera.com/KiteGuide
• Dataset overview and intro
• http://tiny.cloudera.com/Datasets
• Command-line tutorial
• http://tiny.cloudera.com/KiteCLI
• Kite repository and examples
• https://github.com/kite-sdk/kite
• https://github.com/kite-sdk/kite-examples
25. Maven parent POM
©2014 Cloudera, Inc. All rights reserved.
• Automatic Kite and Hadoop dependencies
• Inherit from kite-app-parent-cdh4
• CDH4 only, CDH5 support in 0.16.0
<parent>
<groupId>org.kitesdk</groupId>
<artifactId>kite-app-parent-cdh4</artifactId>
<version>0.15.0</version>
</parent>
26. Maven Plugin
©2014 Cloudera, Inc. All rights reserved.
• Maven plugin manages datasets for an application
• Configured by app-parent POM
• Handles create, update, etc. in maven goals
27. MapReduce
©2014 Cloudera, Inc. All rights reserved.
• DatasetKeyInputFormat
• DatasetKeyOutputFormat
• Values are always null
View eventsBeforeToday = Datasets
.load("dataset:hive:/data/events")
.toBefore("timestamp", startOfToday());
DatasetKeyInputFormat.configure(mrJob).readFrom(eventsBeforeToday);
28. Crunch
©2014 Cloudera, Inc. All rights reserved.
• CrunchDatasets.asSource
• CrunchDatasets.asTarget
PCollection<Event> getPipeline().read(
CrunchDatasets.asSource(eventsBeforeToday);
• Handle-existing support in 0.16.0
• Configure dependencies with Kite parent POM
29. DatasetSink
©2014 Cloudera, Inc. All rights reserved.
• Write to HDFS Avro and HBase
• http://tiny.cloudera.com/DatasetSink
• Proxy user support
• Automatic partitioning
agent.sinks.name.type = org.apache.flume.sink.kite.DatasetSink
agent.sinks.name.kite.repo.uri = repo:hdfs:/datasets
agent.sinks.name.kite.dataset.name = events
agent.sinks.name.auth.proxyUser = cloudera