Learn who is best suited to attend the full training, what prior knowledge you should have, and what topics the course covers. Cloudera Curriculum Developer, Jesse Anderson, will discuss the skills you will attain during the course and how they will help you move make the most of your HBase deployment in development or production and prepare for the Cloudera Certified Specialist in Apache HBase (CCSHB) exam.
2. Agenda
• Why Cloudera Training?
• Target Audience and Prerequisites
• Course Outline
• Short Presentation Based on Actual Course Material
- Using Scans to Access Data
• Q&A
3. 32,000trained professionals by 2015
Rising demand for Big Data
and analytics experts but a
DEFICIENCY OF TALENT
will result in a shortfall of
Source: Accenture “Analytics in Action,“ March 2013.
4. 55%
of the Fortune 100
have attended live
Cloudera training
Source: Fortune, “Fortune 500 “ and “Global 500,” May 2012.
100%
of the top 20 global
technology firms to
use Hadoop
Cloudera has trained
employees from
Big Data
professionals from
Cloudera Trains the Top Companies
5. Intro to Data
Science
Design schemas to minimize latency on massive data sets
Scale hundreds of thousands of operations per second
HBase
Training
Learn to code and write MapReduce programs for production
Master advanced API topics required for real-world data analysis
Implement recommenders and data experiments
Draw actionable insights from analysis of disparate data
Data Analyst
Training
Run full analyses natively on Big Data without BI software
Eliminate complexity to perform ad hoc queries in real time
Developer
Training
Learning Path: Developers
6. Data Analyst
Training
Implement massively distributed, columnar storage at scale
Enable random, real-time read/write access to all data
HBase
Training
Configure, install, and monitor clusters for optimal performance
Implement security measures and multi-user functionality
Vertically integrate basic analytics into data management
Transform and manipulate data to drive high-value utilization
Enterprise
Training
Use Cloudera Manager to speed deployment and scale the cluster
Learn which tools and techniques improve cluster performance
Administrator
Training
Learning Path: Administrators
7. 1 Broadest Range of Courses
Developer, Admin, Analyst, HBase, Data Science
2
3
Most Experienced Instructors
Over 15,000 students trained since 2009
5 Widest Geographic Coverage
Most classes offered: 50 cities worldwide plus online
6 Most Relevant Platform & Community
CDH deployed more than all other distributions combined
7 Depth of Training Material
Hands-on labs and VMs support live instruction
Leader in Certification
Over 5,000 accredited Cloudera professionals
4 State of the Art Curriculum
Classes updated regularly as Hadoop evolves 8 Ongoing Learning
Video tutorials and e-learning complement training
Why Cloudera Training?
8. Cloudera is the best vendor evangelizing
the Big Data movement and is doing a
great service promoting Hadoop in the
industry. Developer training was a great
way to get started on my journey.
10. This course was created for people in developer and operations roles,
including
–Developers
–DevOps
–Database Administrator
–Data Warehouse Engineer
–Administrators
Also useful for others who want to access HBase
–Business Intelligence Developer
–ETL Developers
–Quality Assurance Engineers
Intended Audience
11. Developers who want to learn details of MapReduce programming
–Recommend Cloudera Developer Training for Apache Hadoop
System administrators who want to learn how to install/configure tools
–Recommend Cloudera Administrator Training for Apache Hadoop
Who Should Not Take this Course
12. No prior knowledge of Hadoop is required
What is required is an understanding of
–Basic end-user UNIX commands
An optional understanding of
–Basic relational database concepts
–Basic knowledge of SQL
Course Prerequisites
SELECT id, first_name, last_name
FROM customers;
ORDER BY last_name;
$ mkdir /data
$ cd /data
$ rm /home/tomwheeler/salesreport.txt
13. During this course, you will learn:
The core technologies of Apache HBase
How HBase and HDFS work together
How to work with the HBase shell, Java API, and Thrift API
The HBase storage and cluster architecture
The fundamentals of HBase administration
Best practices for installing and configuring HBase
Advanced features of the HBase API
The importance of schema design in HBase
How to work with HBase ecosystem projects
Course Objectives
14. Hadoop Introduction
–Hands-On Exercise - Using HDFS
Introduction to HBase
HBase Concepts
–Hands-On Exercise - HBase Data Import
The HBase Administration API
–Hands-On Exercise - Using the HBase Shell
Accessing Data with the HBase API Part 1
–Hands-On Exercise - Data Access in the HBase Shell
Accessing Data with the HBase API Part 2
–Hands-On Exercise - Using the Developer API
Course Outline
15. Accessing Data with the HBase API Part 3
–Hands-On Exercise - Filters
HBase Architecture Part 1
–Hands-On Exercise - Exploring HBase
HBase Architecture Part 2
–Hands-On Exercise - Flushes and Compactions
Installation and Configuration Part 1
Installation and Configuration Part 2
–Hands-On Exercise - Administration
Row Key Design in HBase
Course Outline (cont’d)
16. Schema Design in HBase
–Hands-On Exercise - Detecting Hot Spots
The HBase Ecosystem
–Hands-On Exercise - Hive and HBase
Course Outline (cont’d)
17. A Scan can be used when:
–The exact row key is not known
–A group of rows needs to be accessed
Scans can be bounded by a start and stop row key
–The start row key is included in the results
–The stop row is not included in the results and the Scan will exhaust its
data upon hitting the stop row key
Scans can be limited to certain column families or column descriptors
Scans
18. A scan without a start
and stop row will scan
the entire table
With a start row of
"jordena" and an end
row of "turnerb"
–The scan will return
all rows starting at
"jordena" and not
include "turnerb"
Scanning
Row key Users Table
aaronsona fname: Aaron lname: Aaronson
harrise fname: Ernest lname: Harris
jordena fname: Adam lname: Jorden
laytonb fname: Bennie lname: Layton
millerb fname: Billie lname: Miller
nununezw fname: Willam lname: Nunez
rossw fname: William lname: Ross
sperberp fname: Phyllis lname: Sperber
turnerb fname: Brian lname: Turner
walkerm fname: Martin lname: Walker
zykowskiz fname: Zeph lname: Zykowski
19. Retrieve a group of rows with scan
General form:
Examples:
Scanning Rows With scan in HBase Shell
hbase> scan 'tablename' [,options]
hbase> scan 'table1'
hbase> scan 'table1', {LIMIT => 10}
hbase> scan 'table1', {STARTROW => 'start',
STOPROW => 'stop'}
hbase> scan 'table1', {COLUMNS =>
['fam1:col1', 'fam2:col2']}
20. Scan Java API: Complete Code
Scan s = new Scan();
ResultScanner rs = table.getScanner(s);
for (Result r : rs) {
String rowKey = Bytes.toString(r.getRow());
byte[] b = r.getValue(FAMILY_BYTES, COLUMN_BYTES);
String user = Bytes.toString(b);
}
s.close();
21. Scan Java API: Scan and ResultScanner
Scan s = new Scan();
ResultScanner rs = table.getScanner(s);
for (Result r : rs) {
String rowKey = Bytes.toString(r.getRow());
byte[] b = r.getValue(FAMILY_BYTES, COLUMN_BYTES);
String user = Bytes.toString(b);
}
s.close();
The Scan object is created and will scan all rows. The scan
is executed on the table and a ResultScanner object is
returned.
22. Scan Java API: Iterating
Scan s = new Scan();
ResultScanner rs = table.getScanner(s);
for (Result r : rs) {
String rowKey = Bytes.toString(r.getRow());
byte[] b = r.getValue(FAMILY_BYTES, COLUMN_BYTES);
String user = Bytes.toString(b);
}
s.close();Using a for loop, you iterate through all Result objects
in the ResultScanner. Each Result can be used to get
the values.
24. Python Scan Code: Open Scanner
scannerId = client.scannerOpen("tablename")
row = client.scannerGet(scannerId)
while row:
columnvalue = row.columns.get(columnwithcf).value
row = client.scannerGet(scannerId)
client.scannerClose(scannerId)
Call scannerOpen to create a scan object on the Thrift
server. This returns a scanner id that uniquely identifies the
scanner on the server.
25. Python Scan Code: Get the List
scannerId = client.scannerOpen("tablename")
row = client.scannerGet(scannerId)
while row:
columnvalue = row.columns.get(columnwithcf).value
row = client.scannerGet(scannerId)
client.scannerClose(scannerId)
The scannerGet method needs to be called with the
unique id. This returns a row of results.
26. Python Scan Code: Iterating Through
scannerId = client.scannerOpen("tablename")
row = client.scannerGet(scannerId)
while row:
columnvalue = row.columns.get(columnwithcf).value
row = client.scannerGet(scannerId)
client.scannerClose(scannerId)The while loop continues as long as the scanner returns a
new row. Columns must be addressed with column family,
":", and the column descriptor. row gets populated by
another call to scannerGet and the loop is repeated.
27. Python Scan Code: Closing the Scanner
scannerId = client.scannerOpen("tablename")
row = client.scannerGet(scannerId)
while row:
columnvalue = row.columns.get(columnwithcf).value
row = client.scannerGet(scannerId)
client.scannerClose(scannerId)
The scannerClose method call is very important. This
closes the Scan object on the Thrift server. Not calling this
method can leak Scan objects on the server.
28. Scan results can be retrieved in batches to improve performance
–Performance will improve but memory usage will increase
Java API:
Python with Thrift:
Scanner Caching
Scan s = new Scan();
s.setCaching(20);
rowsArray = client.scannerGetList(scannerId, 10)
Notas del editor
scan 'table1'Scans the entire tablescan 'table1', {LIMIT => 10}Scans the first 10 rows in the tablescan 'table1', {STARTROW => 'start', STOPROW => 'stop'} Scan between the start and stop rowsscan 'table1', {COLUMNS => ['fam1:col1', 'fam2:col2']} Scans the entire table for just those 2 column familities
The full code listing. A virtual line by line discussion follows.
Note that for the Python code, the row comes back as an array.