SlideShare una empresa de Scribd logo
1 de 191
Descargar para leer sin conexión
Crunch Big Data in the Cloud
with IBM BigInsights
and Hadoop
IBD-3475
Leons Petrazickis, IBM Canada

@leonsp

© 2013 IBM Corporation
Please note
IBM’s statements regarding its plans, directions, and intent are subject to
change or withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general
product direction and it should not be relied on in making a purchasing decision.

The information mentioned regarding potential future products is not a
commitment, promise, or legal obligation to deliver any material, code or
functionality. Information about potential future products may not be
incorporated into any contract. The development, release, and timing of any
future features or functionality described for our products remains at our sole
discretion.

Performance is based on measurements and projections using standard IBM
benchmarks in a controlled environment. The actual throughput or performance
that any user will experience will vary depending upon many factors, including
considerations such as the amount of multiprogramming in the user’s job
stream, the I/O configuration, the storage configuration, and the workload
processed. Therefore, no assurance can be given that an individual user will
achieve results similar to those stated here.
First step



Request a lab environment
 http://bit.ly/requestLab
BigDataUniversity.com
Hadoop Architecture
Agenda
• Terminology review
• Hadoop architecture
– HDFS
– Blocks
– MapReduce
– Type of nodes
– Topology awareness
– Writing a file to HDFS

6
Terminology review
Hadoop cluster
Rack 1

Rack n

Rack 2

Node 1

Node 1

Node 1

Node 2

Node 2

Node 2

…

…
Node n

7

Node n

…

…
Node n
Hadoop architecture
• Two main components:
– Hadoop Distributed File System (HDFS)
– MapReduce Engine

8
Hadoop distributed file system (HDFS)
• Hadoop file system that runs on top of
existing file system
• Designed to handle very large files with
streaming data access patterns
• Uses blocks to store a file or parts of a file

9
HDFS - Blocks
•

File Blocks
– 64MB (default), 128MB (recommended) – compare to 4KB in UNIX
– Behind the scenes, 1 HDFS block is supported by multiple operating system
(OS) blocks

HDFS Block

128 MB

OS Blocks
•

Advantages of blocks:
– Fixed size – easy to calculate how many fit on a disk
– A file can be larger than any single disk in the network
– If a file or a chunk of the file is smaller than the block size, only needed space is
used. Eg: 420MB file is split as:

128MB
•
10

128MB

128MB

36MB

Fits well with replication to provide fault tolerance and availability
HDFS - Replication
• Blocks with data are replicated to multiple nodes
• Allows for node failure without data loss
Node 3

Node 1

Node 2

11
MapReduce engine
• Technology from Google
• A MapReduce program consists of map and reduce
functions
• A MapReduce job is broken into tasks that run in
parallel

12
Types of nodes - Overview
• HDFS nodes
– NameNode
– DataNode
• MapReduce nodes
– JobTracker
– TaskTracker
• There are other nodes not discussed in this course

13
Types of nodes - Overview

14
Types of nodes - NameNode
• NameNode
– Only one per Hadoop cluster
– Manages the filesystem namespace and metadata
– Single point of failure, but mitigated by writing state to
multiple filesystems
– Single point of failure: Don’t use inexpensive
commodity hardware for this node, large memory
requirements

15
Types of nodes - DataNode
• DataNode
– Many per Hadoop cluster
– Manages blocks with data and
serves them to clients
– Periodically reports to name
node the list of blocks it stores
– Use inexpensive commodity
hardware for this node

16
Types of nodes - JobTracker
• JobTracker node
– One per Hadoop cluster
– Receives job requests submitted by client
– Schedules and monitors MapReduce jobs on task
trackers

17
Types of nodes - TaskTracker
• TaskTracker node
– Many per Hadoop cluster
– Executes MapReduce operations
– Reads blocks from DataNodes

18
…lesson continued in the next video>

19
Topology awareness
Bandwidth becomes progressively smaller in the following scenarios:

20
Topology awareness
Bandwidth becomes progressively smaller in the following scenarios:
1. Process on the same node.

21
Topology awareness
Bandwidth becomes progressively smaller in the following scenarios:
1. Process on the same node
2. Different nodes on the same rack

22
Topology awareness
Bandwidth becomes progressively smaller in the following scenarios:
1. Process on the same node
2. Different nodes on the same rack
3. Nodes on different racks in the same data center

23
Topology awareness
Bandwidth becomes progressively smaller in the following scenarios:
1.
2.
3.
4.

24

Process on the same node
Different nodes on the same rack
Nodes on different racks in the same data center
Nodes in different data centers
Writing a file to HDFS

25
Writing a file to HDFS

26
Writing a file to HDFS

27
Writing a file to HDFS

28
Writing a file to HDFS

29
Writing a file to HDFS

30
Writing a file to HDFS

31
Writing a file to HDFS

32
Writing a file to HDFS

33
Writing a file to HDFS

34
Writing a file to HDFS

35
Thank You
What is Hadoop?
Agenda
•
•
•
•
•

38

What is Hadoop?
What is Big Data?
Hadoop-related open source projects
Examples of Hadoop in action
Big Data solutions and the Cloud
What is Hadoop?

1G
B
Relational
Database

39
What is Hadoop?

10GB
1G
B
Relational
Database

40
What is Hadoop?

100GB

10GB
1G
B
Relational
Database

41
What is Hadoop?

100GB

10GB
1G
B
Relational
Database

42
What is Hadoop?

1TB

Relational
Database

43
What is Hadoop?

10TB 100TB

1TB

Relational
Database

44
What is Hadoop?

10TB 100TB

1TB

Relational
Database

45
What is Hadoop?

Facebook

10TB 100TB
RFIDs

1TB

Relational
Database
Sensors
Twitter

46
What is Hadoop?
• Open source project
• Written in Java

• Optimized to handle
• Massive amounts of data through parallelism

• A variety of data (structured, unstructured, semi-structured)
• Using inexpensive commodity hardware
• Great performance
• Reliability provided through replication
• Not for OLTP, not for OLAP/DSS, good for Big Data
• Current version: 0.20.2
47
What is Big Data?
RFID Readers

48
What is Big Data?
2 Billion internet users

49
What is Big Data?
4.6 Billion mobile phones

50
What is Big Data?
7TB of data processed by Twitter every day

7TB
a day

51
What is Big Data?
10TB of data processed by Facebook every day

10TB
a day

52
What is Big Data?
About 80% of this data is unstructured

53
Hadoop-related open source projects

jaql
PIG

ZooKeeper

54
Examples of Hadoop in action – IBM Watson

55
Examples of Hadoop in action
• In the telecommunication industry
• In the media
• In the technology industry

56
Hadoop is not for all types of work
•
•
•
•
•

57

Not to process transactions (random access)
Not good when work cannot be parallelized
Not good for low latency data access
Not good for processing lots of small files
Not good for intensive calculations with little
data
Big Data solutions and the Cloud
• Big Data solutions are more than just
Hadoop
– Add business intelligence/analytics
functionality
– Derive information of data in motion

• Big Data solutions and the Cloud are a
perfect fit.
– The Cloud allows you to set up a cluster of
systems in minutes and it’s relatively
inexpensive.
58
Thank You
HDFS – Command Line
Agenda
• HDFS Command Line Interface
• Examples

61
HDFS Command line interface
• File System Shell (fs)

• Invoked as follows:

hadoop fs <args>
• Example:
Listing the current directory in hdfs

hadoop fs –ls .
62
HDFS Command line interface
• FS shell commands take paths URIs as argument
• URI format:

scheme://authority/path

• Scheme:
• For the local filesystem, the scheme is file
• For HDFS, the scheme is hdfs

hadoop fs –copyFromLocal
file://myfile.txt
hdfs://localhost/user/keith/myfile.txt
• Scheme and authority are optional
• Defaults are taken from configuration file core-site.xml
63
HDFS Command line interface
• Many POSIX-like commands
• cat, chgrp, chmod, chown, cp, du, ls, mkdir, mv, rm, stat, tail
• Some HDFS-specific commands
• copyFromLocal, copyToLocal, get, getmerge, put, setrep

64
HDFS – Specific commands
• copyFromLocal / put
• Copy files from the local file system into fs

hadoop fs -copyFromLocal <localsrc> .. <dst>
Or

hadoop fs -put <localsrc> .. <dst>

65
HDFS – Specific commands
• copyToLocal / get
• Copy files from fs into the local file system

hadoop fs -copyToLocal [-ignorecrc] [-crc]
<src> <localdst>
Or

hadoop fs -get [-ignorecrc] [-crc]
<src> <localdst>

66
HDFS – Specific commands
• getMerge
• Get all the files in the directories that match the source file pattern
• Merge and sort them to only one file on local fs
• <src> is kept

hadoop fs -getmerge <src> <localdst>

67
HDFS – Specific commands
• setRep
• Set the replication level of a file.
• The -R flag requests a recursive change of replication level for an
entire tree.
• If -w is specified, waits until new replication level is achieved.

hadoop fs -setrep [-R] [-w] <rep> <path/file>

68
Thank You
Hadoop MapReduce
Agenda
•
•
•
•
•
•
•
•

71

Map operations
Reduce operations
Submitting a MapReduce job
Distributed Mergesort Engine
Two fundamental data types
Fault tolerance
Scheduling
Task execution
What is a Map operation?
• Doing something to every element in an array is a common operation:

var a = [1,2,3];
for (i = 0; i < a.length; i++)
a[i] = a[i] * 2;

72
What is a Map operation?
• Doing something to every element in an array is a common operation:

var a = [1,2,3];
for (i = 0; i < a.length; i++)
a[i] = a[i] * 2;
• New value for variable a would be:

var a = [2,4,6];

73
What is a Map operation?
• Doing something to every element in an array is a common operation:

var a = [1,2,3];
for (i = 0; i < a.length; i++)
a[i] = a[i] * 2;
• New value for variable a would be:

var a = [2,4,6];

74

This can
be written as
a function
What is a Map operation?
• Doing something to every element in an array is a common operation:

var a = [1,2,3];
for (i = 0; i < a.length; i++)
a[i] * 2;
a[i] = fn(a[i]);
• New value for variable a would be:

var a = [2,4,6];

75

Like this,
where fn
is
a function
defined
as:
function
fn(x)
{return
x*2;}
What is a Map operation?
• Doing something to every element in an array is a common operation:

var a = [1,2,3];
for (i = 0; i < a.length; i++)
a[i] = fn(a[i]);

Now, all of this can also be
converted into a “map” function

76
What is a Map operation?
• …like this, where fn is a function passed as an argument:

function map(fn, a) {
for (i = 0; i < a.length; i++)
a[i] = fn(a[i]);
}

77
What is a Map operation?
• …like this, where fn is a function passed as an argument:

function map(fn, a) {
for (i = 0; i < a.length; i++)
a[i] = fn(a[i]);
}
• You can invoke this map function like this:

map(function(x){return x*2;}, a);

78
What is a Map operation?
• …like this, where fn is a function passed as an argument:

function map(fn, a) {
for (i = 0; i < a.length; i++)
a[i] = fn(a[i]);
}
• You can invoke this map function like this:

map(function(x){return x*2;}, a);
This is function fn whose definition is included in the call

79
What is a Map operation?
• In summary, now you can rewrite:

for (i = 0; i < a.length; i++)
a[i] = a[i] * 2;
}

as a map operation:

map(function(x){return x*2;}, a);

80
What is a Reduce operation?
• Another common operation on arrays is to combine all their values:

function sum(a) {

var s = 0;
for (i = 0; i < a.length; i++)

s += a[i];
return s;
}

81
What is a Reduce operation?
• Another common operation on arrays is to combine all their values:

function sum(a) {

var s = 0;
for (i = 0; i < a.length; i++)

s += a[i];
return s;
}

82

This can
be written
as a
function
What is a Reduce operation?
• Another common operation on arrays is to combine all their values:

function sum(a) {

var s = 0;
for (i = 0; i < a.length; i++)

s = fn(s,a[i]);
return s;
}

83

Like this,
where function
fn is defined so
it adds its
arguments:
function fn(a,b){
return a+b;
}
What is a Reduce operation?
• Another common operation on arrays is to combine all their values:

function sum(a) {

var s = 0;
for (i = 0; i < a.length; i++)

s = fn(s, a[i]);
return s;
}
The whole function sum can also be rewritten so that fn is passed as an
argument
84
What is a Reduce operation?
• Another common operation on arrays is to combine all their values:

function reduce(fn, a, init) {

var s = init;
for (i = 0; i < a.length; i++)

s = fn(s, a[i]);
return s;
}
Like this… The function name was changed to reduce, and now it takes
three arguments, a function, an array, and an initial value
85
What is a Reduce operation?
• Another common operation on arrays is to combine all their values:

function sum(a) {

var s = 0;
for (i = 0; i < a.length; i++)

s += a[i];
return s;
}
as a reduce operation:

reduce(function(a,b){return a+b;},a,0);
86
…lesson continued in the next video>

87
Submitting a MapReduce job

88
Submitting a MapReduce job

89
Submitting a MapReduce job

90
Submitting a MapReduce job

91
Submitting a MapReduce job

92
Submitting a MapReduce job

93
Submitting a MapReduce job

94
Submitting a MapReduce job

95
Submitting a MapReduce job

96
Submitting a MapReduce job

97
…lesson continued in the next video>

98
MapReduce – Distributed Mergesort Engine

99
MapReduce – Distributed Mergesort Engine

100
MapReduce – Distributed Mergesort Engine

101
MapReduce – Distributed Mergesort Engine

102
MapReduce – Distributed Mergesort Engine

103
MapReduce – Distributed Mergesort Engine

104
MapReduce – Distributed Mergesort Engine

105
MapReduce – Distributed Mergesort Engine

106
MapReduce – Distributed Mergesort Engine

107
MapReduce – Distributed Mergesort Engine

108
MapReduce – Distributed Mergesort Engine

109
…lesson continued in the next video>

110
Two Fundamental data types
• Key/value pairs
• Lists
Input
map
reduce

111

Output
Two Fundamental data types
• Key/value pairs
• Lists
Input
map
reduce

112

<k1, v1>

Output
Two Fundamental data types
• Key/value pairs
• Lists
Input
map
reduce

113

Output

<k1, v1>

list(<k2, v2>)
Two Fundamental data types
• Key/value pairs
• Lists
Input
map

<k1, v1>

list(<k2, v2>)

reduce

114

Output

<k2, list(v2)>
Two Fundamental data types
• Key/value pairs
• Lists
Input
map

<k1, v1>

list(<k2, v2>)

reduce

115

Output

<k2, list(v2)>

list(<k3, v3>)
Simple data flow example

116
Simple data flow example

117
Simple data flow example

118
Simple data flow example

119
Simple data flow example

120
…lesson continued in the next video>

121
Fault tolerance

122
Fault tolerance
• Task Failure

123
Fault tolerance
• Task Failure
• If a child task fails, the child JVM reports to the TaskTracker before it exits.
Attempt is marked failed, freeing up slot for another task.

124
Fault tolerance
• Task Failure
• If a child task fails, the child JVM reports to the TaskTracker before it exits.
Attempt is marked failed, freeing up slot for another task.

• If the child task hangs, it is killed. JobTracker reschedules the task on
another machine.

125
Fault tolerance
• Task Failure
• If a child task fails, the child JVM reports to the TaskTracker before it exits.
Attempt is marked failed, freeing up slot for another task.

• If the child task hangs, it is killed. JobTracker reschedules the task on
another machine.
• If task continues to fail, job is failed.

126
Fault tolerance
• TaskTracker Failure

127
Fault tolerance
• TaskTracker Failure
• JobTracker receives no heartbeat

128
Fault tolerance
• TaskTracker Failure
• JobTracker receives no heartbeat
• Removes TaskTracker from pool of TaskTrackers to schedule tasks on.

129
Fault tolerance
• TaskTracker Failure
• JobTracker receives no heartbeat
• Removes TaskTracker from pool of TaskTrackers to schedule tasks on.

• JobTracker Failure

130
Fault tolerance
• TaskTracker Failure
• JobTracker receives no heartbeat
• Removes TaskTracker from pool of TaskTrackers to schedule tasks on.

• JobTracker Failure
• Singe point of failure. Job fails

131
…lesson continued in the next video>

132
Scheduling

133
Scheduling
• FIFO scheduler (with priorities)

134
Scheduling
• FIFO scheduler (with priorities)
• Each job uses the whole cluster, so jobs wait their turn.

135
Scheduling
• FIFO scheduler (with priorities)
• Each job uses the whole cluster, so jobs wait their turn.

• Fair scheduler

136
Scheduling
• FIFO scheduler (with priorities)
• Each job uses the whole cluster, so jobs wait their turn.

• Fair scheduler
• Jobs placed in pools. If a user submits more jobs than another user, he
will not get any more cluster resources than the other user, on
average. Can define custom pools with guaranteed minimum capacity.

137
Scheduling
• FIFO scheduler (with priorities)
• Each job uses the whole cluster, so jobs wait their turn.

• Fair scheduler
• Jobs placed in pools. If a user submits more jobs than another user, he
will not get any more cluster resources than the other user, on
average. Can define custom pools with guaranteed minimum capacity.

• Capacity scheduler

138
Scheduling
• FIFO scheduler (with priorities)
• Each job uses the whole cluster, so jobs wait their turn.

• Fair scheduler
• Jobs placed in pools. If a user submits more jobs than another user, he
will not get any more cluster resources than the other user, on
average. Can define custom pools with guaranteed minimum capacity.

• Capacity scheduler
• Allows Hadoop to simulate, for each user, a separate MapReduce
cluster with FIFO scheduling.

139
Task execution

140
Task execution
• Speculative Execution

141
Task execution
• Speculative Execution
• Job execution is time sensitive to slow-running tasks. Hadoop detects
slow-running tasks and launches another, equivalent task as a backup.
The output from the first of these tasks to finish is used.

142
Task execution
• Speculative Execution
• Job execution is time sensitive to slow-running tasks. Hadoop detects
slow-running tasks and launches another, equivalent task as a backup.
The output from the first of these tasks to finish is used.

• Task JVM Reuse

143
Task execution
• Speculative Execution
• Job execution is time sensitive to slow-running tasks. Hadoop detects
slow-running tasks and launches another, equivalent task as a backup.
The output from the first of these tasks to finish is used.

• Task JVM Reuse
• Tasks run in their own JVMs for isolation. Jobs that have a large
number of short-lived tasks or tasks with lengthy initialization can
benefit from sequential JVM reuse through configuration.

144
Thank You
Pig, Hive, and JAQL
Agenda
•
•
•
•

147

Overview
Pig
Hive
Jaql
Agenda
•
•
•
•

148

Overview
Pig
Hive
Jaql
Similarities of Pig, Hive and Jaql









149

All translate their respective high-level languages to
MapReduce jobs
All offer significant reductions in program size over
Java
All provide points of extension to cover gaps in
functionality
All provide interoperability with other languages
None support random reads/writes or low-latency
queries
Comparing Pig, Hive, and Jaql
Pig

Hive

Jaql

Developed by

Yahoo!

Facebook

IBM

Language name

Pig Latin

HiveQL

Jaql

Data flow

Declarative
(SQL dialect)

Data flow

Complex

Geared
towards
structured data

Loosely structured
data, JSON

Schema optional?

Yes

No, but data
can have many
schemas
Yes

Turing complete?

Yes when
extended with
Java UDFs

Yes when
extended with
Java UDFs

Type of language

Data structures it
operates on

150

Yes
Agenda
•
•
•
•

151

Overview
Pig
Hive
Jaql
Pig components
• Two Components



Language (called Pig Latin)
Compiler

Pig
Pig Latin
Compiler

• Two execution environments


Local (Single JVM)




Distributed (Hadoop cluster)


152

152

pig -x local
pig -x mapreduce, or simply pig

Execution Environment
Local
Distributed
Running Pig


Script
pig scriptfile.pig



Grunt (command line)
pig (to launch command line tool)



Embedded
Call in to Pig from Java

153

153
Pig Latin sample code
#pig
grunt> records = LOAD ‘econ_assist.csv’
using PigStorage (‘,’)
AS (country:chararray, sum:long);
grunt> grouped = GROUP records BY country;

grunt> thesum

= FOREACH grouped
GENERATE group,
SUM(records, sum);

grunt> DUMP thesum;

154

154
Pig Latin – Statements, operations & commands
Pig Latin program

An operation
as a statement
A
command
as a
statement

… LOAD ‘input.txt’;
… ls *.txt

Logical Plan

…
… DUMP…

Compile

Physical
Plan

Execute

155

155
Pig Latin statements


UDF Statements




Commands








156

Hadoop Filesystem (cat, ls, etc.)
Hadoop MapReduce (kill)
Utility (exec, help, quit, run, set)

Operators


156

REGISTER, DEFINE

Diagnostic: DESCRIBE, EXPLAIN, ILLUSTRATE
Relational: LOAD, STORE, DUMP, FILTER, etc.
Pig Latin – Relational operators


Loading and storing
Eg: LOAD (into a program), STORE (to disk), DUMP (to the screen)



Filtering
Eg: FILTER, DISTINCT, FOREACH...GENERATE, STREAM, SAMPLE



Grouping and joining
Eg: JOIN, COGROUP, GROUP, CROSS



Sorting
Eg: ORDER, LIMIT



Combining and splitting
Eg: UNION, SPLIT

157

157
Pig Latin – Relations and schema




Result of a relational operator is a relation
A relation is a set of tuples
Relations can be named using an alias (Eg: “x”)
x = LOAD ‘sample.txt’ AS (id: int, year:int);
DUMP x



Output is a tuple. Eg:
(1,1987)

158

158
Pig Latin – Relations and schema



Structure of a relation is a schema
Use the DESCRIBE operator to see the schema. Eg:
DESCRIBE x



The output is the schema:
x: {id: int, year: int}

159

159
Pig Latin expressions




Statements that contain relational operators may
also contain expressions.
Kinds of expressions:
Constant
Map lookup
Conditional
Functional

160

160

Field
Cast
Boolean
Flatten

Projection
Arithmetic
Comparison
Pig Latin – Data types
• Simple types:
int
long



bytearray
chararray

Complex types:
Tuple
Bag
Map

161

float
double

161

– Sequence of fields of any type
– Unordered collection of tuples
– Set of key-value pairs. Keys must be chararray.
Pig Latin – Function types


Eval
Input: One or more expressions
Output: An expression
Example: MAX



Filter
Input: Bag or map
Output: boolean
Example: IsEmpty

162

162
Pig Latin – Function types


Load
Input: Data from external storage
Output: A relation
Example: PigStorage



Store
Input: A relation
Output: Data to external storage
Example: PigStorage

163

163
Pig Latin – User-Defined Functions
• Written in Java
 Packaged in a JAR file
 Register JAR file using the REGISTER statement
 Optionally, alias it with DEFINE statement

164

164
Agenda
•
•
•
•

165

Overview
Pig
Hive
Jaql
Hive architecture
DDL

JDBC/ODBC

Queries

CLI
Metastore
(Relational
database
for metadata)

Web
Interface

Hadoop

166

166

Parser,
Planner
Optimizer
Running Hive


Hive Shell


Interactive
hive



Script
hive -f myscript



Inline
hive -e 'SELECT * FROM mytable'

167

167
Hive services
hive --service servicename
where servicename can be:


hiveserver
server for Thrift, JDBC, ODBC clients



hwi
web interface



jar
hadoop jar with Hive jars in classpath



metastore
out of process metastore

168

168
Hive - Metastore



Stores Hive metadata
Configurations
Embedded
in-process metastore, in-process database
 Local
in-process metastore, out-of-process database
 Remote
out-of-process metastore, out-of-process
database


169

169
Hive – Schema-On-Read





170

170

Faster loads into the database (simply copy
or move)
Slower queries
Flexibility – multiple schemas for the same
data
Hive - Configuration
• Three ways to configure hive:
• hive-site.xml
-

fs.default.name
mapred.job.tracker
Metastore configuration settings

hive –hiveconf
 “Set” command in the Hive Shell


171

171
Hive Query Language (HiveQL)




SQL dialect
Does not support full SQL92 specification
No support for:
HAVING clause in SELECT
 Correlated subqueries
 Subqueries outside FROM clauses
 Updateable or materialized views
 Stored procedures


172

172
Sample code
#hive
hive> CREATE TABLE foreign_aid
(country STRING, sum BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE;
hive> SHOW TABLES;
hive> DESCRIBE foreign_aid;
hive> LOAD DATA INPATH ‘econ_assist.csv’
OVERWRITE INTO TABLE foreign_aid;

hive> SELECT * FROM foreign_aid LIMIT 10;
hive> SELECT country, SUM(sum) FROM foreign_aid
GROUP BY country;

173

173
Hive Query Language (HiveQL)


Extensions
MySQL-like extensions
 MapReduce extensions


Multi-table insert, MAP, REDUCE, TRANSFORM clauses


Data Types


Simple
TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, BOOLEAN, STRING



Complex
ARRAY, MAP, STRUCT

174

174
Hive Query Language (HiveQL)


Built-in Functions



175

175

SHOW FUNCTIONS
DESCRIBE FUNCTION
Hive – User-Defined Functions



Written in Java
Three UDF types:









176

UDF
Input: single row, output: single row
UDAF
Input: multiple rows, output: single row
UDTF
Input: single row, output: multiple rows

Register UDF using ADD JAR
Create alias using CREATE TEMPORARY FUNCTION
176
Agenda
•
•
•
•

177

Overview
Pig
Hive
Jaql
Jaql architecture

Interactive shell / Applications

Script
Compiler / Parser / Rewriter
I/O layer

Storage layer
File Systems
(HDFS, GPFS, Local)

178

178

Databases
(DBMS, HBase)

Streams
(Web, Pipes)
Jaql data model: JSON






179

179

JSON = JavaScript object Notation
Flexible (Schema is optional)
Powerful modeling for semi-structured data
Popular exchange format
JSON example
[
{ACCT_NUM:18,AUTH_DATE:”2011-01-29”,
AUTH_AMT:”111.11”,ZIP:98765,MERCH_NAME:”Acme”},

{ACCT_NUM:19,AUTH_DATE:”2011-01-29”,
AUTH_AMT:”222.22”,ZIP:98765,MERCH_NAME:”Exxme”,
NICKNAME:”Xyz”},
{ACCT_NUM:20,AUTH_DATE:”2011-01-30”,
AUTH_AMT:”3.33”,ZIP:12345,MERCH_NAME:”Acme”,
ROUTE:[”68.86.85.188”,”64.215.26.111”]},

…
]

180

180
Running Jaql


Jaql Shell
Interactive.
 Batch
 Inline




Cluster
 Minicluster

181

-b myscript.jaql

-e jaqlstatement

Modes


181

Eg: jaqlshell
Eg: jaqlshell
Eg: jaqlshell
Eg: jaqlshell
Eg: jaqlshell

-c
Jaql query language
source

…

operator

operator

sink

• Sources and sinks
Eg: Copy data from a local file to a new file on HDFS
source
sink
read(file(“input.json”)) -> write(hdfs(“output”))


Core Operators
Filter
Transform
Expand

182

182

Group
Join
Union

Tee
Sort
Top
Jaql query language
• Variables




Pipes, streams, and consumers





183

183

Equal operator (=) binds source output to a variable
e.g. $tweets = read(hdfs(“twitterfeed”))

Pipe operator (->) streams data to a consumer
Pipe expects array as input
e.g. $tweets → filter $.from_src == 'tweetdeck';
$ – implicit variable referencing current array value
Jaql query language
• Categories of Built-in Functions
system
core
hadoop
io
array
index

184

184

schema
xml
regex
binary
date
nil

agg
number
string
function
random
record
Jaql – Data Storage


Data store examples





Amazon S3
HTTP

185

HBase
Local FS

Data format examples
JSON

185

DB2
JDBC

AVRO

CSV

XML

HDFS
Jaql sample code
#jaqlshell -c
jaql> $foreignaid =
read(del(“econ_assist.csv”,
{schema: schema
{country: string, sum: long}
} )
)
jaql> $foreignaid
-> group by $country = ($.country)
into {$country.country, sum($[*].sum)};

186

186
Hadoop core lab – Part 3
BigDataUniversity.com
Acknowledgements and Disclaimers
Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in
which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for
informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant.
While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without
warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this
presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or
representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreem ent governing the use
of IBM software.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have
achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended
to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other
results.

© Copyright IBM Corporation 2013. All rights reserved.

•U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM
Corp.

IBM, the IBM logo, ibm.com, InfoSphere and BigInsights, Streams, and DB2 are trademarks or registered trademarks of International
Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on
their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law
trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law
trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at
www.ibm.com/legal/copytrade.shtml

Other company, product, or service names may be trademarks or service marks of others.
Communities
• On-line communities, User Groups, Technical Forums, Blogs, Social
networks, and more
o Find the community that interests you …
• Information Management bit.ly/InfoMgmtCommunity
• Business Analytics bit.ly/AnalyticsCommunity

• Enterprise Content Management bit.ly/ECMCommunity

• IBM Champions
o Recognizing individuals who have made the most outstanding contributions to
Information Management, Business Analytics, and Enterprise Content Management
communities
•

ibm.com/champion
Thank You
Your feedback is important!
• Access the Conference Agenda Builder to
complete your session surveys

oAny web or mobile browser at
http://iod13surveys.com/surveys.html

oAny Agenda Builder kiosk onsite

Más contenido relacionado

La actualidad más candente

Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterSetting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterEdureka!
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFSEdureka!
 
A day in the life of hadoop administrator!
A day in the life of hadoop administrator!A day in the life of hadoop administrator!
A day in the life of hadoop administrator!Edureka!
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Edureka!
 
Hadoop Adminstration with Latest Release (2.0)
Hadoop Adminstration with Latest Release (2.0)Hadoop Adminstration with Latest Release (2.0)
Hadoop Adminstration with Latest Release (2.0)Edureka!
 
Hadoop, Evolution of Hadoop, Features of Hadoop
Hadoop, Evolution of Hadoop, Features of HadoopHadoop, Evolution of Hadoop, Features of Hadoop
Hadoop, Evolution of Hadoop, Features of HadoopDr Neelesh Jain
 
Hadoop admin training
Hadoop admin trainingHadoop admin training
Hadoop admin trainingArun Kumar
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Edureka!
 
Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS  Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS Dr Neelesh Jain
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Understanding Big Data And Hadoop
Understanding Big Data And HadoopUnderstanding Big Data And Hadoop
Understanding Big Data And HadoopEdureka!
 
Administer Hadoop Cluster
Administer Hadoop ClusterAdminister Hadoop Cluster
Administer Hadoop ClusterEdureka!
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...Edureka!
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 

La actualidad más candente (20)

Setting High Availability in Hadoop Cluster
Setting High Availability in Hadoop ClusterSetting High Availability in Hadoop Cluster
Setting High Availability in Hadoop Cluster
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
A day in the life of hadoop administrator!
A day in the life of hadoop administrator!A day in the life of hadoop administrator!
A day in the life of hadoop administrator!
 
Upgrading hadoop
Upgrading hadoopUpgrading hadoop
Upgrading hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
 
Hadoop Adminstration with Latest Release (2.0)
Hadoop Adminstration with Latest Release (2.0)Hadoop Adminstration with Latest Release (2.0)
Hadoop Adminstration with Latest Release (2.0)
 
Hadoop, Evolution of Hadoop, Features of Hadoop
Hadoop, Evolution of Hadoop, Features of HadoopHadoop, Evolution of Hadoop, Features of Hadoop
Hadoop, Evolution of Hadoop, Features of Hadoop
 
Hadoop admin
Hadoop adminHadoop admin
Hadoop admin
 
Hadoop admin training
Hadoop admin trainingHadoop admin training
Hadoop admin training
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
 
Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS  Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Understanding Big Data And Hadoop
Understanding Big Data And HadoopUnderstanding Big Data And Hadoop
Understanding Big Data And Hadoop
 
Administer Hadoop Cluster
Administer Hadoop ClusterAdminister Hadoop Cluster
Administer Hadoop Cluster
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
 
Hadoop administration
Hadoop administrationHadoop administration
Hadoop administration
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 

Similar a The map operation in MapReduce refers to applying a function to every element in a data set and emitting key-value pairs.Some key properties of a map operation:- It processes the data in parallel on different machines. - The input and output are in the form of key-value pairs.- It does not modify the existing data, it only processes it to emit new outputs.So in summary, a map operation takes a data set with some structure, applies a function independently on different parts of it in parallel, and produces a new set of outputs

Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 

Similar a The map operation in MapReduce refers to applying a function to every element in a data set and emitting key-value pairs.Some key properties of a map operation:- It processes the data in parallel on different machines. - The input and output are in the form of key-value pairs.- It does not modify the existing data, it only processes it to emit new outputs.So in summary, a map operation takes a data set with some structure, applies a function independently on different parts of it in parallel, and produces a new set of outputs (20)

Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Hadoop
HadoopHadoop
Hadoop
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 

Último

What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Último (20)

What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

The map operation in MapReduce refers to applying a function to every element in a data set and emitting key-value pairs.Some key properties of a map operation:- It processes the data in parallel on different machines. - The input and output are in the form of key-value pairs.- It does not modify the existing data, it only processes it to emit new outputs.So in summary, a map operation takes a data set with some structure, applies a function independently on different parts of it in parallel, and produces a new set of outputs

  • 1. Crunch Big Data in the Cloud with IBM BigInsights and Hadoop IBD-3475 Leons Petrazickis, IBM Canada @leonsp © 2013 IBM Corporation
  • 2. Please note IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
  • 3. First step  Request a lab environment  http://bit.ly/requestLab
  • 6. Agenda • Terminology review • Hadoop architecture – HDFS – Blocks – MapReduce – Type of nodes – Topology awareness – Writing a file to HDFS 6
  • 7. Terminology review Hadoop cluster Rack 1 Rack n Rack 2 Node 1 Node 1 Node 1 Node 2 Node 2 Node 2 … … Node n 7 Node n … … Node n
  • 8. Hadoop architecture • Two main components: – Hadoop Distributed File System (HDFS) – MapReduce Engine 8
  • 9. Hadoop distributed file system (HDFS) • Hadoop file system that runs on top of existing file system • Designed to handle very large files with streaming data access patterns • Uses blocks to store a file or parts of a file 9
  • 10. HDFS - Blocks • File Blocks – 64MB (default), 128MB (recommended) – compare to 4KB in UNIX – Behind the scenes, 1 HDFS block is supported by multiple operating system (OS) blocks HDFS Block 128 MB OS Blocks • Advantages of blocks: – Fixed size – easy to calculate how many fit on a disk – A file can be larger than any single disk in the network – If a file or a chunk of the file is smaller than the block size, only needed space is used. Eg: 420MB file is split as: 128MB • 10 128MB 128MB 36MB Fits well with replication to provide fault tolerance and availability
  • 11. HDFS - Replication • Blocks with data are replicated to multiple nodes • Allows for node failure without data loss Node 3 Node 1 Node 2 11
  • 12. MapReduce engine • Technology from Google • A MapReduce program consists of map and reduce functions • A MapReduce job is broken into tasks that run in parallel 12
  • 13. Types of nodes - Overview • HDFS nodes – NameNode – DataNode • MapReduce nodes – JobTracker – TaskTracker • There are other nodes not discussed in this course 13
  • 14. Types of nodes - Overview 14
  • 15. Types of nodes - NameNode • NameNode – Only one per Hadoop cluster – Manages the filesystem namespace and metadata – Single point of failure, but mitigated by writing state to multiple filesystems – Single point of failure: Don’t use inexpensive commodity hardware for this node, large memory requirements 15
  • 16. Types of nodes - DataNode • DataNode – Many per Hadoop cluster – Manages blocks with data and serves them to clients – Periodically reports to name node the list of blocks it stores – Use inexpensive commodity hardware for this node 16
  • 17. Types of nodes - JobTracker • JobTracker node – One per Hadoop cluster – Receives job requests submitted by client – Schedules and monitors MapReduce jobs on task trackers 17
  • 18. Types of nodes - TaskTracker • TaskTracker node – Many per Hadoop cluster – Executes MapReduce operations – Reads blocks from DataNodes 18
  • 19. …lesson continued in the next video> 19
  • 20. Topology awareness Bandwidth becomes progressively smaller in the following scenarios: 20
  • 21. Topology awareness Bandwidth becomes progressively smaller in the following scenarios: 1. Process on the same node. 21
  • 22. Topology awareness Bandwidth becomes progressively smaller in the following scenarios: 1. Process on the same node 2. Different nodes on the same rack 22
  • 23. Topology awareness Bandwidth becomes progressively smaller in the following scenarios: 1. Process on the same node 2. Different nodes on the same rack 3. Nodes on different racks in the same data center 23
  • 24. Topology awareness Bandwidth becomes progressively smaller in the following scenarios: 1. 2. 3. 4. 24 Process on the same node Different nodes on the same rack Nodes on different racks in the same data center Nodes in different data centers
  • 25. Writing a file to HDFS 25
  • 26. Writing a file to HDFS 26
  • 27. Writing a file to HDFS 27
  • 28. Writing a file to HDFS 28
  • 29. Writing a file to HDFS 29
  • 30. Writing a file to HDFS 30
  • 31. Writing a file to HDFS 31
  • 32. Writing a file to HDFS 32
  • 33. Writing a file to HDFS 33
  • 34. Writing a file to HDFS 34
  • 35. Writing a file to HDFS 35
  • 38. Agenda • • • • • 38 What is Hadoop? What is Big Data? Hadoop-related open source projects Examples of Hadoop in action Big Data solutions and the Cloud
  • 44. What is Hadoop? 10TB 100TB 1TB Relational Database 44
  • 45. What is Hadoop? 10TB 100TB 1TB Relational Database 45
  • 46. What is Hadoop? Facebook 10TB 100TB RFIDs 1TB Relational Database Sensors Twitter 46
  • 47. What is Hadoop? • Open source project • Written in Java • Optimized to handle • Massive amounts of data through parallelism • A variety of data (structured, unstructured, semi-structured) • Using inexpensive commodity hardware • Great performance • Reliability provided through replication • Not for OLTP, not for OLAP/DSS, good for Big Data • Current version: 0.20.2 47
  • 48. What is Big Data? RFID Readers 48
  • 49. What is Big Data? 2 Billion internet users 49
  • 50. What is Big Data? 4.6 Billion mobile phones 50
  • 51. What is Big Data? 7TB of data processed by Twitter every day 7TB a day 51
  • 52. What is Big Data? 10TB of data processed by Facebook every day 10TB a day 52
  • 53. What is Big Data? About 80% of this data is unstructured 53
  • 54. Hadoop-related open source projects jaql PIG ZooKeeper 54
  • 55. Examples of Hadoop in action – IBM Watson 55
  • 56. Examples of Hadoop in action • In the telecommunication industry • In the media • In the technology industry 56
  • 57. Hadoop is not for all types of work • • • • • 57 Not to process transactions (random access) Not good when work cannot be parallelized Not good for low latency data access Not good for processing lots of small files Not good for intensive calculations with little data
  • 58. Big Data solutions and the Cloud • Big Data solutions are more than just Hadoop – Add business intelligence/analytics functionality – Derive information of data in motion • Big Data solutions and the Cloud are a perfect fit. – The Cloud allows you to set up a cluster of systems in minutes and it’s relatively inexpensive. 58
  • 61. Agenda • HDFS Command Line Interface • Examples 61
  • 62. HDFS Command line interface • File System Shell (fs) • Invoked as follows: hadoop fs <args> • Example: Listing the current directory in hdfs hadoop fs –ls . 62
  • 63. HDFS Command line interface • FS shell commands take paths URIs as argument • URI format: scheme://authority/path • Scheme: • For the local filesystem, the scheme is file • For HDFS, the scheme is hdfs hadoop fs –copyFromLocal file://myfile.txt hdfs://localhost/user/keith/myfile.txt • Scheme and authority are optional • Defaults are taken from configuration file core-site.xml 63
  • 64. HDFS Command line interface • Many POSIX-like commands • cat, chgrp, chmod, chown, cp, du, ls, mkdir, mv, rm, stat, tail • Some HDFS-specific commands • copyFromLocal, copyToLocal, get, getmerge, put, setrep 64
  • 65. HDFS – Specific commands • copyFromLocal / put • Copy files from the local file system into fs hadoop fs -copyFromLocal <localsrc> .. <dst> Or hadoop fs -put <localsrc> .. <dst> 65
  • 66. HDFS – Specific commands • copyToLocal / get • Copy files from fs into the local file system hadoop fs -copyToLocal [-ignorecrc] [-crc] <src> <localdst> Or hadoop fs -get [-ignorecrc] [-crc] <src> <localdst> 66
  • 67. HDFS – Specific commands • getMerge • Get all the files in the directories that match the source file pattern • Merge and sort them to only one file on local fs • <src> is kept hadoop fs -getmerge <src> <localdst> 67
  • 68. HDFS – Specific commands • setRep • Set the replication level of a file. • The -R flag requests a recursive change of replication level for an entire tree. • If -w is specified, waits until new replication level is achieved. hadoop fs -setrep [-R] [-w] <rep> <path/file> 68
  • 71. Agenda • • • • • • • • 71 Map operations Reduce operations Submitting a MapReduce job Distributed Mergesort Engine Two fundamental data types Fault tolerance Scheduling Task execution
  • 72. What is a Map operation? • Doing something to every element in an array is a common operation: var a = [1,2,3]; for (i = 0; i < a.length; i++) a[i] = a[i] * 2; 72
  • 73. What is a Map operation? • Doing something to every element in an array is a common operation: var a = [1,2,3]; for (i = 0; i < a.length; i++) a[i] = a[i] * 2; • New value for variable a would be: var a = [2,4,6]; 73
  • 74. What is a Map operation? • Doing something to every element in an array is a common operation: var a = [1,2,3]; for (i = 0; i < a.length; i++) a[i] = a[i] * 2; • New value for variable a would be: var a = [2,4,6]; 74 This can be written as a function
  • 75. What is a Map operation? • Doing something to every element in an array is a common operation: var a = [1,2,3]; for (i = 0; i < a.length; i++) a[i] * 2; a[i] = fn(a[i]); • New value for variable a would be: var a = [2,4,6]; 75 Like this, where fn is a function defined as: function fn(x) {return x*2;}
  • 76. What is a Map operation? • Doing something to every element in an array is a common operation: var a = [1,2,3]; for (i = 0; i < a.length; i++) a[i] = fn(a[i]); Now, all of this can also be converted into a “map” function 76
  • 77. What is a Map operation? • …like this, where fn is a function passed as an argument: function map(fn, a) { for (i = 0; i < a.length; i++) a[i] = fn(a[i]); } 77
  • 78. What is a Map operation? • …like this, where fn is a function passed as an argument: function map(fn, a) { for (i = 0; i < a.length; i++) a[i] = fn(a[i]); } • You can invoke this map function like this: map(function(x){return x*2;}, a); 78
  • 79. What is a Map operation? • …like this, where fn is a function passed as an argument: function map(fn, a) { for (i = 0; i < a.length; i++) a[i] = fn(a[i]); } • You can invoke this map function like this: map(function(x){return x*2;}, a); This is function fn whose definition is included in the call 79
  • 80. What is a Map operation? • In summary, now you can rewrite: for (i = 0; i < a.length; i++) a[i] = a[i] * 2; } as a map operation: map(function(x){return x*2;}, a); 80
  • 81. What is a Reduce operation? • Another common operation on arrays is to combine all their values: function sum(a) { var s = 0; for (i = 0; i < a.length; i++) s += a[i]; return s; } 81
  • 82. What is a Reduce operation? • Another common operation on arrays is to combine all their values: function sum(a) { var s = 0; for (i = 0; i < a.length; i++) s += a[i]; return s; } 82 This can be written as a function
  • 83. What is a Reduce operation? • Another common operation on arrays is to combine all their values: function sum(a) { var s = 0; for (i = 0; i < a.length; i++) s = fn(s,a[i]); return s; } 83 Like this, where function fn is defined so it adds its arguments: function fn(a,b){ return a+b; }
  • 84. What is a Reduce operation? • Another common operation on arrays is to combine all their values: function sum(a) { var s = 0; for (i = 0; i < a.length; i++) s = fn(s, a[i]); return s; } The whole function sum can also be rewritten so that fn is passed as an argument 84
  • 85. What is a Reduce operation? • Another common operation on arrays is to combine all their values: function reduce(fn, a, init) { var s = init; for (i = 0; i < a.length; i++) s = fn(s, a[i]); return s; } Like this… The function name was changed to reduce, and now it takes three arguments, a function, an array, and an initial value 85
  • 86. What is a Reduce operation? • Another common operation on arrays is to combine all their values: function sum(a) { var s = 0; for (i = 0; i < a.length; i++) s += a[i]; return s; } as a reduce operation: reduce(function(a,b){return a+b;},a,0); 86
  • 87. …lesson continued in the next video> 87
  • 98. …lesson continued in the next video> 98
  • 99. MapReduce – Distributed Mergesort Engine 99
  • 100. MapReduce – Distributed Mergesort Engine 100
  • 101. MapReduce – Distributed Mergesort Engine 101
  • 102. MapReduce – Distributed Mergesort Engine 102
  • 103. MapReduce – Distributed Mergesort Engine 103
  • 104. MapReduce – Distributed Mergesort Engine 104
  • 105. MapReduce – Distributed Mergesort Engine 105
  • 106. MapReduce – Distributed Mergesort Engine 106
  • 107. MapReduce – Distributed Mergesort Engine 107
  • 108. MapReduce – Distributed Mergesort Engine 108
  • 109. MapReduce – Distributed Mergesort Engine 109
  • 110. …lesson continued in the next video> 110
  • 111. Two Fundamental data types • Key/value pairs • Lists Input map reduce 111 Output
  • 112. Two Fundamental data types • Key/value pairs • Lists Input map reduce 112 <k1, v1> Output
  • 113. Two Fundamental data types • Key/value pairs • Lists Input map reduce 113 Output <k1, v1> list(<k2, v2>)
  • 114. Two Fundamental data types • Key/value pairs • Lists Input map <k1, v1> list(<k2, v2>) reduce 114 Output <k2, list(v2)>
  • 115. Two Fundamental data types • Key/value pairs • Lists Input map <k1, v1> list(<k2, v2>) reduce 115 Output <k2, list(v2)> list(<k3, v3>)
  • 116. Simple data flow example 116
  • 117. Simple data flow example 117
  • 118. Simple data flow example 118
  • 119. Simple data flow example 119
  • 120. Simple data flow example 120
  • 121. …lesson continued in the next video> 121
  • 123. Fault tolerance • Task Failure 123
  • 124. Fault tolerance • Task Failure • If a child task fails, the child JVM reports to the TaskTracker before it exits. Attempt is marked failed, freeing up slot for another task. 124
  • 125. Fault tolerance • Task Failure • If a child task fails, the child JVM reports to the TaskTracker before it exits. Attempt is marked failed, freeing up slot for another task. • If the child task hangs, it is killed. JobTracker reschedules the task on another machine. 125
  • 126. Fault tolerance • Task Failure • If a child task fails, the child JVM reports to the TaskTracker before it exits. Attempt is marked failed, freeing up slot for another task. • If the child task hangs, it is killed. JobTracker reschedules the task on another machine. • If task continues to fail, job is failed. 126
  • 128. Fault tolerance • TaskTracker Failure • JobTracker receives no heartbeat 128
  • 129. Fault tolerance • TaskTracker Failure • JobTracker receives no heartbeat • Removes TaskTracker from pool of TaskTrackers to schedule tasks on. 129
  • 130. Fault tolerance • TaskTracker Failure • JobTracker receives no heartbeat • Removes TaskTracker from pool of TaskTrackers to schedule tasks on. • JobTracker Failure 130
  • 131. Fault tolerance • TaskTracker Failure • JobTracker receives no heartbeat • Removes TaskTracker from pool of TaskTrackers to schedule tasks on. • JobTracker Failure • Singe point of failure. Job fails 131
  • 132. …lesson continued in the next video> 132
  • 134. Scheduling • FIFO scheduler (with priorities) 134
  • 135. Scheduling • FIFO scheduler (with priorities) • Each job uses the whole cluster, so jobs wait their turn. 135
  • 136. Scheduling • FIFO scheduler (with priorities) • Each job uses the whole cluster, so jobs wait their turn. • Fair scheduler 136
  • 137. Scheduling • FIFO scheduler (with priorities) • Each job uses the whole cluster, so jobs wait their turn. • Fair scheduler • Jobs placed in pools. If a user submits more jobs than another user, he will not get any more cluster resources than the other user, on average. Can define custom pools with guaranteed minimum capacity. 137
  • 138. Scheduling • FIFO scheduler (with priorities) • Each job uses the whole cluster, so jobs wait their turn. • Fair scheduler • Jobs placed in pools. If a user submits more jobs than another user, he will not get any more cluster resources than the other user, on average. Can define custom pools with guaranteed minimum capacity. • Capacity scheduler 138
  • 139. Scheduling • FIFO scheduler (with priorities) • Each job uses the whole cluster, so jobs wait their turn. • Fair scheduler • Jobs placed in pools. If a user submits more jobs than another user, he will not get any more cluster resources than the other user, on average. Can define custom pools with guaranteed minimum capacity. • Capacity scheduler • Allows Hadoop to simulate, for each user, a separate MapReduce cluster with FIFO scheduling. 139
  • 142. Task execution • Speculative Execution • Job execution is time sensitive to slow-running tasks. Hadoop detects slow-running tasks and launches another, equivalent task as a backup. The output from the first of these tasks to finish is used. 142
  • 143. Task execution • Speculative Execution • Job execution is time sensitive to slow-running tasks. Hadoop detects slow-running tasks and launches another, equivalent task as a backup. The output from the first of these tasks to finish is used. • Task JVM Reuse 143
  • 144. Task execution • Speculative Execution • Job execution is time sensitive to slow-running tasks. Hadoop detects slow-running tasks and launches another, equivalent task as a backup. The output from the first of these tasks to finish is used. • Task JVM Reuse • Tasks run in their own JVMs for isolation. Jobs that have a large number of short-lived tasks or tasks with lengthy initialization can benefit from sequential JVM reuse through configuration. 144
  • 146. Pig, Hive, and JAQL
  • 149. Similarities of Pig, Hive and Jaql      149 All translate their respective high-level languages to MapReduce jobs All offer significant reductions in program size over Java All provide points of extension to cover gaps in functionality All provide interoperability with other languages None support random reads/writes or low-latency queries
  • 150. Comparing Pig, Hive, and Jaql Pig Hive Jaql Developed by Yahoo! Facebook IBM Language name Pig Latin HiveQL Jaql Data flow Declarative (SQL dialect) Data flow Complex Geared towards structured data Loosely structured data, JSON Schema optional? Yes No, but data can have many schemas Yes Turing complete? Yes when extended with Java UDFs Yes when extended with Java UDFs Type of language Data structures it operates on 150 Yes
  • 152. Pig components • Two Components   Language (called Pig Latin) Compiler Pig Pig Latin Compiler • Two execution environments  Local (Single JVM)   Distributed (Hadoop cluster)  152 152 pig -x local pig -x mapreduce, or simply pig Execution Environment Local Distributed
  • 153. Running Pig  Script pig scriptfile.pig  Grunt (command line) pig (to launch command line tool)  Embedded Call in to Pig from Java 153 153
  • 154. Pig Latin sample code #pig grunt> records = LOAD ‘econ_assist.csv’ using PigStorage (‘,’) AS (country:chararray, sum:long); grunt> grouped = GROUP records BY country; grunt> thesum = FOREACH grouped GENERATE group, SUM(records, sum); grunt> DUMP thesum; 154 154
  • 155. Pig Latin – Statements, operations & commands Pig Latin program An operation as a statement A command as a statement … LOAD ‘input.txt’; … ls *.txt Logical Plan … … DUMP… Compile Physical Plan Execute 155 155
  • 156. Pig Latin statements  UDF Statements   Commands      156 Hadoop Filesystem (cat, ls, etc.) Hadoop MapReduce (kill) Utility (exec, help, quit, run, set) Operators  156 REGISTER, DEFINE Diagnostic: DESCRIBE, EXPLAIN, ILLUSTRATE Relational: LOAD, STORE, DUMP, FILTER, etc.
  • 157. Pig Latin – Relational operators  Loading and storing Eg: LOAD (into a program), STORE (to disk), DUMP (to the screen)  Filtering Eg: FILTER, DISTINCT, FOREACH...GENERATE, STREAM, SAMPLE  Grouping and joining Eg: JOIN, COGROUP, GROUP, CROSS  Sorting Eg: ORDER, LIMIT  Combining and splitting Eg: UNION, SPLIT 157 157
  • 158. Pig Latin – Relations and schema    Result of a relational operator is a relation A relation is a set of tuples Relations can be named using an alias (Eg: “x”) x = LOAD ‘sample.txt’ AS (id: int, year:int); DUMP x  Output is a tuple. Eg: (1,1987) 158 158
  • 159. Pig Latin – Relations and schema   Structure of a relation is a schema Use the DESCRIBE operator to see the schema. Eg: DESCRIBE x  The output is the schema: x: {id: int, year: int} 159 159
  • 160. Pig Latin expressions   Statements that contain relational operators may also contain expressions. Kinds of expressions: Constant Map lookup Conditional Functional 160 160 Field Cast Boolean Flatten Projection Arithmetic Comparison
  • 161. Pig Latin – Data types • Simple types: int long  bytearray chararray Complex types: Tuple Bag Map 161 float double 161 – Sequence of fields of any type – Unordered collection of tuples – Set of key-value pairs. Keys must be chararray.
  • 162. Pig Latin – Function types  Eval Input: One or more expressions Output: An expression Example: MAX  Filter Input: Bag or map Output: boolean Example: IsEmpty 162 162
  • 163. Pig Latin – Function types  Load Input: Data from external storage Output: A relation Example: PigStorage  Store Input: A relation Output: Data to external storage Example: PigStorage 163 163
  • 164. Pig Latin – User-Defined Functions • Written in Java  Packaged in a JAR file  Register JAR file using the REGISTER statement  Optionally, alias it with DEFINE statement 164 164
  • 167. Running Hive  Hive Shell  Interactive hive  Script hive -f myscript  Inline hive -e 'SELECT * FROM mytable' 167 167
  • 168. Hive services hive --service servicename where servicename can be:  hiveserver server for Thrift, JDBC, ODBC clients  hwi web interface  jar hadoop jar with Hive jars in classpath  metastore out of process metastore 168 168
  • 169. Hive - Metastore   Stores Hive metadata Configurations Embedded in-process metastore, in-process database  Local in-process metastore, out-of-process database  Remote out-of-process metastore, out-of-process database  169 169
  • 170. Hive – Schema-On-Read    170 170 Faster loads into the database (simply copy or move) Slower queries Flexibility – multiple schemas for the same data
  • 171. Hive - Configuration • Three ways to configure hive: • hive-site.xml - fs.default.name mapred.job.tracker Metastore configuration settings hive –hiveconf  “Set” command in the Hive Shell  171 171
  • 172. Hive Query Language (HiveQL)    SQL dialect Does not support full SQL92 specification No support for: HAVING clause in SELECT  Correlated subqueries  Subqueries outside FROM clauses  Updateable or materialized views  Stored procedures  172 172
  • 173. Sample code #hive hive> CREATE TABLE foreign_aid (country STRING, sum BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ STORED AS TEXTFILE; hive> SHOW TABLES; hive> DESCRIBE foreign_aid; hive> LOAD DATA INPATH ‘econ_assist.csv’ OVERWRITE INTO TABLE foreign_aid; hive> SELECT * FROM foreign_aid LIMIT 10; hive> SELECT country, SUM(sum) FROM foreign_aid GROUP BY country; 173 173
  • 174. Hive Query Language (HiveQL)  Extensions MySQL-like extensions  MapReduce extensions  Multi-table insert, MAP, REDUCE, TRANSFORM clauses  Data Types  Simple TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, BOOLEAN, STRING  Complex ARRAY, MAP, STRUCT 174 174
  • 175. Hive Query Language (HiveQL)  Built-in Functions   175 175 SHOW FUNCTIONS DESCRIBE FUNCTION
  • 176. Hive – User-Defined Functions   Written in Java Three UDF types:      176 UDF Input: single row, output: single row UDAF Input: multiple rows, output: single row UDTF Input: single row, output: multiple rows Register UDF using ADD JAR Create alias using CREATE TEMPORARY FUNCTION 176
  • 178. Jaql architecture Interactive shell / Applications Script Compiler / Parser / Rewriter I/O layer Storage layer File Systems (HDFS, GPFS, Local) 178 178 Databases (DBMS, HBase) Streams (Web, Pipes)
  • 179. Jaql data model: JSON     179 179 JSON = JavaScript object Notation Flexible (Schema is optional) Powerful modeling for semi-structured data Popular exchange format
  • 181. Running Jaql  Jaql Shell Interactive.  Batch  Inline   Cluster  Minicluster 181 -b myscript.jaql -e jaqlstatement Modes  181 Eg: jaqlshell Eg: jaqlshell Eg: jaqlshell Eg: jaqlshell Eg: jaqlshell -c
  • 182. Jaql query language source … operator operator sink • Sources and sinks Eg: Copy data from a local file to a new file on HDFS source sink read(file(“input.json”)) -> write(hdfs(“output”))  Core Operators Filter Transform Expand 182 182 Group Join Union Tee Sort Top
  • 183. Jaql query language • Variables   Pipes, streams, and consumers    183 183 Equal operator (=) binds source output to a variable e.g. $tweets = read(hdfs(“twitterfeed”)) Pipe operator (->) streams data to a consumer Pipe expects array as input e.g. $tweets → filter $.from_src == 'tweetdeck'; $ – implicit variable referencing current array value
  • 184. Jaql query language • Categories of Built-in Functions system core hadoop io array index 184 184 schema xml regex binary date nil agg number string function random record
  • 185. Jaql – Data Storage  Data store examples    Amazon S3 HTTP 185 HBase Local FS Data format examples JSON 185 DB2 JDBC AVRO CSV XML HDFS
  • 186. Jaql sample code #jaqlshell -c jaql> $foreignaid = read(del(“econ_assist.csv”, {schema: schema {country: string, sum: long} } ) ) jaql> $foreignaid -> group by $country = ($.country) into {$country.country, sum($[*].sum)}; 186 186
  • 187. Hadoop core lab – Part 3
  • 189. Acknowledgements and Disclaimers Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreem ent governing the use of IBM software. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. © Copyright IBM Corporation 2013. All rights reserved. •U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. IBM, the IBM logo, ibm.com, InfoSphere and BigInsights, Streams, and DB2 are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml Other company, product, or service names may be trademarks or service marks of others.
  • 190. Communities • On-line communities, User Groups, Technical Forums, Blogs, Social networks, and more o Find the community that interests you … • Information Management bit.ly/InfoMgmtCommunity • Business Analytics bit.ly/AnalyticsCommunity • Enterprise Content Management bit.ly/ECMCommunity • IBM Champions o Recognizing individuals who have made the most outstanding contributions to Information Management, Business Analytics, and Enterprise Content Management communities • ibm.com/champion
  • 191. Thank You Your feedback is important! • Access the Conference Agenda Builder to complete your session surveys oAny web or mobile browser at http://iod13surveys.com/surveys.html oAny Agenda Builder kiosk onsite