The map operation in MapReduce refers to applying a function to every element in a data set and emitting key-value pairs.Some key properties of a map operation:- It processes the data in parallel on different machines. - The input and output are in the form of key-value pairs.- It does not modify the existing data, it only processes it to emit new outputs.So in summary, a map operation takes a data set with some structure, applies a function independently on different parts of it in parallel, and produces a new set of outputs
Similar a The map operation in MapReduce refers to applying a function to every element in a data set and emitting key-value pairs.Some key properties of a map operation:- It processes the data in parallel on different machines. - The input and output are in the form of key-value pairs.- It does not modify the existing data, it only processes it to emit new outputs.So in summary, a map operation takes a data set with some structure, applies a function independently on different parts of it in parallel, and produces a new set of outputs
Similar a The map operation in MapReduce refers to applying a function to every element in a data set and emitting key-value pairs.Some key properties of a map operation:- It processes the data in parallel on different machines. - The input and output are in the form of key-value pairs.- It does not modify the existing data, it only processes it to emit new outputs.So in summary, a map operation takes a data set with some structure, applies a function independently on different parts of it in parallel, and produces a new set of outputs (20)
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
The map operation in MapReduce refers to applying a function to every element in a data set and emitting key-value pairs.Some key properties of a map operation:- It processes the data in parallel on different machines. - The input and output are in the form of key-value pairs.- It does not modify the existing data, it only processes it to emit new outputs.So in summary, a map operation takes a data set with some structure, applies a function independently on different parts of it in parallel, and produces a new set of outputs
2. Please note
IBM’s statements regarding its plans, directions, and intent are subject to
change or withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general
product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a
commitment, promise, or legal obligation to deliver any material, code or
functionality. Information about potential future products may not be
incorporated into any contract. The development, release, and timing of any
future features or functionality described for our products remains at our sole
discretion.
Performance is based on measurements and projections using standard IBM
benchmarks in a controlled environment. The actual throughput or performance
that any user will experience will vary depending upon many factors, including
considerations such as the amount of multiprogramming in the user’s job
stream, the I/O configuration, the storage configuration, and the workload
processed. Therefore, no assurance can be given that an individual user will
achieve results similar to those stated here.
8. Hadoop architecture
• Two main components:
– Hadoop Distributed File System (HDFS)
– MapReduce Engine
8
9. Hadoop distributed file system (HDFS)
• Hadoop file system that runs on top of
existing file system
• Designed to handle very large files with
streaming data access patterns
• Uses blocks to store a file or parts of a file
9
10. HDFS - Blocks
•
File Blocks
– 64MB (default), 128MB (recommended) – compare to 4KB in UNIX
– Behind the scenes, 1 HDFS block is supported by multiple operating system
(OS) blocks
HDFS Block
128 MB
OS Blocks
•
Advantages of blocks:
– Fixed size – easy to calculate how many fit on a disk
– A file can be larger than any single disk in the network
– If a file or a chunk of the file is smaller than the block size, only needed space is
used. Eg: 420MB file is split as:
128MB
•
10
128MB
128MB
36MB
Fits well with replication to provide fault tolerance and availability
11. HDFS - Replication
• Blocks with data are replicated to multiple nodes
• Allows for node failure without data loss
Node 3
Node 1
Node 2
11
12. MapReduce engine
• Technology from Google
• A MapReduce program consists of map and reduce
functions
• A MapReduce job is broken into tasks that run in
parallel
12
13. Types of nodes - Overview
• HDFS nodes
– NameNode
– DataNode
• MapReduce nodes
– JobTracker
– TaskTracker
• There are other nodes not discussed in this course
13
15. Types of nodes - NameNode
• NameNode
– Only one per Hadoop cluster
– Manages the filesystem namespace and metadata
– Single point of failure, but mitigated by writing state to
multiple filesystems
– Single point of failure: Don’t use inexpensive
commodity hardware for this node, large memory
requirements
15
16. Types of nodes - DataNode
• DataNode
– Many per Hadoop cluster
– Manages blocks with data and
serves them to clients
– Periodically reports to name
node the list of blocks it stores
– Use inexpensive commodity
hardware for this node
16
17. Types of nodes - JobTracker
• JobTracker node
– One per Hadoop cluster
– Receives job requests submitted by client
– Schedules and monitors MapReduce jobs on task
trackers
17
18. Types of nodes - TaskTracker
• TaskTracker node
– Many per Hadoop cluster
– Executes MapReduce operations
– Reads blocks from DataNodes
18
22. Topology awareness
Bandwidth becomes progressively smaller in the following scenarios:
1. Process on the same node
2. Different nodes on the same rack
22
23. Topology awareness
Bandwidth becomes progressively smaller in the following scenarios:
1. Process on the same node
2. Different nodes on the same rack
3. Nodes on different racks in the same data center
23
24. Topology awareness
Bandwidth becomes progressively smaller in the following scenarios:
1.
2.
3.
4.
24
Process on the same node
Different nodes on the same rack
Nodes on different racks in the same data center
Nodes in different data centers
47. What is Hadoop?
• Open source project
• Written in Java
• Optimized to handle
• Massive amounts of data through parallelism
• A variety of data (structured, unstructured, semi-structured)
• Using inexpensive commodity hardware
• Great performance
• Reliability provided through replication
• Not for OLTP, not for OLAP/DSS, good for Big Data
• Current version: 0.20.2
47
56. Examples of Hadoop in action
• In the telecommunication industry
• In the media
• In the technology industry
56
57. Hadoop is not for all types of work
•
•
•
•
•
57
Not to process transactions (random access)
Not good when work cannot be parallelized
Not good for low latency data access
Not good for processing lots of small files
Not good for intensive calculations with little
data
58. Big Data solutions and the Cloud
• Big Data solutions are more than just
Hadoop
– Add business intelligence/analytics
functionality
– Derive information of data in motion
• Big Data solutions and the Cloud are a
perfect fit.
– The Cloud allows you to set up a cluster of
systems in minutes and it’s relatively
inexpensive.
58
62. HDFS Command line interface
• File System Shell (fs)
• Invoked as follows:
hadoop fs <args>
• Example:
Listing the current directory in hdfs
hadoop fs –ls .
62
63. HDFS Command line interface
• FS shell commands take paths URIs as argument
• URI format:
scheme://authority/path
• Scheme:
• For the local filesystem, the scheme is file
• For HDFS, the scheme is hdfs
hadoop fs –copyFromLocal
file://myfile.txt
hdfs://localhost/user/keith/myfile.txt
• Scheme and authority are optional
• Defaults are taken from configuration file core-site.xml
63
64. HDFS Command line interface
• Many POSIX-like commands
• cat, chgrp, chmod, chown, cp, du, ls, mkdir, mv, rm, stat, tail
• Some HDFS-specific commands
• copyFromLocal, copyToLocal, get, getmerge, put, setrep
64
65. HDFS – Specific commands
• copyFromLocal / put
• Copy files from the local file system into fs
hadoop fs -copyFromLocal <localsrc> .. <dst>
Or
hadoop fs -put <localsrc> .. <dst>
65
66. HDFS – Specific commands
• copyToLocal / get
• Copy files from fs into the local file system
hadoop fs -copyToLocal [-ignorecrc] [-crc]
<src> <localdst>
Or
hadoop fs -get [-ignorecrc] [-crc]
<src> <localdst>
66
67. HDFS – Specific commands
• getMerge
• Get all the files in the directories that match the source file pattern
• Merge and sort them to only one file on local fs
• <src> is kept
hadoop fs -getmerge <src> <localdst>
67
68. HDFS – Specific commands
• setRep
• Set the replication level of a file.
• The -R flag requests a recursive change of replication level for an
entire tree.
• If -w is specified, waits until new replication level is achieved.
hadoop fs -setrep [-R] [-w] <rep> <path/file>
68
72. What is a Map operation?
• Doing something to every element in an array is a common operation:
var a = [1,2,3];
for (i = 0; i < a.length; i++)
a[i] = a[i] * 2;
72
73. What is a Map operation?
• Doing something to every element in an array is a common operation:
var a = [1,2,3];
for (i = 0; i < a.length; i++)
a[i] = a[i] * 2;
• New value for variable a would be:
var a = [2,4,6];
73
74. What is a Map operation?
• Doing something to every element in an array is a common operation:
var a = [1,2,3];
for (i = 0; i < a.length; i++)
a[i] = a[i] * 2;
• New value for variable a would be:
var a = [2,4,6];
74
This can
be written as
a function
75. What is a Map operation?
• Doing something to every element in an array is a common operation:
var a = [1,2,3];
for (i = 0; i < a.length; i++)
a[i] * 2;
a[i] = fn(a[i]);
• New value for variable a would be:
var a = [2,4,6];
75
Like this,
where fn
is
a function
defined
as:
function
fn(x)
{return
x*2;}
76. What is a Map operation?
• Doing something to every element in an array is a common operation:
var a = [1,2,3];
for (i = 0; i < a.length; i++)
a[i] = fn(a[i]);
Now, all of this can also be
converted into a “map” function
76
77. What is a Map operation?
• …like this, where fn is a function passed as an argument:
function map(fn, a) {
for (i = 0; i < a.length; i++)
a[i] = fn(a[i]);
}
77
78. What is a Map operation?
• …like this, where fn is a function passed as an argument:
function map(fn, a) {
for (i = 0; i < a.length; i++)
a[i] = fn(a[i]);
}
• You can invoke this map function like this:
map(function(x){return x*2;}, a);
78
79. What is a Map operation?
• …like this, where fn is a function passed as an argument:
function map(fn, a) {
for (i = 0; i < a.length; i++)
a[i] = fn(a[i]);
}
• You can invoke this map function like this:
map(function(x){return x*2;}, a);
This is function fn whose definition is included in the call
79
80. What is a Map operation?
• In summary, now you can rewrite:
for (i = 0; i < a.length; i++)
a[i] = a[i] * 2;
}
as a map operation:
map(function(x){return x*2;}, a);
80
81. What is a Reduce operation?
• Another common operation on arrays is to combine all their values:
function sum(a) {
var s = 0;
for (i = 0; i < a.length; i++)
s += a[i];
return s;
}
81
82. What is a Reduce operation?
• Another common operation on arrays is to combine all their values:
function sum(a) {
var s = 0;
for (i = 0; i < a.length; i++)
s += a[i];
return s;
}
82
This can
be written
as a
function
83. What is a Reduce operation?
• Another common operation on arrays is to combine all their values:
function sum(a) {
var s = 0;
for (i = 0; i < a.length; i++)
s = fn(s,a[i]);
return s;
}
83
Like this,
where function
fn is defined so
it adds its
arguments:
function fn(a,b){
return a+b;
}
84. What is a Reduce operation?
• Another common operation on arrays is to combine all their values:
function sum(a) {
var s = 0;
for (i = 0; i < a.length; i++)
s = fn(s, a[i]);
return s;
}
The whole function sum can also be rewritten so that fn is passed as an
argument
84
85. What is a Reduce operation?
• Another common operation on arrays is to combine all their values:
function reduce(fn, a, init) {
var s = init;
for (i = 0; i < a.length; i++)
s = fn(s, a[i]);
return s;
}
Like this… The function name was changed to reduce, and now it takes
three arguments, a function, an array, and an initial value
85
86. What is a Reduce operation?
• Another common operation on arrays is to combine all their values:
function sum(a) {
var s = 0;
for (i = 0; i < a.length; i++)
s += a[i];
return s;
}
as a reduce operation:
reduce(function(a,b){return a+b;},a,0);
86
124. Fault tolerance
• Task Failure
• If a child task fails, the child JVM reports to the TaskTracker before it exits.
Attempt is marked failed, freeing up slot for another task.
124
125. Fault tolerance
• Task Failure
• If a child task fails, the child JVM reports to the TaskTracker before it exits.
Attempt is marked failed, freeing up slot for another task.
• If the child task hangs, it is killed. JobTracker reschedules the task on
another machine.
125
126. Fault tolerance
• Task Failure
• If a child task fails, the child JVM reports to the TaskTracker before it exits.
Attempt is marked failed, freeing up slot for another task.
• If the child task hangs, it is killed. JobTracker reschedules the task on
another machine.
• If task continues to fail, job is failed.
126
129. Fault tolerance
• TaskTracker Failure
• JobTracker receives no heartbeat
• Removes TaskTracker from pool of TaskTrackers to schedule tasks on.
129
130. Fault tolerance
• TaskTracker Failure
• JobTracker receives no heartbeat
• Removes TaskTracker from pool of TaskTrackers to schedule tasks on.
• JobTracker Failure
130
131. Fault tolerance
• TaskTracker Failure
• JobTracker receives no heartbeat
• Removes TaskTracker from pool of TaskTrackers to schedule tasks on.
• JobTracker Failure
• Singe point of failure. Job fails
131
135. Scheduling
• FIFO scheduler (with priorities)
• Each job uses the whole cluster, so jobs wait their turn.
135
136. Scheduling
• FIFO scheduler (with priorities)
• Each job uses the whole cluster, so jobs wait their turn.
• Fair scheduler
136
137. Scheduling
• FIFO scheduler (with priorities)
• Each job uses the whole cluster, so jobs wait their turn.
• Fair scheduler
• Jobs placed in pools. If a user submits more jobs than another user, he
will not get any more cluster resources than the other user, on
average. Can define custom pools with guaranteed minimum capacity.
137
138. Scheduling
• FIFO scheduler (with priorities)
• Each job uses the whole cluster, so jobs wait their turn.
• Fair scheduler
• Jobs placed in pools. If a user submits more jobs than another user, he
will not get any more cluster resources than the other user, on
average. Can define custom pools with guaranteed minimum capacity.
• Capacity scheduler
138
139. Scheduling
• FIFO scheduler (with priorities)
• Each job uses the whole cluster, so jobs wait their turn.
• Fair scheduler
• Jobs placed in pools. If a user submits more jobs than another user, he
will not get any more cluster resources than the other user, on
average. Can define custom pools with guaranteed minimum capacity.
• Capacity scheduler
• Allows Hadoop to simulate, for each user, a separate MapReduce
cluster with FIFO scheduling.
139
142. Task execution
• Speculative Execution
• Job execution is time sensitive to slow-running tasks. Hadoop detects
slow-running tasks and launches another, equivalent task as a backup.
The output from the first of these tasks to finish is used.
142
143. Task execution
• Speculative Execution
• Job execution is time sensitive to slow-running tasks. Hadoop detects
slow-running tasks and launches another, equivalent task as a backup.
The output from the first of these tasks to finish is used.
• Task JVM Reuse
143
144. Task execution
• Speculative Execution
• Job execution is time sensitive to slow-running tasks. Hadoop detects
slow-running tasks and launches another, equivalent task as a backup.
The output from the first of these tasks to finish is used.
• Task JVM Reuse
• Tasks run in their own JVMs for isolation. Jobs that have a large
number of short-lived tasks or tasks with lengthy initialization can
benefit from sequential JVM reuse through configuration.
144
149. Similarities of Pig, Hive and Jaql
149
All translate their respective high-level languages to
MapReduce jobs
All offer significant reductions in program size over
Java
All provide points of extension to cover gaps in
functionality
All provide interoperability with other languages
None support random reads/writes or low-latency
queries
150. Comparing Pig, Hive, and Jaql
Pig
Hive
Jaql
Developed by
Yahoo!
Facebook
IBM
Language name
Pig Latin
HiveQL
Jaql
Data flow
Declarative
(SQL dialect)
Data flow
Complex
Geared
towards
structured data
Loosely structured
data, JSON
Schema optional?
Yes
No, but data
can have many
schemas
Yes
Turing complete?
Yes when
extended with
Java UDFs
Yes when
extended with
Java UDFs
Type of language
Data structures it
operates on
150
Yes
154. Pig Latin sample code
#pig
grunt> records = LOAD ‘econ_assist.csv’
using PigStorage (‘,’)
AS (country:chararray, sum:long);
grunt> grouped = GROUP records BY country;
grunt> thesum
= FOREACH grouped
GENERATE group,
SUM(records, sum);
grunt> DUMP thesum;
154
154
155. Pig Latin – Statements, operations & commands
Pig Latin program
An operation
as a statement
A
command
as a
statement
… LOAD ‘input.txt’;
… ls *.txt
Logical Plan
…
… DUMP…
Compile
Physical
Plan
Execute
155
155
157. Pig Latin – Relational operators
Loading and storing
Eg: LOAD (into a program), STORE (to disk), DUMP (to the screen)
Filtering
Eg: FILTER, DISTINCT, FOREACH...GENERATE, STREAM, SAMPLE
Grouping and joining
Eg: JOIN, COGROUP, GROUP, CROSS
Sorting
Eg: ORDER, LIMIT
Combining and splitting
Eg: UNION, SPLIT
157
157
158. Pig Latin – Relations and schema
Result of a relational operator is a relation
A relation is a set of tuples
Relations can be named using an alias (Eg: “x”)
x = LOAD ‘sample.txt’ AS (id: int, year:int);
DUMP x
Output is a tuple. Eg:
(1,1987)
158
158
159. Pig Latin – Relations and schema
Structure of a relation is a schema
Use the DESCRIBE operator to see the schema. Eg:
DESCRIBE x
The output is the schema:
x: {id: int, year: int}
159
159
160. Pig Latin expressions
Statements that contain relational operators may
also contain expressions.
Kinds of expressions:
Constant
Map lookup
Conditional
Functional
160
160
Field
Cast
Boolean
Flatten
Projection
Arithmetic
Comparison
161. Pig Latin – Data types
• Simple types:
int
long
bytearray
chararray
Complex types:
Tuple
Bag
Map
161
float
double
161
– Sequence of fields of any type
– Unordered collection of tuples
– Set of key-value pairs. Keys must be chararray.
162. Pig Latin – Function types
Eval
Input: One or more expressions
Output: An expression
Example: MAX
Filter
Input: Bag or map
Output: boolean
Example: IsEmpty
162
162
163. Pig Latin – Function types
Load
Input: Data from external storage
Output: A relation
Example: PigStorage
Store
Input: A relation
Output: Data to external storage
Example: PigStorage
163
163
164. Pig Latin – User-Defined Functions
• Written in Java
Packaged in a JAR file
Register JAR file using the REGISTER statement
Optionally, alias it with DEFINE statement
164
164
168. Hive services
hive --service servicename
where servicename can be:
hiveserver
server for Thrift, JDBC, ODBC clients
hwi
web interface
jar
hadoop jar with Hive jars in classpath
metastore
out of process metastore
168
168
171. Hive - Configuration
• Three ways to configure hive:
• hive-site.xml
-
fs.default.name
mapred.job.tracker
Metastore configuration settings
hive –hiveconf
“Set” command in the Hive Shell
171
171
172. Hive Query Language (HiveQL)
SQL dialect
Does not support full SQL92 specification
No support for:
HAVING clause in SELECT
Correlated subqueries
Subqueries outside FROM clauses
Updateable or materialized views
Stored procedures
172
172
173. Sample code
#hive
hive> CREATE TABLE foreign_aid
(country STRING, sum BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE;
hive> SHOW TABLES;
hive> DESCRIBE foreign_aid;
hive> LOAD DATA INPATH ‘econ_assist.csv’
OVERWRITE INTO TABLE foreign_aid;
hive> SELECT * FROM foreign_aid LIMIT 10;
hive> SELECT country, SUM(sum) FROM foreign_aid
GROUP BY country;
173
173
175. Hive Query Language (HiveQL)
Built-in Functions
175
175
SHOW FUNCTIONS
DESCRIBE FUNCTION
176. Hive – User-Defined Functions
Written in Java
Three UDF types:
176
UDF
Input: single row, output: single row
UDAF
Input: multiple rows, output: single row
UDTF
Input: single row, output: multiple rows
Register UDF using ADD JAR
Create alias using CREATE TEMPORARY FUNCTION
176
179. Jaql data model: JSON
179
179
JSON = JavaScript object Notation
Flexible (Schema is optional)
Powerful modeling for semi-structured data
Popular exchange format
182. Jaql query language
source
…
operator
operator
sink
• Sources and sinks
Eg: Copy data from a local file to a new file on HDFS
source
sink
read(file(“input.json”)) -> write(hdfs(“output”))
Core Operators
Filter
Transform
Expand
182
182
Group
Join
Union
Tee
Sort
Top
183. Jaql query language
• Variables
Pipes, streams, and consumers
183
183
Equal operator (=) binds source output to a variable
e.g. $tweets = read(hdfs(“twitterfeed”))
Pipe operator (->) streams data to a consumer
Pipe expects array as input
e.g. $tweets → filter $.from_src == 'tweetdeck';
$ – implicit variable referencing current array value
184. Jaql query language
• Categories of Built-in Functions
system
core
hadoop
io
array
index
184
184
schema
xml
regex
binary
date
nil
agg
number
string
function
random
record
185. Jaql – Data Storage
Data store examples
Amazon S3
HTTP
185
HBase
Local FS
Data format examples
JSON
185
DB2
JDBC
AVRO
CSV
XML
HDFS
190. Communities
• On-line communities, User Groups, Technical Forums, Blogs, Social
networks, and more
o Find the community that interests you …
• Information Management bit.ly/InfoMgmtCommunity
• Business Analytics bit.ly/AnalyticsCommunity
• Enterprise Content Management bit.ly/ECMCommunity
• IBM Champions
o Recognizing individuals who have made the most outstanding contributions to
Information Management, Business Analytics, and Enterprise Content Management
communities
•
ibm.com/champion
191. Thank You
Your feedback is important!
• Access the Conference Agenda Builder to
complete your session surveys
oAny web or mobile browser at
http://iod13surveys.com/surveys.html
oAny Agenda Builder kiosk onsite