2. CONTENTS
WHAT IS HADOOP
HADOOP APPROACH
SUB PROJECTS OF HADOOP
MAPREDUCE
HDFS
CASE STUDY USING HADOOP
APPLICATION AREA OF HADOOP
3. WHAT IS HADOOP
It is a open source software framework which
support data intensive distributed application.
It enables us to explore complex data,using
custom analysis tailored to our information and
question.
It is helpful in unstructured and semistructured
data analysis.
4. HADOOP APPROACH
The hadoop works in two phase.
First phase is Data Distribution.
Second phase is Map reduce:Isolated Process.
Data Distribution:- in Data Distribution phase data is loaded to all the nodes
of the clusters as it is being loaded.
The HDFS will split large data files into chunks managed by different
clusters.
In addition,data is replicated to across many sites so that single failure will
Not result in data unavailable.
5. DATA DISTRIBUTION PROCESS
Data replicated across different sites, form a common namespace so they are
Universally accessible.
6. DATA DISTRIBUTION PHASE CONTD…
Individual input files are broken into lines or other format specific to the
Application Logic.
Each process running on a node in the cluster then processes a subset of these
Records.
The hadoop framework then schedules these processes in proximity to the
location of data/records,most data is read from the local disk straight into the
cpu,preventing Unnecessary network transfers.this strategy of moving
computation to data,instead of moving the data to computation allows Hadoop to
achieve high performance.
7. MAP REDUCE: ISOLATION PROCESS
Hadoop limits the amount of communication as each individual record is
processed by a task in isolation from one another.
Records are processed in isolation by tasks called Mappers.
Output from different mappers is brought into second list called as Reducers.
The advantage of having isolated task processing is that no user level message
exchange nor do nodes need rollback to pre-arranged checkpoints.
9. SUB PROJECTS OF HADOOP
MAPREDUCE
HDFS
HIVE
CHUKWA
CORE
HBASE
AVRO
10. WHAT IS MAPREDUCE?
MapReduce is a programming model used for processing large data sets.
Programs written in this functional style are automatically parallelized and
executed on a large cluster of commodity machines.
MapReduce is an associated implementation for processing and
generating large data sets.
11. THE PROGRAMMING MODEL OF MAPREDUCE
Map, written by the user, produces a set of intermediatekey/value pairs. The
MapReduce library groups together all intermediate values associated with the
same intermediate key I and passes them to the Reduce function .
12. The Programming Model Of MapReduce Contd…
The Reduce function, also written by the user, accepts an intermediate key I
and a set of values for that key. It merges together these values to form a
possibly smaller set of values.
13. Example of programming model:-
Let Map(k,v)=emit(k.toupper(),v.toupper());
Map(foo,bar).FOO,BAR ,Map(ask,cat)-> ASK,CAT
let reduce(k, vals)
sum = 0
foreach int v in vals:
sum += v
emit(k, sum)
(“a”, [42, 100, 312]) --> (“A”, 454)
(“b”, [12, 6, -2]) --> (“B”, 16)
14. BENEFITS OF PROGRAMING MODEL
Programs written in this functional style are automatically
parallelized and executed on a large cluster of commodity
machines.
The run time system (Hadoop framework) takes care of the
details of partitioning the input data,scheduling the
programme execution,handling machine failures.
This allows programmers without any experience inparallel
and distributed systems to easily utilize the resources of a
large distributed system.
15. MAPREDUCE WORKING
A MapReduce job is a unit of work that the client wants to be
performed: it consists of the input data, the MapReduce program and
configuration information. Hadoop runs the job by dividing it into
tasks, of which there are two types: map tasks and reduce tasks .
There are two types of nodes that control the job execution process:
tasktrackers and jobtrackers .
The jobtracker coordinates all the jobs run on the system by
scheduling tasks to run on tasktrackers.
Tasktrackers run tasks and send progress reports to the jobtracker,
which keeps a record of the overall progress of each job.
If a tasks fails, the jobtracker can reschedule it on a different
tasktracker.
16. MapReduce Working Contd…
Input splits: Hadoop divides the input to a MapReduce job into fixed-
size pieces called input splits or just splits. Hadoop creates one map
task for each split, which runs the user-defined map function for each
record in the split.
The quality of the load balancing increases as the splits become
more fine-grained.
But if splits are too small, then the overhead of managing the splits
and of map task creation begins to dominate the total job execution
time. For most jobs, a good split size tends to be the size of a HDFS
block, 64 MB by default.
18. INPUT TO REDUCE TASKS
Reduce tasks don’t have the advantage of data locality—the
input to a single reduce task is normally the output from all
mappers.
Input for reduce task can be of single input,multiple input.
For some application there is no need of reduce function,in that
case output from map function is directly stored in hdfs.
MapReduce data flow with a single reduce task
19. MapReduce data flow with multiple reduce tasks
MapReduce data flow with no reduce tasks
20. COMBINER FUNCTION
•Many MapReduce jobs are limited by the bandwidth available on the
cluster.
•In order to minimize the data transferred between the map and reduce
tasks, combiner functions are introduced.
•Hadoop allows the user to specify a combiner function to be run on the
map output—the combiner function’s output forms the input to the reduce
function.
•Combiner finctions can help cut down the amount of data shuffled between
the maps and the reduces.
21. HADOOP MAPREDUCE UTILITY
HADOOP STREAMING HADOOP PIPES
Hadoop provides an API to Hadoop Pipes is the name of The
MapReduce that allows the C++ interface to Hadoop
user to write their map and MapReduce.
reduce functions in
languages other than Java.
Hadoop Streaming uses Unlike Streaming, which uses
Unix standard streams as standard input and output to
the interface between communicate with the map and
Hadoop and your program, reduce code, Pipes uses sockets
so you can use any language as the channel over which the
that can read standard input tasktracker communicates with
and write to standard output the process running the C++ map
to write your MapReduce or reduce function. JNI is not
program. used.
22. HADOOP DISTRIBUTED
FILESYSTEM (HDFS)
Filesystems that manage the storage across a network of machines
are called distributed filesystems.
Hadoop comes with a distributed filesystem called HDFS, which
stands for Hadoop Distributed Filesystem.
HDFS, the Hadoop Distributed File System, is a distributed file
system designed to hold very large amounts of data (terabytes or
even petabytes) and provide high-throughput access to the
information.
HDFS make our filesystem tolerate to node failure without suffering
data loss.
23. GOALS OF HDFS
Making distributed filesystems is more complex than regular disk filesystems. This
is because the data is spanned over multiple nodes, so all the complications of
network programming kick in.
•Hardware Failure
• An HDFS instance may consist of hundreds or thousands of server machines,
each storing part of the file system’s data.
•The fact that there are a huge number of components and that each component
has a non-trivial probability of failure means that some component of HDFS is
always non-functional.
•Therefore, detection of faults and quick, automatic recovery from them is a core
architectural goal of HDFS.
•Large Data Sets
• Applications that run on HDFS have large data sets. A typical file in HDFS is
gigabytes to terabytes in size. Thus, HDFS is tuned to support large files.
• It should provide high aggregate data bandwidth and scale to hundreds of nodes
in a single cluster.
24. GOALS OF HDFS
Streaming Data Access
• Applications that run on HDFS need streaming access to their data sets.
• They are not general purpose applications that typically run on general purpose
file systems.
• HDFS is designed more for batch processing rather than interactive use by users.
The emphasis is on high throughput of data access rather than low latency of data
access.
Simple Coherency Model
• HDFS applications need a write-once-read-many access model for files.
• A file once created, written, and closed need not be changed.
• This assumption simplifies data coherency issues and enables high throughput
data access.
• A Map/Reduce application or a web crawler application fits perfectly with this
model.
• There is a plan to support appending-writes to files in the future.
25. Goals of HDFS
Portability Across Heterogeneous Hardware and Software
Platforms
HDFS has been designed to be easily portable from one platform to
another.
This facilitates widespread adoption of HDFS as a platform of choice
for a large set of applications.
Moving Computation is cheaper than Moving Data.
A computation request is much more efficient if it is executed near the
data it operates on.
This minimizes network congestion and increase the overall
throughput of the system.
HDFS provides interfaces for applications to move themselves closer
to where data is located.
26. HDFS CONCEPT
Blocks:
• A block is the minimum amount of data that can be read or written
64 MB by default.
• Files in HDFS are broken into block-sized chunks, which are stored
as independent units.
• HDFS blocks are large compared to disk blocks and the reason is to
minimize the cost of seeks. By making a block large enough, the time
to transfer the data from the disk can be made to be significantly
larger than the time to seek to the start of the block. Thus the time to
transfer a large file made of multiple blocks operates at the disk
transfer rate.
27. BENEFITS OF BLOCK
ABSTRACTION
A file can be larger than any single disk in the network. There’s
nothing that requires the blocks from a file to be stored on the same
disk, so they can take advantage of any of the disks in the cluster.
Making the unit of abstraction a block rather than a file simplifies the
storage subsystem.
Blocks provide fault tolerance and availability. To insure against
corrupted blocks and disk and machine failure, each block is
replicated to a small number of physically separate machines
(typically three). If a block becomes unavailable, a copy can be read
from another location in a way that is transparent to the client.
28. NAMENODES AND DATANODES
A HDFS cluster has two types of node operating in a master-worker
pattern: a namenode (the master) and a number of datanodes (workers).
The namenode manages the filesystem namespace. It maintains the
filesystem tree and the metadata for all the files and directories in the
tree.
Datanodes are the work horses of the filesystem. They store and retrieve
blocks when they are told to (by clients or the namenode) and they report
back to the namenode periodically with lists of blocks that they are
storing.
Without the namenode, the filesystem cannot be used. In fact, if the
machine running the namenode were obliterated, all the files on the
filesystem would be lost since there would be no way of knowing how to
reconstruct the files from the blocks on the datanodes.
30. SECONDARY NAMENODE CONCEPT
To make the namenode resilient to failure, Hadoop provides two
mechanisms for this:
1.To back up the files that make up the persistent state of the
filesystem metadata. Hadoop can be configured so that the
namenode writes its persistent state to multiple filesystems.
2. Another solution is to run a secondary namenode. The secondary
namenode usually runs on a separate physical machine, since it
requires plenty of CPU and as much memory as the namenode to
perform the merge. It keeps a copy of the merged namespace image,
which can be used in the event of the namenode failing.
35. FILE SYSTEM NAMESPACE
HDFS supports a traditional hierarchical file organization. A user
or an application can create and remove files, move a file from
one directory to another, rename a file, create directories and
store files inside these directories.
HDFS does not yet implement user quotas or acces permissions.
HDFS does not support hard links or soft links. However, the
HDFS architecture does not preclude implementing these
features.
The Namenode maintains the file system namespace. Any
change to the file system namespace or its properties is recorded
by the Namenode. An application can specify the number of
replicas of a file that should be maintained by HDFS. The number
of copies of a file is called the replication factor of that file. This
information is stored by the Namenode.
.
36. DATA REPLICATION
The blocks of a file are replicated for fault tolerance.
The NameNode makes all decisions regarding replication of blocks. It
periodically receives a Heartbeat and a Blockreport from each of the
DataNodes in the cluster. Receipt of a Heartbeat implies that the
DataNode is functioning properly.
A Blockreport contains a list of all blocks on a DataNode.
When the replication factor is three, HDFS’s placement policy is to
put one replica on one node in the local rack, another on a different
node in the local rack, and the last on a different node in a different
rack.
39. MAIN FEATURES OF HDFS
Cluster Rebalancing
• It automatically move data from one Data node to another if the free space
on a data node falls below a certain threshhold.
• In case of sudden dynamic demand ,it dynamically create additional replicas
and rebalance other data in the clusters.
Data Integrity
• It is possible that block of data fetched from the datanode arrives is
corrupted due to fault in storage space,network faults or buggy softwares.
• When client creates HDFS file, it computes checksum of each block in file
and store that in separate hidden file in the same HDFS namespace.
• When client retrieves file contents it verifies the data it received by
computing checksum.
Robustness from namenode failure,datanode
failure and network partition
40. HADOOP ARCHIVES
HDFS stores small files inefficiently, since each file is stored in a
block, and block metadata is held in memory by the namenode.
Thus, a large number of small files can eat up a lot of memory on
the namenode.
Hadoop Archives or HAR files are a file archiving facility that
packs files into HDFS blocks more efficiently, thereby reducing
namenode memory usage while still allowing transparent access
to files.
Hadoop Archives can be used as input to MapReduce.
Archives are immutable once they have been created. To add or
remove files, you must recreate the archive.
41. LIMITATIONS OF HDFS
Low-latency data access
Applications that require low-latency access to data, in the tens of
milliseconds range, will not work well with HDFS.Remember HDFS is
optimized for delivering a high throughput of data, and this may be
at the expense of latency. HBase is currently a better choice for low-
latency access.
Multiple write modification siters, arbitrary file
Files in HDFS may be written to by a single writer. Writes are always
made at the end of the file. There is no support for multiple writers, or
for modifications at arbitrary offsets in the file.
42. FILE STRUCTURE SUPPORTED BY
HADOOP
NAME EXTENSION DESCRIPTION
HFTP Hdfs.hftpfilesystem Providing read-only
access to hdfs over http.
HDFS Hdfs.DistributedFilesys HDFS is designed to
tem work efficiently in
conjuction with
Mapreduce.
Local Fs.localfilesystem A filesystem for a
locally connected disk
with client side
checksum
KFS(cloud store) Fs.kfs.kosmos. Cloudstore is a
Filesystem distributed file system
like hdfs or GFS.
43. NAME EXTENSION DESCRIPTION
HSFTP Hdfs.hsftpfilesystem A filesystem providing
read-only access to
HDFS over HTTPS.
HAR Fs.harfilesystem A filesystem layered on
another filesystem for
archiving files.
FTP Fs.ftp.Ftpfilesystem A filesystem backed by
an FTP server.
S3(BLOCK BASE) Fs.s3.S3FilesystemA A filesystem backed by
amazon s3,which stores
files in blocks to
overcome s3 limitation.
44. COMPARISION OF HADOOP WITH
RDBMS
Hadoop uses a brute force method whereas rdbms have
optimization methods for accessing data such as indexes .
TRADITIONAL MAP REDUCE
RDBMS
Data size Gigabytes Petabytes
Access Interactive,batch Batch
Updates Read,write many Write once,read many
Times
Structure Static scheme Dynamic schema
Integrity High Low
Scaling Non linear linear
45. CASE STUDY OF SCIENTIFIC DATA
PROCESSING ON A CLOUD USING
HADOOP
DESCRIPTION
• Our goal is to study the complex molecular interaction
that regulate biological systems.
• We have to develop an imaging platform to acquire
and analyze live cell data.
• The platform has the capability to record data in
highthrough put and efficiently analysis the data.
46. DESCRIPTION CONTD...
The acquistion system has a data rate of 1.5 MBps, and a
typical 48 hours experiment can generate more than 260
GB of images.
The data analysis task for this platform is
daunting:thousands of cells in the video need to be tracked
and characterized individually.
Image analysis is the current bottleneck in our data
processing pipieline,to solve it we use parallelization.
To gather such a large info,storing them into different
nodes and perform analysis on it we use hadoop framework.
We use local eight core server for data processing.
47. SYSTEM DESIGN
Hadoop component used for building such a
platform are:
1. The map-reduce programming and
execution enviornment.
2. The reliable distributed file system
called DFS.
3. A bigTable-like storage system for
sparsely structured data called Hbase.
48. PROGRAMMING MAP-
REDUCE
It handles the way input data is split into parts for
processing Mapreduce,how it handles the way input
data formats and how it handles the extraction of
atomic data records from the split file.
This approach is implemented by writing new classes
that implement the Hadoop interfaces for handling
input and input splits. We have implemented the
following classes:
StringArrayInputFormat.java (implements Hadoop's
InputFormat interface).
CommaSeparatedStringInputSplitRecordReader.java
(implements Hadoop's RecordReader interface).
49. HADOOP DFS
Hadoop's DFS is a flat-structure distributed file
system.
Its master node is called namenode and slave nodes
are called datanodes.
Namenode is visible to all cloud nodes and provides a
uniform global view for file paths in a traditional
hierarchical structure.
File contents are not stored hierarchically, but are
divided into low level data chunks and stored in
datanodes with replication.
Data chunk pointers for files are linked to their
corresponding locations by namenode.
50. HBASE TABLE
HBase is a BigTable like data store. It also employs a
masterslave topology, where its master maintains a
table-like view for users.
The data stored in HBase are sorted key-value pairs
logically organized as sparstables indexed by row keys
with corresponding column family values.
Each column family represents one or more nested
key-value pairs that are grouped together with the
same column family as key prefix.
51. TASK PERFORMED BY
SYSTEM
The client can issues three types of simple
request to cloud application: a request for
transfering experiment data,a request for
performing an analysis job on a certain
acquistion,a request for querying/viewing
analysis results.
For request of submission or query,it inserts a
record into the analysis table in Hbase.
54. ADVANTAGES OF USING HBASE
DFS provides reliable storage, and current database
systems cannot make use of DFS directly. Therefore,
HBase may have a better degree of fault tolerance for
large-scale data management in some sense.
We find it natural and convenient to model our data
using sparse tables because we have varying numbers
of fields with different metadata, and the table
organization can change signicantly as the usage of
the system is extended to new analysis programs and
new types of input or output data.
55. COMPARISION OF HADOOP AND
NON HADOOP APPROACH
HADOOP NON HADOOP
Hadoop's MapReduce. MapReduce
Hadoop's Distributed File System Google File System (GFS).
(DFS).
The HBase storage system for sparse BigTable, a scalable and reliable
structured data. distributed storage system for sparse
structured data.
56. APPLICATION AREA OF
HADOOP
Some of the key areas of distributed computing
where hadoop run them efficiently are:-
Mobile data
E-commerce
Energy discovery
Energy saving
Infrastructure management
Image processing
Online travelling booking
57. REFERENCES
J.Dean and S.Ghemawat.‘‘MapReduce:Simplified Data Processing on large
clusters”,Communication of the ACM,S1(1),Pages 107-113,2009 .
K.Kim,k.Joen,H.Han,G.Kim.‘‘Mrbench: A Benchmark for Mapreduce Framework”:In
.
procedings of the 2010 14th IEEE International Conference on parallel and
distributed system,Pages 11-18,2010.
http://developer.yahoo.com/Hadoop/tutorial/module5.html.
Atterburg.G,Bardnovrki.A,Burm.k,Rana. ‘‘A.Hadoop distributed file system for the
grid”,IEEE,Pages 1056-1061,2009.
Shvachko.k,Hairong kuang,Radia: ‘‘The Distributed File system” ,Communication of
the ACM,Pages 1-10,2010.
http://hadoop.apache.org/architecture-hdfs.
CheZhn ang,Hans De Sterck,Ashraf Aboulnaga and Rob Sladek:Case Study of
Scientific Data Processing on a Cloud Using Hadoop.
Tao Fei, Zhang Lin, Guo Hua, Luo Yongliang, Ren Lei, “Typical characteristics of
cloud manufacturing and several key issues of cloud service composition,” IEEE
vol. 17, pp. 477–486, Mar 2011.