SlideShare una empresa de Scribd logo
1 de 58
HADOOP



Presented by

Ankit Gupta

3125523
CONTENTS
WHAT IS HADOOP

HADOOP APPROACH

SUB PROJECTS OF HADOOP

 MAPREDUCE

 HDFS

CASE STUDY USING HADOOP

APPLICATION AREA OF HADOOP
WHAT IS HADOOP

 It is a open source software framework which
  support data intensive distributed application.
 It enables us to explore complex data,using
  custom analysis tailored to our information and
  question.
 It is helpful in unstructured and semistructured
  data analysis.
HADOOP APPROACH
  The hadoop works in two phase.

 First phase is Data Distribution.

 Second phase is Map reduce:Isolated Process.

  Data Distribution:- in Data Distribution phase data is loaded to all the nodes
  of the clusters as it is being loaded.

The HDFS will split large data files into chunks managed by different
 clusters.

In addition,data is replicated to across many sites so that single failure will
 Not result in data unavailable.
DATA DISTRIBUTION PROCESS




Data replicated across different sites, form a common namespace so they are
 Universally accessible.
DATA DISTRIBUTION PHASE CONTD…
Individual input files are broken into lines or other format specific to the
 Application Logic.

Each process running on a node in the cluster then processes a subset of these
 Records.

The hadoop framework then schedules these processes in proximity to the
 location of data/records,most data is read from the local disk straight into the
 cpu,preventing Unnecessary network transfers.this strategy of moving
 computation to data,instead of moving the data to computation allows Hadoop to
 achieve high performance.
MAP REDUCE: ISOLATION PROCESS

Hadoop limits the amount of communication as each individual record is
 processed by a task in isolation from one another.

Records are processed in isolation by tasks called Mappers.

Output from different mappers is brought into second list called as Reducers.

The advantage of having isolated task processing is that no user level message
 exchange nor do nodes need rollback to pre-arranged checkpoints.
MAP-REDUCE WORKING
SUB PROJECTS OF HADOOP

 MAPREDUCE
 HDFS

 HIVE

 CHUKWA

 CORE

 HBASE

 AVRO
WHAT IS MAPREDUCE?
   MapReduce is a programming model used for processing large data sets.
   Programs written in this functional style are automatically parallelized and
    executed on a large cluster of commodity machines.
   MapReduce is an associated implementation for processing and
    generating large data sets.
THE PROGRAMMING MODEL OF MAPREDUCE
Map, written by the user, produces a set of intermediatekey/value pairs. The
MapReduce library groups together all intermediate values associated with the
same intermediate key I and passes them to the Reduce function .
The Programming Model Of MapReduce Contd…

The Reduce function, also written by the user, accepts an intermediate key I
and a set of values for that key. It merges together these values to form a
possibly smaller set of values.
Example of programming model:-
Let Map(k,v)=emit(k.toupper(),v.toupper());

Map(foo,bar).FOO,BAR ,Map(ask,cat)-> ASK,CAT

let reduce(k, vals)
sum = 0
foreach int v in vals:
sum += v
emit(k, sum)
(“a”, [42, 100, 312]) --> (“A”, 454)
(“b”, [12, 6, -2]) --> (“B”, 16)
BENEFITS OF PROGRAMING MODEL

Programs written in this functional style are automatically
 parallelized and executed on a large cluster of commodity
 machines.

The run time system (Hadoop framework) takes care of the

 details of partitioning the input data,scheduling the
 programme execution,handling machine failures.

This allows programmers without any experience inparallel

 and distributed systems to easily utilize the resources of a
 large distributed system.
MAPREDUCE WORKING
    A MapReduce job is a unit of work that the client wants to be
    performed: it consists of the input data, the MapReduce program and
    configuration information. Hadoop runs the job by dividing it into
    tasks, of which there are two types: map tasks and reduce tasks .

   There are two types of nodes that control the job execution process:
    tasktrackers and jobtrackers .

   The jobtracker coordinates all the jobs run on the system by
    scheduling tasks to run on tasktrackers.

   Tasktrackers run tasks and send progress reports to the jobtracker,
    which keeps a record of the overall progress of each job.

   If a tasks fails, the jobtracker can reschedule it on a different
    tasktracker.
MapReduce Working Contd…
    Input splits: Hadoop divides the input to a MapReduce job into fixed-
    size pieces called input splits or just splits. Hadoop creates one map
    task for each split, which runs the user-defined map function for each
    record in the split.

   The quality of the load balancing increases as the splits become
    more fine-grained.

   But if splits are too small, then the overhead of managing the splits
    and of map task creation begins to dominate the total job execution
    time. For most jobs, a good split size tends to be the size of a HDFS
    block, 64 MB by default.
MAP-REDUCE WORKING
CONTD…
INPUT TO REDUCE TASKS
   Reduce tasks don’t have the advantage of data locality—the
    input to a single reduce task is normally the output from all
    mappers.

   Input for reduce task can be of single input,multiple input.
    For some application there is no need of reduce function,in that
    case output from map function is directly stored in hdfs.
MapReduce data flow with a single reduce task
MapReduce data flow with multiple reduce tasks




MapReduce data flow with no reduce tasks
COMBINER FUNCTION

•Many MapReduce jobs are limited by the bandwidth available on the
 cluster.

•In order to minimize the data transferred between the map and reduce
  tasks, combiner functions are introduced.

•Hadoop allows the user to specify a combiner function to be run on the
  map output—the combiner function’s output forms the input to the reduce
 function.

•Combiner finctions can help cut down the amount of data shuffled between
the maps and the reduces.
HADOOP MAPREDUCE UTILITY
HADOOP STREAMING               HADOOP PIPES


Hadoop provides an API to      Hadoop Pipes is the name of The
MapReduce that allows the      C++ interface to Hadoop
user to write their map and    MapReduce.
reduce functions in
languages other than Java.


Hadoop Streaming uses          Unlike Streaming, which uses
Unix standard streams as       standard input and output to
the interface between          communicate with the map and
Hadoop and your program,       reduce code, Pipes uses sockets
so you can use any language    as the channel over which the
that can read standard input   tasktracker communicates with
and write to standard output   the process running the C++ map
to write your MapReduce        or reduce function. JNI is not
program.                       used.
HADOOP DISTRIBUTED
FILESYSTEM (HDFS)
   Filesystems that manage the storage across a network of machines
    are called distributed filesystems.

   Hadoop comes with a distributed filesystem called HDFS, which
    stands for Hadoop Distributed Filesystem.

   HDFS, the Hadoop Distributed File System, is a distributed file
    system designed to hold very large amounts of data (terabytes or
    even petabytes) and provide high-throughput access to the
    information.

   HDFS make our filesystem tolerate to node failure without suffering
    data loss.
GOALS OF HDFS
Making distributed filesystems is more complex than regular disk filesystems. This
is because the data is spanned over multiple nodes, so all the complications of
network programming kick in.

         •Hardware Failure
• An HDFS instance may consist of hundreds or thousands of server machines,
  each storing part of the file system’s data.
•The fact that there are a huge number of components and that each component
  has a non-trivial probability of failure means that some component of HDFS is
  always non-functional.
•Therefore, detection of faults and quick, automatic recovery from them is a core
  architectural goal of HDFS.

         •Large Data Sets
• Applications that run on HDFS have large data sets. A typical file in HDFS is
   gigabytes to terabytes in size. Thus, HDFS is tuned to support large files.
• It should provide high aggregate data bandwidth and scale to hundreds of nodes

 in a single cluster.
GOALS OF HDFS
          Streaming Data Access
• Applications that run on HDFS need streaming access to their data sets.
• They are not general purpose applications that typically run on general purpose
  file systems.
• HDFS is designed more for batch processing rather than interactive use by users.
  The emphasis is on high throughput of data access rather than low latency of data
  access.

          Simple Coherency Model
• HDFS applications need a write-once-read-many access model for files.
• A file once created, written, and closed need not be changed.
• This assumption simplifies data coherency issues and enables high throughput
  data access.
• A Map/Reduce application or a web crawler application fits perfectly with this
  model.
• There is a plan to support appending-writes to files in the future.
Goals of HDFS
    Portability Across Heterogeneous Hardware and Software
    Platforms
    HDFS has been designed to be easily portable from one platform to
    another.
    This facilitates widespread adoption of HDFS as a platform of choice

     for a large set of applications.


  Moving Computation is cheaper than Moving Data.
 A computation request is much more efficient if it is executed near the
  data it operates on.
 This minimizes network congestion and increase the overall
  throughput of the system.
 HDFS provides interfaces for applications to move themselves closer
  to where data is located.
HDFS CONCEPT
 Blocks:

•   A block is the minimum amount of data that can be read or written
    64 MB by default.
•   Files in HDFS are broken into block-sized chunks, which are stored
    as independent units.
•   HDFS blocks are large compared to disk blocks and the reason is to
    minimize the cost of seeks. By making a block large enough, the time
    to transfer the data from the disk can be made to be significantly
    larger than the time to seek to the start of the block. Thus the time to
    transfer a large file made of multiple blocks operates at the disk
    transfer rate.
BENEFITS OF BLOCK
ABSTRACTION
   A file can be larger than any single disk in the network. There’s
    nothing that requires the blocks from a file to be stored on the same
    disk, so they can take advantage of any of the disks in the cluster.

   Making the unit of abstraction a block rather than a file simplifies the
    storage subsystem.

   Blocks provide fault tolerance and availability. To insure against
    corrupted blocks and disk and machine failure, each block is
    replicated to a small number of physically separate machines
    (typically three). If a block becomes unavailable, a copy can be read
    from another location in a way that is transparent to the client.
NAMENODES AND DATANODES
 A HDFS cluster has two types of node operating in a master-worker
    pattern: a namenode (the master) and a number of datanodes (workers).

   The namenode manages the filesystem namespace. It maintains the
    filesystem tree and the metadata for all the files and directories in the
    tree.

   Datanodes are the work horses of the filesystem. They store and retrieve
    blocks when they are told to (by clients or the namenode) and they report
    back to the namenode periodically with lists of blocks that they are
    storing.


   Without the namenode, the filesystem cannot be used. In fact, if the
    machine running the namenode were obliterated, all the files on the
    filesystem would be lost since there would be no way of knowing how to
    reconstruct the files from the blocks on the datanodes.
NAMENODE & DATANODES
SECONDARY NAMENODE CONCEPT

  To make the namenode resilient to failure, Hadoop provides two
  mechanisms for this:


 1.To back up the files that make up the persistent state of the
  filesystem    metadata. Hadoop can be configured so that the
  namenode writes its persistent state to multiple filesystems.


2. Another solution is to run a secondary namenode. The secondary
  namenode usually runs on a separate physical machine, since it
  requires plenty of CPU and as much memory as the namenode to
  perform the merge. It keeps a copy of the merged namespace image,
  which can be used in the event of the namenode failing.
SECONDARY NAMENODE
HDFS OPERATION
HDFS READ OPEARTION
HDFS WRITE OPEARTION
FILE SYSTEM NAMESPACE
    HDFS supports a traditional hierarchical file organization. A user
    or an application can create and remove files, move a file from
    one directory to another, rename a file, create directories and
    store files inside these directories.

   HDFS does not yet implement user quotas or acces permissions.
    HDFS does not support hard links or soft links. However, the
    HDFS architecture does not preclude implementing these
    features.

   The Namenode maintains the file system namespace. Any
    change to the file system namespace or its properties is recorded
    by the Namenode. An application can specify the number of
    replicas of a file that should be maintained by HDFS. The number
    of copies of a file is called the replication factor of that file. This
    information is stored by the Namenode.

.
DATA REPLICATION
   The blocks of a file are replicated for fault tolerance.

   The NameNode makes all decisions regarding replication of blocks. It
    periodically receives a Heartbeat and a Blockreport from each of the
    DataNodes in the cluster. Receipt of a Heartbeat implies that the
    DataNode is functioning properly.


   A Blockreport contains a list of all blocks on a DataNode.


   When the replication factor is three, HDFS’s placement policy is to
    put one replica on one node in the local rack, another on a different
    node in the local rack, and the last on a different node in a different
    rack.
DATA REPLICATION
OPTIMAL REPLICATION
MAIN FEATURES OF HDFS
   Cluster Rebalancing
•   It automatically move data from one Data node to another if the free space
    on a data node falls below a certain threshhold.
•   In case of sudden dynamic demand ,it dynamically create additional replicas
    and rebalance other data in the clusters.
   Data Integrity
•   It is possible that block of data fetched from the datanode arrives is
    corrupted due to fault in storage space,network faults or buggy softwares.
•   When client creates HDFS file, it computes checksum of each block in file
    and store that in separate hidden file in the same HDFS namespace.
•   When client retrieves file contents it verifies the data it received by
    computing checksum.
   Robustness from namenode failure,datanode
    failure and network partition
HADOOP ARCHIVES
   HDFS stores small files inefficiently, since each file is stored in a
    block, and block metadata is held in memory by the namenode.
    Thus, a large number of small files can eat up a lot of memory on
    the namenode.

   Hadoop Archives or HAR files are a file archiving facility that
    packs files into HDFS blocks more efficiently, thereby reducing
    namenode memory usage while still allowing transparent access
    to files.

   Hadoop Archives can be used as input to MapReduce.

   Archives are immutable once they have been created. To add or
    remove files, you must recreate the archive.
LIMITATIONS OF HDFS
   Low-latency data access
     Applications that require low-latency access to data, in the tens of
    milliseconds range, will not work well with HDFS.Remember HDFS is
    optimized for delivering a high throughput of data, and this may be
    at the expense of latency. HBase is currently a better choice for low-
    latency access.


   Multiple write modification siters, arbitrary file
    Files in HDFS may be written to by a single writer. Writes are always
    made at the end of the file. There is no support for multiple writers, or
    for modifications at arbitrary offsets in the file.
FILE STRUCTURE SUPPORTED BY

HADOOP
NAME               EXTENSION              DESCRIPTION

HFTP               Hdfs.hftpfilesystem    Providing read-only
                                          access to hdfs over http.

HDFS               Hdfs.DistributedFilesys HDFS is designed to
                   tem                     work efficiently in
                                           conjuction with
                                           Mapreduce.
Local              Fs.localfilesystem     A filesystem for a
                                          locally connected disk
                                          with client side
                                          checksum
KFS(cloud store)   Fs.kfs.kosmos.         Cloudstore is a
                   Filesystem             distributed file system
                                          like hdfs or GFS.
NAME             EXTENSION              DESCRIPTION



HSFTP            Hdfs.hsftpfilesystem   A filesystem providing
                                        read-only access to
                                        HDFS over HTTPS.

HAR              Fs.harfilesystem       A filesystem layered on
                                        another filesystem for
                                        archiving files.

FTP              Fs.ftp.Ftpfilesystem   A filesystem backed by
                                        an FTP server.


S3(BLOCK BASE)   Fs.s3.S3FilesystemA    A filesystem backed by
                                        amazon s3,which stores
                                        files in blocks to
                                        overcome s3 limitation.
COMPARISION OF HADOOP WITH
  RDBMS
     Hadoop uses a brute force method whereas rdbms have
      optimization methods for accessing data such as indexes .
                        TRADITIONAL            MAP REDUCE
                        RDBMS

Data size               Gigabytes              Petabytes
Access                  Interactive,batch      Batch
Updates                 Read,write many        Write once,read many
                        Times

Structure               Static scheme          Dynamic schema


Integrity               High                   Low

Scaling                 Non linear             linear
CASE STUDY OF SCIENTIFIC DATA
PROCESSING ON A CLOUD USING
HADOOP
DESCRIPTION
• Our goal is to study the complex molecular interaction
  that regulate biological systems.
• We have to develop an imaging platform to acquire
  and analyze live cell data.
• The platform has the capability to record data in
  highthrough put and efficiently analysis the data.
DESCRIPTION CONTD...

   The acquistion system has a data rate of 1.5 MBps, and a
    typical 48 hours experiment can generate more than 260
    GB of images.
   The data analysis task for this platform is
    daunting:thousands of cells in the video need to be tracked
    and characterized individually.
   Image analysis is the current bottleneck in our data
    processing pipieline,to solve it we use parallelization.
   To gather such a large info,storing them into different
    nodes and perform analysis on it we use hadoop framework.
   We use local eight core server for data processing.
SYSTEM DESIGN
Hadoop component used for building such a
platform are:
 1. The map-reduce programming and
     execution enviornment.
 2. The reliable distributed file system
     called DFS.
 3. A bigTable-like storage system for
    sparsely structured data called Hbase.
PROGRAMMING MAP-
REDUCE

 It handles the way input data is split into parts for
  processing Mapreduce,how it handles the way input
  data formats and how it handles the extraction of
  atomic data records from the split file.
 This approach is implemented by writing new classes
  that implement the Hadoop interfaces for handling
  input and input splits. We have implemented the
  following classes:
 StringArrayInputFormat.java (implements Hadoop's
  InputFormat interface).
  CommaSeparatedStringInputSplitRecordReader.java
  (implements Hadoop's RecordReader interface).
HADOOP DFS
 Hadoop's DFS is a flat-structure distributed file
  system.
 Its master node is called namenode and slave nodes
  are called datanodes.
 Namenode is visible to all cloud nodes and provides a
  uniform global view for file paths in a traditional
  hierarchical structure.
 File contents are not stored hierarchically, but are
  divided into low level data chunks and stored in
  datanodes with replication.
 Data chunk pointers for files are linked to their
  corresponding locations by namenode.
HBASE TABLE

 HBase is a BigTable like data store. It also employs a
  masterslave topology, where its master maintains a
  table-like view for users.
 The data stored in HBase are sorted key-value pairs
  logically organized as sparstables indexed by row keys
  with corresponding column family values.
 Each column family represents one or more nested
  key-value pairs that are grouped together with the
  same column family as key prefix.
TASK PERFORMED BY
SYSTEM
 The client can issues three types of simple
  request to cloud application: a request for
  transfering experiment data,a request for
  performing an analysis job on a certain
  acquistion,a request for querying/viewing
  analysis results.
 For request of submission or query,it inserts a
  record into the analysis table in Hbase.
SYSTEM DESIGN
HBASE TABLE
ADVANTAGES OF USING HBASE
 DFS provides reliable storage, and current database
  systems cannot make use of DFS directly. Therefore,
  HBase may have a better degree of fault tolerance for
  large-scale data management in some sense.
 We find it natural and convenient to model our data
  using sparse tables because we have varying numbers
  of fields with different metadata, and the table
  organization can change signicantly as the usage of
  the system is extended to new analysis programs and
  new types of input or output data.
COMPARISION OF HADOOP AND
 NON HADOOP APPROACH
HADOOP                               NON HADOOP


Hadoop's MapReduce.                  MapReduce


Hadoop's Distributed File System     Google File System (GFS).
(DFS).



The HBase storage system for sparse BigTable, a scalable and reliable
structured data.                    distributed storage system for sparse
                                    structured data.
APPLICATION AREA OF
HADOOP
Some of the key areas of distributed computing
where hadoop run them efficiently are:-
   Mobile data
 E-commerce

 Energy discovery

 Energy saving

 Infrastructure management

 Image processing

 Online travelling booking
REFERENCES

    J.Dean and S.Ghemawat.‘‘MapReduce:Simplified Data Processing on large
    clusters”,Communication of the ACM,S1(1),Pages 107-113,2009 .


    K.Kim,k.Joen,H.Han,G.Kim.‘‘Mrbench: A Benchmark for Mapreduce Framework”:In
    .
    procedings of the 2010 14th IEEE International Conference on parallel and
    distributed system,Pages 11-18,2010.

   http://developer.yahoo.com/Hadoop/tutorial/module5.html.

   Atterburg.G,Bardnovrki.A,Burm.k,Rana. ‘‘A.Hadoop distributed file system for the
    grid”,IEEE,Pages 1056-1061,2009.

   Shvachko.k,Hairong kuang,Radia: ‘‘The Distributed File system” ,Communication of
    the ACM,Pages 1-10,2010.

   http://hadoop.apache.org/architecture-hdfs.

   CheZhn ang,Hans De Sterck,Ashraf Aboulnaga and Rob Sladek:Case Study of
    Scientific Data Processing on a Cloud Using Hadoop.

   Tao Fei, Zhang Lin, Guo Hua, Luo Yongliang, Ren Lei, “Typical characteristics of
    cloud manufacturing and several key issues of cloud service composition,” IEEE
    vol. 17, pp. 477–486, Mar 2011.
THANK
 YOU!!!

Más contenido relacionado

La actualidad más candente

Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationateeq ateeq
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 

La actualidad más candente (20)

Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 

Destacado

MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at myliferesponseteam
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduceHassan A-j
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReducehuguk
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitionerSubhas Kumar Ghosh
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsAnju Singh
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHanborq Inc.
 
RecSysTEL lecture at advanced SIKS course, NL
RecSysTEL lecture at advanced SIKS course, NLRecSysTEL lecture at advanced SIKS course, NL
RecSysTEL lecture at advanced SIKS course, NLHendrik Drachsler
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
How to think like a startup
How to think like a startupHow to think like a startup
How to think like a startupLoic Le Meur
 

Destacado (17)

MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Hadoop
HadoopHadoop
Hadoop
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Next Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduceNext Generation of Hadoop MapReduce
Next Generation of Hadoop MapReduce
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
RecSysTEL lecture at advanced SIKS course, NL
RecSysTEL lecture at advanced SIKS course, NLRecSysTEL lecture at advanced SIKS course, NL
RecSysTEL lecture at advanced SIKS course, NL
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
How to think like a startup
How to think like a startupHow to think like a startup
How to think like a startup
 

Similar a Hadoop ppt2

Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxUttara University
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415SANTOSH WAYAL
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 

Similar a Hadoop ppt2 (20)

Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hadoop map reduce
Hadoop map reduceHadoop map reduce
Hadoop map reduce
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Mapreduce Hadop.pptx
Mapreduce Hadop.pptxMapreduce Hadop.pptx
Mapreduce Hadop.pptx
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
Hadoop
HadoopHadoop
Hadoop
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 

Último

Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 

Último (20)

Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 

Hadoop ppt2

  • 2. CONTENTS WHAT IS HADOOP HADOOP APPROACH SUB PROJECTS OF HADOOP  MAPREDUCE  HDFS CASE STUDY USING HADOOP APPLICATION AREA OF HADOOP
  • 3. WHAT IS HADOOP  It is a open source software framework which support data intensive distributed application.  It enables us to explore complex data,using custom analysis tailored to our information and question.  It is helpful in unstructured and semistructured data analysis.
  • 4. HADOOP APPROACH The hadoop works in two phase.  First phase is Data Distribution.  Second phase is Map reduce:Isolated Process. Data Distribution:- in Data Distribution phase data is loaded to all the nodes of the clusters as it is being loaded. The HDFS will split large data files into chunks managed by different clusters. In addition,data is replicated to across many sites so that single failure will Not result in data unavailable.
  • 5. DATA DISTRIBUTION PROCESS Data replicated across different sites, form a common namespace so they are Universally accessible.
  • 6. DATA DISTRIBUTION PHASE CONTD… Individual input files are broken into lines or other format specific to the Application Logic. Each process running on a node in the cluster then processes a subset of these Records. The hadoop framework then schedules these processes in proximity to the location of data/records,most data is read from the local disk straight into the cpu,preventing Unnecessary network transfers.this strategy of moving computation to data,instead of moving the data to computation allows Hadoop to achieve high performance.
  • 7. MAP REDUCE: ISOLATION PROCESS Hadoop limits the amount of communication as each individual record is processed by a task in isolation from one another. Records are processed in isolation by tasks called Mappers. Output from different mappers is brought into second list called as Reducers. The advantage of having isolated task processing is that no user level message exchange nor do nodes need rollback to pre-arranged checkpoints.
  • 9. SUB PROJECTS OF HADOOP  MAPREDUCE  HDFS  HIVE  CHUKWA  CORE  HBASE  AVRO
  • 10. WHAT IS MAPREDUCE?  MapReduce is a programming model used for processing large data sets.  Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines.  MapReduce is an associated implementation for processing and generating large data sets.
  • 11. THE PROGRAMMING MODEL OF MAPREDUCE Map, written by the user, produces a set of intermediatekey/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function .
  • 12. The Programming Model Of MapReduce Contd… The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values.
  • 13. Example of programming model:- Let Map(k,v)=emit(k.toupper(),v.toupper()); Map(foo,bar).FOO,BAR ,Map(ask,cat)-> ASK,CAT let reduce(k, vals) sum = 0 foreach int v in vals: sum += v emit(k, sum) (“a”, [42, 100, 312]) --> (“A”, 454) (“b”, [12, 6, -2]) --> (“B”, 16)
  • 14. BENEFITS OF PROGRAMING MODEL Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run time system (Hadoop framework) takes care of the details of partitioning the input data,scheduling the programme execution,handling machine failures. This allows programmers without any experience inparallel and distributed systems to easily utilize the resources of a large distributed system.
  • 15. MAPREDUCE WORKING A MapReduce job is a unit of work that the client wants to be performed: it consists of the input data, the MapReduce program and configuration information. Hadoop runs the job by dividing it into tasks, of which there are two types: map tasks and reduce tasks .  There are two types of nodes that control the job execution process: tasktrackers and jobtrackers .  The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers.  Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job.  If a tasks fails, the jobtracker can reschedule it on a different tasktracker.
  • 16. MapReduce Working Contd… Input splits: Hadoop divides the input to a MapReduce job into fixed- size pieces called input splits or just splits. Hadoop creates one map task for each split, which runs the user-defined map function for each record in the split.  The quality of the load balancing increases as the splits become more fine-grained.  But if splits are too small, then the overhead of managing the splits and of map task creation begins to dominate the total job execution time. For most jobs, a good split size tends to be the size of a HDFS block, 64 MB by default.
  • 18. INPUT TO REDUCE TASKS  Reduce tasks don’t have the advantage of data locality—the input to a single reduce task is normally the output from all mappers.  Input for reduce task can be of single input,multiple input. For some application there is no need of reduce function,in that case output from map function is directly stored in hdfs. MapReduce data flow with a single reduce task
  • 19. MapReduce data flow with multiple reduce tasks MapReduce data flow with no reduce tasks
  • 20. COMBINER FUNCTION •Many MapReduce jobs are limited by the bandwidth available on the cluster. •In order to minimize the data transferred between the map and reduce tasks, combiner functions are introduced. •Hadoop allows the user to specify a combiner function to be run on the map output—the combiner function’s output forms the input to the reduce function. •Combiner finctions can help cut down the amount of data shuffled between the maps and the reduces.
  • 21. HADOOP MAPREDUCE UTILITY HADOOP STREAMING HADOOP PIPES Hadoop provides an API to Hadoop Pipes is the name of The MapReduce that allows the C++ interface to Hadoop user to write their map and MapReduce. reduce functions in languages other than Java. Hadoop Streaming uses Unlike Streaming, which uses Unix standard streams as standard input and output to the interface between communicate with the map and Hadoop and your program, reduce code, Pipes uses sockets so you can use any language as the channel over which the that can read standard input tasktracker communicates with and write to standard output the process running the C++ map to write your MapReduce or reduce function. JNI is not program. used.
  • 22. HADOOP DISTRIBUTED FILESYSTEM (HDFS)  Filesystems that manage the storage across a network of machines are called distributed filesystems.  Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed Filesystem.  HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes) and provide high-throughput access to the information.  HDFS make our filesystem tolerate to node failure without suffering data loss.
  • 23. GOALS OF HDFS Making distributed filesystems is more complex than regular disk filesystems. This is because the data is spanned over multiple nodes, so all the complications of network programming kick in. •Hardware Failure • An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. •The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. •Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. •Large Data Sets • Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. • It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster.
  • 24. GOALS OF HDFS Streaming Data Access • Applications that run on HDFS need streaming access to their data sets. • They are not general purpose applications that typically run on general purpose file systems. • HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. Simple Coherency Model • HDFS applications need a write-once-read-many access model for files. • A file once created, written, and closed need not be changed. • This assumption simplifies data coherency issues and enables high throughput data access. • A Map/Reduce application or a web crawler application fits perfectly with this model. • There is a plan to support appending-writes to files in the future.
  • 25. Goals of HDFS Portability Across Heterogeneous Hardware and Software Platforms  HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications. Moving Computation is cheaper than Moving Data.  A computation request is much more efficient if it is executed near the data it operates on.  This minimizes network congestion and increase the overall throughput of the system.  HDFS provides interfaces for applications to move themselves closer to where data is located.
  • 26. HDFS CONCEPT  Blocks: • A block is the minimum amount of data that can be read or written 64 MB by default. • Files in HDFS are broken into block-sized chunks, which are stored as independent units. • HDFS blocks are large compared to disk blocks and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be made to be significantly larger than the time to seek to the start of the block. Thus the time to transfer a large file made of multiple blocks operates at the disk transfer rate.
  • 27. BENEFITS OF BLOCK ABSTRACTION  A file can be larger than any single disk in the network. There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster.  Making the unit of abstraction a block rather than a file simplifies the storage subsystem.  Blocks provide fault tolerance and availability. To insure against corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client.
  • 28. NAMENODES AND DATANODES  A HDFS cluster has two types of node operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers).  The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree.  Datanodes are the work horses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode) and they report back to the namenode periodically with lists of blocks that they are storing.  Without the namenode, the filesystem cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes.
  • 30. SECONDARY NAMENODE CONCEPT To make the namenode resilient to failure, Hadoop provides two mechanisms for this: 1.To back up the files that make up the persistent state of the filesystem metadata. Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems. 2. Another solution is to run a secondary namenode. The secondary namenode usually runs on a separate physical machine, since it requires plenty of CPU and as much memory as the namenode to perform the merge. It keeps a copy of the merged namespace image, which can be used in the event of the namenode failing.
  • 35. FILE SYSTEM NAMESPACE  HDFS supports a traditional hierarchical file organization. A user or an application can create and remove files, move a file from one directory to another, rename a file, create directories and store files inside these directories.  HDFS does not yet implement user quotas or acces permissions. HDFS does not support hard links or soft links. However, the HDFS architecture does not preclude implementing these features.  The Namenode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the Namenode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the Namenode. .
  • 36. DATA REPLICATION  The blocks of a file are replicated for fault tolerance.  The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly.  A Blockreport contains a list of all blocks on a DataNode.  When the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a different node in the local rack, and the last on a different node in a different rack.
  • 39. MAIN FEATURES OF HDFS  Cluster Rebalancing • It automatically move data from one Data node to another if the free space on a data node falls below a certain threshhold. • In case of sudden dynamic demand ,it dynamically create additional replicas and rebalance other data in the clusters.  Data Integrity • It is possible that block of data fetched from the datanode arrives is corrupted due to fault in storage space,network faults or buggy softwares. • When client creates HDFS file, it computes checksum of each block in file and store that in separate hidden file in the same HDFS namespace. • When client retrieves file contents it verifies the data it received by computing checksum.  Robustness from namenode failure,datanode failure and network partition
  • 40. HADOOP ARCHIVES  HDFS stores small files inefficiently, since each file is stored in a block, and block metadata is held in memory by the namenode. Thus, a large number of small files can eat up a lot of memory on the namenode.  Hadoop Archives or HAR files are a file archiving facility that packs files into HDFS blocks more efficiently, thereby reducing namenode memory usage while still allowing transparent access to files.  Hadoop Archives can be used as input to MapReduce.  Archives are immutable once they have been created. To add or remove files, you must recreate the archive.
  • 41. LIMITATIONS OF HDFS  Low-latency data access Applications that require low-latency access to data, in the tens of milliseconds range, will not work well with HDFS.Remember HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency. HBase is currently a better choice for low- latency access.  Multiple write modification siters, arbitrary file Files in HDFS may be written to by a single writer. Writes are always made at the end of the file. There is no support for multiple writers, or for modifications at arbitrary offsets in the file.
  • 42. FILE STRUCTURE SUPPORTED BY HADOOP NAME EXTENSION DESCRIPTION HFTP Hdfs.hftpfilesystem Providing read-only access to hdfs over http. HDFS Hdfs.DistributedFilesys HDFS is designed to tem work efficiently in conjuction with Mapreduce. Local Fs.localfilesystem A filesystem for a locally connected disk with client side checksum KFS(cloud store) Fs.kfs.kosmos. Cloudstore is a Filesystem distributed file system like hdfs or GFS.
  • 43. NAME EXTENSION DESCRIPTION HSFTP Hdfs.hsftpfilesystem A filesystem providing read-only access to HDFS over HTTPS. HAR Fs.harfilesystem A filesystem layered on another filesystem for archiving files. FTP Fs.ftp.Ftpfilesystem A filesystem backed by an FTP server. S3(BLOCK BASE) Fs.s3.S3FilesystemA A filesystem backed by amazon s3,which stores files in blocks to overcome s3 limitation.
  • 44. COMPARISION OF HADOOP WITH RDBMS  Hadoop uses a brute force method whereas rdbms have optimization methods for accessing data such as indexes . TRADITIONAL MAP REDUCE RDBMS Data size Gigabytes Petabytes Access Interactive,batch Batch Updates Read,write many Write once,read many Times Structure Static scheme Dynamic schema Integrity High Low Scaling Non linear linear
  • 45. CASE STUDY OF SCIENTIFIC DATA PROCESSING ON A CLOUD USING HADOOP DESCRIPTION • Our goal is to study the complex molecular interaction that regulate biological systems. • We have to develop an imaging platform to acquire and analyze live cell data. • The platform has the capability to record data in highthrough put and efficiently analysis the data.
  • 46. DESCRIPTION CONTD...  The acquistion system has a data rate of 1.5 MBps, and a typical 48 hours experiment can generate more than 260 GB of images.  The data analysis task for this platform is daunting:thousands of cells in the video need to be tracked and characterized individually.  Image analysis is the current bottleneck in our data processing pipieline,to solve it we use parallelization.  To gather such a large info,storing them into different nodes and perform analysis on it we use hadoop framework.  We use local eight core server for data processing.
  • 47. SYSTEM DESIGN Hadoop component used for building such a platform are: 1. The map-reduce programming and execution enviornment. 2. The reliable distributed file system called DFS. 3. A bigTable-like storage system for sparsely structured data called Hbase.
  • 48. PROGRAMMING MAP- REDUCE  It handles the way input data is split into parts for processing Mapreduce,how it handles the way input data formats and how it handles the extraction of atomic data records from the split file.  This approach is implemented by writing new classes that implement the Hadoop interfaces for handling input and input splits. We have implemented the following classes:  StringArrayInputFormat.java (implements Hadoop's InputFormat interface). CommaSeparatedStringInputSplitRecordReader.java (implements Hadoop's RecordReader interface).
  • 49. HADOOP DFS  Hadoop's DFS is a flat-structure distributed file system.  Its master node is called namenode and slave nodes are called datanodes.  Namenode is visible to all cloud nodes and provides a uniform global view for file paths in a traditional hierarchical structure.  File contents are not stored hierarchically, but are divided into low level data chunks and stored in datanodes with replication.  Data chunk pointers for files are linked to their corresponding locations by namenode.
  • 50. HBASE TABLE  HBase is a BigTable like data store. It also employs a masterslave topology, where its master maintains a table-like view for users.  The data stored in HBase are sorted key-value pairs logically organized as sparstables indexed by row keys with corresponding column family values.  Each column family represents one or more nested key-value pairs that are grouped together with the same column family as key prefix.
  • 51. TASK PERFORMED BY SYSTEM  The client can issues three types of simple request to cloud application: a request for transfering experiment data,a request for performing an analysis job on a certain acquistion,a request for querying/viewing analysis results.  For request of submission or query,it inserts a record into the analysis table in Hbase.
  • 54. ADVANTAGES OF USING HBASE  DFS provides reliable storage, and current database systems cannot make use of DFS directly. Therefore, HBase may have a better degree of fault tolerance for large-scale data management in some sense.  We find it natural and convenient to model our data using sparse tables because we have varying numbers of fields with different metadata, and the table organization can change signicantly as the usage of the system is extended to new analysis programs and new types of input or output data.
  • 55. COMPARISION OF HADOOP AND NON HADOOP APPROACH HADOOP NON HADOOP Hadoop's MapReduce. MapReduce Hadoop's Distributed File System Google File System (GFS). (DFS). The HBase storage system for sparse BigTable, a scalable and reliable structured data. distributed storage system for sparse structured data.
  • 56. APPLICATION AREA OF HADOOP Some of the key areas of distributed computing where hadoop run them efficiently are:- Mobile data  E-commerce  Energy discovery  Energy saving  Infrastructure management  Image processing  Online travelling booking
  • 57. REFERENCES  J.Dean and S.Ghemawat.‘‘MapReduce:Simplified Data Processing on large clusters”,Communication of the ACM,S1(1),Pages 107-113,2009 .  K.Kim,k.Joen,H.Han,G.Kim.‘‘Mrbench: A Benchmark for Mapreduce Framework”:In . procedings of the 2010 14th IEEE International Conference on parallel and distributed system,Pages 11-18,2010.  http://developer.yahoo.com/Hadoop/tutorial/module5.html.  Atterburg.G,Bardnovrki.A,Burm.k,Rana. ‘‘A.Hadoop distributed file system for the grid”,IEEE,Pages 1056-1061,2009.  Shvachko.k,Hairong kuang,Radia: ‘‘The Distributed File system” ,Communication of the ACM,Pages 1-10,2010.  http://hadoop.apache.org/architecture-hdfs.  CheZhn ang,Hans De Sterck,Ashraf Aboulnaga and Rob Sladek:Case Study of Scientific Data Processing on a Cloud Using Hadoop.  Tao Fei, Zhang Lin, Guo Hua, Luo Yongliang, Ren Lei, “Typical characteristics of cloud manufacturing and several key issues of cloud service composition,” IEEE vol. 17, pp. 477–486, Mar 2011.