3. What is Hadoop?
• Apache top level project, open-source
implementation of frameworks for reliable,
scalable, distributed computing and data storage.
• It is a flexible and highly-available architecture for
large scale computation and data processing on a
network of commodity hardware.
4. What is Hadoop
• Hadoop is a software framework for distributed processing of
large datasets across large clusters of computers
• Large datasets Terabytes or petabytes of data
• Large clusters hundreds or thousands of nodes
• Hadoop is open-source implementation for Google
MapReduce
• Hadoop is based on a simple programming model called
MapReduce
• Hadoop is based on a simple data model, any data will fit
4
5. Brief History of Hadoop
• Google introduced Map reduce Algorithm.
• Doug Cutting and team took the solution
provided by Google and started an Open Source
Project called HADOOP in 2005 and Doug
named it after his son's toy elephant.
6. Hadoop’s Developers
Doug Cutting
2005: Doug Cutting and Michael J. Cafarella developed
Hadoop to support distribution for the Nutch search
engine project.
The project was funded by Yahoo.
2006: Yahoo gave the project to Apache
Software Foundation.
7. Large-Scale Data
Analytics
• MapReduce computing paradigm (E.g., Hadoop) vs. Traditional
database systems
7
Database
vs.
Many enterprises are turning to Hadoop
Especially applications generating big data
Web applications, social networks, scientific applications
8. Why Hadoop is able to compete?
8
Scalability (petabytes of data,
thousands of machines)
Database
vs.
Flexibility in accepting all data
formats (no schema)
Commodity inexpensive hardware
Efficient and simple fault-
tolerant mechanism
Performance (tons of indexing,
tuning, data organization tech.)
Structured Data
9. Key Components
• Hadoop framework consists on two main layers
• Distributed file system (HDFS)
• Execution engine (MapReduce)
9
12. Hadoop Distributed File System
(HDFS)
12
Centralized namenode
- Maintains metadata info about files
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
File F
Blocks (64 MB)
13. Main Properties of HDFS
• Large: A HDFS instance may consist of thousands of server
machines, each storing part of the file system’s data
• Replication: Each data block is replicated many times
(default is 3)
• Failure: Failure occurs rarely
• Fault Tolerance: Detection of faults and quick, automatic
recovery from them is a core architectural goal of HDFS
• Namenode is consistently checking Datanodes
13
14. What is MapReduce?
• MapReduce is a programming model
• Programs written in this functional style are automatically
parallelized and executed on a large cluster of commodity
machines
• MapReduce is an associated implementation for processing
and generating large data sets.
MapReduce
MAP
map function that
processes a key/value
pair to generate a set of
intermediate key/value
pairs
REDUCE
and a reduce function
that merges all
intermediate values
associated with the
same intermediate key.
15. Properties of MapReduce Engine
• Job Tracker is the master node (runs with the
namenode)
• Receives the user’s job
• Decides on how many tasks will run (number of mappers)
• Decides on where to run each mapper (concept of locality)
15
• This file has 5 Blocks run 5 map tasks
• Where to run the task reading block “1”
• Try to run it on Node 1 or Node 3
Node 1 Node 2 Node 3
16. Properties of MapReduce Engine
(Cont’d)
• Task Tracker is the slave node (runs on each datanode)
• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting progress
16
17. The Programming Model Of MapReduce
Map takes an input pair and produces a set of intermediate key/value pairs. The
MapReduce library groups together all intermediate values associated with the
same intermediate key.
The Reduce function, also written by the user, accepts an intermediate key I and a set of
values for that key. It merges together these values to form a possibly smaller set of values
21. Example 1 : Color Count
21
Shuffle & Sorting
based on k
Reduce
Reduce
Reduce
Map
Map
Map
Map
Input blocks
on HDFS
Produces (k, v)
( , 1)
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Consumes(k, [v])
( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Job: Count the number of each color in a data set
Part0003
Part0002
Part0001
That’s the output file, it
has 3 parts on probably 3
different machines
22. Example 2: Color Filter
22
Job: Select only the blue and the green colors
Input blocks
on HDFS
Map
Map
Map
Map
Produces (k, v)
( , 1)
Write to HDFS
Write to HDFS
Write to HDFS
Write to HDFS
• Each map task will select only
the blue or green colors
• No need for reduce phase
Part0001
Part0002
Part0003
Part0004
That’s the output file, it
has 4 parts on probably 4
different machines
23. Why use Hadoop?
• Need to process Multi Petabyte Datasets
• Data may not have strict schema
• Expensive to build reliability in each application
• Need common infrastructure
• Very Large Distributed File System
• Assumes Commodity Hardware
• Optimized for Batch Processing
24. Who Uses MapReduce/Hadoop
• Google: Inventors of MapReduce computing paradigm
• Yahoo: Developing Hadoop open-source of MapReduce
• IBM, Microsoft, Oracle
• Facebook, Amazon, AOL, NetFlex
• Many others + universities and research labs
24