The document starts with the introduction for Hadoop and covers the Hadoop 1.x / 2.x services (HDFS / MapReduce / YARN).
It also explains the architecture of Hadoop, the working of Hadoop distributed file system and MapReduce programming model.
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
Hadoop Introduction
1.
2. Apache Hadoop is a Java software framework that allows for the distributed processing
of large data sets across clusters of computers spread across the world using a simple
programming model.
3.
4. • Distributed, scalable and
reliable
• Fault‐tolerant storage
system
Hadoop Distributed
File System
• High-performance parallel
data processing
• Employs the divide-conquer
principle
Map-Reduce
Programming Model
5. A class teacher of class 5 needs to find out the name of the student with highest marks
for each subject.
Total students : 50
Total subjects : 5
Our Goal
To minimize the Total time spent
Time to process each
subject per student
: 1min
Total time spent : 250mins
Subject 1 : S1-98
Subject 2 : S13-95
Subject 3 : S1-97
Subject 4 : S23-100
Subject 5 : S8-99
Input
Output
6. HDFS: Distribute the
data into blocks across
multiple nodes
Distribute papers across 5 peons – Each
peon will have papers of 10 students for
each subject (50 papers each)
a)
Map Phase: Apply
business logic on
distributed data in parallel
Each peon will provide list of subjects
with student name and highest marks
from his data from a list of 10 students.
Total time spent: 50mins (in parallel)
b)
Reduce Phase: Iterate
over the map phase
output and get final result
Total records left: 5 students for 5
subjects only. Time to get subject list for
student name with highest marks: 25mins
c)
Total time spent: 50 + 25 = 75mins
8. HDFS Layer : --
Stores files across storage nodes
in a Hadoop cluster
Consists of :
• Namenode & Datanodes
Map-Reduce Engine : --
Processes vast amounts of data in-
parallel on large clusters in a
reliable & fault-tolerant manner
Consists of :
• Job Tracker & Task Trackers
10. Job
Tracker
Task Tracker 1 Task Tracker _2 Task Tracker _3
HDFS
Block 1
HDFS
Block 2
HDFS
Block 3 Block 4
Map-Reduce
job from
client
Executes individual
Map-Reduce tasks
assigned by Job
Tracker
Task Trackers retrieve data from HDFS which is stored on the
Data-node i.e. the same system where Task Tracker is running.
Task
Tracker
Data
Node
Slave
m/c
11. NameNode
Ø Maps a block to the Datanodes
Ø Controls read/write access to files
Ø Manages Replication Engine for Blocks
DataNode
Ø Responsible for serving read and write
requests (block creation, deletion, and
replication)
JobTracker
Ø Accepts Map-Reduce tasks from the clients
Ø Assigns tasks to the Task Trackers &
monitors their status
TaskTracker
Ø Worker daemon, runs Map-Reduce tasks
Ø Sends heart-beat to Job Tracker
Ø Retrieves Job resources from HDFS
NameNode DataNode
JobTracker TaskTracker
Hadoop
Daemons
12.
13. Hadoop
Services
HDFS MapReduce YARN
YARN stands for “Yet
Another Resource
Negotiator”, a framework
to provide generic
resource management
solution to Hadoop
clusters.
17. Query Language Pig Scripting
Coordination Service
Columnar Database
Log Management
Data Exchange
Designing Workflow
Machine Learning
Messaging System
18. a) Apache Website
à http://hadoop.apache.org/
b) Learning YARN
à https://www.packtpub.com/big-data-and-business-intelligence/learning-yarn
c) Hadoop: The definitive guide
àhttp://shop.oreilly.com/product/0636920033448.do