Hadoop Introduction

•

0 recomendaciones•997 vistas

The document starts with the introduction for Hadoop and covers the Hadoop 1.x / 2.x services (HDFS / MapReduce / YARN). It also explains the architecture of Hadoop, the working of Hadoop distributed file system and MapReduce programming model.

Software

•  Distributed, scalable and
reliable
•  Fault‐tolerant storage
system
Hadoop Distributed
File System
•  High-performance parallel
data processing
•  Employs the divide-conquer
principle
Map-Reduce
Programming Model

A class teacher of class 5 needs to find out the name of the student with highest marks
for each subject.
Total students : 50
Total subjects : 5
Our Goal
To minimize the Total time spent
Time to process each
subject per student
: 1min
Total time spent : 250mins
Subject 1 : S1-98
Subject 2 : S13-95
Subject 3 : S1-97
Subject 4 : S23-100
Subject 5 : S8-99
Input
Output

HDFS: Distribute the
data into blocks across
multiple nodes
Distribute papers across 5 peons – Each
peon will have papers of 10 students for
each subject (50 papers each)
a)
Map Phase: Apply
business logic on
distributed data in parallel
Each peon will provide list of subjects
with student name and highest marks
from his data from a list of 10 students.
Total time spent: 50mins (in parallel)
b)
Reduce Phase: Iterate
over the map phase
output and get final result
Total records left: 5 students for 5
subjects only. Time to get subject list for
student name with highest marks: 25mins
c)
Total time spent: 50 + 25 = 75mins

Social Media Data
Analyzing Web Clickstream Data
Server Log Data
Machine and Sensor Data

HDFS Layer : --
Stores files across storage nodes
in a Hadoop cluster
Consists of :
•  Namenode & Datanodes
Map-Reduce Engine : --
Processes vast amounts of data in-
parallel on large clusters in a
reliable & fault-tolerant manner
Consists of :
•  Job Tracker & Task Trackers

Namenode
Datanode_1 Datanode_2 Datanode_3
HDFS
Block 1
HDFS
Block 2
HDFS
Block 3 Block 4
Storage & Replication of Blocks in HDFS
Filedividedintoblocks
Block 1
Block 2
Block 3
Block 4
HDFS Client
File write
request

Job
Tracker
Task Tracker 1 Task Tracker _2 Task Tracker _3
HDFS
Block 1
HDFS
Block 2
HDFS
Block 3 Block 4
Map-Reduce
job from
client
Executes individual
Map-Reduce tasks
assigned by Job
Tracker
Task Trackers retrieve data from HDFS which is stored on the
Data-node i.e. the same system where Task Tracker is running.
Task
Tracker
Data
Node
Slave
m/c

NameNode
Ø  Maps a block to the Datanodes
Ø  Controls read/write access to files
Ø  Manages Replication Engine for Blocks
DataNode
Ø  Responsible for serving read and write
requests (block creation, deletion, and
replication)
JobTracker
Ø  Accepts Map-Reduce tasks from the clients
Ø  Assigns tasks to the Task Trackers &
monitors their status
TaskTracker
Ø  Worker daemon, runs Map-Reduce tasks
Ø  Sends heart-beat to Job Tracker
Ø  Retrieves Job resources from HDFS
NameNode DataNode
JobTracker TaskTracker
Hadoop
Daemons

Hadoop
Services
HDFS MapReduce YARN
YARN stands for “Yet
Another Resource
Negotiator”, a framework
to provide generic
resource management
solution to Hadoop
clusters.

Allows easy integration of
multiple data processing
algorithms to the data stored in
HDFS

Query Language Pig Scripting
Coordination Service
Columnar Database
Log Management
Data Exchange
Designing Workflow
Machine Learning
Messaging System

a)  Apache Website
à http://hadoop.apache.org/
b)  Learning YARN
à https://www.packtpub.com/big-data-and-business-intelligence/learning-yarn
c)  Hadoop: The definitive guide
àhttp://shop.oreilly.com/product/0636920033448.do

Más contenido relacionado

La actualidad más candente

Hadoop introductionSubhas Kumar Ghosh

Hadoop HDFS NameNode HAHanborq Inc.

Hadoop Distributed File System(HDFS) : Behind the scenesNitin Khattar

Hadoop Distributed File Systemelliando dias

Hadoop Interacting with HDFSApache Apex

Introduction to HDFS and MapReduceUday Vakalapudi

Hadoop HDFS Architeture and Designsudhakara st

HDFS Trunncate: Evolving Beyond Write-Once SemanticsDataWorks Summit

Coordinating Metadata Replication: Survival Strategy for Distributed SystemsKonstantin V. Shvachko

Hdfs architectureAisha Siddiqa

Anatomy of file write in hadoopRajesh Ananda Kumar

Hadoop HDFS by rohitkapakapa rohit

Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma

HadoopEsraa El Ghoul

Hadoop Distributed File SystemAnand Kulkarni

Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksCloudera, Inc.

Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisSameer Tiwari

March 2011 HUG: HDFS FederationYahoo Developer Network

Hadoop ArchitectureDelhi/NCR HUG

HDFS introductioninjae yeo

La actualidad más candente (20)

Hadoop introduction

Hadoop HDFS NameNode HA

Hadoop Distributed File System(HDFS) : Behind the scenes

Hadoop Distributed File System

Hadoop Interacting with HDFS

Introduction to HDFS and MapReduce

Hadoop HDFS Architeture and Design

HDFS Trunncate: Evolving Beyond Write-Once Semantics

Coordinating Metadata Replication: Survival Strategy for Distributed Systems

Hdfs architecture

Anatomy of file write in hadoop

Hadoop HDFS by rohitkapa

Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System

Hadoop

Hadoop Distributed File System

Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks

Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis

March 2011 HUG: HDFS Federation

Hadoop Architecture

HDFS introduction

Destacado

Les Business Analysts face à l'agilité : de nouveaux challenges à releverOCTO Technology Suisse

Agile & Top ManagementOCTO Technology Suisse

Spark One Platform WebinarCloudera, Inc.

Apache Spark beyond Hadoop MapReduceEdureka!

Spark for big data analyticsEdureka!

De la pensée projet à la pensée produitOCTO Technology Suisse

Cloud : en 2017, sortez du stratus !OCTO Technology Suisse

Démystifions l'API-culture!OCTO Technology Suisse

Afterwork Blockchain : la prochaine technologie disruptive ?OCTO Technology Suisse

MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka Edureka!

Control Transactions using PowerCenterEdureka!

Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Edureka!

하둡 (Hadoop) 및 관련기술 훑어보기beom kyun choi

Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Edureka!

What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...Edureka!

DevOps : mission [im]possible ?rfelden

Polar Expeditions and Agility: the 1910 Race to the South Pole and Modern TalesOCTO Technology Suisse

Afterwork Big Data - Data Science & Machine Learning : explorer, comprendre e...OCTO Technology Suisse

Destacado (18)

Les Business Analysts face à l'agilité : de nouveaux challenges à relever

Agile & Top Management

Spark One Platform Webinar

Apache Spark beyond Hadoop MapReduce

Spark for big data analytics

De la pensée projet à la pensée produit

Cloud : en 2017, sortez du stratus !

Démystifions l'API-culture!

Afterwork Blockchain : la prochaine technologie disruptive ?

MapReduce Example | MapReduce Programming | Hadoop MapReduce Tutorial | Edureka

Control Transactions using PowerCenter

Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka

하둡 (Hadoop) 및 관련기술 훑어보기

Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...

What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...

DevOps : mission [im]possible ?

Polar Expeditions and Agility: the 1910 Race to the South Pole and Modern Tales

Afterwork Big Data - Data Science & Machine Learning : explorer, comprendre e...

Similar a Hadoop Introduction

HADOOP.pptxBharathi567510

Hadoop data managementSubhas Kumar Ghosh

Aziksa hadoop architecture santosh jhaData Con LA

Tutorial Haddop 2.3Atanu Chatterjee

Hadoop securityBiju Nair

Hadoop -HDFS.pptRamyaMurugesan12

Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23

Understanding HadoopMahendran Ponnusamy

Introduction to HDFSSiddharth Mathur

Lecture 2 part 1Jazan University

Hadoop overview.pdfSunil D Patil

Hadoop Distributed File SystemMilad Sobhkhiz

Big data- HDFS(2nd presentation)Takrim Ul Islam Laskar

Big Data Reverse Knowledge Transfer.pptxssuser8c3ea7

Big data processing using hadoop poster presentationAmrut Patil

Hadoop architecture-tutorialvinayiqbusiness

HdfsChirag Ahuja

Unit 1SriKGangadharRaoAssi

Data Analytics presentation.pptxSwarnaSLcse

Hadoop training in bangaloreKelly Technologies

Similar a Hadoop Introduction (20)

HADOOP.pptx

Hadoop data management

Aziksa hadoop architecture santosh jha

Tutorial Haddop 2.3

Hadoop security

Hadoop -HDFS.ppt

Topic 9a-Hadoop Storage- HDFS.pptx

Understanding Hadoop

Introduction to HDFS

Lecture 2 part 1

Hadoop overview.pdf

Hadoop Distributed File System

Big data- HDFS(2nd presentation)

Big Data Reverse Knowledge Transfer.pptx

Big data processing using hadoop poster presentation

Hadoop architecture-tutorial

Hdfs

Unit 1

Data Analytics presentation.pptx

Hadoop training in bangalore

Último

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

5 Signs You Need a Fashion PLM Software.pdfWave PLM

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

Software Quality Assurance Interview QuestionsArshad QA

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

Right Money Management App For Your Financial GoalsJhone kinadey

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

TECUNIQUE: Success Stories: IT Service providermohitmore19

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

Hadoop Introduction

2. Apache Hadoop is a Java software framework that allows for the distributed processing of large data sets across clusters of computers spread across the world using a simple programming model.

4. •  Distributed, scalable and reliable •  Fault‐tolerant storage system Hadoop Distributed File System •  High-performance parallel data processing •  Employs the divide-conquer principle Map-Reduce Programming Model

5. A class teacher of class 5 needs to find out the name of the student with highest marks for each subject. Total students : 50 Total subjects : 5 Our Goal To minimize the Total time spent Time to process each subject per student : 1min Total time spent : 250mins Subject 1 : S1-98 Subject 2 : S13-95 Subject 3 : S1-97 Subject 4 : S23-100 Subject 5 : S8-99 Input Output

6. HDFS: Distribute the data into blocks across multiple nodes Distribute papers across 5 peons – Each peon will have papers of 10 students for each subject (50 papers each) a) Map Phase: Apply business logic on distributed data in parallel Each peon will provide list of subjects with student name and highest marks from his data from a list of 10 students. Total time spent: 50mins (in parallel) b) Reduce Phase: Iterate over the map phase output and get final result Total records left: 5 students for 5 subjects only. Time to get subject list for student name with highest marks: 25mins c) Total time spent: 50 + 25 = 75mins

7. Social Media Data Analyzing Web Clickstream Data Server Log Data Machine and Sensor Data

8. HDFS Layer : -- Stores files across storage nodes in a Hadoop cluster Consists of : •  Namenode & Datanodes Map-Reduce Engine : -- Processes vast amounts of data in- parallel on large clusters in a reliable & fault-tolerant manner Consists of : •  Job Tracker & Task Trackers

9. Namenode Datanode_1 Datanode_2 Datanode_3 HDFS Block 1 HDFS Block 2 HDFS Block 3 Block 4 Storage & Replication of Blocks in HDFS Filedividedintoblocks Block 1 Block 2 Block 3 Block 4 HDFS Client File write request

10. Job Tracker Task Tracker 1 Task Tracker _2 Task Tracker _3 HDFS Block 1 HDFS Block 2 HDFS Block 3 Block 4 Map-Reduce job from client Executes individual Map-Reduce tasks assigned by Job Tracker Task Trackers retrieve data from HDFS which is stored on the Data-node i.e. the same system where Task Tracker is running. Task Tracker Data Node Slave m/c

11. NameNode Ø  Maps a block to the Datanodes Ø  Controls read/write access to files Ø  Manages Replication Engine for Blocks DataNode Ø  Responsible for serving read and write requests (block creation, deletion, and replication) JobTracker Ø  Accepts Map-Reduce tasks from the clients Ø  Assigns tasks to the Task Trackers & monitors their status TaskTracker Ø  Worker daemon, runs Map-Reduce tasks Ø  Sends heart-beat to Job Tracker Ø  Retrieves Job resources from HDFS NameNode DataNode JobTracker TaskTracker Hadoop Daemons

12.

13. Hadoop Services HDFS MapReduce YARN YARN stands for “Yet Another Resource Negotiator”, a framework to provide generic resource management solution to Hadoop clusters.

14.

15. Allows easy integration of multiple data processing algorithms to the data stored in HDFS

16.

17. Query Language Pig Scripting Coordination Service Columnar Database Log Management Data Exchange Designing Workflow Machine Learning Messaging System

18. a)  Apache Website à http://hadoop.apache.org/ b)  Learning YARN à https://www.packtpub.com/big-data-and-business-intelligence/learning-yarn c)  Hadoop: The definitive guide àhttp://shop.oreilly.com/product/0636920033448.do

Hadoop Introduction

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (18)

Similar a Hadoop Introduction

Similar a Hadoop Introduction (20)

Último

Último (20)

Hadoop Introduction