SlideShare una empresa de Scribd logo
1 de 35
http://www.beinghadoop.com

hadoopframework@gmail.com
MASTER NODE, SLAVE NODES.
Master nodes are
Namenode
Secondary namenode
Jobtracker
Slave nodes:
Datanodes, Task trackers.
Hdfs – Hadoop distributed file system which is distributed to
Available Datanodes. So actual data resides in Datanodes.
Data in Datanodes represented as small size chunks,
we call this chunks as blocks.
The default block size in HDFS is 64Mb.
For example if you want to store 1GB of data, in HDFS,
In 3 data nodes, 1000mb/64mb=15.625mb
16 blocks are equally distributed 3 data nodes.
In This 15 blocks are in same size, 16th one with 7Mb.
hadoopframework@gmail.com
These blocks are replicated based on replication policy.
Data replication policy is to improve data reliability,
availability and network bandwidth utilization.

If you are using Replication of blocks, the Default Replication
Factor is 3. In above example 3*16=48 blocks are created.
These 48 blocks are replicated to all the data nodes
As HDFS replication placement policy
Hdfs stores --one replica on node in local rack.
second replica on another datanode in local rack,
third one in different node in a different rack.
Hdfs stores each file as sequence of blocks.
All blocks of a file except last block are the same size.
this blocks of a file are replicated for fault tolerance.
A user or an application can specify the replication factor for a file.
hadoopframework@gmail.com
The default Replication factor for HDFS is specified
In Hdfs-site.xml under hadoop-home/conf directory

We can manually specify replication factor for file using
Hadoop fs –setrep command.
Syntax: hadoop fs s-setrep [-R] location of file.
R indicates Recursive

hadoopframework@gmail.com
In above diagram data1 contains 3 blocks 1,2,3
Data2 contains 2 blocks 4,5. There are 4 data nodes.
This 5 blocks*3 replications=15blocks are distributed to
4 data nodes.
When blocks are replicating to data nodes, General issues arises like
---over replicated blocks
---under replicated blocks
---corrupted blocks
---miss replicated blocks.
hadoopframework@gmail.com
To balance These Nodes equally
We run a script start-balancer.sh under Hadoop-home/bin directory
We use
hadoop fsck /,
hadoop dfsadmin –report
hadoop dfsadmin –metasave metaloginfo.txt
Datanode blockscanerreport
To check blocks information, Storage information
health of hdfs file system,

hadoopframework@gmail.com
hadoopframework@gmail.com
hadoopframework@gmail.com
We can not access the data from Datanode Directly.
Because datanode contains distributed blocks,
Which are like fragmented data or part files.
The Datanode are responsible for
---serving read, write requests from the clients
---performs block creating, deletion, Replication of blocks based on replication factor..

hadoopframework@gmail.com
when datanode startup, each datanode in a cluster performs
handshake procedure with namenode.
Each DataNode sends its block report to the NameNode every hour
so that NameNode has an up to date view of where block replicas are
located in the cluster.

Datanode also sends heartbeats to the NameNode every ten minutes.
so that Namenode will identify how many datanodes are working properly.
if it doesn't receive a heartbeat from DataNode,
NameNode assumes that the particular datanode got failure.
And then it starts writing replicas in available datanodes.

hadoopframework@gmail.com
Name node manages the file system namespace.
It stores metadata for the files which are being stored in DataNode.
NameNode maintains the data in the form file system tree.
When user writes data to DataNode, its metadata is maintained in Namenode.
So when user tries to access the same data, or Hdfs client requests to access data,
it has to contact NameNode through Jobtracker. It gets the reference in which
DataNode, the data is located. Will be given to Hdfs client.
When the Namenode starts up it gets the
FsImage and Editlog from its local file system,
update FsImage with EditLog information and then
stores a copy of the FsImage on the filesytstem as a checkpoint
When user initiated write operation, the data written to editlog file.
When chekpoint occurs it writes the data into fsimage file
hadoopframework@gmail.com
Name node stores the data as a Hierarchical file system
Namenode maintains the file system tree
Any meta information changes to the file system
is recorded by the Namenode.
An application can specify the
number of replicas of the file needed,
replication factor of the file.
This information is stored in the Namenode.

Namenode uses a transaction log called the EditLog to record
every change that occurs to the filesystem meta data.
For example, creating a new file.
Change replication factor of a file
EditLog is stored in the Namenode’s local filesystem
Entire filesystem namespace including mapping of blocks to
files and file system properties is stored in a file FsImage. Stored
in NameNode’s local file system.
hadoopframework@gmail.com
When you start the Namenode , it enters to Safemode.
During the safemode, Replication of data blocks do not occur.
Each DataNode checks in with Heartbeat and BlockReport.

Namenode verifies that each block has
acceptable number of replicas
After a configurable percentage of safely replicated blocks check in
with the Namenode, Namenode exits Safemode.
It then makes the list of blocks that need to be replicated.
Namenode then proceeds to replicate these blocks to other
Datanodes.

hadoopframework@gmail.com
Client applications submit jobs to the Job tracker.
The JobTracker talks to the NameNode to determine the location of the data
The JobTracker locates TaskTracker nodes with available slots
at or near the data
The JobTracker submits the work to the chosen TaskTracker nodes.
The TaskTracker nodes are monitored. If they do not submit heartbeat
signals often enough, they are treated as they have failed and the work is
scheduled to a different TaskTracker.
A TaskTracker will notify the JobTracker when a task fails.
The JobTracker decides what to do then:
it may resubmit the job elsewhere,
it may mark that specific record as something to avoid,
it may even blacklist the TaskTracker as unreliable.
When the work is completed, the JobTracker updates its status.
Client applications can poll the JobTracker for information
hadoopframework@gmail.com
A Tasktracker is a slave node in the cluster which that accepts the tasks
from JobTracker like Map, Reduce or shuffle operation. Tasktracker also runs in
its own JVM Process.
Every TaskTracker is configured with a set of slots; these indicate the number of
tasks that it can accept. The TaskTracker starts a separate JVM processes to do
the actual work (called as Task Instance) this is to ensure that process failure
does not take down the task tracker.
The Tasktracker monitors these task instances, capturing the output and exit
codes. When the Task instances finish, successfully or not, the task tracker
notifies the JobTracker.
The TaskTrackers also send out heartbeat messages to the JobTracker, usually
every few minutes, to reassure the JobTracker that it is still alive. These
messages also inform the JobTracker of the number of available slots, so the
JobTracker can stay up to date with where in the cluster work can be delegated.
hadoopframework@gmail.com
The Maximum Number of Map tasks that can run on a task tracker
Is controlled by the property for map tasks
Mapred.tasktracker.map.tasks.maximum=2(Default 2)
The Maximum Number of Reduce tasks that can run on a task tracker
Is controlled by the property for reduce tasks
Mapred.tasktracker.reduce.tasks.maximum=2 (Default=2)
If eight processors are available we can assign maximum Seven for map
Or reducer tasks.

hadoopframework@gmail.com
1. A block is minimum amount of data that can read or Write.
HDFS- The Default block size 64mb.
we can set it to 128mb in hdfs-site.xml
dfs.blocksize=128mb
2. The HDfs default buffer size 4kb.
By considering system configuration we can set
this using io.file.buffer.size in core-site.xml

hadoopframework@gmail.com
When we add a node to existing cluster
it is commissioning a node,
Removing a node is decommissioning node.

3.To commissioning a node
Add the network address of the nodes to commissioning.
Dfs.hosts= node address (hdfs-site.xml)

Mapred.hosts=node address(mapred-site.xml)
Update the name node using the commands
Hadoop dfsadmin –refreshnodes
Update the jobtracker using the command
Hadoop mradmin –refreshnodes.
hadoopframework@gmail.com
To Decommissioning a node
Add network addresses of the nodes to be decommissioned.
Dfs.hosts.exclude=node address(hdfs-site.xml)
Mapred.hosts.exclude=node address(mapred-site.xml)
Update the name node using the commands
Hadoop dfsadmin –refreshnodes
Update the jobtracker using the command
Hadoop mradmin –refreshnodes

4.Non HDFs in data node:
In One Tb disk of data node, If we want to use
250Gb for Non Hdfs storage
Set dfs.datanode.du.reserverd=250gb(in bytes)

hadoopframework@gmail.com
5. Recycle bin for HDFS.
If trash enabled , A hidden directory created with the name .trash
Command to clear the trash for Hdfs
Hadoop fs –expunge.
To specify the minimum amount of time that
a file will remain in trash can be set using
fs.trash.interval=600 minutes.
To disable the trash set
Fs.trash.interval=0 in core.site.xml

hadoopframework@gmail.com
6. DATANODE BLOCK SCANNER:
PEROIODICALLY VERIFIES ALL THE BLOCKS ON THE DATA NODE.
The default interval every 504 HOURS.
HDFS-SITE.XML:
DFS.DATANODE.SCAN.PERIOD.HOURS=504

7. Log Files Location: By default The log directory is located
Under hadoop_install/logs. We can assign new location for
Log directory in Hadoop-env.sh by adding a Line
Export Hadoop_log_dir=/var/log/hadoop

hadoopframework@gmail.com
Hadoop-env.sh:
Environmental variables which are used to run hadoop.

Core-site.xml: contains configuration settings for hadoop, i/o settings,
other properties common to Hdfs and Mapreduce.
Hdfs-site.xml:
Configurations for hdfs daemons,
Namenode, snn, data nodes.
Mapred-site.xml:
Configuration settings for Mapreduce daemons:
Jobtracker, tasktrackers.

hadoopframework@gmail.com
To designate particular node as name node, specify its URI
In core-site.xml for the property fs.default.name.
The NameNode ‘s name or its IP address is specified
In core-site.xml:

In case of psuedo distributed mode, we configure name node in localhost.
We can also specify like
hdfs://0.0.0.0:50070
http://127.0.0.1:50070
hadoopframework@gmail.com
Dfs.name.dir property in hdfs-site.xml contains the location
Where NameNode stores file system metadata. We can specify multiple
Disks locations, remote disks. So that we can recover data incase NameNode
Got failure. Specify the directory names separated by comma(,).

hadoopframework@gmail.com
hadoopframework@gmail.com
The property dfs.data.dir in hdfs-site.xml specify the location
Where Datanode stores its blocks. This property contains list of directories
Where DataNode stores its blocks. We can specify multiple directories for replication
Of blocks. It uses round robin algorithm to write data between directories.

Mapred.job.tracker property in mapred-site.xml specifies
The port where job tracker is running.
In a fully distributed mode we can configure jobtracker in a separate node too.

hadoopframework@gmail.com
Mapred.local.dir which contains list of directory name seperated by comma
Where jobtracker stores intermediate data for jobs. The data is cleared when
Job is completed.

Mapred.system.dir specifies a locations where shared files are stored
While a job is running.
Mapred.tasktracker.map.tasks.maximum is a property specifies
The number of map tasks that can run on a tasktracker at one time.
Mapred.tasktracker.reduce.tasks.maximum is a property specifies
The number of reducer tasks that can run on a tasktracker at one time.

hadoopframework@gmail.com
Name node’s http server address port: 50070.
Open web browser and type uri for NameNode as
http://localhost:50070
http://0.0.0.0:50070
Http://127.0.0.1:50070

50030 ; jobtracker http server address port
50060: task tracker http server address port
50075: datanode’s http server address port
50090: secondary name node’s http server address port

hadoopframework@gmail.com
hadoopframework@gmail.com
These Scripts are located under the Directory
Hadoop_home/bin or
/usr/lib/bin
HADOOP CONTROL SCRIPTS:
Master file: It is plain text file Contains the
NameNode, Secondary NameNode, Jobtracker address.
Slave file. It is a plain text file contains the address of data nodes.
Start-dfs.sh:
Starts namenode on the local machine.
Starts datanode on each machine listed in slave file.
Starts secondary namenode .
Start-mapred.sh
Starts a jobtracker
Starts tasktracker on each datanode machine.
To stop this we have the scripts
Stop-dfs.sh
Stop-mapred.sh
hadoopframework@gmail.com
hadoopframework@gmail.com
http://localhost:50075
http://0.0.0.0:50075
Http://127.0.0.1:50075

hadoopframework@gmail.com
http://localhost:50030
http://0.0.0.0:50030
Http://127.0.0.1:50030

http://0.0.0.0:50030
Http://127.0.0.1:500

hadoopframework@gmail.com
http://localhost:50090
http://0.0.0.0:50090
Http://127.0.0.1:50090

hadoopframework@gmail.com
http://localhost:50060
http://0.0.0.0:50060
Http://127.0.0.1:50060

hadoopframework@gmail.com

Más contenido relacionado

La actualidad más candente

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorialawesomesos
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceUday Vakalapudi
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basicHafizur Rahman
 

La actualidad más candente (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
 
Hadoop admin
Hadoop adminHadoop admin
Hadoop admin
 
Hadoop operations basic
Hadoop operations basicHadoop operations basic
Hadoop operations basic
 

Destacado

Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Hortonworks
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedHadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedDataWorks Summit
 
Ambari Meetup: NameNode HA
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HAHortonworks
 
Big Data Europe Concept and Platform
Big Data Europe Concept and PlatformBig Data Europe Concept and Platform
Big Data Europe Concept and PlatformBigData_Europe
 
HDFS NameNode High Availability
HDFS NameNode High AvailabilityHDFS NameNode High Availability
HDFS NameNode High AvailabilityDataWorks Summit
 
HDFS Namenode High Availability
HDFS Namenode High AvailabilityHDFS Namenode High Availability
HDFS Namenode High AvailabilityHortonworks
 
BDE-SC6 Hangout - “Insight into Virtual Currency Ecosystems”
BDE-SC6 Hangout - “Insight into Virtual Currency Ecosystems”BDE-SC6 Hangout - “Insight into Virtual Currency Ecosystems”
BDE-SC6 Hangout - “Insight into Virtual Currency Ecosystems”BigData_Europe
 

Destacado (7)

Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedHadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
 
Ambari Meetup: NameNode HA
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HA
 
Big Data Europe Concept and Platform
Big Data Europe Concept and PlatformBig Data Europe Concept and Platform
Big Data Europe Concept and Platform
 
HDFS NameNode High Availability
HDFS NameNode High AvailabilityHDFS NameNode High Availability
HDFS NameNode High Availability
 
HDFS Namenode High Availability
HDFS Namenode High AvailabilityHDFS Namenode High Availability
HDFS Namenode High Availability
 
BDE-SC6 Hangout - “Insight into Virtual Currency Ecosystems”
BDE-SC6 Hangout - “Insight into Virtual Currency Ecosystems”BDE-SC6 Hangout - “Insight into Virtual Currency Ecosystems”
BDE-SC6 Hangout - “Insight into Virtual Currency Ecosystems”
 

Similar a Hadoop architecture by ajay

Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreducesenthil0809
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersMindsMapped Consulting
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Designsudhakara st
 
Most Popular Hadoop Interview Questions and Answers
Most Popular Hadoop Interview Questions and AnswersMost Popular Hadoop Interview Questions and Answers
Most Popular Hadoop Interview Questions and AnswersSprintzeal
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxsunithachphd
 
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewHdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewNitesh Ghosh
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
data stage-material
data stage-materialdata stage-material
data stage-materialRajesh Kv
 

Similar a Hadoop architecture by ajay (20)

Unit 1
Unit 1Unit 1
Unit 1
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Hadoop -HDFS.ppt
Hadoop -HDFS.pptHadoop -HDFS.ppt
Hadoop -HDFS.ppt
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Most Popular Hadoop Interview Questions and Answers
Most Popular Hadoop Interview Questions and AnswersMost Popular Hadoop Interview Questions and Answers
Most Popular Hadoop Interview Questions and Answers
 
13160296.ppt
13160296.ppt13160296.ppt
13160296.ppt
 
Hdfs questions answers
Hdfs questions answersHdfs questions answers
Hdfs questions answers
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptx
 
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewHdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
data stage-material
data stage-materialdata stage-material
data stage-material
 

Último

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 

Último (20)

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 

Hadoop architecture by ajay

  • 2. MASTER NODE, SLAVE NODES. Master nodes are Namenode Secondary namenode Jobtracker Slave nodes: Datanodes, Task trackers. Hdfs – Hadoop distributed file system which is distributed to Available Datanodes. So actual data resides in Datanodes. Data in Datanodes represented as small size chunks, we call this chunks as blocks. The default block size in HDFS is 64Mb. For example if you want to store 1GB of data, in HDFS, In 3 data nodes, 1000mb/64mb=15.625mb 16 blocks are equally distributed 3 data nodes. In This 15 blocks are in same size, 16th one with 7Mb. hadoopframework@gmail.com
  • 3. These blocks are replicated based on replication policy. Data replication policy is to improve data reliability, availability and network bandwidth utilization. If you are using Replication of blocks, the Default Replication Factor is 3. In above example 3*16=48 blocks are created. These 48 blocks are replicated to all the data nodes As HDFS replication placement policy Hdfs stores --one replica on node in local rack. second replica on another datanode in local rack, third one in different node in a different rack. Hdfs stores each file as sequence of blocks. All blocks of a file except last block are the same size. this blocks of a file are replicated for fault tolerance. A user or an application can specify the replication factor for a file. hadoopframework@gmail.com
  • 4. The default Replication factor for HDFS is specified In Hdfs-site.xml under hadoop-home/conf directory We can manually specify replication factor for file using Hadoop fs –setrep command. Syntax: hadoop fs s-setrep [-R] location of file. R indicates Recursive hadoopframework@gmail.com
  • 5. In above diagram data1 contains 3 blocks 1,2,3 Data2 contains 2 blocks 4,5. There are 4 data nodes. This 5 blocks*3 replications=15blocks are distributed to 4 data nodes. When blocks are replicating to data nodes, General issues arises like ---over replicated blocks ---under replicated blocks ---corrupted blocks ---miss replicated blocks. hadoopframework@gmail.com
  • 6. To balance These Nodes equally We run a script start-balancer.sh under Hadoop-home/bin directory We use hadoop fsck /, hadoop dfsadmin –report hadoop dfsadmin –metasave metaloginfo.txt Datanode blockscanerreport To check blocks information, Storage information health of hdfs file system, hadoopframework@gmail.com
  • 9. We can not access the data from Datanode Directly. Because datanode contains distributed blocks, Which are like fragmented data or part files. The Datanode are responsible for ---serving read, write requests from the clients ---performs block creating, deletion, Replication of blocks based on replication factor.. hadoopframework@gmail.com
  • 10. when datanode startup, each datanode in a cluster performs handshake procedure with namenode. Each DataNode sends its block report to the NameNode every hour so that NameNode has an up to date view of where block replicas are located in the cluster. Datanode also sends heartbeats to the NameNode every ten minutes. so that Namenode will identify how many datanodes are working properly. if it doesn't receive a heartbeat from DataNode, NameNode assumes that the particular datanode got failure. And then it starts writing replicas in available datanodes. hadoopframework@gmail.com
  • 11. Name node manages the file system namespace. It stores metadata for the files which are being stored in DataNode. NameNode maintains the data in the form file system tree. When user writes data to DataNode, its metadata is maintained in Namenode. So when user tries to access the same data, or Hdfs client requests to access data, it has to contact NameNode through Jobtracker. It gets the reference in which DataNode, the data is located. Will be given to Hdfs client. When the Namenode starts up it gets the FsImage and Editlog from its local file system, update FsImage with EditLog information and then stores a copy of the FsImage on the filesytstem as a checkpoint When user initiated write operation, the data written to editlog file. When chekpoint occurs it writes the data into fsimage file hadoopframework@gmail.com
  • 12. Name node stores the data as a Hierarchical file system Namenode maintains the file system tree Any meta information changes to the file system is recorded by the Namenode. An application can specify the number of replicas of the file needed, replication factor of the file. This information is stored in the Namenode. Namenode uses a transaction log called the EditLog to record every change that occurs to the filesystem meta data. For example, creating a new file. Change replication factor of a file EditLog is stored in the Namenode’s local filesystem Entire filesystem namespace including mapping of blocks to files and file system properties is stored in a file FsImage. Stored in NameNode’s local file system. hadoopframework@gmail.com
  • 13. When you start the Namenode , it enters to Safemode. During the safemode, Replication of data blocks do not occur. Each DataNode checks in with Heartbeat and BlockReport. Namenode verifies that each block has acceptable number of replicas After a configurable percentage of safely replicated blocks check in with the Namenode, Namenode exits Safemode. It then makes the list of blocks that need to be replicated. Namenode then proceeds to replicate these blocks to other Datanodes. hadoopframework@gmail.com
  • 14. Client applications submit jobs to the Job tracker. The JobTracker talks to the NameNode to determine the location of the data The JobTracker locates TaskTracker nodes with available slots at or near the data The JobTracker submits the work to the chosen TaskTracker nodes. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are treated as they have failed and the work is scheduled to a different TaskTracker. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, it may even blacklist the TaskTracker as unreliable. When the work is completed, the JobTracker updates its status. Client applications can poll the JobTracker for information hadoopframework@gmail.com
  • 15. A Tasktracker is a slave node in the cluster which that accepts the tasks from JobTracker like Map, Reduce or shuffle operation. Tasktracker also runs in its own JVM Process. Every TaskTracker is configured with a set of slots; these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. The Tasktracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These messages also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated. hadoopframework@gmail.com
  • 16. The Maximum Number of Map tasks that can run on a task tracker Is controlled by the property for map tasks Mapred.tasktracker.map.tasks.maximum=2(Default 2) The Maximum Number of Reduce tasks that can run on a task tracker Is controlled by the property for reduce tasks Mapred.tasktracker.reduce.tasks.maximum=2 (Default=2) If eight processors are available we can assign maximum Seven for map Or reducer tasks. hadoopframework@gmail.com
  • 17. 1. A block is minimum amount of data that can read or Write. HDFS- The Default block size 64mb. we can set it to 128mb in hdfs-site.xml dfs.blocksize=128mb 2. The HDfs default buffer size 4kb. By considering system configuration we can set this using io.file.buffer.size in core-site.xml hadoopframework@gmail.com
  • 18. When we add a node to existing cluster it is commissioning a node, Removing a node is decommissioning node. 3.To commissioning a node Add the network address of the nodes to commissioning. Dfs.hosts= node address (hdfs-site.xml) Mapred.hosts=node address(mapred-site.xml) Update the name node using the commands Hadoop dfsadmin –refreshnodes Update the jobtracker using the command Hadoop mradmin –refreshnodes. hadoopframework@gmail.com
  • 19. To Decommissioning a node Add network addresses of the nodes to be decommissioned. Dfs.hosts.exclude=node address(hdfs-site.xml) Mapred.hosts.exclude=node address(mapred-site.xml) Update the name node using the commands Hadoop dfsadmin –refreshnodes Update the jobtracker using the command Hadoop mradmin –refreshnodes 4.Non HDFs in data node: In One Tb disk of data node, If we want to use 250Gb for Non Hdfs storage Set dfs.datanode.du.reserverd=250gb(in bytes) hadoopframework@gmail.com
  • 20. 5. Recycle bin for HDFS. If trash enabled , A hidden directory created with the name .trash Command to clear the trash for Hdfs Hadoop fs –expunge. To specify the minimum amount of time that a file will remain in trash can be set using fs.trash.interval=600 minutes. To disable the trash set Fs.trash.interval=0 in core.site.xml hadoopframework@gmail.com
  • 21. 6. DATANODE BLOCK SCANNER: PEROIODICALLY VERIFIES ALL THE BLOCKS ON THE DATA NODE. The default interval every 504 HOURS. HDFS-SITE.XML: DFS.DATANODE.SCAN.PERIOD.HOURS=504 7. Log Files Location: By default The log directory is located Under hadoop_install/logs. We can assign new location for Log directory in Hadoop-env.sh by adding a Line Export Hadoop_log_dir=/var/log/hadoop hadoopframework@gmail.com
  • 22. Hadoop-env.sh: Environmental variables which are used to run hadoop. Core-site.xml: contains configuration settings for hadoop, i/o settings, other properties common to Hdfs and Mapreduce. Hdfs-site.xml: Configurations for hdfs daemons, Namenode, snn, data nodes. Mapred-site.xml: Configuration settings for Mapreduce daemons: Jobtracker, tasktrackers. hadoopframework@gmail.com
  • 23. To designate particular node as name node, specify its URI In core-site.xml for the property fs.default.name. The NameNode ‘s name or its IP address is specified In core-site.xml: In case of psuedo distributed mode, we configure name node in localhost. We can also specify like hdfs://0.0.0.0:50070 http://127.0.0.1:50070 hadoopframework@gmail.com
  • 24. Dfs.name.dir property in hdfs-site.xml contains the location Where NameNode stores file system metadata. We can specify multiple Disks locations, remote disks. So that we can recover data incase NameNode Got failure. Specify the directory names separated by comma(,). hadoopframework@gmail.com
  • 26. The property dfs.data.dir in hdfs-site.xml specify the location Where Datanode stores its blocks. This property contains list of directories Where DataNode stores its blocks. We can specify multiple directories for replication Of blocks. It uses round robin algorithm to write data between directories. Mapred.job.tracker property in mapred-site.xml specifies The port where job tracker is running. In a fully distributed mode we can configure jobtracker in a separate node too. hadoopframework@gmail.com
  • 27. Mapred.local.dir which contains list of directory name seperated by comma Where jobtracker stores intermediate data for jobs. The data is cleared when Job is completed. Mapred.system.dir specifies a locations where shared files are stored While a job is running. Mapred.tasktracker.map.tasks.maximum is a property specifies The number of map tasks that can run on a tasktracker at one time. Mapred.tasktracker.reduce.tasks.maximum is a property specifies The number of reducer tasks that can run on a tasktracker at one time. hadoopframework@gmail.com
  • 28. Name node’s http server address port: 50070. Open web browser and type uri for NameNode as http://localhost:50070 http://0.0.0.0:50070 Http://127.0.0.1:50070 50030 ; jobtracker http server address port 50060: task tracker http server address port 50075: datanode’s http server address port 50090: secondary name node’s http server address port hadoopframework@gmail.com
  • 30. These Scripts are located under the Directory Hadoop_home/bin or /usr/lib/bin HADOOP CONTROL SCRIPTS: Master file: It is plain text file Contains the NameNode, Secondary NameNode, Jobtracker address. Slave file. It is a plain text file contains the address of data nodes. Start-dfs.sh: Starts namenode on the local machine. Starts datanode on each machine listed in slave file. Starts secondary namenode . Start-mapred.sh Starts a jobtracker Starts tasktracker on each datanode machine. To stop this we have the scripts Stop-dfs.sh Stop-mapred.sh hadoopframework@gmail.com