SlideShare una empresa de Scribd logo
1 de 37
Descargar para leer sin conexión
Hadoop
Operations - Basic
Hafizur Rahman
April 4, 2013
Agenda
● Why Hadoop
● Hadoop Architecture
● Hadoop Installation
● Hadoop Configuration
● Hadoop DFS Command
● What's next
Challenges at Large Scale
● Single node can't handle due to limited
resource
○ Processor time, Memory, Hard drive space, Network
bandwidth
○ Individual hard drives can only sustain read speeds between
60-100 MB/second, so multicore does not help that much
● Multiple nodes needed, but probability of
failure increases
○ Network failure, Data transfer failure, Node failure
○ Desynchronized clock, Lock
○ Partial failure in distributed atomic transaction
Hadoop Approach (1/4)
● Data Distribution
○ Distributed to all the nodes in the cluster
○ Replicated to several nodes
Hadoop Approach (2/4)
● Move computation to the data
○ Whenever possible, rather than moving data for
processing, computation is moved to the node that
contains the data
○ Most data is read from local disk straight into the
CPU, alleviating strain on network bandwidth and
preventing unnecessary network transfers
○ This data locality results in high performance
Hadoop Approach (3/4)
● MapReduce programming model
○ Run as isolated process
Hadoop Approach (4/4)
● Isolated execution
○ Communication between nodes is limited and done
implicitly
○ Individual node failures can be worked around by
restarting tasks on other nodes
■ No message exchange needed by user task
■ No roll back to pre-arranged checkpoints to
partially restart the computation
■ Other workers continue to operate as though
nothing went wrong
Hadoop Environment
High-level Hadoop architecture
HDFS (1/2)
● Storage component of Hadoop
● Distributed file system modeled after GFS
● Optimized for high throughput
● Works best when reading and writing large files
(gigabytes and larger)
● To support this throughput HDFS leverages unusually
large (for a filesystem) block sizes and data locality
optimizations to reduce network input/output (I/O)
HDFS (2/2)
● Scalability and availability are also key
traits of HDFS, achieved in part due to data
replication and fault tolerance
● HDFS replicates files for a configured
number of times, is tolerant of both
software and hardware failure, and
automatically re-replicates data blocks on
nodes that have failed
HDFS Architecture
MapReduce (1/2)
● MapReduce is a batch-based, distributed
computing framework modeled
● Simplifies parallel processing by abstracting
away the complexities involved in working
with distributed systems
○ computational parallelization
○ work distribution
○ dealing with unreliable hardware and software
MapReduce (2/2)
MapReduce Logical Architecture
● Name Node
● Secondary
Name Node
● Data Node
● Job Tracker
● Task Tracker
Hadoop Installation
● Local mode
○ No need to communicate with other nodes, so it
does not use HDFS, nor will it launch any of the
Hadoop daemons
○ Used for developing and debugging the application
logic of a MapReduce program
● Pseudo Distributed Mode
○ All daemons running on a single machine
○ Helps to examine memory usage, HDFS
input/output issues, and other daemon interactions
● Fully Distributed Mode
Hadoop Configuration
File name Description
hadoop-env.sh ● Environment-specific settings go here.
● If a current JDK is not in the system path you’ll want to come here to configure your
JAVA_HOME
core-site.xml ● Contains system-level Hadoop configuration items
○ HDFS URL
○ Hadoop temporary directory
○ script locations for rack-aware Hadoop clusters
● Override settings in core-default.xml: http://hadoop.apache.org/common/docs/r1.
0.0/core-default.html.
hdfs-site.xml ● Contains HDFS settings
○ default file replication count
○ block size
○ whether permissions are enforced
● Override settings in hdfs-default.xml: http://hadoop.apache.org/common/docs/r1.
0.0/hdfs-default.html
mapred-site.xml ● Contains HDFS settings
○ default number of reduce tasks
○ default min/max task memory sizes
○ speculative execution
● Override settings in mapred-default.xml: http://hadoop.apache.
org/common/docs/r1.0.0/mapred-default.html
Installation
Pseudo Distributed Mode
● Setup public key based login
○ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
○ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
● Update the following configuration
○ hadoop.tmp.dir and fs.default.name at core-site.
xml
○ dfs.replication at hdfs-site.xml
○ mapred.job.tracker at mapred-site.xml
● Format NameNode
○ bin/hadoop namenode -format
● Start all daemons
○ bin/start-all.sh
Hands On
● HDFS Commands
○ http://hadoop.apache.org/docs/r0.18.1
/hdfs_shell.html
● Execute example
○ Wordcount
● Web Interface
○ NameNode daemon: http://localhost:50070/
○ JobTracker daemon: http://localhost:50030/
○ TaskTracker daemon: http://localhost:50060/
● Hadoop Job Command
Hadoop FileSystem
File
System
URI
Scheme
Java Impl. (all under org.
apache.hadoop)
Description
Local file fs.LocalFileSystem Filesystem for a locally
connected disk with client-side
checksums
HDFS hdfs hdfs.DistributedFileSystem Hadoop’s distributed filesystem
WebHDFS webhdfs hdfs.web.
WebHdfsFileSystem
Filesystem providing secure
read-write access to HDFS over
HTTP
S3 (native) s3n fs.s3native.
NativeS3FileSystem
Filesystem backed by Amazon
S3
S3 (block
based)
s3 fs.s3.S3FileSystem Filesystem backed by Amazon
S3, which stores files in blocks
(much like HDFS) to overcome
S3’s 5 GB file size limit.
GlusterFS glusterfs fs.glusterfs.
GlusterFileSystem
Still in beta
https://github.
com/gluster/glusterfs/tree/master
/glusterfs-hadoop
Installation
Fully Distributed Mode
Three different kind of hosts:
● master
○ master node of the cluster
○ hosts NameNode and JobTracker daemons
● backup
○ hosts Secondary NameNode daemon
● slave1, slave2, ...
○ slave boxes running both DataNode and TaskTracker
daemons
Hadoop Configuration
File Name Description
masters ● Name is misleading and should have been called secondary-masters
● When you start Hadoop it will launch NameNode and JobTracker on the local
host from which you issued the start command and then SSH to all the nodes
in this file to launch the SecondaryNameNode.
slaves ● Contains a list of hosts that are Hadoop slaves
● When you start Hadoop it will SSH to each host in this file and launch the
DataNode and TaskTracker daemons
Recipes
● S3 Configuration
● Using multiple disks/volumes and limiting
HDFS disk usage
● Setting HDFS block size
● Setting the file replication factor
Recipes:
S3 Configuration
● Config file: conf/hadoop-site.xml
● To access S3 data using DFS command
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>ID</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>SECRET</value>
</property>
● To use S3 as a replacement for HDFS
<property>
<name>fs.default.name</name>
<value>s3://BUCKET</value>
</property>
Recipes:
Disk Configuration
● Config file: $HADOOP_HOME/conf/hdfs-site.xml
● For multiple locations:
<property>
<name>dfs.data.dir</name>
<value>/u1/hadoop/data,/u2/hadoop/data</value>
</property>
● For limiting the HDFS disk usage, specify reserved
space for non-DFS (bytes per volume)
<property>
<name>dfs.datanode.du.reserved</name>
<value>6000000000</value>
</property>
Recipes:
HDFS Block Size (1/3)
● HDFS stores files across the cluster by
breaking them down into coarser grained,
fixed-size blocks
● Default HDFS block size is 64 MB
● Affects performance of
○ filesystem operations where larger block sizes
would be more effective, if you are storing and
processing very large files
○ MapReduce computations, as the default behavior
of Hadoop is to create one map task for each data
block of the input files
Recipes:
HDFS Block Size (2/3)
● Option 1: NameNode configuration
○ Add/modify dfs.block.size parameter at conf/hdfs-
site.xml
○ Block size in number of bytes
○ Only the files copied after the change will have the
new block size
○ Existing files in HDFS will not be affected
<property>
<name>dfs.block.size</name>
<value>134217728</value>
</property>
Recipes:
HDFS Block Size (2/3)
● Option 2: During file upload
○ Applies only to the specific file paths
> bin/hadoop fs -Ddfs.blocksize=134217728 -put data.in /user/foo
● Use fsck command
> bin/hadoop fsck /user/foo/data.in -blocks -files -locations
/user/foo/data.in 215227246 bytes, 2 block(s): ....
0. blk_6981535920477261584_1059len=134217728 repl=1 [hostname:50010]
1. blk_-8238102374790373371_1059 len=81009518 repl=1 [hostname:50010]
Recipes:
File Replication Factor (1/3)
● Replication done for fault tolerance
○ Pros: Improves data locality and data access
bandwidth
○ Cons: Needs more storage
● HDFS replication factor is a file-level
property that can be set per file basis
Recipes:
File Replication Factor (2/3)
● Set default replication factor
○ Add/Modify dfs.replication property in conf/hdfs-
site.xml
○ Old files will be unaffected
○ Only the files copied after the change will have the
new replication factor
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
Recipes:
File Replication Factor (3/3)
● Set replication factor during file upload
> bin/hadoop fs -D dfs.replication=1 -copyFromLocal non-criticalfile.txt
/user/foo
● Change the replication factor of files or file
paths that are already in the HDFS
○ Use setrep command
○ Syntax: hadoop fs -setrep [-R] <path>
> bin/hadoop fs -setrep 2 non-critical-file.txt
Replication 3 set: hdfs://myhost:9000/user/foo/non-critical-file.txt
Recipes:
Merging files in HDFS
● Use HDFS getmerge command
● Syntax:
hadoop fs -getmerge <src> <localdst> [addnl]
● Copies files in a given path in HDFS to a
single concatenated file in the local
filesystem
> bin/hadoop fs -getmerge /user/foo/demofiles merged.txt
Hadoop Operations - Advanced
Example:
Advanced Operations
● HDFS
○ Adding new data node
○ Decommissioning data node
○ Checking FileSystem Integrity with fsck
○ Balancing HDFS Block Data
○ Dealing with a Failed Disk
● MapReduce
○ Adding a Tasktracker
○ Decommissioning a Tasktracker
○ Killing a MapReduce Job
○ Killing a MapReduce Task
○ Dealing with a Blacklisted Tasktracker
Links
● http://www.michael-noll.
com/tutorials/running-hadoop-on-ubuntu-
linux-single-node-cluster/
● http://www.michael-noll.
com/tutorials/running-hadoop-on-ubuntu-
linux-multi-node-cluster/
● http://developer.yahoo.
com/hadoop/tutorial/
● http://hadoop.apache.org/docs/r1.0.4
/mapred_tutorial.html
Q/A

Más contenido relacionado

La actualidad más candente

Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jkEdureka!
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentationpuneet yadav
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemSteve Loughran
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an exampleNikita Kesharwani
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorialawesomesos
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Rohit Agrawal
 

La actualidad más candente (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Hadoop File System Shell Commands,
Hadoop File System Shell Commands,Hadoop File System Shell Commands,
Hadoop File System Shell Commands,
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an example
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2Hadoop Cluster Configuration and Data Loading - Module 2
Hadoop Cluster Configuration and Data Loading - Module 2
 

Similar a Hadoop operations basic

Hadoop training institute in bangalore
Hadoop training institute in bangaloreHadoop training institute in bangalore
Hadoop training institute in bangaloreKelly Technologies
 
Hadoop training institute in hyderabad
Hadoop training institute in hyderabadHadoop training institute in hyderabad
Hadoop training institute in hyderabadKelly Technologies
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache HadoopSufi Nawaz
 
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Hadoop 2.x  HDFS Cluster Installation (VirtualBox)Hadoop 2.x  HDFS Cluster Installation (VirtualBox)
Hadoop 2.x HDFS Cluster Installation (VirtualBox)Amir Sedighi
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04Mandakini Kumari
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to HadoopAnandMHadoop
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapakapa rohit
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryIJRESJOURNAL
 
Map Reduce
Map ReduceMap Reduce
Map Reduceopenak
 
Hadoop disaster recovery
Hadoop disaster recoveryHadoop disaster recovery
Hadoop disaster recoverySandeep Singh
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudLeons Petražickis
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFSEdureka!
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxNIKHILGR3
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentationAmrut Patil
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxssuser8c3ea7
 
Flume-based Independent News Aggregator
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News AggregatorMário Almeida
 

Similar a Hadoop operations basic (20)

Hadoop training institute in bangalore
Hadoop training institute in bangaloreHadoop training institute in bangalore
Hadoop training institute in bangalore
 
Hadoop training institute in hyderabad
Hadoop training institute in hyderabadHadoop training institute in hyderabad
Hadoop training institute in hyderabad
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Hadoop 2.x  HDFS Cluster Installation (VirtualBox)Hadoop 2.x  HDFS Cluster Installation (VirtualBox)
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04Big data with hadoop Setup on Ubuntu 12.04
Big data with hadoop Setup on Ubuntu 12.04
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Interacting with hdfs
Interacting with hdfsInteracting with hdfs
Interacting with hdfs
 
Hadoop disaster recovery
Hadoop disaster recoveryHadoop disaster recovery
Hadoop disaster recovery
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptxLecture10_CloudServicesModel_MapReduceHDFS.pptx
Lecture10_CloudServicesModel_MapReduceHDFS.pptx
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Training
TrainingTraining
Training
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
 
Flume-based Independent News Aggregator
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News Aggregator
 

Último

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 

Último (20)

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 

Hadoop operations basic

  • 1. Hadoop Operations - Basic Hafizur Rahman April 4, 2013
  • 2.
  • 3. Agenda ● Why Hadoop ● Hadoop Architecture ● Hadoop Installation ● Hadoop Configuration ● Hadoop DFS Command ● What's next
  • 4. Challenges at Large Scale ● Single node can't handle due to limited resource ○ Processor time, Memory, Hard drive space, Network bandwidth ○ Individual hard drives can only sustain read speeds between 60-100 MB/second, so multicore does not help that much ● Multiple nodes needed, but probability of failure increases ○ Network failure, Data transfer failure, Node failure ○ Desynchronized clock, Lock ○ Partial failure in distributed atomic transaction
  • 5. Hadoop Approach (1/4) ● Data Distribution ○ Distributed to all the nodes in the cluster ○ Replicated to several nodes
  • 6. Hadoop Approach (2/4) ● Move computation to the data ○ Whenever possible, rather than moving data for processing, computation is moved to the node that contains the data ○ Most data is read from local disk straight into the CPU, alleviating strain on network bandwidth and preventing unnecessary network transfers ○ This data locality results in high performance
  • 7. Hadoop Approach (3/4) ● MapReduce programming model ○ Run as isolated process
  • 8. Hadoop Approach (4/4) ● Isolated execution ○ Communication between nodes is limited and done implicitly ○ Individual node failures can be worked around by restarting tasks on other nodes ■ No message exchange needed by user task ■ No roll back to pre-arranged checkpoints to partially restart the computation ■ Other workers continue to operate as though nothing went wrong
  • 11. HDFS (1/2) ● Storage component of Hadoop ● Distributed file system modeled after GFS ● Optimized for high throughput ● Works best when reading and writing large files (gigabytes and larger) ● To support this throughput HDFS leverages unusually large (for a filesystem) block sizes and data locality optimizations to reduce network input/output (I/O)
  • 12. HDFS (2/2) ● Scalability and availability are also key traits of HDFS, achieved in part due to data replication and fault tolerance ● HDFS replicates files for a configured number of times, is tolerant of both software and hardware failure, and automatically re-replicates data blocks on nodes that have failed
  • 14. MapReduce (1/2) ● MapReduce is a batch-based, distributed computing framework modeled ● Simplifies parallel processing by abstracting away the complexities involved in working with distributed systems ○ computational parallelization ○ work distribution ○ dealing with unreliable hardware and software
  • 16. MapReduce Logical Architecture ● Name Node ● Secondary Name Node ● Data Node ● Job Tracker ● Task Tracker
  • 17. Hadoop Installation ● Local mode ○ No need to communicate with other nodes, so it does not use HDFS, nor will it launch any of the Hadoop daemons ○ Used for developing and debugging the application logic of a MapReduce program ● Pseudo Distributed Mode ○ All daemons running on a single machine ○ Helps to examine memory usage, HDFS input/output issues, and other daemon interactions ● Fully Distributed Mode
  • 18. Hadoop Configuration File name Description hadoop-env.sh ● Environment-specific settings go here. ● If a current JDK is not in the system path you’ll want to come here to configure your JAVA_HOME core-site.xml ● Contains system-level Hadoop configuration items ○ HDFS URL ○ Hadoop temporary directory ○ script locations for rack-aware Hadoop clusters ● Override settings in core-default.xml: http://hadoop.apache.org/common/docs/r1. 0.0/core-default.html. hdfs-site.xml ● Contains HDFS settings ○ default file replication count ○ block size ○ whether permissions are enforced ● Override settings in hdfs-default.xml: http://hadoop.apache.org/common/docs/r1. 0.0/hdfs-default.html mapred-site.xml ● Contains HDFS settings ○ default number of reduce tasks ○ default min/max task memory sizes ○ speculative execution ● Override settings in mapred-default.xml: http://hadoop.apache. org/common/docs/r1.0.0/mapred-default.html
  • 19. Installation Pseudo Distributed Mode ● Setup public key based login ○ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa ○ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys ● Update the following configuration ○ hadoop.tmp.dir and fs.default.name at core-site. xml ○ dfs.replication at hdfs-site.xml ○ mapred.job.tracker at mapred-site.xml ● Format NameNode ○ bin/hadoop namenode -format ● Start all daemons ○ bin/start-all.sh
  • 20. Hands On ● HDFS Commands ○ http://hadoop.apache.org/docs/r0.18.1 /hdfs_shell.html ● Execute example ○ Wordcount ● Web Interface ○ NameNode daemon: http://localhost:50070/ ○ JobTracker daemon: http://localhost:50030/ ○ TaskTracker daemon: http://localhost:50060/ ● Hadoop Job Command
  • 21. Hadoop FileSystem File System URI Scheme Java Impl. (all under org. apache.hadoop) Description Local file fs.LocalFileSystem Filesystem for a locally connected disk with client-side checksums HDFS hdfs hdfs.DistributedFileSystem Hadoop’s distributed filesystem WebHDFS webhdfs hdfs.web. WebHdfsFileSystem Filesystem providing secure read-write access to HDFS over HTTP S3 (native) s3n fs.s3native. NativeS3FileSystem Filesystem backed by Amazon S3 S3 (block based) s3 fs.s3.S3FileSystem Filesystem backed by Amazon S3, which stores files in blocks (much like HDFS) to overcome S3’s 5 GB file size limit. GlusterFS glusterfs fs.glusterfs. GlusterFileSystem Still in beta https://github. com/gluster/glusterfs/tree/master /glusterfs-hadoop
  • 22. Installation Fully Distributed Mode Three different kind of hosts: ● master ○ master node of the cluster ○ hosts NameNode and JobTracker daemons ● backup ○ hosts Secondary NameNode daemon ● slave1, slave2, ... ○ slave boxes running both DataNode and TaskTracker daemons
  • 23. Hadoop Configuration File Name Description masters ● Name is misleading and should have been called secondary-masters ● When you start Hadoop it will launch NameNode and JobTracker on the local host from which you issued the start command and then SSH to all the nodes in this file to launch the SecondaryNameNode. slaves ● Contains a list of hosts that are Hadoop slaves ● When you start Hadoop it will SSH to each host in this file and launch the DataNode and TaskTracker daemons
  • 24. Recipes ● S3 Configuration ● Using multiple disks/volumes and limiting HDFS disk usage ● Setting HDFS block size ● Setting the file replication factor
  • 25. Recipes: S3 Configuration ● Config file: conf/hadoop-site.xml ● To access S3 data using DFS command <property> <name>fs.s3.awsAccessKeyId</name> <value>ID</value> </property> <property> <name>fs.s3.awsSecretAccessKey</name> <value>SECRET</value> </property> ● To use S3 as a replacement for HDFS <property> <name>fs.default.name</name> <value>s3://BUCKET</value> </property>
  • 26. Recipes: Disk Configuration ● Config file: $HADOOP_HOME/conf/hdfs-site.xml ● For multiple locations: <property> <name>dfs.data.dir</name> <value>/u1/hadoop/data,/u2/hadoop/data</value> </property> ● For limiting the HDFS disk usage, specify reserved space for non-DFS (bytes per volume) <property> <name>dfs.datanode.du.reserved</name> <value>6000000000</value> </property>
  • 27. Recipes: HDFS Block Size (1/3) ● HDFS stores files across the cluster by breaking them down into coarser grained, fixed-size blocks ● Default HDFS block size is 64 MB ● Affects performance of ○ filesystem operations where larger block sizes would be more effective, if you are storing and processing very large files ○ MapReduce computations, as the default behavior of Hadoop is to create one map task for each data block of the input files
  • 28. Recipes: HDFS Block Size (2/3) ● Option 1: NameNode configuration ○ Add/modify dfs.block.size parameter at conf/hdfs- site.xml ○ Block size in number of bytes ○ Only the files copied after the change will have the new block size ○ Existing files in HDFS will not be affected <property> <name>dfs.block.size</name> <value>134217728</value> </property>
  • 29. Recipes: HDFS Block Size (2/3) ● Option 2: During file upload ○ Applies only to the specific file paths > bin/hadoop fs -Ddfs.blocksize=134217728 -put data.in /user/foo ● Use fsck command > bin/hadoop fsck /user/foo/data.in -blocks -files -locations /user/foo/data.in 215227246 bytes, 2 block(s): .... 0. blk_6981535920477261584_1059len=134217728 repl=1 [hostname:50010] 1. blk_-8238102374790373371_1059 len=81009518 repl=1 [hostname:50010]
  • 30. Recipes: File Replication Factor (1/3) ● Replication done for fault tolerance ○ Pros: Improves data locality and data access bandwidth ○ Cons: Needs more storage ● HDFS replication factor is a file-level property that can be set per file basis
  • 31. Recipes: File Replication Factor (2/3) ● Set default replication factor ○ Add/Modify dfs.replication property in conf/hdfs- site.xml ○ Old files will be unaffected ○ Only the files copied after the change will have the new replication factor <property> <name>dfs.replication</name> <value>2</value> </property>
  • 32. Recipes: File Replication Factor (3/3) ● Set replication factor during file upload > bin/hadoop fs -D dfs.replication=1 -copyFromLocal non-criticalfile.txt /user/foo ● Change the replication factor of files or file paths that are already in the HDFS ○ Use setrep command ○ Syntax: hadoop fs -setrep [-R] <path> > bin/hadoop fs -setrep 2 non-critical-file.txt Replication 3 set: hdfs://myhost:9000/user/foo/non-critical-file.txt
  • 33. Recipes: Merging files in HDFS ● Use HDFS getmerge command ● Syntax: hadoop fs -getmerge <src> <localdst> [addnl] ● Copies files in a given path in HDFS to a single concatenated file in the local filesystem > bin/hadoop fs -getmerge /user/foo/demofiles merged.txt
  • 35. Example: Advanced Operations ● HDFS ○ Adding new data node ○ Decommissioning data node ○ Checking FileSystem Integrity with fsck ○ Balancing HDFS Block Data ○ Dealing with a Failed Disk ● MapReduce ○ Adding a Tasktracker ○ Decommissioning a Tasktracker ○ Killing a MapReduce Job ○ Killing a MapReduce Task ○ Dealing with a Blacklisted Tasktracker
  • 37. Q/A