SlideShare una empresa de Scribd logo
1 de 54
Descargar para leer sin conexión
lug.getFamiliarWithHadoop();
PRESENTED BY A.R.MOHAMMADI
AMIRMHD.IR
INTRODUCTION
• Big data is a term used to describe the voluminous amount of unstructured and semi-structured data a
company creates.
• Data that would take too much time and cost too much money to load into a relational database for
analysis.
• Big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes and
exabytes of data.
DATA GENERATION
• The New York Stock Exchange generates about one terabyte of new trade data per day.
• Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.
• Ancestry.com, the genealogy site, stores around 2.5 petabytes of data.
• The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per
month.
• The Large Hadron Collider near Geneva, Switzerland, produces about 15 petabytes of data per year.
WHAT CAUSED THE PROBLEM
0%
100%
1 2
Year
Standard Hard Drive Size
(in Mb)
1990 1370
2010 1000000
4%
96%
1 2
Year
Data Transfer Rate
(Mbps)
1990 4.4
2010 100
SO WHAT IS THE PROBLEM?
• The transfer speed is around 100 MB/s
• A standard disk is 1 Terabyte
• Time to read entire disk = 10000 seconds or 3 Hours!
• Increase in processing rate may not be as helpful because
• Network bandwidth is now more of a limiting factor
• Physical limits of processor chips have been reached
SO WHAT DO WE DO?
• The obvious solution is that we use multiple
processors to solve the same problem by
fragmenting it into pieces.
• Imagine if we had 100 drives, each holding one
hundredth of the data. Working in parallel, we could
read the data in under two minutes.
Parallelization- Multiple processors or CPU’s in a
single machine
Distributed Computing- Multiple computers
connected via a network
DISTRIBUTED COMPUTING VS PARALLELIZATION
DISTRIBUTED COMPUTING
The key issues involved in this Solution:
• Hardware failure
• Combine the data after analysis
• Network Associated Problems
WHAT CAN WE DO WITH A DISTRIBUTED COMPUTER
SYSTEM?
• IBM Deep Blue
• Index the Web (Google)
• Simulating an internet size network for network experiments
• Analysing Complex Networks
• ...
PROBLEMS IN DISTRIBUTED COMPUTING
• Hardware Failure:
As soon as we start using many pieces of hardware, the chance
that one will fail is fairly high.
• Combine the data after analysis:
Most analysis tasks need to be able to combine the data in some
way; data read from one disk may need to be combined with the
data from any of the other 99 disks.
HADOOP
• Apache Hadoop is an open-source software framework that supports data-intensive distributed
applications, licensed under the Apache v2 license.
• A common way of avoiding data loss is through replication: redundant copies of the data are kept by the
system so that in the event of failure, there is another copy available. The Hadoop Distributed Filesystem
(HDFS), takes care of this problem.
• The second problem is solved by a simple programming model- Mapreduce. Hadoop is the popular open
source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of
very large data sets.
DEVELOPER
Doug Cutting
2005: Doug Cutting and Michael J. Cafarella developed
Hadoop to support distribution for the Nutch search
engine project.
The project was funded by Yahoo.
2006: Yahoo gave the project to Apache
Software Foundation.
WHAT ELSE IS HADOOP?
A reliable shared storage and analysis system.
There are other subprojects of Hadoop that provide complementary services, or build on the core to add
higher-level abstractions The various subprojects of hadoop include:
1. Core
2. Avro
3. Pig
4. HBase
5. Zookeeper
6. Hive
7. Chukwa
HADOOP APPROACH TO DISTRIBUTED COMPUTING
The theoretical 1000-CPU machine
would cost a very large amount of
money, far more than 1,000 single-CPU.
Hadoop will tie these smaller and more
reasonably priced machines together
into a single cost-effective compute
cluster.
2008 - Hadoop Wins Terabyte Sort Benchmark
(sorted 1 terabyte of data in 209 seconds, compared
to previous record of 297 seconds)
MAP-REDUCE
• Hadoop limits the amount of communication which can be performed by the processes, as each
individual record is processed by a task in isolation from one another
• By restricting the communication between nodes, Hadoop makes the distributed system much more
reliable. Individual node failures can be worked around by restarting tasks on other machines.
• The other workers continue to operate as though nothing went wrong, leaving the challenging aspects of
partially restarting the program to the underlying Hadoop layer.
Map : (in_value,in_key)(out_key, intermediate_value)
Reduce: (out_key, intermediate_value) (out_value list)
WHAT IS MAPREDUCE?
• MapReduce is a programming model
• Programs written in this functional style are automatically parallelized and executed on a large
cluster of commodity machines
• MapReduce is an associated implementation for processing and generating large data sets.
MapReduce
MAP
map function that
processes a key/value pair
to generate a set of
intermediate key/value
pairs
REDUCE
and a reduce function that
merges all intermediate
values associated with the
same intermediate key.
MAP
REDUCE
reduce(out_key, intermediate_value list) -> out_value list
EXAMPLE
 // key: document name
 // value: document contents
 for each word w in value:
 EmitIntermediate(w, "1");
 reduce(String key, Iterator values):
 // key: a word
 // values: a list of counts
 int result = 0;
 for each v in values:
 result += ParseInt(v);
 Emit(AsString(result));
DATA LOCALITY OPTIMIZATION
• The computer nodes and the storage nodes are the same. The Map-Reduce framework and the
Distributed File System run on the same set of nodes. This configuration allows the framework to
effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate
bandwidth across the cluster.
• If this is not possible : The computation is done by another processor on the same rack.
MAPREDUCE DATA FLOW WITH A SINGLE REDUCE TASK
MAPREDUCE DATA FLOW WITH MULTIPLE REDUCE TASKS
HADOOP STREAMING:
• Hadoop provides an API to MapReduce that allows you to write your map and reduce
functions in languages other than Java.
• Hadoop Streaming uses Unix standard streams as the interface between Hadoop and
your program, so you can use any language that can read standard input and write to
standard output to write your MapReduce program.
HADOOP PIPES:
• Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce.
• Unlike Streaming, which uses standard input and output to communicate with the map and
reduce code, Pipes uses sockets as the channel over which the tasktracker communicates
with the process running the C++ map or reduce function.
HADOOP DISTRIBUTED FILESYSTEM (HDFS)
• Filesystems that manage the storage across a network of machines are called distributed
filesystems.
• HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very
large amounts of data (terabytes or even petabytes), and provide high-throughput access to this
information.
PROBLEMS IN DISTRIBUTED FILE SYSTEMS
Making distributed filesystems is more complex than regular disk filesystems. This is because the data is
spanned over multiple nodes, so all the complications of network programming kick in.
 Hardware Failure
An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact
that there are a huge number of components and that each component has a non-trivial probability of failure means that some
component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core
architectural goal of HDFS.
 Large Data Sets
Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to
support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should
support tens of millions of files in a single instance.
DESIGN OF HDFS
• Very large files
Files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running
today that store petabytes of data.
• Streaming data access
HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-
times pattern.
A dataset is typically generated or copied from source, then various analyses are performed on that
dataset over time. Each analysis will involve a large proportion of the dataset, so the time to read the
whole dataset is more important than the latency in reading the first record.
NAMENODES AND DATANODES
• A HDFS cluster has two types of node operating in a master-worker pattern: a namenode (the master)
and a number of datanodes (workers).
• The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata
for all the files and directories in the tree.
• Datanodes are the work horses of the filesystem. They store and retrieve blocks when they are told to
(by clients or the namenode), and they report back to the namenode periodically with lists of blocks
that they are storing.
Without the namenode, the filesystem cannot be used. In fact, if the
machine running the namenode were obliterated, all the files on the
filesystem would be lost since there would be no way of knowing
how to reconstruct the files from the blocks on the datanodes.
DATA REPLICATION
• The blocks of a file are replicated for fault tolerance.
• The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat
and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the
DataNode is functioning properly.
• A Blockreport contains a list of all blocks on a DataNode.
• When the replication factor is three, HDFS’s placement policy is to put one replica on one node in the
local rack, another on a different node in the local rack, and the last on a different node in a different
rack.
BIBLIOGRAPHY
1. Hadoop- The Definitive Guide, O’Reilly 2009, Yahoo! Press
2. MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat
3. Ranking and Semi-supervised Classification on Large Scale Graphs Using Map-Reduce, Delip Rao, David
Yarowsky, Dept. of Computer Science, Johns Hopkins University
4. Improving MapReduce Performance in Heterogeneous Environments, Matei Zaharia, Andy Konwinski,
Anthony D. Joseph, Randy Katz, Ion Stoica, University of California, Berkeley
5. MapReduce in a Week By Hannah Tang, Albert Wong, Aaron Kimball, Winter 2007
INSTALLING JAVA
 Update the source list
 sudo apt-get update
 The OpenJDK project is the default version of Java that is provided from a supported Ubuntu repository.
 sudo apt-get install default-jdk
 Or Install Sun Java 8 JDK
 sudo apt-get install openjdk-8-jdk
 After installation, make a quick check whether Sun’s JDK is correctly set up:
 java -version
ADDING A DEDICATED HADOOP SYSTEM USER
 sudo addgroup hadoop
 sudo adduser --ingroup hadoop hduser
CONFIGURING SSH
 user@ubuntu:~$ su - hduser
 hduser@ubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu
The key's randomart image is:
[...snipp...]
 hduser@ubuntu:~$ usermod –aG sudo hduser
ENABLE SSH ACCESS TO YOUR LOCAL MACHINE
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
TEST
 hduser@ubuntu:~$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux
Ubuntu 10.04 LTS
[...snipp...]
 hduser@ubuntu:~$
DISABLING IPV6
To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of
your choice and add the following lines to the end of the file:
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
 You can check whether IPv6 is enabled on your machine with the following command:
cat /proc/sys/net/ipv6/conf/all/disable_ipv6
INSTALLING HADOOP
cd /usr/local
sudo tar xzf hadoop-*.tar.gz
sudo mv hadoop-* hadoop
sudo chown -R hduser:hadoop hadoop
SETUP CONFIGURATION FILES
1. ~/.bashrc
2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh
3. /usr/local/hadoop/etc/hadoop/core-site.xml
4. /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
UPDATE $HOME/.BASHRC
 Append the following to the end of ~/.bashrc
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 //java folder
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END
SET JAVA_HOME BY MODIFYING HADOOP-ENV.SH FILE
 vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh
 export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
CORE-SITE.XML
 sudo mkdir -p /app/hadoop/tmp
 sudo chown hduser:hadoop /app/hadoop/tmp
 vi /usr/local/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
MAPRED-SITE.XML
 cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
/usr/local/hadoop/etc/hadoop/mapred-site.xml
 vi /usr/local/hadoop/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
CONF/HDFS-SITE.XML
 vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
FORMATTING THE HDFS FILESYSTEM VIA THE NAMENODE
 /usr/local/hadoop/bin/hadoop namenode –format
10/05/08 16:namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ubuntu/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri
Feb 19 08:07:34 UTC 2010
************************************************************/
10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop
10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted.
10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/
STARTING YOUR SINGLE-NODE CLUSTER
 /usr/local/hadoop/bin/start-all.sh
 jps
COPY LOCAL EXAMPLE DATA TO HDFS
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg
/user/hduser/gutenberg
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser
Found 1 items
drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/gutenberg
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg
Found 3 items
-rw-r--r-- 3 hduser supergroup 674566 2011-03-10 11:38 /user/hduser/gutenberg/pg20417.txt
-rw-r--r-- 3 hduser supergroup 1573112 2011-03-10 11:38 /user/hduser/gutenberg/pg4300.txt
-rw-r--r-- 3 hduser supergroup 1423801 2011-03-10 11:38 /user/hduser/gutenberg/pg5000.txt
• hduser@ubuntu:/usr/local/hadoop$
RUN THE MAPREDUCE JOB
• hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount
/user/hduser/gutenberg /user/hduser/gutenberg-output
CHECK IF THE RESULT IS SUCCESSFULLY STORED IN
HDFS DIRECTORY
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser
• Found 2 items
• drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/gutenberg
• drwxr-xr-x - hduser supergroup 0 2010-05-08 17:43 /user/hduser/gutenberg-output
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg-output
• Found 2 items
• drwxr-xr-x - hduser supergroup 0 2010-05-08 17:43 /user/hduser/gutenberg-output/_logs
• -rw-r--r-- 1 hduser supergroup 880802 2010-05-08 17:43 /user/hduser/gutenberg-output/part-r-00000
• hduser@ubuntu:/usr/local/hadoop$
RETRIEVE THE JOB RESULT FROM HDFS
• hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat /user/hduser/gutenberg-output/part-r-00000
COPY THE RESULTS TO THE LOCAL FILE SYSTEM
 hduser@ubuntu:/usr/local/hadoop$ mkdir /tmp/gutenberg-output
 hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -getmerge /user/hduser/gutenberg-output /tmp/gutenberg-output
 hduser@ubuntu:/usr/local/hadoop$ head /tmp/gutenberg-output/gutenberg-output
"(Lo)cra" 1
"1490 1
"1498," 1
"35" 1
"40," 1
"A 2
"AS-IS". 1
"A_ 1
"Absoluti 1
"Alack! 1
• hduser@ubuntu:/usr/local/hadoop$
HADOOP WEB INTERFACES
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at
these locations:
 http://localhost:50070/ – web UI of the NameNode daemon
 http://localhost:50030/ – web UI of the JobTracker daemon
 http://localhost:50060/ – web UI of the TaskTracker daemon

Más contenido relacionado

La actualidad más candente

Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesKelly Technologies
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Adam Kawa
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Adam Kawa
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1Giovanna Roda
 

La actualidad más candente (20)

Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Hadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologiesHadoop training in hyderabad-kellytechnologies
Hadoop training in hyderabad-kellytechnologies
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
HDFS
HDFSHDFS
HDFS
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Introduction to Hadoop part1
Introduction to Hadoop part1Introduction to Hadoop part1
Introduction to Hadoop part1
 

Similar a getFamiliarWithHadoop

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basicssaili mane
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxAltafKhadim
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenmaharajothip1
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopRojaT4
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoopMohit Tare
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 

Similar a getFamiliarWithHadoop (20)

Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Hadoop
HadoopHadoop
Hadoop
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

Último

SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 

Último (20)

SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 

getFamiliarWithHadoop

  • 2. INTRODUCTION • Big data is a term used to describe the voluminous amount of unstructured and semi-structured data a company creates. • Data that would take too much time and cost too much money to load into a relational database for analysis. • Big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data.
  • 3. DATA GENERATION • The New York Stock Exchange generates about one terabyte of new trade data per day. • Facebook hosts approximately 10 billion photos, taking up one petabyte of storage. • Ancestry.com, the genealogy site, stores around 2.5 petabytes of data. • The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20 terabytes per month. • The Large Hadron Collider near Geneva, Switzerland, produces about 15 petabytes of data per year.
  • 4. WHAT CAUSED THE PROBLEM 0% 100% 1 2 Year Standard Hard Drive Size (in Mb) 1990 1370 2010 1000000 4% 96% 1 2 Year Data Transfer Rate (Mbps) 1990 4.4 2010 100
  • 5. SO WHAT IS THE PROBLEM? • The transfer speed is around 100 MB/s • A standard disk is 1 Terabyte • Time to read entire disk = 10000 seconds or 3 Hours! • Increase in processing rate may not be as helpful because • Network bandwidth is now more of a limiting factor • Physical limits of processor chips have been reached
  • 6.
  • 7. SO WHAT DO WE DO? • The obvious solution is that we use multiple processors to solve the same problem by fragmenting it into pieces. • Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes.
  • 8. Parallelization- Multiple processors or CPU’s in a single machine Distributed Computing- Multiple computers connected via a network DISTRIBUTED COMPUTING VS PARALLELIZATION
  • 9. DISTRIBUTED COMPUTING The key issues involved in this Solution: • Hardware failure • Combine the data after analysis • Network Associated Problems
  • 10. WHAT CAN WE DO WITH A DISTRIBUTED COMPUTER SYSTEM? • IBM Deep Blue • Index the Web (Google) • Simulating an internet size network for network experiments • Analysing Complex Networks • ...
  • 11. PROBLEMS IN DISTRIBUTED COMPUTING • Hardware Failure: As soon as we start using many pieces of hardware, the chance that one will fail is fairly high. • Combine the data after analysis: Most analysis tasks need to be able to combine the data in some way; data read from one disk may need to be combined with the data from any of the other 99 disks.
  • 12. HADOOP • Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. • A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. The Hadoop Distributed Filesystem (HDFS), takes care of this problem. • The second problem is solved by a simple programming model- Mapreduce. Hadoop is the popular open source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of very large data sets.
  • 13. DEVELOPER Doug Cutting 2005: Doug Cutting and Michael J. Cafarella developed Hadoop to support distribution for the Nutch search engine project. The project was funded by Yahoo. 2006: Yahoo gave the project to Apache Software Foundation.
  • 14. WHAT ELSE IS HADOOP? A reliable shared storage and analysis system. There are other subprojects of Hadoop that provide complementary services, or build on the core to add higher-level abstractions The various subprojects of hadoop include: 1. Core 2. Avro 3. Pig 4. HBase 5. Zookeeper 6. Hive 7. Chukwa
  • 15. HADOOP APPROACH TO DISTRIBUTED COMPUTING The theoretical 1000-CPU machine would cost a very large amount of money, far more than 1,000 single-CPU. Hadoop will tie these smaller and more reasonably priced machines together into a single cost-effective compute cluster.
  • 16. 2008 - Hadoop Wins Terabyte Sort Benchmark (sorted 1 terabyte of data in 209 seconds, compared to previous record of 297 seconds)
  • 17. MAP-REDUCE • Hadoop limits the amount of communication which can be performed by the processes, as each individual record is processed by a task in isolation from one another • By restricting the communication between nodes, Hadoop makes the distributed system much more reliable. Individual node failures can be worked around by restarting tasks on other machines. • The other workers continue to operate as though nothing went wrong, leaving the challenging aspects of partially restarting the program to the underlying Hadoop layer. Map : (in_value,in_key)(out_key, intermediate_value) Reduce: (out_key, intermediate_value) (out_value list)
  • 18. WHAT IS MAPREDUCE? • MapReduce is a programming model • Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines • MapReduce is an associated implementation for processing and generating large data sets. MapReduce MAP map function that processes a key/value pair to generate a set of intermediate key/value pairs REDUCE and a reduce function that merges all intermediate values associated with the same intermediate key.
  • 19. MAP
  • 21. EXAMPLE  // key: document name  // value: document contents  for each word w in value:  EmitIntermediate(w, "1");  reduce(String key, Iterator values):  // key: a word  // values: a list of counts  int result = 0;  for each v in values:  result += ParseInt(v);  Emit(AsString(result));
  • 22. DATA LOCALITY OPTIMIZATION • The computer nodes and the storage nodes are the same. The Map-Reduce framework and the Distributed File System run on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. • If this is not possible : The computation is done by another processor on the same rack.
  • 23. MAPREDUCE DATA FLOW WITH A SINGLE REDUCE TASK
  • 24. MAPREDUCE DATA FLOW WITH MULTIPLE REDUCE TASKS
  • 25. HADOOP STREAMING: • Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java. • Hadoop Streaming uses Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.
  • 26. HADOOP PIPES: • Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce. • Unlike Streaming, which uses standard input and output to communicate with the map and reduce code, Pipes uses sockets as the channel over which the tasktracker communicates with the process running the C++ map or reduce function.
  • 27. HADOOP DISTRIBUTED FILESYSTEM (HDFS) • Filesystems that manage the storage across a network of machines are called distributed filesystems. • HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information.
  • 28. PROBLEMS IN DISTRIBUTED FILE SYSTEMS Making distributed filesystems is more complex than regular disk filesystems. This is because the data is spanned over multiple nodes, so all the complications of network programming kick in.  Hardware Failure An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.  Large Data Sets Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
  • 29. DESIGN OF HDFS • Very large files Files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data. • Streaming data access HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many- times pattern. A dataset is typically generated or copied from source, then various analyses are performed on that dataset over time. Each analysis will involve a large proportion of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.
  • 30. NAMENODES AND DATANODES • A HDFS cluster has two types of node operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers). • The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. • Datanodes are the work horses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.
  • 31. Without the namenode, the filesystem cannot be used. In fact, if the machine running the namenode were obliterated, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes.
  • 32. DATA REPLICATION • The blocks of a file are replicated for fault tolerance. • The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. • A Blockreport contains a list of all blocks on a DataNode. • When the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a different node in the local rack, and the last on a different node in a different rack.
  • 33. BIBLIOGRAPHY 1. Hadoop- The Definitive Guide, O’Reilly 2009, Yahoo! Press 2. MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat 3. Ranking and Semi-supervised Classification on Large Scale Graphs Using Map-Reduce, Delip Rao, David Yarowsky, Dept. of Computer Science, Johns Hopkins University 4. Improving MapReduce Performance in Heterogeneous Environments, Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica, University of California, Berkeley 5. MapReduce in a Week By Hannah Tang, Albert Wong, Aaron Kimball, Winter 2007
  • 34.
  • 35. INSTALLING JAVA  Update the source list  sudo apt-get update  The OpenJDK project is the default version of Java that is provided from a supported Ubuntu repository.  sudo apt-get install default-jdk  Or Install Sun Java 8 JDK  sudo apt-get install openjdk-8-jdk  After installation, make a quick check whether Sun’s JDK is correctly set up:  java -version
  • 36. ADDING A DEDICATED HADOOP SYSTEM USER  sudo addgroup hadoop  sudo adduser --ingroup hadoop hduser
  • 37. CONFIGURING SSH  user@ubuntu:~$ su - hduser  hduser@ubuntu:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Created directory '/home/hduser/.ssh'. Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub. The key fingerprint is: 9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu The key's randomart image is: [...snipp...]  hduser@ubuntu:~$ usermod –aG sudo hduser
  • 38. ENABLE SSH ACCESS TO YOUR LOCAL MACHINE cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys TEST  hduser@ubuntu:~$ ssh localhost The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (RSA) to the list of known hosts. Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux Ubuntu 10.04 LTS [...snipp...]  hduser@ubuntu:~$
  • 39. DISABLING IPV6 To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of your choice and add the following lines to the end of the file: net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1  You can check whether IPv6 is enabled on your machine with the following command: cat /proc/sys/net/ipv6/conf/all/disable_ipv6
  • 40. INSTALLING HADOOP cd /usr/local sudo tar xzf hadoop-*.tar.gz sudo mv hadoop-* hadoop sudo chown -R hduser:hadoop hadoop
  • 41. SETUP CONFIGURATION FILES 1. ~/.bashrc 2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh 3. /usr/local/hadoop/etc/hadoop/core-site.xml 4. /usr/local/hadoop/etc/hadoop/mapred-site.xml.template 5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
  • 42. UPDATE $HOME/.BASHRC  Append the following to the end of ~/.bashrc #HADOOP VARIABLES START export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 //java folder export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib" #HADOOP VARIABLES END
  • 43. SET JAVA_HOME BY MODIFYING HADOOP-ENV.SH FILE  vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh  export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
  • 44. CORE-SITE.XML  sudo mkdir -p /app/hadoop/tmp  sudo chown hduser:hadoop /app/hadoop/tmp  vi /usr/local/hadoop/etc/hadoop/core-site.xml <configuration> <property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> </configuration>
  • 45. MAPRED-SITE.XML  cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml  vi /usr/local/hadoop/etc/hadoop/mapred-site.xml <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> </configuration>
  • 46. CONF/HDFS-SITE.XML  vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> </configuration>
  • 47. FORMATTING THE HDFS FILESYSTEM VIA THE NAMENODE  /usr/local/hadoop/bin/hadoop namenode –format 10/05/08 16:namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = ubuntu/127.0.1.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.20.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 ************************************************************/ 10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop 10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup 10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true 10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds. 10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted. 10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1 ************************************************************/
  • 48. STARTING YOUR SINGLE-NODE CLUSTER  /usr/local/hadoop/bin/start-all.sh  jps
  • 49. COPY LOCAL EXAMPLE DATA TO HDFS  hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hduser/gutenberg  hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser Found 1 items drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/gutenberg  hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg Found 3 items -rw-r--r-- 3 hduser supergroup 674566 2011-03-10 11:38 /user/hduser/gutenberg/pg20417.txt -rw-r--r-- 3 hduser supergroup 1573112 2011-03-10 11:38 /user/hduser/gutenberg/pg4300.txt -rw-r--r-- 3 hduser supergroup 1423801 2011-03-10 11:38 /user/hduser/gutenberg/pg5000.txt • hduser@ubuntu:/usr/local/hadoop$
  • 50. RUN THE MAPREDUCE JOB • hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
  • 51. CHECK IF THE RESULT IS SUCCESSFULLY STORED IN HDFS DIRECTORY  hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser • Found 2 items • drwxr-xr-x - hduser supergroup 0 2010-05-08 17:40 /user/hduser/gutenberg • drwxr-xr-x - hduser supergroup 0 2010-05-08 17:43 /user/hduser/gutenberg-output  hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg-output • Found 2 items • drwxr-xr-x - hduser supergroup 0 2010-05-08 17:43 /user/hduser/gutenberg-output/_logs • -rw-r--r-- 1 hduser supergroup 880802 2010-05-08 17:43 /user/hduser/gutenberg-output/part-r-00000 • hduser@ubuntu:/usr/local/hadoop$
  • 52. RETRIEVE THE JOB RESULT FROM HDFS • hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat /user/hduser/gutenberg-output/part-r-00000
  • 53. COPY THE RESULTS TO THE LOCAL FILE SYSTEM  hduser@ubuntu:/usr/local/hadoop$ mkdir /tmp/gutenberg-output  hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -getmerge /user/hduser/gutenberg-output /tmp/gutenberg-output  hduser@ubuntu:/usr/local/hadoop$ head /tmp/gutenberg-output/gutenberg-output "(Lo)cra" 1 "1490 1 "1498," 1 "35" 1 "40," 1 "A 2 "AS-IS". 1 "A_ 1 "Absoluti 1 "Alack! 1 • hduser@ubuntu:/usr/local/hadoop$
  • 54. HADOOP WEB INTERFACES Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:  http://localhost:50070/ – web UI of the NameNode daemon  http://localhost:50030/ – web UI of the JobTracker daemon  http://localhost:50060/ – web UI of the TaskTracker daemon