Hadoop & HDFS for Beginners

Hadoop

Rahul Jain
Software Engineer
http://www.linkedin.com/in/rahuldausa

1

Agenda
• Hadoop
– Introduction
– Hadoop (Why)
– Hadoop History
– Uses of Hadoop
– High Level Architecture
– Map-Reduce
• HDFS
– GFS (Google File System)
– HDFS Architecture
• Installation/Configuration
• Examples

2

Introduction
 An open source software framework
 Supports Data intensive Distributed Applications.
 Enables Application to work thousand of computational
independent computers and petabytes of data.
 Derived from Google’s Map-Reduce and Google File System
papers.
 Written in the Java Programming Language.
 Started by Doug Cutting, who named it after his son’s toy
elephant to support distribution for the Nutch ( A sub-project
of Lucene)

3

Hadoop (Why)
• Need to process huge datasets on large no. of computers.
• It is expensive to build reliability into each application.
• Nodes fails everyday
- Failure is expected, rather than exceptional.
- Need common infrastructure
- Efficient, reliable, easy to use.
- Open sourced , Apache License

4

Hadoop History
• Dec 2004 – Google GFS paper published
• July 2005 – Nutch uses Map-Reduce
• Jan 2006 – Doug Cutting joins Yahoo!
• Feb 2006 – Become Lucene Subproject
• Apr 2007 – Yahoo! On 1000 node cluster
• Jan 2008 – An Apache Top Level Project
• Feb 2008 – Yahoo Production search index

5

What is Hadoop Used for ?
• Searching (Yahoo)
• Log Processing
• Recommendation Systems (Facebook, LinkedIn, eBay, Amazon)
• Analytics(Facebook, LinkedIn)
• Video and Image Analysis (NASA)
• Data Retention

6

Hadoop High Level Architecture

7

Map-Reduce
Framework for processing parallel
problems across huge datasets using a
large numbers of computers(nodes),
collectively referred as
Cluster : If all nodes are on same local network and
uses similar network.
Or
Grid: If the nodes are shared across geographically
and uses more heterogeneous hardware.

Consists Two Step :
1.Map Step- The master node takes the input,
divides it into smaller sub-problems, and distributes
them to worker nodes. A worker node may do this
again in turn, leading to a multi-level tree structure.
The worker node processes the smaller problem,
and passes the answer back to its master node.
2.Reduce Step -The master node then collects
the answers to all the sub-problems and combines
them in some way to form the output – the answer
to the problem it was originally trying to solve.

Multiple Map-Reduce phases

8

Map-Reduce Life-Cycle

Credit : http://code.google.com/edu/parallel/mapreduce-tutorial.html
9

HDFS
Hadoop Distributed File System

10

Lets Understand GFS first …
Google File System

11

GFS Architecture

12

Goals of HDFS
1. Very Large Distributed File System
- 10K nodes, 100 million files, 10 PB
2. Assumes Commodity Hardware
- Files are replicated to handle hardware failure
- Detect failures and recovers from them
3. Optimized for Batch Processing
- Data locations exposed so that computation can move to
where data resides.

13

Installation/ Configuration
[rjain@10.1.110.12 hadoop-1.0.3]$ vi conf/hdfs-site.xml [rjain@10.1.110.12 hadoop-1.0.3]$ pwd
<configuration> /home/rjain/hadoop-1.0.3
<property>
<name>dfs.replication</name>
[rjain@10.1.110.12 hadoop-1.0.3]$ bin/start-all.sh
<value>1</value>
</property> [rjain@10.1.110.12 hadoop-1.0.3]$ bin/start-mapred.sh
<property>
<name>dfs.permissions</name>
[rjain@10.1.110.12 hadoop-1.0.3]$ bin/start-dfs.sh
<value>true</value>
</property> [rjain@10.1.110.12 hadoop-1.0.3]$ bin/hadoop fs
<property> Usage: java FsShell
<name>dfs.data.dir</name> [-ls <path>]
<value>/home/rjain/rahul/hdfs/data</value> [-lsr <path>] : Recursive version of ls. Similar to Unix ls -R.
</property> [-du <path>] : Displays aggregate length of files contained in the directory or the length of a file.
<property> [-dus <path>] : Displays a summary of file lengths.
<name>dfs.name.dir</name> [-count[-q] <path>]
<value>/home/rjain/rahul/hdfs/name</value> [-mv <src> <dst>]
</property> [-cp <src> <dst>]
</configuration> [-rm [-skipTrash] <path>]
[-rmr [-skipTrash] <path>] : Recursive version of delete(rm).
[rjain@10.1.110.12 hadoop-1.0.3]$ vi conf/mapred-site.xml [-expunge] : Empty the Trash
<configuration> [-put <localsrc> ... <dst>] : Copy single src, or multiple srcs from local file system to the
<property> destination filesystem
<name>mapred.job.tracker</name> [-copyFromLocal <localsrc> ... <dst>]
<value>localhost:9001</value> [-moveFromLocal <localsrc> ... <dst>]
</property> [-get [-ignoreCrc] [-crc] <src> <localdst>]
</configuration> [-getmerge <src> <localdst> [addnl]]
[-cat <src>]
[rjain@10.1.110.12 hadoop-1.0.3]$ vi conf/core-site.xml [-text <src>] : Takes a source file and outputs the file in text format. The allowed formats are zip
<configuration> and TextRecordInputStream.
<property> [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>]
<name>fs.default.name</name> [-moveToLocal [-crc] <src> <localdst>]
<value>hdfs://localhost:9000</value> [-mkdir <path>]
</property> [-setrep [-R] [-w] <rep> <path/file>] : Changes the replication factor of a file
</configuration> [-touchz <path>] : Create a file of zero length.
[-test -[ezd] <path>] : -e check to see if the file exists. Return 0 if true. -z check to see if the file is
[rjain@10.1.110.12 hadoop-1.0.3]$ jps zero length. Return 0 if true. -d check to see if the path is directory. Return 0 if true.
29756 SecondaryNameNode [-stat [format] <path>] : Returns the stat information on the path like created time of dir
19847 TaskTracker [-tail [-f] <file>] : Displays last kilobyte of the file to stdout
18756 Jps [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
29483 NameNode [-chown [-R] [OWNER][:[GROUP]] PATH...]
29619 DataNode [-chgrp [-R] GROUP PATH...] 15
19711 JobTracker [-help [cmd]]

HDFS- Read/Write Example
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);

Given an input/output file name as string, we construct inFile/outFile Path objects.
Most of the FileSystem APIs accepts Path objects.

Path inFile = new Path(argv[0]);
Path outFile = new Path(argv[1]);

Validate the input/output paths before reading/writing.

if (!fs.exists(inFile))
printAndExit("Input file not found");
if (!fs.isFile(inFile))
printAndExit("Input should be a file");
if (fs.exists(outFile))
printAndExit("Output already exists");

Open inFile for reading.

FSDataInputStream in = fs.open(inFile);

Open outFile for writing.

FSDataOutputStream out = fs.create(outFile);

Read from input stream and write to output stream until EOF.

while ((bytesRead = in.read(buffer)) > 0) {
out.write(buffer, 0, bytesRead);
}

Close the streams when done.

in.close();
out.close();
16

Hadoop Sub-Projects
• Hadoop Common: The common utilities that support the other Hadoop subprojects.
• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput
access to application data.
• Hadoop MapReduce: A software framework for distributed processing of large data sets on
compute clusters.

Other Hadoop-related projects at Apache include:

• Avro™: A data serialization system.
• Cassandra™: A scalable multi-master database with no single points of failure.
• Chukwa™: A data collection system for managing large distributed systems.
• HBase™: A scalable, distributed database that supports structured data storage for large tables.
• Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
• Mahout™: A Scalable machine learning and data mining library.
• Pig™: A high-level data-flow language and execution framework for parallel computation.
• ZooKeeper™: A high-performance coordination service for distributed applications.

17

Hadoop & HDFS for Beginners

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (18)

Similar a Hadoop & HDFS for Beginners

Similar a Hadoop & HDFS for Beginners (20)

Más de Rahul Jain

Más de Rahul Jain (13)

Último

Último (20)

Hadoop & HDFS for Beginners