SlideShare una empresa de Scribd logo
1 de 18
EWT Portal Practice Team 2013

Hadoop Cluster Configuration

Table of contents
1. Introduction…………………………………………………………………………………………………………… 2
2. Prerequisites Softwares…………………………………………………………………………………………. 8
3. Cluster Configuration on Server1 and Server2……………………………………………………….. 8
4. Hadoop………..………………………………………………………………………………………………………… 9
5. Flume…………………………………………………………………………………………………………………….. 11
6. Hive……………………………………………………………………………………………………………………….. 12
7. Hbase…………………………………………………………………………………………………………………….. 13
8. Organizations using Hadoop………………………………………………………………………………….. 14
9. References…………………………………………………………………………………………………………….. 14

1. Introduction: Hadoop - A Software Framework for Data Intensive Computing Applications.
[Type text]

Page 1
EWT Portal Practice Team 2013
a. Hadoop?
Software platform that lets one easily write and run applications that process vast amounts of
data. It includes:
– HDFS – Hadoop distributed file system
– HBase (pre-alpha) – online data access
– MapReduce – offline computing engine
•

Yahoo! is the biggest contributor

•

Here's what makes it especially useful:
Scalable: It can reliably store and process petabytes.
Economical: It distributes the data and processing across clusters of commonly available
computers (in thousands).
Efficient: By distributing the data, it can process it in parallel on the nodes where the data
is located.
Reliable: It automatically maintains multiple copies of data and automatically redeploys
computing tasks based on failures.

b. What does it do?
•

Hadoop implements Google’s MapReduce, using HDFS

•

MapReduce divides applications into many small blocks of work.

•

HDFS creates multiple replicas of data blocks for reliability, placing them on compute
nodes around the cluster.

•

MapReduce can then process the data where it is located.

•

Hadoop ‘s target is to run on clusters of the order of 10,000-nodes.
Written in Java
Does work with other languages
Runs on
Linux, Windows and more

c. HDFS?

[Type text]

Page 2
EWT Portal Practice Team 2013
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run
on commodity hardware. It has many similarities with existing distributed file systems.
However, the differences from other distributed file systems are significant.
–
–

provides high throughput access to application data and is suitable for applications that
have large data sets.

–

[Type text]

highly fault-tolerant and is designed to be deployed on low-cost hardware.

relaxes a few POSIX requirements to enable streaming access to file system data.

Page 3
EWT Portal Practice Team 2013
d. MapReduce?
•

Programming model developed at Google

•

Sort/merge based distributed computing

•

Initially, it was intended for their internal search/indexing application, but now used
extensively by more organizations (e.g., Yahoo, Amazon.com, IBM, etc.)

•

It is functional style programming (e.g., LISP) that is naturally parallelizable across a
large cluster of workstations or PCS.

•

The underlying system takes care of the partitioning of the input data, scheduling the
program’s execution across several machines, handling machine failures, and managing
required inter-machine communication. (This is the key for Hadoop’s success)

e. How does MapReduce work?
•

The run time partitions the input and provides it to different Map instances;

•

Map (key, value)  (key’, value’)

•

The run time collects the (key’, value’) pairs and distributes them to several Reduce
functions so that each Reduce function gets the pairs with the same key’.

•

Each Reduce produces a single (or zero) file output.

•

Map and Reduce are user written functions

f. Flume?
[Type text]

Page 4
EWT Portal Practice Team 2013
Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data. It has a simple and flexible
architecture based on streaming data flows. It is robust and fault tolerant with tunable
reliability mechanisms and many failover and recovery mechanisms. The system is
centrally managed and allows for intelligent dynamic management. It uses a simple
extensible data model that allows for online analytic applications.
Flume Architecture

g. Hive?

[Type text]

Page 5
EWT Portal Practice Team 2013
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, adhoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
Hive provides a mechanism to project structure onto this data and query the data using a
SQL-like language called HiveQL. At the same time this language also allows traditional
map/reduce programmers to plug in their custom mappers and reducers when it is
inconvenient or inefficient to express this logic in HiveQL.
•

Hive - SQL on top of Hadoop

•

Rich data types (structs, lists and maps)

•

Efficient implementations of SQL filters, joins and group-by’s on top of map reduce

•

Allow users to access Hive data without using Hive
Hive Optimizations
Efficient Execution of SQL on top of Map-Reduce

h. Hbase?

[Type text]

Page 6
EWT Portal Practice Team 2013
HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the
database isn't an RDBMS which supports SQL as its primary access language, but there
are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL
database, whereas HBase is very much a distributed database. Technically speaking,
HBase is really more a "Data Store" than "Data Base" because it lacks many of the
features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and
advanced query languages, etc.

i. When Should I Use HBase?
•

HBase isn't suitable for every problem.

•

First, make sure you have enough data. If you have hundreds of millions or billions of
rows, then HBase is a good candidate. If you only have a few thousand/million rows, then
using a traditional RDBMS might be a better choice due to the fact that all of your data
might wind up on a single node (or two) and the rest of the cluster may be sitting idle.

•

Second, make sure you can live without all the extra features that an RDBMS provides
(e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An
application built against an RDBMS cannot be "ported" to HBase by simply changing a
JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete
redesign as opposed to a port.

•

Third, make sure you have enough hardware. Even HDFS doesn't do well with anything
less than 5 DataNodes (due to things such as HDFS block replication which has a default
of 3), plus a NameNode.

•

HBase can run quite well stand-alone on a laptop - but this should be considered a
development configuration only.

j. What Is The Difference Between HBase and Hadoop/HDFS?
HDFS is a distributed file system that is well suited for the storage of large files. It's
documentation states that it is not, however, a general purpose file system, and does not
provide fast individual record lookups in files. HBase, on the other hand, is built on top of
HDFS and provides fast record lookups (and updates) for large tables. This can
sometimes be a point of conceptual confusion. HBase internally puts your data in indexed
"StoreFiles" that exist on HDFS for high-speed lookups.

2. Prerequisites Softwares
[Type text]

Page 7
EWT Portal Practice Team 2013
To get a Hadoop, Flume, Hbase and Hive distribution, download a recent stable tar files from one of the
Apache Download Mirrors.
Note: Configuration setup on two linux servers (Server1 and Server2).

2.1 Download the prerequisites softwares from the below urls(Server 1 machine)
a. Hadoop : http://download.nextag.com/apache/hadoop/common/stable/
b. Flume : http://archive.apache.org/dist/flume/stable/
c. Hive : http://download.nextag.com/apache/hive/stable/
d. Hbase : http://www.eng.lsu.edu/mirrors/apache/hbase/stable/

2.2 Download the Java 1.6/1.7 :
http://www.oracle.com/technetwork/java/javase/downloads/index.html

2.3 Stable versions of the Hadoop Components on August 2013

•

Hadoop-1.1.2

•

Flume-1.4.0

•

Hbase-0.94.9

•

Hive -0.10.0

3. Cluster configuration on Server1 and Server2

Create a user and password to give the admin permission(Server1 and Server2)
3.1 Task: Add a user with group to the system
useradd -G {hadoop} hduser

3.2 Task : Add a password to hduser
[Type text]

Page 8
EWT Portal Practice Team 2013
Passwd hduser

3.3 Open the host file from server1 system and edit the file /etc/hosts
# For example:
#

102.54.94.97

rhino.acme.com

master

rhino.acme.com

slaves

Server2 system
#

102.54.94.98

3.4 Create Authentication SSH-Keygen Keys on Server1 and Server2
3.4.1 First login into server1 with user hduser and generate a pair of public keys
using following command. (Note: Same steps to server2)
Sshy-keygen –t rsa –P “”

3.4.2

Upload Generated Public Keys to – server1 to server2

Use SSH from server1 and upload new generated public key (id_rsa.pub) on
server2 under hduser‘s .ssh directory as a file name authorized_keys. (Note:
Same steps server2 to server1).

3.4.3

Login from SERVER1 to SERVER2 Server without Password

Ssh server1
Ssh server2

4. Hadoop

Create a directory called Hadoop under the /home/hduser
Mkdir hadoop
Chmod –R 777 hadoop
[Type text]

Page 9
EWT Portal Practice Team 2013

a. Extract the hadoop-1.1.2.tar.gz into Hadoop directory
tar –xzvf hadoop-1.1.2.tar.gz
Check the extracted files into the below dir /home/hduser/hadoop
sudo chown -R hduser:hadoop hadoop

b. Create the directory and set the required ownerships and permissions:
$ sudo mkdir -p /hduser/hadoop/tmp $ sudo chown
hduser:hadoop /hduser/hadoop/tmp # ...and if you want to tighten up
security, chmod from 755 to 750... $ sudo chmod 750 /hduser/hadoop/tmp
Add the following snippets between the <configuration> ... </configuration> tags in the
respective configuration XML file.

core-site.xml

In file hadoop/conf/core-site.xml:
<property>
<name>hadoop.tmp.dir</name>
<value>/hduser/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
<description>The name of the default file system. A URI whose scheme and authority determine
the FileSystem implementation. The uri's scheme determines the config property
(fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>

[Type text]

Page 10
EWT Portal Practice Team 2013

mapred-site.xml

In file conf/mapred-site.xml:
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are
run in-process as a single map and reduce task. </description>
</property>

hdfs-site.xml

In file conf/hdfs-site.xml:
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication. The actual number of replications can be specified when
the file is created. The default is used if replication is not specified in create time. </description>
</property>

In file conf/hadoop-env.sh, masters and slaves

hadoop-env.sh

masters

slaves

c. Formatting the HDFS filesystem via the NameNode
hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop namenode -format

d. Starting your single-node cluster
Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/start-all.sh
[Type text]

Page 11
EWT Portal Practice Team 2013

e. Stopping your single-node cluster
Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/stop-all.sh

For Clustering, Open the Server2(Slave) System

1. Login into hduser
2. Make directory /home/hduser
3. Move the hadoop directory into Server2(Slave) system
4. Connect with scp –r hduser@master:/home/hduser/hadoop /home/hduser/hadoop
5. Starting your multi-node cluster
Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/start-all.sh
6. Check the process should be started on both machines(master and slave)
7. Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop fs –ls
8. Ps –e | grep java

5. Flume
Apache Flume Configuration

1. Extract the apache-flume-1.4.0-bin.tar into Flume directory
tar –xzvf apache-flume-1.4.0-bin.tar
Check the extracted files into the below dir /home/hduser/flume
sudo chown -R hduser:hadoop flume
[Type text]

Page 12
EWT Portal Practice Team 2013
2. Open the flume directory and run the below command
a. sudo cp conf/flume-conf.properties.template conf/flume.conf
b. sudo cp conf/flume-env.sh.template conf/flume-env.sh
c. Open the conf directory and check 5 files are available

flume.conf

3. In file flume/conf/flume.conf overwrite the flume.conf file
# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory

# Here exec1 is source name.
agent1.sources.exec1.channels = ch1
agent1.sources.exec1.type = exec
agent1.sources.exec1.command = tail -F /var/log/anaconda.log
#in /home/hadoop/as/ash i have kept a text file.

# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
# Here HDFS is sink name.
agent1.sinks.HDFS.channel = ch1
agent1.sinks.HDFS.type = hdfs
agent1.sinks.HDFS.hdfs.path = hdfs://namenode:9000/user/root/flumeout.log
agent1.sinks.HDFS.hdfs.file.Type = DataStream

# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
[Type text]

Page 13
EWT Portal Practice Team 2013
agent1.channels = ch1
#source name can be of anything.(here i have chosen exec1)
agent1.sources = exec1
#sinkname can be of anything.(here i have chosen HDFS)
agent1.sinks = HDFS

4. Run the command
bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1

5. Check the file is written in HDFS
6. Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop fs –ls
7. Hadoop fs –cat /user/root/*

6. Hive
Apache Hive Configuration

1. Extract the hive-0.10.0.tar into Hbase directory
tar –xzvf hive-0.10.0.tar
Check the extracted files into the below dir /home/hduser/hive
sudo chown -R hduser:hadoop hive

2. In file hive/conf/hive-site.xml overwrite the hive-site.xml file
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
[Type text]

Page 14
EWT Portal Practice Team 2013
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/hive/warehouse</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/hive_metastore</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
</property>
</configuration>
[Type text]

Page 15
EWT Portal Practice Team 2013

7. Hbase
Apache Hbase Configuration

1. Extract the hbase-0.94.9.tar into Hbase directory
tar –xzvf hbase-0.94.9.tar
Check the extracted files into the below dir /home/hduser/hbase
sudo chown -R hduser:hadoop hbase

2. In file hbase/conf/hbase-site.xml overwrite the hbase-site.xml file

hbase-site.xml

<property>
<name>hbase.rootdir</name>
<value>hdfs://master:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hduser/hadoop/data/zookeeper</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
[Type text]

Page 16
EWT Portal Practice Team 2013

regionservers

3. In file hbase/conf/regionservers overwrite regionservers file
Master
Slaves

4. Open the hbase directory and run the below command
hbase/bin start-hbase.sh

8. Example Applications and Organizations using Hadoop
•

A9.com – Amazon: To build Amazon's product search indices; process millions of
sessions daily for analytics, using both the Java and streaming APIs; clusters vary
from 1 to 100 nodes.

•

Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop;
biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support
research for Ad Systems and Web Search

•

AOL : Used for a variety of things ranging from statistics generation to running
advanced algorithms for doing behavioral analysis and targeting; cluster size is
50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and
800 GB hard-disk giving us a total of 37 TB HDFS capacity.

•

Facebook: To store copies of internal log and dimension data sources and use it
as a source for reporting/analytics and machine learning; 320 machine cluster
with 2,560 cores and about 1.3 PB raw storage;

9. References:

1. http://download.nextag.com/apache/hadoop/common/stable/
[Type text]

Page 17
EWT Portal Practice Team 2013
2. http://archive.apache.org/dist/flume/stable/
3. http://www.eng.lsu.edu/mirrors/apache/hbase/stable/
4. Hive : http://download.nextag.com/apache/hive/stable/
5. http://www.oracle.com/technetwork/java/javase/downloads/index.html
6. http://myadventuresincoding.wordpress.com/2011/12/22/linux-how-to-sshbetween-two-linux-computers-without-needing-a-password/
7. http://hadoop.apache.org/docs/stable/single_node_setup.html
8. http://hadoop.apache.org/docs/stable/cluster_setup.html

[Type text]

Page 18

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Spark
SparkSpark
Spark
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Hadoop
HadoopHadoop
Hadoop
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node Cluster
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
 
Cloud Computing - Benefits and Challenges
Cloud Computing - Benefits and ChallengesCloud Computing - Benefits and Challenges
Cloud Computing - Benefits and Challenges
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Introduction to Cloud Computing
Introduction to Cloud ComputingIntroduction to Cloud Computing
Introduction to Cloud Computing
 
Cloud Service Models
Cloud Service ModelsCloud Service Models
Cloud Service Models
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 

Similar a Hadoop cluster configuration

Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
Thanh Nguyen
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
BOSC 2010
 

Similar a Hadoop cluster configuration (20)

Hdfs design
Hdfs designHdfs design
Hdfs design
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Unit 5
Unit  5Unit  5
Unit 5
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 

Más de prabakaranbrick (9)

Install and configure mongo db nosql db on windows
Install and configure mongo db nosql db on windowsInstall and configure mongo db nosql db on windows
Install and configure mongo db nosql db on windows
 
Sonar
SonarSonar
Sonar
 
Web services for remote portlets v01
Web services for remote portlets v01Web services for remote portlets v01
Web services for remote portlets v01
 
Jmeter
JmeterJmeter
Jmeter
 
Nuxeo dm installation
Nuxeo dm installationNuxeo dm installation
Nuxeo dm installation
 
Gwt portlet
Gwt portletGwt portlet
Gwt portlet
 
Jackrabbit setup configuration
Jackrabbit setup configurationJackrabbit setup configuration
Jackrabbit setup configuration
 
Integrating open am with liferay portal
Integrating open am with liferay portalIntegrating open am with liferay portal
Integrating open am with liferay portal
 
Installation and configure the oracle webcenter
Installation and configure the oracle webcenterInstallation and configure the oracle webcenter
Installation and configure the oracle webcenter
 

Último

Último (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Hadoop cluster configuration

  • 1. EWT Portal Practice Team 2013 Hadoop Cluster Configuration Table of contents 1. Introduction…………………………………………………………………………………………………………… 2 2. Prerequisites Softwares…………………………………………………………………………………………. 8 3. Cluster Configuration on Server1 and Server2……………………………………………………….. 8 4. Hadoop………..………………………………………………………………………………………………………… 9 5. Flume…………………………………………………………………………………………………………………….. 11 6. Hive……………………………………………………………………………………………………………………….. 12 7. Hbase…………………………………………………………………………………………………………………….. 13 8. Organizations using Hadoop………………………………………………………………………………….. 14 9. References…………………………………………………………………………………………………………….. 14 1. Introduction: Hadoop - A Software Framework for Data Intensive Computing Applications. [Type text] Page 1
  • 2. EWT Portal Practice Team 2013 a. Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes: – HDFS – Hadoop distributed file system – HBase (pre-alpha) – online data access – MapReduce – offline computing engine • Yahoo! is the biggest contributor • Here's what makes it especially useful: Scalable: It can reliably store and process petabytes. Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). Efficient: By distributing the data, it can process it in parallel on the nodes where the data is located. Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. b. What does it do? • Hadoop implements Google’s MapReduce, using HDFS • MapReduce divides applications into many small blocks of work. • HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. • MapReduce can then process the data where it is located. • Hadoop ‘s target is to run on clusters of the order of 10,000-nodes. Written in Java Does work with other languages Runs on Linux, Windows and more c. HDFS? [Type text] Page 2
  • 3. EWT Portal Practice Team 2013 The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. – – provides high throughput access to application data and is suitable for applications that have large data sets. – [Type text] highly fault-tolerant and is designed to be deployed on low-cost hardware. relaxes a few POSIX requirements to enable streaming access to file system data. Page 3
  • 4. EWT Portal Practice Team 2013 d. MapReduce? • Programming model developed at Google • Sort/merge based distributed computing • Initially, it was intended for their internal search/indexing application, but now used extensively by more organizations (e.g., Yahoo, Amazon.com, IBM, etc.) • It is functional style programming (e.g., LISP) that is naturally parallelizable across a large cluster of workstations or PCS. • The underlying system takes care of the partitioning of the input data, scheduling the program’s execution across several machines, handling machine failures, and managing required inter-machine communication. (This is the key for Hadoop’s success) e. How does MapReduce work? • The run time partitions the input and provides it to different Map instances; • Map (key, value)  (key’, value’) • The run time collects the (key’, value’) pairs and distributes them to several Reduce functions so that each Reduce function gets the pairs with the same key’. • Each Reduce produces a single (or zero) file output. • Map and Reduce are user written functions f. Flume? [Type text] Page 4
  • 5. EWT Portal Practice Team 2013 Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications. Flume Architecture g. Hive? [Type text] Page 5
  • 6. EWT Portal Practice Team 2013 Hive is a data warehouse system for Hadoop that facilitates easy data summarization, adhoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. • Hive - SQL on top of Hadoop • Rich data types (structs, lists and maps) • Efficient implementations of SQL filters, joins and group-by’s on top of map reduce • Allow users to access Hive data without using Hive Hive Optimizations Efficient Execution of SQL on top of Map-Reduce h. Hbase? [Type text] Page 6
  • 7. EWT Portal Practice Team 2013 HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc. i. When Should I Use HBase? • HBase isn't suitable for every problem. • First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle. • Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port. • Third, make sure you have enough hardware. Even HDFS doesn't do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode. • HBase can run quite well stand-alone on a laptop - but this should be considered a development configuration only. j. What Is The Difference Between HBase and Hadoop/HDFS? HDFS is a distributed file system that is well suited for the storage of large files. It's documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups. 2. Prerequisites Softwares [Type text] Page 7
  • 8. EWT Portal Practice Team 2013 To get a Hadoop, Flume, Hbase and Hive distribution, download a recent stable tar files from one of the Apache Download Mirrors. Note: Configuration setup on two linux servers (Server1 and Server2). 2.1 Download the prerequisites softwares from the below urls(Server 1 machine) a. Hadoop : http://download.nextag.com/apache/hadoop/common/stable/ b. Flume : http://archive.apache.org/dist/flume/stable/ c. Hive : http://download.nextag.com/apache/hive/stable/ d. Hbase : http://www.eng.lsu.edu/mirrors/apache/hbase/stable/ 2.2 Download the Java 1.6/1.7 : http://www.oracle.com/technetwork/java/javase/downloads/index.html 2.3 Stable versions of the Hadoop Components on August 2013 • Hadoop-1.1.2 • Flume-1.4.0 • Hbase-0.94.9 • Hive -0.10.0 3. Cluster configuration on Server1 and Server2 Create a user and password to give the admin permission(Server1 and Server2) 3.1 Task: Add a user with group to the system useradd -G {hadoop} hduser 3.2 Task : Add a password to hduser [Type text] Page 8
  • 9. EWT Portal Practice Team 2013 Passwd hduser 3.3 Open the host file from server1 system and edit the file /etc/hosts # For example: # 102.54.94.97 rhino.acme.com master rhino.acme.com slaves Server2 system # 102.54.94.98 3.4 Create Authentication SSH-Keygen Keys on Server1 and Server2 3.4.1 First login into server1 with user hduser and generate a pair of public keys using following command. (Note: Same steps to server2) Sshy-keygen –t rsa –P “” 3.4.2 Upload Generated Public Keys to – server1 to server2 Use SSH from server1 and upload new generated public key (id_rsa.pub) on server2 under hduser‘s .ssh directory as a file name authorized_keys. (Note: Same steps server2 to server1). 3.4.3 Login from SERVER1 to SERVER2 Server without Password Ssh server1 Ssh server2 4. Hadoop Create a directory called Hadoop under the /home/hduser Mkdir hadoop Chmod –R 777 hadoop [Type text] Page 9
  • 10. EWT Portal Practice Team 2013 a. Extract the hadoop-1.1.2.tar.gz into Hadoop directory tar –xzvf hadoop-1.1.2.tar.gz Check the extracted files into the below dir /home/hduser/hadoop sudo chown -R hduser:hadoop hadoop b. Create the directory and set the required ownerships and permissions: $ sudo mkdir -p /hduser/hadoop/tmp $ sudo chown hduser:hadoop /hduser/hadoop/tmp # ...and if you want to tighten up security, chmod from 755 to 750... $ sudo chmod 750 /hduser/hadoop/tmp Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file. core-site.xml In file hadoop/conf/core-site.xml: <property> <name>hadoop.tmp.dir</name> <value>/hduser/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://master:9000</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property> [Type text] Page 10
  • 11. EWT Portal Practice Team 2013 mapred-site.xml In file conf/mapred-site.xml: <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> hdfs-site.xml In file conf/hdfs-site.xml: <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> In file conf/hadoop-env.sh, masters and slaves hadoop-env.sh masters slaves c. Formatting the HDFS filesystem via the NameNode hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop namenode -format d. Starting your single-node cluster Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/start-all.sh [Type text] Page 11
  • 12. EWT Portal Practice Team 2013 e. Stopping your single-node cluster Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/stop-all.sh For Clustering, Open the Server2(Slave) System 1. Login into hduser 2. Make directory /home/hduser 3. Move the hadoop directory into Server2(Slave) system 4. Connect with scp –r hduser@master:/home/hduser/hadoop /home/hduser/hadoop 5. Starting your multi-node cluster Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/start-all.sh 6. Check the process should be started on both machines(master and slave) 7. Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop fs –ls 8. Ps –e | grep java 5. Flume Apache Flume Configuration 1. Extract the apache-flume-1.4.0-bin.tar into Flume directory tar –xzvf apache-flume-1.4.0-bin.tar Check the extracted files into the below dir /home/hduser/flume sudo chown -R hduser:hadoop flume [Type text] Page 12
  • 13. EWT Portal Practice Team 2013 2. Open the flume directory and run the below command a. sudo cp conf/flume-conf.properties.template conf/flume.conf b. sudo cp conf/flume-env.sh.template conf/flume-env.sh c. Open the conf directory and check 5 files are available flume.conf 3. In file flume/conf/flume.conf overwrite the flume.conf file # Define a memory channel called ch1 on agent1 agent1.channels.ch1.type = memory # Here exec1 is source name. agent1.sources.exec1.channels = ch1 agent1.sources.exec1.type = exec agent1.sources.exec1.command = tail -F /var/log/anaconda.log #in /home/hadoop/as/ash i have kept a text file. # Define a logger sink that simply logs all events it receives # and connect it to the other end of the same channel. # Here HDFS is sink name. agent1.sinks.HDFS.channel = ch1 agent1.sinks.HDFS.type = hdfs agent1.sinks.HDFS.hdfs.path = hdfs://namenode:9000/user/root/flumeout.log agent1.sinks.HDFS.hdfs.file.Type = DataStream # Finally, now that we've defined all of our components, tell # agent1 which ones we want to activate. [Type text] Page 13
  • 14. EWT Portal Practice Team 2013 agent1.channels = ch1 #source name can be of anything.(here i have chosen exec1) agent1.sources = exec1 #sinkname can be of anything.(here i have chosen HDFS) agent1.sinks = HDFS 4. Run the command bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1 5. Check the file is written in HDFS 6. Run the command: hduser@ubuntu:~$ /home/hduser/hadoop/bin/hadoop fs –ls 7. Hadoop fs –cat /user/root/* 6. Hive Apache Hive Configuration 1. Extract the hive-0.10.0.tar into Hbase directory tar –xzvf hive-0.10.0.tar Check the extracted files into the below dir /home/hduser/hive sudo chown -R hduser:hadoop hive 2. In file hive/conf/hive-site.xml overwrite the hive-site.xml file <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost/</value> [Type text] Page 14
  • 15. EWT Portal Practice Team 2013 </property> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/hive/warehouse</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost/hive_metastore</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>root</value> </property> </configuration> [Type text] Page 15
  • 16. EWT Portal Practice Team 2013 7. Hbase Apache Hbase Configuration 1. Extract the hbase-0.94.9.tar into Hbase directory tar –xzvf hbase-0.94.9.tar Check the extracted files into the below dir /home/hduser/hbase sudo chown -R hduser:hadoop hbase 2. In file hbase/conf/hbase-site.xml overwrite the hbase-site.xml file hbase-site.xml <property> <name>hbase.rootdir</name> <value>hdfs://master:9000/hbase</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/home/hduser/hadoop/data/zookeeper</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> [Type text] Page 16
  • 17. EWT Portal Practice Team 2013 regionservers 3. In file hbase/conf/regionservers overwrite regionservers file Master Slaves 4. Open the hbase directory and run the below command hbase/bin start-hbase.sh 8. Example Applications and Organizations using Hadoop • A9.com – Amazon: To build Amazon's product search indices; process millions of sessions daily for analytics, using both the Java and streaming APIs; clusters vary from 1 to 100 nodes. • Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for Ad Systems and Web Search • AOL : Used for a variety of things ranging from statistics generation to running advanced algorithms for doing behavioral analysis and targeting; cluster size is 50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity. • Facebook: To store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning; 320 machine cluster with 2,560 cores and about 1.3 PB raw storage; 9. References: 1. http://download.nextag.com/apache/hadoop/common/stable/ [Type text] Page 17
  • 18. EWT Portal Practice Team 2013 2. http://archive.apache.org/dist/flume/stable/ 3. http://www.eng.lsu.edu/mirrors/apache/hbase/stable/ 4. Hive : http://download.nextag.com/apache/hive/stable/ 5. http://www.oracle.com/technetwork/java/javase/downloads/index.html 6. http://myadventuresincoding.wordpress.com/2011/12/22/linux-how-to-sshbetween-two-linux-computers-without-needing-a-password/ 7. http://hadoop.apache.org/docs/stable/single_node_setup.html 8. http://hadoop.apache.org/docs/stable/cluster_setup.html [Type text] Page 18