WordPress Websites for Engineers: Elevate Your Brand
R hive tutorial supplement 1 - Installing Hadoop
1. RHive Tutorial – Installing Hadoop
This tutorial is for beginning users without much foreknowledge of Hadoop. It
gives a simple explanation on how to install Hadoop before Hadoop.
RHive has dependency for Hive and Hive in turn has dependency for Hadoop.
Thus Hadoop and Hive must have been installed in order to install RHive.
The method of installing Hadoop to be introduced in this tutorial is about
setting up a small Hadoop environment for RHive.
This installation of fundamentals is useful for quickly building** a small-scale
distributed environment by using VMs or just a few servers.
For large, well-structured environments this may not be appropriate.
Installing Hadoop
Work Environment
The environment used in this tutorial is set up like the following:
• Server cluster environment: Cloud Service
• Server Number: Total of 4 virtual machines
• Server specs: virtual machine, 1 core, 1Gb main memory, 25Gb Harddisk for
OS, 2TB additional harddisk
• OS: CentOS5
• Network: 10.x.x.x IP address
Pre-installation Checklist
Checking root account, firewall, SElinux
You must be able to connect to the servers prepared for Hadoop installation
via a root account or obtain sudoer permission to wield a system-level access
such as root level access.
And each server must be void of have special firewall or security settings.
If you are using a Linux with such settings then you must have clearance to
control them or already know how to use them.
If SElinux or firewall is running while strong rules are in place for security
purposes, then you must manually configure Hadoop-related port or ACL
(Access Control List) or simply disable SElinux and firewall altogether.
This tutorial installs Hadoop to an isolated VM with no external access. Since
they are isolated and unable to be connected from outside, their installed
SELinux and firewall are entirely deactivated.
Check Server IP Address
2. You must know the IP addresses of the servers you will be using.
The servers used in this tutorial each have the following IP addresses:
10.1.1.1
10.1.1.2
10.1.1.3
10.1.1.4
This tutorial will make 10.1.1.1 into Hadoop name node.
And 10.1.1.2, 10.1.1.3, 10.1.1.4 will become Hadoop’s Job nodes.
Preliminary preparations before installing Hadoop
Setting hosts file
There is a need to edit each server’s /etc/hosts
You might already know—these files are those that manually map hostnames
and IP addresses.
Doing this is to make setting Hadoop convenient.
Use the following settings to connect to all (four) servers and add the following
lines to /etc/hosts files.
10.1.1.1
node0
10.1.1.2
node1
10.1.1.3
node2
10.1.1.4
node3
node0 ~ node3 are arbitrary hostnames: any memorable name will do.
But changing them after having installed Hadoop and ran it is quite dangerous
and you need to take that into consideration.
Installing Java
As Hadoop is written in Java, JVM is required, naturally.
Oftentimes, Java is installed once Linux is, and even if it isn’t, it can be easily
installed.
If the servers you are using do not have Java installed then use the following
command to install Java to all servers.
yum
install
java
3. Assigning JAVA_HOME environment variable
JAVA_HOME must have environment variables set.
The directory where Java SDK or JRE is installed must be assigned to
JAVA_HOME but if your OS is a CentOS, then you can use the following
command to find it out.
update-‐alternatives
-‐-‐display
java
In the work environment used in this tutorial, JAVA_HOME is "/usr/lib/jvm/jre-
1.6.0-openjdk.x86_64".
JAVA_HOME’s path can change depending on the user’s environment or
installed Java version, so you must find out your server’s exact JAVA_HOME.
On that matter, refer to a document on Linux distributions or some other
document on how to install Java.
Once you found out JAVA_HOME, register the environment variable to
/etc/profile, ~/.bashrc, or etc.
JAVA_HOME=/usr/lib/jvm/jre-‐1.6.0-‐openjdk.x86_64/
export
JAVA_HOME=/usr/lib/jvm/jre-‐1.6.0-‐openjdk.x86_64
Certainly indeed, the task of installing Java and assigning JAVA_HOME
should be similarly done for all servers.
Downloading Hadoop
Now we’ll start installing Hadoop.
As Hadoop is written in Java, merely decompressing the downloaded file
alone completes the installation.
Hadoop-1.0.0 version is packaged with rpm and deb so you can use rpm or
dpkg (etc.) to install Hadoop.
However, since Hive does not yet support Hadoop-1.0.0, it is not wise to use
this with Hive.
Hadoop needs a directory to install to before installing. In other words, you
must decide upon and create a proper directory to decompress.
And it must be a location where there is sufficient disk space.
Hadoop uses a lot of space once it starts up, making log files and managing
HDFS while storing files.
Thus it is good to check whether the disk space where Hadoop will be
installed to has sufficient hard disk space, and if there is a large add-on hard
disk installed somewhere, then check where it is mounted before installing.
This tutorial has made at least 2TB worth of hard disk mount in each server’s
4. “/mnt”, and made a “/mnt/srv” directory below that to install Hadoop in that
directory.
It’s good to establish the same directory structure to all other servers as well.
Make an arbitrary directory called srv, like below.
mkdir
/mnt/srv
cd
/mnt/srv
We will install Hadoop under the base directory chosen above.
Now we are going to download our Hadoop from Hadoop’s official website.
This tutorial recommends using version 0.20.203
You can download every Hadoop version from the following site.
http://www.apache.org/dyn/closer.cgi/hadoop/common/
The same version must be installed to all the servers. One way to do this is to
copy the downloaded file to all servers.
Download the latest Hadoop version from the server like below.
Wget
http://apache.tt.co.kr//hadoop/common/hadoop-‐
0.20.203.0/hadoop-‐0.20.203.0rc1.tar.gz
You can also change the mirror site, which is proper to you.
Decompress the downloaded file.
tar
xvfz
hadoop-‐0.20.203.0rc1.tar.gz
Once you downloaded it into one server, in order to singly make the same
directory for other servers and copy the file into them, you can use shell
command like the following.
If you are not accustomed to using shell programming, then just manually do
the same work for every other server.
5. $
for
I
in
`seq
3`;
do
ssh
node$I
'mkdir
/mnt/srv'
done
$
for
I
in
`seq
3`;
do
scp
hadoop*.gz
node$I:/mnt/srv/;
done
$
for
I
in
`seq
3`;
do
ssh
node$I
'cd
/mnt/srv/;
tar
xvfz
hadoop*.gz';
done
Making SSH Key
In order to enable Hadoop namenode to control each node, you must create
and set null passphrase key.
Hadoop connects to each server from namenode, to run tasktracker or
datanode, but to do this it must be able to connect to each node without
passwords.
This tutorial will create and make a key to enable connecting to all servers
with a root account.
With the command below, create private key and public key which doesn’t ask
for passwords.
ssh-‐keygen
-‐t
rsa
-‐P
''
-‐f
~/.ssh/id_rsa
Now register public key to authorized_keys.
cat
~/.ssh/id_rsa.pub
>>
~/.ssh/authorized_keys
Now see if you can use the command below to connect to localhost via ssh
without entering a password.
ssh
localhost
If you login without being asked for passwords, then it is done.
Now exit the connected localhost.
exit
If you fail to connect or see a password prompt despite having properly
created the openssh and keys as mentioned above,
you might need to check sshd settings and make changes.
You can usually edit the sshd settings file path “/etc/ssh/sshd_config” by using
a text editor.
Edit the sshd_config file using any familiar editor.
6. vi
/etc/ssh/sshd_config
There are many configuration values in the file, but the items you should focus
on are listed below.
If the code line below is disabled (If a # is attached ahead of the line or the file
is blank)
Edit the contents or insert the following, then quit the editor.
RSAAuthentication
yes
PubkeyAuthentication
yes
AuthorizedKeysFile
.ssh/authorized_keys
If you still cannot connect to localhost via ssh without being asked for a
password despite having modified the settings file, then consult the system
administrator or refer to relevant documents on configuring sshd.
Now you must take the key file’s public key and insert then into other servers’
~/.ssh/authorized_keys.
Normally you would have to add ~/.ssh/id_rsa.pub to authorized_keys after
having copied them to other servers, but for the sake of convenience, this
tutorial will be copying authorized_keys to another server.
Copy the entire thing like below.
$
for
I
in
`seq
3`;
do
scp
~/.ssh/id_rsa.pub
node$I:~/.ssh/;
done
Fixing Hadoop Configurations
Once Hadoop is installed, Hadoop settings need configuring.
Now head over to the Hadoop conf directory.
This tutorial will modify 4 files: hadoop-env.sh, core-site.xml, mapred-site.xml,
and hdfs-site.xml.
Move to Hadoop conf Directory
First, head over to Hadoop’s conf directory, which was already installed.
cd
/mnt/srv/hadoop-‐0.20.203.0/conf
Modify hadoop-env.sh
Open a text editor and modify hadoop-env.sh.
7. vi
hadoop-‐env.sh
Look for the lines shown below and edit the lines to your liking.
export
JAVA_HOME=/usr/java/default
export
HADOOP_LOG_DIR=/mnt/srv/hadoopdata/data/logs
JAVA_HOME can be set the same as the JAVA_HOME found out earlier in
this tutorial.
As HADOOP_LOG_DIR is where Hadoop’s logs** will be saved, it’s good to
choose a location with sufficient space.
We will be using a directory called /mnt/srv/hadoopdata/data/logs.
Editing core-site.xml
Open core-sie.xml with a text editor.
vi
core-‐site.xml
In here, adjust hadoop.tmp.dir and fs.default.name to appropriate values.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://node0:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/mnt/srv/hadoopdata/hadoop-‐${user.name}</value>
<description>A
base
for
other
temporary
directories.</description>
</property>
</configuration>
Editing hdfs-site.xml
8. There is no need to edit hdfs-site.xml.
But should you need to edit anything else, you can open and adjust its values
with a text editor, just like core-site.xml can be.
Open hdfs-site.xml with a text editor.
vi
hdfs-‐site.xml
Should you want to increase the number of files Hadoop will simultaneously
open, adjust the values like below:
<configuration>
<property>
<name>dfs.datanode.max.xcievers</name>
<value>4096</value>
</property>
</configuration>
The above is optional and not obligatory.
Editing mapred-site.xml
Open mapred-site.xml with a text editor like vi.
vi
mapred-‐site.xml
If you open the file and look through the contents, you may find something like
the following. In here, you should edit the value of mapred.job.tracker to suit
your environment.
Use defaults for the rest customize them to your liking.
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>node0:9001</value>
</property>
<property>
<name>mapred.jobtracker.taskScheduler</name>
9. <value>org.apache.hadoop.mapred.FairScheduler</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>6</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>6</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-‐Xmx2048M</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>16</value>
</property>
<property>
<name>mapred.task.timeout</name>
<value>3600000</value>
</property>
</configuration>
Activating Hadoop
Checking whether Hadoop is Running
After installing Hadoop, you can use a web browser to connect to a webpage
that can check up on Hadoop’s status.
It’s normally serviced as port 50030.
http://node0:50030/
If you see Hadoop’s state as “RUNNING” like below, then Hadoop is running
as normal.
10. node0
Hadoop
Map/Reduce
Administration
Quick
Links
State:
RUNNING
Started:
Thu
Jan
05
17:24:18
EST
2012
Version:
0.20.203.0,
r1099333
Compiled:
Wed
May
4
07:57:50
PDT
2011
by
oom
Identifier:
201201051724
Naturally, you cannot connect to the page above if Hadoop namenode is on
the other side of the firewall and 50030 is not open.
Trying to Run MRbench
Hadoop provides several useful utilities by default.
Among them, hadoop-test-* allows you an easy view of the map/reduce task.
As Hadoop version used in this tutorial is 0.20.203.0, the Hadoop home
directory must contain the hadoop-test-0.20.203.0.jar file.
And you can check whether Hadoop’s Map/Reduce is running via the
following command:
$HADOOP_HOME/bin/hadoop
jar
$HADOOP_HOME/hadoop-‐test-‐
0.20.203.0.jar
mrbench
The results of executing the above command are as follows.
MRBenchmark.0.0.2
11/12/07
13:15:36
INFO
mapred.MRBench:
creating
control
file:
1
numLines,
ASCENDING
sortOrder
11/12/07
13:15:36
INFO
mapred.MRBench:
created
control
file:
/benchmarks/MRBench/mr_input/input_-‐1026698718.txt
11/12/07
13:15:36
INFO
mapred.MRBench:
Running
job
0:
input=hdfs://node0:9000/benchmarks/MRBench/mr_input
output=hdfs://node0:9000/benchmarks/MRBench/mr_output/output_12
20591687
11/12/07
13:15:36
INFO
mapred.FileInputFormat:
Total
input
paths
to
process
:
1
11/12/07
13:15:37
INFO
mapred.JobClient:
Running
job:
job_201112071314_0001
11. 11/12/07
13:15:38
INFO
mapred.JobClient:
map
0%
reduce
0%
11/12/07
13:15:55
INFO
mapred.JobClient:
map
50%
reduce
0%
11/12/07
13:15:58
INFO
mapred.JobClient:
map
100%
reduce
0%
11/12/07
13:16:10
INFO
mapred.JobClient:
map
100%
reduce
100%
11/12/07
13:16:15
INFO
mapred.JobClient:
Job
complete:
job_201112071314_0001
11/12/07
13:16:15
INFO
mapred.JobClient:
Counters:
26
11/12/07
13:16:15
INFO
mapred.JobClient:
Job
Counters
11/12/07
13:16:15
INFO
mapred.JobClient:
Launched
reduce
tasks=1
11/12/07
13:16:15
INFO
mapred.JobClient:
SLOTS_MILLIS_MAPS=22701
11/12/07
13:16:15
INFO
mapred.JobClient:
Total
time
spent
by
all
reduces
waiting
after
reserving
slots
(ms)=0
11/12/07
13:16:15
INFO
mapred.JobClient:
Total
time
spent
by
all
maps
waiting
after
reserving
slots
(ms)=0
11/12/07
13:16:15
INFO
mapred.JobClient:
Launched
map
tasks=2
11/12/07
13:16:15
INFO
mapred.JobClient:
Data-‐local
map
tasks=2
11/12/07
13:16:15
INFO
mapred.JobClient:
SLOTS_MILLIS_REDUCES=15000
11/12/07
13:16:15
INFO
mapred.JobClient:
File
Input
Format
Counters
11/12/07
13:16:15
INFO
mapred.JobClient:
Bytes
Read=4
11/12/07
13:16:15
INFO
mapred.JobClient:
File
Output
Format
Counters
11/12/07
13:16:15
INFO
mapred.JobClient:
Bytes
Written=3
11/12/07
13:16:15
INFO
mapred.JobClient:
FileSystemCounters
11/12/07
13:16:15
INFO
mapred.JobClient:
FILE_BYTES_READ=13
11/12/07
13:16:15
INFO
mapred.JobClient:
HDFS_BYTES_READ=244
11/12/07
13:16:15
INFO
mapred.JobClient:
FILE_BYTES_WRITTEN=63949
11/12/07
13:16:15
INFO
12. mapred.JobClient:
HDFS_BYTES_WRITTEN=3
11/12/07
13:16:15
INFO
mapred.JobClient:
Map-‐Reduce
Framework
11/12/07
13:16:15
INFO
mapred.JobClient:
Map
output
materialized
bytes=19
11/12/07
13:16:15
INFO
mapred.JobClient:
Map
input
records=1
11/12/07
13:16:15
INFO
mapred.JobClient:
Reduce
shuffle
bytes=19
11/12/07
13:16:15
INFO
mapred.JobClient:
Spilled
Records=2
11/12/07
13:16:15
INFO
mapred.JobClient:
Map
output
bytes=5
11/12/07
13:16:15
INFO
mapred.JobClient:
Map
input
bytes=2
11/12/07
13:16:15
INFO
mapred.JobClient:
Combine
input
records=0
11/12/07
13:16:15
INFO
mapred.JobClient:
SPLIT_RAW_BYTES=240
11/12/07
13:16:15
INFO
mapred.JobClient:
Reduce
input
records=1
11/12/07
13:16:15
INFO
mapred.JobClient:
Reduce
input
groups=1
11/12/07
13:16:15
INFO
mapred.JobClient:
Combine
output
records=0
11/12/07
13:16:15
INFO
mapred.JobClient:
Reduce
output
records=1
11/12/07
13:16:15
INFO
mapred.JobClient:
Map
output
records=1
DataLines
Maps
Reduces
AvgTime
(milliseconds)
1
2
1
39487
If there were no errors running then Hadoop ran without problems.
Now you can make Map/Reduce implementations for yourself and use
Hadoop to perform distributed processing.