1. Cloudera Hadoop (CDH 4)
Installation on Ubuntu 12.04 LTS
Sumitra Pundlik
Assistant Professor
Department of Computer Engineering
MIT College of Engineering
Kothrud, Pune 411038
asavari.deshpande@mitcoe.edu.in
2. Agenda
● Introduction to Hadoop
● Various components of Hadoop
● Installation steps for Cloudera Hadoop
3. Introduction to Hadoop
● The Apache Hadoop software library is a
framework that allows for the distributed
processing of large data sets across clusters
of computers using simple programming
models.
● It is designed to scale up from single servers
to thousands of machines, each offering local
computation and storage.
● The library itself is designed to detect and
handle failures at the application layer.
5. The project includes these modules:
Hadoop Common: The common utilities that
support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A
distributed file system that provides high-throughput
access to application data.
Hadoop YARN: A framework for job scheduling and
cluster resource management.
Hadoop MapReduce: A YARN-based system for
parallel processing of large data sets.
6. Ambari™: A web-based tool for provisioning, managing, and
monitoring Apache Hadoop clusters which includes support for
Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase,
ZooKeeper, Oozie, Pig and Sqoop.
Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single
points of failure.
Chukwa™: A data collection system for managing large
distributed systems.
HBase™: A scalable, distributed database that supports
structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data
summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining
library.
7. ● Pig™: A high-level data-flow language and
execution framework for parallel computation.
● Spark™: A fast and general compute engine for
Hadoop data. Spark provides a simple and
expressive programming model that supports a
wide range of applications, including ETL,
machine learning, stream processing, and graph
computation.
● Tez™: A generalized data-flow programming
framework, built on Hadoop YARN.
● ZooKeeper™: A high-performance coordination
service for distributed applications
8. Cloudera Hadoop Installation
● What is Cloudera Hadoop?
● What is Cloudera Manager?
● Prerequisite for installation
● Installation Steps with Screen Shot
9. What is Cloudera Hadoop
● CDH is the world’s most complete, tested, and
popular distribution of Apache Hadoop.
● CDH is 100% Apache-licensed open source.
● CDH bundled all Hadoop related projects at one
place.
10.
11. What is Cloudera Manager
● Cloudera Manager automates the installation
and configuration of CDH on an entire cluster.
● Prerequisite
Update your Ubuntu
Password less ssh
Password less sudo
Edit host file
Install database(MySQL/PostgreSQL/Oracle)
Install JDBC connector for above databases.
12. Update Your Ubuntu Machine
● Run sudo apt-get update
● If you have any problem for update
sudo -i
apt-get clean
cd /var/lib/apt
mv lists lists.old
mkdir -p lists/partial
apt-get clean
apt-get update
● Still you are facing problem contact your
Technical Assistant
13. Password less SSH
● Secure Shell (SSH) is a cryptographic network protocol
for secure data communication, remote command-line
login, remote command execution, and other secure
network services between two networked computers.
● Install OpenSSH
sudo apt-get install openssh-server openssh-client
and change configuration of sshd_config file /etc/ssh/ by
using
sudo gedit /etc/ssh/sshd_config and set
PubkeyAuthentication to YES
sudo /etc/init.d/ssh reload
14. Password less SSH
● Run following command for password less ssh
1 ssh-keygen
2 ssh-add
3 ssh-copy-id -i exam@172.20.55.67
4 ssh exam@172.20.55.67
Run
3 and 4 command for cluster implementation with specific
hostname or user_name@ip_address from master machine
It means connect client machines from master machine.
15. Password less sudo
● Make Sudo password less
● Make changes in sudoers file
sudo gedit /etc/sudoers
%sudo ALL:= NOPASSWD:ALL
save that file
● For Cluster Implementation Need to change
sudoers file of each and every client machine
16. Edit hosts file
● In this file mention IP address and host name
of machine
example
172.20.55.62 ccompl0910
for cluster implementation mention all client IP
address and Host name in Masters hosts file
and masters IP address and Host Name in
each clients hosts file
18. Install JDBC connector and
configure for secure installation
sudo apt-get install libmysql-java
sudo /usr/bin/mysql_secure_installation
Enter current password for root (enter for none): password
Change the root password? [Y/n] n
Remove anonymous users? [Y/n] y
Disallow root login remotely? [Y/n] n
Remove test database and access to it? [Y/n] y
Reload privilege tables now? [Y/n] y
Restart mysql server
sudo service mysql restart
19. Create Database
Mysql -u root -p and enter password
create database sttpdatabase;
create database hive;
We need separate database for following activities
Activity Monitor
Service Monitor
Report Manager
Host Monitor
Cloudera Navigator
20. Supported OS
● Ubuntu 10.04 (Lucid Lynx), 64-bit
● Ubuntu 12.04 (Precise Pangolin), 64-bit
● Supported Browsers
Firefox 11 or later
Google Chrome
Internet Explorer 9
Safari 5 or later
22. ● Resources
● Cloudera Manager Server:
5 GB on the partition hosting /var.
500 MB on the partition hosting /usr
RAM - 4 GB is appropriate for most cases, and is
required when using Oracle databases
Python - Cloudera Manager uses Python.
● Installation Path
Path A: Automated Path
Path B: Your Own Method
23. PATH A Installation
● Step 1: Download and Run the Cloudera Manager Installer
● Download cloudera-manager-installer.bin
● Install Cloudera Manager on a single host.
● Change it to have executable permission
chmod u+x cloudera-manager-installer.bin
● Run installer bin
sudo ./cloudera-manager-installer.bin
● after completion of installer bin set up open browser with
http://localhost:7180
● Login : admin
● Password : admin