Why is everyone interested in Big Data and Hadoop?
Why you should use Hadoop?
Read this to and you as well you quickly and easily be the proud owner of a Hadoop kit of your own, using Cloudera Free Edition.
************************NOTE**********************
This presentation is still being edited and new slides added every day. Stay tuned...
****************************************************
Exploring the Future Potential of AI-Enabled Smartphone Processors
Instant hadoop of your own
1. Instant Hadoop of your Own
Created by Jack Bezalel
Senior IT Architect
As part of the CTE Mentorship Program
CA Technologies
2. What’s Hadoop all about?
• OPPORTUNITY: We have access to amazingly
valuable data (Social Media, Mobile, …)
• Problem: Data is seldom UN-Structured
• Relational and data warehouse MUST have
Structured Data, so they are off the list
• Hadoop = fast, reliable analysis of both
structured data and complex data
3. What’s in Hadoop?
• Reliable data storage using the Hadoop
Distributed File System (HDFS)
• High-Performance parallel data processing
using a technique called MapReduce.
4. How does it scale so well?
• Hadoop runs on a collection of commodity,
shared-nothing servers
• You can add or remove servers in a Hadoop
cluster at will
• The system detects and compensates for
hardware or system problems on any server.
(self-healing)
5. Who uses Hadoop?
• Originally developed and employed by Yahoo and
Facebook
• Hadoop is now widely used in
– Finance
– Technology
– Telecom
– media and entertainment
– Government
– research institutions and other markets with
significant data.
6. Why did we use Cloudera’s Hadoop
kit?
• Cloudera is an active contributor to the
Hadoop project
• Provides an enterprise-ready, commercial
Distribution for Hadoop
• Cloudera Distribution saves time by bundling
and testing the most popular projects related
to Hadoop into a single easier to use package
7. The solution we tested is provided by
Cloudera Free Edition
• Automates the installation and configuration
of CDH3
• Entire cluster (up to 50 nodes)
• requiring only root SSH access to your cluster's
machines
• Download Here:
https://ccp.cloudera.com/display/SUPPORT/Cl
oudera+Manager+Free+Edition+Download
8. Cloudera Manager Free Edition
consists of:
• A small self-executing Cloudera Manager
installation program
• Server and other packages in preparation for
cluster host installation
• Cloudera Manager wizard for automating
CDH3 installation and configuration on the
cluster
• Cloudera Manager monitoring and configuring
the cluster after installation is completed
9. What does Cloudera Include - Flume
• Flume — Reliable Data Mover
• The primary use case
– a logging system
– gathers a set of log files on every machine
– aggregates them to a centralized persistent store
(such as HDFS)
10. What does Cloudera Include - Sqoop
• Sqoop — A tool that imports / exports data
between relational databases and Hadoop
clusters.
• Using JDBC imports into a Hadoop HDFS
• Generates Java classes that enable users to
interpret the table's schema
11. What does Cloudera Include - Hue
• Hue — GUI to work with CDH
• Web application
12. What does Cloudera Include - Pig
• Pig — Analyzes large amounts of data
• Using Pig's query language called Pig Latin
• Queries run distributed on a Hadoop cluster
13. What does Cloudera Include - Hive
• Hive — A powerful data warehousing APP
• Enables access your data using Hive QL
• Hive QL = language that is similar to SQL.
14. What does Cloudera Include - HBase
• HBase — Large-scale tabular storage
• Using HDFS
• Cloudera recommends installing HBase in a
standalone mode before you try to run it on a
whole cluster.
15. What does Cloudera Include -
ZooKeeper
• Zookeeper — Service that provides
coordination between distributed processes.
16. What does Cloudera Include - Oozie
• Oozie — A server-based workflow engine
• Runs workflow jobs with actions that execute
Hadoop jobs
• A command line client is also available for
Remote Management
17. What does Cloudera Include – 3 last
strangely named tools…
• Whirr — Provides a fast way to run cloud
services
• Snappy — A compression/decompression
library
• Mahout — A machine-learning tool. By
enabling you to build machine-learning
libraries that are scalable to "reasonably
large" datasets, it aims to make building
intelligent applications easier and faster
18. Setup Walkthrough
• Use Redhat RH5.5+ (CentOS and others
supported as well, we used RH5.7)
• 64bit only
• 3 VMs used:
– Cloudera Manager
– 2 Nodes to deploy Hadoop on
19. About the Cloudera Manager Free
Edition Installation Program
• Automatically Installs the package repositories
for Cloudera Manager and the Oracle (JDK)
• Installs the Cloudera Manager Server
• Installs and configures an embedded
PostgreSQL database
20. Download the CDH3 (Cloudera)
Manager
• http://archive.cloudera.com/cloudera-
manager/installer/latest/cloudera-manager-
installer.bin
21. Set yum.conf with your proxy if exists
• Add those lines to /etc/yum.conf in your first
Redhat Hadoop node (example here)
proxy=http://proxy.corp.com:80
proxy_username=username
proxy_password=password
22. Let the show begin!
• Make sure Selinux is disabled, or this won’t work!
– View file /etc/sysconfig/selinux
– Make sure you have this line:
SELINUX=disabled
– You will need to reboot to if you changed the SELINUX
setting
• Launch the Cloudera Manager Installation:
Sudo chmod u+x ./cloudera-manager-installer.bin
sudo ./cloudera-manager-installer.bin
55. How to start Hadooping – using its GUI
option (HUE)
• Download the HUE user guide right here:
https://ccp.cloudera.com/display/CDH4B2/Hu
e+2.0+User+Guide
58. Wait a Minute…
• Expect undocumented issues if you do this:
• HUE requires a special user (let’s say “admin”)
• Tell HUE about it, the first time you use it
• Add the user to the Unix system as well
• Add the user to groups “hive” and “hadoop”