2. www.edureka.in/hadoop-admin
How It Works…
LIVE classes
Class recordings
Module wise Quizzes and Practical Assignments
24x7 on-demand technical support
Deployment of different clusters
Online certification exam
Lifetime access to the Learning Management System
3. www.edureka.in/hadoop-admin
Course Topics
Week 1
– Understanding Big Data
– Hadoop Components
– Introduction to Hadoop 2.0
Week 2
– Hadoop 2.0
– Hadoop Configuration
– Hadoop Cluster Architecture
Week 3
– Different Hadoop Server Roles
– Data processing flow
– Cluster Network Configuration
Week 4
– Job Scheduling
– Fair Scheduler
– Monitoring a Hadoop Cluster
Week 5
– Securing your Hadoop Cluster
– Kerberos and HDFS Federation
– Backup and Recovery
Week 6
– Oozie and Hive Administration
– HBase Architecture
– HBase Administration
4. www.edureka.in/hadoop-admin
Topics for Today
What is Big Data?
Limitations of the existing solutions
Solving the problem with Hadoop
Introduction to Hadoop
Hadoop Eco-System
Hadoop Core Components
MapReduce software framework
Hadoop Cluster Administrator: Roles and Responsibilities
Introduction to Hadoop 2.0
5. www.edureka.in/hadoop-admin
What Is Big Data?
Lots of Data (Terabytes or Petabytes).
Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of
information.
A airline jet collects 10 terabytes of sensor data
for every 30 minutes of flying time.
NYSE generates about one terabyte of new trade data
per day to Perform stock trading analytics to determine
trends for optimal trades.
6. www.edureka.in/hadoop-admin
IBM’s Definition
IBM’s definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Volume Velocity Variety
Characteristics Of Big Data
12 Terabytes of
tweets created
each day
Scrutinizes 5 million
trade events created
each day to identify
potential fraud
Sensor data, audio,
video, click streams,
log files and more
7. www.edureka.in/hadoop-admin
Estimated Global Data Volume:
2011: 1.8 ZB
2015: 7.9 ZB
The world's information doubles every two years
Over the next 10 years:
The number of servers worldwide will grow by 10x
Amount of information managed by enterprise data
centers will grow by 50x
Number of “files” enterprise data center handle will
grow by 75x
Source: http://www.emc.com/leadership/programs/digital-universe.htm,
which was based on the 2011 IDC Digital Universe Study
Data Volume Is Growing Exponentially
8. www.edureka.in/hadoop-admin
What Big Companies Have To Say…
“Analyzing Big Data sets will become a key basis for competition.”
“Leaders in every sector will have to grapple the implications of Big Data.”
McKinsey
Gartner
Forrester
Research
“Big Data analytics are rapidly emerging as the preferred solution to business and
technology trends that are disrupting.”
“Enterprises should not delay implementation of Big Data Analytics.”
“Use Hadoop to gain a competitive advantage over more risk-averse enterprises.”
“Prioritize Big Data projects that might benefit from Hadoop.”
11. www.edureka.in/hadoop-admin
Apache Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage & distributed processing.
What Is Hadoop?
13. www.edureka.in/hadoop-admin
Hadoop History
Doug Cutting & Mike Cafarella
started working on Nutch
NY Times converts 4TB of
Image archives over 100 EC2s
Fastest sort of a TB,
62secs over 1,460 nodes
Sorted a PB in 16.25hours
Over 3.658 nodes
Fastest sort of a TB, 3.5mins
over 910 nodes
Doug Cutting adds DFS &
MapReduce support to Nutch
Google publishes GFS &
MapReduce papers Yahoo! hires Cutting,
Hadoop spins out of Nutch
Facebook launches Hive:
SQL Support for Hadoop
Doug Cutting
Joins Cloudera
Hadoop Summit 2009,
750 attendees
Founded
2002 2003 2004 2005 2006 2007 2008 2009
14. www.edureka.in/hadoop-admin
Hadoop 1.x Eco-System
Apache Oozie (Workflow)
HDFS (Hadoop Distributed File System)
Pig Latin
Data Analysis
Mahout
Machine Learning
Hive
DW System
MapReduce Framework
HBase
Flume Sqoop
Import Or Export
Unstructured or
Semi-Structured data Structured Data
15. www.edureka.in/hadoop-admin
Hadoop is a system for large scale data processing.
It has two main components:
HDFS – Hadoop Distributed File System (Storage)
Distributed across “nodes”
Natively redundant
NameNode tracks locations.
MapReduce (Processing)
Splits a task across processors
“near” the data & assembles results
Self-Healing, High Bandwidth
Clustered storage
JobTracker manages the TaskTrackers
Hadoop 1.x Core Components
Additional Administration
Tools:
Filesystem utilities
Job scheduling and monitoring
Web UI
17. www.edureka.in/hadoop-admin
NameNode:
master of the system
maintains and manages the blocks which are present on the
DataNodes
DataNodes:
slaves which are deployed on each machine and provide the
actual storage
responsible for serving read and write requests for the clients
Name Node and Data Nodes
18. www.edureka.in/hadoop-admin
Secondary NameNode:
Not a hot standby for the NameNode
Connects to NameNode every hour*
Housekeeping, backup of NemeNode metadata
Saved metadata can build a failed NameNode
You give me
metadata every
hour, I will make
it secure
Single Point
Failure
Secondary
NameNode
NameNode
Secondary Name Node
metadata
metadata
19. www.edureka.in/hadoop-admin
What Is MapReduce?
MapReduce is a programming model
It is neither platform- nor language-specific
Record-oriented data processing (key and value)
Task distributed across multiple nodes
Where possible, each node processes data
stored on that node
Consists of two phases
Map
Reduce
ValueKey
MapReduce
20. www.edureka.in/hadoop-admin
What Is MapReduce? (Contd.)
Process can be considered as being similar to a Unix pipeline
cat /my/log | grep '.html' | sort | uniq –c > /my/outfile
MAP SORT REDUCE
23. www.edureka.in/hadoop-admin
Hadoop Cluster Administrator
Deploying the cluster
Performance and availability of the cluster
Job scheduling and Management
Upgrades
Backup and Recovery
Monitoring the cluster
Troubleshooting
Roles and Responsibilities
24. www.edureka.in/hadoop-admin
Hadoop 1.0 Vs. Hadoop 2.0
Property Hadoop 1.x Hadoop 2.x
NameNodes 1 Many
High Availability Not present Highly Available
Processing Control JobTracker, Task Tracker Resource Manager, Node
Manager, App Master
25. www.edureka.in/hadoop-admin
MRv1 Vs. MRv2
Data Node
HDFS
(Data Storage)
MapReduce
(data processing)
MapReduce
(Data Processing)
Others
(data Processing)
Hadoop 1.0 Hadoop 2.0
Scheduler
Applications
Manager (AsM)
Job Tracker
YARN
(Cluster Resource Management)
HDFS
(Data Storage)
Provides a Cluster Level Resource Manager
Application Level Resource Management (Node
Manager??)
Provides slots for Jobs other than Map and Reduce
Problems with Resource utilization
Slots only for Map and Reduce
26. www.edureka.in/hadoop-admin
Client
HDFS
YARN
Resource Manager
Hadoop 2.0 - Architecture
Shared
edit logs
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and applies
to its own namespace
Secondary
Name Node
Data Node Data Node
Data Node Data Node
Node Manager
Container
App
Master
Node Manager
Container
App
Master
Standby
NameNode
Node Manager
Container
App
Master
Node Manager
Container
App
Master
Active
NameNode
27. www.edureka.in/hadoop-admin
Attempt the following Assignments using the documents present in the LMS:
Single Node Apache Hadoop 1.0 Installation on Ubuntu
Execute Linux Basic Commands
Execute HDFS Hands On
Cloudera CDH3 and CDH4 Quick VM installation on your local machine
Assignments