Hadoop Administration pdf

www.edureka.in/hadoop-admin
How It Works…
 LIVE classes
 Class recordings
 Module wise Quizzes and Practical Assignments
 24x7 on-demand technical support
 Deployment of different clusters
 Online certification exam
 Lifetime access to the Learning Management System

Course Topics
 Week 1
– Understanding Big Data
– Hadoop Components
– Introduction to Hadoop 2.0
 Week 2
– Hadoop 2.0
– Hadoop Configuration
– Hadoop Cluster Architecture
 Week 3
– Different Hadoop Server Roles
– Data processing flow
– Cluster Network Configuration
 Week 4
– Job Scheduling
– Fair Scheduler
– Monitoring a Hadoop Cluster
 Week 5
– Securing your Hadoop Cluster
– Kerberos and HDFS Federation
– Backup and Recovery
 Week 6
– Oozie and Hive Administration
– HBase Architecture
– HBase Administration

Topics for Today
 What is Big Data?
 Limitations of the existing solutions
 Solving the problem with Hadoop
 Introduction to Hadoop
 Hadoop Eco-System
 Hadoop Core Components
 MapReduce software framework
 Hadoop Cluster Administrator: Roles and Responsibilities
 Introduction to Hadoop 2.0

What Is Big Data?
 Lots of Data (Terabytes or Petabytes).
 Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of
information.
A airline jet collects 10 terabytes of sensor data
for every 30 minutes of flying time.
NYSE generates about one terabyte of new trade data
per day to Perform stock trading analytics to determine
trends for optimal trades.

IBM’s Definition
 IBM’s definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Volume Velocity Variety
Characteristics Of Big Data
12 Terabytes of
tweets created
each day
Scrutinizes 5 million
trade events created
each day to identify
potential fraud
Sensor data, audio,
video, click streams,
log files and more

 Estimated Global Data Volume:
 2011: 1.8 ZB
 2015: 7.9 ZB
 The world's information doubles every two years
 Over the next 10 years:
 The number of servers worldwide will grow by 10x
 Amount of information managed by enterprise data
centers will grow by 50x
 Number of “files” enterprise data center handle will
grow by 75x
Source: http://www.emc.com/leadership/programs/digital-universe.htm,
which was based on the 2011 IDC Digital Universe Study
Data Volume Is Growing Exponentially

What Big Companies Have To Say…
“Analyzing Big Data sets will become a key basis for competition.”
“Leaders in every sector will have to grapple the implications of Big Data.”
McKinsey
Gartner
Forrester
Research
“Big Data analytics are rapidly emerging as the preferred solution to business and
technology trends that are disrupting.”
“Enterprises should not delay implementation of Big Data Analytics.”
“Use Hadoop to gain a competitive advantage over more risk-averse enterprises.”
“Prioritize Big Data projects that might benefit from Hadoop.”

Some Of the Hadoop Users

Hadoop Users – In Detail
http://wiki.apache.org/hadoop/PoweredBy

 Apache Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using a simple programming model.
 It is an Open-source Data Management with scale-out storage & distributed processing.
What Is Hadoop?

Hadoop Key Characteristics
Reliable
Economical
Scalable
Flexible
Hadoop
Features

Hadoop History
Doug Cutting & Mike Cafarella
started working on Nutch
NY Times converts 4TB of
Image archives over 100 EC2s
Fastest sort of a TB,
62secs over 1,460 nodes
Sorted a PB in 16.25hours
Over 3.658 nodes
Fastest sort of a TB, 3.5mins
over 910 nodes
Doug Cutting adds DFS &
MapReduce support to Nutch
Google publishes GFS &
MapReduce papers Yahoo! hires Cutting,
Hadoop spins out of Nutch
Facebook launches Hive:
SQL Support for Hadoop
Doug Cutting
Joins Cloudera
Hadoop Summit 2009,
750 attendees
Founded
2002 2003 2004 2005 2006 2007 2008 2009

Hadoop 1.x Eco-System
Apache Oozie (Workflow)
HDFS (Hadoop Distributed File System)
Pig Latin
Data Analysis
Mahout
Machine Learning
Hive
DW System
MapReduce Framework
HBase
Flume Sqoop
Import Or Export
Unstructured or
Semi-Structured data Structured Data

Hadoop is a system for large scale data processing.
It has two main components:
 HDFS – Hadoop Distributed File System (Storage)
 Distributed across “nodes”
 Natively redundant
 NameNode tracks locations.
 MapReduce (Processing)
 Splits a task across processors
 “near” the data & assembles results
 Self-Healing, High Bandwidth
 Clustered storage
 JobTracker manages the TaskTrackers
Hadoop 1.x Core Components
 Additional Administration
Tools:
 Filesystem utilities
 Job scheduling and monitoring
 Web UI

Data Node
Task
Tracker
Data Node
Task
Tracker
Data Node
Task
Tracker
Data Node
Task
Tracker
Hadoop 1.x Core Components (Contd.)
MapReduce
Engine
HDFS
Cluster
Job Tracker
Admin Node
Name node

 NameNode:
 master of the system
 maintains and manages the blocks which are present on the
DataNodes
 DataNodes:
 slaves which are deployed on each machine and provide the
actual storage
 responsible for serving read and write requests for the clients
Name Node and Data Nodes

 Secondary NameNode:
 Not a hot standby for the NameNode
 Connects to NameNode every hour*
 Housekeeping, backup of NemeNode metadata
 Saved metadata can build a failed NameNode
You give me
metadata every
hour, I will make
it secure
Single Point
Failure
Secondary
NameNode
NameNode
Secondary Name Node
metadata
metadata

What Is MapReduce?
 MapReduce is a programming model
 It is neither platform- nor language-specific
 Record-oriented data processing (key and value)
 Task distributed across multiple nodes
 Where possible, each node processes data
stored on that node
 Consists of two phases
 Map
 Reduce
ValueKey
MapReduce

What Is MapReduce? (Contd.)
Process can be considered as being similar to a Unix pipeline
cat /my/log | grep '.html' | sort | uniq –c > /my/outfile
MAP SORT REDUCE

Client
HDFS Map Reduce
Hadoop 1.x – In Summary
Secondary
Name Node
Data
Blocks
Data Node
Name Node Job Tracker
Task Tracker
Map Reduce
Data Node Task Tracker
Map Reduce
….

Poll Questions

Hadoop Cluster Administrator
 Deploying the cluster
 Performance and availability of the cluster
 Job scheduling and Management
 Upgrades
 Backup and Recovery
 Monitoring the cluster
 Troubleshooting
Roles and Responsibilities

Hadoop 1.0 Vs. Hadoop 2.0
Property Hadoop 1.x Hadoop 2.x
NameNodes 1 Many
High Availability Not present Highly Available
Processing Control JobTracker, Task Tracker Resource Manager, Node
Manager, App Master

MRv1 Vs. MRv2
Data Node
HDFS
(Data Storage)
MapReduce
(data processing)
MapReduce
(Data Processing)
Others
(data Processing)
Hadoop 1.0 Hadoop 2.0
Scheduler
Applications
Manager (AsM)
Job Tracker
YARN
(Cluster Resource Management)
HDFS
(Data Storage)
 Provides a Cluster Level Resource Manager
 Application Level Resource Management (Node
Manager??)
 Provides slots for Jobs other than Map and Reduce
 Problems with Resource utilization
 Slots only for Map and Reduce

Client
HDFS
YARN
Resource Manager
Hadoop 2.0 - Architecture
Shared
edit logs
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and applies
to its own namespace
Secondary
Name Node
Data Node Data Node
Data Node Data Node
Node Manager
Container
App
Master
Node Manager
Container
App
Master
Standby
NameNode
Node Manager
Container
App
Master
Node Manager
Container
App
Master
Active
NameNode

 Attempt the following Assignments using the documents present in the LMS:
 Single Node Apache Hadoop 1.0 Installation on Ubuntu
 Execute Linux Basic Commands
 Execute HDFS Hands On
 Cloudera CDH3 and CDH4 Quick VM installation on your local machine
Assignments

Thank You
See You in Class Next Week

Hadoop Administration pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop Administration pdf

Similar to Hadoop Administration pdf (20)

More from Edureka!

More from Edureka! (20)

Recently uploaded

Recently uploaded (20)

Hadoop Administration pdf