This document provides an overview of Big Data and Hadoop. It discusses what Big Data is, why existing data analytics approaches have limitations, and how Hadoop addresses these issues. Hadoop uses a master-slave architecture with the NameNode as master and DataNodes as slaves. It stores data in HDFS as blocks across DataNodes and allows distributed processing via MapReduce. The document covers Hadoop 1.0 and 2.0 components as well as challenges of Hadoop 1.x like single point of failure and lack of high availability of the NameNode.
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
1. Naveen P.N
Trainer
NPN TrainingTraining is the essence of success and we are committed to it.
www.npntraining.com
Module 01 - Understanding Big Data and Hadoop
Includes (Hadoop 1.x & 2.x Architecture)
2. Topics for the Module `
What is Big Data
OLTP VS OLAP
Limitation of existing Data Analytics
Moving Data into Code
Moving Code into Data
Hadoop 1.0 / 2.0 Core Components
Hadoop 2.0 Core Components
Hadoop Master Slave Architecture
After completing the module, you will be able to understand:
File Blocks
Rack Awareness
Anatomy of File Read and Write
Hadoop 1.x Challenges
Scala REPL
Scala Variable Types
3. Big Data is the term for a collection of data sets so large and complex that it
becomes difficult to process it using Traditional data processing applications.
What is Big Data
www.npntraining.com/masters-program/big-data-architect-training.php
4. 12+ TBs
of tweet data
every day
25+ TBs of
log data
every day
?TBsof
dataevery
day
2+ billion
people on
the Web
by end
2011
30 billion RFID tags
today
(1.3B in 2005)
4.6
billion
camera
phones
world
wide
100s of
millions of
GPS
enabled
devices
sold
annually
76 million smart
meters in 2009…
200M by 2014
Where Is This “Big Data” Coming From ?
www.npntraining.com/masters-program/big-data-architect-training.php
5. About RDBMS
Why do I need RDBMS
For quick response
It enables relation between data elements to be defined and managed.
It enables one database to be utilized for all applications.
Presently the data is stored in RDBMS, then what is the problem why the problem of BigData come
www.npntraining.com/masters-program/big-data-architect-training.php
6. OLTP VS OLAP
We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that
OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it
www.npntraining.com/masters-program/big-data-architect-training.php
7. "Big Data are high-volume, high-velocity, and/or high-variety information assets that require new
forms of processing to enable enhanced decision making, insight discovery and process
optimization”
Big Data spans three dimensions (3Vs)
www.npntraining.com/masters-program/big-data-architect-training.php
8. Storage only Grid (SAN)
(Raw Data)
ETL Compute Grid
RDBMS
(Aggregated Data)
1. Can’t explore original high
fidelity raw data
3. Premature data death
90% of data is Archived
A meagre 10% of Data
is available for BI
Limitation of Existing Data Analytics Architecture
www.npntraining.com/masters-program/big-data-architect-training.php
9. BI Reports + Interactive Apps
Solution: A Combined Storage Compute Layer
Hadoop: Storage + Compute Grid
RDBMS
(Aggregated Data)
Scalable throughput for
ETL & aggregation
Data Exploration
& Advanced
analytics
No Data
Archiving
Keep data alive
forever
Both Storage
And Compute
Grid together
Entire Data is
available for
processing
www.npntraining.com/masters-program/big-data-architect-training.php
10. Processing 1TB of Data
1 Machine
4 I/O Channels
Each Channel – 100 MB/s
45 Minutes
Limitation
Traditional Approach
Processing Data in Enterprise
www.npntraining.com/masters-program/big-data-architect-training.php
11. Processing 1TB of Data
10 Machine
4 I/O Channels
Each Channel – 100 MB/s
4.3 Minutes
Hadoop Approach
Processing Data in DFS
www.npntraining.com/masters-program/big-data-architect-training.php
12. What is Apache Hadoop
Apache Hadoop is a framework that allows for the
distributed processing of large data sets across clusters of
commodity computers using a simple programming
model
To solve the Big Data problem a new
framework has evolved that is Hadoop.
Hadoop provides.
Commodity Hardware
Big Cluster
Map Reduce
Failover
Data distribution
Moving code to data
Heterogeneous Hardware
Scalable
Hadoop is based on work done by Google in the early 2000s
Google File System (GFS) paper published in 2003
MapReduce paper published in 2004
It is an architecture that can scale with huge volumes,
variety and speed requirements of Big Data by distributing
the work across dozens, hundreds, or even thousands of
commodity servers that process the data in parallel.
www.npntraining.com/masters-program/big-data-architect-training.php
13. Moving data into code Contd...
Terabyte
Wants to Analyze
the data
Traditional data processing architecture
Nodes are broken up into separate processing and storage nodes connected by high capacity link
Many data intensive applications are CPU demanding causing bottle neck in networks.
Latency in transferring data.
``
www.npntraining.com/masters-program/big-data-architect-training.php
14. Moving code to data
Wants to Analyze
the data
Map Reduce
Client writes
MapReduce
Jobs Jobs Jobs
Jobs Jobs Jobs
Hadoop takes radically new approach to the problem of distributed computing.
Distribute the data to multiple nodes.
Distribute the program for computation to these multiple nodes.
Individual nodes then work on data stay in their nodes.
No data transfer over the network is required for initial processing.
Additional nodes can be added for scalability.
``
15. Distribution Vendors
Cloudera Distribution for Hadoop (CDH)
MapR Distribution.
Hortonworks Data Platform
Apache BigTop Distribution.
``
www.npntraining.com/masters-program/big-data-architect-training.php
16. Hadoop 1.0 Core Components
Hadoop has two main components :
1. HDFS – Hadoop Distributed File System (Storage)
2. MapReduce ( Processing)
Hadoop
HDFS MapReduce
Responsible to store the data in chunks(by
splitting into blocks of 64MB each)
To process the data in a massive parallel
manner.
Daemons
Name Node
Data Node
Secondary Name Node
Job Tracker
Task Tracker
HDFS
NameNode (Master)
DataNode
Secondary NameNode
MapReduce
JobTracker
TaskTracker
Storage Processing
www.npntraining.com/masters-program/big-data-architect-training.php
17. Hadoop 2.0 Core Components
Hadoop 2.0 has two main components :
1. HDFS – Hadoop Distributed File System (Storage)
2. MapReduce ( Processing)
Hadoop
HDFS YARN/MRv2
Responsible to store the data in chunks(by
splitting into blocks of 128MB each)
To process the data in a massive parallel
manner.
Daemons
Name Node
Data Node
Secondary Name Node
ResourceManager
NodeManager
HDFS
NameNode (Master)
DataNode(Slave)
Secondary NameNode
MapReduce
ResourceManager(Master)
NodeManager(Slave)
Storage Processing
www.npntraining.com/masters-program/big-data-architect-training.php
18. HDFS – Hadoop Distributed File System
HDFS is a distributed and scalable file system designed for storing very large files with
streaming data access patterns, running clusters on commodity hardware.
HDFS Architecture follows a Master/Slave Architecture, where a cluster comprises of a single NameNode
(Master node) and a number of DataNodes (Slave nodes).
Each and every file in the
File System is divided into
blocks of size 512 Bytes
File System
www.npntraining.com/masters-program/big-data-architect-training.php
19. File Blocks
By default, block size is 64 MB in Hadoop 1.x and 128 MB in Hadoop 2.x
Why block size is large?
The main reason for having the HDFS blocks in large size is due to cost of seek time.
The large block size is to account for proper usage of storage space while considering the limit on the
memory of NameNode.
www.npntraining.com/masters-program/big-data-architect-training.php
20. 1.0 Master Slave Architecture – Simple cluster setup with Hadoop Daemons
……
MasterSlave
NameNode JobTracker
Secondary
NameNode
Single Box Single Box
Optional to have in two boxes
Single Box
In Separate Box ( Many)
TaskTracker TaskTracker TaskTracker
Slave1 Slave2 Slave3
DataNode DataNode DataNode
1 2 3
4
5
website : www.npntraining.com
21. 2.0 Master Slave Architecture – Simple cluster setup with Hadoop Daemons
……
MasterSlave
NameNode
Active
ResourceManager
NameNode
Standby
Single Box Single Box
Optional to have in two boxes
Single Box
In Separate Box ( Many)
NodeManager NodeManager
NodeManager
Slave1 Slave2 Slave3
DataNode DataNode DataNode
1
2 3
4
5
Secondary
NameNode
Single Box
In Separate Box ( Many)
3
website : www.npntraining.com
23. Hadoop Cluster: A Typical Use Case
www.npntraining.com/masters-program/big-data-architect-training.php
24. File Blocks in HDFS
Master
Node
Communicates
Wants to save 400 MB of
data into cluster/HDFS
Decides which nodes to
write the data to
First copy is always stored in nodes
which is in close proximity to the client
128
MB
128
MB
16
MB
128
MB
128
MB
128
MB
128
MB
128
MB
128
MB
16
MB
128
MB
16
MB
In HDFS the data is broken
into blocks of size 64 MB
Hadoop creates a 3 replication
by default which is
configurable and achieves fault
tolerance
``
website : www.npntraining.com
27. NameNode
NameNode does not store the files but only files metadata.
NameNode keeps track of all the file system related information(Metadata) such as:
Block Locations
Information about file permissions and ownership.
Last access time for the file.
User permission like which user have access to the file.
NameNode oversees the health DataNode and coordinates access to the data stored in DataNode.
The entire metadata is in main memory.
``
www.npntraining.com/masters-program/big-data-architect-training.php
28. NameNode Metadata
The entire metadata is in main memory.
No demand paging of FS meta-data.
NameNode maintains two files
1. fsimage and
2. edit log
The fsimage is a file that represents a point-in-time snapshot of the filesystem’s metadata.
However, while the fsimage file format is very efficient to read
It’s unsuitable for making small incremental updates like renaming a single file. Thus, rather than
writing a new fsimage every time the namespace is modified, the NameNode instead records
the modifying operation in the edit log for durability.
``
www.npntraining.com/masters-program/big-data-architect-training.php
29. Secondary NameNode or CheckPoint Node
NameNode
Secondary
NameNode
-fsImage
Edit logs
Not a hot standby for the NameNode
Connects to NameNode every hour.
Housekeeping, backup of NameNode metadata
Saved metadata can build a failed NameNode
Pulls metadata
``
www.npntraining.com/masters-program/big-data-architect-training.php
30. Hadoop Components Contd…
ResourceManager
Name Node
NodeManager
Data Node
NodeManager
Data Node
NodeManager
Data Node
Master Node
Slave Node
Maintains and manages the blocks
present on the Slave Nodes
Periodically receives a Heartbeat and
a Block report from each of the data
nodes in the cluster
Heartbeat implies that the
DataNode is functioning properly,
once every 3 seconds
The HDFS architecture is built in
such a way that the user data is
never stored in the NameNode, it
only stored metadata.
It records the metadata of all the
files stored in the cluster, e.g. the
location, the size of the files,
permissions, hierarchy, etc
DataNodes perform the
low-level read and write
requests from the file
system’s clients.
Responsible for creating blocks,
deleting blocks and replicating
the same based on the decisions
taken by the NameNode.
Secondary Name Node
31. Anatomy of File Write – High Level
A user wants to write data to Hadoop
hdfs dfs –put 2016-apache-logs.txt / Client “cuts” input file into chunks of “block
size”
Client then contacts the NameNode to request write operation.
Sends No of blocks
Replication Factor
NameNode responds with pipeline of DataNodes for replication to write.
Clients reaches out to first DataNode in pipeline + Performs write
* No actual data transfer will take place from NameNode
Client takes the request “splits” input file into chunks of “block size”
Client writes blocks in parallel all the blocks are written at a time not one by one
www.npntraining.com/masters-program/big-data-architect-training.php
32. Anatomy of File Write – Full Example
2016-apache-logs.txt
200 MB file
Block size : 64 MB
Replication Factor : 3
44
MB
Hadoop Client
128
MB
128
MB
NameNode
–put request
Write pipeline
blk_000 to DN1,DN5,DN6
blk_001 to DN4,DN8,DN9
blk_002 to DN7,DN3,DN3
DataNode1
DataNode2
DataNode3
Rack01
DataNode4
DataNode5
DataNode6
Rack01
DataNode7
DataNode8
DataNode9
Rack01
website : www.npntraining.com
33. Anatomy of File Read – Full Example
2016-apache-logs.txt
200 MB file
Block size : 64 MB
Replication Factor : 3
44
MB
Hadoop Client
128
MB
128
MB
NameNode
–get request
Write pipeline
blk_000 to DN1
blk_001 to DN4
blk_002 to DN7
DataNode1
DataNode2
DataNode3
Rack01
DataNode4
DataNode5
DataNode6
Rack01
DataNode7
DataNode8
DataNode9
Rack01
website : www.npntraining.com
34. In HDFS, blocks of a file are written in parallel, however
replication of the blocks are done sequentially.
a) True
b) False
Hadoop is a framework that allows for the distributed
processing of :
a) Small Data sets
b) Large Data sets
A file of 400 MB is being copied to HDFS. The System has
finished copying 250 MB . What happens if a client tries to
access that file.
a) Can read up to block that’s successfully written.
b) Can read up to last bit successfully written
c) Will throw an exception
d) Cannot see that file until its finished copying
www.npntraining.com/masters-program/big-data-architect-training.php
36. What could be the limitation of Hadoop 1 / Gen 1
Hadoop 1.x cluster can it have multiple HDFS Namespaces
Which of the following are significant disadvantage in Hadoop 1.0
a) Single Point of Failure on NameNode
b) Too much burden on JobTracker
Hadoop 1.x cluster can it have multiple HDFS Namespaces
Can you use other than MapReduce for processing in Hadoop 1.x
www.npntraining.com/masters-program/big-data-architect-training.php
37. Hadoop 1.x - Challenges
NameNode – No Horizontal Scalability
Single NameNode and single Namespaces, limited by NameNode RAM
NameNode – No High Availability (HA)
NameNode is Single Point of Failure, Need manual recovery using Secondary NameNode in case
of failure.
Job Tracker - Overburdened
Spends significant portion of time and effort managing the life cycle of applications.
MRv1 – Only Map & Reduce tasks
HumongousData stored in HDFS remains unutilized and cannot be used for other workloads
such as Graph processing etc.
www.npntraining.com/masters-program/big-data-architect-training.php
38. Single NameNode running and managing Single Namespace. Maintains metadata in RAM
100 slaves/ 1000 slaves --> Managed by Single NameNode
Max tested till --> 4000 servers --> Single NameNode --> Single NameSpace
Lets assume we have /VOICE directory with too many files and folders we configure separate
NameNode for this
directory
/VOICE/... NameNode01
/SMS/... NameNode02
/Data/... NameNode03
So based on the directory structure we can configure NameNode, so in Hadoop 2 we can configure
10000 servers can be configured because because NameNode separately managing directory
structure that's why we call it as Federation
Limitation 1 – No Horizontal Scalability
www.npntraining.com/masters-program/big-data-architect-training.php
40. How does HDFS Federation help HDFS scale horizontally?
Reduces the load on any single NameNode by using the
multiple, independent NameNode to manage individual
parts of the file system namespace.
You have configured two name nodes to manage
/voice and /sms respectively. What will happen if you try to
put a file in /lte directory?
Put will fail. None of the namespace will manage the file
and you will get an IOException with a no such file or
directory error
www.npntraining.com/masters-program/big-data-architect-training.php
41. If you loose Namenode you will loose the Cluster details.Manual intervention should be there to
start new NameNode and copy backup from SecondaryNN
Problem
10am --> backup to SNN
10:45am --> NameNode breakdown --> You can get data till 10:00am from SNN ( Problem in Gen 1 )
Solution
==========
HighAvailability : Active and Standby Namenodes manage same data at given point of time.
--> In case Active NameNode fails Standby NameNode will act as Active and serves request
Limitation 2 – No High Availability
www.npntraining.com/masters-program/big-data-architect-training.php
42. Hadoop 2.x Architecture - HA
https://hadoop.apache.org/docs/r2.5.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html
``
www.npntraining.com/masters-program/big-data-architect-training.php
43. HDFDS HA was developed to overcome the following
disadvantage in Hadoop 1.0
a) Single Point of Failure of NameNode
b) Only one version can be run in classic Map-Reduce
c) Too much burden on JobTracker
45. YARN – Yet Another Resource Negotiator
YARN is the core component of Hadoop 2 and is added to improve performance in Hadoop
Hadoop 1.x
MapReduce
Cluster Resource Management &
Data Processing
HDFS
(File Storage)
Hadoop 2.x
YARN
Cluster Resource Management
HDFS
(File Storage)
MapReduce
(Data Processing)
Others
(Data Processing)
It is the next generation computing platform which offer various advantage when compared to
classic MapReduce
It is a layer that separates the resource management layer and the processing components layer.
MapReduce2 moves Resource management (like infrastructure to monitor nodes, allocate
resources and schedule jobs) into YARN.
48. YARN Components
YARN consists of 3 components
1. ResourceManager
i. Scheduler
ii. Application Manager
2. NodeManager
3. Application Master
www.npntraining.com/masters-program/big-data-architect-training.php
49. YARN Architecture
ResourceManager
DN1NodeManager
Client
DN1NodeManager
DN1NodeManager
1. Client submits job to
Resource Manager
2. RM will contact any
of the NodeManager
Application Master
3. Node Manager will
create a daemon by Name
Application Master on the
same Node, its one per job
4. AM will communicate to
the ResourceManager to
find where the data is.
Container
www.npntraining.com/masters-program/big-data-architect-training.php
50. Sinble JobTracker to manage thousands of jobs
Problem
Jobtracker was overburdend
Solution
==========
YARN with multiple deamons like ResourceManager, NodeManger, ApplicationMaster(one per
Application)
Container --> variable resources allocated per task(in slave m/c) --> cpu,memory,disk,network
1. Resource Manager --> Entire Cluster Lever
2. NodeManager --> Per Node/Slave/machine/server
3. App Master --> life cycle of job ( App Master one per job )
Limitation 3 – Job Tracker Overburden
www.npntraining.com/masters-program/big-data-architect-training.php
51. YARN (Yet Another Resource Negotiator) is a new component added in Hadoop 2.0
Hadoop 1.x
MapReduce
Cluster Resource Management
& Data Processing
HDFS
(File Storage)
Hadoop 2.x
YARN
Cluster Resource Management
HDFS
(File Storage)
MapReduce
(Data Processing)
Others
(Data Processing)
Introduction to new YARN layer in Hadoop 2.0``
www.npntraining.com/masters-program/big-data-architect-training.php
FB, G+, LinkedIn, Twitter every day generating huge volume of data
Facebook recently unveiled some statistics on the amount of data its system processes and stores. According to Facebook, its data system processes 2.5 million pieces of content each day amounting to 500+ terabytes of data daily. Facebook generates 2.7 billion Like actions per day and 300 million new photos are uploaded daily.
Presently the data is stored in RDBMS, then why the problem of BigData
What is the limitation of of RDBMS / Why do I need RDBMS
We go online and we get response immediately that is the concept of DBMS or OLTP application
IBM’s Definition–Big DataCharacteristics
Velocity : CDR ( Call Detail Records ) Used to understand Customer Churn i.e. customer leaving service provider
The rate at which velocity is generated
Variety : Image MRI Scans
The advantage of Share nothing architecture is it can scale easily – simply by adding another node.
A shared nothing architecture (SN) is a distributed computing architecture in which each node is independent and self-sufficient, and there is no single point of contention across the system. More specifically, none of the nodes share memory or disk storage.
Latency in trasferring data
Latency in trasferring data
Processing coupled with data : In Hadoop we send jobs towards data.
There are some programs which manages the Hadoop components these programs are known as Daemons.
Daemons take care of the components in hadoop
There are some programs which manages the Hadoop components these programs are known as Daemons.
Daemons take care of the components in hadoop
HDFS is a block structured file system which is designed to store very large files, where each file is divided into blocks of predetermined size. These blocks are stored across a cluster of one or several commodity hardware's.
HDFS is a block structured file system which is designed to store very large files, where each file is divided into blocks of predetermined size. These blocks are stored across a cluster of one or several commodity hardware's.
Tell : http://wiki.apache.org/hadoop/PoweredBy
Fault Tolerance : Hadoop will not fail even one ore more slaves fail
Fault Tolerance : Hadoop will not fail even one ore more slaves fail
Rack : Group of server’s placed in a single place.
Hadoop writes one replication in one rack and other replication in different rack, administrator can even change this also.
Rack level fault tolerance also Hadoop provides
This way, if the NameNode crashes, it can restore its state by first loading the fsimage
This way, if the NameNode crashes, it can restore its state by first loading the fsimage
This way, if the NameNode crashes, it can restore its state by first loading the fsimage then replaying all the operations (also called edits or transactions) in the edit log to catch up to the most recent state of the namesystem. The edit log comprises a series of files, called edit log segments, that together represent all the namesystem modifications made since the creation of the fsimage
Between the client network will be slower when compared to the cluster
Between the client network will be slower when compared to the cluster
In HDFS, blocks of a file are written in parallel, however replication of the blocks are done sequentially.
Answer : True. A file is divided into Blocks , these blocks are written in parallel, but the block replication happen in sequence.
A file of 400 MB is being copied to HDFS. The System has finished copying 250 MB . What happens if a client tries to access that file.
Answer : (a) Can read up to block that’s successfully written.
There are lot of self standing software's which are build on top of Hadoop Framework each addressing lot of problems.
Software's built on top of Hadoop framework is called as Hadoop Eco-System.
Flume is used to stream the data from Non HDFS to HDFS.
e.g. Twitter
Each NameNode need not to coordinate so it is called as Federated.
In HDFS, blocks of a file are written in parallel, however replication of the blocks are done sequentially.
Answer : True. A file is divided into Blocks , these blocks are written in parallel, but the block replication happen in sequence.
A file of 400 MB is being copied to HDFS. The System has finished copying 250 MB . What happens if a client tries to access that file.
Answer : (a) Can read up to block that’s successfully written.
Each NameNode need not to coordinate so it is called as Federated.
HDFDS HA was developed to overcome the following disadvantage in Hadoop 1.0
Answer : (a)
Let’s say a client submits a program, the program communicates to JobTracker, in Hadoop terminology the program is considered as Job then in the job we would have mentioned which data to process, the Job Tracker communicates with NameNode to get the DataNode which has the data
In a Nutshell the responsibility of JobTracker
JT accepts the job
Figure where is the data
Invokes all the TT and assigns them the Job
It monitors all the tasks (TT crahes) it monitors the life cycle
JT will be overburdened because in production 1000’s of job will be running after certain time your JT becomes slow
In Hadoop 1.x MapReduce is the only programming model to process the data which is stored in HDFS.
In MapReduce work is divided into 2 phases
Map phase
Reduce phase
Each Map takes 1GB resource for processing
In Hadoop 2.x the processing is taken care by YARNThe minimum memory allocation for Map task is 1GB Map phase
Reduce phase
Dsf
http://sivansasidharan.me/blog/Hadoop_YARN/
Whenever a job is submitted it communicates to the Resource Manager.
Resource Manager will then contact any of the Node Manager not necessary NodeManager which have data and say there is a job
Node Manager launches a daemon called Application Master on the same node
Application Master is per job.
It is the responsibility of the ApplicationMaster to run the job.
Application Master can contact the NodeManager as well as Resource Manager by contacting the REsourceManager Application Master will come to know where is the data and it will contact that Node and launches something called as container
Containers are nothing but a simple Java process or JVM and inside the container actual program gets executed.
The advantage of such a complex architecture is if DN2 requires more Resource for processing, the Application Master can contact the ResourceManager and allocate more resource so RM is a global entity which manages resources .
On the other Hand the entire life cycle of the application starting from creating, monitoring, etc is managed by Application Master.
NodeManager keeps track of the resource present in tht DataNode and updates to ResourceManager.
In this architecture the resource is not managed by DataNode if any machine has more resource RM can communicate with NodeManager and NM will create Container and data will be copied and execute
http://sivansasidharan.me/blog/Hadoop_YARN/