2. Big Data
• Lots of Data
• The challenges include capture, storage, search, transfer,
analysis and visualization.
• Systems/Enterprise generates huge amount of data from
Terabyte to Petabytes of information.
4. What is Hadoop?
• Apache Hadoop is the framework that allows for
distributed processing of arrange datasets across cluster
of commodity computers using simple programming
model
• Its is Open source Data Management.
5. Hadoop System-
Principles
• Scale-Out rather then scale-up
• Bring code to data rather data to code
• Deal with failures – they are common
• Abstract complexity of distributed and concurrent
applications
7. Files and Blocks
• Files are split into blocks(single unit of storage).
• Replicated across machine at load time.
• By default 3 replication.
8. Hadoop - MapReduce
• Model for processing large amount of data in parallel.
• Derived from functional programming.
• Can be implemented in multiple languages.
9. MapReduce Model
• Impose key-value input/output
• Defines map and reduce funtions
map : (k1,v1) -> list (k2,v2)
reduce : (k2,list(v2)) -> list (k3,v3)
10. MapReduce Framework
• Takes care of distributed processing and coordination
• Scheduling
• Task localization with Data
• Error Handling
• Data Synchronization
11. Yarn Daemons
- Node Manager
• Manages resources of single node
• There is one instance per node in the cluster
- Resource Manager
• Manages Resources for Cluster
• Instructs Node Manager to allocate resources