SlideShare una empresa de Scribd logo
1 de 24
MANAGING BIG DATA WITH 
HADOOP 
Presented by: 
Nalini Mehta 
Student(MLVTEC Bhilwara) 
Email: nalinimehta52@gmail.com
Introduction 
Big Data: 
•Big data is a term used to describe the voluminous amount of unstructured and 
semi-structured data . 
•Data that would take too much time and cost too much money to load into a 
relational database for analysis. 
• Big data doesn't refer to any specific quantity, the term is often used when 
speaking about petabytes and exabytes of data.
General framework of Big Data 
Networking 
 The driving force behind 
the implementation of Big 
data is both infrastructure 
and analytics which 
together constitutes the 
software. 
 Hadoop is the Big Data 
management software 
which is used to 
distribute, catalogue 
manage and query data 
across multiple, 
horizontally scaled server 
nodes.
Managing Big Data
Overview of Hadoop 
• Hadoop is a platform for 
processing large amount of 
data in distributed fashion. 
• It provides scheduling and 
resource management 
framework to execute the 
map and to reduce phases 
in the cluster environment. 
• Hadoop Distributed File is 
Hadoop’s data storage layer 
which is designed to handle 
the petabytes and exabytes 
of data distributed over 
multiple nodes in parallel.
Hadoop Cluster 
• DataNode- The DataNodes are 
the repositories for the data, and it 
consist of multiple smaller 
database infrastructures. 
• Client- The client represents the 
user interface to the big data 
implementation and query engine. 
The client could be a server or PC 
with a traditional user interface. 
• NameNode- the NameNode is 
equivalent to the address router 
and location of every data node. 
• Job Tracker- The job tracker 
represents the software tracking 
mechanism to distribute and 
aggregate search queries across 
multiple nodes for ultimate client 
analysis.
Apache Hadoop 
• Apache Hadoop is an open source distributed software platform for 
storing and processing data. 
• It is a framework for running applications on large cluster built of 
commodity hardware. 
• A common way of avoiding data loss is through replication: 
redundant copies of the data are kept by the system so that in the 
event of failure, there is another copy available. The Hadoop 
Distributed File system (HDFS), takes care of this problem. 
• MapReduce is a simple programming model for processing and 
generating large data sets.
What is MapReduce? 
 MapReduce is a programming model . 
 Programs written automatically parallelized and executed on a large 
cluster of commodity machines. 
 Users specify a map function that processes a key/value pair to 
generate a set of intermediate key/value pair, and a reduce function that 
merges all intermediate values associated with the same intermediate 
key. 
MapReduce 
MAP 
map function that 
processes a key/value 
pair to generate a set of 
intermediate key/value 
pairs 
REDUCE 
and a reduce function 
that merges all 
intermediate values 
associated with the 
same intermediate key.
The Programming Model Of MapReduce 
 Map, written by the user, takes an input pair and produces a set of 
intermediate key/value pairs. The MapReduce library groups 
together all intermediate values associated with the same 
intermediate key and passes them to the Reduce function.
 The Reduce function, also written by the user, accepts 
an intermediate key and a set of values for that key. 
It merges together these values to form a possibly 
smaller set of values.
HADOOP DISTRIBUTED FILE 
SYSTEM (HDFS) 
 Apache Hadoop comes with a distributed file system called HDFS, 
which stands for Hadoop Distributed File System. 
 HDFS is designed to hold very large amounts of data (terabytes or 
even petabytes), and provide high-throughput access to this 
information. 
 HDFS is designed for scalability and fault tolerance and provides 
APIs MapReduce applications to read and write data in parallel. 
 The capacity and performance of HDFS can be scaled by adding 
Data Nodes, and a single Name Node mechanisms that manages 
data placement and monitor server availability.
Assumptions and Goals 
1. Hardware Failure 
• An HDFS instance may consist of hundreds or thousands of server machines, 
each storing part of the file system’s data. 
• There are a huge number of components and that each component has a non-trivial 
probability of failure. 
• Detection of faults and quick, automatic recovery from them is a core 
architectural goal of HDFS. 
2. Streaming Data Access 
• Applications that run on HDFS need streaming access to their data sets. 
• HDFS is designed more for batch processing rather than interactive use by 
users. 
• The emphasis is on high throughput of data access rather than low latency of 
data access. 
3. Large Data Sets 
• A typical file in HDFS is gigabytes to terabytes in size. 
• Thus, HDFS is tuned to support large files. 
• It should provide high aggregate data bandwidth and scale to hundreds of 
nodes in a single cluster.
4. Simple coherency model 
• HDFS applications need a write-once-read-many access model for files. 
• A file once created, written, and closed need not be changed. 
• This assumption simplifies data coherency issues and enables high 
throughput data access. 
5. “Moving Computation is Cheaper than Moving 
Data” 
• A computation requested by an application is much more efficient if it is 
executed near the data it operates on when the size of the data set is huge. 
• This minimizes network congestion and increases the overall throughput of 
the system. 
6. Portability across Heterogeneous Hardware and 
Software Platforms 
• HDFS has been designed to be easily portable from one platform to 
another. This facilitates widespread adoption of HDFS as a platform of 
choice for a large set of applications.
Concepts of HDFS:
NameNode and DataNodes 
 A HDFS cluster has two 
types of node operating in 
a master-slave pattern: a 
NameNode (the master) 
and a number of 
DataNodes (slaves). 
 The NameNode manages 
the file system 
namespace. It maintains 
the file system tree and 
the metadata for all the 
files and directories in the 
tree. 
 Internally a file is split into 
one or more blocks and 
these blocks are stored in 
a set of DataNodes.
 The NameNode executes file system namespace 
operations like opening, closing, and renaming 
files and directories. 
 DataNodes store and retrieve blocks when they 
are told to (by clients or the NameNode), and they 
report back to the NameNode periodically with lists 
of blocks that they are storing. 
 The DataNodes also perform block creation, 
deletion, and replication upon instruction from the 
NameNode. 
 Without the NameNode, the file system cannot be 
used. In fact, if the machine running the 
NameNode were destroyed, all the files on the file 
system would be lost since there would be no way 
of knowing how to reconstruct the files from the 
blocks on the DataNodes.
File System Namespace 
 HDFS supports a traditional hierarchical file 
organization. A user or an application can create 
and remove files, move a file from one directory to 
another, rename a file, create directories and store 
files inside these directories. 
 The NameNode maintains the file system 
namespace. Any change to the file system 
namespace or its properties is recorded by the 
NameNode. 
 An application can specify the number of replicas of 
a file that should be maintained by HDFS. The 
number of copies of a file is called the replication 
factor of that file. This information is stored by the 
NameNode.
Data Replication 
 The blocks of a file are replicated for fault 
tolerance. 
 The block and replication factor are configurable as 
per file. 
 The NameNode makes all decisions regarding 
replication of blocks. 
 A Block report contains a list of all blocks on a 
DataNode.
Hadoop as a Service in the Cloud 
(Haas): 
 Hadoop is economical for large scale data driven 
companies like Yahoo or Facebook. 
 The ecosystem around Hadoop nowadays offers various 
tools like Hive and Pig to make Big Data processing 
accessible focusing on what to do with the data and to 
avoid the complexity of programming. 
 Consequently, a minimal Hadoop as a Service provide a 
managed Hadoop cluster ready to use without the need to 
configure or install any Hadoop relevant services on any 
cluster nodes like Job tracker, Task tracker, NameNode or 
DataNode. 
 Depending on the level of service, abstraction and tools 
provided, Hadoop as a Service (HaaS) can be placed in the 
cloud stack as a Platform or Software as a Service 
solutions, between infrastructure services and cloud clients.
Limitations: 
It places several requirements on the network: 
 Data locality 
 The distributed Hadoop nodes running jobs parallel 
causes east-west network traffic that can be adversely 
affected by the suboptimal network connectivity. 
 The network should provide high bandwidth, low latency 
and any to any connectivity between the nodes for 
optimal Hadoop performance. 
 Scale out 
 Deployments might start with a small cluster and then 
scale out over time as the customer may realize the 
initial success and then needs. 
 The underlying network architecture should also scale 
seamlessly with Hadoop clusters and should provide 
predictable performance.
Conclusion 
 The growth of communication and 
connectivity has led to the emergence of 
Big Data. Apache Hadoop is an open 
source framework that has become a de-facto 
standard for big data platforms 
deployed today. 
 To sum up, we conclude that promising 
progress has been made in the area of 
Big Data but much remains to be done. 
Almost all proposed approaches are 
evaluated to a limited scale, and further 
research is required for large scale 
evaluations.
References: 
 White paper –Introduction to Big Data: Infrastructure 
and Network consideration 
 MapReduce: Simplified Data processing on Large 
Clusters, http://research .google.com/archive 
/mapreduce.html 
 White paper Big Data Analytics[http:/Hadoop.intel.com] 
 The Hadoop Distributed File System Architecture and 
Design:by Dhruba Borthakur 
 Big Data in the enterprise, Cisco White Paper. 
 Cloudera capacity planning recommendations: 
http://www.cloudera.com/blog/ 2010/08/Hadoop HBase-capacity- 
planning/ 
 Apache Hadoop Wiki Website: 
http://en.wikipedia.org/wiki/Apache-Hadoop. 
 Towards a Big Data Reference Architecture 
 [www.win.tue.nl/~gfletche/Maier_MSc_thesis.pdf]
Managing Big Data with Hadoop

Más contenido relacionado

La actualidad más candente

SPARQL 사용법
SPARQL 사용법SPARQL 사용법
SPARQL 사용법홍수 허
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Scaling and High Performance Storage System: LeoFS
Scaling and High Performance Storage System: LeoFSScaling and High Performance Storage System: LeoFS
Scaling and High Performance Storage System: LeoFSRakuten Group, Inc.
 
Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS  Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS Dr Neelesh Jain
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraEric Evans
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Edureka!
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 

La actualidad más candente (20)

SPARQL 사용법
SPARQL 사용법SPARQL 사용법
SPARQL 사용법
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
 
Scaling and High Performance Storage System: LeoFS
Scaling and High Performance Storage System: LeoFSScaling and High Performance Storage System: LeoFS
Scaling and High Performance Storage System: LeoFS
 
Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS  Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Virtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in CassandraVirtual Nodes: Rethinking Topology in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Allyourbase
AllyourbaseAllyourbase
Allyourbase
 
Mongodb
MongodbMongodb
Mongodb
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Bigtable and Dynamo
Bigtable and DynamoBigtable and Dynamo
Bigtable and Dynamo
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 

Similar a Managing Big Data with Hadoop

Similar a Managing Big Data with Hadoop (20)

hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Bigdata and Hadoop Introduction
Bigdata and Hadoop IntroductionBigdata and Hadoop Introduction
Bigdata and Hadoop Introduction
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Distributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptxDistributed Systems Hadoop.pptx
Distributed Systems Hadoop.pptx
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
G017143640
G017143640G017143640
G017143640
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 

Último

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 

Último (20)

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 

Managing Big Data with Hadoop

  • 1. MANAGING BIG DATA WITH HADOOP Presented by: Nalini Mehta Student(MLVTEC Bhilwara) Email: nalinimehta52@gmail.com
  • 2. Introduction Big Data: •Big data is a term used to describe the voluminous amount of unstructured and semi-structured data . •Data that would take too much time and cost too much money to load into a relational database for analysis. • Big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data.
  • 3.
  • 4. General framework of Big Data Networking  The driving force behind the implementation of Big data is both infrastructure and analytics which together constitutes the software.  Hadoop is the Big Data management software which is used to distribute, catalogue manage and query data across multiple, horizontally scaled server nodes.
  • 6. Overview of Hadoop • Hadoop is a platform for processing large amount of data in distributed fashion. • It provides scheduling and resource management framework to execute the map and to reduce phases in the cluster environment. • Hadoop Distributed File is Hadoop’s data storage layer which is designed to handle the petabytes and exabytes of data distributed over multiple nodes in parallel.
  • 7. Hadoop Cluster • DataNode- The DataNodes are the repositories for the data, and it consist of multiple smaller database infrastructures. • Client- The client represents the user interface to the big data implementation and query engine. The client could be a server or PC with a traditional user interface. • NameNode- the NameNode is equivalent to the address router and location of every data node. • Job Tracker- The job tracker represents the software tracking mechanism to distribute and aggregate search queries across multiple nodes for ultimate client analysis.
  • 8. Apache Hadoop • Apache Hadoop is an open source distributed software platform for storing and processing data. • It is a framework for running applications on large cluster built of commodity hardware. • A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. The Hadoop Distributed File system (HDFS), takes care of this problem. • MapReduce is a simple programming model for processing and generating large data sets.
  • 9. What is MapReduce?  MapReduce is a programming model .  Programs written automatically parallelized and executed on a large cluster of commodity machines.  Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pair, and a reduce function that merges all intermediate values associated with the same intermediate key. MapReduce MAP map function that processes a key/value pair to generate a set of intermediate key/value pairs REDUCE and a reduce function that merges all intermediate values associated with the same intermediate key.
  • 10. The Programming Model Of MapReduce  Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key and passes them to the Reduce function.
  • 11.  The Reduce function, also written by the user, accepts an intermediate key and a set of values for that key. It merges together these values to form a possibly smaller set of values.
  • 12. HADOOP DISTRIBUTED FILE SYSTEM (HDFS)  Apache Hadoop comes with a distributed file system called HDFS, which stands for Hadoop Distributed File System.  HDFS is designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information.  HDFS is designed for scalability and fault tolerance and provides APIs MapReduce applications to read and write data in parallel.  The capacity and performance of HDFS can be scaled by adding Data Nodes, and a single Name Node mechanisms that manages data placement and monitor server availability.
  • 13. Assumptions and Goals 1. Hardware Failure • An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. • There are a huge number of components and that each component has a non-trivial probability of failure. • Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. 2. Streaming Data Access • Applications that run on HDFS need streaming access to their data sets. • HDFS is designed more for batch processing rather than interactive use by users. • The emphasis is on high throughput of data access rather than low latency of data access. 3. Large Data Sets • A typical file in HDFS is gigabytes to terabytes in size. • Thus, HDFS is tuned to support large files. • It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster.
  • 14. 4. Simple coherency model • HDFS applications need a write-once-read-many access model for files. • A file once created, written, and closed need not be changed. • This assumption simplifies data coherency issues and enables high throughput data access. 5. “Moving Computation is Cheaper than Moving Data” • A computation requested by an application is much more efficient if it is executed near the data it operates on when the size of the data set is huge. • This minimizes network congestion and increases the overall throughput of the system. 6. Portability across Heterogeneous Hardware and Software Platforms • HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.
  • 16. NameNode and DataNodes  A HDFS cluster has two types of node operating in a master-slave pattern: a NameNode (the master) and a number of DataNodes (slaves).  The NameNode manages the file system namespace. It maintains the file system tree and the metadata for all the files and directories in the tree.  Internally a file is split into one or more blocks and these blocks are stored in a set of DataNodes.
  • 17.  The NameNode executes file system namespace operations like opening, closing, and renaming files and directories.  DataNodes store and retrieve blocks when they are told to (by clients or the NameNode), and they report back to the NameNode periodically with lists of blocks that they are storing.  The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.  Without the NameNode, the file system cannot be used. In fact, if the machine running the NameNode were destroyed, all the files on the file system would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the DataNodes.
  • 18. File System Namespace  HDFS supports a traditional hierarchical file organization. A user or an application can create and remove files, move a file from one directory to another, rename a file, create directories and store files inside these directories.  The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode.  An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.
  • 19. Data Replication  The blocks of a file are replicated for fault tolerance.  The block and replication factor are configurable as per file.  The NameNode makes all decisions regarding replication of blocks.  A Block report contains a list of all blocks on a DataNode.
  • 20. Hadoop as a Service in the Cloud (Haas):  Hadoop is economical for large scale data driven companies like Yahoo or Facebook.  The ecosystem around Hadoop nowadays offers various tools like Hive and Pig to make Big Data processing accessible focusing on what to do with the data and to avoid the complexity of programming.  Consequently, a minimal Hadoop as a Service provide a managed Hadoop cluster ready to use without the need to configure or install any Hadoop relevant services on any cluster nodes like Job tracker, Task tracker, NameNode or DataNode.  Depending on the level of service, abstraction and tools provided, Hadoop as a Service (HaaS) can be placed in the cloud stack as a Platform or Software as a Service solutions, between infrastructure services and cloud clients.
  • 21. Limitations: It places several requirements on the network:  Data locality  The distributed Hadoop nodes running jobs parallel causes east-west network traffic that can be adversely affected by the suboptimal network connectivity.  The network should provide high bandwidth, low latency and any to any connectivity between the nodes for optimal Hadoop performance.  Scale out  Deployments might start with a small cluster and then scale out over time as the customer may realize the initial success and then needs.  The underlying network architecture should also scale seamlessly with Hadoop clusters and should provide predictable performance.
  • 22. Conclusion  The growth of communication and connectivity has led to the emergence of Big Data. Apache Hadoop is an open source framework that has become a de-facto standard for big data platforms deployed today.  To sum up, we conclude that promising progress has been made in the area of Big Data but much remains to be done. Almost all proposed approaches are evaluated to a limited scale, and further research is required for large scale evaluations.
  • 23. References:  White paper –Introduction to Big Data: Infrastructure and Network consideration  MapReduce: Simplified Data processing on Large Clusters, http://research .google.com/archive /mapreduce.html  White paper Big Data Analytics[http:/Hadoop.intel.com]  The Hadoop Distributed File System Architecture and Design:by Dhruba Borthakur  Big Data in the enterprise, Cisco White Paper.  Cloudera capacity planning recommendations: http://www.cloudera.com/blog/ 2010/08/Hadoop HBase-capacity- planning/  Apache Hadoop Wiki Website: http://en.wikipedia.org/wiki/Apache-Hadoop.  Towards a Big Data Reference Architecture  [www.win.tue.nl/~gfletche/Maier_MSc_thesis.pdf]