SlideShare una empresa de Scribd logo
1 de 41
BIG DATA
AND HADOOP
A Presentation on
Presented By-
Mohit Tare
UNDERSTANDING BIG DATA –
What ?
How ?
Why ?
Source-http://www.intel.com/content/www/us/en/communications/internet-minuteinfographic.html
Big Data Is Everywhere
•The Large Hadron Collider (LHC), a particle
accelerator that will revolutionize our
understanding of the workings of the Universe,
will generate 60 terabytes of data per day –
15 petabytes (15 million gigabytes)
annually.[1]
•Decoding the human genome originally took
10 years to process; now it can be achieved
in one week.
•12 terabytes of Tweets created each day[2]
•100 terabytes of data uploaded
daily to Facebook .[3]
•Walmart handles more than 1 million customer
transactions every hour, which is imported into
databases estimated to contain more than
2.5 petabytes of data.[3]
•Convert 350 billion annual meter readings to
better predict power consumption[2].
What Is Big Data?
Its LARGE Its COMPLEX
Its UNSTRUCTURED
By David Kellog, “Big data refers to the datasets whose size is beyond the ability of a
typical database software tools to capture ,store, manage and analyze.”[4]
O’Reilly defines big data the following way: “Big data is data that exceeds the
processing capacity of conventional database systems. The data is too big, moves too
fast, or doesn't fit the strictures of your database architectures.” [5]
An Obvious Question – How BIG is the BIG
DATA ?
A common misconception is Big data is
solely related to VOLUME.
While volume or size is a part of the
equation…..
What about SPEED at which data
is generated ?
And about the VARIETY of big data
that variety of sources are
generating?
You guessed it Right! The 3 Vs of Big data
[6]
Why The Sudden Explosion Of Big Data ?
•An Increased number and variety of data sources
that generate large quantities of data
•Sensors(location, GPS..)
•Scientific Computing(CERN, biological research..)
•Web 2.0(Twitter, wikis ..)
•Realization that data is too valuable to delete
•Data analytics and Data Warehousing
•Business Intelligence
•Dramatic Decline in the cost of hardware,
especially storage
•decline in price of SSDs
BIG DATA is fuelled by CLOUD
•The properties of cloud help us in dealing with the Big data
•And the challenges of the Big data drives the Future
designs , enhancement and expansion of cloud.
•Both are in a Never Ending cycle.
The Value Of Big Data – Why Its So Important?
[6]
MANAGING BIG DATA
Traditional Enterprise
Architecture VS Cluster
Architecture
Hadoop – Managing Big data
TRADITIONAL ENTERPRISE ARCHITECTURE
Consists of
•Servers
•SAN (Storage Area Network)
•Storage arrays
•Servers -a server is a physical computer dedicated to running one or more
services to serve the needs of the users of other computers on the network.
•Storage Arrays-A disk array is a disk storage system which contains multiple
disk drives(SATA,SSD).
•Storage Area Network - A storage area network (SAN) is a dedicated
network that provides access to consolidated, data storage. SANs are primarily
used to make storage devices, such as disk arrays, accessible to servers so that
the devices appear like locally attached devices to the operating system.
SOME ADVANTAGES AND DISADVANTAGES OF
ENTERPRISE ARCHITECTURE
ADVANTAGES
•Coupling between Servers and Storage /
Disk arrays – Which can be expanded,
upgraded or retire independent of each
other
•SAN enables services on any of server to
have access of any of storage arrays as
long as they have access permission.
•ROBUST and MINIMUM FAILURE rate.
•Mainly designed for computing
intensive applications which operate on a
subset of data.
DISADVANTAGES
•More Costlier as it expands.
•But What about BIG DATA ?
It cannot handle Data intensive
operation like sorting.
What we want is an Architecture that will give -
CLUSTER ARCHITECTURE
Consists of
•Nodes – each having its
own cores , memory ,disks .
•Interconnection via high
speed network(LAN)
• consists of a set of loosely connected computers that work together so that in
many respects they can be viewed as a single system.
•usually connected to each other through fast local area networks,
each node (computer used as a server) running its own instance of
an operating system.
•The activities of the computing nodes are orchestrated by "clustering
middleware", a software layer that sits atop the nodes and allows the users to
treat the cluster as by and large one cohesive computing unit.
Benefits of Using a Cluster Architecture
•Modular and Scalable - easier to expand the system without bringing down
the application that runs on top of the cluster.
•Data Locality – where data can be processed by the cores collocated in
same node or Rack minimizing any transfer over network.
•Parallelization - higher degree of parallelism via the simultaneous
execution of separate portions of a program on different processors.
•All this with less cost .
But Every Coin has two Sides!
•Complexity - Cost of administering a cluster of N machines .
•More Storage – As data is replicated to protect from failure.
•Data Distribution – How to distribute data evenly across cluster ?
•Careful Management and Need of massive parallel processing Design.
Riding the Elephant - Hadoop
SOLUTION
•Open Source Apache Project initiated and led by
Yahoo.
•Apache Hadoop is an open source Java framework
for processing and querying vast amounts of data
on large clusters of commodity hardware.[8][9]
•Runs on
oLinux, Mac OS/X, Windows, and Solaris
oCommodity hardware
•Target cluster of commodity PCs
oCost-effective bulk computing
•Invented by Doug Cutting and funded by Yahoo in
2006 and reached to its “web scale capacity” in
2008.[7]
Doug Cutting
Where Does it All come from ?
• underlying technology was invented by Google back in their earlier
days so they could usefully index all the rich textural and structural
information they were collecting, and then present meaningful and
actionable results to users.
•Based on Google’s Map Reduce and Google File System.
What hadoop is ?
Hadoop Consists of two core components [9]–
1.Hadoop Distributed File System (HDFS)
2.Hadoop Distributed Processing
Framework
– Using Map/Reduce metaphor
Hadoop Distributed File System(HDFS)
Based on Simple design principles –
•To Split
•To Scatter
•To Replicate
•To Manage data across cluster
•Files are broken in to large file blocks
which is usually a multiple of storage
blocks.
Typically 64 MB or higher
Hadoop Distributed File System(HDFS) contd..
•File blocks are Replicated to several
datanodes, for reliability.
•Default is 3 replicas, but settable
•Blocks are placed (writes are
pipelined):
•On same node
•On same rack
•On the other rack
•Clients read from closest replica.
•If the replication for a block drops
below target, it is automatically re-
replicated.
Hadoop Distributed File System(HDFS) contd..
•Single namespace for entire cluster
managed by a single Name node[7]
•Namenode, a master server that
manages the file system namespace and
regulates access to files by clients.
•DataNodes: serves read, write requests,
performs block creation, deletion, and
replication upon instruction from
Namenode.
•When a datanode fails , Namenode
•identifies file blocks that have
been affected
•retrieves copy from other healthy
nodes
•finds new node to store another
copy of them.
•Updates information in its tables.
Hadoop Distributed File System(HDFS) contd..
•Client talks to both namenode and
datanodes
•Data is not sent through the
namenode.
•First namenode is connected and
then user can directly connect to data
node
HDFS
Architecture[10]
•ADVANTAGES
•Highly fault-tolerant
•High throughput
•Suitable for applications with large data
sets
•Streaming access to file system data
•Can be built out of commodity hardware
Hadoop Distributed File System(HDFS) contd..
•2 POINT OF FAILURES
•Namenode can become a single point of
failure
•Cluster rebalancing
•SOLUTIONS
•Enterprise Editions maintain Backup of
namenode.
•Architecture is compatible with data rebalancing
schemes , but its still an area of research.
Hadoop Map/Reduce
•Map/Reduce is a programming
model
for efficient distributed computing
•User submits MapReduce job
•System:
• Partitions job into lots
of tasks
•Schedules tasks on
nodes close to data
• Monitors tasks
• Kills and restarts if they
fail/hang/disappear[11]
Consists of two phases
1.Mapper Phase
2.Reduce Phase
Hadoop Map/Reduce contd …
1.Mapper Phase
•The data are fed into the map function as
key value pairs to produce intermediate
key/value pairs.
• Input: key1,value1 pair
• Output: key2, value2 pairs
•All nodes will do same computation
•Uses Data Locality to increase
performance.
•As all data blocks stored in HDFS
are of equal size mapper computation can
be equally divided.
Hadoop Map/Reduce contd …
Reduce Phase
•Once the mapping is done, all the intermediate results from various nodes are
reduced to create the final output.
•Has 3 Phases
• shuffle,
•sort and
•reduce.[12]
•Shuffle - Input to the Reducer is the sorted output of the mappers. In this
phase the framework fetches the relevant partition of the output of all the
mappers.
•Sort - The framework groups Reducer inputs by keys (since different
mappers may have output the same key) in this stage.
The shuffle and sort phases occur simultaneously; while map-outputs are
being fetched they are merged.
•Reduce - In this phase the reduce method is called for each <key, (list of
values)> pair in the grouped inputs and will produce final outputs.
Understood or not ? Lets understand it by an
Example
• Suppose you want to analyze blog entries stored in BigData.txt and
count no of times Hadoop , Big Data, Green Plum words appear in it.
•Suppose 3 nodes participate in task . In Mapper Phase , each node will receive
an address of file block and pointer to mapper function.
•Mapper Function will calculate word –count.
[13]
Lets understand it by an Example
•Output of mapper function will be set of
<key,value >pairs.
FINAL OUTPUT
OF MAPPER PHASE
Lets understand it by an Example
•The Reduce Phase sums and reduces
output .
•A node is selected to perform
reduce function and other nodes send
their output to that node.
•After Shuffling of Reduce Phase
Lets understand it by an Example
•After sorting phase of Reduce Phase
And FINALLY
•JobTracker keeps track of all the
MapReduces jobs that are running on
various nodes.
•This schedules the jobs, keeps track of all
the map and reduce jobs running across
the nodes.
•If any one of those jobs fails, it reallocates
the job to another node, etc.
•TaskTracker performs the map and
reduce tasks that are assigned by the
JobTracker.
•TaskTracker also constantly sends a
hearbeat message to JobTracker, which
helps JobTracker to decide whether to
delegate a new task to this particular node
or not.
A bit more on Map/Reduce
Accessibilty and Implementation
•HDFS
•HDFS provides Java API for application to use.
•Python access is also used in many applications.
•It provides a command line interface called the FS shell that lets the
user interact with data in the HDFS.
•The syntax of the commands is similar to bash.
Example: to create a directory
Usage: hadoop dfs -mkdir <paths>
hadoop dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
•Map/Reduce
•Java API which has prebuilt classes and Interfaces.
•Python , C++ can also be used.
C++ example on Word Count[14]
And there is more and more …
PIG
Who uses Hadoop ?
References
[1] Randal E. Bryant , Randy H. Katz , Edward D. Lazowska, “Big-Data Computing:
Creating revolutionary breakthroughs in commerce, science, and society”
,Version 8: December 22, 2008. Available:
http://www.cra.org/ccc/docs/init/Big_Data.pdf [Accessed Sept.9,2012]
[2]What is Big Data ?[Online]. Available :
http://www-01.ibm.com/software/data/bigdata/ [Accessed Sept.9,2012]
[3] A Comprehensive List of Big Data Statistics [Online].
Available :http://wikibon.org/blog/big-data-statistics/ [Accessed Sept.9,2012]
[4] James Manyika, Michael Chui ,Brad Brown, Jacques Bughin, Richard Dobbs ,Charles
Roxburgh , Angela Hung Byers Big Data: The next frontier for innovation ,
competition ,and productivity , McKinskey Global Institute, May 2011.Availabe:
http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data
_the_next_frontier_for_innovation[Accessed Sept.10,2012]
[5]What Is Big Data? ,O’Reilly Radar, January 11, 2012,[Online].Available :
http://radar.oreilly.com/2012/01/what-is-big-data.html[Accessed Sept.10,2102]
[6]-Big Data, Wipro,[Online].Available:
http://www.slideshare.net/wiprotechnologies/wipro-infographicbig-data[Accessed
Sept.11,2012]
References
[7]Owan o maley ,”Introduction to Hadoop”[Online].
Available : http://wiki.apache.org/hadoop/HadoopPresentations
[Accessed Sept .17,2012 ]
[8]Hadoop at Yahoo!, Yahoo developer Network[Online].Available:
http://developer.yahoo.com/hadoop/ [Accessed Sept .17,2012 ]
[9] Elif Dede, Madhusudhan Govindaraju, Dan Gunter, Lavanya
Ramakrishnan,“Ridingthe elephant: managing ensembles with hadoop”,
in MTAGS '11 Proceedings of the 2011 ACM international workshop on Many task
computing on grids and supercomputers, Pages 49-58[Online].
Available : ACM Digital Library,
http://dl.acm.org/citation.cfm?id=2132876.2132888 [Accessed Sept .17,2012 ]
[10] HDFS Architecture, Hadoop 0.20 Documentation[Online].
Available: http://hadoop.apache.org/docs/r0.20.2/hdfs_design.html[Accessed
Sep.20,2012]
References
[11]Doug Cutting ,”Hadoop Overview” ,[Online] Available:
http://wiki.apache.org/hadoop/HadoopPresentations
[Accessed Sept .17,2012 ]
[12] Map/Reduce Tutorial, Hadoop 0.20 Documentation,[Online].
Available :
http://hadoop.apache.org/docs/r0.20.2/mapred_tutorial.html#Reducer
[Accessed Sept .17,2012 ]
[13] Patricia Florissi, Big Ideas : Demystifying Hadoop, [Video].
Available : http://www.youtube.com/watch?v=XtLXPLb6EXs&feature=relmfu
[14] C/C++ MapReduce Code & build, Hadoop Wiki , C++ word Count, [Online].
Available :
http://wiki.apache.org/hadoop/C%2B%2BWordCount
[Accessed October .1,2012]
Thank You !
And …
Stay Udacious ?

Más contenido relacionado

La actualidad más candente

WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 Chris Almond
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsFadi Yousuf
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
Selective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopSelective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopDataWorks Summit
 
The hadoop ecosystem table
The hadoop ecosystem tableThe hadoop ecosystem table
The hadoop ecosystem tableMohamed Magdy
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentalsits_skm
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureRoman Nikitchenko
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop TechnologyOpenDev
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
NoSQL overview implementation free
NoSQL overview implementation freeNoSQL overview implementation free
NoSQL overview implementation freeBenoit Perroud
 

La actualidad más candente (20)

WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014 WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
WANdisco Non-Stop Hadoop: PHXDataConference Presentation Oct 2014
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The Essentials
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Selective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopSelective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed Hadoop
 
The hadoop ecosystem table
The hadoop ecosystem tableThe hadoop ecosystem table
The hadoop ecosystem table
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big data
Big dataBig data
Big data
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
NoSQL overview implementation free
NoSQL overview implementation freeNoSQL overview implementation free
NoSQL overview implementation free
 

Destacado

Presentation on Big Data Hadoop (Summer Training Demo)
Presentation on Big Data Hadoop (Summer Training Demo)Presentation on Big Data Hadoop (Summer Training Demo)
Presentation on Big Data Hadoop (Summer Training Demo)Ashok Royal
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopFebiyan Rachman
 
Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Arohi Khandelwal
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersAmal G Jose
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
 
MongoDB Administration 101
MongoDB Administration 101MongoDB Administration 101
MongoDB Administration 101MongoDB
 
Key Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShareKey Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShareMapR Technologies
 
Proof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesProof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesDataWorks Summit
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...DataStax
 
Infinit: Modern Storage Platform for Container Environments
Infinit: Modern Storage Platform for Container EnvironmentsInfinit: Modern Storage Platform for Container Environments
Infinit: Modern Storage Platform for Container EnvironmentsDocker, Inc.
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Destacado (20)

Presentation on Big Data Hadoop (Summer Training Demo)
Presentation on Big Data Hadoop (Summer Training Demo)Presentation on Big Data Hadoop (Summer Training Demo)
Presentation on Big Data Hadoop (Summer Training Demo)
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
MongoDB Administration 101
MongoDB Administration 101MongoDB Administration 101
MongoDB Administration 101
 
Key Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShareKey Considerations for Putting Hadoop in Production SlideShare
Key Considerations for Putting Hadoop in Production SlideShare
 
Proof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-seriesProof of Concept for Hadoop: storage and analytics of electrical time-series
Proof of Concept for Hadoop: storage and analytics of electrical time-series
 
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
 
Infinit: Modern Storage Platform for Container Environments
Infinit: Modern Storage Platform for Container EnvironmentsInfinit: Modern Storage Platform for Container Environments
Infinit: Modern Storage Platform for Container Environments
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similar a Big data and hadoop

MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsrishavkumar1402
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservicesBigstep
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
 
Infinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - LondonInfinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - LondonHentsū
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersZohar Elkayam
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Ankus, bigdata deployment and orchestration framework
Ankus, bigdata deployment and orchestration frameworkAnkus, bigdata deployment and orchestration framework
Ankus, bigdata deployment and orchestration frameworkAshrith Mekala
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computingSachin Gowda
 
module4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfmodule4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfSumanthReddy540432
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptxRATISHKUMAR32
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptxShreyasKv13
 

Similar a Big data and hadoop (20)

getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering students
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Big Data
Big DataBig Data
Big Data
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 
Infinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - LondonInfinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - London
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Ankus, bigdata deployment and orchestration framework
Ankus, bigdata deployment and orchestration frameworkAnkus, bigdata deployment and orchestration framework
Ankus, bigdata deployment and orchestration framework
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computing
 
module4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfmodule4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdf
 
Lecture 3.31 3.32.pptx
Lecture 3.31  3.32.pptxLecture 3.31  3.32.pptx
Lecture 3.31 3.32.pptx
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 

Último

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Último (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Big data and hadoop

  • 1. BIG DATA AND HADOOP A Presentation on Presented By- Mohit Tare
  • 2. UNDERSTANDING BIG DATA – What ? How ? Why ?
  • 4. Big Data Is Everywhere •The Large Hadron Collider (LHC), a particle accelerator that will revolutionize our understanding of the workings of the Universe, will generate 60 terabytes of data per day – 15 petabytes (15 million gigabytes) annually.[1] •Decoding the human genome originally took 10 years to process; now it can be achieved in one week. •12 terabytes of Tweets created each day[2] •100 terabytes of data uploaded daily to Facebook .[3] •Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes of data.[3] •Convert 350 billion annual meter readings to better predict power consumption[2].
  • 5. What Is Big Data? Its LARGE Its COMPLEX Its UNSTRUCTURED By David Kellog, “Big data refers to the datasets whose size is beyond the ability of a typical database software tools to capture ,store, manage and analyze.”[4] O’Reilly defines big data the following way: “Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn't fit the strictures of your database architectures.” [5]
  • 6. An Obvious Question – How BIG is the BIG DATA ? A common misconception is Big data is solely related to VOLUME. While volume or size is a part of the equation….. What about SPEED at which data is generated ? And about the VARIETY of big data that variety of sources are generating?
  • 7. You guessed it Right! The 3 Vs of Big data [6]
  • 8. Why The Sudden Explosion Of Big Data ? •An Increased number and variety of data sources that generate large quantities of data •Sensors(location, GPS..) •Scientific Computing(CERN, biological research..) •Web 2.0(Twitter, wikis ..) •Realization that data is too valuable to delete •Data analytics and Data Warehousing •Business Intelligence •Dramatic Decline in the cost of hardware, especially storage •decline in price of SSDs
  • 9. BIG DATA is fuelled by CLOUD •The properties of cloud help us in dealing with the Big data •And the challenges of the Big data drives the Future designs , enhancement and expansion of cloud. •Both are in a Never Ending cycle.
  • 10. The Value Of Big Data – Why Its So Important? [6]
  • 11. MANAGING BIG DATA Traditional Enterprise Architecture VS Cluster Architecture Hadoop – Managing Big data
  • 12. TRADITIONAL ENTERPRISE ARCHITECTURE Consists of •Servers •SAN (Storage Area Network) •Storage arrays •Servers -a server is a physical computer dedicated to running one or more services to serve the needs of the users of other computers on the network. •Storage Arrays-A disk array is a disk storage system which contains multiple disk drives(SATA,SSD). •Storage Area Network - A storage area network (SAN) is a dedicated network that provides access to consolidated, data storage. SANs are primarily used to make storage devices, such as disk arrays, accessible to servers so that the devices appear like locally attached devices to the operating system.
  • 13. SOME ADVANTAGES AND DISADVANTAGES OF ENTERPRISE ARCHITECTURE ADVANTAGES •Coupling between Servers and Storage / Disk arrays – Which can be expanded, upgraded or retire independent of each other •SAN enables services on any of server to have access of any of storage arrays as long as they have access permission. •ROBUST and MINIMUM FAILURE rate. •Mainly designed for computing intensive applications which operate on a subset of data. DISADVANTAGES •More Costlier as it expands. •But What about BIG DATA ? It cannot handle Data intensive operation like sorting.
  • 14. What we want is an Architecture that will give -
  • 15. CLUSTER ARCHITECTURE Consists of •Nodes – each having its own cores , memory ,disks . •Interconnection via high speed network(LAN) • consists of a set of loosely connected computers that work together so that in many respects they can be viewed as a single system. •usually connected to each other through fast local area networks, each node (computer used as a server) running its own instance of an operating system. •The activities of the computing nodes are orchestrated by "clustering middleware", a software layer that sits atop the nodes and allows the users to treat the cluster as by and large one cohesive computing unit.
  • 16. Benefits of Using a Cluster Architecture •Modular and Scalable - easier to expand the system without bringing down the application that runs on top of the cluster. •Data Locality – where data can be processed by the cores collocated in same node or Rack minimizing any transfer over network. •Parallelization - higher degree of parallelism via the simultaneous execution of separate portions of a program on different processors. •All this with less cost .
  • 17. But Every Coin has two Sides! •Complexity - Cost of administering a cluster of N machines . •More Storage – As data is replicated to protect from failure. •Data Distribution – How to distribute data evenly across cluster ? •Careful Management and Need of massive parallel processing Design.
  • 18. Riding the Elephant - Hadoop SOLUTION •Open Source Apache Project initiated and led by Yahoo. •Apache Hadoop is an open source Java framework for processing and querying vast amounts of data on large clusters of commodity hardware.[8][9] •Runs on oLinux, Mac OS/X, Windows, and Solaris oCommodity hardware •Target cluster of commodity PCs oCost-effective bulk computing •Invented by Doug Cutting and funded by Yahoo in 2006 and reached to its “web scale capacity” in 2008.[7] Doug Cutting
  • 19. Where Does it All come from ? • underlying technology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. •Based on Google’s Map Reduce and Google File System.
  • 20. What hadoop is ? Hadoop Consists of two core components [9]– 1.Hadoop Distributed File System (HDFS) 2.Hadoop Distributed Processing Framework – Using Map/Reduce metaphor
  • 21. Hadoop Distributed File System(HDFS) Based on Simple design principles – •To Split •To Scatter •To Replicate •To Manage data across cluster •Files are broken in to large file blocks which is usually a multiple of storage blocks. Typically 64 MB or higher
  • 22. Hadoop Distributed File System(HDFS) contd.. •File blocks are Replicated to several datanodes, for reliability. •Default is 3 replicas, but settable •Blocks are placed (writes are pipelined): •On same node •On same rack •On the other rack •Clients read from closest replica. •If the replication for a block drops below target, it is automatically re- replicated.
  • 23. Hadoop Distributed File System(HDFS) contd.. •Single namespace for entire cluster managed by a single Name node[7] •Namenode, a master server that manages the file system namespace and regulates access to files by clients. •DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode. •When a datanode fails , Namenode •identifies file blocks that have been affected •retrieves copy from other healthy nodes •finds new node to store another copy of them. •Updates information in its tables.
  • 24. Hadoop Distributed File System(HDFS) contd.. •Client talks to both namenode and datanodes •Data is not sent through the namenode. •First namenode is connected and then user can directly connect to data node HDFS Architecture[10]
  • 25. •ADVANTAGES •Highly fault-tolerant •High throughput •Suitable for applications with large data sets •Streaming access to file system data •Can be built out of commodity hardware Hadoop Distributed File System(HDFS) contd.. •2 POINT OF FAILURES •Namenode can become a single point of failure •Cluster rebalancing •SOLUTIONS •Enterprise Editions maintain Backup of namenode. •Architecture is compatible with data rebalancing schemes , but its still an area of research.
  • 26. Hadoop Map/Reduce •Map/Reduce is a programming model for efficient distributed computing •User submits MapReduce job •System: • Partitions job into lots of tasks •Schedules tasks on nodes close to data • Monitors tasks • Kills and restarts if they fail/hang/disappear[11] Consists of two phases 1.Mapper Phase 2.Reduce Phase
  • 27. Hadoop Map/Reduce contd … 1.Mapper Phase •The data are fed into the map function as key value pairs to produce intermediate key/value pairs. • Input: key1,value1 pair • Output: key2, value2 pairs •All nodes will do same computation •Uses Data Locality to increase performance. •As all data blocks stored in HDFS are of equal size mapper computation can be equally divided.
  • 28. Hadoop Map/Reduce contd … Reduce Phase •Once the mapping is done, all the intermediate results from various nodes are reduced to create the final output. •Has 3 Phases • shuffle, •sort and •reduce.[12] •Shuffle - Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers. •Sort - The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage. The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged. •Reduce - In this phase the reduce method is called for each <key, (list of values)> pair in the grouped inputs and will produce final outputs.
  • 29. Understood or not ? Lets understand it by an Example • Suppose you want to analyze blog entries stored in BigData.txt and count no of times Hadoop , Big Data, Green Plum words appear in it. •Suppose 3 nodes participate in task . In Mapper Phase , each node will receive an address of file block and pointer to mapper function. •Mapper Function will calculate word –count. [13]
  • 30. Lets understand it by an Example •Output of mapper function will be set of <key,value >pairs. FINAL OUTPUT OF MAPPER PHASE
  • 31. Lets understand it by an Example •The Reduce Phase sums and reduces output . •A node is selected to perform reduce function and other nodes send their output to that node. •After Shuffling of Reduce Phase
  • 32. Lets understand it by an Example •After sorting phase of Reduce Phase And FINALLY
  • 33. •JobTracker keeps track of all the MapReduces jobs that are running on various nodes. •This schedules the jobs, keeps track of all the map and reduce jobs running across the nodes. •If any one of those jobs fails, it reallocates the job to another node, etc. •TaskTracker performs the map and reduce tasks that are assigned by the JobTracker. •TaskTracker also constantly sends a hearbeat message to JobTracker, which helps JobTracker to decide whether to delegate a new task to this particular node or not. A bit more on Map/Reduce
  • 34. Accessibilty and Implementation •HDFS •HDFS provides Java API for application to use. •Python access is also used in many applications. •It provides a command line interface called the FS shell that lets the user interact with data in the HDFS. •The syntax of the commands is similar to bash. Example: to create a directory Usage: hadoop dfs -mkdir <paths> hadoop dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2 •Map/Reduce •Java API which has prebuilt classes and Interfaces. •Python , C++ can also be used.
  • 35. C++ example on Word Count[14]
  • 36. And there is more and more … PIG
  • 38. References [1] Randal E. Bryant , Randy H. Katz , Edward D. Lazowska, “Big-Data Computing: Creating revolutionary breakthroughs in commerce, science, and society” ,Version 8: December 22, 2008. Available: http://www.cra.org/ccc/docs/init/Big_Data.pdf [Accessed Sept.9,2012] [2]What is Big Data ?[Online]. Available : http://www-01.ibm.com/software/data/bigdata/ [Accessed Sept.9,2012] [3] A Comprehensive List of Big Data Statistics [Online]. Available :http://wikibon.org/blog/big-data-statistics/ [Accessed Sept.9,2012] [4] James Manyika, Michael Chui ,Brad Brown, Jacques Bughin, Richard Dobbs ,Charles Roxburgh , Angela Hung Byers Big Data: The next frontier for innovation , competition ,and productivity , McKinskey Global Institute, May 2011.Availabe: http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data _the_next_frontier_for_innovation[Accessed Sept.10,2012] [5]What Is Big Data? ,O’Reilly Radar, January 11, 2012,[Online].Available : http://radar.oreilly.com/2012/01/what-is-big-data.html[Accessed Sept.10,2102] [6]-Big Data, Wipro,[Online].Available: http://www.slideshare.net/wiprotechnologies/wipro-infographicbig-data[Accessed Sept.11,2012]
  • 39. References [7]Owan o maley ,”Introduction to Hadoop”[Online]. Available : http://wiki.apache.org/hadoop/HadoopPresentations [Accessed Sept .17,2012 ] [8]Hadoop at Yahoo!, Yahoo developer Network[Online].Available: http://developer.yahoo.com/hadoop/ [Accessed Sept .17,2012 ] [9] Elif Dede, Madhusudhan Govindaraju, Dan Gunter, Lavanya Ramakrishnan,“Ridingthe elephant: managing ensembles with hadoop”, in MTAGS '11 Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers, Pages 49-58[Online]. Available : ACM Digital Library, http://dl.acm.org/citation.cfm?id=2132876.2132888 [Accessed Sept .17,2012 ] [10] HDFS Architecture, Hadoop 0.20 Documentation[Online]. Available: http://hadoop.apache.org/docs/r0.20.2/hdfs_design.html[Accessed Sep.20,2012]
  • 40. References [11]Doug Cutting ,”Hadoop Overview” ,[Online] Available: http://wiki.apache.org/hadoop/HadoopPresentations [Accessed Sept .17,2012 ] [12] Map/Reduce Tutorial, Hadoop 0.20 Documentation,[Online]. Available : http://hadoop.apache.org/docs/r0.20.2/mapred_tutorial.html#Reducer [Accessed Sept .17,2012 ] [13] Patricia Florissi, Big Ideas : Demystifying Hadoop, [Video]. Available : http://www.youtube.com/watch?v=XtLXPLb6EXs&feature=relmfu [14] C/C++ MapReduce Code & build, Hadoop Wiki , C++ word Count, [Online]. Available : http://wiki.apache.org/hadoop/C%2B%2BWordCount [Accessed October .1,2012]
  • 41. Thank You ! And … Stay Udacious ?