4. Big Data Is Everywhere
•The Large Hadron Collider (LHC), a particle
accelerator that will revolutionize our
understanding of the workings of the Universe,
will generate 60 terabytes of data per day –
15 petabytes (15 million gigabytes)
annually.[1]
•Decoding the human genome originally took
10 years to process; now it can be achieved
in one week.
•12 terabytes of Tweets created each day[2]
•100 terabytes of data uploaded
daily to Facebook .[3]
•Walmart handles more than 1 million customer
transactions every hour, which is imported into
databases estimated to contain more than
2.5 petabytes of data.[3]
•Convert 350 billion annual meter readings to
better predict power consumption[2].
5. What Is Big Data?
Its LARGE Its COMPLEX
Its UNSTRUCTURED
By David Kellog, “Big data refers to the datasets whose size is beyond the ability of a
typical database software tools to capture ,store, manage and analyze.”[4]
O’Reilly defines big data the following way: “Big data is data that exceeds the
processing capacity of conventional database systems. The data is too big, moves too
fast, or doesn't fit the strictures of your database architectures.” [5]
6. An Obvious Question – How BIG is the BIG
DATA ?
A common misconception is Big data is
solely related to VOLUME.
While volume or size is a part of the
equation…..
What about SPEED at which data
is generated ?
And about the VARIETY of big data
that variety of sources are
generating?
8. Why The Sudden Explosion Of Big Data ?
•An Increased number and variety of data sources
that generate large quantities of data
•Sensors(location, GPS..)
•Scientific Computing(CERN, biological research..)
•Web 2.0(Twitter, wikis ..)
•Realization that data is too valuable to delete
•Data analytics and Data Warehousing
•Business Intelligence
•Dramatic Decline in the cost of hardware,
especially storage
•decline in price of SSDs
9. BIG DATA is fuelled by CLOUD
•The properties of cloud help us in dealing with the Big data
•And the challenges of the Big data drives the Future
designs , enhancement and expansion of cloud.
•Both are in a Never Ending cycle.
10. The Value Of Big Data – Why Its So Important?
[6]
12. TRADITIONAL ENTERPRISE ARCHITECTURE
Consists of
•Servers
•SAN (Storage Area Network)
•Storage arrays
•Servers -a server is a physical computer dedicated to running one or more
services to serve the needs of the users of other computers on the network.
•Storage Arrays-A disk array is a disk storage system which contains multiple
disk drives(SATA,SSD).
•Storage Area Network - A storage area network (SAN) is a dedicated
network that provides access to consolidated, data storage. SANs are primarily
used to make storage devices, such as disk arrays, accessible to servers so that
the devices appear like locally attached devices to the operating system.
13. SOME ADVANTAGES AND DISADVANTAGES OF
ENTERPRISE ARCHITECTURE
ADVANTAGES
•Coupling between Servers and Storage /
Disk arrays – Which can be expanded,
upgraded or retire independent of each
other
•SAN enables services on any of server to
have access of any of storage arrays as
long as they have access permission.
•ROBUST and MINIMUM FAILURE rate.
•Mainly designed for computing
intensive applications which operate on a
subset of data.
DISADVANTAGES
•More Costlier as it expands.
•But What about BIG DATA ?
It cannot handle Data intensive
operation like sorting.
14. What we want is an Architecture that will give -
15. CLUSTER ARCHITECTURE
Consists of
•Nodes – each having its
own cores , memory ,disks .
•Interconnection via high
speed network(LAN)
• consists of a set of loosely connected computers that work together so that in
many respects they can be viewed as a single system.
•usually connected to each other through fast local area networks,
each node (computer used as a server) running its own instance of
an operating system.
•The activities of the computing nodes are orchestrated by "clustering
middleware", a software layer that sits atop the nodes and allows the users to
treat the cluster as by and large one cohesive computing unit.
16. Benefits of Using a Cluster Architecture
•Modular and Scalable - easier to expand the system without bringing down
the application that runs on top of the cluster.
•Data Locality – where data can be processed by the cores collocated in
same node or Rack minimizing any transfer over network.
•Parallelization - higher degree of parallelism via the simultaneous
execution of separate portions of a program on different processors.
•All this with less cost .
17. But Every Coin has two Sides!
•Complexity - Cost of administering a cluster of N machines .
•More Storage – As data is replicated to protect from failure.
•Data Distribution – How to distribute data evenly across cluster ?
•Careful Management and Need of massive parallel processing Design.
18. Riding the Elephant - Hadoop
SOLUTION
•Open Source Apache Project initiated and led by
Yahoo.
•Apache Hadoop is an open source Java framework
for processing and querying vast amounts of data
on large clusters of commodity hardware.[8][9]
•Runs on
oLinux, Mac OS/X, Windows, and Solaris
oCommodity hardware
•Target cluster of commodity PCs
oCost-effective bulk computing
•Invented by Doug Cutting and funded by Yahoo in
2006 and reached to its “web scale capacity” in
2008.[7]
Doug Cutting
19. Where Does it All come from ?
• underlying technology was invented by Google back in their earlier
days so they could usefully index all the rich textural and structural
information they were collecting, and then present meaningful and
actionable results to users.
•Based on Google’s Map Reduce and Google File System.
20. What hadoop is ?
Hadoop Consists of two core components [9]–
1.Hadoop Distributed File System (HDFS)
2.Hadoop Distributed Processing
Framework
– Using Map/Reduce metaphor
21. Hadoop Distributed File System(HDFS)
Based on Simple design principles –
•To Split
•To Scatter
•To Replicate
•To Manage data across cluster
•Files are broken in to large file blocks
which is usually a multiple of storage
blocks.
Typically 64 MB or higher
22. Hadoop Distributed File System(HDFS) contd..
•File blocks are Replicated to several
datanodes, for reliability.
•Default is 3 replicas, but settable
•Blocks are placed (writes are
pipelined):
•On same node
•On same rack
•On the other rack
•Clients read from closest replica.
•If the replication for a block drops
below target, it is automatically re-
replicated.
23. Hadoop Distributed File System(HDFS) contd..
•Single namespace for entire cluster
managed by a single Name node[7]
•Namenode, a master server that
manages the file system namespace and
regulates access to files by clients.
•DataNodes: serves read, write requests,
performs block creation, deletion, and
replication upon instruction from
Namenode.
•When a datanode fails , Namenode
•identifies file blocks that have
been affected
•retrieves copy from other healthy
nodes
•finds new node to store another
copy of them.
•Updates information in its tables.
24. Hadoop Distributed File System(HDFS) contd..
•Client talks to both namenode and
datanodes
•Data is not sent through the
namenode.
•First namenode is connected and
then user can directly connect to data
node
HDFS
Architecture[10]
25. •ADVANTAGES
•Highly fault-tolerant
•High throughput
•Suitable for applications with large data
sets
•Streaming access to file system data
•Can be built out of commodity hardware
Hadoop Distributed File System(HDFS) contd..
•2 POINT OF FAILURES
•Namenode can become a single point of
failure
•Cluster rebalancing
•SOLUTIONS
•Enterprise Editions maintain Backup of
namenode.
•Architecture is compatible with data rebalancing
schemes , but its still an area of research.
26. Hadoop Map/Reduce
•Map/Reduce is a programming
model
for efficient distributed computing
•User submits MapReduce job
•System:
• Partitions job into lots
of tasks
•Schedules tasks on
nodes close to data
• Monitors tasks
• Kills and restarts if they
fail/hang/disappear[11]
Consists of two phases
1.Mapper Phase
2.Reduce Phase
27. Hadoop Map/Reduce contd …
1.Mapper Phase
•The data are fed into the map function as
key value pairs to produce intermediate
key/value pairs.
• Input: key1,value1 pair
• Output: key2, value2 pairs
•All nodes will do same computation
•Uses Data Locality to increase
performance.
•As all data blocks stored in HDFS
are of equal size mapper computation can
be equally divided.
28. Hadoop Map/Reduce contd …
Reduce Phase
•Once the mapping is done, all the intermediate results from various nodes are
reduced to create the final output.
•Has 3 Phases
• shuffle,
•sort and
•reduce.[12]
•Shuffle - Input to the Reducer is the sorted output of the mappers. In this
phase the framework fetches the relevant partition of the output of all the
mappers.
•Sort - The framework groups Reducer inputs by keys (since different
mappers may have output the same key) in this stage.
The shuffle and sort phases occur simultaneously; while map-outputs are
being fetched they are merged.
•Reduce - In this phase the reduce method is called for each <key, (list of
values)> pair in the grouped inputs and will produce final outputs.
29. Understood or not ? Lets understand it by an
Example
• Suppose you want to analyze blog entries stored in BigData.txt and
count no of times Hadoop , Big Data, Green Plum words appear in it.
•Suppose 3 nodes participate in task . In Mapper Phase , each node will receive
an address of file block and pointer to mapper function.
•Mapper Function will calculate word –count.
[13]
30. Lets understand it by an Example
•Output of mapper function will be set of
<key,value >pairs.
FINAL OUTPUT
OF MAPPER PHASE
31. Lets understand it by an Example
•The Reduce Phase sums and reduces
output .
•A node is selected to perform
reduce function and other nodes send
their output to that node.
•After Shuffling of Reduce Phase
32. Lets understand it by an Example
•After sorting phase of Reduce Phase
And FINALLY
33. •JobTracker keeps track of all the
MapReduces jobs that are running on
various nodes.
•This schedules the jobs, keeps track of all
the map and reduce jobs running across
the nodes.
•If any one of those jobs fails, it reallocates
the job to another node, etc.
•TaskTracker performs the map and
reduce tasks that are assigned by the
JobTracker.
•TaskTracker also constantly sends a
hearbeat message to JobTracker, which
helps JobTracker to decide whether to
delegate a new task to this particular node
or not.
A bit more on Map/Reduce
34. Accessibilty and Implementation
•HDFS
•HDFS provides Java API for application to use.
•Python access is also used in many applications.
•It provides a command line interface called the FS shell that lets the
user interact with data in the HDFS.
•The syntax of the commands is similar to bash.
Example: to create a directory
Usage: hadoop dfs -mkdir <paths>
hadoop dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
•Map/Reduce
•Java API which has prebuilt classes and Interfaces.
•Python , C++ can also be used.
38. References
[1] Randal E. Bryant , Randy H. Katz , Edward D. Lazowska, “Big-Data Computing:
Creating revolutionary breakthroughs in commerce, science, and society”
,Version 8: December 22, 2008. Available:
http://www.cra.org/ccc/docs/init/Big_Data.pdf [Accessed Sept.9,2012]
[2]What is Big Data ?[Online]. Available :
http://www-01.ibm.com/software/data/bigdata/ [Accessed Sept.9,2012]
[3] A Comprehensive List of Big Data Statistics [Online].
Available :http://wikibon.org/blog/big-data-statistics/ [Accessed Sept.9,2012]
[4] James Manyika, Michael Chui ,Brad Brown, Jacques Bughin, Richard Dobbs ,Charles
Roxburgh , Angela Hung Byers Big Data: The next frontier for innovation ,
competition ,and productivity , McKinskey Global Institute, May 2011.Availabe:
http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data
_the_next_frontier_for_innovation[Accessed Sept.10,2012]
[5]What Is Big Data? ,O’Reilly Radar, January 11, 2012,[Online].Available :
http://radar.oreilly.com/2012/01/what-is-big-data.html[Accessed Sept.10,2102]
[6]-Big Data, Wipro,[Online].Available:
http://www.slideshare.net/wiprotechnologies/wipro-infographicbig-data[Accessed
Sept.11,2012]
39. References
[7]Owan o maley ,”Introduction to Hadoop”[Online].
Available : http://wiki.apache.org/hadoop/HadoopPresentations
[Accessed Sept .17,2012 ]
[8]Hadoop at Yahoo!, Yahoo developer Network[Online].Available:
http://developer.yahoo.com/hadoop/ [Accessed Sept .17,2012 ]
[9] Elif Dede, Madhusudhan Govindaraju, Dan Gunter, Lavanya
Ramakrishnan,“Ridingthe elephant: managing ensembles with hadoop”,
in MTAGS '11 Proceedings of the 2011 ACM international workshop on Many task
computing on grids and supercomputers, Pages 49-58[Online].
Available : ACM Digital Library,
http://dl.acm.org/citation.cfm?id=2132876.2132888 [Accessed Sept .17,2012 ]
[10] HDFS Architecture, Hadoop 0.20 Documentation[Online].
Available: http://hadoop.apache.org/docs/r0.20.2/hdfs_design.html[Accessed
Sep.20,2012]
40. References
[11]Doug Cutting ,”Hadoop Overview” ,[Online] Available:
http://wiki.apache.org/hadoop/HadoopPresentations
[Accessed Sept .17,2012 ]
[12] Map/Reduce Tutorial, Hadoop 0.20 Documentation,[Online].
Available :
http://hadoop.apache.org/docs/r0.20.2/mapred_tutorial.html#Reducer
[Accessed Sept .17,2012 ]
[13] Patricia Florissi, Big Ideas : Demystifying Hadoop, [Video].
Available : http://www.youtube.com/watch?v=XtLXPLb6EXs&feature=relmfu
[14] C/C++ MapReduce Code & build, Hadoop Wiki , C++ word Count, [Online].
Available :
http://wiki.apache.org/hadoop/C%2B%2BWordCount
[Accessed October .1,2012]