2. Introduction :
Big data
Why study big data ?
The Three V’s
Data analysis and storage
Big Data Technology :
Hadoop
• HDFS
• MapReduce
Conclusion
2
3. Big Data refers to datasets that grow so large that it is difficult
to
capture, store, manage, share, analyze and visualize with
The typical database software tools.
3
4. Social media
and
networks
Scientific
instruments
Mobile
devices
Sensor
technology
and networks * To Analyze it..
large amounts of different data types, or
big data, in an effort to uncover hidden
patterns, unknown correlations and other
useful information i.e Big data is GOLDMINE.
5. Big data Examples
5+
billion
people
on the
Web by
end
2014
30 billion RFID
tags today
(1.3B in 2005)
4.6
billion
camera
phones
world
wide
100s of
millions
of GPS
enabled
devices
sold
annually
76 million smart
meters in 2009…
200M by 2014
12+ TBs
of tweet data
every day
10 billion
people(1PB)
? TBs of
data every day
5
6. Volume :The amount of data is big.
Velocity :
How fast is data available for analysis
How fast you can use data
Variety :
Structured
Semi-structured
Unstructured
Other V’s => Veracity ,Variability ,Visualization ,Value ..
6
7. Data Volume
◦ 44x increase from 2009- 2020
◦ From 0.8 zettabytes to 35zb.
Data volume is increasing exponentially .
----
exponential
8. Pre-defined schema imposed on data
Highly patterned structured
Usually stored in relational database system
Numbers :20,3.14
String:”Hello World”
Dates: 08/04/2014
Roughly 20% of all data out there is structured .
8
9. Inconsistent structure
Cannot be stored typically in tables or database
Information is often self-describing(
label/value) pairs
No fix data models
• Xml – Extensible markup language .
• Sgml – Standard Generalized markup language .
• Logs - Catlogs , Weblogs ,Graph logs .
• Tweets.
9
10. Data does not resides in any particular form
i.e row-column
Opposite of structured data
•Multimedia –videos,photos,audio,files
•Email ,Messages
•Presentation and reports
•Free form text
•Word processing documents
According to experts 80-90% of data in any organization is
unstructured data .
10
11. Storage capacity of hard drives has increased
massively over the years.
Access speeds have not kept up
Year Storage
Capacity
Transfer
Speed
Time
1990 1370 mb 4.4mbps <5min.
2010 1Tb 100mbps >2.5hrs.
Problem and its solution :Big Data technology.
11
13. To The Rescue!
“Hadoop”
Apache Hadoop is a framework for running applications on
large cluster built of commodity hardware.
A common way of avoiding data loss is through replication,The
Hadoop Distributed Filesystem (HDFS), takes care of this
problem.
The second problem is solved by a simple programming model-
Mapreduce. Hadoop is the popular open source implementation
of MapReduce, a powerful tool designed for deep analysis and
transformation of very large data sets.
13
14. HDFS
“Moving Computation is Cheaper than Moving Data”
14
HDFS, is a distributed file system designed to hold very
large amounts of data (terabytes or even petabytes)
•Redundant copies of the data are kept by the system so that
in the event of failure, there is another copy available.
•Portability Across Heterogeneous Hardware and Software
Platforms
15. MapReduce is a programming model .
Programs written in this functional style are automatically parallelized and
executed on a large cluster of commodity machines.
MapReduce is an associated implementation for processing and generating large
data sets.
MapReduce
MAP
map function that processes a
key/value pair to generate a
set of intermediate key/value
pairs
REDUCE
and a reduce function that
merges all intermediate values
associated with the same
intermediate key.
15
16. References
o Hadoop- The Definitive Guide, O’Reilly
2009, Yahoo! Press – Tom White.
o http://en.wikipedia.org/wiki/Big_data
* www.technologyreview.in/featured-story/
401775/10-emerging-technologies-that-
will-change-the/
16
17. Conclusion
BIG DATA is a key for innovation and has a high potential for
value creation. There are huge opportunities, for example
concerning healthcare, location related data, retail, manufacturing,
or social data. There are also challenges, for example concerning
data volume, data quality, data capturing, and data management,
such as privacy, security or governance.
17