Introduction
Evolution of big data
Characteristics
Examples of big data generation
Big data v/s RDBMS
Hadoop
HDFS
MapReduce
references
Big data is a term for DATASETS that are so
large or complex that traditional data
processing applications are inadequate.
Big data is the capability to manage huge
volume of disparate data, at the right speed
and within the right time frame to allow real
time analysis.
Black box data
Social media data
Stock exchange data
Power grid data
Transport data
Search engine data
Huge competition in market:
retails- customer analytics and predictive analytics
travel- travel pattern of customers
website- understand users navigation pattern,
interest, conversion etc.
Sensors, satellite and geospatial data
Military and intelligence
Big Data includes huge volume, high
velocity, and extensible variety of data.
The data in it will be of three types.
• Relational data.Structured
data:
• XML data
Semi
Structured
data:
• Word, PDF, Text,
Media Logs
Unstructured
data:
Relational database Big data
Single-computer
platform that scales
with better CPUs,
centralized
processing.
Relational database
(SQL), centralized
storage.
Batched, descriptive,
centralized
Cluster platforms that
scale to thousands of
nodes, distributed
process
Non-relational
databases that manage
varied data types and
formats (NoSQL),
distributed storage.
Real-time, predictive
and prescriptive,
distributed analytics
An open source apache foundation
framework.
It allows distributed processing of large
datasets across clusters of computers using
simple programming models.
Hadoop runs applications using the
MapReduce algorithm, where the data is
processed in parallel with others.
It Uses the concept of Data locality.
Hadoop framework allows the user to
quickly write and test distributed systems. It
is efficient, and it automatic distributes the
data and work across the machines and in
turn, utilizes the underlying parallelism of
the CPU cores.
Hadoop does not rely on hardware to
provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been
designed to detect and handle failures at the
application layer.
hadoop is designed to be self-healing.
HDFS is a file system designed for storing
very large files with streaming data access
patterns, running on clusters of commodity
hardware.
It can be defined as, "A reliable, high
bandwidth, low-cost, data-storage cluster
that facilitates the management of related
files across machines.”
Basic architecture of HDFS
source: J. Hurwitz, et al., “Big Data for Dummies,” Wiley, 2013, ISBN:978-1-118-50422-2.
Hadoop mapReduce is an implementation of
mapReduce algorithm.
Map reduce is a batch query processor, and
the ability to run an adhoc query against
whole dataset and get the results in a
reasonable time is TRANSFORMATIVE.
source: J. Hurwitz, et al., “Big Data for Dummies,” Wiley, 2013, ISBN:978-1-118-50422-2.
Example of air temperature analysis.
Problems :
Dividing the work into equal size pieces is not
easy.
Combininng the results from independent process
may requirefurther processing.
The processing capacity of a single machine is
limited.
source: Hadoop: The Definitive Guide, by Tom White, 2015, ISBN: 978-1-491-90163-2
source: Hadoop: The Definitive Guide, by Tom White, 2015, ISBN: 978-1-491-90163-2
i. J. Hurwitz, et al., “Big Data for Dummies,”
Wiley, 2013, ISBN:978-1-118-50422-2.
ii. http://www.cse.wustl.edu/~jain/cse570-
13/
iii. Hadoop: The Definitive Guide, by Tom
White, 2015, ISBN: 978-1-491-90163-2
iv. Hadoop tutorials on
www.tutorialspoint.com
Notas del editor
There is a data explosion, according to an figure by IBM 2.5 quintilloins of data is created each day which is very huge amount of data.
And to oracle ,in 2012 data growth rate was 40% compound annual rate.
Data is growing exponentially.
in the late 1960s, data was stored in flat files that imposed no structure
Later in the 1970s, things changed with the invention of the relational data model and the relational database management system (RDBMS) that imposed structure and a method for improving performance
Enterprise Content Management systems evolved in the 1980s to provide businesses with the capability to better manage unstructured data, mostly documents.
In the 1990s with the rise of the web, organizations wanted to move beyond documents and store and manage web content, images, audio, and video.
As with other waves in data management, big data is built on top of the evolution of data management practices over the past five decades.
With big data, it is now possible to virtualize data so that it can be stored efficiently and, utilizing cloud-based storage, more cost-effectively as well.
Volume: How much data
Velocity: How fast that data is processed
Variety: The various types of data
Even more important is the fourth V: veracity. How accurate is that data in predicting business value?
Variability : Inconsistency of the data set can hamper processes to handle and manage it.
Data processing
Data management
Analytics
Hadoop is designed to parallelize data processing across computing nodes to speed computations and hide latency.
Hadoop has two primary layers:
Mapreduce
HDFS
YARN
Common utilities
Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without interruption.
Or we can say, hadoop is able to detect changes, including failures, and adjust to the changes and continues to operate without interruption.
Very large files
“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data.
Streaming data access
HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, and then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.
Commodity hardware
Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of commodity hardware (commonly available hardware that can be obtained from multiple vendors) for which the chance of node failure across the cluster is high, at least for large clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure.
Low-latency data access
Lots of small files
Multiple writers, arbitrary file modifications
placing replicas in different data centers may maximize
redundancy, but at the cost of bandwidth. Even in the same data center (which is what
all Hadoop clusters to date have run in), there are a variety of possible placement
strategies.
Data
Hadoop’s default strategy is to place the first replica on the same node as the client (for
clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy). The second replica is placed on a
different rack from the first (off-rack), chosen at random. The third replica is placed on
the same rack as the second, but on a different node chosen at random