1. Summer 2016 Reportby:ShreyaChakrabarti
Self-Learning Hadoop
What is Big Data?
(Image Reference: http://www.webopedia.com/TERM/B/big_data.html)
According to recent research and findings it has been found that every day we create around
2.5 quintillion bytes of data. Surprisingly, majority of this data has been acquired in a short span
of last 10 years. A major contribution to this data is the various social media ventures in the
recent years namely Facebook, Twitter, Instagram etc. Other sources of data also include the
cell phone GPS signals, Shopper’s profile storage stored by shopping giants like Amazon, eBay
etc. and other numerous resources.
All of this data which is so huge that storing, analyzing, visualizing and performing analytics on
the same is increasingly difficult because of the sheer volume of the data, such data is called Big
Data.
Big Data is becoming a very popular term in recent times as the world realizes the importance
of using the existing data to their advantage and maximizing business profits. The main
advantage of storing this data and utilizing newer Big Data technologies is analytics.
The four Types of Analytic techniques can be used to achieve greater heights in today’s world
for companies to better engage with their customers and in turn maximize their own capital.
The four type of analytic techniques include:
1) Descriptive Analytics: “What Happened?” Simple tool like page views can give us an idea
about the success of a particular campaign
2) Diagnostic Analytics:” Why it happened?” Business Intelligence tools used to analyze the data
most presently available in the company give us the specific reasons for why a particular
campaign was successful or unsuccessful based on which the decision to continue the campaign
or discontinue it can be easily taken.
3) Predictive Analytics: “Future Prediction” Predictive analytics is a branch of advanced analytics
2. Summer 2016 Reportby:ShreyaChakrabarti
which is used to make predictions about unknown future events. Predictive analytics uses many
techniques like data mining, statistics modeling, machine learning and artificial intelligence to
analyze current data to make predictions about future.
4)Prescriptive Analytics: “Prevention better than cure” Once predictive analytics predicts what
needs to be done in order to maximize profits, care needs to be taken that nothing is done in
the opposite direction to hamper the profits.
Why Hadoop?
As discussed earlier Technology needs to advance at a drastic speed for the world to take
advantage of the existing as well as ever updating data.
Apache Hadoop is an open source software framework for distributed storage and distributed
processing of very large datasets on computer clusters built from commodity hardware.
In simple terms “Hadoop” can be said to be a database used to store large datasets and
perform data analysis on it.
Hadoop was designed on the base of Google File System paper published in 2003.Doug Cutting
the creator of Hadoop named it after his son’s toy elephant. Hadoop 0.1.0 was released in April
2006 and continues to evolve by the many contributors to the Apache Hadoop project.
Hadoop is based on Map-Reduce algorithms
Hadoop
Components
Hadoop
Distributed File
System
MapReduce
Processing
3. Summer 2016 Reportby:ShreyaChakrabarti
HDFS Architecture
(https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html)
Apache Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using a simple programming model. It is an open-
source data management framework with scale out storage and distributed processing
capabilities. It distributes data across multiple machines. Files are logically divided into equal
sized blocks. Blocks are spread across multiple machines who create replicas of blocks. Three
replicas are maintained to ensure availability. Data integrity is maintained by computing the
block checksum. The Name-node maintains address of the blocks on the respective data-nodes.
Whenever data is requested the name-node provides the address of the data physically closest
to the client. The secondary name node serves as a checkpoint server and is not a replacement
to Primary name-node when it fails.
4. Summer 2016 Reportby:ShreyaChakrabarti
Map Reduce
Earlier spawned from Google Map-Reduce is a popular algorithm for processing and generating
large data sets. The name MapReduce originally referred to the proprietary Google technology,
but has since been genericized. Google, however has moved on to newer technologies since
2014.
The belowdiagramisfromGoogle’sorignal Mapreduce paper.The diagramdescribesthe workingof the
Map-Reduce algorithm.
The Map-Reduce algorithm breaks down into three important steps namely Map, Group & Sort
and Reduce.
The MAP part of the algorithm divides the data into key: value pairs. The Key is the most
important part of the Map function as this key is further used by the reduce function too.
Group and sort basically groups the values with same keys together to make it simpler for the
next stage of Reducer.
The final stage of the Reducer is that it receives the grouped and sorted data from the previous
stage and selects the output desired from the processing of the dataset.
5. Summer 2016 Reportby:ShreyaChakrabarti
Some of the examples which can give an in depth understanding of MapReduce are explained
in below projects.
Mini-Project 1: Max and MinTemperatures in year 1800
The dataset in this mini project contains temperatures from the year 1800 which were recorded
at various weather stations.
The dataset can be explained as below:
The data also contains some other fields which are not relevant to our mini project.
We will be finding out the “Minimum Temperatures at a particularWeatherStation
throughout the year 1800” and “Maximum Temperatures at that particular Weather Station
throughout the year 1800”.(There are only two weather stations included in this particular
dataset)
Understanding the data plays a very important role in determining the “Map” and “Reduce”
part for writing a Map-Reduce Program.
Weather Station
Code
Date inthe year
1800 whenthe
temperature was
recorded
Type of
Temperature
(Maximumor
Minimum)
Temperaturesin
Celsius
6. Summer 2016 Reportby:ShreyaChakrabarti
The understanding of how a Map Reduce Program Works:
Data
Mapper (Key -Value Pairs)
Group and Sort
Reducer
The working of the Map-Reduce algorithm can be explained in the above diagram. The data is
then fed to the mapper where the mapper selects the required data which is relevant for the
result, basically separates the data into key-value pairs. Then this data is further grouped and
sorted according to the keys. The Reducer can be said to be a function which ultimately gives us
the result.
ITE00100554 18000101 TMAX -75
GM000010962 18000101 PRCP 0
EZE00100082 18000101 TMAX -86
E00100082 18000101 TMIN -135
EZE00100082 18000101 TMIN -135
ITE00100554 18000102 TMAX -60
ITE00100554 18000102 TMIN -125
GM000010962 18000102 PRCP 0
EZE00100082 18000102 TMAX -44
ITE00100554,-75 EZE00100082 ,-86
ITE00100554, -60
ITE00100554,-75,-60 EZE00100082 ,-86
ITE00100554,-60 EZE00100082 ,-86
7. Summer 2016 Reportby:ShreyaChakrabarti
The above logiccan be writtenasbelowinPythonLanguage Code
MinimumTemperature
MaximumTemperature
Mapper
(To establish Key-Value Pair)
Reducer
(For Final Results)
9. Summer 2016 Reportby:ShreyaChakrabarti
Mini-Project 2: Total Amount Orderedby eachcustomer
The datasetcontainsa listof customerswiththe amountstheyspendineachordertheyplacedina
restaurant.
The datasetcontains3 attributesnamelyCustomerID,OrderNumberandAmountSpend.
To write the code for thisdata analysisproblem, letusdesignanapproachforthe problem
Data
Mapper
The Mapper should be able to
establishthe Key-Value pair.Inthis
case the key value pair would be
Customer and the amount he
Spend.
Group and Sort
In group and sort there would be
grouping on the basis of the
customer.
The data after Grouping and
Sorting would contain the
CustomerNumberandthe amount
he spends in total
Reducer
The Reducerwould inturn produce
the output as to Customer with
what ID spend How much Money
in orders.
10. Summer 2016 Reportby:ShreyaChakrabarti
The code for the same is thus written as below in Python:
Output:
The output of this Project can also be improved by feeding the output of the first reducer into
another mapper to get a sorted output. This sort of MapReduce job is called “Chained
MapReduce Jobs”.
12. Summer 2016 Reportby:ShreyaChakrabarti
Project: Social Graph of Superhero’s
This dataset contains of Superhero Data from Marvel which mentions the appearance of Super
Hero’s with each other in various comic books. It basically traces the appearance of
superheroes with each other in various comic books which feature them.
The above image is a snippet from the data where the various numbers are assigned to various
characters and the first character(Highlighted) is the Superhero with the following numbers
belonging to other characters who the main character is Friends with.
Step:1 Find Total Number of Friends per Superhero
To find the most popular superhero first we need to map the character and the number of
friends the particular superhero has. To do this we need to add the friends per character and
map them as Key-Value pair and feed to the reducer. The reducer then adds up the number of
friends per character.
Step:2 Find Superhero with Maximum Friend Count
Mapper1: Count the number of
friends per character, per line.
Establish a key value pair of
Superhero: NumberOfFriends
Reducer1: Add up the
number of Friends per
Superhero
Reducer1: Total
number of friends per
Superhero
Mapper2: Substitute a
common key (Empty
Key) for example
None: 59 5933
where None: Key
59 5933: Value
Reducer2: Find out the
Superhero with max
friends
14. Summer 2016 Reportby:ShreyaChakrabarti
The load_name_dictionary displays the name of the Superhero from the superhero name file as
opposed to the code of the Superhero with the number of Friends he has.
Output:
Other Important Technologies inHadoop
YARN
Yarn can be simply called the operating system of Hadoop because it is responsible for
managing and monitoring workloads, maintaining a multi-tenant environment, implementing
security controls and managing high availability features of Hadoop.
(https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html)
15. Summer 2016 Reportby:ShreyaChakrabarti
Resource Manager: Master that arbitrates all the available cluster resources and thus helps
manage the distributed applications running on the YARN system.
Node Manager: Node Manager takes instructions from resource manager and manage
resources on a single node.
Application Master: Negotiators,applicationmastersare responsible fornegotiatingresourcesfrom
Resource Manager.
HIVE
Hive is an open source project run by volunteers at the Apache Software Foundation. Hive is
basically a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query and analysis.
HIVE provides a SQL language HIVEQL with schema on read and transparently converts queries
to MapReduce.
SQOOP
Sqoop is a command-line interface application for transferring data between relational
databases and Hadoop. Sqoop got its name from SQL+Hadoop.
SPARK
Spark was developed in response to limitations in the MapReduce cluster computing paradigm.
Apache Spark is a fast, in-memory data processing engine with elegant and expressive
development APIs to allow data workers to efficiently execute streaming, machine learning or
SQL workloads that require fast iterative access to datasets. With Spark running on Apache
Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power,
derive insights, and enrich their data science workloads within a single, shared dataset in
Hadoop.