SlideShare una empresa de Scribd logo
1 de 15
Summer 2016 Reportby:ShreyaChakrabarti
Self-Learning Hadoop
What is Big Data?
(Image Reference: http://www.webopedia.com/TERM/B/big_data.html)
According to recent research and findings it has been found that every day we create around
2.5 quintillion bytes of data. Surprisingly, majority of this data has been acquired in a short span
of last 10 years. A major contribution to this data is the various social media ventures in the
recent years namely Facebook, Twitter, Instagram etc. Other sources of data also include the
cell phone GPS signals, Shopper’s profile storage stored by shopping giants like Amazon, eBay
etc. and other numerous resources.
All of this data which is so huge that storing, analyzing, visualizing and performing analytics on
the same is increasingly difficult because of the sheer volume of the data, such data is called Big
Data.
Big Data is becoming a very popular term in recent times as the world realizes the importance
of using the existing data to their advantage and maximizing business profits. The main
advantage of storing this data and utilizing newer Big Data technologies is analytics.
The four Types of Analytic techniques can be used to achieve greater heights in today’s world
for companies to better engage with their customers and in turn maximize their own capital.
The four type of analytic techniques include:
1) Descriptive Analytics: “What Happened?” Simple tool like page views can give us an idea
about the success of a particular campaign
2) Diagnostic Analytics:” Why it happened?” Business Intelligence tools used to analyze the data
most presently available in the company give us the specific reasons for why a particular
campaign was successful or unsuccessful based on which the decision to continue the campaign
or discontinue it can be easily taken.
3) Predictive Analytics: “Future Prediction” Predictive analytics is a branch of advanced analytics
Summer 2016 Reportby:ShreyaChakrabarti
which is used to make predictions about unknown future events. Predictive analytics uses many
techniques like data mining, statistics modeling, machine learning and artificial intelligence to
analyze current data to make predictions about future.
4)Prescriptive Analytics: “Prevention better than cure” Once predictive analytics predicts what
needs to be done in order to maximize profits, care needs to be taken that nothing is done in
the opposite direction to hamper the profits.
Why Hadoop?
As discussed earlier Technology needs to advance at a drastic speed for the world to take
advantage of the existing as well as ever updating data.
Apache Hadoop is an open source software framework for distributed storage and distributed
processing of very large datasets on computer clusters built from commodity hardware.
In simple terms “Hadoop” can be said to be a database used to store large datasets and
perform data analysis on it.
Hadoop was designed on the base of Google File System paper published in 2003.Doug Cutting
the creator of Hadoop named it after his son’s toy elephant. Hadoop 0.1.0 was released in April
2006 and continues to evolve by the many contributors to the Apache Hadoop project.
Hadoop is based on Map-Reduce algorithms
Hadoop
Components
Hadoop
Distributed File
System
MapReduce
Processing
Summer 2016 Reportby:ShreyaChakrabarti
HDFS Architecture
(https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html)
Apache Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using a simple programming model. It is an open-
source data management framework with scale out storage and distributed processing
capabilities. It distributes data across multiple machines. Files are logically divided into equal
sized blocks. Blocks are spread across multiple machines who create replicas of blocks. Three
replicas are maintained to ensure availability. Data integrity is maintained by computing the
block checksum. The Name-node maintains address of the blocks on the respective data-nodes.
Whenever data is requested the name-node provides the address of the data physically closest
to the client. The secondary name node serves as a checkpoint server and is not a replacement
to Primary name-node when it fails.
Summer 2016 Reportby:ShreyaChakrabarti
Map Reduce
Earlier spawned from Google Map-Reduce is a popular algorithm for processing and generating
large data sets. The name MapReduce originally referred to the proprietary Google technology,
but has since been genericized. Google, however has moved on to newer technologies since
2014.
The belowdiagramisfromGoogle’sorignal Mapreduce paper.The diagramdescribesthe workingof the
Map-Reduce algorithm.
The Map-Reduce algorithm breaks down into three important steps namely Map, Group & Sort
and Reduce.
The MAP part of the algorithm divides the data into key: value pairs. The Key is the most
important part of the Map function as this key is further used by the reduce function too.
Group and sort basically groups the values with same keys together to make it simpler for the
next stage of Reducer.
The final stage of the Reducer is that it receives the grouped and sorted data from the previous
stage and selects the output desired from the processing of the dataset.
Summer 2016 Reportby:ShreyaChakrabarti
Some of the examples which can give an in depth understanding of MapReduce are explained
in below projects.
Mini-Project 1: Max and MinTemperatures in year 1800
The dataset in this mini project contains temperatures from the year 1800 which were recorded
at various weather stations.
The dataset can be explained as below:
The data also contains some other fields which are not relevant to our mini project.
We will be finding out the “Minimum Temperatures at a particularWeatherStation
throughout the year 1800” and “Maximum Temperatures at that particular Weather Station
throughout the year 1800”.(There are only two weather stations included in this particular
dataset)
Understanding the data plays a very important role in determining the “Map” and “Reduce”
part for writing a Map-Reduce Program.
Weather Station
Code
Date inthe year
1800 whenthe
temperature was
recorded
Type of
Temperature
(Maximumor
Minimum)
Temperaturesin
Celsius
Summer 2016 Reportby:ShreyaChakrabarti
The understanding of how a Map Reduce Program Works:
Data
Mapper (Key -Value Pairs)
Group and Sort
Reducer
The working of the Map-Reduce algorithm can be explained in the above diagram. The data is
then fed to the mapper where the mapper selects the required data which is relevant for the
result, basically separates the data into key-value pairs. Then this data is further grouped and
sorted according to the keys. The Reducer can be said to be a function which ultimately gives us
the result.
ITE00100554 18000101 TMAX -75
GM000010962 18000101 PRCP 0
EZE00100082 18000101 TMAX -86
E00100082 18000101 TMIN -135
EZE00100082 18000101 TMIN -135
ITE00100554 18000102 TMAX -60
ITE00100554 18000102 TMIN -125
GM000010962 18000102 PRCP 0
EZE00100082 18000102 TMAX -44
ITE00100554,-75 EZE00100082 ,-86
ITE00100554, -60
ITE00100554,-75,-60 EZE00100082 ,-86
ITE00100554,-60 EZE00100082 ,-86
Summer 2016 Reportby:ShreyaChakrabarti
The above logiccan be writtenasbelowinPythonLanguage Code
MinimumTemperature
MaximumTemperature
Mapper
(To establish Key-Value Pair)
Reducer
(For Final Results)
Summer 2016 Reportby:ShreyaChakrabarti
Running the Minimum Temperatures Code:
Output for Minimum Temperatures:
Running the Maximum Temperatures Code:
Output for Minimum Temperatures:
Summer 2016 Reportby:ShreyaChakrabarti
Mini-Project 2: Total Amount Orderedby eachcustomer
The datasetcontainsa listof customerswiththe amountstheyspendineachordertheyplacedina
restaurant.
The datasetcontains3 attributesnamelyCustomerID,OrderNumberandAmountSpend.
To write the code for thisdata analysisproblem, letusdesignanapproachforthe problem
Data
Mapper
The Mapper should be able to
establishthe Key-Value pair.Inthis
case the key value pair would be
Customer and the amount he
Spend.
Group and Sort
In group and sort there would be
grouping on the basis of the
customer.
The data after Grouping and
Sorting would contain the
CustomerNumberandthe amount
he spends in total
Reducer
The Reducerwould inturn produce
the output as to Customer with
what ID spend How much Money
in orders.
Summer 2016 Reportby:ShreyaChakrabarti
The code for the same is thus written as below in Python:
Output:
The output of this Project can also be improved by feeding the output of the first reducer into
another mapper to get a sorted output. This sort of MapReduce job is called “Chained
MapReduce Jobs”.
Summer 2016 Reportby:ShreyaChakrabarti
Revised Code:
Revised Output:
First Reducer’s
Output of “Order
Totals” is send to
another Mapper,
Reducer Pair to
get the results
sorted
Summer 2016 Reportby:ShreyaChakrabarti
Project: Social Graph of Superhero’s
This dataset contains of Superhero Data from Marvel which mentions the appearance of Super
Hero’s with each other in various comic books. It basically traces the appearance of
superheroes with each other in various comic books which feature them.
The above image is a snippet from the data where the various numbers are assigned to various
characters and the first character(Highlighted) is the Superhero with the following numbers
belonging to other characters who the main character is Friends with.
Step:1 Find Total Number of Friends per Superhero
To find the most popular superhero first we need to map the character and the number of
friends the particular superhero has. To do this we need to add the friends per character and
map them as Key-Value pair and feed to the reducer. The reducer then adds up the number of
friends per character.
Step:2 Find Superhero with Maximum Friend Count
Mapper1: Count the number of
friends per character, per line.
Establish a key value pair of
Superhero: NumberOfFriends
Reducer1: Add up the
number of Friends per
Superhero
Reducer1: Total
number of friends per
Superhero
Mapper2: Substitute a
common key (Empty
Key) for example
None: 59 5933
where None: Key
59 5933: Value
Reducer2: Find out the
Superhero with max
friends
Summer 2016 Reportby:ShreyaChakrabarti
These two steps would give us the most popular Social Hero.
Summer 2016 Reportby:ShreyaChakrabarti
The load_name_dictionary displays the name of the Superhero from the superhero name file as
opposed to the code of the Superhero with the number of Friends he has.
Output:
Other Important Technologies inHadoop
YARN
Yarn can be simply called the operating system of Hadoop because it is responsible for
managing and monitoring workloads, maintaining a multi-tenant environment, implementing
security controls and managing high availability features of Hadoop.
(https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html)
Summer 2016 Reportby:ShreyaChakrabarti
Resource Manager: Master that arbitrates all the available cluster resources and thus helps
manage the distributed applications running on the YARN system.
Node Manager: Node Manager takes instructions from resource manager and manage
resources on a single node.
Application Master: Negotiators,applicationmastersare responsible fornegotiatingresourcesfrom
Resource Manager.
HIVE
Hive is an open source project run by volunteers at the Apache Software Foundation. Hive is
basically a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query and analysis.
HIVE provides a SQL language HIVEQL with schema on read and transparently converts queries
to MapReduce.
SQOOP
Sqoop is a command-line interface application for transferring data between relational
databases and Hadoop. Sqoop got its name from SQL+Hadoop.
SPARK
Spark was developed in response to limitations in the MapReduce cluster computing paradigm.
Apache Spark is a fast, in-memory data processing engine with elegant and expressive
development APIs to allow data workers to efficiently execute streaming, machine learning or
SQL workloads that require fast iterative access to datasets. With Spark running on Apache
Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power,
derive insights, and enrich their data science workloads within a single, shared dataset in
Hadoop.

Más contenido relacionado

La actualidad más candente

Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.Computer Science Journals
 
Extracting intelligence from online news sources
Extracting intelligence from online news sourcesExtracting intelligence from online news sources
Extracting intelligence from online news sourceseSAT Publishing House
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo
 
Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753pradip patel
 
Outlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopOutlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopPranab Ghosh
 
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...Data Works MD
 
Democratizing Machine Learning: Perspective from a scikit-learn Creator
Democratizing Machine Learning: Perspective from a scikit-learn CreatorDemocratizing Machine Learning: Perspective from a scikit-learn Creator
Democratizing Machine Learning: Perspective from a scikit-learn CreatorDatabricks
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit
 
Mansi chowkkar programming_in_data_analytics
Mansi chowkkar programming_in_data_analyticsMansi chowkkar programming_in_data_analytics
Mansi chowkkar programming_in_data_analyticsMansiChowkkar
 
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...Big Data Spain
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Miningsnoreen
 
Enhance The Technique For Searching Dimension Incomplete Databases
Enhance The Technique For Searching Dimension Incomplete DatabasesEnhance The Technique For Searching Dimension Incomplete Databases
Enhance The Technique For Searching Dimension Incomplete Databasespaperpublications3
 
Data Science : Make Smarter Business Decisions
Data Science : Make Smarter Business DecisionsData Science : Make Smarter Business Decisions
Data Science : Make Smarter Business DecisionsEdureka!
 
Data analysis using hive ql & tableau
Data analysis using hive ql & tableauData analysis using hive ql & tableau
Data analysis using hive ql & tableaupkale1708
 

La actualidad más candente (17)

Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.Effect of countries in performance of hadoop.
Effect of countries in performance of hadoop.
 
Extracting intelligence from online news sources
Extracting intelligence from online news sourcesExtracting intelligence from online news sources
Extracting intelligence from online news sources
 
Spark1
Spark1Spark1
Spark1
 
R and Data Science
R and Data ScienceR and Data Science
R and Data Science
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure ML
 
The GDELT project
The GDELT project The GDELT project
The GDELT project
 
Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753Twitter word frequency count using hadoop components 150331221753
Twitter word frequency count using hadoop components 150331221753
 
Outlier and fraud detection using Hadoop
Outlier and fraud detection using HadoopOutlier and fraud detection using Hadoop
Outlier and fraud detection using Hadoop
 
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock...
 
Democratizing Machine Learning: Perspective from a scikit-learn Creator
Democratizing Machine Learning: Perspective from a scikit-learn CreatorDemocratizing Machine Learning: Perspective from a scikit-learn Creator
Democratizing Machine Learning: Perspective from a scikit-learn Creator
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
Mansi chowkkar programming_in_data_analytics
Mansi chowkkar programming_in_data_analyticsMansi chowkkar programming_in_data_analytics
Mansi chowkkar programming_in_data_analytics
 
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...
Monitoring world geopolitics through Big Data by Tomasa Rodrigo and Álvaro Or...
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Enhance The Technique For Searching Dimension Incomplete Databases
Enhance The Technique For Searching Dimension Incomplete DatabasesEnhance The Technique For Searching Dimension Incomplete Databases
Enhance The Technique For Searching Dimension Incomplete Databases
 
Data Science : Make Smarter Business Decisions
Data Science : Make Smarter Business DecisionsData Science : Make Smarter Business Decisions
Data Science : Make Smarter Business Decisions
 
Data analysis using hive ql & tableau
Data analysis using hive ql & tableauData analysis using hive ql & tableau
Data analysis using hive ql & tableau
 

Destacado

Survey Paper on Google Project Loon- Ballon for Everyone
Survey Paper on Google Project Loon- Ballon for EveryoneSurvey Paper on Google Project Loon- Ballon for Everyone
Survey Paper on Google Project Loon- Ballon for EveryoneShreya Chakrabarti
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Geoffrey Fox
 
Programming Languages - Functional Programming Paper
Programming Languages - Functional Programming PaperProgramming Languages - Functional Programming Paper
Programming Languages - Functional Programming PaperShreya Chakrabarti
 
Project loon report in ieee format
Project loon report in ieee formatProject loon report in ieee format
Project loon report in ieee formatsahithi reddy
 
Big Data HPC Convergence
Big Data HPC ConvergenceBig Data HPC Convergence
Big Data HPC ConvergenceGeoffrey Fox
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming DataGeoffrey Fox
 

Destacado (9)

Survey Paper on Google Project Loon- Ballon for Everyone
Survey Paper on Google Project Loon- Ballon for EveryoneSurvey Paper on Google Project Loon- Ballon for Everyone
Survey Paper on Google Project Loon- Ballon for Everyone
 
Power+point
Power+pointPower+point
Power+point
 
BICFinal06
BICFinal06BICFinal06
BICFinal06
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel 
 
Programming Languages - Functional Programming Paper
Programming Languages - Functional Programming PaperProgramming Languages - Functional Programming Paper
Programming Languages - Functional Programming Paper
 
Project loon report in ieee format
Project loon report in ieee formatProject loon report in ieee format
Project loon report in ieee format
 
Big Data HPC Convergence
Big Data HPC ConvergenceBig Data HPC Convergence
Big Data HPC Convergence
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
 

Similar a Summer Independent Study Report

IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET Journal
 
Big Data Hadoop (Overview)
Big Data Hadoop (Overview)Big Data Hadoop (Overview)
Big Data Hadoop (Overview)Rohit Srivastava
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Big data analytics in banking sector
Big data analytics in banking sectorBig data analytics in banking sector
Big data analytics in banking sectorAnil Rana
 
DSBDA Miniproject Assignment - TE A (1).pdf
DSBDA Miniproject Assignment - TE A (1).pdfDSBDA Miniproject Assignment - TE A (1).pdf
DSBDA Miniproject Assignment - TE A (1).pdfAbhiThorat6
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1gauravsc36
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaNithin Kakkireni
 
A REVIEW PAPER ON BIG DATA ANALYTICS
A REVIEW PAPER ON BIG DATA ANALYTICSA REVIEW PAPER ON BIG DATA ANALYTICS
A REVIEW PAPER ON BIG DATA ANALYTICSSarah Adams
 
Twitter_Sentiment_analysis.pptx
Twitter_Sentiment_analysis.pptxTwitter_Sentiment_analysis.pptx
Twitter_Sentiment_analysis.pptxJOELFRANKLIN13
 
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...IRJET Journal
 
Programming for Data Analytics Project
Programming for Data Analytics ProjectProgramming for Data Analytics Project
Programming for Data Analytics ProjectAkshay Kumar Bhushan
 
Big Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsBig Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsFredReynolds2
 
Analysis of parking citations mapreduce techniques
Analysis of parking citations   mapreduce techniquesAnalysis of parking citations   mapreduce techniques
Analysis of parking citations mapreduce techniquesSindhujanDhayalan
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoopAnusha sweety
 
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docxlorainedeserre
 

Similar a Summer Independent Study Report (20)

IRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOPIRJET - Survey Paper on Map Reduce Processing using HADOOP
IRJET - Survey Paper on Map Reduce Processing using HADOOP
 
Big Data
Big DataBig Data
Big Data
 
Big Data Hadoop (Overview)
Big Data Hadoop (Overview)Big Data Hadoop (Overview)
Big Data Hadoop (Overview)
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Big data analytics in banking sector
Big data analytics in banking sectorBig data analytics in banking sector
Big data analytics in banking sector
 
DSBDA Miniproject Assignment - TE A (1).pdf
DSBDA Miniproject Assignment - TE A (1).pdfDSBDA Miniproject Assignment - TE A (1).pdf
DSBDA Miniproject Assignment - TE A (1).pdf
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
 
A REVIEW PAPER ON BIG DATA ANALYTICS
A REVIEW PAPER ON BIG DATA ANALYTICSA REVIEW PAPER ON BIG DATA ANALYTICS
A REVIEW PAPER ON BIG DATA ANALYTICS
 
Twitter_Sentiment_analysis.pptx
Twitter_Sentiment_analysis.pptxTwitter_Sentiment_analysis.pptx
Twitter_Sentiment_analysis.pptx
 
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
Social Media Market Trender with Dache Manager Using Hadoop and Visualization...
 
Programming for Data Analytics Project
Programming for Data Analytics ProjectProgramming for Data Analytics Project
Programming for Data Analytics Project
 
Big Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsBig Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential Tools
 
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Ap...
 
Major ppt
Major pptMajor ppt
Major ppt
 
Analysis of parking citations mapreduce techniques
Analysis of parking citations   mapreduce techniquesAnalysis of parking citations   mapreduce techniques
Analysis of parking citations mapreduce techniques
 
Big data with hadoop
Big data with hadoopBig data with hadoop
Big data with hadoop
 
Big data
Big dataBig data
Big data
 
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
2Running Head BIG DATA PROCESSING OF SOFTWARE AND TOOLS2BIG.docx
 

Más de Shreya Chakrabarti

Certificate in google analytics beginners
Certificate in google analytics   beginnersCertificate in google analytics   beginners
Certificate in google analytics beginnersShreya Chakrabarti
 
Citizen Data Scientist : Marketing perspective Certification
Citizen Data Scientist : Marketing perspective CertificationCitizen Data Scientist : Marketing perspective Certification
Citizen Data Scientist : Marketing perspective CertificationShreya Chakrabarti
 
Microsoft Virtual Academy Certificate of Completion Python
Microsoft Virtual Academy Certificate of Completion PythonMicrosoft Virtual Academy Certificate of Completion Python
Microsoft Virtual Academy Certificate of Completion PythonShreya Chakrabarti
 
Intelligent Systems - Predictive Analytics Project
Intelligent Systems - Predictive Analytics ProjectIntelligent Systems - Predictive Analytics Project
Intelligent Systems - Predictive Analytics ProjectShreya Chakrabarti
 

Más de Shreya Chakrabarti (8)

Certificate in google analytics beginners
Certificate in google analytics   beginnersCertificate in google analytics   beginners
Certificate in google analytics beginners
 
Citizen Data Scientist : Marketing perspective Certification
Citizen Data Scientist : Marketing perspective CertificationCitizen Data Scientist : Marketing perspective Certification
Citizen Data Scientist : Marketing perspective Certification
 
Machine Learning with MATLAB
Machine Learning with MATLABMachine Learning with MATLAB
Machine Learning with MATLAB
 
Microsoft Virtual Academy Certificate of Completion Python
Microsoft Virtual Academy Certificate of Completion PythonMicrosoft Virtual Academy Certificate of Completion Python
Microsoft Virtual Academy Certificate of Completion Python
 
Intelligent Systems - Predictive Analytics Project
Intelligent Systems - Predictive Analytics ProjectIntelligent Systems - Predictive Analytics Project
Intelligent Systems - Predictive Analytics Project
 
PROJECT PPT
PROJECT PPTPROJECT PPT
PROJECT PPT
 
BE Project
BE ProjectBE Project
BE Project
 
Project Loon - Final PPT
Project Loon - Final PPTProject Loon - Final PPT
Project Loon - Final PPT
 

Summer Independent Study Report

  • 1. Summer 2016 Reportby:ShreyaChakrabarti Self-Learning Hadoop What is Big Data? (Image Reference: http://www.webopedia.com/TERM/B/big_data.html) According to recent research and findings it has been found that every day we create around 2.5 quintillion bytes of data. Surprisingly, majority of this data has been acquired in a short span of last 10 years. A major contribution to this data is the various social media ventures in the recent years namely Facebook, Twitter, Instagram etc. Other sources of data also include the cell phone GPS signals, Shopper’s profile storage stored by shopping giants like Amazon, eBay etc. and other numerous resources. All of this data which is so huge that storing, analyzing, visualizing and performing analytics on the same is increasingly difficult because of the sheer volume of the data, such data is called Big Data. Big Data is becoming a very popular term in recent times as the world realizes the importance of using the existing data to their advantage and maximizing business profits. The main advantage of storing this data and utilizing newer Big Data technologies is analytics. The four Types of Analytic techniques can be used to achieve greater heights in today’s world for companies to better engage with their customers and in turn maximize their own capital. The four type of analytic techniques include: 1) Descriptive Analytics: “What Happened?” Simple tool like page views can give us an idea about the success of a particular campaign 2) Diagnostic Analytics:” Why it happened?” Business Intelligence tools used to analyze the data most presently available in the company give us the specific reasons for why a particular campaign was successful or unsuccessful based on which the decision to continue the campaign or discontinue it can be easily taken. 3) Predictive Analytics: “Future Prediction” Predictive analytics is a branch of advanced analytics
  • 2. Summer 2016 Reportby:ShreyaChakrabarti which is used to make predictions about unknown future events. Predictive analytics uses many techniques like data mining, statistics modeling, machine learning and artificial intelligence to analyze current data to make predictions about future. 4)Prescriptive Analytics: “Prevention better than cure” Once predictive analytics predicts what needs to be done in order to maximize profits, care needs to be taken that nothing is done in the opposite direction to hamper the profits. Why Hadoop? As discussed earlier Technology needs to advance at a drastic speed for the world to take advantage of the existing as well as ever updating data. Apache Hadoop is an open source software framework for distributed storage and distributed processing of very large datasets on computer clusters built from commodity hardware. In simple terms “Hadoop” can be said to be a database used to store large datasets and perform data analysis on it. Hadoop was designed on the base of Google File System paper published in 2003.Doug Cutting the creator of Hadoop named it after his son’s toy elephant. Hadoop 0.1.0 was released in April 2006 and continues to evolve by the many contributors to the Apache Hadoop project. Hadoop is based on Map-Reduce algorithms Hadoop Components Hadoop Distributed File System MapReduce Processing
  • 3. Summer 2016 Reportby:ShreyaChakrabarti HDFS Architecture (https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model. It is an open- source data management framework with scale out storage and distributed processing capabilities. It distributes data across multiple machines. Files are logically divided into equal sized blocks. Blocks are spread across multiple machines who create replicas of blocks. Three replicas are maintained to ensure availability. Data integrity is maintained by computing the block checksum. The Name-node maintains address of the blocks on the respective data-nodes. Whenever data is requested the name-node provides the address of the data physically closest to the client. The secondary name node serves as a checkpoint server and is not a replacement to Primary name-node when it fails.
  • 4. Summer 2016 Reportby:ShreyaChakrabarti Map Reduce Earlier spawned from Google Map-Reduce is a popular algorithm for processing and generating large data sets. The name MapReduce originally referred to the proprietary Google technology, but has since been genericized. Google, however has moved on to newer technologies since 2014. The belowdiagramisfromGoogle’sorignal Mapreduce paper.The diagramdescribesthe workingof the Map-Reduce algorithm. The Map-Reduce algorithm breaks down into three important steps namely Map, Group & Sort and Reduce. The MAP part of the algorithm divides the data into key: value pairs. The Key is the most important part of the Map function as this key is further used by the reduce function too. Group and sort basically groups the values with same keys together to make it simpler for the next stage of Reducer. The final stage of the Reducer is that it receives the grouped and sorted data from the previous stage and selects the output desired from the processing of the dataset.
  • 5. Summer 2016 Reportby:ShreyaChakrabarti Some of the examples which can give an in depth understanding of MapReduce are explained in below projects. Mini-Project 1: Max and MinTemperatures in year 1800 The dataset in this mini project contains temperatures from the year 1800 which were recorded at various weather stations. The dataset can be explained as below: The data also contains some other fields which are not relevant to our mini project. We will be finding out the “Minimum Temperatures at a particularWeatherStation throughout the year 1800” and “Maximum Temperatures at that particular Weather Station throughout the year 1800”.(There are only two weather stations included in this particular dataset) Understanding the data plays a very important role in determining the “Map” and “Reduce” part for writing a Map-Reduce Program. Weather Station Code Date inthe year 1800 whenthe temperature was recorded Type of Temperature (Maximumor Minimum) Temperaturesin Celsius
  • 6. Summer 2016 Reportby:ShreyaChakrabarti The understanding of how a Map Reduce Program Works: Data Mapper (Key -Value Pairs) Group and Sort Reducer The working of the Map-Reduce algorithm can be explained in the above diagram. The data is then fed to the mapper where the mapper selects the required data which is relevant for the result, basically separates the data into key-value pairs. Then this data is further grouped and sorted according to the keys. The Reducer can be said to be a function which ultimately gives us the result. ITE00100554 18000101 TMAX -75 GM000010962 18000101 PRCP 0 EZE00100082 18000101 TMAX -86 E00100082 18000101 TMIN -135 EZE00100082 18000101 TMIN -135 ITE00100554 18000102 TMAX -60 ITE00100554 18000102 TMIN -125 GM000010962 18000102 PRCP 0 EZE00100082 18000102 TMAX -44 ITE00100554,-75 EZE00100082 ,-86 ITE00100554, -60 ITE00100554,-75,-60 EZE00100082 ,-86 ITE00100554,-60 EZE00100082 ,-86
  • 7. Summer 2016 Reportby:ShreyaChakrabarti The above logiccan be writtenasbelowinPythonLanguage Code MinimumTemperature MaximumTemperature Mapper (To establish Key-Value Pair) Reducer (For Final Results)
  • 8. Summer 2016 Reportby:ShreyaChakrabarti Running the Minimum Temperatures Code: Output for Minimum Temperatures: Running the Maximum Temperatures Code: Output for Minimum Temperatures:
  • 9. Summer 2016 Reportby:ShreyaChakrabarti Mini-Project 2: Total Amount Orderedby eachcustomer The datasetcontainsa listof customerswiththe amountstheyspendineachordertheyplacedina restaurant. The datasetcontains3 attributesnamelyCustomerID,OrderNumberandAmountSpend. To write the code for thisdata analysisproblem, letusdesignanapproachforthe problem Data Mapper The Mapper should be able to establishthe Key-Value pair.Inthis case the key value pair would be Customer and the amount he Spend. Group and Sort In group and sort there would be grouping on the basis of the customer. The data after Grouping and Sorting would contain the CustomerNumberandthe amount he spends in total Reducer The Reducerwould inturn produce the output as to Customer with what ID spend How much Money in orders.
  • 10. Summer 2016 Reportby:ShreyaChakrabarti The code for the same is thus written as below in Python: Output: The output of this Project can also be improved by feeding the output of the first reducer into another mapper to get a sorted output. This sort of MapReduce job is called “Chained MapReduce Jobs”.
  • 11. Summer 2016 Reportby:ShreyaChakrabarti Revised Code: Revised Output: First Reducer’s Output of “Order Totals” is send to another Mapper, Reducer Pair to get the results sorted
  • 12. Summer 2016 Reportby:ShreyaChakrabarti Project: Social Graph of Superhero’s This dataset contains of Superhero Data from Marvel which mentions the appearance of Super Hero’s with each other in various comic books. It basically traces the appearance of superheroes with each other in various comic books which feature them. The above image is a snippet from the data where the various numbers are assigned to various characters and the first character(Highlighted) is the Superhero with the following numbers belonging to other characters who the main character is Friends with. Step:1 Find Total Number of Friends per Superhero To find the most popular superhero first we need to map the character and the number of friends the particular superhero has. To do this we need to add the friends per character and map them as Key-Value pair and feed to the reducer. The reducer then adds up the number of friends per character. Step:2 Find Superhero with Maximum Friend Count Mapper1: Count the number of friends per character, per line. Establish a key value pair of Superhero: NumberOfFriends Reducer1: Add up the number of Friends per Superhero Reducer1: Total number of friends per Superhero Mapper2: Substitute a common key (Empty Key) for example None: 59 5933 where None: Key 59 5933: Value Reducer2: Find out the Superhero with max friends
  • 13. Summer 2016 Reportby:ShreyaChakrabarti These two steps would give us the most popular Social Hero.
  • 14. Summer 2016 Reportby:ShreyaChakrabarti The load_name_dictionary displays the name of the Superhero from the superhero name file as opposed to the code of the Superhero with the number of Friends he has. Output: Other Important Technologies inHadoop YARN Yarn can be simply called the operating system of Hadoop because it is responsible for managing and monitoring workloads, maintaining a multi-tenant environment, implementing security controls and managing high availability features of Hadoop. (https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html)
  • 15. Summer 2016 Reportby:ShreyaChakrabarti Resource Manager: Master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. Node Manager: Node Manager takes instructions from resource manager and manage resources on a single node. Application Master: Negotiators,applicationmastersare responsible fornegotiatingresourcesfrom Resource Manager. HIVE Hive is an open source project run by volunteers at the Apache Software Foundation. Hive is basically a data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis. HIVE provides a SQL language HIVEQL with schema on read and transparently converts queries to MapReduce. SQOOP Sqoop is a command-line interface application for transferring data between relational databases and Hadoop. Sqoop got its name from SQL+Hadoop. SPARK Spark was developed in response to limitations in the MapReduce cluster computing paradigm. Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Hadoop.