SlideShare una empresa de Scribd logo
1 de 39
Hadoop Tech Talk
By : Purna Chander
Agenda
● Big Data
● Traditional System Workflow
● Hadoop
● Hadoop Tools
● Hadoop Testing
BigData
What is Big Data ?
What is BIG
DATA ?
Data which is beyond to
the storage capacity and
which is beyond to the
processing power
VOLUME VELOCITY VARIETY
Assume we live in a world of 100% data, 90% of data was generated in the
last 3-4 years and 10% of data was generated when the systems was
introduced.
Volume
1. Transaction based data stored in relational database since years.
1. Unstructured data stored as the part of social media .
1. Sensor and machine to machine generated data.
4. The volume on Facebook is 600 Terabytes of data every day.
Velocity
1. Reacting Fast enough is one of the challenges
1. Computation is process bound.
1. Processing the unstructured data
Variety
1. Different formats.
1. Structured data residing in traditional RDBMS or flat files.
1. Unstructured data related to text documents, videos, email, audio, log files
and etc..
1. Managing , merging and governing different varieties of data is the biggest
challenge
1. Connected and correlated to extract useful information from it.
Challenges reading from single disk
Reading data from warehouse
200GB
FB (india)
200GB
FB (US)
200GB
FB (Japan)
200GB
FB (UK)
200GB
FB (CHINA)
1 TB and more
Data warehouse
100 * 1000 *8
__________ = 8000 secs
100
100mbps 2.2
Hours
OLTP
( RDBMS )
Social
Networking
Logs
Xml /
.txt
Eg :
Apache
logs
Data Warehouse
(Expensive storage)
Storage spread across not
easily accessible, limited
storage capacity
Reports
Reports
Traditional work flow data
OLTP
( RDBMS )
Social
Networking
Logs
Xml /
.txt
Eg :
Apache
logs
Data Warehouse
Reports
Reports
Hadoop
Hadoop Workflow
Vertical Scaling
Increasing the resources ( RAM, Processor, Hard Disk) on the machine is
vertical scaling.
1990 - 512 MB RAM , 2 core processor.
2000 - 4 / 8 GB RAM, 8 core processor.
Production Systems - 64GB RAM / 16 core processor. ( Cost and Maintenance).
Horizontal Scaling ( Distributed Computing)
Data Warehouse
8GB 8GB 8GB 8GB
10 %
20 %
30 %
HADOOP
Hadoop
Hadoop is a apache software library framework that allows for the distributed
processing of large data sets across cluster of computers using simple
programming model
It is designed to scale up from single server to thousand of machines each
offering local computation and storage.
Rather than rely on hardware to deliver high availability the library is itself
designed to detect and handle failures at the application layer , so delivering
high availability of service on top of cluster of computers, each of which may
be prone to failures.
Hadoop Core Features
HDFS - Used for Storing data on cluster of machines.
Mapreduce - it is a technique to process the data that is stored in HDFS.
Rack 1
Block1
Block 2
Block 3
Default
Replication : 3
Rack 2
Computer 1 Computer 2 Computer 3
Computer 4 Computer 5 Computer 6
Computer 7 Computer 8 Computer 9
Computer 10 Computer 11 Computer
Data Replication
Map Reduce
What is MapReduce?
Mapreduce is a programming model for processing large Data sets with a
parallel and distributed algorithm on a cluster.
Example : we take a word count of a small and big and count the occurrences
of the words.
Mapreduce programs can be written in Perl, Python, Java or Ruby.
Mapreduce uses hadoopstreaming.jar to convert the Python/Perl programs to
jar and execute the programs in parallel and get the result count using the
reducer.
Mapper code in Python
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '%st%s' % (word, 1)
words_count.txt
pramati yahoo facebook aol facebook IBM
kony google pramati
Reducer code in Python
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('t', 1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print '%st%s' % (current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '%st%s' % (current_word, current_count)
Execution of Mapreduce
hadoop jar
/usr/local/hadoop/share/hadoop/tools/lib/had
oop-streaming-2.7.1.jar -
Dmapred.reduce.tasks=4 -file
/home/hduser/mapper.py
/home/hduser/reducer.py -mapper "python
mapper.py" -reducer "python reducer.py" -input
/test/words_count.txt -output /test_output
Output of mapreduce job
Hadoop Tools
Pig Latin
● Apache Pig is a tool used to analyze large amounts of data by representing them as data
flows. Using the PigLatin scripting language operations like ETL ( Extract , Transform
and Load) , adhoc data analysis and iterative processing can be easily achieved.
● Can solve Variety data problems for structured, unstructured and Semi-structured
● Pig was first built in Yahoo! And later became a top level Apache Project . In this series
we will walk through different features of pig using a sample dataset.
Pig Access
● Interactive mode
● Batch Mode
● $ Pig -x local - to run the grunt shell in the local file system mode
● $ Pig - to run the grunt shell in the HDFS system mode
Execution of Pig
Can Perform Joins
● Self Join
● Equi Join
● Left Outer Join
● Right Outer Join
Customers.txt
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
Orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
Output of Pig
Hive
● Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
● Hive is a database technology that can define databases and tables to analyze structured data.
The theme for structured data analysis is to store the data in a tabular manner, and pass
queries to analyze it.
● Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive.
Hive Features
Hive is not
1. A relational database.
1. A design for online transaction processing ( OLTP).
1. A language for real time queries and row level updates.
Hiive is a
1. It stores schema in a database and processed data into HDFS.
1. Designed for online analytical processing ( OLAP).
1. It provides SQL type language for querying called HiveQL or HQL.
Loading structured data into table using Hive
hive> CREATE DATABASE EMP;
hive > use EMP;
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name
String, salary String, deptno int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY 'n'
STORED AS TEXTFILE;
hive> load data local inpath '/home/purnar/emp_data.txt' into
table employee;
hive> select * from employee;
EMP_DATA.txt
1201,chandu,10000.00,20
1202,shekar,2000.00,10
1203,ravi,1000.00,10
1204,kiran,2000.00,20
1205,sharma,30000.00,30
1206,sri,4000.00,40
Difference between Hive and Pig
● Hive is mainly used by data analysts whereas Pig is used by Researchers
and Programmers.
● Hive is mainly used for structured data whereas Pig is used for semi-
structured data / unstructured data.
● Hive is mainly used for creating Reports whereas Pig is used for
Programmers.
● Hive has the provision for Partitions so that you can process the subset of
data by date or an alphabetical order where as Pig does not have any notion
of partion though might be one can achieve this through filters
SQOOP
● Sqoop is a tool designed to transfer data between Hadoop and relational
database servers. It is used to import data from relational databases such
as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to
relational databases. This is a brief tutorial that explains how to make use
of Sqoop in Hadoop ecosystem.
● The traditional application management system, that is, the interaction of
applications with relational database using RDBMS, is one of the sources
that generate Big Data. Such Big Data, generated by RDBMS, is stored in
Relational Database Servers in the relational database structure.
● SQOOP = “SQL to Hadoop and Hadoop to SQL”
Export data from HDFS to MySQL
1. Create a database test and create a table ( employee) as below.
CREATE TABLE employee(id INT,name VARCHAR(20),deg
VARCHAR(20),salary INT,dept VARCHAR(10));
1. Create a txt file with the data given below and input it to the hadoop file
system
Emp.txt
========
1201, gopal, manager,50000, TP
1202, manisha, preader,50000, TP
1203, kalil, php dev,30000, AC
1204, prasanth, php dev,30000, AC
1205, kranthi, admin,20000, TP
1206, satish p, grp des,20000, GR
1. Hadoop fs -mkdir /emp
2. Hadoop fs -put emp.txt /emp
3. Execute the sqoop command below to export the data from txt to mysql.
sqoop export
--verbose --connect
jdbc:mysql://localhost/hi
ve_db
--username ***** --
password ******
-m 4
--table employee --
export-dir /emp/emp.txt
Import data from mysql to flatfile
sqoop import
--connect jdbc:mysql://localhost/test
--username *******
--password *******
--table employee
--m 1
--target-dir /chandu
Hadoop Testing
Hadoop Testing
● Unix commands like mkdir , ls, cat and etc...
● Testing the mapreduce scripts
● Test the mapper and reducer scripts separately with different input files.
● Example : Parse the apache.log file for the gmail users and count the
number of times they have logged in on a particular day.
● Here we pass a 12mb file to the hadoop file system and extract only the
gmail users and count the number of times , they have logged in on that
particular day.
Test Scenarios
1. Adding a special characters to the pattern object.
1. Provide extra spaces to the patterns.
1. Test for boundary conditions of the pattern.
1. Add some special characters inbetween the pattern.
1. Count the number of patterns using the reducer.
And etc..
Hive Test
Hive > create database test;
Hive > create table emp(id int, name string,salary float, deptno int,
designation string);
>ROW FORMAT DELIMITED
>FIELDS TARMINATED BY “,”
>Lines TERMINATED BY ‘n”;
Hive > load data local inpath '/home/purnar/emp.txt' into table emp;
hive> select * from emp;
OK
NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL
Time taken: 0.1 seconds, Fetched: 6 row(s)
EMP.txt
1201, pavan,4000,30,Dev
1202, ravi,3000, 10,QA
1203, kalil,30000,10,phpdev
1204, prasanth,30000,20,QA
1205, kranthi,20000,30,QA
1206, satishp,20000,40,Admin
Hive Test 2
Hive > create database test;
Hive > create table emp(id int, name string,salary float, deptno int,
designation string);
>ROW FORMAT DELIMITED
>FIELDS TARMINATED BY “,”
>Lines TERMINATED BY ‘n”
>STORED AS TEXTFILE;
Hive > load data local inpath '/home/purnar/emp.txt' into table emp;
hive> select * from employee;
OK
1202 pavan 4000.0 30 Dev
1202 ravi 3000.0 NULL QA
1203 kalil 30000.0 10 phpdev
1204 prasanth 30000.0 20 QA
1205 kranthi 20000.0 30 QA
1206 satishp 20000.0 40 Admin
Time taken: 0.096 seconds, Fetched: 6 row(s)
EMP.txt
1201, pavan,4000,30,Dev
1202, ravi,3000, 10,QA
1203, kalil,30000,10,phpdev
1204, prasanth,30000,20,QA
1205, kranthi,20000,30,QA
1206, satishp,20000,40,Admin
Space
before
10
Null Value
displayed
Questions ?

Más contenido relacionado

La actualidad más candente

The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemCloudera, Inc.
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 
YARN(yet an another resource locator)
YARN(yet an another resource locator)YARN(yet an another resource locator)
YARN(yet an another resource locator)Rupak Roy
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in DepthSyed Hadoop
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hiveReza Ameri
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveMike Frampton
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 

La actualidad más candente (19)

MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
YARN(yet an another resource locator)
YARN(yet an another resource locator)YARN(yet an another resource locator)
YARN(yet an another resource locator)
 
Sqoop tutorial
Sqoop tutorialSqoop tutorial
Sqoop tutorial
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Apache Hive - Introduction
Apache Hive - IntroductionApache Hive - Introduction
Apache Hive - Introduction
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in Depth
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 

Similar a Hadoop workshop

Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoopAmbuj Kumar
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data Amar kumar
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLAdam Muise
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopMr. Ankit
 

Similar a Hadoop workshop (20)

Big data
Big dataBig data
Big data
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

Último

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Último (20)

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

Hadoop workshop

  • 1. Hadoop Tech Talk By : Purna Chander
  • 2. Agenda ● Big Data ● Traditional System Workflow ● Hadoop ● Hadoop Tools ● Hadoop Testing
  • 4. What is BIG DATA ? Data which is beyond to the storage capacity and which is beyond to the processing power VOLUME VELOCITY VARIETY Assume we live in a world of 100% data, 90% of data was generated in the last 3-4 years and 10% of data was generated when the systems was introduced.
  • 5. Volume 1. Transaction based data stored in relational database since years. 1. Unstructured data stored as the part of social media . 1. Sensor and machine to machine generated data. 4. The volume on Facebook is 600 Terabytes of data every day.
  • 6. Velocity 1. Reacting Fast enough is one of the challenges 1. Computation is process bound. 1. Processing the unstructured data
  • 7. Variety 1. Different formats. 1. Structured data residing in traditional RDBMS or flat files. 1. Unstructured data related to text documents, videos, email, audio, log files and etc.. 1. Managing , merging and governing different varieties of data is the biggest challenge 1. Connected and correlated to extract useful information from it.
  • 8. Challenges reading from single disk Reading data from warehouse 200GB FB (india) 200GB FB (US) 200GB FB (Japan) 200GB FB (UK) 200GB FB (CHINA) 1 TB and more Data warehouse 100 * 1000 *8 __________ = 8000 secs 100 100mbps 2.2 Hours
  • 9. OLTP ( RDBMS ) Social Networking Logs Xml / .txt Eg : Apache logs Data Warehouse (Expensive storage) Storage spread across not easily accessible, limited storage capacity Reports Reports Traditional work flow data
  • 10. OLTP ( RDBMS ) Social Networking Logs Xml / .txt Eg : Apache logs Data Warehouse Reports Reports Hadoop Hadoop Workflow
  • 11. Vertical Scaling Increasing the resources ( RAM, Processor, Hard Disk) on the machine is vertical scaling. 1990 - 512 MB RAM , 2 core processor. 2000 - 4 / 8 GB RAM, 8 core processor. Production Systems - 64GB RAM / 16 core processor. ( Cost and Maintenance).
  • 12. Horizontal Scaling ( Distributed Computing) Data Warehouse 8GB 8GB 8GB 8GB 10 % 20 % 30 %
  • 14. Hadoop Hadoop is a apache software library framework that allows for the distributed processing of large data sets across cluster of computers using simple programming model It is designed to scale up from single server to thousand of machines each offering local computation and storage. Rather than rely on hardware to deliver high availability the library is itself designed to detect and handle failures at the application layer , so delivering high availability of service on top of cluster of computers, each of which may be prone to failures.
  • 15. Hadoop Core Features HDFS - Used for Storing data on cluster of machines. Mapreduce - it is a technique to process the data that is stored in HDFS.
  • 16. Rack 1 Block1 Block 2 Block 3 Default Replication : 3 Rack 2 Computer 1 Computer 2 Computer 3 Computer 4 Computer 5 Computer 6 Computer 7 Computer 8 Computer 9 Computer 10 Computer 11 Computer Data Replication
  • 17. Map Reduce What is MapReduce? Mapreduce is a programming model for processing large Data sets with a parallel and distributed algorithm on a cluster. Example : we take a word count of a small and big and count the occurrences of the words. Mapreduce programs can be written in Perl, Python, Java or Ruby. Mapreduce uses hadoopstreaming.jar to convert the Python/Perl programs to jar and execute the programs in parallel and get the result count using the reducer.
  • 18. Mapper code in Python #!/usr/bin/env python import sys for line in sys.stdin: line = line.strip() words = line.split() for word in words: print '%st%s' % (word, 1) words_count.txt pramati yahoo facebook aol facebook IBM kony google pramati
  • 19. Reducer code in Python #!/usr/bin/env python from operator import itemgetter import sys current_word = None current_count = 0 word = None for line in sys.stdin: line = line.strip() word, count = line.split('t', 1) try: count = int(count) except ValueError: continue if current_word == word: current_count += count else: if current_word: print '%st%s' % (current_word, current_count) current_count = count current_word = word if current_word == word: print '%st%s' % (current_word, current_count)
  • 20. Execution of Mapreduce hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/had oop-streaming-2.7.1.jar - Dmapred.reduce.tasks=4 -file /home/hduser/mapper.py /home/hduser/reducer.py -mapper "python mapper.py" -reducer "python reducer.py" -input /test/words_count.txt -output /test_output
  • 23. Pig Latin ● Apache Pig is a tool used to analyze large amounts of data by representing them as data flows. Using the PigLatin scripting language operations like ETL ( Extract , Transform and Load) , adhoc data analysis and iterative processing can be easily achieved. ● Can solve Variety data problems for structured, unstructured and Semi-structured ● Pig was first built in Yahoo! And later became a top level Apache Project . In this series we will walk through different features of pig using a sample dataset.
  • 24. Pig Access ● Interactive mode ● Batch Mode ● $ Pig -x local - to run the grunt shell in the local file system mode ● $ Pig - to run the grunt shell in the HDFS system mode
  • 25. Execution of Pig Can Perform Joins ● Self Join ● Equi Join ● Left Outer Join ● Right Outer Join Customers.txt 1,Ramesh,32,Ahmedabad,2000.00 2,Khilan,25,Delhi,1500.00 3,kaushik,23,Kota,2000.00 4,Chaitali,25,Mumbai,6500.00 5,Hardik,27,Bhopal,8500.00 6,Komal,22,MP,4500.00 7,Muffy,24,Indore,10000.00 Orders.txt 102,2009-10-08 00:00:00,3,3000 100,2009-10-08 00:00:00,3,1500 101,2009-11-20 00:00:00,2,1560 103,2008-05-20 00:00:00,4,2060
  • 27. Hive ● Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy. ● Hive is a database technology that can define databases and tables to analyze structured data. The theme for structured data analysis is to store the data in a tabular manner, and pass queries to analyze it. ● Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive.
  • 28. Hive Features Hive is not 1. A relational database. 1. A design for online transaction processing ( OLTP). 1. A language for real time queries and row level updates. Hiive is a 1. It stores schema in a database and processed data into HDFS. 1. Designed for online analytical processing ( OLAP). 1. It provides SQL type language for querying called HiveQL or HQL.
  • 29. Loading structured data into table using Hive hive> CREATE DATABASE EMP; hive > use EMP; hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String, salary String, deptno int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n' STORED AS TEXTFILE; hive> load data local inpath '/home/purnar/emp_data.txt' into table employee; hive> select * from employee; EMP_DATA.txt 1201,chandu,10000.00,20 1202,shekar,2000.00,10 1203,ravi,1000.00,10 1204,kiran,2000.00,20 1205,sharma,30000.00,30 1206,sri,4000.00,40
  • 30. Difference between Hive and Pig ● Hive is mainly used by data analysts whereas Pig is used by Researchers and Programmers. ● Hive is mainly used for structured data whereas Pig is used for semi- structured data / unstructured data. ● Hive is mainly used for creating Reports whereas Pig is used for Programmers. ● Hive has the provision for Partitions so that you can process the subset of data by date or an alphabetical order where as Pig does not have any notion of partion though might be one can achieve this through filters
  • 31. SQOOP ● Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases. This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem. ● The traditional application management system, that is, the interaction of applications with relational database using RDBMS, is one of the sources that generate Big Data. Such Big Data, generated by RDBMS, is stored in Relational Database Servers in the relational database structure. ● SQOOP = “SQL to Hadoop and Hadoop to SQL”
  • 32. Export data from HDFS to MySQL 1. Create a database test and create a table ( employee) as below. CREATE TABLE employee(id INT,name VARCHAR(20),deg VARCHAR(20),salary INT,dept VARCHAR(10)); 1. Create a txt file with the data given below and input it to the hadoop file system Emp.txt ======== 1201, gopal, manager,50000, TP 1202, manisha, preader,50000, TP 1203, kalil, php dev,30000, AC 1204, prasanth, php dev,30000, AC 1205, kranthi, admin,20000, TP 1206, satish p, grp des,20000, GR 1. Hadoop fs -mkdir /emp 2. Hadoop fs -put emp.txt /emp 3. Execute the sqoop command below to export the data from txt to mysql. sqoop export --verbose --connect jdbc:mysql://localhost/hi ve_db --username ***** -- password ****** -m 4 --table employee -- export-dir /emp/emp.txt
  • 33. Import data from mysql to flatfile sqoop import --connect jdbc:mysql://localhost/test --username ******* --password ******* --table employee --m 1 --target-dir /chandu
  • 35. Hadoop Testing ● Unix commands like mkdir , ls, cat and etc... ● Testing the mapreduce scripts ● Test the mapper and reducer scripts separately with different input files. ● Example : Parse the apache.log file for the gmail users and count the number of times they have logged in on a particular day. ● Here we pass a 12mb file to the hadoop file system and extract only the gmail users and count the number of times , they have logged in on that particular day.
  • 36. Test Scenarios 1. Adding a special characters to the pattern object. 1. Provide extra spaces to the patterns. 1. Test for boundary conditions of the pattern. 1. Add some special characters inbetween the pattern. 1. Count the number of patterns using the reducer. And etc..
  • 37. Hive Test Hive > create database test; Hive > create table emp(id int, name string,salary float, deptno int, designation string); >ROW FORMAT DELIMITED >FIELDS TARMINATED BY “,” >Lines TERMINATED BY ‘n”; Hive > load data local inpath '/home/purnar/emp.txt' into table emp; hive> select * from emp; OK NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL Time taken: 0.1 seconds, Fetched: 6 row(s) EMP.txt 1201, pavan,4000,30,Dev 1202, ravi,3000, 10,QA 1203, kalil,30000,10,phpdev 1204, prasanth,30000,20,QA 1205, kranthi,20000,30,QA 1206, satishp,20000,40,Admin
  • 38. Hive Test 2 Hive > create database test; Hive > create table emp(id int, name string,salary float, deptno int, designation string); >ROW FORMAT DELIMITED >FIELDS TARMINATED BY “,” >Lines TERMINATED BY ‘n” >STORED AS TEXTFILE; Hive > load data local inpath '/home/purnar/emp.txt' into table emp; hive> select * from employee; OK 1202 pavan 4000.0 30 Dev 1202 ravi 3000.0 NULL QA 1203 kalil 30000.0 10 phpdev 1204 prasanth 30000.0 20 QA 1205 kranthi 20000.0 30 QA 1206 satishp 20000.0 40 Admin Time taken: 0.096 seconds, Fetched: 6 row(s) EMP.txt 1201, pavan,4000,30,Dev 1202, ravi,3000, 10,QA 1203, kalil,30000,10,phpdev 1204, prasanth,30000,20,QA 1205, kranthi,20000,30,QA 1206, satishp,20000,40,Admin Space before 10 Null Value displayed