SlideShare una empresa de Scribd logo
1 de 46
1
Hadoop Eco System
• Why Big Data?
• Ingredients of Big Data Eco System
• Working with Map Reduce
• Phases of MR
• HDFS
• Hive
• Use case
• Conclusion
Agenda
2
• Big Data is NOT JUST ABOUT SIZE its ABOUT
HOW IMPORTANT THE DATA IS in a large
chunk
• Data is CHANGING and getting MESSY
• Prior Structured but now Unstructured.
• Non Uniform
• Many distributed contributors to the data
• Mobile, PDA, Tablet, sensors.
• Domains: Financial, Healthcare, Social Media
Why Big Data!!
3
Glimpse
4
• Map reduce – Technique of solving big data by map –
reduce technique on clusters.
• HDFS- Distributed file system used by hadoop.
• HIVE- SQL based query engine for non java programmers
• PIG- A data flow language and execution environment for
exploring very large datasets
Ingredients of Eco System
5
• HBASE - A distributed, column-oriented database.
• Zookeeper - A distributed, highly available coordination
service.
• Sqoop - A tool for efficiently moving data between
relational databases and HDFS.
Ingredients cont.
6
• Protocols used- RPC/ HTTP for inter
communication of commodity hardware.
• Run on Pseudo Node or Clusters
• Components- Daemons
• NameNode
• DataNode
• JobTracker
• TaskTracker
Hadoop Internals
7
• Map  Function which maps for each of
the data available
• Reduce  Function which is used for
aggregation or reduction
Working with Map Reduce
8
• f(n) = Σ {n=0.. n=10} (n(n-1)/2)
• map = ∀ n from 0 to n
• map(n(n-1)/2)
• Reduce = Σ ([values]) is the
aggregation/reduction function
Hence can achieve parallelism
MR as a function
9
MR as representation
10
• Map <K1, V1>  Map <K2, V2>
• V2 – list of values for Key K2
• Reduce <K2, V2>  ~ <K3, V3>
• ~ Reduction operation
• Reduced output with specific Keys and
Values
• Data on HDFS
• Input partition – FileSplit , Inputsplit
• Map
• Shuffle
• Sort
• Partition
• Reducer
• Aggregated Data on HDFS
Phases of MR
11
Phases of MR depicted
12
Data flow in MR
13
MapReduce data flow with multiple reduce tasks
Shuffle and Sort phase
14
• Architecture
HDFS Hadoop Distributed File System
15
HDFS- Client Read
16
HDFS- Client Write
17
• List all the files and directories in the HDFS
• $hadoop fs –lsr
• Put file to HDFS
• $hadoop fs –put <from path> <to path>
• Get files from HDFS
• $hadoop fs –get <from path>
• To run jar file
• $hadoop jar <jarfile> <className> <input
path> <output path>
HDFS - cli
18
• Job Configuration
• Key files core-site.xml, mapred-
site.xml
• Specific job configuration can be
provided in the code
Map Reduce cont.
19
MR job in action
20
• Job Scheduling
• Fair scheduler
• Capacity scheduler
Job Scheduling
21
• Job is planned and placed in the job pool
• Supports preemption
• If no pools created and only one job
available, the job runs as is
Fair Scheduler
22
• Supports Multi user scheduling
• Depends on the clusters, number of
queues and hierarchical way jobs are
scheduled
• One queue may be child of another
queue
• Enforces fair scheduling within each job
pool
Capacity scheduler
23
Map reduce Input Formats
24
• Map Side Join
• large inputs works by performing the join
before the data reaches the map function
• Reducer Side Join
• input datasets don’t have to be structured in
any particular way, but it is less efficient as
both datasets have to go through the Map
Reduce shuffle.
MR Joins
25
• Hive was created to make it possible for
analysts with strong SQL skills (but meager
Java programming skills)
• From Developers of Facebook and later
associated it part of apache open source
projects.
• Hive runs on your workstation and converts
your SQL query into a series of Map
Reduce jobs for execution on a Hadoop
cluster
HIVE
26
• Unzip the gz file
• % tar xzf hive-x.y.z-dev.tar.gz
• Be handy
• % export HIVE_INSTALL=/home/tom/hive-x.y.z-
dev
• % export PATH=$PATH:$HIVE_INSTALL/bin
• Hive shell launched
• hive> Show tables;
Hive Infrastructure
27
Hive Modules
28
Hive Data Types
29
• Creating table
• CREATE TABLE rank_customer(custid STRING,
socre STRING, location STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';
• Load Data
• LOAD DATA LOCAL INPATH
'input/dir/customerrank.dat‘ OVERWRITE INTO
TABLE rank_customer;
• Check data in warehouse
• $ls /user/hive/warehouse/records/
Commands
30
• SELECT QUERY
• SELECT c.custid, c.score, c.location FROM
rank_customer c ORDER BY c.custid ASC,
c.location ASC, c.score DESC;
Commands cont.
31
• hive> CREATE DATABASE financials WITH
DBPROPERTIES ('creator' = MGP', 'date' =
'2014-10-03');
• hive> DROP DATABASE IF EXISTS financials;
• hive> ALTER DATABASE financials SET
DBPROPERTIES ('edited-by' = 'Joe Dba');
• hive> DROP TABLE IF EXISTS employees;
• hive> ALTER TABLE log_messages RENAME TO
logmsgs;
Hive-
DDL Commands
32
• Determine the rank of the customer
based on his id and the locality he
belongs. Highest scorer gains the higher
rank.
• Input Output
Use case
33
• Custom Writable
Using Map Reduce
34
• CustomWritable methods overridden
CustomWritable cont.
35
Driver code
36
Mapper Code
37
Partitioner Code
38
Sort Comparator class
39
Reducer Code
40
• ## FOR OBTINING THE RANKING ON THE BASIS OF
LOCATION AND CUSTOMER ID AS PER THE
REQUIREMENT
• hive>SELECT custid, score, location, rank()
over(PARTITION BY custid, location ORDER BY
score DESC )
AS myrank
FROM rank_customer;
Hive Query
41
Hive results
42
• Hadoop eco system is majorly designed
for large number of files of large size of
data
• Not so suitable for small sized large
number of files.
• Achieving the parallelism on the huge
data
• Mapping and Reducing are the key and
core functions to achieve parallelism.
Conclusion
43
• Hadoop eco system works efficiently with
commodity hardware.
• Distributed hardware can be efficiently
utilized.
• Hadoop map reduce codes are written
using Java.
• Hive gives feasibility for SQL
programmers though internally Java MR
jobs run.
Conclusion cont.
44
• Hadoop: The Definitive Guide, Third
Edition by Tom White
• Programming Hive by Edward Capriolo,
Dean Wampler, and Jason Rutherglen
• http://hadoop.apache.org/
• http://hive.apache.org/
References
45
46
THANK YOU
Q&A
PRADEEP M G

Más contenido relacionado

La actualidad más candente

Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and PipesHanborq Inc.
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2VIVEKVANAVAN
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkUserReport
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture Rupak Roy
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to SparkKyle Burke
 

La actualidad más candente (20)

Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Big data
Big dataBig data
Big data
 
Introduce to spark
Introduce to sparkIntroduce to spark
Introduce to spark
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
 
Hadoop
HadoopHadoop
Hadoop
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 

Destacado

Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceKrishna Sangeeth KS
 
Pyshark in Network Packet analysis
Pyshark in Network Packet analysisPyshark in Network Packet analysis
Pyshark in Network Packet analysisRengaraj D
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduceFrane Bandov
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 

Destacado (13)

Intro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and MapreduceIntro to BigData , Hadoop and Mapreduce
Intro to BigData , Hadoop and Mapreduce
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Pyshark in Network Packet analysis
Pyshark in Network Packet analysisPyshark in Network Packet analysis
Pyshark in Network Packet analysis
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Similar a Hadoop_EcoSystem_Pradeep_MG

Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoopyaevents
 
Scheduling scheme for hadoop clusters
Scheduling scheme for hadoop clustersScheduling scheme for hadoop clusters
Scheduling scheme for hadoop clustersAmjith Singh
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programsjani shaik
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
What's the Scoop on Hadoop? How It Works and How to WORK IT!
What's the Scoop on Hadoop? How It Works and How to WORK IT!What's the Scoop on Hadoop? How It Works and How to WORK IT!
What's the Scoop on Hadoop? How It Works and How to WORK IT!MongoDB
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfWasyihunSema2
 

Similar a Hadoop_EcoSystem_Pradeep_MG (20)

Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Scheduling scheme for hadoop clusters
Scheduling scheme for hadoop clustersScheduling scheme for hadoop clusters
Scheduling scheme for hadoop clusters
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
What's the Scoop on Hadoop? How It Works and How to WORK IT!
What's the Scoop on Hadoop? How It Works and How to WORK IT!What's the Scoop on Hadoop? How It Works and How to WORK IT!
What's the Scoop on Hadoop? How It Works and How to WORK IT!
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 
Big Data training
Big Data trainingBig Data training
Big Data training
 

Hadoop_EcoSystem_Pradeep_MG

  • 2. • Why Big Data? • Ingredients of Big Data Eco System • Working with Map Reduce • Phases of MR • HDFS • Hive • Use case • Conclusion Agenda 2
  • 3. • Big Data is NOT JUST ABOUT SIZE its ABOUT HOW IMPORTANT THE DATA IS in a large chunk • Data is CHANGING and getting MESSY • Prior Structured but now Unstructured. • Non Uniform • Many distributed contributors to the data • Mobile, PDA, Tablet, sensors. • Domains: Financial, Healthcare, Social Media Why Big Data!! 3
  • 5. • Map reduce – Technique of solving big data by map – reduce technique on clusters. • HDFS- Distributed file system used by hadoop. • HIVE- SQL based query engine for non java programmers • PIG- A data flow language and execution environment for exploring very large datasets Ingredients of Eco System 5
  • 6. • HBASE - A distributed, column-oriented database. • Zookeeper - A distributed, highly available coordination service. • Sqoop - A tool for efficiently moving data between relational databases and HDFS. Ingredients cont. 6
  • 7. • Protocols used- RPC/ HTTP for inter communication of commodity hardware. • Run on Pseudo Node or Clusters • Components- Daemons • NameNode • DataNode • JobTracker • TaskTracker Hadoop Internals 7
  • 8. • Map  Function which maps for each of the data available • Reduce  Function which is used for aggregation or reduction Working with Map Reduce 8
  • 9. • f(n) = Σ {n=0.. n=10} (n(n-1)/2) • map = ∀ n from 0 to n • map(n(n-1)/2) • Reduce = Σ ([values]) is the aggregation/reduction function Hence can achieve parallelism MR as a function 9
  • 10. MR as representation 10 • Map <K1, V1>  Map <K2, V2> • V2 – list of values for Key K2 • Reduce <K2, V2>  ~ <K3, V3> • ~ Reduction operation • Reduced output with specific Keys and Values
  • 11. • Data on HDFS • Input partition – FileSplit , Inputsplit • Map • Shuffle • Sort • Partition • Reducer • Aggregated Data on HDFS Phases of MR 11
  • 12. Phases of MR depicted 12
  • 13. Data flow in MR 13 MapReduce data flow with multiple reduce tasks
  • 14. Shuffle and Sort phase 14
  • 15. • Architecture HDFS Hadoop Distributed File System 15
  • 18. • List all the files and directories in the HDFS • $hadoop fs –lsr • Put file to HDFS • $hadoop fs –put <from path> <to path> • Get files from HDFS • $hadoop fs –get <from path> • To run jar file • $hadoop jar <jarfile> <className> <input path> <output path> HDFS - cli 18
  • 19. • Job Configuration • Key files core-site.xml, mapred- site.xml • Specific job configuration can be provided in the code Map Reduce cont. 19
  • 20. MR job in action 20
  • 21. • Job Scheduling • Fair scheduler • Capacity scheduler Job Scheduling 21
  • 22. • Job is planned and placed in the job pool • Supports preemption • If no pools created and only one job available, the job runs as is Fair Scheduler 22
  • 23. • Supports Multi user scheduling • Depends on the clusters, number of queues and hierarchical way jobs are scheduled • One queue may be child of another queue • Enforces fair scheduling within each job pool Capacity scheduler 23
  • 24. Map reduce Input Formats 24
  • 25. • Map Side Join • large inputs works by performing the join before the data reaches the map function • Reducer Side Join • input datasets don’t have to be structured in any particular way, but it is less efficient as both datasets have to go through the Map Reduce shuffle. MR Joins 25
  • 26. • Hive was created to make it possible for analysts with strong SQL skills (but meager Java programming skills) • From Developers of Facebook and later associated it part of apache open source projects. • Hive runs on your workstation and converts your SQL query into a series of Map Reduce jobs for execution on a Hadoop cluster HIVE 26
  • 27. • Unzip the gz file • % tar xzf hive-x.y.z-dev.tar.gz • Be handy • % export HIVE_INSTALL=/home/tom/hive-x.y.z- dev • % export PATH=$PATH:$HIVE_INSTALL/bin • Hive shell launched • hive> Show tables; Hive Infrastructure 27
  • 30. • Creating table • CREATE TABLE rank_customer(custid STRING, socre STRING, location STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; • Load Data • LOAD DATA LOCAL INPATH 'input/dir/customerrank.dat‘ OVERWRITE INTO TABLE rank_customer; • Check data in warehouse • $ls /user/hive/warehouse/records/ Commands 30
  • 31. • SELECT QUERY • SELECT c.custid, c.score, c.location FROM rank_customer c ORDER BY c.custid ASC, c.location ASC, c.score DESC; Commands cont. 31
  • 32. • hive> CREATE DATABASE financials WITH DBPROPERTIES ('creator' = MGP', 'date' = '2014-10-03'); • hive> DROP DATABASE IF EXISTS financials; • hive> ALTER DATABASE financials SET DBPROPERTIES ('edited-by' = 'Joe Dba'); • hive> DROP TABLE IF EXISTS employees; • hive> ALTER TABLE log_messages RENAME TO logmsgs; Hive- DDL Commands 32
  • 33. • Determine the rank of the customer based on his id and the locality he belongs. Highest scorer gains the higher rank. • Input Output Use case 33
  • 34. • Custom Writable Using Map Reduce 34
  • 35. • CustomWritable methods overridden CustomWritable cont. 35
  • 41. • ## FOR OBTINING THE RANKING ON THE BASIS OF LOCATION AND CUSTOMER ID AS PER THE REQUIREMENT • hive>SELECT custid, score, location, rank() over(PARTITION BY custid, location ORDER BY score DESC ) AS myrank FROM rank_customer; Hive Query 41
  • 43. • Hadoop eco system is majorly designed for large number of files of large size of data • Not so suitable for small sized large number of files. • Achieving the parallelism on the huge data • Mapping and Reducing are the key and core functions to achieve parallelism. Conclusion 43
  • 44. • Hadoop eco system works efficiently with commodity hardware. • Distributed hardware can be efficiently utilized. • Hadoop map reduce codes are written using Java. • Hive gives feasibility for SQL programmers though internally Java MR jobs run. Conclusion cont. 44
  • 45. • Hadoop: The Definitive Guide, Third Edition by Tom White • Programming Hive by Edward Capriolo, Dean Wampler, and Jason Rutherglen • http://hadoop.apache.org/ • http://hive.apache.org/ References 45