SlideShare una empresa de Scribd logo
1 de 21
Hadoop – Large scale data analysis Abhijit Sharma Page 1    |    9/8/2011
Unprecedented growth in  Data set size - Facebook 21+ PB data warehouse, 12+ TB/day Un(semi)-structured data – logs, documents, graphs Connected data web, tags, graphs Relevant to enterprises – logs, social media, machine generated data, breaking of silos Page 2    |    9/8/2011 Big Data Trends
Page 3    |    9/8/2011 Putting Big Data to work Data driven Org – decision support, new offerings Analytics on large data sets (FB Insights – Page, App etc stats),  Data Mining – Clustering - Google News articles Search - Google
Embarrassingly data parallel problems Data chunked & distributed across cluster Parallel processing with data locality – task dispatched where data is Horizontal/Linear scaling approach using commodity hardware Write Once, Read Many Examples  Distributed logs – grep, # of accesses per URL Search - Term Vector generation, Reverse Links Page 4    |    9/8/2011 Problem characteristics and examples
Open source system for large scale batch distributed computing on big data Map Reduce Programming Paradigm & Framework  Map Reduce Infrastructure Distributed File System (HDFS) Endorsed/used extensively by web giants – Google, FB, Yahoo! Page 5    |    9/8/2011 What is Hadoop?
MapReduce is a programming model and an implementation for parallel processing of large data sets Map processes each logical record per input split to generate a set of intermediate key/value pairs Reduce merges all intermediate values associated with the same intermediate key Page 6    |    9/8/2011 Map Reduce - Definition
Map : Apply a function to each list member - Parallelizable [1, 2, 3].collect { it * it }  Output : [1, 2, 3] -> Map (Square) : [1, 4, 9] Reduce : Apply a function and an accumulator to each list member [1, 2, 3].inject(0) { sum, item -> sum + item }  Output : [1, 2, 3] -> Reduce (Sum) : 6 Map & Reduce  [1, 2, 3].collect { it * it }.inject(0) { sum, item -> sum + item }  Output : [1, 2, 3] -> Map (Square) -> [1, 4, 9] -> Reduce (Sum) : 14 Page 7    |    9/8/2011 Map Reduce - Functional Programming Origins
Page 8    |    9/8/2011 Word Count - Shell cat * | grep  | sort                | uniq –c input| map  | shuffle & sort  | reduce
Page 9    |    9/8/2011 Word Count - Map Reduce
mapper (filename, file-contents): for each word in file-contents:     emit (word, 1) // single count for a word e.g. (“the”, 1) for each occurrence of “the” reducer (word, Iterator values): // Iterator for list of counts for a word e.g. (“the”, [1,1,..]) sum = 0   for each value in intermediate_values:     sum = sum + value   emit (word, sum) Page 10    |    9/8/2011 Word Count  - Pseudo code
Word Count / Distributed logs search for # accesses to various URLs Map – emits word/URL, 1 for each doc/log split Reduce – sums up the counts for a specific word/URL Term Vector generation – term -> [doc-id] Map – emits term, doc-id for each doc split Reduce – Identity Reducer – accumulates the (term, [doc-id, doc-id ..]) Reverse Links – source -> target to target-> source Map – emits (target, source) for each doc split Reducer – Identity Reducer – accumulates the (target, [source, source ..])  Page 11    |    9/8/2011 Examples – Map Reduce Defn
Hides complexity of distributed computing Automatic parallelization of job Automatic data chunking & distribution (via HDFS) Data locality – MR task dispatched where data is Fault tolerant to server, storage, N/W failures Network and disk transfer optimization Load balancing Page 12    |    9/8/2011 Map Reduce – Hadoop Implementation
Page 13    |    9/8/2011 Hadoop Map Reduce Architecture
Very large files – block size 64 MB/128 MB Data access pattern - Write once read many Writes are large, create & append only Reads are large & streaming Commodity hardware Tolerant to failure – server, storage, network Highly available through transparent replication ,[object Object],Page 14    |    9/8/2011 HDFS Characteristics
Page 15    |    9/8/2011 HDFS Architecture
Thanks Page 16    |    9/8/2011
Page 17    |    9/8/2011 Backup Slides
Page 18    |    9/8/2011 Map & Reduce Functions
Page 19    |    9/8/2011 Job Configuration
Job Tracker tracks MR jobs – runs on master node Task Tracker Runs on data nodes and tracks Mapper, Reducer tasks assigned to the node Heartbeats to Job Tracker Maintains and picks up tasks from a queue Page 20    |    9/8/2011 Hadoop Map Reduce Components
Name Node  Manages the file system namespace and regulates access to files by clients – stores meta data Mapping of blocks to Data Nodes and replicas Manage replication Executes file system namespace operations like opening, closing, and renaming files and directories. Data Node One per node, which manages local storage attached to the node  Internally, a file is split into one or more blocks and these blocks are stored in a set of Data Nodes Responsible for serving read and write requests from the file system’s clients. The Data Nodes also perform block creation, deletion, and replication upon instruction from the Name Node. Page 21    |    9/8/2011 HDFS

Más contenido relacionado

La actualidad más candente

Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
dbpublications
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalk
nzhang
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatation
Ashish Saraf
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
jencyjayastina
 

La actualidad más candente (20)

Adding data into GIS
Adding  data into GISAdding  data into GIS
Adding data into GIS
 
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
Hybrid Job-Driven Meta Data Scheduling for BigData with MapReduce Clusters an...
 
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...Hadoop ecosystem  J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
Hadoop ecosystem J.AYEESHA PARVEEN II-M.SC.,COMPUTER SCIENCE, BON SECOURS CO...
 
Hadoop Design Patterns
Hadoop Design PatternsHadoop Design Patterns
Hadoop Design Patterns
 
Taking Advantage of a Spatial Database with MapInfo Professional
Taking Advantage of a Spatial Database with MapInfo ProfessionalTaking Advantage of a Spatial Database with MapInfo Professional
Taking Advantage of a Spatial Database with MapInfo Professional
 
Hadoop by sunitha
Hadoop by sunithaHadoop by sunitha
Hadoop by sunitha
 
Dbms quiz
Dbms quiz Dbms quiz
Dbms quiz
 
Hadoop, mapreduce and yarn networks
Hadoop, mapreduce and yarn networksHadoop, mapreduce and yarn networks
Hadoop, mapreduce and yarn networks
 
Hadoop
HadoopHadoop
Hadoop
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalk
 
Geodatabases
GeodatabasesGeodatabases
Geodatabases
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatation
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
 
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science  Bon Secours...
Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...
 
Zenith it-hadoop-training
Zenith it-hadoop-trainingZenith it-hadoop-training
Zenith it-hadoop-training
 
Exploring Spatial data in GIS Environment
Exploring Spatial data in GIS Environment Exploring Spatial data in GIS Environment
Exploring Spatial data in GIS Environment
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop development series(1)
Hadoop development series(1)Hadoop development series(1)
Hadoop development series(1)
 
Introduction to MapBasic
Introduction to MapBasicIntroduction to MapBasic
Introduction to MapBasic
 
9-Figures in LaTex
9-Figures in LaTex9-Figures in LaTex
9-Figures in LaTex
 

Destacado

Big Data and the growing relevance of NoSQL
Big Data and the growing relevance of NoSQLBig Data and the growing relevance of NoSQL
Big Data and the growing relevance of NoSQL
Abhijit Sharma
 
U of L and The Social Web
U of L and The Social WebU of L and The Social Web
U of L and The Social Web
jackbr4
 
Kenenisa
KenenisaKenenisa
Kenenisa
rgana
 
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
Michel Dumontier
 

Destacado (20)

Biosimilars in China
Biosimilars in ChinaBiosimilars in China
Biosimilars in China
 
Industrial Sector of Pakistan
Industrial Sector of PakistanIndustrial Sector of Pakistan
Industrial Sector of Pakistan
 
Responders and Assessments Presentation
Responders  and  Assessments PresentationResponders  and  Assessments Presentation
Responders and Assessments Presentation
 
Connect Globally For An Innovation Economy, Nastas Article In Moscow Times
Connect Globally For An Innovation Economy, Nastas Article In Moscow TimesConnect Globally For An Innovation Economy, Nastas Article In Moscow Times
Connect Globally For An Innovation Economy, Nastas Article In Moscow Times
 
Big Data and the growing relevance of NoSQL
Big Data and the growing relevance of NoSQLBig Data and the growing relevance of NoSQL
Big Data and the growing relevance of NoSQL
 
Tennessee Ballot
Tennessee BallotTennessee Ballot
Tennessee Ballot
 
Better Search With Structured Knowledge
Better Search With Structured KnowledgeBetter Search With Structured Knowledge
Better Search With Structured Knowledge
 
Rims Metals and Mining Session
Rims Metals and Mining Session Rims Metals and Mining Session
Rims Metals and Mining Session
 
Adapting health systems to the challenge of diversity in the US and Europe
Adapting health systems to the challenge of diversity in the US and EuropeAdapting health systems to the challenge of diversity in the US and Europe
Adapting health systems to the challenge of diversity in the US and Europe
 
Lourenza
LourenzaLourenza
Lourenza
 
Android Bootcamp Santa Fe GTUG
Android Bootcamp Santa Fe GTUGAndroid Bootcamp Santa Fe GTUG
Android Bootcamp Santa Fe GTUG
 
Squizz presentation
Squizz presentationSquizz presentation
Squizz presentation
 
Howgirlsunderstand
HowgirlsunderstandHowgirlsunderstand
Howgirlsunderstand
 
U of L and The Social Web
U of L and The Social WebU of L and The Social Web
U of L and The Social Web
 
Kenenisa
KenenisaKenenisa
Kenenisa
 
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
We’re all SMILES! Building Chemical Semantic Web Services with SADI, ChEBI, a...
 
Tema 5 1º bach tangencias y enlaces v4
Tema 5 1º bach tangencias y enlaces v4Tema 5 1º bach tangencias y enlaces v4
Tema 5 1º bach tangencias y enlaces v4
 
HR head dilemma ideate assignment
HR head dilemma ideate assignmentHR head dilemma ideate assignment
HR head dilemma ideate assignment
 
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)Powering Scientific Discovery with the Semantic Web (VanBUG 2014)
Powering Scientific Discovery with the Semantic Web (VanBUG 2014)
 
The Economics of Grid-Connected Hybrid Distributed Generation
The Economics of Grid-Connected Hybrid Distributed GenerationThe Economics of Grid-Connected Hybrid Distributed Generation
The Economics of Grid-Connected Hybrid Distributed Generation
 

Similar a An introduction to Hadoop for large scale data analysis

Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
ragho
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
assignment3
assignment3assignment3
assignment3
Kirti J
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
Long Dao
 

Similar a An introduction to Hadoop for large scale data analysis (20)

Big data
Big dataBig data
Big data
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Lipstick On Pig
Lipstick On Pig Lipstick On Pig
Lipstick On Pig
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
assignment3
assignment3assignment3
assignment3
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

An introduction to Hadoop for large scale data analysis

  • 1. Hadoop – Large scale data analysis Abhijit Sharma Page 1 | 9/8/2011
  • 2. Unprecedented growth in Data set size - Facebook 21+ PB data warehouse, 12+ TB/day Un(semi)-structured data – logs, documents, graphs Connected data web, tags, graphs Relevant to enterprises – logs, social media, machine generated data, breaking of silos Page 2 | 9/8/2011 Big Data Trends
  • 3. Page 3 | 9/8/2011 Putting Big Data to work Data driven Org – decision support, new offerings Analytics on large data sets (FB Insights – Page, App etc stats), Data Mining – Clustering - Google News articles Search - Google
  • 4. Embarrassingly data parallel problems Data chunked & distributed across cluster Parallel processing with data locality – task dispatched where data is Horizontal/Linear scaling approach using commodity hardware Write Once, Read Many Examples Distributed logs – grep, # of accesses per URL Search - Term Vector generation, Reverse Links Page 4 | 9/8/2011 Problem characteristics and examples
  • 5. Open source system for large scale batch distributed computing on big data Map Reduce Programming Paradigm & Framework Map Reduce Infrastructure Distributed File System (HDFS) Endorsed/used extensively by web giants – Google, FB, Yahoo! Page 5 | 9/8/2011 What is Hadoop?
  • 6. MapReduce is a programming model and an implementation for parallel processing of large data sets Map processes each logical record per input split to generate a set of intermediate key/value pairs Reduce merges all intermediate values associated with the same intermediate key Page 6 | 9/8/2011 Map Reduce - Definition
  • 7. Map : Apply a function to each list member - Parallelizable [1, 2, 3].collect { it * it } Output : [1, 2, 3] -> Map (Square) : [1, 4, 9] Reduce : Apply a function and an accumulator to each list member [1, 2, 3].inject(0) { sum, item -> sum + item } Output : [1, 2, 3] -> Reduce (Sum) : 6 Map & Reduce [1, 2, 3].collect { it * it }.inject(0) { sum, item -> sum + item } Output : [1, 2, 3] -> Map (Square) -> [1, 4, 9] -> Reduce (Sum) : 14 Page 7 | 9/8/2011 Map Reduce - Functional Programming Origins
  • 8. Page 8 | 9/8/2011 Word Count - Shell cat * | grep | sort | uniq –c input| map | shuffle & sort | reduce
  • 9. Page 9 | 9/8/2011 Word Count - Map Reduce
  • 10. mapper (filename, file-contents): for each word in file-contents: emit (word, 1) // single count for a word e.g. (“the”, 1) for each occurrence of “the” reducer (word, Iterator values): // Iterator for list of counts for a word e.g. (“the”, [1,1,..]) sum = 0 for each value in intermediate_values: sum = sum + value emit (word, sum) Page 10 | 9/8/2011 Word Count - Pseudo code
  • 11. Word Count / Distributed logs search for # accesses to various URLs Map – emits word/URL, 1 for each doc/log split Reduce – sums up the counts for a specific word/URL Term Vector generation – term -> [doc-id] Map – emits term, doc-id for each doc split Reduce – Identity Reducer – accumulates the (term, [doc-id, doc-id ..]) Reverse Links – source -> target to target-> source Map – emits (target, source) for each doc split Reducer – Identity Reducer – accumulates the (target, [source, source ..]) Page 11 | 9/8/2011 Examples – Map Reduce Defn
  • 12. Hides complexity of distributed computing Automatic parallelization of job Automatic data chunking & distribution (via HDFS) Data locality – MR task dispatched where data is Fault tolerant to server, storage, N/W failures Network and disk transfer optimization Load balancing Page 12 | 9/8/2011 Map Reduce – Hadoop Implementation
  • 13. Page 13 | 9/8/2011 Hadoop Map Reduce Architecture
  • 14.
  • 15. Page 15 | 9/8/2011 HDFS Architecture
  • 16. Thanks Page 16 | 9/8/2011
  • 17. Page 17 | 9/8/2011 Backup Slides
  • 18. Page 18 | 9/8/2011 Map & Reduce Functions
  • 19. Page 19 | 9/8/2011 Job Configuration
  • 20. Job Tracker tracks MR jobs – runs on master node Task Tracker Runs on data nodes and tracks Mapper, Reducer tasks assigned to the node Heartbeats to Job Tracker Maintains and picks up tasks from a queue Page 20 | 9/8/2011 Hadoop Map Reduce Components
  • 21. Name Node Manages the file system namespace and regulates access to files by clients – stores meta data Mapping of blocks to Data Nodes and replicas Manage replication Executes file system namespace operations like opening, closing, and renaming files and directories. Data Node One per node, which manages local storage attached to the node Internally, a file is split into one or more blocks and these blocks are stored in a set of Data Nodes Responsible for serving read and write requests from the file system’s clients. The Data Nodes also perform block creation, deletion, and replication upon instruction from the Name Node. Page 21 | 9/8/2011 HDFS