SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
MapReduce with
Hadoop at MyLife
June 6, 2013
Speaker: Jeff Meister
Topics of Talk
• What are MapReduce and Hadoop?
• When would you want to use them?
• How do they work?
• What does Hadoop do for you?
• How do you write MapReduce programs
to take advantage of that?
• What do we use them for at MyLife?
What are MapReduce
and Hadoop?
• MapReduce is a programming model for
parallel processing of large datasets
• An idea for how to write programs under
certain constraints
• Hadoop is an open-source implementation
of MapReduce
• Designed for clusters of commodity
machines
Motivation:
Why would you use
MapReduce?
Background:
Disk vs. Memory
• Memory
• Where the computer
keeps data it’s
currently working on
• Fast response time,
random access
supported
• Expensive: typical size
in tens of GB
• Hard disk
• More permanent
storage of data for
future tasks
• Slow response time,
sequential access only
• Cheap: typical size in
hundreds or
thousands of GB
Example Task on
Small Datasets
ID Public record
R1 Steve Jones, 36, 12 Main St, 10001
R2 John Brown, 72, 625 8th Ave, 90210
R3 James Davis, 23, 10 Broadway, 20202
R4 Tom Lewis, 45, 95 Park Pl, 90024
R5 Tim Harris, 33, PO Box 256, 33514
... ...
R20
00 Adam Parker, 59, 82 F St, 45454
Size: 8 MB Size: 3.5 MB
ID Phone number
P1 Robert White, 45121, (654) 321-4702
P2 David Johnson, 07470, (973) 602-2519
P3 Scott Lee, 23910, (602) 412-2255
P4 Steve Jones, 10001, (212) 347-3380
P5 John Wayne, 13284, (312) 446-8878
... ...
P10
00 Tom Lewis, 90024, (650) 945-2319
Real World:
Large Datasets
• 290 million public records = 380 GB
• 228 million phone records = 252 GB
• We could improve previous algorithm, but...
• The machine doesn’t have enough memory
• Would spend lots of time moving pieces of data
between disk and memory
• Disk is so slow, the task is now impractical
• What to do? Use Hadoop MapReduce!
• Divide into smaller tasks, run them in parallel
Hadoop:
What does it do?
How do you work with it?
Components of the
Hadoop System
• Hadoop Distributed File System
(HDFS)
• Splits up files into blocks, stores
them on multiple computers
• Knows which blocks are on
each machine
• Transfers blocks between
machines over the network
• Replicates blocks, designed to
tolerate frequent machine
failures
• MapReduce engine
• Supports distributed
computation
• Programmer writes Map and
Reduce functions
• Engine takes care of
parallelization, so you can focus
on your work
The Map and
Reduce Functions
• map : (K1, V1) List(K2, V2)
• Take an input record and produce (emit) a list of
intermediate (key, value) pairs
• reduce : (K2, List(V2)) List(K3, V3)
• Examine the values for each intermediate key,
produce a list of output records
• Critical observation: output type of map ≠ input type
of reduce!
• What’s going on in between?
The “Magic”:
A Fast Parallel Sort
• The core of Hadoop MapReduce is a
distributed parallel sorting algorithm
• Hadoop guarantees that the input to each
reducer is sorted by key (K2)
• All the (K2, V2) pairs from the mappers
are grouped by key
• The reducer gets a list of values
corresponding to each key
Why Is It Fast?
• Imagine how you might sort a deck of cards
• The most intuitive procedure for humans is
very inefficient for computers
• Turns out the best algorithm, merge sort, is
less straightforward
• Split the data up into smaller pieces, sort
the pieces individually, then merge them
• Hadoop is using HDFS to do a giant parallel
merge sort over its cluster
Example Task
with MapReduce
• map : (source_id, record) List(match_key, source_id)
• For each input record, select the fields to match by, make a
key out of them
• Use the record’s unique identifier as the value
• reduce : (match_key, List(source_id))
List(public_record_id, phone_id)
• For each match key, look through the list of unique IDs
• If we find both a public record ID and a phone ID in the
same list, match!
• The profiles with these IDs share all fields in the key
• Generate the output pair of matched IDs
Example Task on
Small Datasets
ID Public record
R1 Steve Jones, 36, 12 Main St, 10001
R2 John Brown, 72, 625 8th Ave, 90210
R3 James Davis, 23, 10 Broadway, 20202
R4 Tom Lewis, 45, 95 Park Pl, 90024
R5 Tim Harris, 33, PO Box 256, 33514
... ...
R20
00 Adam Parker, 59, 82 F St, 45454
Size: 8 MB Size: 3.5 MB
ID Phone number
P1 Robert White, 45121, (654) 321-4702
P2 David Johnson, 07470, (973) 602-2519
P3 Scott Lee, 23910, (602) 412-2255
P4 Steve Jones, 10001, (212) 347-3380
P5 John Wayne, 13284, (312) 446-8878
... ...
P10
00 Tom Lewis, 90024, (650) 945-2319
When is MapReduce
Appropriate?
• To benefit from using Hadoop:
• The data must be decomposable into many
(key, value) pairs
• Each mapper runs the same operation,
independently of other mappers
• Map output keys should sort values into groups
of similar size
• Sequential algorithms that are more straightforward
may need redesign for the MapReduce model
Common Applications
of MapReduce
• Many common distributed tasks are easily
expressible with MapReduce.A few examples:
• Term frequency counting
• Pattern searching
• Of course, sorting
• Graph algorithms, such as reversal (Web links)
• Inverted index generation
• Data mining (clustering, statistics)
MapReduce at MyLife
Applications of
MapReduce at MyLife
• We regularly run computations over large sets of
people data
• Who’s Searching ForYou
• Content-based aggregation pipeline (1.5 TB)
• Deltas of licensed data updates (300 GB)
• Generating search indexes for old platform
• Various ad hoc jobs involving matching, searching,
extraction, counting, de-duplication, and more
Hadoop Cluster
Specifications
• Currently 63 machines, each configured to run 4 or 6 map or
reduce tasks at once (total capacity 296)
• CPU:
• Each machine: 2x quad-core Opteron @ 2.2 GHz
• Memory:
• Each machine: 32 GB
• Cluster total: 2 TB
• Hard disk:
• Each machine: between 3 and 9 TB
• Total HDFS capacity: 345 TB
Other Companies
Using Hadoop
• Yahoo! - Index calculations for Web search
• Facebook - Analytics and machine learning
• World’s largest Hadoop cluster!
• Amazon - Supports Hadoop on EC2/S3 cloud services
• LinkedIn
• PeopleYou May Know
• Viewers of This Profile AlsoViewed
• Apple - Used in iAds platform
• Twitter - Data warehousing and analytics
• Lots more... http://wiki.apache.org/hadoop/PoweredBy
Further Reading
• Google research papers
• Google File System, SOSP 2003
• MapReduce, OSDI 2004
• BigTable, OSDI 2006
• Hadoop manual: http://hadoop.apache.org/
• Other Hadoop-related projects from
Apache: Cassandra, HBase, Hive, Pig

Más contenido relacionado

La actualidad más candente

Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Geek camp
Geek campGeek camp
Geek campjdhok
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoopyaevents
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copyMohammad_Tariq
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSBouquet
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyJay Nagar
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 

La actualidad más candente (20)

Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Geek camp
Geek campGeek camp
Geek camp
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
 
Pptx present
Pptx presentPptx present
Pptx present
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Migrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMSMigrating structured data between Hadoop and RDBMS
Migrating structured data between Hadoop and RDBMS
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Anju
AnjuAnju
Anju
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 

Similar a Map reduce and hadoop at mylife

Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - HadoopTalentica Software
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explainedDmytro Sandu
 
BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data Mindgrub Technologies
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataDhanashri Yadav
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015 clairvoyantllc
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoopAbhi Goyan
 

Similar a Map reduce and hadoop at mylife (20)

Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
002 Introduction to hadoop v3
002   Introduction to hadoop v3002   Introduction to hadoop v3
002 Introduction to hadoop v3
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Map reduce paradigm explained
Map reduce paradigm explainedMap reduce paradigm explained
Map reduce paradigm explained
 
BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data
 
Bw tech hadoop
Bw tech hadoopBw tech hadoop
Bw tech hadoop
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Hadoop
HadoopHadoop
Hadoop
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 

Último

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Último (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Map reduce and hadoop at mylife

  • 1. MapReduce with Hadoop at MyLife June 6, 2013 Speaker: Jeff Meister
  • 2. Topics of Talk • What are MapReduce and Hadoop? • When would you want to use them? • How do they work? • What does Hadoop do for you? • How do you write MapReduce programs to take advantage of that? • What do we use them for at MyLife?
  • 3. What are MapReduce and Hadoop? • MapReduce is a programming model for parallel processing of large datasets • An idea for how to write programs under certain constraints • Hadoop is an open-source implementation of MapReduce • Designed for clusters of commodity machines
  • 4. Motivation: Why would you use MapReduce?
  • 5. Background: Disk vs. Memory • Memory • Where the computer keeps data it’s currently working on • Fast response time, random access supported • Expensive: typical size in tens of GB • Hard disk • More permanent storage of data for future tasks • Slow response time, sequential access only • Cheap: typical size in hundreds or thousands of GB
  • 6. Example Task on Small Datasets ID Public record R1 Steve Jones, 36, 12 Main St, 10001 R2 John Brown, 72, 625 8th Ave, 90210 R3 James Davis, 23, 10 Broadway, 20202 R4 Tom Lewis, 45, 95 Park Pl, 90024 R5 Tim Harris, 33, PO Box 256, 33514 ... ... R20 00 Adam Parker, 59, 82 F St, 45454 Size: 8 MB Size: 3.5 MB ID Phone number P1 Robert White, 45121, (654) 321-4702 P2 David Johnson, 07470, (973) 602-2519 P3 Scott Lee, 23910, (602) 412-2255 P4 Steve Jones, 10001, (212) 347-3380 P5 John Wayne, 13284, (312) 446-8878 ... ... P10 00 Tom Lewis, 90024, (650) 945-2319
  • 7. Real World: Large Datasets • 290 million public records = 380 GB • 228 million phone records = 252 GB • We could improve previous algorithm, but... • The machine doesn’t have enough memory • Would spend lots of time moving pieces of data between disk and memory • Disk is so slow, the task is now impractical • What to do? Use Hadoop MapReduce! • Divide into smaller tasks, run them in parallel
  • 8. Hadoop: What does it do? How do you work with it?
  • 9. Components of the Hadoop System • Hadoop Distributed File System (HDFS) • Splits up files into blocks, stores them on multiple computers • Knows which blocks are on each machine • Transfers blocks between machines over the network • Replicates blocks, designed to tolerate frequent machine failures • MapReduce engine • Supports distributed computation • Programmer writes Map and Reduce functions • Engine takes care of parallelization, so you can focus on your work
  • 10. The Map and Reduce Functions • map : (K1, V1) List(K2, V2) • Take an input record and produce (emit) a list of intermediate (key, value) pairs • reduce : (K2, List(V2)) List(K3, V3) • Examine the values for each intermediate key, produce a list of output records • Critical observation: output type of map ≠ input type of reduce! • What’s going on in between?
  • 11. The “Magic”: A Fast Parallel Sort • The core of Hadoop MapReduce is a distributed parallel sorting algorithm • Hadoop guarantees that the input to each reducer is sorted by key (K2) • All the (K2, V2) pairs from the mappers are grouped by key • The reducer gets a list of values corresponding to each key
  • 12. Why Is It Fast? • Imagine how you might sort a deck of cards • The most intuitive procedure for humans is very inefficient for computers • Turns out the best algorithm, merge sort, is less straightforward • Split the data up into smaller pieces, sort the pieces individually, then merge them • Hadoop is using HDFS to do a giant parallel merge sort over its cluster
  • 13. Example Task with MapReduce • map : (source_id, record) List(match_key, source_id) • For each input record, select the fields to match by, make a key out of them • Use the record’s unique identifier as the value • reduce : (match_key, List(source_id)) List(public_record_id, phone_id) • For each match key, look through the list of unique IDs • If we find both a public record ID and a phone ID in the same list, match! • The profiles with these IDs share all fields in the key • Generate the output pair of matched IDs
  • 14. Example Task on Small Datasets ID Public record R1 Steve Jones, 36, 12 Main St, 10001 R2 John Brown, 72, 625 8th Ave, 90210 R3 James Davis, 23, 10 Broadway, 20202 R4 Tom Lewis, 45, 95 Park Pl, 90024 R5 Tim Harris, 33, PO Box 256, 33514 ... ... R20 00 Adam Parker, 59, 82 F St, 45454 Size: 8 MB Size: 3.5 MB ID Phone number P1 Robert White, 45121, (654) 321-4702 P2 David Johnson, 07470, (973) 602-2519 P3 Scott Lee, 23910, (602) 412-2255 P4 Steve Jones, 10001, (212) 347-3380 P5 John Wayne, 13284, (312) 446-8878 ... ... P10 00 Tom Lewis, 90024, (650) 945-2319
  • 15. When is MapReduce Appropriate? • To benefit from using Hadoop: • The data must be decomposable into many (key, value) pairs • Each mapper runs the same operation, independently of other mappers • Map output keys should sort values into groups of similar size • Sequential algorithms that are more straightforward may need redesign for the MapReduce model
  • 16. Common Applications of MapReduce • Many common distributed tasks are easily expressible with MapReduce.A few examples: • Term frequency counting • Pattern searching • Of course, sorting • Graph algorithms, such as reversal (Web links) • Inverted index generation • Data mining (clustering, statistics)
  • 18. Applications of MapReduce at MyLife • We regularly run computations over large sets of people data • Who’s Searching ForYou • Content-based aggregation pipeline (1.5 TB) • Deltas of licensed data updates (300 GB) • Generating search indexes for old platform • Various ad hoc jobs involving matching, searching, extraction, counting, de-duplication, and more
  • 19. Hadoop Cluster Specifications • Currently 63 machines, each configured to run 4 or 6 map or reduce tasks at once (total capacity 296) • CPU: • Each machine: 2x quad-core Opteron @ 2.2 GHz • Memory: • Each machine: 32 GB • Cluster total: 2 TB • Hard disk: • Each machine: between 3 and 9 TB • Total HDFS capacity: 345 TB
  • 20. Other Companies Using Hadoop • Yahoo! - Index calculations for Web search • Facebook - Analytics and machine learning • World’s largest Hadoop cluster! • Amazon - Supports Hadoop on EC2/S3 cloud services • LinkedIn • PeopleYou May Know • Viewers of This Profile AlsoViewed • Apple - Used in iAds platform • Twitter - Data warehousing and analytics • Lots more... http://wiki.apache.org/hadoop/PoweredBy
  • 21. Further Reading • Google research papers • Google File System, SOSP 2003 • MapReduce, OSDI 2004 • BigTable, OSDI 2006 • Hadoop manual: http://hadoop.apache.org/ • Other Hadoop-related projects from Apache: Cassandra, HBase, Hive, Pig