SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
MAPREDUCE SUCCINCTLY
Data everywhere
Problem - We are drowning in data
Hadoop’s place
Effective storage and processing of large chunks of data
Google GFS and MapReduce
• Google was dealing a large amount of data over 10 years ago
• Documented experience in a series of papers
• The MapReduce programming model
• Google File System
• Scalable model that was implemented in Hadoop
Disk speeds
• Processing 10 TB file
• Time – ~430 minutes
• Stored as 1TB on 10 machines
• Time – ~43 minutes
To store data at scale you need to
use multiple disks/machines
Processor trends
• CPU speeds are not growing exponentially
• Processors take less power
• Processors are able to do more in one cycle
Product Name
Intel® Core™ i7-920
Processor (8M Cache,
2.66 GHz, 4.80 GT/s
Intel® QPI)
Intel® Core™ i7-6700K
Processor (8M Cache, up
to 4.20 GHz)
Code Name Bloomfield Skylake
Launch Date Q4'08 Q3'15
Lithography 45 nm 14 nm
Recommended
Customer Price BOX : $305.00 BOX : $350.00
# of Cores 4 4
# of Threads 8 8
Processor Base
Frequency 2.66 GHz 4 GHz
Max Turbo
Frequency 2.93 GHz 4.2 GHz
TDP 130 W 91 W
Source - http://ark.intel.com/compare/88195,37147
To scale you need to use multiple
CPUs/machines
Network speeds
• Gigabit - Speed: 1000 mbps
• Size: 1 TB
• ~ 2 Hours
Don’t move data unless you have to
Example scenario
• Example that we will use to understand the problem
• Data on favorite beverage
• Calculate average cups consumed per day for each beverage
Brianna, coffee, 3
Cameron, milk, 5
Thomas, milk, 4
Wyatt, coffee, 5
coffee, 4
milk, 4.5
Example – Single Threaded
Average cups consumed by tea drinkers is 3.33
Transform
Group by beverage
Summarize and display results
The problem of shared state
Can we avoid
shared state?
Key idea – cooperating units
• Organize program into independent but cooperating units
• Programs need to be broken into a structure that will minimize
the need for any shared state
• Cooperating units can work in parallel without sharing resources
and cooperate as needed
Key idea – avoid shared state
Sum large list
Add list 1
Add list 2
Add list 3
Add and display
sum
How can we apply to our problem?
• Data can be split into blocks
• Each block of data can be processed by a thread
Stage 1 - input Stage 1 - output Stage 2 - output Stage 3 output
Brianna, coffee, 1
Cameron, milk, 5
Thomas, milk, 4
Wyatt, tea, 1
Victoria, coffee, 3
Grace, coffee, 4
David, tea, 4
coffee, 1
milk, 5
milk, 4
tea, 1
coffee, 3
coffee, 4
tea, 4
coffee, {1,3,4}
milk, {5, 4}
tea, {1, 4}
Coffee – 2.67
Milk, 4.5
Tea – 2.5
The Akka Actor model
• Units can send and receive messages
• Mailbox
Implementation structured to avoid shared state
Implementation – Take 2
Implementation – Take 3
MapReduce
Framework
Sorts, groups and
sends data by key
[Sort/Shuffle step]
The MapReduce framework
Preparation Map - input Map - output Sort/shuffle -
output
Reduce output
Break files into
blocks that can
be processed
independently
Locate and use
code to read
each record
Brianna, coffee, 1
Cameron, milk, 5
Thomas, milk, 4
Wyatt, tea, 1
Victoria, coffee, 3
Grace, coffee, 4
David, tea, 4
coffee, 1
milk, 5
milk, 4
tea, 1
coffee, 3
coffee, 4
tea, 4
coffee, {1,3,4}
milk, {5, 4}
tea, {1, 4}
Coffee – 2.67
Milk, 4.5
Tea – 2.5
Hadoop Distributed File System
• Files are split into large blocks
• Each block is stored on multiple nodes
• Namenode tracks block location
Other aspects
• Framework does a lot of the heavy lifting
• Machines can fail
• Tasks can fail
• Stragglers
• Users just write the Map and Reduce functions
Cup count demo – Apache Hadoop
• Demo
• Program is almost identical to what we wrote
Next steps
• Check out sample files on GitHub - https://github.com/danjebaraj/hadoopmr
• Read Google’s paper on Map Reduce and GFS (HDFS)
• http://research.google.com/archive/mapreduce.html
• http://research.google.com/archive/gfs.html
• Get familiar with Hadoop and Apache Spark
• Become familiar with functional programming
• Scala, F#, Clojure
• Check out Syncfusion’s free e-Books on related topics
• If working with Windows checkout Syncfusion’s easy to use Big Data Platform -
http://www.syncfusion.com/products/big-data
http://www.syncfusion.com/products/big-data
http://www.syncfusion.com/resources/techportal/ebooks
Related links
Thank you
Daniel Jebaraj
www.syncfusion.com

Más contenido relacionado

Similar a MAPREDUCE SUCCINCTLY

Meeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersMeeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersDataWorks Summit/Hadoop Summit
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inRahulBhole12
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Databricks
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in JavaRuben Badaró
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Performance Tuning
Performance TuningPerformance Tuning
Performance TuningJannet Peetz
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systemselliando dias
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning MongoDB
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Mark Kromer
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data ModelingAdam Doyle
 
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)Huibert Aalbers
 

Similar a MAPREDUCE SUCCINCTLY (20)

Hadoop
HadoopHadoop
Hadoop
 
Big Data for QAs
Big Data for QAsBig Data for QAs
Big Data for QAs
 
Meeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersMeeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop Clusters
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Performance Tuning
Performance TuningPerformance Tuning
Performance Tuning
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
 
Breaking data
Breaking dataBreaking data
Breaking data
 
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)
 

Último

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 

Último (20)

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 

MAPREDUCE SUCCINCTLY

  • 2. Data everywhere Problem - We are drowning in data
  • 3. Hadoop’s place Effective storage and processing of large chunks of data
  • 4. Google GFS and MapReduce • Google was dealing a large amount of data over 10 years ago • Documented experience in a series of papers • The MapReduce programming model • Google File System • Scalable model that was implemented in Hadoop
  • 5. Disk speeds • Processing 10 TB file • Time – ~430 minutes • Stored as 1TB on 10 machines • Time – ~43 minutes To store data at scale you need to use multiple disks/machines
  • 6. Processor trends • CPU speeds are not growing exponentially • Processors take less power • Processors are able to do more in one cycle Product Name Intel® Core™ i7-920 Processor (8M Cache, 2.66 GHz, 4.80 GT/s Intel® QPI) Intel® Core™ i7-6700K Processor (8M Cache, up to 4.20 GHz) Code Name Bloomfield Skylake Launch Date Q4'08 Q3'15 Lithography 45 nm 14 nm Recommended Customer Price BOX : $305.00 BOX : $350.00 # of Cores 4 4 # of Threads 8 8 Processor Base Frequency 2.66 GHz 4 GHz Max Turbo Frequency 2.93 GHz 4.2 GHz TDP 130 W 91 W Source - http://ark.intel.com/compare/88195,37147 To scale you need to use multiple CPUs/machines
  • 7. Network speeds • Gigabit - Speed: 1000 mbps • Size: 1 TB • ~ 2 Hours Don’t move data unless you have to
  • 8. Example scenario • Example that we will use to understand the problem • Data on favorite beverage • Calculate average cups consumed per day for each beverage Brianna, coffee, 3 Cameron, milk, 5 Thomas, milk, 4 Wyatt, coffee, 5 coffee, 4 milk, 4.5
  • 9. Example – Single Threaded Average cups consumed by tea drinkers is 3.33 Transform Group by beverage Summarize and display results
  • 10. The problem of shared state Can we avoid shared state?
  • 11. Key idea – cooperating units • Organize program into independent but cooperating units • Programs need to be broken into a structure that will minimize the need for any shared state • Cooperating units can work in parallel without sharing resources and cooperate as needed
  • 12. Key idea – avoid shared state Sum large list Add list 1 Add list 2 Add list 3 Add and display sum
  • 13. How can we apply to our problem? • Data can be split into blocks • Each block of data can be processed by a thread Stage 1 - input Stage 1 - output Stage 2 - output Stage 3 output Brianna, coffee, 1 Cameron, milk, 5 Thomas, milk, 4 Wyatt, tea, 1 Victoria, coffee, 3 Grace, coffee, 4 David, tea, 4 coffee, 1 milk, 5 milk, 4 tea, 1 coffee, 3 coffee, 4 tea, 4 coffee, {1,3,4} milk, {5, 4} tea, {1, 4} Coffee – 2.67 Milk, 4.5 Tea – 2.5
  • 14. The Akka Actor model • Units can send and receive messages • Mailbox
  • 15. Implementation structured to avoid shared state
  • 17. Implementation – Take 3 MapReduce Framework Sorts, groups and sends data by key [Sort/Shuffle step]
  • 18. The MapReduce framework Preparation Map - input Map - output Sort/shuffle - output Reduce output Break files into blocks that can be processed independently Locate and use code to read each record Brianna, coffee, 1 Cameron, milk, 5 Thomas, milk, 4 Wyatt, tea, 1 Victoria, coffee, 3 Grace, coffee, 4 David, tea, 4 coffee, 1 milk, 5 milk, 4 tea, 1 coffee, 3 coffee, 4 tea, 4 coffee, {1,3,4} milk, {5, 4} tea, {1, 4} Coffee – 2.67 Milk, 4.5 Tea – 2.5
  • 19. Hadoop Distributed File System • Files are split into large blocks • Each block is stored on multiple nodes • Namenode tracks block location
  • 20. Other aspects • Framework does a lot of the heavy lifting • Machines can fail • Tasks can fail • Stragglers • Users just write the Map and Reduce functions
  • 21. Cup count demo – Apache Hadoop • Demo • Program is almost identical to what we wrote
  • 22. Next steps • Check out sample files on GitHub - https://github.com/danjebaraj/hadoopmr • Read Google’s paper on Map Reduce and GFS (HDFS) • http://research.google.com/archive/mapreduce.html • http://research.google.com/archive/gfs.html • Get familiar with Hadoop and Apache Spark • Become familiar with functional programming • Scala, F#, Clojure • Check out Syncfusion’s free e-Books on related topics • If working with Windows checkout Syncfusion’s easy to use Big Data Platform - http://www.syncfusion.com/products/big-data