MapReduce provides an effective framework for processing large datasets in a distributed environment. It addresses challenges of storing and processing big data by breaking jobs into independent map and reduce tasks that can run in parallel across multiple machines without requiring shared memory or state. The map tasks split input data and emit key-value pairs, which are then sorted and grouped by the framework before being passed to reduce tasks to generate final output. This allows problems to be solved in a scalable, fault-tolerant manner.
4. Google GFS and MapReduce
• Google was dealing a large amount of data over 10 years ago
• Documented experience in a series of papers
• The MapReduce programming model
• Google File System
• Scalable model that was implemented in Hadoop
5. Disk speeds
• Processing 10 TB file
• Time – ~430 minutes
• Stored as 1TB on 10 machines
• Time – ~43 minutes
To store data at scale you need to
use multiple disks/machines
6. Processor trends
• CPU speeds are not growing exponentially
• Processors take less power
• Processors are able to do more in one cycle
Product Name
Intel® Core™ i7-920
Processor (8M Cache,
2.66 GHz, 4.80 GT/s
Intel® QPI)
Intel® Core™ i7-6700K
Processor (8M Cache, up
to 4.20 GHz)
Code Name Bloomfield Skylake
Launch Date Q4'08 Q3'15
Lithography 45 nm 14 nm
Recommended
Customer Price BOX : $305.00 BOX : $350.00
# of Cores 4 4
# of Threads 8 8
Processor Base
Frequency 2.66 GHz 4 GHz
Max Turbo
Frequency 2.93 GHz 4.2 GHz
TDP 130 W 91 W
Source - http://ark.intel.com/compare/88195,37147
To scale you need to use multiple
CPUs/machines
7. Network speeds
• Gigabit - Speed: 1000 mbps
• Size: 1 TB
• ~ 2 Hours
Don’t move data unless you have to
8. Example scenario
• Example that we will use to understand the problem
• Data on favorite beverage
• Calculate average cups consumed per day for each beverage
Brianna, coffee, 3
Cameron, milk, 5
Thomas, milk, 4
Wyatt, coffee, 5
coffee, 4
milk, 4.5
9. Example – Single Threaded
Average cups consumed by tea drinkers is 3.33
Transform
Group by beverage
Summarize and display results
11. Key idea – cooperating units
• Organize program into independent but cooperating units
• Programs need to be broken into a structure that will minimize
the need for any shared state
• Cooperating units can work in parallel without sharing resources
and cooperate as needed
12. Key idea – avoid shared state
Sum large list
Add list 1
Add list 2
Add list 3
Add and display
sum
13. How can we apply to our problem?
• Data can be split into blocks
• Each block of data can be processed by a thread
Stage 1 - input Stage 1 - output Stage 2 - output Stage 3 output
Brianna, coffee, 1
Cameron, milk, 5
Thomas, milk, 4
Wyatt, tea, 1
Victoria, coffee, 3
Grace, coffee, 4
David, tea, 4
coffee, 1
milk, 5
milk, 4
tea, 1
coffee, 3
coffee, 4
tea, 4
coffee, {1,3,4}
milk, {5, 4}
tea, {1, 4}
Coffee – 2.67
Milk, 4.5
Tea – 2.5
14. The Akka Actor model
• Units can send and receive messages
• Mailbox
17. Implementation – Take 3
MapReduce
Framework
Sorts, groups and
sends data by key
[Sort/Shuffle step]
18. The MapReduce framework
Preparation Map - input Map - output Sort/shuffle -
output
Reduce output
Break files into
blocks that can
be processed
independently
Locate and use
code to read
each record
Brianna, coffee, 1
Cameron, milk, 5
Thomas, milk, 4
Wyatt, tea, 1
Victoria, coffee, 3
Grace, coffee, 4
David, tea, 4
coffee, 1
milk, 5
milk, 4
tea, 1
coffee, 3
coffee, 4
tea, 4
coffee, {1,3,4}
milk, {5, 4}
tea, {1, 4}
Coffee – 2.67
Milk, 4.5
Tea – 2.5
19. Hadoop Distributed File System
• Files are split into large blocks
• Each block is stored on multiple nodes
• Namenode tracks block location
20. Other aspects
• Framework does a lot of the heavy lifting
• Machines can fail
• Tasks can fail
• Stragglers
• Users just write the Map and Reduce functions
21. Cup count demo – Apache Hadoop
• Demo
• Program is almost identical to what we wrote
22. Next steps
• Check out sample files on GitHub - https://github.com/danjebaraj/hadoopmr
• Read Google’s paper on Map Reduce and GFS (HDFS)
• http://research.google.com/archive/mapreduce.html
• http://research.google.com/archive/gfs.html
• Get familiar with Hadoop and Apache Spark
• Become familiar with functional programming
• Scala, F#, Clojure
• Check out Syncfusion’s free e-Books on related topics
• If working with Windows checkout Syncfusion’s easy to use Big Data Platform -
http://www.syncfusion.com/products/big-data