An Introduction to Apache Hadoop MapReduce, what is it and how does
it work ? What is the map reduce cycle and how are jobs managed.
Why should it be used and who are big users and providers ?
2. MapReduce – What is it ?
● Processing engine of Hadoop
● Developers create Map and Reduce jobs
● Used for big data batch processing
● Parallel processing of huge data volumes
● Fault tolerant
● Scalable
3. MapReduce – What use it ?
● Your data in Terabyte / Petabyte range
● You have huge I/O
● Hadoop framework takes care of
– Job and task management
– Failures
– Storage
– Replication
● You just write Map and Reduce jobs
4. MapReduce – How does it work ?
Take word counting as an example, something that Google does
all of the time.
5. MapReduce – How does it work ?
● Input data split into shards
● Split data mapped to key,value pairs i.e. Bear,1
● Mapped data shuffled/sorted by key i.e. Bear
● Sorted data reduced i.e. Bear, 2
● Final data stored on HDFS
● There might be extra map layer before shuffle
● JobTracker controls all tasks in job
● TaskTracker controls map and reduce
6. MapReduce - Some examples
A visual example with colours to show you the cycle
Split -> Map -> Shuffle -> Reduce
7. MapReduce - Some examples
A visual example of MapReduce with job and task trackers added to
individual map and reduce jobs.
9. Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project consultancy
● We are happy to hear about your problems
● You can just pay for those hours that you need
● To solve your problems
10. Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project consultancy
● We are happy to hear about your problems
● You can just pay for those hours that you need
● To solve your problems