2. • Map-Reduce introduction
• Hadoop introduction
• Hadoop Application Architecture
• Developing a typical Hadoop Application
• Practice on Hadoop
Agenda
3. • A programming model specification from Google.
• Tend to use for processing Terabyte(1024GBs), Petabyte(1024
Terabytes) data.
• Break large or complex processing into smaller, independent pieces
and modeling into key-value pair.
• Run on a commodity of group of clustering machines.
• Scale by add more workers, not bigger worker
• Consist of two phases:
– Map: written by the user, takes an input pair and produce a set of
intermediate key/value pairs.
– Reduce: aggregate and collate intermediate results.
– (input)<k1, v1> map<k2, v2> combine<k2, v2> reduce<k3, v3> (output)
Map-Reduce concept
6. • User program splits the input file into M pieces.
• One of the copies of the program is the master, the rest are the slaves.
• Master selects idle slaves and assigns a map or reduce task to each one
of them.
• Slaves parse the input into key-value pairs and pass to reduce function.
• The slaves emit key-pair in buffer memory and local hard-disk. This
location is also sent to Master.
• The master notifies to reduce slaves the location of key-pair.
• The reduce slave get the key-pair, sort base on key.
• The reduce pass intermediate key and its value to reduce function.
• The reduce slaves process using reduce function and produce output to
user.
• End process, master return result and control to user.
Map-reduce overall flow
7. • An open source from Apache implementing the Map-Reduce
specification using Java.
• Distributed processing for large or computationally complex problems
• Main core tenet:
– Scale out not up
– Move processing
– Expect and embrace failure
• Normally batch processing for a massive amount of data set.
• Consisting of two main parts:
– A data storage using for processing(HDFS).
– A parallel process engine (MapReduce APIs).
• Current main players: Amazon Elastic Map Reduce, Cloudera, MapR,
Hortonworks
Hadoop framework
9. • Using for temporarily storing data for Map-Reduce processing
• A typical file in HDFS is gigabytes to terabytes in size
• Divide large file into smaller block, default is 64Mb.
• Structure like any existing FS: file, directory, permission
• Support Linux-base command for interact: ls, rm, put…
• Communication model via TPC/IP protocol
• Provide a Java base APIs for access.
Hadoop Distributed File System
12. • Client submit a Job to Hadoop
– The job can be a Mapper, a Reducer, or list of Input.
– It’s a collection of Java classes which packaged into Jar file.
• the Job is sent to JobTracker process on Master Node.
• Each slave Node runs a process called TaskTracker.
• JobTracker instruct the TaskTracker and monitor.
• A Map or Reduce over a piece of data is a single task.
• A task attempt is an instance of a task running on a slave node.
Hadoop working model
14. • The Map-Reduce framework relies on the InputFormat of the job to:
– Validate the input-specification of the job.
– Split-up the input file(s) into logical InputSplits, each of which is then assigned to
an individual Mapper.
– Provide the RecordReader implementation to be used to glean input records
from the logical InputSplit for processing by the Mapper.
• Mapper task processing, resulting intermediate key-value pair and sending
to reducer using Map.context(k, v) class.
• Reduce reduces a set of intermediate values which share a key to a
smaller set of values and has 3 primary phases:
– Shuffle: copies the sorted output from each Mapper across the network
– Sort: sorts inputs by keys (since different Mappers may output the same key)
– Reduce: call reduce method defined by user.
• Hadoop defines “box” classes for strings (Text), integers (IntWritable) for
optimizing the serialization over the network.
Hadoop Programming model
16. • Using Sqoop or Flume to import/export data from various external
data source into HDFS for processing:
– The process is executed in map task of Hadoop.
– Can work with or RDBMS or NoSQL.
– Sample: sqoop import –connect jdbc:mysql://localhost:3306/sqoop -
username root -pasword pass -table employees
• Using Apache Hive as a data warehouse software facilitates querying
and managing large datasets:
– Organize data model as table, row, column, partition
– Support data type like: integer, float, double, string, list, struct
– Support Join, Group, Filter…built-in operators and function
• Using Sping Data for simplifying developing Apache Hadoop:
– Create and configure applications that use MapReduce, Streaming, Hive,
Pig, or Hbase.
– Integration with Spring Boot, using Dependency Injection…
Typical Hadoop Application Architecture
18. • Choose appropriate frameworks for each application:
– Hive or Pig for logged/relational data
– Sqoop for working with database, Flume for collecting log data from web
server because it’s event driven.
– HDFS or Hbase for storage of temporary data for processing
– Crunch APIs for join/aggregation rather than Hadoop APIs.
• Apply best practices:
– Choose Number of Mapper and Reducer wisely: Total mapper or reducer
= Number of Nodes * maximum number of tasks per node.
– Set Reducers to zero if you not using it.
– Mappers process optimal amount of data
– Always use Combiner if possible for local aggregation
– Minimize your mapper output
– Always write unit test and run in a small data set
Developing a typical Hadoop Application
19. • Tuning Hadoop using configuration parameter
– Hadoop provide a lot of parameter for tuning.
• What do when a task fail
– Usually happens
– Try again(retries possible because of idempotence)
– Report failure
• Slow tasks:
– Run anther version of the same task in parallel.
• Apply java coding best practice
Developing Typical Hadoop Application
20. • Support Standalone/Pseudo distributed/fully distributed mode
• Implement a word count problem
• Debug a Hadoop program:
– Using log file
– Using remote debug
Setup environment and practice