Hadoop, HDFS and MapReduce

Hadoop and
MapReduce
Friso van Vollenhoven
fvanvollenhoven@xebia.com

The workings of the elephant

Data everywhere

‣ Global data volume grows exponentially
‣ Information retrieval is BIG business these days
‣ Need means of economically storing and processing large data sets

Opportunity

‣ Commodity hardware is ultra cheap
‣ CPU and storage even cheaper

Traditional solution

‣ Store data in a (relational) database
‣ Run batch jobs for processing

Problems with existing solutions

‣ Databases are seek heavy; B-tree gives log(n) random accesses per update
‣ Seeks are wasted time, nothing of value happens during seeks
‣ Databases do not play well with commoditized hardware (SANs and 16 CPU
machines are not in the price sweet spot of performance / $)
‣ Databases were not built with horizontal scaling in mind

Solution: sort/merge vs. updating the B-tree

‣ Eliminate the seeks, only sequential reading / writing
‣ Work with batches for efficiency
‣ Parallelize work load
‣ Distribute processing and storage

History

‣ 2000: Apache Lucene: batch index updates and sort/merge with on disk index
‣ 2002: Apache Nutch: distributed, scalable open source web crawler; sort/merge
optimization applies
‣ 2004: Google publishes GFS and MapReduce papers
‣ 2006: Apache Hadoop: open source Java implementation of GFS and MR to solve
Nutch’ problem; later becomes standalone project
‣ 2011: We’re here learning about it!

Hadoop foundations

‣ Commodity hardware (3K - 7K $ machines)
‣ Only sequential reads / writes
‣ Distribution of data and processing across cluster
‣ Built in reliability / fault tolerance / redundancy
‣ Disk based, does not require data or indexes to fit in RAM
‣ Apache licensed, Open Source Software

The US government
builds their ﬁnger print
search index using
Hadoop.

The contents for the People You May Know feature is
created by a chain of many MapReduce jobs that
run daily. The jobs are reportedly a combination of
graph traversal, clustering and assisted machine
learning.

Amazon’s Frequently Bought Together and Customers Who Bought This Item Also
Bought features are brought to you by MapReduce jobs. Recommendation
based on large sales transaction datasets is a much seen use case.

Top Charts
generated daily
based on millions
of users’ listening
behavior.

Top searches used for auto-completion are re-generated daily by a
MapReduce job using all searches for the past couple of days.
Popularity for search terms can be based on counts, but also trending
and correlation with other datasets (e.g. trending on social media,
news, charts in case of music and movies, best seller lists, etc.)

Hadoop
Filesystem

HDFS

HDFS overview

‣ Distributed filesystem
‣ Consists of a single master node and multiple (many) data nodes
‣ Files are split up blocks (typically 64MB)
‣ Blocks are spread across data nodes in the cluster
‣ Each block is replicated multiple times to different data nodes in the cluster
(typically 3 times)
‣ Master node keeps track of which blocks belong to a file

HDFS interaction

‣ Accessible through Java API
‣ FUSE (filesystem in user space) driver available to mount as regular FS
‣ C API available
‣ Basic command line tools in Hadoop distribution
‣ Web interface

HDFS interaction

‣ File creation, directory listing and other meta data actions go through the master
node (e.g. ls, du, fsck, create file)
‣ Data goes directly to and from data nodes (read, write, append)
‣ Local read path optimization: clients located on same machine as data node will
always access local replica when possible

Hadoop FileSystem (HDFS)
Name Node

/some/file /foo/bar
HDFS client
create ﬁle

read data
Date Node Date Node Date Node
write data
DISK DISK DISK

Node local
HDFS client
DISK DISK DISK

replicate
DISK DISK DISK

read data

HDFS daemons: NameNode

‣ Filesystem master node
‣ Keeps track of directories, files and block locations
‣ Assigns blocks to data nodes
‣ Keeps track of live nodes (through heartbeats)
‣ Initiates re-replication in case of data node loss

‣ Block meta data is held in memory
• Will run out of memory when too many files exist
‣ Is a SINGLE POINT OF FAILURE in the system
• Some solutions exist

HDFS daemons: DataNode

‣ Filesystem worker node / “Block server”
‣ Uses underlying regular FS for storage (e.g. ext3)
• Takes care of distribution of blocks across disks
• Don’t use RAID
• More disks means more IO throughput
‣ Sends heartbeats to NameNode
‣ Reports blocks to NameNode (on startup)
‣ Does not know about the rest of the cluster (shared nothing)

Things to know about HDFS

‣ HDFS is write once, read many
• But has append support in newer versions
‣ Has built in compression at the block level
‣ Does end-to-end checksumming on all data
‣ Has tools for parallelized copying of large amounts of data to other HDFS
clusters (distcp)
‣ Provides a convenient file format to gather lots of small files into a single large
one
• Remember the NameNode running out of memory with too many files?
‣ HDFS is best used for large, unstructured volumes of raw data in BIG files used
for batch operations
• Optimized for sequential reads, not random access

Hadoop Sequence Files

‣ Special type of file to store Key-Value pairs
‣ Stores keys and values as byte arrays
‣ Uses length encoded bytes as format
‣ Often used as input or output format for MapReduce jobs
‣ Has built in compression on values

Example: command directory listing

friso@fvv:~/java$ hadoop fs -ls /
Found 3 items
drwxr-xr-x - friso supergroup 0 2011-03-31 17:06 /Users
drwxr-xr-x - friso supergroup 0 2011-03-16 14:16 /hbase
drwxr-xr-x - friso supergroup 0 2011-04-18 11:33 /user
friso@fvv:~/java$

Example: NameNode web interface

Example: copy local file to HDFS

friso@fvv:~/Downloads$ hadoop fs -put ./some-tweets.json tweets-data.json

MapReduce


Massively parallelizable
computing

MapReduce, the algorithm
Input data: Required output:

Map: extract something useful from each record
KEYS VALUES

map
void map(recordNumber, record) {
key = record.findColorfulShape();
map value = record.findGrayShapes();
emit(key, value);
map }
map

map

map

map

map

Framework sorts all KeyValue pairs by Key
KEYS VALUES KEYS VALUES

Reduce: process values for each key
KEYS VALUES KEYS VALUES

reduce

reduce

void reduce(key, values) {
reduce allGrayShapes = [];
foreach (value in values) {
allGrayShapes.push(value);
}
emit(key, allGrayShapes);
}

MapReduce, the algorithm

KEYS VALUES KEYS VALUES KEYS VALUES

map
reduce

map

map

reduce
map

map
reduce

map

map

map

Hadoop MapReduce: parallelized on top of HDFS

‣ Job input comes from files on HDFS
• Typically sequence files
• Other formats are possible; requires specialized InputFormat implementation
• Built in support for text files (convenient for logs, csv, etc.)
• Files must be splittable for parallelization to work
- Not all compression formats have this property (e.g. gzip)

MapReduce daemons: JobTracker

‣ MapReduce master node
‣ Takes care of scheduling and job submission
‣ Splits jobs into tasks (Mappers and Reducers)
‣ Assigns tasks to worker nodes
‣ Reassigns tasks in case of failure
‣ Keeps track of job progress
‣ Keeps track of worker nodes through heartbeats

MapReduce daemons: TaskTracker

‣ MapReduce worker process
‣ Starts Mappers en Reducers assigned by JobTracker
‣ Sends heart beats to the JobTracker
‣ Sends task progress to the JobTracker
‣ Does not know about the rest of the cluster (shared nothing)

Hadoop MapReduce: parallelized on top of HDFS

Hadoop MapReduce: Mapper side

‣ Each mapper processes a piece of the total input
• Typically blocks that reside on the same machine as the mapper (local
datanode)
‣ Mappers sort output by key and store it on the local disk
• If the mapper output does not fit in RAM, on disk merge sort happens

Hadoop MapReduce: Reducer side

‣ Reducers collect sorted input KeyValue pairs over the network from Mappers
• Reducer performs (on disk) merge on inputs from different mappers
‣ Reducer calls the reduce method for each unique key
• List of values for each key is read from local disk (the result of the merge)
• Values do not need to fit in RAM
- Reduce methods that need a global view, need enough RAM to fit all values
for a key

‣ Reducer writes output KeyValue pairs to HDFS
• Typically blocks go to local data node

<PLUG>
Summer Classes
Big data crunching using Hadoop and other NoSQL tools
• Write Hadoop MapReduce jobs in Java
• Run on a actual cluster pre-loaded with several datasets
• Create a simple application or visualization with the result
• Learn about Hadoop without the hassle of building a production cluster ﬁrst
• Have lots of fun!

Dates: July 12, August 10
Only € 295,= for a full day course
http://www.xebia.com/summerclasses/bigdata

</PLUG>

Hadoop, HDFS and MapReduce

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (16)

Destacado

Destacado (20)

Similar a Hadoop, HDFS and MapReduce

Similar a Hadoop, HDFS and MapReduce (20)

Más de fvanvollenhoven

Más de fvanvollenhoven (9)

Último

Último (20)

Hadoop, HDFS and MapReduce