2. Topics of Talk
• What are MapReduce and Hadoop?
• When would you want to use them?
• How do they work?
• What does Hadoop do for you?
• How do you write MapReduce programs
to take advantage of that?
• What do we use them for at MyLife?
3. What are MapReduce
and Hadoop?
• MapReduce is a programming model for
parallel processing of large datasets
• An idea for how to write programs under
certain constraints
• Hadoop is an open-source implementation
of MapReduce
• Designed for clusters of commodity
machines
5. Background:
Disk vs. Memory
• Memory
• Where the computer
keeps data it’s
currently working on
• Fast response time,
random access
supported
• Expensive: typical size
in tens of GB
• Hard disk
• More permanent
storage of data for
future tasks
• Slow response time,
sequential access only
• Cheap: typical size in
hundreds or
thousands of GB
6. Example Task on
Small Datasets
ID Public record
R1 Steve Jones, 36, 12 Main St, 10001
R2 John Brown, 72, 625 8th Ave, 90210
R3 James Davis, 23, 10 Broadway, 20202
R4 Tom Lewis, 45, 95 Park Pl, 90024
R5 Tim Harris, 33, PO Box 256, 33514
... ...
R20
00 Adam Parker, 59, 82 F St, 45454
Size: 8 MB Size: 3.5 MB
ID Phone number
P1 Robert White, 45121, (654) 321-4702
P2 David Johnson, 07470, (973) 602-2519
P3 Scott Lee, 23910, (602) 412-2255
P4 Steve Jones, 10001, (212) 347-3380
P5 John Wayne, 13284, (312) 446-8878
... ...
P10
00 Tom Lewis, 90024, (650) 945-2319
7. Real World:
Large Datasets
• 290 million public records = 380 GB
• 228 million phone records = 252 GB
• We could improve previous algorithm, but...
• The machine doesn’t have enough memory
• Would spend lots of time moving pieces of data
between disk and memory
• Disk is so slow, the task is now impractical
• What to do? Use Hadoop MapReduce!
• Divide into smaller tasks, run them in parallel
9. Components of the
Hadoop System
• Hadoop Distributed File System
(HDFS)
• Splits up files into blocks, stores
them on multiple computers
• Knows which blocks are on
each machine
• Transfers blocks between
machines over the network
• Replicates blocks, designed to
tolerate frequent machine
failures
• MapReduce engine
• Supports distributed
computation
• Programmer writes Map and
Reduce functions
• Engine takes care of
parallelization, so you can focus
on your work
10. The Map and
Reduce Functions
• map : (K1, V1) List(K2, V2)
• Take an input record and produce (emit) a list of
intermediate (key, value) pairs
• reduce : (K2, List(V2)) List(K3, V3)
• Examine the values for each intermediate key,
produce a list of output records
• Critical observation: output type of map ≠ input type
of reduce!
• What’s going on in between?
11. The “Magic”:
A Fast Parallel Sort
• The core of Hadoop MapReduce is a
distributed parallel sorting algorithm
• Hadoop guarantees that the input to each
reducer is sorted by key (K2)
• All the (K2, V2) pairs from the mappers
are grouped by key
• The reducer gets a list of values
corresponding to each key
12. Why Is It Fast?
• Imagine how you might sort a deck of cards
• The most intuitive procedure for humans is
very inefficient for computers
• Turns out the best algorithm, merge sort, is
less straightforward
• Split the data up into smaller pieces, sort
the pieces individually, then merge them
• Hadoop is using HDFS to do a giant parallel
merge sort over its cluster
13. Example Task
with MapReduce
• map : (source_id, record) List(match_key, source_id)
• For each input record, select the fields to match by, make a
key out of them
• Use the record’s unique identifier as the value
• reduce : (match_key, List(source_id))
List(public_record_id, phone_id)
• For each match key, look through the list of unique IDs
• If we find both a public record ID and a phone ID in the
same list, match!
• The profiles with these IDs share all fields in the key
• Generate the output pair of matched IDs
14. Example Task on
Small Datasets
ID Public record
R1 Steve Jones, 36, 12 Main St, 10001
R2 John Brown, 72, 625 8th Ave, 90210
R3 James Davis, 23, 10 Broadway, 20202
R4 Tom Lewis, 45, 95 Park Pl, 90024
R5 Tim Harris, 33, PO Box 256, 33514
... ...
R20
00 Adam Parker, 59, 82 F St, 45454
Size: 8 MB Size: 3.5 MB
ID Phone number
P1 Robert White, 45121, (654) 321-4702
P2 David Johnson, 07470, (973) 602-2519
P3 Scott Lee, 23910, (602) 412-2255
P4 Steve Jones, 10001, (212) 347-3380
P5 John Wayne, 13284, (312) 446-8878
... ...
P10
00 Tom Lewis, 90024, (650) 945-2319
15. When is MapReduce
Appropriate?
• To benefit from using Hadoop:
• The data must be decomposable into many
(key, value) pairs
• Each mapper runs the same operation,
independently of other mappers
• Map output keys should sort values into groups
of similar size
• Sequential algorithms that are more straightforward
may need redesign for the MapReduce model
16. Common Applications
of MapReduce
• Many common distributed tasks are easily
expressible with MapReduce.A few examples:
• Term frequency counting
• Pattern searching
• Of course, sorting
• Graph algorithms, such as reversal (Web links)
• Inverted index generation
• Data mining (clustering, statistics)
18. Applications of
MapReduce at MyLife
• We regularly run computations over large sets of
people data
• Who’s Searching ForYou
• Content-based aggregation pipeline (1.5 TB)
• Deltas of licensed data updates (300 GB)
• Generating search indexes for old platform
• Various ad hoc jobs involving matching, searching,
extraction, counting, de-duplication, and more
19. Hadoop Cluster
Specifications
• Currently 63 machines, each configured to run 4 or 6 map or
reduce tasks at once (total capacity 296)
• CPU:
• Each machine: 2x quad-core Opteron @ 2.2 GHz
• Memory:
• Each machine: 32 GB
• Cluster total: 2 TB
• Hard disk:
• Each machine: between 3 and 9 TB
• Total HDFS capacity: 345 TB
20. Other Companies
Using Hadoop
• Yahoo! - Index calculations for Web search
• Facebook - Analytics and machine learning
• World’s largest Hadoop cluster!
• Amazon - Supports Hadoop on EC2/S3 cloud services
• LinkedIn
• PeopleYou May Know
• Viewers of This Profile AlsoViewed
• Apple - Used in iAds platform
• Twitter - Data warehousing and analytics
• Lots more... http://wiki.apache.org/hadoop/PoweredBy
21. Further Reading
• Google research papers
• Google File System, SOSP 2003
• MapReduce, OSDI 2004
• BigTable, OSDI 2006
• Hadoop manual: http://hadoop.apache.org/
• Other Hadoop-related projects from
Apache: Cassandra, HBase, Hive, Pig