This document summarizes Yandex's MapReduce implementation. It describes Yandex's MapReduce clusters which contain 5 clusters with 2000 hosts and store 10PB of data. It also describes the structure of MapReduce tables which are divided into chunks and records. The document outlines some of Yandex's MapReduce applications like mr_cat, mr_cp and mr_grep which function similar to UNIX tools. It provides an overview of the Map and Reduce operations in Yandex MapReduce.
3. Yandex
MapReduce
Search
Quality
Team:
• 5
clusters
• 2,000
hosts
• 10
PB
data
• 3
TB
new
data
a
day
(only
user
logs)
• 100
users
• 2,000,000
tables
4. Structure
of
MapReduce
cluster
Hosts
specificaNon:
• 6
x
8
GB
RAM
• 2
x
6-‐cores
Xeon
CPU
• 4
x
2
GB
HDD
• 1
Gb
Ethernet
5. Yandex
MapReduce
Tables
• Table
consists
of
a
number
of
records
• Record
is
a
key,
subkey
and
value
tuple
• Table
consists
of
a
number
of
chunks
• Size
of
chunk
is
126
MB
• Each
chunks
has
several
replicas
(usually
3)
6. Sklad
• Minimal
overhead
costs
file
system
for
MapReduce
• Great
name:
Storehouse
7. netliba
• Tolerant
algorithm
of
congesNon
control
traffic
of
network
allows
us
to
increase
available
bandwidth
of
network.
• UDP-‐based
• Reliable
transmission
• Support
IPv6
8. mr_apps
UNIX-‐like
toolset
MapReduce
u*l
Descrip*on
mr_cat
cat
-‐
merge
tables
mr_cp
cp
–
copy
tables
mr_diff
diff
–
compare
tables
mr_du
du
–
display
disk
usage
staNsNcs
mr_grep
grep
–
display
records
matching
a
pa`ern
mr_head
head
–
print
top
records
mr_ls
ls
–
print
list
of
tables
mr_mv
mv
–
move
tables
mr_wc
wc
–
print
number
of
keys
or
records
mr_hist
print
keys
distribuNon
14. Yandex.Tables
(YT)
New
GeneraNon
of
MapReduce
• Tables
have
flexible
structure:
– Custom
columns
– Composite
keys
– Column
selecNon
read
tables
• Triple
masters:
no
single
point
of
failure
• New
tools
for
monitoring
• New
API