MapReduce 簡單介紹與練習

MAP REDUCE
https://goo.gl/dSsBqp
May 21, 2015

巨量資料
■ Google 處理 Web
■ 20+ billion web pages x 20KB = 400+ TB
■ 單台電腦讀取硬碟速度 30-35 MB/sec。需 4 個月讀取
整個 web
■ 200 * 2TB 硬碟來儲存

巨量資料
■ 分散式計算、分散式儲存
■ 2011 年時 Google 有 1M 台機器

MapReduce
■ 挑戰
■ 如何分散計算、分散資料？
■ 撰寫分散式/平行程式很困難？
■ MapReduce 解決上述問題
■ Google 所設計的運算與資料處理模型
■ 簡單且優雅的運用在大資料處理，撰寫平行程式不再
困難

範例: Word Count
1 $ cat data | tr -sc 'A-Za-z' 'n' | sort | uniq -c
2 ...
3 645 and
4 2 animation
5 3 annotations
6 3 answers
7 1 anticipated
8 19 any
9 2 anymore
10 2 anything
11 39 apos
12 ...
6

範例: Word Count
1 $ cat data
2 ...
3 On January 2, 1985, Zaman Akil sent the Academy of Scie
4 At his request, the Perpetual Secretary of the Academy,
5 I was the only one who agreed to discuss it with the au
6 ...
7

範例: Word Count
1 $ cat data | tr -sc 'A-Za-z' 'n'
2 ...
3 Germain
4 sent
5 the
6 letter
7 to
8 several
9 members
10 of
11 ...
9

範例: Word Count
1 $ cat data | tr -sc 'A-Za-z' 'n' | sort
2 ...
3 Also
4 Also
5 Also
6 Although
7 Although
8 America
9 American
10 American
11 Among
12 An
13 An 11

範例: Word Count
1 $ cat data | tr -sc 'A-Za-z' 'n' | sort | uniq -c
2 ...
3 645 and
4 2 animation
5 3 annotations
6 3 answers
7 1 anticipated
8 19 any
9 2 anymore
10 2 anything
11 39 apos
12 ...
13

WordCount Functions
1 def map(key, value):
2 # key: NA; value: a line of input text
3 for word in value:
4 emit(word, 1)
5 def reduce(key, values):
6 # key: word; values: an iterator over counts
7 result = 0
8 for count in values:
9 result += count
10 emit(key, result)
14

MapReduce Implementations
■ Hadoop
■ 1999 Doug Cutting 開發搜尋引擎開放軟體 Apache
Lucent, Nutch
■ 2004 Google 揭露搜尋引擎的作法: MapReduce +
Google 分散式檔案系統
■ 2004 Doug Cutting 開發 Hadoop 以配合 Lucent,
Nutch(Yahoo 贊助)

MapReduce Implementations
■ Spark
■ Designed for performance
■ APIs for Scala, Java, and Python
■ Disco: MapReduce framework for Python
■ MapReduce-MPI: for distributed-memory parallel
machines on top of standard MPI message passing
■ Meguro: a simple Javascript Map/Reduce framework
■ bashreduce : mapreduce in a bash script
■ localmapreduce

Hadoop Distributions & Services
Apache page of Distributions
■ Amazon Web Services
■ Apache Bigtop
■ Cascading
■ Cloudera
■ Datameer
■ Hortonworks
■ IBM InfoSphere BigInsights
■ MapR Technologies
■ Syncsort
■ Tresata
■ ...

Hadoop Streaming
■ 使用 stdin, stdout 做資料的傳遞。類似 shell 上的
pipeline。
■ 可用任何可在 shell 上執行的指令作為 mapper 或
reducer。亦即可用任何語言做 mapreduce 計算。
Mapper
■ Input: 檔案整行作為 value（預設）
■ Output: 第一個 tab 以前為 key，其後為 value
Reducer
■ Input: 第一個 tab 以前為 key，其後為 value。相同
key 的行會連續出現。
■ Output: 整行印出（預設）

實際動手跑 (lmr)
下載
■ ngramcount.py: https://goo.gl/wZ41MH
■ localmapreduce: https://goo.gl/8UlChs
■ citeseerx.40000: https://goo.gl/RmbfYm
Mac 環境準備
■ 安裝 Homebrew http://brew.sh/
■ 安裝 GNU parallel
brew install parallel
20

實際手動跑 (lmr)
■ 使用 pipe 測試 ngramcount.py
1 $ head -100 citeseerx.40000 | ./ngramcount.py -m |
2 > sort -k1,1 -t$'t' | ./ngramcount.py -r | less
■ 使用 lmr(localmapreduce) 分散執行
1 $ pv citeseerx.40000 |
2 > lmr 300k 20 './ngramcount.py -m' './ngramcount.py -r' out
3 $ ls out
4 reducer-00 reducer-02 reducer-04 reducer-06 reducer-08 reducer-10 re
5 reducer-01 reducer-03 reducer-05 reducer-07 reducer-09 reducer-11 re
6 $ less out/*
21

Ngram Count I
1 #!/usr/bin/env python
2 # -*- coding: utf-8 -*-
3 from __future__ import unicode_literals, print_function
4
5
6 def ngrams(words):
7 for length in range(1, 5 + 1):
8 for ngram in zip(*(words[i:] for i in range(length))):
9 yield ngram
10
11
12 def mapper(line):
13 # from nltk.tokenize import word_tokenize
14 # words = word_tokenize(line.lower())
15 import re
16 words = re.findall(r'[a-z]+', line.lower())
17 for ngram in ngrams(words):
22

Ngram Count II
18 yield ' '.join(ngram), 1
19
20
21 def reducer(key, values):
22 count = sum(int(v) for v in values)
23 yield key, count
24
25
26 def do_mapper(files):
27 import fileinput
28 for line in fileinput.input(files):
29 for key, value in mapper(line):
30 print('{}t{}'.format(key, value))
31
32
33 def line_to_keyvalue(line):
34 key, value = line.decode('utf8').split('t', 1)
35 return key, value
36
23

Ngram Count III
37
38 def do_reducer(files):
39 import fileinput
40 from itertools import groupby, imap
41 keyvalues = imap(line_to_keyvalue, fileinput.input(files))
42 for key, grouped_keyvalues in groupby(keyvalues,
43 key=lambda x: x[0]):
44 values = (v for k, v in grouped_keyvalues)
45 for key, value in reducer(key, values):
46 print('{}t{}'.format(key, value))
47
48
49 def argparser():
50 import argparse
51 parser = argparse.ArgumentParser(description='N-gram counter')
52 mode_group = parser.add_mutually_exclusive_group(required=True)
53 mode_group.add_argument(
54 '-r', '--reducer', action='store_true', help='reducer mode')
55 mode_group.add_argument(
24

Ngram Count IV
56 '-m', '--mapper', action='store_true', help='mapper mode')
57 parser.add_argument('files', metavar='FILE', type=str, nargs='*',
58 help='input files')
59 return parser.parse_args()
60
61 if __name__ == '__main__':
62
63 args = argparser()
64 if args.mapper:
65 do_mapper(args.files)
66 elif args.reducer:
67 do_reducer(args.files)
25

修改練習
修改 ngramcount.py
■ 只產生 1-2grams
■ 只保留 count > 5 的 n-grams

練習：語言搜尋引擎
■ 建立一搜尋引擎用於搜尋英文詞語用法。
■ 可輔助英語學習與文章寫作。
搜尋例子
■ adj. beach: 即代表搜尋 beach 前面出現過的形容詞。
■ play * role: 搜尋 play 與 role 中間最常出現的字詞
組合。
■ go ?to home: go 與 home 之間是否要放 to。
■ go * movie: go 與 role 中間最常出現的字詞組合。
■ kill the _: 最常被 kill 的東西是。

語法設計
語法說明
_ 單一任意字詞
* 零到多個任意字詞
?term term 可有可無
term1 | term2 term1 或 term2
adj. det. n. v. prep.形容詞、冠詞、名詞、動詞、介繫詞
搜尋例子
■ adj. beach: 即代表搜尋 beach 前面出現過的形容詞。
■ play * role: 搜尋 play 與 role 中間最常出現的字詞組
合。
■ go ?to home: go 與 home 之間是否要放 to。
■ go * movie: go 與 role 中間最常出現的字詞組合。
■ kill the _: 最常被 kill 的東西是。

語言搜尋引擎
■ 目標：完成語法第一項 _
■ 任意位置置入 _
■ 最長 4-gram
Query 範例
■ play _ _ role
■ kill the _
■ a _ beach
■ 輸入資料：citeseerx 的許多句子
■ 輸出結果：
■ key: 所有會有結果的 query
■ value: 符合 query 的前 100 名 ngram 與 count。

語言搜尋引擎 - 輸出
■ key: 所有會有結果的 query
■ value: 符合 query 的前 100 名 ngram 與 count。
輸出範例
Key Ngrams Counts
a _ beach a sandy beach 486
a private beach 416
a beautiful beach 314
a small beach 175
...
kill the _ kill the people 189
kill the other 174
kill the process 163
kill the enemy 160
...

隨堂練習
目標
■ 依 MapReduce 架構，設計每階段 mapper, reduce 的
輸入輸出來完成 Lab 12
■ 在紙寫撰寫簡單輸入、輸出的 key-value 範例表達概念
即可
小提示
■ 可有 1 至多個 map, reduce 流程
■ 考慮 mapper 的輸入資料切割影響
■ mapper 輸入為 value 或 key-value，輸出為 key-value
■ reducer 輸入為 grouped key-values，輸出為
key-value

Bi-gram Count
Bi-gram Count Mapper 範例
Input(value) Output(key => value)
C D C D C D => 2
D C => 1
B C D A B C => 1
C D => 1
D A => 1
C D A B C D => 1
D A => 1
A B => 1
Reducer 範例
Input(key => value) Output(key => value)
A B => 1 A B => 1
B C => 1 B C => 1
C D => 2 C D => 4
C D => 1
C D => 1
D A => 1 D A => 2
D A => 1
D C => 2 C C => 2

語言搜尋引擎
Mapper 範例
A B C 200 A B C => A B C 200
_ B C => A B C 200
A _ C => A B C 200
A B _ => A B C 200
_ _ C => A B C 200
_ B _ => A B C 200
A _ _ => A B C 200
A D C 300 _ D C => A D C 300
A _ C => A D C 300
...
A E C 100 _ E C => A E C 100
A _ C => A E C 100
...

語言搜尋引擎
Reducer 範例
A _ C => A B C 200 A _ C => A D C 300,
A _ C => A D C 300 A B C 200,
A _ C => A E C 100 A E C 100
A B _ => A B C 200 A B _ => A B C 200
A D _ => A D C 300 A D _ => A D C 300
A E _ => A E C 100 A E _ => A E C 100
A _ _ => A B C 200 A _ _ => A D C 300,
A _ _ => A D C 300 A B C 200,
A _ _ => A E C 100 A E C 100
_ B C => A B C 200 _ B C => A B C 200
_ D C => A D C 300 _ D C => A D C 300
_ E C => A E C 100 _ E C => A E C 100
... ...

回家作業
需完成六支程式
■ 產生 ngram count 的 mapper, reducer
■ 產生 query result 的 mapper, reducer
■ 將 query result 轉為 database
(試試 python 內建的 shelve 或 sqlite3 套件)
■ Database 介面程式，讓使用者輸入 query ，即時取得
result

python shelve
1 import shelve
2 d = shelve.open('data.shelve')
3 d['odds'] = [1, 3, 5, 7, 9]
4 print d['odds']
5 d['evens'] = [2, 4, 6, 8, 10]
6 d['hello'] = 'world'
7 del d['hello']
8 d['zipcodes'] = {'hsinchu': 300, 'zhongli': 320}
9 print d.keys()
10 d.close()
Google “python shelve” for official documents
38

MapReduce 簡單介紹與練習

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a MapReduce 簡單介紹與練習

Similar a MapReduce 簡單介紹與練習 (20)

MapReduce 簡單介紹與練習