Lecture 6: Data-Intensive Computing for Text Analysis (Fall 2011)

Data-Intensive Computing for Text Analysis
CS395T / INF385T / LIN386M
University of Texas at Austin, Fall 2011

Lecture 6
September 29, 2011

Jason Baldridge Matt Lease
Department of Linguistics School of Information
University of Texas at Austin University of Texas at Austin
Jasonbaldridge at gmail dot com ml at ischool dot utexas dot edu

Acknowledgments
Course design and slides based on
Jimmy Lin’s cloud computing courses at
the University of Maryland, College Park

Some figures courtesy of the following
excellent Hadoop books (order yours today!)
• Chuck Lam’s Hadoop In Action (2010)
• Tom White’s Hadoop: The Definitive Guide,
2nd Edition (2010)

Today’s Agenda
• Automatic Spelling Correction
– Review: Information Retrieval (IR)
• Boolean Search
• Vector Space Modeling
• Inverted Indexing in MapReduce
– Probabilisitic modeling via noisy channel
• Index Compression
– Order inversion in MapReduce
• In-class exercise
• Hadoop: Pipelined & Chained jobs

Automatic Spelling Correction
 Three main stages
 Error detection
 Candidate generation
 Candidate ranking / choose best candidate

 Usage cases
 Flagging possible misspellings / spell checker
 Suggesting possible corrections
 Automatically correcting (inferred) misspellings
• “as you type” correction
• web queries
• real-time closed captioning
• …

Types of spelling errors
 Unknown words: “She is their favorite acress in town.”
 Can be identified using a dictionary…
 …but could be a valid word not in the dictionary
 Dictionary could be automatically constructed from large corpora
• Filter out rare words (misspellings, or valid but unlikely)…
• Why filter out rare words that are valid?
 Unknown words violating phonotactics:
 e.g. “There isn’t enough room in this tonw for the both of us.”
 Given dictionary, could automatically construct “n-gram dictionary”
of all character n-grams known in the language
• e.g. English words don’t end with “nw”, so flag tonw
 Incorrect homophone: “She drove their.”
 Valid word, wrong usage; infer appropriateness from context
 Typing errors reflecting kayout of leyboard

Candidate generation
 How to generate possible corrections for acress?
 Inspiration: how do people do it?
 People may suggest words like actress, across, access, acres,
caress, and cress – what do these have in common?
 What about “blam” and “zigzag”?
 Two standard strategies for candidate generation
 Minimum edit distance
• Generate all candidates within 1+ edit step(s)
• Possible edit operations: insertion, deletion, substitution, transposition, …
• Filter through a dictionary
• See Peter Norvig’s post: http://norvig.com/spell-correct.html

 Character ngrams: see next slide…

Character ngram Spelling Correction
 Information Retrieval (IR) model
 Query=typo word
 Document collection = dictionary (i.e. set of valid words)
 Representation: word is set of character ngrams
 Let’s use n=3 (trigram), with # to mark word start/end
 Examples
 across: [#ac, acr, cro, oss, ss#]
 acress: [#ac, acr, cre, res, ess, ss#]
 actress: [#ac, act, ctr, tre, res, ess, ss#]
 blam: [#bl, bla, lam, am#]
 mississippi: [#mis, iss, ssi, sis, sip, ipp, ppi, pi#]
 Uhm, IR model???
 Review…

Abstract IR Architecture

Query Documents

online offline
Representation Representation
Function Function

Query Representation Document Representation

Comparison
Function Index

Results

Document  Boolean Representation
McDonald's slims down spuds
Fast-food chain to reduce certain types of
“Bag of Words”
fat in its french fries with new cooking oil.
NEW YORK (CNN/Money) - McDonald's Corp. is McDonalds
cutting the amount of "bad" fat in its french fries
nearly in half, the fast-food chain said Tuesday as
it moves to make all its fried menu items fat
healthier.
But does that mean the popular shoestring fries fries
won't taste the same? The company says no. "It's
a win-win for our customers because they are
getting the same great french-fry taste along with new
an even healthier nutrition profile," said Mike
Roberts, president of McDonald's USA.
french
But others are not so sure. McDonald's will not
specifically discuss the kind of oil it plans to use,
but at least one nutrition expert says playing with Company
the formula could mean a different taste.
Shares of Oak Brook, Ill.-based McDonald's Said
(MCD: down $0.54 to $23.22, Research,
Estimates) were lower Tuesday afternoon. It was
unclear Tuesday whether competitors Burger nutrition
King and Wendy's International (WEN: down
$0.80 to $34.91, Research, Estimates) would
follow suit. Neither company could immediately …
be reached for comment.
…

Boolean Retrieval
Doc 1 Doc 2 Doc 3 Doc 4
dogs dolphins football football dolphins

Inverted Index: Boolean Retrieval
one fish, two fish red fish, blue fish cat in the hat green eggs and ham

1 2 3 4

blue 1 blue 2

cat 1 cat 3

egg 1 egg 4

fish 1 1 fish 1 2

green 1 green 4

ham 1 ham 4

hat 1 hat 3

one 1 one 1

red 1 red 2

two 1 two 1

Inverted Indexing via MapReduce
Doc 1 Doc 2 Doc 3
one fish, two fish red fish, blue fish cat in the hat

one 1 red 2 cat 3

Map two 1 blue 2 hat 3

fish 1 fish 2

Shuffle and Sort: aggregate values by keys

cat 3
blue 2
Reduce fish 1 2
hat 3
one 1
two 1
red 2

Inverted Indexing in MapReduce

1: class Mapper
2: procedure Map(docid n; doc d)
3: H = new Set
4: for all term t in doc d do
5: H.add(t)
6: for all term t in H do
7: Emit(term t, n)

1: class Reducer
2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …])
3: List P = docids.values()
4: Emit(term t; P)

Scalability Bottleneck
 Desired output format: <term, [doc1, doc2, …]>
 Just emitting each <term, docID> pair won’t produce this
 How to produce this without buffering?
 Side-effect: write directly to HDFS instead of emitting
 Complications?
• Persistent data must be cleaned up if reducer restarted…

Using the Inverted Index
 Boolean Retrieval: to execute a Boolean query
 Build query syntax tree
OR
( blue AND fish ) OR ham ham AND

 For each clause, look up postings blue fish

blue 2

fish 1 2

 Traverse postings and apply Boolean operator
 Efficiency analysis
 Start with shortest posting first
 Postings traversal is linear (if postings are sorted)
• Oops… we didn’t actually do this in building our index…

Inverted Indexing in MapReduce: try 2

1: class Mapper
2: procedure Map(docid n; doc d)
3: H = new Set
4: for all term t in doc d do
5: H.add(t)
6: for all term t in H do
7: Emit(term t, n)

1: class Reducer
2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …])
3: List P = docids.values() fish
4: Sort(P) 1 2
5: Emit(term t; P)

(Another) Scalability Bottleneck
 Reducers buffers all docIDs associated with term (to sort)
 What if term occurs in many documents?
 Secondary sorting
 Use composite key
 Partition function
 Key Comparator
 Side-effect: write directly to HDFS as before…

Inverted index for spelling correction
 Like search, spelling correction must be fast
 How can we quickly identify candidate corrections?
 II: Map each character ngram  list of all words containing it
 #ac -> { act, across, actress, acquire, … }
 acr -> { across, acrimony, macro, … }
 cre -> { crest, acre, acres, … }
 res -> { arrest, rest, rescue, restaurant, … }
 ess -> { less, lesson, necessary, actress, … }
 ss# -> { less, mess, moss, across, actress, … }
 How do we build the inverted index in MapReduce?

Exercise
 Write a MapReduce algorithm for creating an inverted
index for trigram spelling correction, given a corpus

Exercise
 Write a MapReduce algorithm for creating an inverted
index for trigram spelling correction, given a corpus

Map(String docid, String text):
for each word w in text:
for each trigram t in w:
Emit(t, w)

Reduce(String trigram, Iterator<Text> values):
Emit(trigram, values.toSet)

 Also other alternatives, e.g. in-mapper combining, pairs
 Is MapReduce even necessary for this?
 Dictionary vs. token frequency

Spelling correction as Boolean search
 Given inverted index, how to find set of possible corrections?
 Compute union of all words indexed by any of its character ngrams
 = Boolean search
• Query “acress”  “#ac OR acr OR cre OR res OR ess OR ss# “
 Are all corrections equally likely / good?

Ranked Information Retrieval
 Order documents by probability of relevance
 Estimate relevance of each document to the query
 Rank documents by relevance
 How do we estimate relevance?
 Vector space paradigm
 Approximate relevance by vector similarity (e.g. cosine)
 Represent queries and documents as vectors
 Rank documents by vector similarity to the query

Vector Space Model
t3
d2

d3
d1
θ
φ
t1

d5
t2
d4

Assumption: Documents that are “close” in vector space
“talk about” the same things

Retrieve documents based on how close the document
vector is to the query vector (i.e., similarity ~ “closeness”)

Similarity Metric
 Use “angle” between the vectors
 
d j  dk
cos( )   
d j dk
 
i1 wi, j wi,k
n
d j  dk
sim(d j , d k )    
i1 wi , j i 1 wi2,k
n n
d j dk 2

 Given pre-normalized vectors, just compute inner product
 
sim(d j , d k )  d j  d k  i 1 wi , j wi ,k
n

Boolean Character ngram correction
 Boolean Information Retrieval (IR) model
 Query=typo word
 Representation: word is set of character ngrams
 Let’s use n=3 (trigram), with # to mark word start/end
 Examples
 mississippi: [#mis, iss, ssi, sis, sip, ipp, ppi, pi#]

Ranked Character ngram correction
 Vector space Information Retrieval (IR) model
 Query=typo word
 Representation: word is vector of character ngram value
 Rank candidate corrections according to vector similarity (cosine)
 Trigram Examples
 mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#]

Spelling Correction in Vector Space
t3
d2

d3
d1
θ
φ
t1

d5
t2
d4

Assumption: Words that are “close together” in ngram
vector space have similar orthography

Therefore, retrieve words in the dictionary based on how
close the word is to the typo (i.e., similarity ~ “closeness”)

 Query=typo word
 “value” here expresses relative importance of different
vector components for the similarity comparison
 Use simple count here, what else might we do?

IR Term Weighting
 Term weights consist of two components
 Local: how important is the term in this document?
 Global: how important is the term in the collection?
 Here’s the intuition:
 Terms that appear often in a document should get high weights
 Terms that appear in many documents should get low weights
 How do we capture this mathematically?
 Term frequency (local)
 Inverse document frequency (global)

TF.IDF Term Weighting

N
wi , j  tfi , j  log
ni
wi , j weight assigned to term i in document j

tfi, j number of occurrence of term i in document j

N number of documents in entire collection

ni number of documents with term i

Inverted Index: TF.IDF
one fish, two fish red fish, blue fish cat in the hat green eggs and ham

tf
1 2 3 4 df
blue 1 1 blue 1 2 1

cat 1 1 cat 1 3 1

egg 1 1 egg 1 4 1

fish 2 2 2 fish 2 1 2 2 2

green 1 1 green 1 4 1

ham 1 1 ham 1 4 1

hat 1 1 hat 1 3 1

one 1 1 one 1 1 1

red 1 1 red 1 2 1

two 1 1 two 1 1 1

Inverted Indexing via MapReduce (2)
Doc 1 Doc 2 Doc 3
one fish, two fish red fish, blue fish cat in the hat

one 1 1 red 2 1 cat 3 1

Map two 1 1 blue 2 1 hat 3 1

fish 1 2 fish 2 2

Shuffle and Sort: aggregate values by keys

cat 3 1
blue 2 1
Reduce fish 1 2 2 2
hat 3 1
one 1 1
two 1 1
red 2 1

Inverted Indexing: Pseudo-Code

Further exaccerbates earlier scalability issues …

 Query=typo word
 “value” here expresses relative importance of different
vector components for the similarity comparison
 What else might we do? TF.IDF for character n-grams?

TF.IDF for character n-grams
 Think about what makes an ngram more discriminating
 e.g. in acquire, acq and cqu are more indicative than qui and ire.
 Schematically, we want something like:

• acquire: [ #ac, acq, cqu, qui, uir, ire, re# ]
 Possible solution: TF-IDF, where
 TF is the frequency of the ngram in the word
 IDF is the number of words the ngram occurs in in the vocabulary

Correction Beyond Orthography
 So far we’ve focused on orthography alone
 The context of a typo also tells us a great deal
 How can we compare contexts?

Correction Beyond Orthography
 So far we’ve focused on orthography alone
 The context of a typo also tells us a great deal
 How can we compare contexts?
 Idea: use the co-occurrence matrices built during HW2
 We have a vector of co-occurrence counts for each word

 Extract a similar vector for the typo given its immediate context
• “She is their favorite acress in town.” 
acress: [ she:1, is:1, their:1, favorite:1, in:1, town:1 ]

 Possible enhancement: make vectors sensitive to word order

Combining evidence
 We have orthographic similarity and contextual similarity
 We can do a simple weighted combination of the two, e.g.:

simCombined ( d j , d k )   simOrth( d j , d k )  1    simContext ( d j , d k )

 How to do this more efficiently?
 Compute top candidates based on simOrth
 Take top k for consideration with simContext
 …or other way around…

 The combined model might also be expressed by a similar
probabilistic model…

March 22, 2005 42

Paradigm: Noisy-Channel Modeling

s  arg max P( S | O)  arg max P( S ) P(O | S )
S S

Want to recover most likely latent (correct) source
word underlying the observed (misspelled) word

P(S): language model gives probability distribution
over possible (candidate) source words

P(O|S): channel model gives probability of each
candidate source word being “corrupted” into the
observed typo

Noisy Channel Model for correction
 We want to rank candidates by P(cand | typo)
 Using Bayes law, the chain rule, an independence
assumption, and logs, we have:
P( cand , typo, context )
P( cand | typo, context ) 
P(typo, context )
 P( cand , typo, context )
 P(typo | cand , context ) P( cand , context )
 P(typo | cand ) P( cand , context )
 P(typo | cand ) P( cand | context ) P( context )
 P(typo | cand ) P( cand | context )
 log P(typo | cand )  log P( cand | context )

Probabilistic vs. vector space model
 Both measure orthographic & contextual “fit” of the
candidate given the typo and its usage context
 Noisy channel:
P( cand | typo, context )  log P(typo | cand )  log P( cand | context )
 IR approach:
simCombined ( d j , d k )   simOrth( d j , d k )  1    simContext ( d j , d k )
 Both can benefit from “big” data (i.e. bigger samples)
 Better estimates of probabilities and population frequencies
 Usual probabilistic vs. non-probabilistic tradeoffs
 Principled theory and methodology for modeling and estimation
 How to extend the feature space to include additional information?
• Typing haptics (key proximity)? Cognitive errors (e.g. homonyms)?

Postings Encoding
Conceptually:

fish 1 2 9 1 21 3 34 1 35 2 80 3 …

In Practice:
•Instead of document IDs, encode deltas (or d-gaps)
• But it’s not obvious that this save space…

fish 1 2 8 1 12 3 13 1 1 2 45 3 …

Overview of Index Compression
 Byte-aligned vs. bit-aligned
 Non-parameterized bit-aligned
 Unary codes
  (gamma) codes
  (delta) codes
 Parameterized bit-aligned
 Golomb codes

Want more detail? Read Managing Gigabytes by Witten, Moffat, and Bell!

But First... General Data Compression
 Run Length Encoding
 7 7 7 8 8 9 = (7, 3), (8,2), (9,1)
 Binary Equivalent
 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 = 6, 1, 3, 2, 3
 Good with sparse binary data
 Huffman Coding
 Optimal when data is distributed by negative powers of two
 e.g. P(a)= ½, P(b) = ¼, P(c)=1/8, P(d)=1/8
• a = 0, b = 10, c= 110, d=111
 Prefix codes: no codeword is the prefix of another codeword
• If we read 0, we know it’s an “a” following bits are a new codeword
• Similarly 10 is a b (no other codeword starts with 10), etc.
• Prefix is 1* (i.e. path to internal nodes is all 1s, output on leaves)

Unary Codes
 Encode number as a run of 1s, specifically…
 x  1 coded as x-1 1s, followed by zero bit terminator
 1=0
 2 = 10
 3 = 110
 4 = 1110
 ...
 Great for small numbers… horrible for large numbers
 Overly-biased for very small gaps

 codes
 x  1 is coded in two parts: unary length : offset
 Start with binary encoded, remove highest-order bit = offset
 Length is number of binary digits, encoded in unary
 Concatenate length + offset codes
 Example: 9 in binary is 1001
 Offset = 001
 Length = 4, in unary code = 1110
  code = 1110:001
 Another example: 7 (111 in binary)
• offset=11, length=3 (110 in unary)   code = 110:11
 Analysis
 Offset = log x
 Length = log x +1
 Total = 2 log x +1 (97 bits, 75 bits, …)

 codes
 As with  codes, two parts: unary length & offset
 Offset is same as before
 Length is encoded by its  code
 Example: 9 (=1001 in binary)
 Offset = 001
 Length = 4 (100), offset=00, length 3 = 110 in unary
•  code=110:00
  code = 110:00:001
 Comparison
  codes better for smaller numbers
  codes better for larger numbers

Golomb Codes
 x  1, parameter b
 x encoded in two parts
 Part 1: q = ( x - 1 ) / b , code q + 1 in unary
 Part 2: remainder r<b, r = x - qb – 1 coded in truncated binary
 Truncated binary defines prefix code
 if b is a power of 2
• easy case: truncated binary = regular binary
 else
• First 2^(log b + 1) – b values encoded in log b bits
• Remaining values encoded in log b + 1 bits
 Let’s see some examples

Golomb Code Examples
 b = 3, r = [0:2]
 First 2^(log 3 + 1) – 3 = 2^2 – 3 = 1 values, in log 3 = 1 bit
 First 1 value in 1 bit: 0
 Remaining 3-1=2 values in 1+1=2 bits with prefix 1: 10, 11
 b = 5, r = [0:4]
 First 2^(log 5 + 1) – 5 = 2^3 – 5 = 3 values, in log 5 = 2 bits
 First 3 values in 2 bits: 00, 01, 10
 Remaining 5-3=2 values in 2+1=3 bits with prefix 11: 110, 111
• Two prefix bits needed since single leading 1 already used in “10”
 b = 6, r = [0:5]
 First 2^(log 6 + 1) – 6 = 2^3 – 6 = 2 values, in log 6 = 2 bits
 First 2 values in 2 bits: 00, 01
 Remaining 6-2=4 values in 2+1=3 bits with prefix 1: 100, 101, 110, 111

Comparison of Coding Schemes

Unary   Golomb
b=3 b=6

1 0 0 0 0:0 0:00
2 10 10:0 100:0 0:10 0:01
3 110 10:1 100:1 0:11 0:100
4 1110 110:00 101:00 10:0 0:101
5 11110 110:01 101:01 10:10 0:110
6 111110 110:10 101:10 10:11 0:111
7 1111110 110:11 101:11 110:0 10:00
8 11111110 1110:000 11000:000 110:10 10:01
9 111111110 1110:001 11000:001 110:11 10:100
10 1111111110 1110:010 11000:010 1110:0 10:101

See Figure 4.5 in Lin & Dyer p. 77 for b=5 and b=10

Witten, Moffat, Bell, Managing Gigabytes (1999)

Index Compression: Performance

Comparison of Index Size (bits per pointer)

Bible TREC

Unary 262 1918
Binary 15 20
 6.51 6.63
 6.23 6.38
Golomb 6.09 5.84

Use Golomb for d-gaps,  codes for term frequencies
Optimal b  0.69 (N/df): Different b for every term!

Bible: King James version of the Bible; 31,101 verses (4.3 MB)
TREC: TREC disks 1+2; 741,856 docs (2070 MB)

Witten, Moffat, Bell, Managing Gigabytes (1999)

Where are we without compression?
(key) (values) (keys) (values)

fish 1 2 [2,4] fish 1 [2,4]

34 1 [23] fish 9 [9]

21 3 [1,8,22] fish 21 [1,8,22]

35 2 [8,41] fish 34 [23]

80 3 [2,9,76] fish 35 [8,41]

9 1 [9] fish 80 [2,9,76]

How is this different?
• Let the framework do the sorting
• Directly write postings to disk
• Term frequency implicitly stored

Index Compression in MapReduce
 Need df to compress posting for each term
 How do we compute df?
 Count the # of postings in reduce(), then compress
 Problem?

Order Inversion Pattern
 In the mapper:
 Emit “special” key-value pairs to keep track of df
 In the reducer:
 Make sure “special” key-value pairs come first: process them to
determine df
 Remember: proper partitioning!

Getting the df: Modified Mapper
Doc 1
one fish, two fish Input document…

(key) (value)

fish 1 [2,4] Emit normal key-value pairs…

one 1 [1]

two 1 [3]

fish  [1] Emit “special” key-value pairs to keep track of df…

one  [1]

two  [1]

Getting the df: Modified Reducer
(key) (value)
First, compute the df by summing contributions
fish  [63] [82] [27] …
from all “special” key-value pair…

Compress postings incrementally as they arrive
fish 1 [2,4]

fish 9 [9]

fish 21 [1,8,22] Important: properly define sort order to make
sure “special” key-value pairs come first!
fish 34 [23]

fish 35 [8,41]

fish 80 [2,9,76]

… Write postings directly to disk

Where have we seen this before?

Exercise: where have all the ngrams gone?
For each observed (word) trigram in collection,
output its observed (docID, wordIndex) locations
Input
Doc 1 Doc 2 Doc 3
one fish two fish one fish two salmon two fish two fish

Output Possible Tools:
* pairs/stripes?
one fish two [(1,1),(2,1)]
* combining?
fish two fish [(1,2),(3,2)]
* secondary sorting?
fish two salmon [(2,2)]
* order inversion?
two fish two [(3,1)]
* side effects?

Exercise: shingling
Given observed (docID, wordIndex) ngram locations
For each document, for each of its ngrams (in order),
give a list of the ngram locations for that ngram

Input
one fish two [(1,1),(2,1)]

fish two fish [(1,2),(3,2)]

fish two salmon [(2,2)]
Possible Tools:
[(3,1)]
two fish two * pairs/stripes?
Output * combining?
Doc 1 [ [(1,1),(2,1)], [(1,2),(3,2)] ] * secondary sorting?
Doc 2 [ [(1,1),(2,1)], [(2,2)] ] * order inversion?
Doc 3 [ [(3,1)], [(1,2),(3,2)] ]
* side effects?

Exercise: shingling (2)
How can we recognize when longer ngrams are
aligned across documents?
Example
doc 1: a b c d e
doc 2: a b c d f
doc 3: e b c d f
doc 4: a b c d e

Find “a b c d” in docs 1 2 and 4,
“b c d f” in 2 & 3
“a b c d e” in 1 and 4

class Alignment
int index // start position in this document
int length // sequence length in ngrams typedef Pair<int docID, int position> Ngram;
int otherID // ID of other document
int otherIndex // start position in other document

class NgramExtender
Set<Alignment> alignments = empty set
index=0;
NgramExtender(int docID) { _docID = docID }
close() { foreach Alignment a, emit(_docID, a) }

AlignNgrams(List<Ngram> ngrams) // call this function iteratively in order of ngrams observed in this document

...
@inproceedings{Kolak:2008,
author = {Kolak, Okan and Schilit, Bill N.},
title = {Generating links by mining quotations},
booktitle = {19th ACM conference on Hypertext and
hypermedia},
year = {2008},
pages = {117--126}
}

class Alignment
int index // start position in this document
int length // sequence length in ngrams typedef Pair<int docID, int position> Ngram;
int otherID // ID of other document
int otherIndex // start position in other document

class NgramExtender
Set<Alignment> alignments = empty set
index=0;
NgramExtender(int docID) { _docID = docID }
close() { foreach Alignment a, emit(_docID, a) }

AlignNgrams(List<Ngram> ngrams) // call this function iteratively in order of ngrams observed in this document
++index;
foreach Alignment a in alignments
Ngram next = new Ngram(a.otherID, a.otherIndex + a.length)
if (ngrams.contains(next)) // extend alignment
a.length += 1; ngrams.remove(next)
else // terminate alignment
emit _docID, (a); alignments.remove(a)

foreach ngram in ngrams
alignments.add( new Alignment( index, 1, ngram.docID, ngram.otherIndex )

Building more complex MR algorithms
 Monolithic single Map + single Reduce
 What we’ve done so far
 Fitting all computation to this model can be difficult and ugly
 We generally strive for modularization when possible
 What else can we do?
 Pipeline: [Map Reduce] [Map Reduce] … (multiple sequential jobs)
 Chaining: [Map+ Reduce Map*]
• 1 or more Mappers
• 1 reducer
• 0 or more Mappers
 Pipelined Chain: [Map+ Reduce Map*] [Map+ Reduce Map*] …
 Express arbitrary dependencies between jobs

Modularization and WordCount
 General benefits of modularization
 Re-use for easier/faster development
 Consistent behavior across applications
 Easier/faster to maintain/extend for benefit of many applications
 Even basic word count can be broken down
 Pre-processing
• How will we tokenize? Perform stemming? Remove stopwords?
 Main computation: count tokenized tokens and group by word
 Post-processing
• Transform the values? (e.g. log-damping)
 Let’s separate tokenization into its own module
 Many other tasks can likely benefit
 First approach: pipeline…

Pipeline WordCount Modules

Tokenize
Tokenizer Mapper
• String -> List[String]
No Reducer
• Keep doc ID key
• E.g.(10032, “the 10 cats sleep”) -> (10032, [“the”, “10”, “cats”, “sleep”])

Count
Observer Mapper LongSumReducer
• List[String] -> List[(String, Int)] • Sum token counts
• E.g. (10032, [“the”, “10”, “cats”, “sleep”] -> [(“the”,1), (“10”, 1), (“cats”,1), • E.g. (“sleep”, [1, 5, 2]) ->
(“sleep”,1)] (“sleep”, 8)

Pipeline WordCount in Hadoop
 Two distinct jobs: tokenize and count
 Data sharing between jobs via persistent output
 Can use combiners and partitioners as usual (won’t bother here)
 Let’s use SequenceFileOutputFormat rather than TextOutputFormat
 sequence of binary key-value pairs; faster / smaller
 tokenization output will stick around unless we delete it
 Tokenize job
 Just a mapper, no reducer: conf.setNumReduceTasks(0) or IdentityReducer
 Output goes to directory we specify
 Files will be read back in by the counting job
 Output is array of tokens
 We need to make a suitable Writable for String arrays
 Count job
 Input types defined by the input SequenceFile (don’t need to be specified)
 Mapper is trivial
 observes tokens from incoming data
 Key: (docid) & Value: (Array of Strings, encoded as a Writable)

Pipeline WordCount (old Hadoop API)
Configuration conf = new Configuration();
String tmpDir1to2 = "/tmp/intermediate1to2";

// Tokenize job
JobConf tokenizationJob = new JobConf(conf);
tokenizationJob.setJarByClass(PipelineWordCount.class);
FileInputFormat.setInputPaths(tokenizationJob, new Path(inputPath));
FileOutputFormat.setOutputPath(tokenizationJob, new Path(tmpDir1to2));
tokenizationJob.setOutputFormat(SequenceFileOutputFormat.class);
tokenizationJob.setMapperClass(AggressiveTokenizerMapper.class);
tokenizationJob.setOutputKeyClass(LongWritable.class);
tokenizationJob.setOutputValueClass(TextArrayWritable.class);
tokenizationJob.setNumReduceTasks(0);

// Count job
JobConf countingJob = new JobConf(conf);
countingJob.setJarByClass(PipelineWordCount.class);
countingJob.setInputFormat(SequenceFileInputFormat.class);
FileInputFormat.setInputPaths(countingJob, new Path(tmpDir1to2));
FileOutputFormat.setOutputPath(countingJob, new Path(outputPath));
countingJob.setMapperClass(TrivialWordObserver.class);
countingJob.setReducerClass(MapRedIntSumReducer.class);
countingJob.setOutputKeyClass(Text.class);
countingJob.setOutputValueClass(IntWritable.class);
countingJob.setNumReduceTasks(reduceTasks);

JobClient.runJob(tokenizationJob);
JobClient.runJob(countingJob);

Pipeline jobs in Hadoop
 Old API
 JobClinet.runJob(..) does not return until job finishes
 New API
 Use Job rather than JobConf
 Use job.waitForCompletion instead of JobClient.runJob
 Why Old API?
 In 0.20.2, chaining only possible under old API
 We want to re-use the same components for chaining (next…)

Chaining in Hadoop Mapper 1 Mapper 1

 Map+ Reduce Map*
Intermediates Intermediates
 1 or more Mappers
• Can use IdentityMapper
 1 reducer Mapper 2 Mapper 2
• No reducers: conf.setNumReduceTasks(0)?
 0 or more Mappers
 Usual combiners and partitioners
 By default, data passed between Reducer Reducer
Mappers by usual writing of
intermediate data to disk
 Can always use side-effects…
 There is a better, built-in way to bypass Mapper 3 Mapper 3
this and pass (Key,Value) pairs by
reference instead
• Requires different Mapper semantics! Persistent Output Persistent Output

Hadoop: ChainMapper & ChainReducer
 Below JobConf objects (deprecated in Hadoop 0.20.2).
 No undeprecated replacement in 0.20.2… 
 Examples here work for later versions with small changes
JobConf job = new JobConf(conf);
...

boolean passByRef = false; // pass output (Key,Value) pairs to next Mapper by reference?

JobConf map1Conf = new JobConf(false);
ChainMapper.addMapper(job, Map1.class, Map1InputKey.class, Map1InputValue.class,
Map1OutputKey.class, Map1OutputValue.class, passByRef, map1Conf);

ChainMapper.addMapper(job, Map2.class, Map1OutputKey.class, Map1OutputValue.class,
Map2OutputKey.class, Map2OutputValue.class, passByRef, map2Conf);

JobConf reduceConf = new JobConf(false);
ChainReducer.setReducer(job, Reducer.class, Map2OutputKey.class, Map2OutputValue.class,
ReducerOutputKey.class, ReducerOutputValue.class, passByRef, reduceConf)

ChainReducer.addMapper (job, Map3.class, ReducerOutputKey.class, ReducerOutputValue.class,
Map3OutputKey.class, Map3OutputValue.class, passByRef, map3Conf)

JobClient.runJob(job);

Chaining in Hadoop
 Let’s continue our running example:
 Mapper 1: Tokenize
 Mapper 2: Observe (count) words
 Reducer: same IntSum reducer as always
 Mapper 3 Log-dampen counts
• We didn’t have this in our pipeline example but we’ll add here…

Chained Tokenizer + WordCount
// Set up configuration and intermediate directory location
JobConf chainJob = new JobConf(conf);
chainJob.setJobName("Chain job");
chainJob.setJarByClass(ChainWordCount.class); // single jar for all Mappers and Reducers…
chainJob.setNumReduceTasks(reduceTasks);
FileInputFormat.setInputPaths(chainJob, new Path(inputPath));
FileOutputFormat.setOutputPath(chainJob, new Path(outputPath));

// pass output (Key,Value) pairs to next Mapper by reference?
boolean passByRef = false;

JobConf map1 = new JobConf(false); // tokenization
ChainMapper.addMapper(chainJob, AggressiveTokenizerMapper.class,
LongWritable.class, Text.class,
LongWritable.class, TextArrayWritable.class, passByRef, map1);

JobConf map2 = new JobConf(false); // Add token observer job
ChainMapper.addMapper(chainJob, TrivialWordObserver.class,
LongWritable.class, TextArrayWritable.class,
Text.class, LongWritable.class, passByRef, map2);

JobConf reduce = new JobConf(false); // Set the int sum reducer
ChainReducer.setReducer(chainJob, LongSumReducer.class, Text.class, LongWritable.class,
Text.class, LongWritable.class, passByRef, reduce);

JobConf map3 = new JobConf(false); // log-scaling of counts
ChainReducer.addMapper(chainJob, ComputeLogMapper.class, Text.class, LongWritable.class,
Text.class, FloatWritable.class, passByRef, map3);

JobClient.runJob(chainJob);

Hadoop Chaining: Pass by Reference
 Chaining allows possible optimization
 Chained mappers run in same JVM thread, so opportunity to avoid
serialization to/from disk with pipelined jobs
 Also lesser benefit of avoiding extra object destruction / construction
 Gotchas
 OutputCollector.collect(K k, V v) promises
not alter the content of k and v
 But if Map1 passes (k,v) by reference to Map2 via collect(),
Map2 may alter (k,v) & thereby violate the contract
 What to do?
 Option 1: Honor the contract – don’t alter input (k,v) in Map2
 Option 2: Re-negotiate terms – don’t re-use (k,v) in Map1 after collect()
 Document carefully to avoid later changes silently breaking this…

Setting Dependencies Between Jobs
 JobControl and Job provide the mechanism
// create jobconf1 and jobconf2 as appropriate
// …

Job job1=new Job(jobconf1)
Job job2=new Job(jobconf2);
job2.addDependingJob(job1);

JobControl jbcntrl=new JobControl("jbcntrl");
jbcntrl.addJob(job1);
jbcntrl.addJob(job2);
jbcntrl.run()

 New API: no JobConf, create Job from Configuration, …

Higher Level Abstractions
 Pig: language and execution environment for expressing
MapReduce data flows. (pretty much the standard)
 See White, Chapter 11
 Cascading: another environment with a higher level of
abstraction for composing complex data flows
 See White, Chapter 16, pp 539-552
 Cascalog: query language based on Cascading that uses
Clojure (a JVM-based LISP variant)
 Word count in Cascalog
 Certainly more concise – though you need to grok the syntax.

(?<- (stdout) [?word ?count] (sentence ?s) (split ?s :> ?word) (c/ count ?count))

Lecture 6: Data-Intensive Computing for Text Analysis (Fall 2011)

Recomendados

Recomendados

Más contenido relacionado

Similar a Lecture 6: Data-Intensive Computing for Text Analysis (Fall 2011)

Similar a Lecture 6: Data-Intensive Computing for Text Analysis (Fall 2011) (20)

Más de Matthew Lease

Más de Matthew Lease (20)

Último

Último (20)

Lecture 6: Data-Intensive Computing for Text Analysis (Fall 2011)