SlideShare una empresa de Scribd logo
1 de 80
Descargar para leer sin conexión
Data-Intensive Computing for Text Analysis
                 CS395T / INF385T / LIN386M
            University of Texas at Austin, Fall 2011



                      Lecture 6
                 September 29, 2011

        Jason Baldridge                      Matt Lease
   Department of Linguistics           School of Information
  University of Texas at Austin    University of Texas at Austin
Jasonbaldridge at gmail dot com   ml at ischool dot utexas dot edu
Acknowledgments
        Course design and slides based on
     Jimmy Lin’s cloud computing courses at
    the University of Maryland, College Park

Some figures courtesy of the following
excellent Hadoop books (order yours today!)
• Chuck Lam’s Hadoop In Action (2010)
• Tom White’s Hadoop: The Definitive Guide,
  2nd Edition (2010)
Today’s Agenda
• Automatic Spelling Correction
  – Review: Information Retrieval (IR)
     • Boolean Search
     • Vector Space Modeling
     • Inverted Indexing in MapReduce
  – Probabilisitic modeling via noisy channel
• Index Compression
  – Order inversion in MapReduce
• In-class exercise
• Hadoop: Pipelined & Chained jobs
Automatic Spelling Correction
Automatic Spelling Correction
   Three main stages
       Error detection
       Candidate generation
       Candidate ranking / choose best candidate



   Usage cases
       Flagging possible misspellings / spell checker
       Suggesting possible corrections
       Automatically correcting (inferred) misspellings
         •   “as you type” correction
         •   web queries
         •   real-time closed captioning
         •   …
Types of spelling errors
    Unknown words: “She is their favorite acress in town.”
        Can be identified using a dictionary…
        …but could be a valid word not in the dictionary
        Dictionary could be automatically constructed from large corpora
          • Filter out rare words (misspellings, or valid but unlikely)…
          • Why filter out rare words that are valid?
    Unknown words violating phonotactics:
        e.g. “There isn’t enough room in this tonw for the both of us.”
        Given dictionary, could automatically construct “n-gram dictionary”
         of all character n-grams known in the language
          • e.g. English words don’t end with “nw”, so flag tonw
    Incorrect homophone: “She drove their.”
        Valid word, wrong usage; infer appropriateness from context
    Typing errors reflecting kayout of leyboard
Candidate generation
   How to generate possible corrections for acress?
   Inspiration: how do people do it?
       People may suggest words like actress, across, access, acres,
        caress, and cress – what do these have in common?
       What about “blam” and “zigzag”?
   Two standard strategies for candidate generation
       Minimum edit distance
         • Generate all candidates within 1+ edit step(s)
             • Possible edit operations: insertion, deletion, substitution, transposition, …
         • Filter through a dictionary
         • See Peter Norvig’s post: http://norvig.com/spell-correct.html


       Character ngrams: see next slide…
Character ngram Spelling Correction
   Information Retrieval (IR) model
       Query=typo word
       Document collection = dictionary (i.e. set of valid words)
       Representation: word is set of character ngrams
   Let’s use n=3 (trigram), with # to mark word start/end
   Examples
       across: [#ac, acr, cro, oss, ss#]
       acress: [#ac, acr, cre, res, ess, ss#]
       actress: [#ac, act, ctr, tre, res, ess, ss#]
       blam: [#bl, bla, lam, am#]
       mississippi: [#mis, iss, ssi, sis, sip, ipp, ppi, pi#]
   Uhm, IR model???
       Review…
Abstract IR Architecture

           Query                                  Documents


                            online offline
        Representation                            Representation
          Function                                  Function


     Query Representation                    Document Representation



         Comparison
          Function                                   Index




         Results
Document  Boolean Representation
McDonald's slims down spuds
Fast-food chain to reduce certain types of
                                                        “Bag of Words”
fat in its french fries with new cooking oil.
NEW YORK (CNN/Money) - McDonald's Corp. is              McDonalds
cutting the amount of "bad" fat in its french fries
nearly in half, the fast-food chain said Tuesday as
it moves to make all its fried menu items               fat
healthier.
But does that mean the popular shoestring fries         fries
won't taste the same? The company says no. "It's
a win-win for our customers because they are
getting the same great french-fry taste along with      new
an even healthier nutrition profile," said Mike
Roberts, president of McDonald's USA.
                                                        french
But others are not so sure. McDonald's will not
specifically discuss the kind of oil it plans to use,
but at least one nutrition expert says playing with     Company
the formula could mean a different taste.
Shares of Oak Brook, Ill.-based McDonald's              Said
(MCD: down $0.54 to $23.22, Research,
Estimates) were lower Tuesday afternoon. It was
unclear Tuesday whether competitors Burger              nutrition
King and Wendy's International (WEN: down
$0.80 to $34.91, Research, Estimates) would
follow suit. Neither company could immediately          …
be reached for comment.
…
Boolean Retrieval
 Doc 1   Doc 2       Doc 3      Doc 4
 dogs     dolphins   football   football dolphins
Inverted Index: Boolean Retrieval
 Doc 1                    Doc 2                 Doc 3            Doc 4
 one fish, two fish       red fish, blue fish   cat in the hat   green eggs and ham



                 1    2    3      4

         blue         1                                   blue       2

          cat              1                              cat        3

         egg                      1                       egg        4

         fish    1    1                                   fish       1   2

         green                    1                      green       4

         ham                      1                       ham        4

          hat              1                              hat        3

         one     1                                        one        1

          red         1                                   red        2

         two     1                                        two        1
Inverted Indexing via MapReduce
      Doc 1                        Doc 2                         Doc 3
      one fish, two fish            red fish, blue fish          cat in the hat

         one    1                    red     2                     cat    3

Map      two    1                    blue    2                     hat    3

         fish   1                    fish    2



                    Shuffle and Sort: aggregate values by keys



                    cat     3
                                                          blue     2
Reduce              fish    1 2
                                                          hat      3
                    one     1
                                                          two      1
                    red     2
Inverted Indexing in MapReduce

1: class Mapper
2: procedure Map(docid n; doc d)
3:        H = new Set
4:        for all term t in doc d do
5:           H.add(t)
6:        for all term t in H do
7:           Emit(term t, n)


1: class Reducer
2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …])
3:        List P = docids.values()
4:        Emit(term t; P)
Scalability Bottleneck
    Desired output format: <term, [doc1, doc2, …]>
        Just emitting each <term, docID> pair won’t produce this
        How to produce this without buffering?
    Side-effect: write directly to HDFS instead of emitting
        Complications?
          • Persistent data must be cleaned up if reducer restarted…
Using the Inverted Index
    Boolean Retrieval: to execute a Boolean query
        Build query syntax tree
                                                              OR
                        ( blue AND fish ) OR ham        ham           AND

        For each clause, look up postings                       blue       fish

                          blue       2

                          fish       1    2


        Traverse postings and apply Boolean operator
    Efficiency analysis
        Start with shortest posting first
        Postings traversal is linear (if postings are sorted)
          • Oops… we didn’t actually do this in building our index…
Inverted Indexing in MapReduce

1: class Mapper
2: procedure Map(docid n; doc d)
3:        H = new Set
4:        for all term t in doc d do
5:           H.add(t)
6:        for all term t in H do
7:           Emit(term t, n)


1: class Reducer
2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …])
3:        List P = docids.values()
4:        Emit(term t; P)
Inverted Indexing in MapReduce: try 2

1: class Mapper
2: procedure Map(docid n; doc d)
3:        H = new Set
4:        for all term t in doc d do
5:           H.add(t)
6:        for all term t in H do
7:           Emit(term t, n)


1: class Reducer
2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …])
3:        List P = docids.values()     fish
4:        Sort(P)                      1 2
5:        Emit(term t; P)
(Another) Scalability Bottleneck
    Reducers buffers all docIDs associated with term (to sort)
        What if term occurs in many documents?
    Secondary sorting
        Use composite key
        Partition function
        Key Comparator
    Side-effect: write directly to HDFS as before…
Inverted index for spelling correction
   Like search, spelling correction must be fast
       How can we quickly identify candidate corrections?
   II: Map each character ngram  list of all words containing it
       #ac -> { act, across, actress, acquire, … }
       acr -> { across, acrimony, macro, … }
       cre -> { crest, acre, acres, … }
       res -> { arrest, rest, rescue, restaurant, … }
       ess -> { less, lesson, necessary, actress, … }
       ss# -> { less, mess, moss, across, actress, … }
   How do we build the inverted index in MapReduce?
Exercise
   Write a MapReduce algorithm for creating an inverted
    index for trigram spelling correction, given a corpus
Exercise
   Write a MapReduce algorithm for creating an inverted
    index for trigram spelling correction, given a corpus

    Map(String docid, String text):
      for each word w in text:
          for each trigram t in w:
              Emit(t, w)

    Reduce(String trigram, Iterator<Text> values):
      Emit(trigram, values.toSet)



   Also other alternatives, e.g. in-mapper combining, pairs
   Is MapReduce even necessary for this?
       Dictionary vs. token frequency
Spelling correction as Boolean search
   Given inverted index, how to find set of possible corrections?
       Compute union of all words indexed by any of its character ngrams
       = Boolean search
         • Query “acress”  “#ac OR acr OR cre OR res OR ess OR ss# “
   Are all corrections equally likely / good?
Ranked Information Retrieval
   Order documents by probability of relevance
       Estimate relevance of each document to the query
       Rank documents by relevance
   How do we estimate relevance?
   Vector space paradigm
       Approximate relevance by vector similarity (e.g. cosine)
       Represent queries and documents as vectors
       Rank documents by vector similarity to the query
Vector Space Model
                              t3
                                          d2

            d3
                                                    d1
                                    θ
                         φ
                                                         t1

                                               d5
            t2
                             d4

   Assumption: Documents that are “close” in vector space
   “talk about” the same things

   Retrieve documents based on how close the document
   vector is to the query vector (i.e., similarity ~ “closeness”)
Similarity Metric
    Use “angle” between the vectors
                      
                    d j  dk
        cos( )   
                     d j dk
                            
                                     i1 wi, j wi,k
                                         n
                           d j  dk
        sim(d j , d k )    
                                    i1 wi , j i 1 wi2,k
                                     n            n
                           d j dk          2




    Given pre-normalized vectors, just compute inner product
                           
        sim(d j , d k )  d j  d k  i 1 wi , j wi ,k
                                       n
Boolean Character ngram correction
   Boolean Information Retrieval (IR) model
       Query=typo word
       Document collection = dictionary (i.e. set of valid words)
       Representation: word is set of character ngrams
   Let’s use n=3 (trigram), with # to mark word start/end
   Examples
       across: [#ac, acr, cro, oss, ss#]
       acress: [#ac, acr, cre, res, ess, ss#]
       actress: [#ac, act, ctr, tre, res, ess, ss#]
       blam: [#bl, bla, lam, am#]
       mississippi: [#mis, iss, ssi, sis, sip, ipp, ppi, pi#]
Ranked Character ngram correction
   Vector space Information Retrieval (IR) model
       Query=typo word
       Document collection = dictionary (i.e. set of valid words)
       Representation: word is vector of character ngram value
       Rank candidate corrections according to vector similarity (cosine)
   Trigram Examples
       across: [#ac, acr, cro, oss, ss#]
       acress: [#ac, acr, cre, res, ess, ss#]
       actress: [#ac, act, ctr, tre, res, ess, ss#]
       blam: [#bl, bla, lam, am#]
       mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#]
Spelling Correction in Vector Space
                              t3
                                         d2

            d3
                                                    d1
                                    θ
                         φ
                                                         t1

                                               d5
            t2
                             d4

   Assumption: Words that are “close together” in ngram
   vector space have similar orthography

   Therefore, retrieve words in the dictionary based on how
   close the word is to the typo (i.e., similarity ~ “closeness”)
Ranked Character ngram correction
   Vector space Information Retrieval (IR) model
       Query=typo word
       Document collection = dictionary (i.e. set of valid words)
       Representation: word is vector of character ngram value
       Rank candidate corrections according to vector similarity (cosine)
   Trigram Examples
       across: [#ac, acr, cro, oss, ss#]
       acress: [#ac, acr, cre, res, ess, ss#]
       actress: [#ac, act, ctr, tre, res, ess, ss#]
       blam: [#bl, bla, lam, am#]
       mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#]
   “value” here expresses relative importance of different
    vector components for the similarity comparison
       Use simple count here, what else might we do?
IR Term Weighting
   Term weights consist of two components
       Local: how important is the term in this document?
       Global: how important is the term in the collection?
   Here’s the intuition:
       Terms that appear often in a document should get high weights
       Terms that appear in many documents should get low weights
   How do we capture this mathematically?
       Term frequency (local)
       Inverse document frequency (global)
TF.IDF Term Weighting


                              N
       wi , j  tfi , j  log
                              ni
            wi , j   weight assigned to term i in document j

            tfi, j   number of occurrence of term i in document j

             N       number of documents in entire collection

             ni      number of documents with term i
Inverted Index: TF.IDF
 Doc 1                    Doc 2                  Doc 3            Doc 4
 one fish, two fish        red fish, blue fish   cat in the hat   green eggs and ham


                          tf
                 1    2        3   4   df
         blue         1                1                   blue      1    2   1

          cat                  1       1                   cat       1    3   1

         egg                       1   1                   egg       1    4   1

         fish    2    2                2                   fish      2    1   2   2   2

         green                     1   1                  green      1    4   1

         ham                       1   1                   ham       1    4   1

          hat                  1       1                   hat       1    3   1

         one     1                     1                   one       1    1   1

          red         1                1                   red       1    2   1

         two     1                     1                   two       1    1   1
Inverted Indexing via MapReduce
      Doc 1                        Doc 2                         Doc 3
      one fish, two fish            red fish, blue fish          cat in the hat

         one    1                    red     2                     cat    3

Map      two    1                    blue    2                     hat    3

         fish   1                    fish    2



                    Shuffle and Sort: aggregate values by keys



                    cat     3
                                                          blue     2
Reduce              fish    1 2
                                                          hat      3
                    one     1
                                                          two      1
                    red     2
Inverted Indexing via MapReduce (2)
      Doc 1                      Doc 2                        Doc 3
      one fish, two fish         red fish, blue fish          cat in the hat

         one    1 1                red    2 1                   cat    3 1

Map      two    1 1                blue   2 1                   hat    3 1

         fish   1 2                fish   2 2



                 Shuffle and Sort: aggregate values by keys



                  cat      3 1
                                                       blue     2 1
Reduce            fish     1 2   2 2
                                                       hat      3 1
                  one      1 1
                                                       two      1 1
                  red      2 1
Inverted Indexing: Pseudo-Code




     Further exaccerbates earlier scalability issues …
Ranked Character ngram correction
   Vector space Information Retrieval (IR) model
       Query=typo word
       Document collection = dictionary (i.e. set of valid words)
       Representation: word is vector of character ngram value
       Rank candidate corrections according to vector similarity (cosine)
   Trigram Examples
       across: [#ac, acr, cro, oss, ss#]
       acress: [#ac, acr, cre, res, ess, ss#]
       actress: [#ac, act, ctr, tre, res, ess, ss#]
       blam: [#bl, bla, lam, am#]
       mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#]
   “value” here expresses relative importance of different
    vector components for the similarity comparison
       What else might we do? TF.IDF for character n-grams?
TF.IDF for character n-grams
   Think about what makes an ngram more discriminating
       e.g. in acquire, acq and cqu are more indicative than qui and ire.
       Schematically, we want something like:

         • acquire: [ #ac,   acq, cqu, qui, uir, ire, re# ]
   Possible solution: TF-IDF, where
       TF is the frequency of the ngram in the word
       IDF is the number of words the ngram occurs in in the vocabulary
Correction Beyond Orthography
   So far we’ve focused on orthography alone
   The context of a typo also tells us a great deal
   How can we compare contexts?
Correction Beyond Orthography
   So far we’ve focused on orthography alone
   The context of a typo also tells us a great deal
   How can we compare contexts?
   Idea: use the co-occurrence matrices built during HW2
       We have a vector of co-occurrence counts for each word

       Extract a similar vector for the typo given its immediate context
         • “She is their favorite acress in town.” 
           acress: [ she:1, is:1, their:1, favorite:1, in:1, town:1 ]


       Possible enhancement: make vectors sensitive to word order
Combining evidence
   We have orthographic similarity and contextual similarity
   We can do a simple weighted combination of the two, e.g.:

 simCombined ( d j , d k )   simOrth( d j , d k )  1    simContext ( d j , d k )


   How to do this more efficiently?
       Compute top candidates based on simOrth
       Take top k for consideration with simContext
       …or other way around…



   The combined model might also be expressed by a similar
    probabilistic model…
March 22, 2005                      42

Paradigm: Noisy-Channel Modeling
   
   s  arg max P( S | O)  arg max P( S ) P(O | S )
             S                 S

Want to recover most likely latent (correct) source
 word underlying the observed (misspelled) word

P(S): language model gives probability distribution
  over possible (candidate) source words

P(O|S): channel model gives probability of each
  candidate source word being “corrupted” into the
  observed typo
Noisy Channel Model for correction
     We want to rank candidates by P(cand | typo)
     Using Bayes law, the chain rule, an independence
      assumption, and logs, we have:
                             P( cand , typo, context )
P( cand | typo, context ) 
                                P(typo, context )
                             P( cand , typo, context )
                            P(typo | cand , context ) P( cand , context )
                            P(typo | cand ) P( cand , context )
                            P(typo | cand ) P( cand | context ) P( context )
                            P(typo | cand ) P( cand | context )
                            log P(typo | cand )  log P( cand | context )
Probabilistic vs. vector space model
     Both measure orthographic & contextual “fit” of the
      candidate given the typo and its usage context
     Noisy channel:
  P( cand | typo, context )  log P(typo | cand )  log P( cand | context )
     IR approach:
simCombined ( d j , d k )   simOrth( d j , d k )  1    simContext ( d j , d k )
     Both can benefit from “big” data (i.e. bigger samples)
          Better estimates of probabilities and population frequencies
     Usual probabilistic vs. non-probabilistic tradeoffs
          Principled theory and methodology for modeling and estimation
          How to extend the feature space to include additional information?
            • Typing haptics (key proximity)? Cognitive errors (e.g. homonyms)?
Index Compression
Postings Encoding
 Conceptually:

   fish         1   2   9   1   21   3   34   1   35      2   80   3   …




 In Practice:
    •Instead of document IDs, encode deltas (or d-gaps)
    • But it’s not obvious that this save space…


   fish         1   2   8   1   12   3   13   1    1      2   45   3   …
Overview of Index Compression
   Byte-aligned vs. bit-aligned
   Non-parameterized bit-aligned
        Unary codes
         (gamma) codes
         (delta) codes
   Parameterized bit-aligned
        Golomb codes




    Want more detail? Read Managing Gigabytes by Witten, Moffat, and Bell!
But First... General Data Compression
   Run Length Encoding
       7 7 7 8 8 9 = (7, 3), (8,2), (9,1)
   Binary Equivalent
       0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 = 6, 1, 3, 2, 3
       Good with sparse binary data
   Huffman Coding
       Optimal when data is distributed by negative powers of two
       e.g. P(a)= ½, P(b) = ¼, P(c)=1/8, P(d)=1/8
         • a = 0, b = 10, c= 110, d=111
       Prefix codes: no codeword is the prefix of another codeword
         • If we read 0, we know it’s an “a” following bits are a new codeword
         • Similarly 10 is a b (no other codeword starts with 10), etc.
         • Prefix is 1* (i.e. path to internal nodes is all 1s, output on leaves)
Unary Codes
   Encode number as a run of 1s, specifically…
   x  1 coded as x-1 1s, followed by zero bit terminator
       1=0
       2 = 10
       3 = 110
       4 = 1110
       ...
   Great for small numbers… horrible for large numbers
       Overly-biased for very small gaps
 codes
   x  1 is coded in two parts: unary length : offset
       Start with binary encoded, remove highest-order bit = offset
       Length is number of binary digits, encoded in unary
       Concatenate length + offset codes
   Example: 9 in binary is 1001
       Offset = 001
       Length = 4, in unary code = 1110
        code = 1110:001
       Another example: 7 (111 in binary)
         • offset=11, length=3 (110 in unary)   code = 110:11
   Analysis
       Offset = log x
       Length = log x +1
       Total = 2 log x +1 (97 bits, 75 bits, …)
 codes
   As with  codes, two parts: unary length & offset
       Offset is same as before
       Length is encoded by its  code
   Example: 9 (=1001 in binary)
       Offset = 001
       Length = 4 (100), offset=00, length 3 = 110 in unary
         •  code=110:00
        code = 110:00:001
   Comparison
        codes better for smaller numbers
        codes better for larger numbers
Golomb Codes
   x  1, parameter b
   x encoded in two parts
       Part 1: q = ( x - 1 ) / b , code q + 1 in unary
       Part 2: remainder r<b, r = x - qb – 1 coded in truncated binary
   Truncated binary defines prefix code
       if b is a power of 2
         • easy case: truncated binary = regular binary
       else
         • First 2^(log b + 1) – b values encoded in log b bits
         • Remaining values encoded in log b + 1 bits
   Let’s see some examples
Golomb Code Examples
   b = 3, r = [0:2]
       First 2^(log 3 + 1) – 3 = 2^2 – 3 = 1 values, in log 3 = 1 bit
       First 1 value in 1 bit: 0
       Remaining 3-1=2 values in 1+1=2 bits with prefix 1: 10, 11
   b = 5, r = [0:4]
       First 2^(log 5 + 1) – 5 = 2^3 – 5 = 3 values, in log 5 = 2 bits
       First 3 values in 2 bits: 00, 01, 10
       Remaining 5-3=2 values in 2+1=3 bits with prefix 11: 110, 111
         • Two prefix bits needed since single leading 1 already used in “10”
   b = 6, r = [0:5]
       First 2^(log 6 + 1) – 6 = 2^3 – 6 = 2 values, in log 6 = 2 bits
       First 2 values in 2 bits: 00, 01
       Remaining 6-2=4 values in 2+1=3 bits with prefix 1: 100, 101, 110, 111
Comparison of Coding Schemes


                                 Unary                                        Golomb
                                                                                b=3         b=6

                         1      0                 0                0            0:0        0:00
                         2      10                10:0             100:0        0:10       0:01
                         3      110               10:1             100:1        0:11       0:100
                         4      1110              110:00           101:00       10:0       0:101
                         5      11110             110:01           101:01       10:10      0:110
                         6      111110            110:10           101:10       10:11      0:111
                         7      1111110           110:11           101:11       110:0      10:00
                         8      11111110          1110:000         11000:000    110:10     10:01
                         9      111111110         1110:001         11000:001    110:11     10:100
                         10     1111111110        1110:010         11000:010    1110:0     10:101


                                     See Figure 4.5 in Lin & Dyer p. 77 for b=5 and b=10



Witten, Moffat, Bell, Managing Gigabytes (1999)
Index Compression: Performance

                                     Comparison of Index Size (bits per pointer)

                                                             Bible        TREC

                                            Unary               262         1918
                                            Binary                15          20
                                                              6.51         6.63
                                                              6.23         6.38
                                            Golomb             6.09         5.84




                                Use Golomb for d-gaps,  codes for term frequencies
                                Optimal b  0.69 (N/df): Different b for every term!

                                Bible: King James version of the Bible; 31,101 verses (4.3 MB)
                                TREC: TREC disks 1+2; 741,856 docs (2070 MB)



Witten, Moffat, Bell, Managing Gigabytes (1999)
Where are we without compression?
  (key)   (values)                (keys)      (values)

  fish    1    2     [2,4]       fish   1       [2,4]


          34   1     [23]        fish   9       [9]


          21   3     [1,8,22]    fish   21      [1,8,22]


          35   2     [8,41]      fish   34      [23]


          80   3     [2,9,76]    fish   35      [8,41]


          9    1     [9]         fish   80      [2,9,76]




                                How is this different?
                                 • Let the framework do the sorting
                                 • Directly write postings to disk
                                 • Term frequency implicitly stored
Index Compression in MapReduce
   Need df to compress posting for each term
   How do we compute df?
       Count the # of postings in reduce(), then compress
       Problem?
Order Inversion Pattern
    In the mapper:
        Emit “special” key-value pairs to keep track of df
    In the reducer:
        Make sure “special” key-value pairs come first: process them to
         determine df
    Remember: proper partitioning!
Getting the df: Modified Mapper
   Doc 1
    one fish, two fish   Input document…


   (key)       (value)

  fish     1    [2,4]    Emit normal key-value pairs…

  one      1    [1]


  two      1    [3]



  fish         [1]      Emit “special” key-value pairs to keep track of df…

  one          [1]


  two          [1]
Getting the df: Modified Reducer
   (key)           (value)
                                                    First, compute the df by summing contributions
  fish             [63]     [82]   [27]   …
                                                    from all “special” key-value pair…

                                           Compress postings incrementally as they arrive
  fish     1        [2,4]


  fish     9        [9]


  fish   21         [1,8,22]                        Important: properly define sort order to make
                                                    sure “special” key-value pairs come first!
  fish   34         [23]


  fish   35         [8,41]


  fish   80         [2,9,76]

               …                                  Write postings directly to disk




                                                    Where have we seen this before?
In-class Exercise
Exercise: where have all the ngrams gone?
 For each observed (word) trigram in collection,
 output its observed (docID, wordIndex) locations
   Input
       Doc 1                     Doc 2                 Doc 3
        one fish two fish        one fish two salmon   two fish two fish



   Output                                      Possible Tools:
                                               * pairs/stripes?
     one fish two     [(1,1),(2,1)]
                                               * combining?
      fish two fish   [(1,2),(3,2)]
                                               * secondary sorting?
    fish two salmon    [(2,2)]
                                               * order inversion?
      two fish two     [(3,1)]
                                               * side effects?
Exercise: shingling
Given observed (docID, wordIndex) ngram locations
For each document, for each of its ngrams (in order),
give a list of the ngram locations for that ngram

 Input
            one fish two          [(1,1),(2,1)]

             fish two fish        [(1,2),(3,2)]

           fish two salmon         [(2,2)]
                                                  Possible Tools:
                                   [(3,1)]
             two fish two                         * pairs/stripes?
Output                                            * combining?
   Doc 1    [ [(1,1),(2,1)], [(1,2),(3,2)] ]      * secondary sorting?
   Doc 2    [ [(1,1),(2,1)], [(2,2)] ]            * order inversion?
   Doc 3    [ [(3,1)], [(1,2),(3,2)] ]
                                                  * side effects?
Exercise: shingling (2)
How can we recognize when longer ngrams are
aligned across documents?
Example
      doc 1: a b c d e
      doc 2: a b c d f
      doc 3: e b c d f
      doc 4: a b c d e

Find “a b c d” in docs 1 2 and 4,
      “b c d f” in 2 & 3
      “a b c d e” in 1 and 4
class Alignment
             int index      // start position in this document
             int length     // sequence length in ngrams          typedef Pair<int docID, int position> Ngram;
             int otherID    // ID of other document
             int otherIndex // start position in other document

class NgramExtender
   Set<Alignment> alignments = empty set
   index=0;
   NgramExtender(int docID) { _docID = docID }
   close() { foreach Alignment a, emit(_docID, a) }

  AlignNgrams(List<Ngram> ngrams) // call this function iteratively in order of ngrams observed in this document

              ...
                @inproceedings{Kolak:2008,
                 author = {Kolak, Okan and Schilit, Bill N.},
                 title = {Generating links by mining quotations},
                 booktitle = {19th ACM conference on Hypertext and
                hypermedia},
                year = {2008},
                 pages = {117--126}
                 }
class Alignment
             int index      // start position in this document
             int length     // sequence length in ngrams          typedef Pair<int docID, int position> Ngram;
             int otherID    // ID of other document
             int otherIndex // start position in other document

class NgramExtender
   Set<Alignment> alignments = empty set
   index=0;
   NgramExtender(int docID) { _docID = docID }
   close() { foreach Alignment a, emit(_docID, a) }

  AlignNgrams(List<Ngram> ngrams) // call this function iteratively in order of ngrams observed in this document
     ++index;
     foreach Alignment a in alignments
             Ngram next = new Ngram(a.otherID, a.otherIndex + a.length)
             if (ngrams.contains(next)) // extend alignment
                 a.length += 1;    ngrams.remove(next)
             else                       // terminate alignment
                 emit _docID, (a); alignments.remove(a)

     foreach ngram in ngrams
              alignments.add( new Alignment( index, 1, ngram.docID, ngram.otherIndex )
Sequences of MapReduce Jobs
Building more complex MR algorithms
   Monolithic single Map + single Reduce
       What we’ve done so far
       Fitting all computation to this model can be difficult and ugly
       We generally strive for modularization when possible
   What else can we do?
       Pipeline: [Map Reduce] [Map Reduce] … (multiple sequential jobs)
       Chaining: [Map+ Reduce Map*]
         • 1 or more Mappers
         • 1 reducer
         • 0 or more Mappers
       Pipelined Chain: [Map+ Reduce Map*] [Map+ Reduce Map*] …
       Express arbitrary dependencies between jobs
Modularization and WordCount
   General benefits of modularization
       Re-use for easier/faster development
       Consistent behavior across applications
       Easier/faster to maintain/extend for benefit of many applications
   Even basic word count can be broken down
       Pre-processing
         • How will we tokenize? Perform stemming? Remove stopwords?
       Main computation: count tokenized tokens and group by word
       Post-processing
         • Transform the values? (e.g. log-damping)
   Let’s separate tokenization into its own module
       Many other tasks can likely benefit
   First approach: pipeline…
Pipeline WordCount Modules

                                                Tokenize
Tokenizer Mapper
• String -> List[String]
                                                                                         No Reducer
• Keep doc ID key
• E.g.(10032, “the 10 cats sleep”) -> (10032, [“the”, “10”, “cats”, “sleep”])




                                                      Count
Observer Mapper                                                                       LongSumReducer
• List[String] -> List[(String, Int)]                                                 • Sum token counts
• E.g. (10032, [“the”, “10”, “cats”, “sleep”] -> [(“the”,1), (“10”, 1), (“cats”,1),   • E.g. (“sleep”, [1, 5, 2]) ->
  (“sleep”,1)]                                                                          (“sleep”, 8)
Pipeline WordCount in Hadoop
   Two distinct jobs: tokenize and count
        Data sharing between jobs via persistent output
        Can use combiners and partitioners as usual (won’t bother here)
   Let’s use SequenceFileOutputFormat rather than TextOutputFormat
        sequence of binary key-value pairs; faster / smaller
        tokenization output will stick around unless we delete it
   Tokenize job
        Just a mapper, no reducer: conf.setNumReduceTasks(0) or IdentityReducer
        Output goes to directory we specify
           Files will be read back in by the counting job
        Output is array of tokens
           We need to make a suitable Writable for String arrays
   Count job
        Input types defined by the input SequenceFile (don’t need to be specified)
        Mapper is trivial
           observes tokens from incoming data
           Key: (docid) & Value: (Array of Strings, encoded as a Writable)
Pipeline WordCount (old Hadoop API)
Configuration conf = new Configuration();
String tmpDir1to2 = "/tmp/intermediate1to2";

// Tokenize job
JobConf tokenizationJob = new JobConf(conf);
tokenizationJob.setJarByClass(PipelineWordCount.class);
FileInputFormat.setInputPaths(tokenizationJob, new Path(inputPath));
FileOutputFormat.setOutputPath(tokenizationJob, new Path(tmpDir1to2));
tokenizationJob.setOutputFormat(SequenceFileOutputFormat.class);
tokenizationJob.setMapperClass(AggressiveTokenizerMapper.class);
tokenizationJob.setOutputKeyClass(LongWritable.class);
tokenizationJob.setOutputValueClass(TextArrayWritable.class);
tokenizationJob.setNumReduceTasks(0);

// Count job
JobConf countingJob = new JobConf(conf);
countingJob.setJarByClass(PipelineWordCount.class);
countingJob.setInputFormat(SequenceFileInputFormat.class);
FileInputFormat.setInputPaths(countingJob, new Path(tmpDir1to2));
FileOutputFormat.setOutputPath(countingJob, new Path(outputPath));
countingJob.setMapperClass(TrivialWordObserver.class);
countingJob.setReducerClass(MapRedIntSumReducer.class);
countingJob.setOutputKeyClass(Text.class);
countingJob.setOutputValueClass(IntWritable.class);
countingJob.setNumReduceTasks(reduceTasks);

JobClient.runJob(tokenizationJob);
JobClient.runJob(countingJob);
Pipeline jobs in Hadoop
   Old API
       JobClinet.runJob(..) does not return until job finishes
   New API
       Use Job rather than JobConf
       Use job.waitForCompletion instead of JobClient.runJob
   Why Old API?
       In 0.20.2, chaining only possible under old API
       We want to re-use the same components for chaining (next…)
Chaining in Hadoop                                     Mapper 1            Mapper 1



   Map+ Reduce Map*
                                                     Intermediates       Intermediates
       1 or more Mappers
         • Can use IdentityMapper
       1 reducer                                      Mapper 2            Mapper 2
         • No reducers: conf.setNumReduceTasks(0)?
       0 or more Mappers
   Usual combiners and partitioners
   By default, data passed between                    Reducer             Reducer
    Mappers by usual writing of
    intermediate data to disk
       Can always use side-effects…
       There is a better, built-in way to bypass      Mapper 3            Mapper 3
        this and pass (Key,Value) pairs by
        reference instead
         • Requires different Mapper semantics!      Persistent Output   Persistent Output
Hadoop: ChainMapper & ChainReducer
   Below JobConf objects (deprecated in Hadoop 0.20.2).
       No undeprecated replacement in 0.20.2… 
   Examples here work for later versions with small changes
Configuration conf = new Configuration();
JobConf job = new JobConf(conf);
...

boolean passByRef = false; // pass output (Key,Value) pairs to next Mapper by reference?

JobConf map1Conf = new JobConf(false);
ChainMapper.addMapper(job, Map1.class, Map1InputKey.class, Map1InputValue.class,
                      Map1OutputKey.class, Map1OutputValue.class, passByRef, map1Conf);

JobConf map2Conf = new JobConf(false);
ChainMapper.addMapper(job, Map2.class, Map1OutputKey.class, Map1OutputValue.class,
                      Map2OutputKey.class, Map2OutputValue.class, passByRef, map2Conf);

JobConf reduceConf = new JobConf(false);
ChainReducer.setReducer(job, Reducer.class, Map2OutputKey.class, Map2OutputValue.class,
                    ReducerOutputKey.class, ReducerOutputValue.class, passByRef, reduceConf)

JobConf map3Conf = new JobConf(false);
ChainReducer.addMapper (job, Map3.class, ReducerOutputKey.class, ReducerOutputValue.class,
                       Map3OutputKey.class, Map3OutputValue.class, passByRef, map3Conf)

JobClient.runJob(job);
Chaining in Hadoop
   Let’s continue our running example:
       Mapper 1: Tokenize
       Mapper 2: Observe (count) words
       Reducer: same IntSum reducer as always
       Mapper 3 Log-dampen counts
         • We didn’t have this in our pipeline example but we’ll add here…
Chained Tokenizer + WordCount
// Set up configuration and intermediate directory location
Configuration conf = new Configuration();
JobConf chainJob = new JobConf(conf);
chainJob.setJobName("Chain job");
chainJob.setJarByClass(ChainWordCount.class); // single jar for all Mappers and Reducers…
chainJob.setNumReduceTasks(reduceTasks);
FileInputFormat.setInputPaths(chainJob, new Path(inputPath));
FileOutputFormat.setOutputPath(chainJob, new Path(outputPath));

// pass output (Key,Value) pairs to next Mapper by reference?
boolean passByRef = false;

JobConf map1 = new JobConf(false); // tokenization
ChainMapper.addMapper(chainJob, AggressiveTokenizerMapper.class,
                     LongWritable.class, Text.class,
                     LongWritable.class, TextArrayWritable.class, passByRef, map1);

JobConf map2 = new JobConf(false); // Add token observer job
ChainMapper.addMapper(chainJob, TrivialWordObserver.class,
                      LongWritable.class, TextArrayWritable.class,
                      Text.class, LongWritable.class, passByRef, map2);

JobConf reduce = new JobConf(false); // Set the int sum reducer
ChainReducer.setReducer(chainJob, LongSumReducer.class, Text.class, LongWritable.class,
                        Text.class, LongWritable.class, passByRef, reduce);

JobConf map3 = new JobConf(false); // log-scaling of counts
ChainReducer.addMapper(chainJob, ComputeLogMapper.class, Text.class, LongWritable.class,
                       Text.class, FloatWritable.class, passByRef, map3);

JobClient.runJob(chainJob);
Hadoop Chaining: Pass by Reference
   Chaining allows possible optimization
       Chained mappers run in same JVM thread, so opportunity to avoid
        serialization to/from disk with pipelined jobs
       Also lesser benefit of avoiding extra object destruction / construction
   Gotchas
       OutputCollector.collect(K k, V v) promises
                                            not alter the content of k and v
       But if Map1 passes (k,v) by reference to Map2 via collect(),
        Map2 may alter (k,v) & thereby violate the contract
   What to do?
       Option 1: Honor the contract – don’t alter input (k,v) in Map2
       Option 2: Re-negotiate terms – don’t re-use (k,v) in Map1 after collect()
       Document carefully to avoid later changes silently breaking this…
Setting Dependencies Between Jobs
   JobControl and Job provide the mechanism
 // create jobconf1 and jobconf2 as appropriate
 // …

 Job job1=new Job(jobconf1)
 Job job2=new Job(jobconf2);
 job2.addDependingJob(job1);

 JobControl jbcntrl=new JobControl("jbcntrl");
 jbcntrl.addJob(job1);
 jbcntrl.addJob(job2);
 jbcntrl.run()




   New API: no JobConf, create Job from Configuration, …
Higher Level Abstractions
   Pig: language and execution environment for expressing
    MapReduce data flows. (pretty much the standard)
       See White, Chapter 11
   Cascading: another environment with a higher level of
    abstraction for composing complex data flows
       See White, Chapter 16, pp 539-552
   Cascalog: query language based on Cascading that uses
    Clojure (a JVM-based LISP variant)
       Word count in Cascalog
       Certainly more concise – though you need to grok the syntax.

     (?<- (stdout) [?word ?count] (sentence ?s) (split ?s :> ?word) (c/ count ?count))

Más contenido relacionado

Similar a Lecture 6: Data-Intensive Computing for Text Analysis (Fall 2011)

Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google wayEduard Hildebrandt
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with LuceneKai Chan
 
OpenRepGrid and Friends
OpenRepGrid and FriendsOpenRepGrid and Friends
OpenRepGrid and FriendsMark Heckmann
 
Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Toria Gibbs
 
OpenTag: Open Attribute Value Extraction From Product Profiles
OpenTag: Open Attribute Value Extraction From Product ProfilesOpenTag: Open Attribute Value Extraction From Product Profiles
OpenTag: Open Attribute Value Extraction From Product ProfilesSubhabrata Mukherjee
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)fridolin.wild
 
Lecture 3 l dand_haplotypes_full
Lecture 3 l dand_haplotypes_fullLecture 3 l dand_haplotypes_full
Lecture 3 l dand_haplotypes_fullLekki Frazier-Wood
 
The OpenRepGrid project – Software tools for the analysis and administration...
The OpenRepGrid project – Software tools  for the analysis and administration...The OpenRepGrid project – Software tools  for the analysis and administration...
The OpenRepGrid project – Software tools for the analysis and administration...Mark Heckmann
 
Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...Simplilearn
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Fungus on White Bread
Fungus on White BreadFungus on White Bread
Fungus on White BreadGaurav Lochan
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLPSatyam Saxena
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLPAnuj Gupta
 

Similar a Lecture 6: Data-Intensive Computing for Text Analysis (Fall 2011) (20)

Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google way
 
DNA Microarray
DNA MicroarrayDNA Microarray
DNA Microarray
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
 
OpenRepGrid and Friends
OpenRepGrid and FriendsOpenRepGrid and Friends
OpenRepGrid and Friends
 
Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017
 
OpenTag: Open Attribute Value Extraction From Product Profiles
OpenTag: Open Attribute Value Extraction From Product ProfilesOpenTag: Open Attribute Value Extraction From Product Profiles
OpenTag: Open Attribute Value Extraction From Product Profiles
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Lecture 3 l dand_haplotypes_full
Lecture 3 l dand_haplotypes_fullLecture 3 l dand_haplotypes_full
Lecture 3 l dand_haplotypes_full
 
The OpenRepGrid project – Software tools for the analysis and administration...
The OpenRepGrid project – Software tools  for the analysis and administration...The OpenRepGrid project – Software tools  for the analysis and administration...
The OpenRepGrid project – Software tools for the analysis and administration...
 
Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...Python Interview Questions | Python Interview Questions And Answers | Python ...
Python Interview Questions | Python Interview Questions And Answers | Python ...
 
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Fungus on White Bread
Fungus on White BreadFungus on White Bread
Fungus on White Bread
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 
Prolog 7-Languages
Prolog 7-LanguagesProlog 7-Languages
Prolog 7-Languages
 

Más de Matthew Lease

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesMatthew Lease
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Matthew Lease
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopMatthew Lease
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Matthew Lease
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd Matthew Lease
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Matthew Lease
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Matthew Lease
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?Matthew Lease
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Matthew Lease
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Matthew Lease
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information RetrievalMatthew Lease
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Matthew Lease
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...Matthew Lease
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingMatthew Lease
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)Matthew Lease
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016Matthew Lease
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)Matthew Lease
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing ScienceMatthew Lease
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsMatthew Lease
 

Más de Matthew Lease (20)

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing Science
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
 

Último

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 

Último (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 

Lecture 6: Data-Intensive Computing for Text Analysis (Fall 2011)

  • 1. Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M University of Texas at Austin, Fall 2011 Lecture 6 September 29, 2011 Jason Baldridge Matt Lease Department of Linguistics School of Information University of Texas at Austin University of Texas at Austin Jasonbaldridge at gmail dot com ml at ischool dot utexas dot edu
  • 2. Acknowledgments Course design and slides based on Jimmy Lin’s cloud computing courses at the University of Maryland, College Park Some figures courtesy of the following excellent Hadoop books (order yours today!) • Chuck Lam’s Hadoop In Action (2010) • Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010)
  • 3. Today’s Agenda • Automatic Spelling Correction – Review: Information Retrieval (IR) • Boolean Search • Vector Space Modeling • Inverted Indexing in MapReduce – Probabilisitic modeling via noisy channel • Index Compression – Order inversion in MapReduce • In-class exercise • Hadoop: Pipelined & Chained jobs
  • 5. Automatic Spelling Correction  Three main stages  Error detection  Candidate generation  Candidate ranking / choose best candidate  Usage cases  Flagging possible misspellings / spell checker  Suggesting possible corrections  Automatically correcting (inferred) misspellings • “as you type” correction • web queries • real-time closed captioning • …
  • 6. Types of spelling errors  Unknown words: “She is their favorite acress in town.”  Can be identified using a dictionary…  …but could be a valid word not in the dictionary  Dictionary could be automatically constructed from large corpora • Filter out rare words (misspellings, or valid but unlikely)… • Why filter out rare words that are valid?  Unknown words violating phonotactics:  e.g. “There isn’t enough room in this tonw for the both of us.”  Given dictionary, could automatically construct “n-gram dictionary” of all character n-grams known in the language • e.g. English words don’t end with “nw”, so flag tonw  Incorrect homophone: “She drove their.”  Valid word, wrong usage; infer appropriateness from context  Typing errors reflecting kayout of leyboard
  • 7. Candidate generation  How to generate possible corrections for acress?  Inspiration: how do people do it?  People may suggest words like actress, across, access, acres, caress, and cress – what do these have in common?  What about “blam” and “zigzag”?  Two standard strategies for candidate generation  Minimum edit distance • Generate all candidates within 1+ edit step(s) • Possible edit operations: insertion, deletion, substitution, transposition, … • Filter through a dictionary • See Peter Norvig’s post: http://norvig.com/spell-correct.html  Character ngrams: see next slide…
  • 8. Character ngram Spelling Correction  Information Retrieval (IR) model  Query=typo word  Document collection = dictionary (i.e. set of valid words)  Representation: word is set of character ngrams  Let’s use n=3 (trigram), with # to mark word start/end  Examples  across: [#ac, acr, cro, oss, ss#]  acress: [#ac, acr, cre, res, ess, ss#]  actress: [#ac, act, ctr, tre, res, ess, ss#]  blam: [#bl, bla, lam, am#]  mississippi: [#mis, iss, ssi, sis, sip, ipp, ppi, pi#]  Uhm, IR model???  Review…
  • 9. Abstract IR Architecture Query Documents online offline Representation Representation Function Function Query Representation Document Representation Comparison Function Index Results
  • 10. Document  Boolean Representation McDonald's slims down spuds Fast-food chain to reduce certain types of “Bag of Words” fat in its french fries with new cooking oil. NEW YORK (CNN/Money) - McDonald's Corp. is McDonalds cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items fat healthier. But does that mean the popular shoestring fries fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with new an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. french But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with Company the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's Said (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger nutrition King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately … be reached for comment. …
  • 11. Boolean Retrieval Doc 1 Doc 2 Doc 3 Doc 4 dogs dolphins football football dolphins
  • 12. Inverted Index: Boolean Retrieval Doc 1 Doc 2 Doc 3 Doc 4 one fish, two fish red fish, blue fish cat in the hat green eggs and ham 1 2 3 4 blue 1 blue 2 cat 1 cat 3 egg 1 egg 4 fish 1 1 fish 1 2 green 1 green 4 ham 1 ham 4 hat 1 hat 3 one 1 one 1 red 1 red 2 two 1 two 1
  • 13. Inverted Indexing via MapReduce Doc 1 Doc 2 Doc 3 one fish, two fish red fish, blue fish cat in the hat one 1 red 2 cat 3 Map two 1 blue 2 hat 3 fish 1 fish 2 Shuffle and Sort: aggregate values by keys cat 3 blue 2 Reduce fish 1 2 hat 3 one 1 two 1 red 2
  • 14. Inverted Indexing in MapReduce 1: class Mapper 2: procedure Map(docid n; doc d) 3: H = new Set 4: for all term t in doc d do 5: H.add(t) 6: for all term t in H do 7: Emit(term t, n) 1: class Reducer 2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …]) 3: List P = docids.values() 4: Emit(term t; P)
  • 15. Scalability Bottleneck  Desired output format: <term, [doc1, doc2, …]>  Just emitting each <term, docID> pair won’t produce this  How to produce this without buffering?  Side-effect: write directly to HDFS instead of emitting  Complications? • Persistent data must be cleaned up if reducer restarted…
  • 16. Using the Inverted Index  Boolean Retrieval: to execute a Boolean query  Build query syntax tree OR ( blue AND fish ) OR ham ham AND  For each clause, look up postings blue fish blue 2 fish 1 2  Traverse postings and apply Boolean operator  Efficiency analysis  Start with shortest posting first  Postings traversal is linear (if postings are sorted) • Oops… we didn’t actually do this in building our index…
  • 17. Inverted Indexing in MapReduce 1: class Mapper 2: procedure Map(docid n; doc d) 3: H = new Set 4: for all term t in doc d do 5: H.add(t) 6: for all term t in H do 7: Emit(term t, n) 1: class Reducer 2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …]) 3: List P = docids.values() 4: Emit(term t; P)
  • 18. Inverted Indexing in MapReduce: try 2 1: class Mapper 2: procedure Map(docid n; doc d) 3: H = new Set 4: for all term t in doc d do 5: H.add(t) 6: for all term t in H do 7: Emit(term t, n) 1: class Reducer 2: procedure Reduce(term t; Iterator<integer> docids [n1, n2, …]) 3: List P = docids.values() fish 4: Sort(P) 1 2 5: Emit(term t; P)
  • 19. (Another) Scalability Bottleneck  Reducers buffers all docIDs associated with term (to sort)  What if term occurs in many documents?  Secondary sorting  Use composite key  Partition function  Key Comparator  Side-effect: write directly to HDFS as before…
  • 20. Inverted index for spelling correction  Like search, spelling correction must be fast  How can we quickly identify candidate corrections?  II: Map each character ngram  list of all words containing it  #ac -> { act, across, actress, acquire, … }  acr -> { across, acrimony, macro, … }  cre -> { crest, acre, acres, … }  res -> { arrest, rest, rescue, restaurant, … }  ess -> { less, lesson, necessary, actress, … }  ss# -> { less, mess, moss, across, actress, … }  How do we build the inverted index in MapReduce?
  • 21. Exercise  Write a MapReduce algorithm for creating an inverted index for trigram spelling correction, given a corpus
  • 22. Exercise  Write a MapReduce algorithm for creating an inverted index for trigram spelling correction, given a corpus Map(String docid, String text): for each word w in text: for each trigram t in w: Emit(t, w) Reduce(String trigram, Iterator<Text> values): Emit(trigram, values.toSet)  Also other alternatives, e.g. in-mapper combining, pairs  Is MapReduce even necessary for this?  Dictionary vs. token frequency
  • 23. Spelling correction as Boolean search  Given inverted index, how to find set of possible corrections?  Compute union of all words indexed by any of its character ngrams  = Boolean search • Query “acress”  “#ac OR acr OR cre OR res OR ess OR ss# “  Are all corrections equally likely / good?
  • 24. Ranked Information Retrieval  Order documents by probability of relevance  Estimate relevance of each document to the query  Rank documents by relevance  How do we estimate relevance?  Vector space paradigm  Approximate relevance by vector similarity (e.g. cosine)  Represent queries and documents as vectors  Rank documents by vector similarity to the query
  • 25. Vector Space Model t3 d2 d3 d1 θ φ t1 d5 t2 d4 Assumption: Documents that are “close” in vector space “talk about” the same things Retrieve documents based on how close the document vector is to the query vector (i.e., similarity ~ “closeness”)
  • 26. Similarity Metric  Use “angle” between the vectors   d j  dk cos( )    d j dk   i1 wi, j wi,k n d j  dk sim(d j , d k )     i1 wi , j i 1 wi2,k n n d j dk 2  Given pre-normalized vectors, just compute inner product   sim(d j , d k )  d j  d k  i 1 wi , j wi ,k n
  • 27. Boolean Character ngram correction  Boolean Information Retrieval (IR) model  Query=typo word  Document collection = dictionary (i.e. set of valid words)  Representation: word is set of character ngrams  Let’s use n=3 (trigram), with # to mark word start/end  Examples  across: [#ac, acr, cro, oss, ss#]  acress: [#ac, acr, cre, res, ess, ss#]  actress: [#ac, act, ctr, tre, res, ess, ss#]  blam: [#bl, bla, lam, am#]  mississippi: [#mis, iss, ssi, sis, sip, ipp, ppi, pi#]
  • 28. Ranked Character ngram correction  Vector space Information Retrieval (IR) model  Query=typo word  Document collection = dictionary (i.e. set of valid words)  Representation: word is vector of character ngram value  Rank candidate corrections according to vector similarity (cosine)  Trigram Examples  across: [#ac, acr, cro, oss, ss#]  acress: [#ac, acr, cre, res, ess, ss#]  actress: [#ac, act, ctr, tre, res, ess, ss#]  blam: [#bl, bla, lam, am#]  mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#]
  • 29. Spelling Correction in Vector Space t3 d2 d3 d1 θ φ t1 d5 t2 d4 Assumption: Words that are “close together” in ngram vector space have similar orthography Therefore, retrieve words in the dictionary based on how close the word is to the typo (i.e., similarity ~ “closeness”)
  • 30. Ranked Character ngram correction  Vector space Information Retrieval (IR) model  Query=typo word  Document collection = dictionary (i.e. set of valid words)  Representation: word is vector of character ngram value  Rank candidate corrections according to vector similarity (cosine)  Trigram Examples  across: [#ac, acr, cro, oss, ss#]  acress: [#ac, acr, cre, res, ess, ss#]  actress: [#ac, act, ctr, tre, res, ess, ss#]  blam: [#bl, bla, lam, am#]  mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#]  “value” here expresses relative importance of different vector components for the similarity comparison  Use simple count here, what else might we do?
  • 31. IR Term Weighting  Term weights consist of two components  Local: how important is the term in this document?  Global: how important is the term in the collection?  Here’s the intuition:  Terms that appear often in a document should get high weights  Terms that appear in many documents should get low weights  How do we capture this mathematically?  Term frequency (local)  Inverse document frequency (global)
  • 32. TF.IDF Term Weighting N wi , j  tfi , j  log ni wi , j weight assigned to term i in document j tfi, j number of occurrence of term i in document j N number of documents in entire collection ni number of documents with term i
  • 33. Inverted Index: TF.IDF Doc 1 Doc 2 Doc 3 Doc 4 one fish, two fish red fish, blue fish cat in the hat green eggs and ham tf 1 2 3 4 df blue 1 1 blue 1 2 1 cat 1 1 cat 1 3 1 egg 1 1 egg 1 4 1 fish 2 2 2 fish 2 1 2 2 2 green 1 1 green 1 4 1 ham 1 1 ham 1 4 1 hat 1 1 hat 1 3 1 one 1 1 one 1 1 1 red 1 1 red 1 2 1 two 1 1 two 1 1 1
  • 34. Inverted Indexing via MapReduce Doc 1 Doc 2 Doc 3 one fish, two fish red fish, blue fish cat in the hat one 1 red 2 cat 3 Map two 1 blue 2 hat 3 fish 1 fish 2 Shuffle and Sort: aggregate values by keys cat 3 blue 2 Reduce fish 1 2 hat 3 one 1 two 1 red 2
  • 35. Inverted Indexing via MapReduce (2) Doc 1 Doc 2 Doc 3 one fish, two fish red fish, blue fish cat in the hat one 1 1 red 2 1 cat 3 1 Map two 1 1 blue 2 1 hat 3 1 fish 1 2 fish 2 2 Shuffle and Sort: aggregate values by keys cat 3 1 blue 2 1 Reduce fish 1 2 2 2 hat 3 1 one 1 1 two 1 1 red 2 1
  • 36. Inverted Indexing: Pseudo-Code Further exaccerbates earlier scalability issues …
  • 37. Ranked Character ngram correction  Vector space Information Retrieval (IR) model  Query=typo word  Document collection = dictionary (i.e. set of valid words)  Representation: word is vector of character ngram value  Rank candidate corrections according to vector similarity (cosine)  Trigram Examples  across: [#ac, acr, cro, oss, ss#]  acress: [#ac, acr, cre, res, ess, ss#]  actress: [#ac, act, ctr, tre, res, ess, ss#]  blam: [#bl, bla, lam, am#]  mississippi: [#mis, (iss, 2), (ssi, 2), sis, sip, ipp, ppi, pi#]  “value” here expresses relative importance of different vector components for the similarity comparison  What else might we do? TF.IDF for character n-grams?
  • 38. TF.IDF for character n-grams  Think about what makes an ngram more discriminating  e.g. in acquire, acq and cqu are more indicative than qui and ire.  Schematically, we want something like: • acquire: [ #ac, acq, cqu, qui, uir, ire, re# ]  Possible solution: TF-IDF, where  TF is the frequency of the ngram in the word  IDF is the number of words the ngram occurs in in the vocabulary
  • 39. Correction Beyond Orthography  So far we’ve focused on orthography alone  The context of a typo also tells us a great deal  How can we compare contexts?
  • 40. Correction Beyond Orthography  So far we’ve focused on orthography alone  The context of a typo also tells us a great deal  How can we compare contexts?  Idea: use the co-occurrence matrices built during HW2  We have a vector of co-occurrence counts for each word  Extract a similar vector for the typo given its immediate context • “She is their favorite acress in town.”  acress: [ she:1, is:1, their:1, favorite:1, in:1, town:1 ]  Possible enhancement: make vectors sensitive to word order
  • 41. Combining evidence  We have orthographic similarity and contextual similarity  We can do a simple weighted combination of the two, e.g.: simCombined ( d j , d k )   simOrth( d j , d k )  1    simContext ( d j , d k )  How to do this more efficiently?  Compute top candidates based on simOrth  Take top k for consideration with simContext  …or other way around…  The combined model might also be expressed by a similar probabilistic model…
  • 42. March 22, 2005 42 Paradigm: Noisy-Channel Modeling  s  arg max P( S | O)  arg max P( S ) P(O | S ) S S Want to recover most likely latent (correct) source word underlying the observed (misspelled) word P(S): language model gives probability distribution over possible (candidate) source words P(O|S): channel model gives probability of each candidate source word being “corrupted” into the observed typo
  • 43. Noisy Channel Model for correction  We want to rank candidates by P(cand | typo)  Using Bayes law, the chain rule, an independence assumption, and logs, we have: P( cand , typo, context ) P( cand | typo, context )  P(typo, context )  P( cand , typo, context )  P(typo | cand , context ) P( cand , context )  P(typo | cand ) P( cand , context )  P(typo | cand ) P( cand | context ) P( context )  P(typo | cand ) P( cand | context )  log P(typo | cand )  log P( cand | context )
  • 44. Probabilistic vs. vector space model  Both measure orthographic & contextual “fit” of the candidate given the typo and its usage context  Noisy channel: P( cand | typo, context )  log P(typo | cand )  log P( cand | context )  IR approach: simCombined ( d j , d k )   simOrth( d j , d k )  1    simContext ( d j , d k )  Both can benefit from “big” data (i.e. bigger samples)  Better estimates of probabilities and population frequencies  Usual probabilistic vs. non-probabilistic tradeoffs  Principled theory and methodology for modeling and estimation  How to extend the feature space to include additional information? • Typing haptics (key proximity)? Cognitive errors (e.g. homonyms)?
  • 46. Postings Encoding Conceptually: fish 1 2 9 1 21 3 34 1 35 2 80 3 … In Practice: •Instead of document IDs, encode deltas (or d-gaps) • But it’s not obvious that this save space… fish 1 2 8 1 12 3 13 1 1 2 45 3 …
  • 47. Overview of Index Compression  Byte-aligned vs. bit-aligned  Non-parameterized bit-aligned  Unary codes   (gamma) codes   (delta) codes  Parameterized bit-aligned  Golomb codes Want more detail? Read Managing Gigabytes by Witten, Moffat, and Bell!
  • 48. But First... General Data Compression  Run Length Encoding  7 7 7 8 8 9 = (7, 3), (8,2), (9,1)  Binary Equivalent  0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 = 6, 1, 3, 2, 3  Good with sparse binary data  Huffman Coding  Optimal when data is distributed by negative powers of two  e.g. P(a)= ½, P(b) = ¼, P(c)=1/8, P(d)=1/8 • a = 0, b = 10, c= 110, d=111  Prefix codes: no codeword is the prefix of another codeword • If we read 0, we know it’s an “a” following bits are a new codeword • Similarly 10 is a b (no other codeword starts with 10), etc. • Prefix is 1* (i.e. path to internal nodes is all 1s, output on leaves)
  • 49. Unary Codes  Encode number as a run of 1s, specifically…  x  1 coded as x-1 1s, followed by zero bit terminator  1=0  2 = 10  3 = 110  4 = 1110  ...  Great for small numbers… horrible for large numbers  Overly-biased for very small gaps
  • 50.  codes  x  1 is coded in two parts: unary length : offset  Start with binary encoded, remove highest-order bit = offset  Length is number of binary digits, encoded in unary  Concatenate length + offset codes  Example: 9 in binary is 1001  Offset = 001  Length = 4, in unary code = 1110   code = 1110:001  Another example: 7 (111 in binary) • offset=11, length=3 (110 in unary)   code = 110:11  Analysis  Offset = log x  Length = log x +1  Total = 2 log x +1 (97 bits, 75 bits, …)
  • 51.  codes  As with  codes, two parts: unary length & offset  Offset is same as before  Length is encoded by its  code  Example: 9 (=1001 in binary)  Offset = 001  Length = 4 (100), offset=00, length 3 = 110 in unary •  code=110:00   code = 110:00:001  Comparison   codes better for smaller numbers   codes better for larger numbers
  • 52. Golomb Codes  x  1, parameter b  x encoded in two parts  Part 1: q = ( x - 1 ) / b , code q + 1 in unary  Part 2: remainder r<b, r = x - qb – 1 coded in truncated binary  Truncated binary defines prefix code  if b is a power of 2 • easy case: truncated binary = regular binary  else • First 2^(log b + 1) – b values encoded in log b bits • Remaining values encoded in log b + 1 bits  Let’s see some examples
  • 53. Golomb Code Examples  b = 3, r = [0:2]  First 2^(log 3 + 1) – 3 = 2^2 – 3 = 1 values, in log 3 = 1 bit  First 1 value in 1 bit: 0  Remaining 3-1=2 values in 1+1=2 bits with prefix 1: 10, 11  b = 5, r = [0:4]  First 2^(log 5 + 1) – 5 = 2^3 – 5 = 3 values, in log 5 = 2 bits  First 3 values in 2 bits: 00, 01, 10  Remaining 5-3=2 values in 2+1=3 bits with prefix 11: 110, 111 • Two prefix bits needed since single leading 1 already used in “10”  b = 6, r = [0:5]  First 2^(log 6 + 1) – 6 = 2^3 – 6 = 2 values, in log 6 = 2 bits  First 2 values in 2 bits: 00, 01  Remaining 6-2=4 values in 2+1=3 bits with prefix 1: 100, 101, 110, 111
  • 54. Comparison of Coding Schemes Unary   Golomb b=3 b=6 1 0 0 0 0:0 0:00 2 10 10:0 100:0 0:10 0:01 3 110 10:1 100:1 0:11 0:100 4 1110 110:00 101:00 10:0 0:101 5 11110 110:01 101:01 10:10 0:110 6 111110 110:10 101:10 10:11 0:111 7 1111110 110:11 101:11 110:0 10:00 8 11111110 1110:000 11000:000 110:10 10:01 9 111111110 1110:001 11000:001 110:11 10:100 10 1111111110 1110:010 11000:010 1110:0 10:101 See Figure 4.5 in Lin & Dyer p. 77 for b=5 and b=10 Witten, Moffat, Bell, Managing Gigabytes (1999)
  • 55. Index Compression: Performance Comparison of Index Size (bits per pointer) Bible TREC Unary 262 1918 Binary 15 20  6.51 6.63  6.23 6.38 Golomb 6.09 5.84 Use Golomb for d-gaps,  codes for term frequencies Optimal b  0.69 (N/df): Different b for every term! Bible: King James version of the Bible; 31,101 verses (4.3 MB) TREC: TREC disks 1+2; 741,856 docs (2070 MB) Witten, Moffat, Bell, Managing Gigabytes (1999)
  • 56. Where are we without compression? (key) (values) (keys) (values) fish 1 2 [2,4] fish 1 [2,4] 34 1 [23] fish 9 [9] 21 3 [1,8,22] fish 21 [1,8,22] 35 2 [8,41] fish 34 [23] 80 3 [2,9,76] fish 35 [8,41] 9 1 [9] fish 80 [2,9,76] How is this different? • Let the framework do the sorting • Directly write postings to disk • Term frequency implicitly stored
  • 57. Index Compression in MapReduce  Need df to compress posting for each term  How do we compute df?  Count the # of postings in reduce(), then compress  Problem?
  • 58. Order Inversion Pattern  In the mapper:  Emit “special” key-value pairs to keep track of df  In the reducer:  Make sure “special” key-value pairs come first: process them to determine df  Remember: proper partitioning!
  • 59. Getting the df: Modified Mapper Doc 1 one fish, two fish Input document… (key) (value) fish 1 [2,4] Emit normal key-value pairs… one 1 [1] two 1 [3] fish  [1] Emit “special” key-value pairs to keep track of df… one  [1] two  [1]
  • 60. Getting the df: Modified Reducer (key) (value) First, compute the df by summing contributions fish  [63] [82] [27] … from all “special” key-value pair… Compress postings incrementally as they arrive fish 1 [2,4] fish 9 [9] fish 21 [1,8,22] Important: properly define sort order to make sure “special” key-value pairs come first! fish 34 [23] fish 35 [8,41] fish 80 [2,9,76] … Write postings directly to disk Where have we seen this before?
  • 62. Exercise: where have all the ngrams gone? For each observed (word) trigram in collection, output its observed (docID, wordIndex) locations Input Doc 1 Doc 2 Doc 3 one fish two fish one fish two salmon two fish two fish Output Possible Tools: * pairs/stripes? one fish two [(1,1),(2,1)] * combining? fish two fish [(1,2),(3,2)] * secondary sorting? fish two salmon [(2,2)] * order inversion? two fish two [(3,1)] * side effects?
  • 63. Exercise: shingling Given observed (docID, wordIndex) ngram locations For each document, for each of its ngrams (in order), give a list of the ngram locations for that ngram Input one fish two [(1,1),(2,1)] fish two fish [(1,2),(3,2)] fish two salmon [(2,2)] Possible Tools: [(3,1)] two fish two * pairs/stripes? Output * combining? Doc 1 [ [(1,1),(2,1)], [(1,2),(3,2)] ] * secondary sorting? Doc 2 [ [(1,1),(2,1)], [(2,2)] ] * order inversion? Doc 3 [ [(3,1)], [(1,2),(3,2)] ] * side effects?
  • 64. Exercise: shingling (2) How can we recognize when longer ngrams are aligned across documents? Example doc 1: a b c d e doc 2: a b c d f doc 3: e b c d f doc 4: a b c d e Find “a b c d” in docs 1 2 and 4, “b c d f” in 2 & 3 “a b c d e” in 1 and 4
  • 65. class Alignment int index // start position in this document int length // sequence length in ngrams typedef Pair<int docID, int position> Ngram; int otherID // ID of other document int otherIndex // start position in other document class NgramExtender Set<Alignment> alignments = empty set index=0; NgramExtender(int docID) { _docID = docID } close() { foreach Alignment a, emit(_docID, a) } AlignNgrams(List<Ngram> ngrams) // call this function iteratively in order of ngrams observed in this document ... @inproceedings{Kolak:2008, author = {Kolak, Okan and Schilit, Bill N.}, title = {Generating links by mining quotations}, booktitle = {19th ACM conference on Hypertext and hypermedia}, year = {2008}, pages = {117--126} }
  • 66. class Alignment int index // start position in this document int length // sequence length in ngrams typedef Pair<int docID, int position> Ngram; int otherID // ID of other document int otherIndex // start position in other document class NgramExtender Set<Alignment> alignments = empty set index=0; NgramExtender(int docID) { _docID = docID } close() { foreach Alignment a, emit(_docID, a) } AlignNgrams(List<Ngram> ngrams) // call this function iteratively in order of ngrams observed in this document ++index; foreach Alignment a in alignments Ngram next = new Ngram(a.otherID, a.otherIndex + a.length) if (ngrams.contains(next)) // extend alignment a.length += 1; ngrams.remove(next) else // terminate alignment emit _docID, (a); alignments.remove(a) foreach ngram in ngrams alignments.add( new Alignment( index, 1, ngram.docID, ngram.otherIndex )
  • 68. Building more complex MR algorithms  Monolithic single Map + single Reduce  What we’ve done so far  Fitting all computation to this model can be difficult and ugly  We generally strive for modularization when possible  What else can we do?  Pipeline: [Map Reduce] [Map Reduce] … (multiple sequential jobs)  Chaining: [Map+ Reduce Map*] • 1 or more Mappers • 1 reducer • 0 or more Mappers  Pipelined Chain: [Map+ Reduce Map*] [Map+ Reduce Map*] …  Express arbitrary dependencies between jobs
  • 69. Modularization and WordCount  General benefits of modularization  Re-use for easier/faster development  Consistent behavior across applications  Easier/faster to maintain/extend for benefit of many applications  Even basic word count can be broken down  Pre-processing • How will we tokenize? Perform stemming? Remove stopwords?  Main computation: count tokenized tokens and group by word  Post-processing • Transform the values? (e.g. log-damping)  Let’s separate tokenization into its own module  Many other tasks can likely benefit  First approach: pipeline…
  • 70. Pipeline WordCount Modules Tokenize Tokenizer Mapper • String -> List[String] No Reducer • Keep doc ID key • E.g.(10032, “the 10 cats sleep”) -> (10032, [“the”, “10”, “cats”, “sleep”]) Count Observer Mapper LongSumReducer • List[String] -> List[(String, Int)] • Sum token counts • E.g. (10032, [“the”, “10”, “cats”, “sleep”] -> [(“the”,1), (“10”, 1), (“cats”,1), • E.g. (“sleep”, [1, 5, 2]) -> (“sleep”,1)] (“sleep”, 8)
  • 71. Pipeline WordCount in Hadoop  Two distinct jobs: tokenize and count  Data sharing between jobs via persistent output  Can use combiners and partitioners as usual (won’t bother here)  Let’s use SequenceFileOutputFormat rather than TextOutputFormat  sequence of binary key-value pairs; faster / smaller  tokenization output will stick around unless we delete it  Tokenize job  Just a mapper, no reducer: conf.setNumReduceTasks(0) or IdentityReducer  Output goes to directory we specify  Files will be read back in by the counting job  Output is array of tokens  We need to make a suitable Writable for String arrays  Count job  Input types defined by the input SequenceFile (don’t need to be specified)  Mapper is trivial  observes tokens from incoming data  Key: (docid) & Value: (Array of Strings, encoded as a Writable)
  • 72. Pipeline WordCount (old Hadoop API) Configuration conf = new Configuration(); String tmpDir1to2 = "/tmp/intermediate1to2"; // Tokenize job JobConf tokenizationJob = new JobConf(conf); tokenizationJob.setJarByClass(PipelineWordCount.class); FileInputFormat.setInputPaths(tokenizationJob, new Path(inputPath)); FileOutputFormat.setOutputPath(tokenizationJob, new Path(tmpDir1to2)); tokenizationJob.setOutputFormat(SequenceFileOutputFormat.class); tokenizationJob.setMapperClass(AggressiveTokenizerMapper.class); tokenizationJob.setOutputKeyClass(LongWritable.class); tokenizationJob.setOutputValueClass(TextArrayWritable.class); tokenizationJob.setNumReduceTasks(0); // Count job JobConf countingJob = new JobConf(conf); countingJob.setJarByClass(PipelineWordCount.class); countingJob.setInputFormat(SequenceFileInputFormat.class); FileInputFormat.setInputPaths(countingJob, new Path(tmpDir1to2)); FileOutputFormat.setOutputPath(countingJob, new Path(outputPath)); countingJob.setMapperClass(TrivialWordObserver.class); countingJob.setReducerClass(MapRedIntSumReducer.class); countingJob.setOutputKeyClass(Text.class); countingJob.setOutputValueClass(IntWritable.class); countingJob.setNumReduceTasks(reduceTasks); JobClient.runJob(tokenizationJob); JobClient.runJob(countingJob);
  • 73. Pipeline jobs in Hadoop  Old API  JobClinet.runJob(..) does not return until job finishes  New API  Use Job rather than JobConf  Use job.waitForCompletion instead of JobClient.runJob  Why Old API?  In 0.20.2, chaining only possible under old API  We want to re-use the same components for chaining (next…)
  • 74. Chaining in Hadoop Mapper 1 Mapper 1  Map+ Reduce Map* Intermediates Intermediates  1 or more Mappers • Can use IdentityMapper  1 reducer Mapper 2 Mapper 2 • No reducers: conf.setNumReduceTasks(0)?  0 or more Mappers  Usual combiners and partitioners  By default, data passed between Reducer Reducer Mappers by usual writing of intermediate data to disk  Can always use side-effects…  There is a better, built-in way to bypass Mapper 3 Mapper 3 this and pass (Key,Value) pairs by reference instead • Requires different Mapper semantics! Persistent Output Persistent Output
  • 75. Hadoop: ChainMapper & ChainReducer  Below JobConf objects (deprecated in Hadoop 0.20.2).  No undeprecated replacement in 0.20.2…   Examples here work for later versions with small changes Configuration conf = new Configuration(); JobConf job = new JobConf(conf); ... boolean passByRef = false; // pass output (Key,Value) pairs to next Mapper by reference? JobConf map1Conf = new JobConf(false); ChainMapper.addMapper(job, Map1.class, Map1InputKey.class, Map1InputValue.class, Map1OutputKey.class, Map1OutputValue.class, passByRef, map1Conf); JobConf map2Conf = new JobConf(false); ChainMapper.addMapper(job, Map2.class, Map1OutputKey.class, Map1OutputValue.class, Map2OutputKey.class, Map2OutputValue.class, passByRef, map2Conf); JobConf reduceConf = new JobConf(false); ChainReducer.setReducer(job, Reducer.class, Map2OutputKey.class, Map2OutputValue.class, ReducerOutputKey.class, ReducerOutputValue.class, passByRef, reduceConf) JobConf map3Conf = new JobConf(false); ChainReducer.addMapper (job, Map3.class, ReducerOutputKey.class, ReducerOutputValue.class, Map3OutputKey.class, Map3OutputValue.class, passByRef, map3Conf) JobClient.runJob(job);
  • 76. Chaining in Hadoop  Let’s continue our running example:  Mapper 1: Tokenize  Mapper 2: Observe (count) words  Reducer: same IntSum reducer as always  Mapper 3 Log-dampen counts • We didn’t have this in our pipeline example but we’ll add here…
  • 77. Chained Tokenizer + WordCount // Set up configuration and intermediate directory location Configuration conf = new Configuration(); JobConf chainJob = new JobConf(conf); chainJob.setJobName("Chain job"); chainJob.setJarByClass(ChainWordCount.class); // single jar for all Mappers and Reducers… chainJob.setNumReduceTasks(reduceTasks); FileInputFormat.setInputPaths(chainJob, new Path(inputPath)); FileOutputFormat.setOutputPath(chainJob, new Path(outputPath)); // pass output (Key,Value) pairs to next Mapper by reference? boolean passByRef = false; JobConf map1 = new JobConf(false); // tokenization ChainMapper.addMapper(chainJob, AggressiveTokenizerMapper.class, LongWritable.class, Text.class, LongWritable.class, TextArrayWritable.class, passByRef, map1); JobConf map2 = new JobConf(false); // Add token observer job ChainMapper.addMapper(chainJob, TrivialWordObserver.class, LongWritable.class, TextArrayWritable.class, Text.class, LongWritable.class, passByRef, map2); JobConf reduce = new JobConf(false); // Set the int sum reducer ChainReducer.setReducer(chainJob, LongSumReducer.class, Text.class, LongWritable.class, Text.class, LongWritable.class, passByRef, reduce); JobConf map3 = new JobConf(false); // log-scaling of counts ChainReducer.addMapper(chainJob, ComputeLogMapper.class, Text.class, LongWritable.class, Text.class, FloatWritable.class, passByRef, map3); JobClient.runJob(chainJob);
  • 78. Hadoop Chaining: Pass by Reference  Chaining allows possible optimization  Chained mappers run in same JVM thread, so opportunity to avoid serialization to/from disk with pipelined jobs  Also lesser benefit of avoiding extra object destruction / construction  Gotchas  OutputCollector.collect(K k, V v) promises not alter the content of k and v  But if Map1 passes (k,v) by reference to Map2 via collect(), Map2 may alter (k,v) & thereby violate the contract  What to do?  Option 1: Honor the contract – don’t alter input (k,v) in Map2  Option 2: Re-negotiate terms – don’t re-use (k,v) in Map1 after collect()  Document carefully to avoid later changes silently breaking this…
  • 79. Setting Dependencies Between Jobs  JobControl and Job provide the mechanism // create jobconf1 and jobconf2 as appropriate // … Job job1=new Job(jobconf1) Job job2=new Job(jobconf2); job2.addDependingJob(job1); JobControl jbcntrl=new JobControl("jbcntrl"); jbcntrl.addJob(job1); jbcntrl.addJob(job2); jbcntrl.run()  New API: no JobConf, create Job from Configuration, …
  • 80. Higher Level Abstractions  Pig: language and execution environment for expressing MapReduce data flows. (pretty much the standard)  See White, Chapter 11  Cascading: another environment with a higher level of abstraction for composing complex data flows  See White, Chapter 16, pp 539-552  Cascalog: query language based on Cascading that uses Clojure (a JVM-based LISP variant)  Word count in Cascalog  Certainly more concise – though you need to grok the syntax. (?<- (stdout) [?word ?count] (sentence ?s) (split ?s :> ?word) (c/ count ?count))