SlideShare una empresa de Scribd logo
1 de 72
Descargar para leer sin conexión
Building Mini-Google in Ruby


                                                                              Ilya Grigorik
                                                                                          @igrigorik


Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
postrank.com/topic/ruby




                               The slides…                           Twitter                           My blog


Building Mini-Google in Ruby      http://bit.ly/railsconf-pagerank             @igrigorik #railsconf
Ruby + Math
                                                                            PageRank
              Optimization




                               Examples                                          Indexing
   Misc Fun




Building Mini-Google in Ruby     http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
PageRank                                       PageRank + Ruby




      Tools
        +                       Examples                                       Indexing
   Optimization

Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Consume with care…
                     everything that follows is based on released / public domain info




Building Mini-Google in Ruby        http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
Search-engine graveyard
                                                                  Google did pretty well…




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Query: Ruby




                                                                                           Results




       1. Crawl                           2. Index                                         3. Rank




                                                                  Search pipeline
                                                                                  50,000-foot view



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Query: Ruby




                                                                                                Results




       1. Crawl                           2. Index                                          3. Rank




            Bah                           Interesting                                     Fun




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
CPU Speed                                     333Mhz
          RAM                                           32-64MB

          Index                                         27,000,000 documents
          Index refresh                                 once a month~ish
          PageRank computation                          several days

          Laptop CPU                                    2.1Ghz
          VM RAM                                        1GB
          1-Million page web                            ~10 minutes


                                                                  circa 1997-1998



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Creating & Maintaining an Inverted Index
                                                                    DIY and the gotchas within




Building Mini-Google in Ruby     http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
require 'set'
                                                      {
                                                         quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
    pages = {
                                                         quot;aquot;=>#<Set: {quot;3quot;}>,
     quot;1quot; => quot;it is what it isquot;,
                                                         quot;bananaquot;=>#<Set: {quot;3quot;}>,
     quot;2quot; => quot;what is itquot;,
                                                         quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
     quot;3quot; => quot;it is a bananaquot;
                                                         quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
    }
                                                        }
    index = {}

    pages.each do |page, content|
     content.split(/s/).each do |word|
      if index[word]
        index[word] << page
      else
        index[word] = Set.new(page)
      end
     end
    end


                                                Building an Inverted Index

Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
require 'set'
                                                      {
                                                         quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
    pages = {
                                                         quot;aquot;=>#<Set: {quot;3quot;}>,
     quot;1quot; => quot;it is what it isquot;,
                                                         quot;bananaquot;=>#<Set: {quot;3quot;}>,
     quot;2quot; => quot;what is itquot;,
                                                         quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
     quot;3quot; => quot;it is a bananaquot;
                                                         quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
    }
                                                        }
    index = {}

    pages.each do |page, content|
     content.split(/s/).each do |word|
      if index[word]
        index[word] << page
      else
        index[word] = Set.new(page)
      end
     end
    end


                                                Building an Inverted Index

Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
require 'set'
                                                      {
                                                         quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
    pages = {
                                                         quot;aquot;=>#<Set: {quot;3quot;}>,
     quot;1quot; => quot;it is what it isquot;,
                                                         quot;bananaquot;=>#<Set: {quot;3quot;}>,
     quot;2quot; => quot;what is itquot;,
                                                         quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
     quot;3quot; => quot;it is a bananaquot;
                                                         quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
    }
                                                        }
    index = {}

    pages.each do |page, content|
                                                                  Word => [Document]
     content.split(/s/).each do |word|
      if index[word]
        index[word] << page
      else
        index[word] = Set.new(page)
      end
     end
    end


                                                Building an Inverted Index

Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
# query: quot;what is bananaquot;
 p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;]
 # > #<Set: {}>


 # query: quot;a bananaquot;
 p index[quot;aquot;] & index[quot;bananaquot;]
 # > #<Set: {quot;3quot;}>


                                                                    1                        3
                                                                                 2
 # query: quot;what isquot;
 p index[quot;whatquot;] & index[quot;isquot;]
 # > #<Set: {quot;1quot;, quot;2quot;}>


 {
   quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
   quot;aquot;=>#<Set: {quot;3quot;}>,
   quot;bananaquot;=>#<Set: {quot;3quot;}>,
   quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
   quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
                                                                  Querying the index
  }



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank      @igrigorik #railsconf
# query: quot;what is bananaquot;
 p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;]
 # > #<Set: {}>


 # query: quot;a bananaquot;
 p index[quot;aquot;] & index[quot;bananaquot;]
 # > #<Set: {quot;3quot;}>


                                                                    1                        3
                                                                                 2
 # query: quot;what isquot;
 p index[quot;whatquot;] & index[quot;isquot;]
 # > #<Set: {quot;1quot;, quot;2quot;}>


 {
   quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
   quot;aquot;=>#<Set: {quot;3quot;}>,
   quot;bananaquot;=>#<Set: {quot;3quot;}>,
   quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
   quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
                                                                  Querying the index
  }



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank      @igrigorik #railsconf
# query: quot;what is bananaquot;
 p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;]
 # > #<Set: {}>


 # query: quot;a bananaquot;
 p index[quot;aquot;] & index[quot;bananaquot;]
 # > #<Set: {quot;3quot;}>


                                                                    1                        3
                                                                                 2
 # query: quot;what isquot;
 p index[quot;whatquot;] & index[quot;isquot;]
 # > #<Set: {quot;1quot;, quot;2quot;}>


 {
   quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
   quot;aquot;=>#<Set: {quot;3quot;}>,
   quot;bananaquot;=>#<Set: {quot;3quot;}>,
   quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
   quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
                                                                  Querying the index
  }



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank      @igrigorik #railsconf
# query: quot;what is bananaquot;
 p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;]
 # > #<Set: {}>


 # query: quot;a bananaquot;
 p index[quot;aquot;] & index[quot;bananaquot;]
 # > #<Set: {quot;3quot;}>

                                                                  What order?
 # query: quot;what isquot;
 p index[quot;whatquot;] & index[quot;isquot;]
                                                                  [1, 2] or [2,1]
 # > #<Set: {quot;1quot;, quot;2quot;}>


 {
   quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
   quot;aquot;=>#<Set: {quot;3quot;}>,
   quot;bananaquot;=>#<Set: {quot;3quot;}>,
   quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
   quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
                                                                    Querying the index
  }



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank            @igrigorik #railsconf
require 'set'

    pages = {
     quot;1quot; => quot;it is what it isquot;,
     quot;2quot; => quot;what is itquot;,
     quot;3quot; => quot;it is a bananaquot;
    }
                                                                    PDF, HTML, RSS?
    index = {}
                                                                  Lowercase / Upcase?
    pages.each do |page, content|                                   Compact Index?
                                                                        Hmmm?
     content.split(/s/).each do |word|                               Stop words?
      if index[word]                                                  Persistence?
        index[word] << page
      else
        index[word] = Set.new(page)
      end
     end
    end


                                                Building an Inverted Index

Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank         @igrigorik #railsconf
Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Ferret is a high-performance, full-featured text search engine library written for Ruby



Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
require 'ferret'
    include Ferret

    index = Index::Index.new()

    index << {:title => quot;1quot;, :content => quot;it is what it isquot;}
    index << {:title => quot;2quot;, :content => quot;what is itquot;}
    index << {:title => quot;3quot;, :content => quot;it is a bananaquot;}

    index.search_each('content:quot;bananaquot;') do |id, score|
     puts quot;Score: #{score}, #{index[id][:title]} quot;
    end


    > Score: 1.0, 3




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
require 'ferret'
    include Ferret

    index = Index::Index.new()

    index << {:title => quot;1quot;, :content => quot;it is what it isquot;}
    index << {:title => quot;2quot;, :content => quot;what is itquot;}
    index << {:title => quot;3quot;, :content => quot;it is a bananaquot;}

    index.search_each('content:quot;bananaquot;') do |id, score|
     puts quot;Score: #{score}, #{index[id][:title]} quot;
    end


    > Score: 1.0, 3


                               Hmmm?




Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
class Ferret::Analysis::Analyzer                                        class Ferret::Search::BooleanQuery
   class Ferret::Analysis::AsciiLetterAnalyzer                             class Ferret::Search::ConstantScoreQuery
   class Ferret::Analysis::AsciiLetterTokenizer                            class Ferret::Search::Explanation
   class Ferret::Analysis::AsciiLowerCaseFilter                            class Ferret::Search::Filter
   class Ferret::Analysis::AsciiStandardAnalyzer                           class Ferret::Search::FilteredQuery
   class Ferret::Analysis::AsciiStandardTokenizer                          class Ferret::Search::FuzzyQuery
   class Ferret::Analysis::AsciiWhiteSpaceAnalyzer                         class Ferret::Search::Hit
   class Ferret::Analysis::AsciiWhiteSpaceTokenizer                        class Ferret::Search::MatchAllQuery
   class Ferret::Analysis::HyphenFilter                                    class Ferret::Search::MultiSearcher
   class Ferret::Analysis::LetterAnalyzer                                  class Ferret::Search::MultiTermQuery
   class Ferret::Analysis::LetterTokenizer                                 class Ferret::Search::PhraseQuery
   class Ferret::Analysis::LowerCaseFilter                                 class Ferret::Search::PrefixQuery
   class Ferret::Analysis::MappingFilter                                   class Ferret::Search::Query
   class Ferret::Analysis::PerFieldAnalyzer                                class Ferret::Search::QueryFilter
   class Ferret::Analysis::RegExpAnalyzer                                  class Ferret::Search::RangeFilter
   class Ferret::Analysis::RegExpTokenizer                                 class Ferret::Search::RangeQuery
   class Ferret::Analysis::StandardAnalyzer                                class Ferret::Search::Searcher
   class Ferret::Analysis::StandardTokenizer                               class Ferret::Search::Sort
   class Ferret::Analysis::StemFilter                                      class Ferret::Search::SortField
   class Ferret::Analysis::StopFilter                                      class Ferret::Search::TermQuery
   class Ferret::Analysis::Token                                           class Ferret::Search::TopDocs
   class Ferret::Analysis::TokenStream                                     class Ferret::Search::TypedRangeFilter
   class Ferret::Analysis::WhiteSpaceAnalyzer                              class Ferret::Search::TypedRangeQuery
                                                                           class Ferret::Search::WildcardQuery
   class Ferret::Analysis::WhiteSpaceTokenizer



Building Mini-Google in Ruby            http://bit.ly/railsconf-pagerank              @igrigorik #railsconf
ferret.davebalmain.com/trac




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Ranking Results
                                                                      0-60 with PageRank…




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
index.search_each('content:quot;the brown cowquot;') do |id, score|
     puts quot;Score: #{score}, #{index[id][:title]} quot;
    end

    > Score: 0.827, 3
    > Score: 0.523, 5                                                 Relevance?
    > Score: 0.125, 4

                               3                     5                   4
            the                4                     3                   5
          brown                1                     3                   1
            cow                1                     4                   1
     Score                     6                    10                   7


                                                              Naïve: Term Frequency

Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank          @igrigorik #railsconf
index.search_each('content:quot;the brown cowquot;') do |id, score|
     puts quot;Score: #{score}, #{index[id][:title]} quot;
    end

    > Score: 0.827, 3
    > Score: 0.523, 5
    > Score: 0.125, 4

                               3                     5                4
            the                4                     3                5
                                                                                                  Skew
          brown                1                     3                1
            cow                1                     4                1
     Score                     6                    10                7


                                                              Naïve: Term Frequency

Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
3                         5                4
            the                4                         3                5
                                                                                                       Skew
          brown                1                         3                1
            cow                1                         4                1


                               # of docs
                                                              Score = TF * IDF
                   the             6
                                                              TF = # occurrences / # words
                  brown            3
                                                              IDF = # docs / # docs with W
                   cow             4


     Total # of documents:                     10


                                                                                                      TF-IDF
                                           Term Frequency * Inverse Document Frequency


Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
3                         5                  4
            the                 4                         3                  5
          brown                 1                         3                  1
            cow                 1                         4                  1


                                                                      Doc # 3 score for ‘the’:
                               # of docs
                                                                      4/10 * ln(10/6) = 0.204
                   the              6
                                                                      Doc # 3 score for ‘brown’:
                  brown             3
                                                                      1/10 * ln(10/3) = 0.120
                   cow              4
                                                                      Doc # 3 score for ‘cow’:
                                                                      1/10 * ln(10/4) = 0.092
     Total # of documents:                      10
     # words in document:                       10


                                                                                                         TF-IDF
                               Score = 0.204 + 0.120 + 0.092 = 0.416



Building Mini-Google in Ruby            http://bit.ly/railsconf-pagerank         @igrigorik #railsconf
W1        W2   …           …             …        …       …             …     WN

         Doc 1       15        23   …
         Doc 2       24        12   …
         …            …        …    …
         …
         Doc K

         Size = N * K * size of Ruby object
                                                                                   Ouch.
          Pages = N = 10,000
          Words = K = 2,000
          Ruby Object = 20+ bytes

                                                                       Frequency Matrix
          Footprint = 384 MB



Building Mini-Google in Ruby        http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
NArray is an Numerical N-dimensional Array class (implemented in C)



                                                      #    create new NArray. initialize with 0.
       NArray.new(typecode, size, ...)
                                                      #    1 byte unsigned integer
       NArray.byte(size,...)
                                                      #    2 byte signed integer
       NArray.sint(size,...)
                                                      #    4 byte signed integer
       NArray.int(size,...)
                                                      #    single precision float
       NArray.sfloat(size,...)
                                                      #    double precision float
       NArray.float(size,...)
                                                      #    single precision complex
       NArray.scomplex(size,...)
                                                      #    double precision complex
       NArray.complex(size,...)
                                                      #    Ruby object
       NArray.object(size,...)




                                                                                              NArray
                                                                    http://narray.rubyforge.org/



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
NArray is an Numerical N-dimensional Array class (implemented in C)




                                                                                            NArray
                                                                  http://narray.rubyforge.org/



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
Links as votes




                                                                                        PageRank
                                                                                     the google juice
               Problem: link gaming




Building Mini-Google in Ruby      http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
P = 0.85



                               Follow link from page he/she is currently on.



                               Teleport to a random location on the web.



                                   P = 0.15

                                                                               Random Surfer
                                                                                    powerful abstraction




Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank        @igrigorik #railsconf
Follow link from page he/she is currently on.
                                                                                                      Page K

                               Teleport to a random location on the web.



        Page N                         Page M
                                                                                                    Surfin’
                                                                          rinse & repeat, ad naseum




Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
On Page P, clicks on link to K
                                                                                         P = 0.85


                               On Page K clicks on link to M
                                                                                         P = 0.85


                               On Page M teleports to X

           P = 0.15
                                                                                                    Surfin’
                                                …
                                                                          rinse & repeat, ad naseum




Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
P = 0.05                                                      P = 0.20
                                                               X
                               N

                                                                                     P = 0.15
                                                                      M
                                   K
                  P = 0.6




                                                       Analyzing the Web Graph
                                                                               extracting PageRank




Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
What is PageRank?
                                                                                             It’s a scalar!

Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank      @igrigorik #railsconf
P = 0.05                                                      P = 0.20
                                                               X
                               N

                                                                                     P = 0.15
                                                                      M
                                   K
                  P = 0.6




                                                                      What is PageRank?
                                                                                       it’s a probability!




Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
P = 0.05                                                      P = 0.20
                                                               X
                               N

                                                                                     P = 0.15
                                                                      M
                                   K
                  P = 0.6




                                                                      What is PageRank?
          Higher Pr, Higher Importance?
                                                                                       it’s a probability!




Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
Teleportation?
                                                                                           sci-fi fans, … ?




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank    @igrigorik #railsconf
1. No in-links!                                                        3. Isolated Web



                                           X
         N
                          K
                                                                                       2. No out-links!
                                        M
                                                                  M



                                                  Reasons for teleportation
                                                                      enumerating edge cases



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
•Breadth First Search
                               •Depth First Search
                               •A* Search
                               •Lexicographic Search
                               •Dijkstra’s Algorithm
                               •Floyd-Warshall
                               •Triangulation and Comparability detection

require 'gratr/import'

dg = Digraph[1,2, 2,3, 2,4, 4,5, 6,4, 1,6]

dg.directed? # true
dg.vertex?(4) # true
dg.edge?(2,4) # true
dg.vertices # [5, 6, 1, 2, 3, 4]
                                                                       Exploring Graphs
Graph[1,2,1,3,1,4,2,5].bfs # [1, 2, 3, 4, 5]                                  gratr.rubyforge.com
Graph[1,2,1,3,1,4,2,5].dfs # [1, 2, 5, 3, 4]



Building Mini-Google in Ruby        http://bit.ly/railsconf-pagerank    @igrigorik #railsconf
P(T) = 0.03
                                                                     P(T) = 0.15 / # of pages
        P(T) = 0.03
                                                                     P(T) = 0.03
                                            X
         N
                          K                                        P(T) = 0.03

                                         M
                P(T) = 0.03
                                                                   M
                               P(T) = 0.03


                                                                             Teleportation
                                                                                                probabilities



Building Mini-Google in Ruby    http://bit.ly/railsconf-pagerank        @igrigorik #railsconf
Assume the web is N pages big
    Assume that probability of teleportation (t) is 0.15, and following link (s) is 0.85
    Assume that teleportation probability (E) is uniform
    Assume that you start on any random page (uniform distribution L), then

                                                  0.15
                                                            ������
                                ������ = ������ =            ⋮
                                                  0.15
                                                            ������
    Then after one step, the probability your on page X is:
                                      ������ ∗ ������������ + ������������

                               ������ ∗ (0.85 ∗ ������ + 0.15 ∗ ������)


                     PageRank: Simplified Mathematical Def’n
                                                                    cause that’s how we roll



Building Mini-Google in Ruby     http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Link Graph                                                 No link from 1 to N



                           1   2                …                 …         N
             1             1   0                 …                …          0

             2             0   1                 …                …          1

             …             …   …                 …                …         …

             …             …   …                 …                …         …

             N             0   1                 …                …          1



                                                                  G = The Link Graph
                     Huge!
                                                                        ginormous and sparse



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
Links to…

                               {
                                   quot;1quot;      =>         [25, 26],
                                   quot;2quot;      =>         [1],
                  Page
                                   quot;5quot;      =>         [123,2],
                                   quot;6quot;      =>         [67, 1]
                               }



                                                                          G as a dictionary
                                                                                         more compact…



Building Mini-Google in Ruby        http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
Follow link from page he/she is currently on.
                                                                                                       Page K

                               Teleport to a random location on the web.




                                                                          Computing PageRank
                                                                                              the tedious way



Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank        @igrigorik #railsconf
Don’t trust me! Verify it yourself!


                                                                                ������1
                                                               −1                ⋮
                               ������ = ������ ������ − ������������                         ������ =
                                                                                ������������
                                    Identity matrix




                                                                         Computing PageRank
                                                                                                    in one swoop



Building Mini-Google in Ruby          http://bit.ly/railsconf-pagerank          @igrigorik #railsconf
Enough hand-waving, dammit!
                                                                                  show me the code




Building Mini-Google in Ruby     http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Hot, Fast, Awesome

                                                                  Birth of EM-Proxy
                                                                            flash of the obvious




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
http://rb-gsl.rubyforge.org/




                                                                        Hot, Fast, Awesome




                       Click there! … Give yourself a weekend.


Building Mini-Google in Ruby         http://bit.ly/railsconf-pagerank      @igrigorik #railsconf
http://ruby-gsl.sourceforge.net/
                       Click there! … Give yourself a weekend.


Building Mini-Google in Ruby      http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
require quot;gslquot;
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank(g)
                                                                         Verify NxN
    raise if g.size1 != g.size2

     i = Matrix.I(g.size1)                      # identity matrix
     p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

     s = 0.85              # probability of following a link
     t = 1-s               # probability of teleportation

    t*((i-s*g).invert)*p
   end



                                                                         PageRank in Ruby
                                                                                              6 lines, or less



Building Mini-Google in Ruby          http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
require quot;gslquot;
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank(g)                                                          Constants…
    raise if g.size1 != g.size2

     i = Matrix.I(g.size1)                      # identity matrix
     p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

     s = 0.85              # probability of following a link
     t = 1-s               # probability of teleportation

    t*((i-s*g).invert)*p
   end



                                                                         PageRank in Ruby
                                                                                              6 lines, or less



Building Mini-Google in Ruby          http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
require quot;gslquot;
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank(g)
    raise if g.size1 != g.size2

     i = Matrix.I(g.size1)                      # identity matrix
     p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

     s = 0.85              # probability of following a link
     t = 1-s               # probability of teleportation

    t*((i-s*g).invert)*p
   end



                                                                         PageRank in Ruby
                  PageRank!
                                                                                              6 lines, or less



Building Mini-Google in Ruby          http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
X
                 P = 0.33                                              P = 0.33
                               N


                                                                  P = 0.33
                                                    K


        pagerank(Matrix[[0,0,1], [0,0,1], [1,0,0]])
        > [0.33, 0.33, 0.33]


                                                                  Ex: Circular Web
                                                                               testing intuition…




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank    @igrigorik #railsconf
X
                 P = 0.05                                               P = 0.07
                               N


                                                                   P = 0.87
                                                    K


        pagerank(Matrix[[0,0,0], [0.5,0,0], [0.5,1,1]])
        > [0.05, 0.07, 0.87]


                                                              Ex: All roads lead to K
                                                                                testing intuition…




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
PageRank + Ferret
                                                                             awesome search, ftw!




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
2
                               P = 0.05                                                          P = 0.07
                                                           1


require 'ferret'                                                                              P = 0.87
                                                                      3
include Ferret

index = Index::Index.new()

index << {:title => quot;1quot;, :content => quot;it is what it isquot;, :pr => 0.05 }
index << {:title => quot;2quot;, :content => quot;what is itquot;, :pr => 0.07 }
index << {:title => quot;3quot;, :content => quot;it is a bananaquot;, :pr => 0.87 }



                                                                               Store PageRank




Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
index.search_each('content:quot;worldquot;') do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot;
end

                               TF-IDF Search
puts quot;*quot; * 50

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:quot;worldquot;', :sort => sf_pr) do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot;
end


#   Score: 0.267119228839874, 3 (PR: 0.87)
#   Score: 0.17807948589325, 1 (PR: 0.05)
#   Score: 0.17807948589325, 2 (PR: 0.07)
#   ***********************************
#   Score: 0.267119228839874, 3, (PR: 0.87)
#   Score: 0.17807948589325, 2, (PR: 0.07)
#   Score: 0.17807948589325, 1, (PR: 0.05)



Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
index.search_each('content:quot;worldquot;') do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot;
end
                                                PageRank FTW!
puts quot;*quot; * 50

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:quot;worldquot;', :sort => sf_pr) do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot;
end


#   Score: 0.267119228839874, 3 (PR: 0.87)
#   Score: 0.17807948589325, 1 (PR: 0.05)
#   Score: 0.17807948589325, 2 (PR: 0.07)
#   ***********************************
#   Score: 0.267119228839874, 3, (PR: 0.87)
#   Score: 0.17807948589325, 2, (PR: 0.07)
#   Score: 0.17807948589325, 1, (PR: 0.05)



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
index.search_each('content:quot;worldquot;') do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot;
end

puts quot;*quot; * 50

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:quot;worldquot;', :sort => sf_pr) do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot;
end


#    Score: 0.267119228839874, 3 (PR: 0.87)
#    Score: 0.17807948589325, 1 (PR: 0.05)                                         Others
#    Score: 0.17807948589325, 2 (PR: 0.07)
#    ***********************************
#    Score: 0.267119228839874, 3, (PR: 0.87)
#    Score: 0.17807948589325, 2, (PR: 0.07)                                        Google
#    Score: 0.17807948589325, 1, (PR: 0.05)



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Search*: Graphs are ubiquitous!
                                                 PageRank is a general purpose hammer




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Username               GitCred
                                                                    ==============================
                                                                    37signals              10.00
                                                                    imbriaco               9.76
                                                                    why                    8.74
                                                                    rails                  8.56
                                                                    defunkt                8.17
                                                                    technoweenie           7.83
                                                                    jeresig                7.60
                                                                    mojombo                7.51
                                                                    yui                    7.34
                                                                    drnic                  7.34
                                                                    pjhyett                6.91
                                                                    wycats                 6.85
                                                                    dhh                    6.84

           http://bit.ly/3YQPU

                                                       PageRank + Social Graph
                                                                                                    GitHub




Building Mini-Google in Ruby     http://bit.ly/railsconf-pagerank           @igrigorik #railsconf
Hmm…




                                                                  Analyze the social graph:
                                                                  - Filter messages by ‘TwitterRank’
                                                                  - Suggest users by ‘TwitterRank’
                                                                  -…
                                                     PageRank + Social Graph
                                                                                                   Twitter




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank            @igrigorik #railsconf
PageRank + Product Graph
                                                                                          E-commerce

                                  Link items purchased in same cart… Run PR on it.



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
PageRank = Powerful Hammer
                                                                                          use it!




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Personalization
                                                                      how would you do it?




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
0.15
                               ������         Teleportation distribution doesn’t
       ������ =            ⋮                          have to be uniform!
                    0.15
                               ������


                               yahoo.com is
                               my homepage!


                                                  PageRank + Personalization
                                                                 customize the teleportation vector




Building Mini-Google in Ruby        http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
Make pages with links!




                                                                      Gaming PageRank
       http://bit.ly/pagerank-spam                       for fun and profit (I don’t endorse it)




Building Mini-Google in Ruby    http://bit.ly/railsconf-pagerank          @igrigorik #railsconf
Slides: http://bit.ly/railsconf-pagerank

    Ferret: http://bit.ly/ferret
    RB-GSL: http://bit.ly/rb-gsl

    PageRank on Wikipedia: http://bit.ly/wp-pagerank
    Gaming PageRank: http://bit.ly/pagerank-spam

    Michael Nielsen’s lectures on PageRank:
    http://michaelnielsen.org/blog


                                                                                     Questions?

                               The slides…                           Twitter                           My blog


Building Mini-Google in Ruby      http://bit.ly/railsconf-pagerank             @igrigorik #railsconf

Más contenido relacionado

Destacado

Rails 3 And The Real Secret To High Productivity Presentation
Rails 3 And The Real Secret To High Productivity PresentationRails 3 And The Real Secret To High Productivity Presentation
Rails 3 And The Real Secret To High Productivity Presentationrailsconf
 
Quality Code With Cucumber Presentation
Quality Code With Cucumber PresentationQuality Code With Cucumber Presentation
Quality Code With Cucumber Presentationrailsconf
 
J Ruby On Rails Presentation
J Ruby On Rails PresentationJ Ruby On Rails Presentation
J Ruby On Rails Presentationrailsconf
 
Rails Is From Mars Ruby Is From Venus Presentation 1
Rails Is From Mars  Ruby Is From Venus Presentation 1Rails Is From Mars  Ruby Is From Venus Presentation 1
Rails Is From Mars Ruby Is From Venus Presentation 1railsconf
 
Integrating Flex And Rails With Ruby Amf
Integrating Flex And Rails With Ruby AmfIntegrating Flex And Rails With Ruby Amf
Integrating Flex And Rails With Ruby Amfrailsconf
 
TDD (Test Driven Design)
TDD (Test Driven Design)TDD (Test Driven Design)
TDD (Test Driven Design)nedirtv
 
Below And Beneath Tdd Test Last Development And Other Real World Test Patter...
Below And Beneath Tdd  Test Last Development And Other Real World Test Patter...Below And Beneath Tdd  Test Last Development And Other Real World Test Patter...
Below And Beneath Tdd Test Last Development And Other Real World Test Patter...railsconf
 
Don T Mock Yourself Out Presentation
Don T Mock Yourself Out PresentationDon T Mock Yourself Out Presentation
Don T Mock Yourself Out Presentationrailsconf
 

Destacado (8)

Rails 3 And The Real Secret To High Productivity Presentation
Rails 3 And The Real Secret To High Productivity PresentationRails 3 And The Real Secret To High Productivity Presentation
Rails 3 And The Real Secret To High Productivity Presentation
 
Quality Code With Cucumber Presentation
Quality Code With Cucumber PresentationQuality Code With Cucumber Presentation
Quality Code With Cucumber Presentation
 
J Ruby On Rails Presentation
J Ruby On Rails PresentationJ Ruby On Rails Presentation
J Ruby On Rails Presentation
 
Rails Is From Mars Ruby Is From Venus Presentation 1
Rails Is From Mars  Ruby Is From Venus Presentation 1Rails Is From Mars  Ruby Is From Venus Presentation 1
Rails Is From Mars Ruby Is From Venus Presentation 1
 
Integrating Flex And Rails With Ruby Amf
Integrating Flex And Rails With Ruby AmfIntegrating Flex And Rails With Ruby Amf
Integrating Flex And Rails With Ruby Amf
 
TDD (Test Driven Design)
TDD (Test Driven Design)TDD (Test Driven Design)
TDD (Test Driven Design)
 
Below And Beneath Tdd Test Last Development And Other Real World Test Patter...
Below And Beneath Tdd  Test Last Development And Other Real World Test Patter...Below And Beneath Tdd  Test Last Development And Other Real World Test Patter...
Below And Beneath Tdd Test Last Development And Other Real World Test Patter...
 
Don T Mock Yourself Out Presentation
Don T Mock Yourself Out PresentationDon T Mock Yourself Out Presentation
Don T Mock Yourself Out Presentation
 

Similar a Building A Mini Google High Performance Computing In Ruby

Building A Mini Google High Performance Computing In Ruby Presentation 1
Building A Mini Google  High Performance Computing In Ruby Presentation 1Building A Mini Google  High Performance Computing In Ruby Presentation 1
Building A Mini Google High Performance Computing In Ruby Presentation 1elliando dias
 
It's Mechanize for it. Ruby as a Finder.
It's Mechanize for it. Ruby as a Finder.It's Mechanize for it. Ruby as a Finder.
It's Mechanize for it. Ruby as a Finder.Tomohiro Nishimura
 
Monitoring web application behaviour with cucumber-nagios
Monitoring web application behaviour with cucumber-nagiosMonitoring web application behaviour with cucumber-nagios
Monitoring web application behaviour with cucumber-nagiosLindsay Holmwood
 
Google G Data Reading And Writing Data On The Web
Google G Data Reading And Writing Data On The WebGoogle G Data Reading And Writing Data On The Web
Google G Data Reading And Writing Data On The WebQConLondon2008
 
Google G Data Reading And Writing Data On The Web 1
Google G Data Reading And Writing Data On The Web 1Google G Data Reading And Writing Data On The Web 1
Google G Data Reading And Writing Data On The Web 1QConLondon2008
 
Intro To Django
Intro To DjangoIntro To Django
Intro To DjangoUdi Bauman
 
Beautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFacesBeautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFacesLincoln III
 
Web application intro + a bit of ruby (revised)
Web application intro + a bit of ruby (revised)Web application intro + a bit of ruby (revised)
Web application intro + a bit of ruby (revised)Tobias Pfeiffer
 
Exploiting the newer perl to improve your plugins
Exploiting the newer perl to improve your pluginsExploiting the newer perl to improve your plugins
Exploiting the newer perl to improve your pluginsMarian Marinov
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Abhishek Mishra
 
Rapid RIA development with Netzke
Rapid RIA development with NetzkeRapid RIA development with Netzke
Rapid RIA development with Netzkenetzke
 
When To Use Ruby On Rails
When To Use Ruby On RailsWhen To Use Ruby On Rails
When To Use Ruby On Railsdosire
 
Rails bestpractices.com
Rails bestpractices.comRails bestpractices.com
Rails bestpractices.comRichard Huang
 
Lightweight Webservices with Sinatra and RestClient
Lightweight Webservices with Sinatra and RestClientLightweight Webservices with Sinatra and RestClient
Lightweight Webservices with Sinatra and RestClientAdam Wiggins
 
More Secrets of JavaScript Libraries
More Secrets of JavaScript LibrariesMore Secrets of JavaScript Libraries
More Secrets of JavaScript Librariesjeresig
 

Similar a Building A Mini Google High Performance Computing In Ruby (20)

Building A Mini Google High Performance Computing In Ruby Presentation 1
Building A Mini Google  High Performance Computing In Ruby Presentation 1Building A Mini Google  High Performance Computing In Ruby Presentation 1
Building A Mini Google High Performance Computing In Ruby Presentation 1
 
It's Mechanize for it. Ruby as a Finder.
It's Mechanize for it. Ruby as a Finder.It's Mechanize for it. Ruby as a Finder.
It's Mechanize for it. Ruby as a Finder.
 
Monitoring web application behaviour with cucumber-nagios
Monitoring web application behaviour with cucumber-nagiosMonitoring web application behaviour with cucumber-nagios
Monitoring web application behaviour with cucumber-nagios
 
Happy Coding with Ruby on Rails
Happy Coding with Ruby on RailsHappy Coding with Ruby on Rails
Happy Coding with Ruby on Rails
 
Web application intro
Web application introWeb application intro
Web application intro
 
Google G Data Reading And Writing Data On The Web
Google G Data Reading And Writing Data On The WebGoogle G Data Reading And Writing Data On The Web
Google G Data Reading And Writing Data On The Web
 
Google G Data Reading And Writing Data On The Web 1
Google G Data Reading And Writing Data On The Web 1Google G Data Reading And Writing Data On The Web 1
Google G Data Reading And Writing Data On The Web 1
 
SearchMonkey
SearchMonkeySearchMonkey
SearchMonkey
 
Intro To Django
Intro To DjangoIntro To Django
Intro To Django
 
Beautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFacesBeautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFaces
 
Rails + Webpack
Rails + WebpackRails + Webpack
Rails + Webpack
 
Web application intro + a bit of ruby (revised)
Web application intro + a bit of ruby (revised)Web application intro + a bit of ruby (revised)
Web application intro + a bit of ruby (revised)
 
Exploiting the newer perl to improve your plugins
Exploiting the newer perl to improve your pluginsExploiting the newer perl to improve your plugins
Exploiting the newer perl to improve your plugins
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010
 
Rapid RIA development with Netzke
Rapid RIA development with NetzkeRapid RIA development with Netzke
Rapid RIA development with Netzke
 
When To Use Ruby On Rails
When To Use Ruby On RailsWhen To Use Ruby On Rails
When To Use Ruby On Rails
 
Cucumber
CucumberCucumber
Cucumber
 
Rails bestpractices.com
Rails bestpractices.comRails bestpractices.com
Rails bestpractices.com
 
Lightweight Webservices with Sinatra and RestClient
Lightweight Webservices with Sinatra and RestClientLightweight Webservices with Sinatra and RestClient
Lightweight Webservices with Sinatra and RestClient
 
More Secrets of JavaScript Libraries
More Secrets of JavaScript LibrariesMore Secrets of JavaScript Libraries
More Secrets of JavaScript Libraries
 

Último

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Último (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

Building A Mini Google High Performance Computing In Ruby

  • 1. Building Mini-Google in Ruby Ilya Grigorik @igrigorik Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 2. postrank.com/topic/ruby The slides… Twitter My blog Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 3. Ruby + Math PageRank Optimization Examples Indexing Misc Fun Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 4. PageRank PageRank + Ruby Tools + Examples Indexing Optimization Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 5. Consume with care… everything that follows is based on released / public domain info Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 6. Search-engine graveyard Google did pretty well… Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 7. Query: Ruby Results 1. Crawl 2. Index 3. Rank Search pipeline 50,000-foot view Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 8. Query: Ruby Results 1. Crawl 2. Index 3. Rank Bah Interesting Fun Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 9. CPU Speed 333Mhz RAM 32-64MB Index 27,000,000 documents Index refresh once a month~ish PageRank computation several days Laptop CPU 2.1Ghz VM RAM 1GB 1-Million page web ~10 minutes circa 1997-1998 Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 10. Creating & Maintaining an Inverted Index DIY and the gotchas within Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 11. require 'set' { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, pages = { quot;aquot;=>#<Set: {quot;3quot;}>, quot;1quot; => quot;it is what it isquot;, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;2quot; => quot;what is itquot;, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;3quot; => quot;it is a bananaquot; quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} } } index = {} pages.each do |page, content| content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 12. require 'set' { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, pages = { quot;aquot;=>#<Set: {quot;3quot;}>, quot;1quot; => quot;it is what it isquot;, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;2quot; => quot;what is itquot;, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;3quot; => quot;it is a bananaquot; quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} } } index = {} pages.each do |page, content| content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 13. require 'set' { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, pages = { quot;aquot;=>#<Set: {quot;3quot;}>, quot;1quot; => quot;it is what it isquot;, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;2quot; => quot;what is itquot;, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;3quot; => quot;it is a bananaquot; quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} } } index = {} pages.each do |page, content| Word => [Document] content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 14. # query: quot;what is bananaquot; p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;] # > #<Set: {}> # query: quot;a bananaquot; p index[quot;aquot;] & index[quot;bananaquot;] # > #<Set: {quot;3quot;}> 1 3 2 # query: quot;what isquot; p index[quot;whatquot;] & index[quot;isquot;] # > #<Set: {quot;1quot;, quot;2quot;}> { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, quot;aquot;=>#<Set: {quot;3quot;}>, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} Querying the index } Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 15. # query: quot;what is bananaquot; p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;] # > #<Set: {}> # query: quot;a bananaquot; p index[quot;aquot;] & index[quot;bananaquot;] # > #<Set: {quot;3quot;}> 1 3 2 # query: quot;what isquot; p index[quot;whatquot;] & index[quot;isquot;] # > #<Set: {quot;1quot;, quot;2quot;}> { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, quot;aquot;=>#<Set: {quot;3quot;}>, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} Querying the index } Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 16. # query: quot;what is bananaquot; p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;] # > #<Set: {}> # query: quot;a bananaquot; p index[quot;aquot;] & index[quot;bananaquot;] # > #<Set: {quot;3quot;}> 1 3 2 # query: quot;what isquot; p index[quot;whatquot;] & index[quot;isquot;] # > #<Set: {quot;1quot;, quot;2quot;}> { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, quot;aquot;=>#<Set: {quot;3quot;}>, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} Querying the index } Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 17. # query: quot;what is bananaquot; p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;] # > #<Set: {}> # query: quot;a bananaquot; p index[quot;aquot;] & index[quot;bananaquot;] # > #<Set: {quot;3quot;}> What order? # query: quot;what isquot; p index[quot;whatquot;] & index[quot;isquot;] [1, 2] or [2,1] # > #<Set: {quot;1quot;, quot;2quot;}> { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, quot;aquot;=>#<Set: {quot;3quot;}>, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} Querying the index } Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 18. require 'set' pages = { quot;1quot; => quot;it is what it isquot;, quot;2quot; => quot;what is itquot;, quot;3quot; => quot;it is a bananaquot; } PDF, HTML, RSS? index = {} Lowercase / Upcase? pages.each do |page, content| Compact Index? Hmmm? content.split(/s/).each do |word| Stop words? if index[word] Persistence? index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 19. Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 20. Ferret is a high-performance, full-featured text search engine library written for Ruby Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 21. require 'ferret' include Ferret index = Index::Index.new() index << {:title => quot;1quot;, :content => quot;it is what it isquot;} index << {:title => quot;2quot;, :content => quot;what is itquot;} index << {:title => quot;3quot;, :content => quot;it is a bananaquot;} index.search_each('content:quot;bananaquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} quot; end > Score: 1.0, 3 Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 22. require 'ferret' include Ferret index = Index::Index.new() index << {:title => quot;1quot;, :content => quot;it is what it isquot;} index << {:title => quot;2quot;, :content => quot;what is itquot;} index << {:title => quot;3quot;, :content => quot;it is a bananaquot;} index.search_each('content:quot;bananaquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} quot; end > Score: 1.0, 3 Hmmm? Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 23. class Ferret::Analysis::Analyzer class Ferret::Search::BooleanQuery class Ferret::Analysis::AsciiLetterAnalyzer class Ferret::Search::ConstantScoreQuery class Ferret::Analysis::AsciiLetterTokenizer class Ferret::Search::Explanation class Ferret::Analysis::AsciiLowerCaseFilter class Ferret::Search::Filter class Ferret::Analysis::AsciiStandardAnalyzer class Ferret::Search::FilteredQuery class Ferret::Analysis::AsciiStandardTokenizer class Ferret::Search::FuzzyQuery class Ferret::Analysis::AsciiWhiteSpaceAnalyzer class Ferret::Search::Hit class Ferret::Analysis::AsciiWhiteSpaceTokenizer class Ferret::Search::MatchAllQuery class Ferret::Analysis::HyphenFilter class Ferret::Search::MultiSearcher class Ferret::Analysis::LetterAnalyzer class Ferret::Search::MultiTermQuery class Ferret::Analysis::LetterTokenizer class Ferret::Search::PhraseQuery class Ferret::Analysis::LowerCaseFilter class Ferret::Search::PrefixQuery class Ferret::Analysis::MappingFilter class Ferret::Search::Query class Ferret::Analysis::PerFieldAnalyzer class Ferret::Search::QueryFilter class Ferret::Analysis::RegExpAnalyzer class Ferret::Search::RangeFilter class Ferret::Analysis::RegExpTokenizer class Ferret::Search::RangeQuery class Ferret::Analysis::StandardAnalyzer class Ferret::Search::Searcher class Ferret::Analysis::StandardTokenizer class Ferret::Search::Sort class Ferret::Analysis::StemFilter class Ferret::Search::SortField class Ferret::Analysis::StopFilter class Ferret::Search::TermQuery class Ferret::Analysis::Token class Ferret::Search::TopDocs class Ferret::Analysis::TokenStream class Ferret::Search::TypedRangeFilter class Ferret::Analysis::WhiteSpaceAnalyzer class Ferret::Search::TypedRangeQuery class Ferret::Search::WildcardQuery class Ferret::Analysis::WhiteSpaceTokenizer Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 24. ferret.davebalmain.com/trac Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 25. Ranking Results 0-60 with PageRank… Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 26. index.search_each('content:quot;the brown cowquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} quot; end > Score: 0.827, 3 > Score: 0.523, 5 Relevance? > Score: 0.125, 4 3 5 4 the 4 3 5 brown 1 3 1 cow 1 4 1 Score 6 10 7 Naïve: Term Frequency Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 27. index.search_each('content:quot;the brown cowquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} quot; end > Score: 0.827, 3 > Score: 0.523, 5 > Score: 0.125, 4 3 5 4 the 4 3 5 Skew brown 1 3 1 cow 1 4 1 Score 6 10 7 Naïve: Term Frequency Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 28. 3 5 4 the 4 3 5 Skew brown 1 3 1 cow 1 4 1 # of docs Score = TF * IDF the 6 TF = # occurrences / # words brown 3 IDF = # docs / # docs with W cow 4 Total # of documents: 10 TF-IDF Term Frequency * Inverse Document Frequency Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 29. 3 5 4 the 4 3 5 brown 1 3 1 cow 1 4 1 Doc # 3 score for ‘the’: # of docs 4/10 * ln(10/6) = 0.204 the 6 Doc # 3 score for ‘brown’: brown 3 1/10 * ln(10/3) = 0.120 cow 4 Doc # 3 score for ‘cow’: 1/10 * ln(10/4) = 0.092 Total # of documents: 10 # words in document: 10 TF-IDF Score = 0.204 + 0.120 + 0.092 = 0.416 Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 30. W1 W2 … … … … … … WN Doc 1 15 23 … Doc 2 24 12 … … … … … … Doc K Size = N * K * size of Ruby object Ouch. Pages = N = 10,000 Words = K = 2,000 Ruby Object = 20+ bytes Frequency Matrix Footprint = 384 MB Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 31. NArray is an Numerical N-dimensional Array class (implemented in C) # create new NArray. initialize with 0. NArray.new(typecode, size, ...) # 1 byte unsigned integer NArray.byte(size,...) # 2 byte signed integer NArray.sint(size,...) # 4 byte signed integer NArray.int(size,...) # single precision float NArray.sfloat(size,...) # double precision float NArray.float(size,...) # single precision complex NArray.scomplex(size,...) # double precision complex NArray.complex(size,...) # Ruby object NArray.object(size,...) NArray http://narray.rubyforge.org/ Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 32. NArray is an Numerical N-dimensional Array class (implemented in C) NArray http://narray.rubyforge.org/ Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 33. Links as votes PageRank the google juice Problem: link gaming Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 34. P = 0.85 Follow link from page he/she is currently on. Teleport to a random location on the web. P = 0.15 Random Surfer powerful abstraction Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 35. Follow link from page he/she is currently on. Page K Teleport to a random location on the web. Page N Page M Surfin’ rinse & repeat, ad naseum Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 36. On Page P, clicks on link to K P = 0.85 On Page K clicks on link to M P = 0.85 On Page M teleports to X P = 0.15 Surfin’ … rinse & repeat, ad naseum Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 37. P = 0.05 P = 0.20 X N P = 0.15 M K P = 0.6 Analyzing the Web Graph extracting PageRank Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 38. What is PageRank? It’s a scalar! Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 39. P = 0.05 P = 0.20 X N P = 0.15 M K P = 0.6 What is PageRank? it’s a probability! Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 40. P = 0.05 P = 0.20 X N P = 0.15 M K P = 0.6 What is PageRank? Higher Pr, Higher Importance? it’s a probability! Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 41. Teleportation? sci-fi fans, … ? Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 42. 1. No in-links! 3. Isolated Web X N K 2. No out-links! M M Reasons for teleportation enumerating edge cases Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 43. •Breadth First Search •Depth First Search •A* Search •Lexicographic Search •Dijkstra’s Algorithm •Floyd-Warshall •Triangulation and Comparability detection require 'gratr/import' dg = Digraph[1,2, 2,3, 2,4, 4,5, 6,4, 1,6] dg.directed? # true dg.vertex?(4) # true dg.edge?(2,4) # true dg.vertices # [5, 6, 1, 2, 3, 4] Exploring Graphs Graph[1,2,1,3,1,4,2,5].bfs # [1, 2, 3, 4, 5] gratr.rubyforge.com Graph[1,2,1,3,1,4,2,5].dfs # [1, 2, 5, 3, 4] Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 44. P(T) = 0.03 P(T) = 0.15 / # of pages P(T) = 0.03 P(T) = 0.03 X N K P(T) = 0.03 M P(T) = 0.03 M P(T) = 0.03 Teleportation probabilities Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 45. Assume the web is N pages big Assume that probability of teleportation (t) is 0.15, and following link (s) is 0.85 Assume that teleportation probability (E) is uniform Assume that you start on any random page (uniform distribution L), then 0.15 ������ ������ = ������ = ⋮ 0.15 ������ Then after one step, the probability your on page X is: ������ ∗ ������������ + ������������ ������ ∗ (0.85 ∗ ������ + 0.15 ∗ ������) PageRank: Simplified Mathematical Def’n cause that’s how we roll Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 46. Link Graph No link from 1 to N 1 2 … … N 1 1 0 … … 0 2 0 1 … … 1 … … … … … … … … … … … … N 0 1 … … 1 G = The Link Graph Huge! ginormous and sparse Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 47. Links to… { quot;1quot; => [25, 26], quot;2quot; => [1], Page quot;5quot; => [123,2], quot;6quot; => [67, 1] } G as a dictionary more compact… Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 48. Follow link from page he/she is currently on. Page K Teleport to a random location on the web. Computing PageRank the tedious way Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 49. Don’t trust me! Verify it yourself! ������1 −1 ⋮ ������ = ������ ������ − ������������ ������ = ������������ Identity matrix Computing PageRank in one swoop Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 50. Enough hand-waving, dammit! show me the code Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 51. Hot, Fast, Awesome Birth of EM-Proxy flash of the obvious Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 52. http://rb-gsl.rubyforge.org/ Hot, Fast, Awesome Click there! … Give yourself a weekend. Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 53. http://ruby-gsl.sourceforge.net/ Click there! … Give yourself a weekend. Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 54. require quot;gslquot; include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) Verify NxN raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank in Ruby 6 lines, or less Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 55. require quot;gslquot; include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) Constants… raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank in Ruby 6 lines, or less Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 56. require quot;gslquot; include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank in Ruby PageRank! 6 lines, or less Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 57. X P = 0.33 P = 0.33 N P = 0.33 K pagerank(Matrix[[0,0,1], [0,0,1], [1,0,0]]) > [0.33, 0.33, 0.33] Ex: Circular Web testing intuition… Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 58. X P = 0.05 P = 0.07 N P = 0.87 K pagerank(Matrix[[0,0,0], [0.5,0,0], [0.5,1,1]]) > [0.05, 0.07, 0.87] Ex: All roads lead to K testing intuition… Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 59. PageRank + Ferret awesome search, ftw! Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 60. 2 P = 0.05 P = 0.07 1 require 'ferret' P = 0.87 3 include Ferret index = Index::Index.new() index << {:title => quot;1quot;, :content => quot;it is what it isquot;, :pr => 0.05 } index << {:title => quot;2quot;, :content => quot;what is itquot;, :pr => 0.07 } index << {:title => quot;3quot;, :content => quot;it is a bananaquot;, :pr => 0.87 } Store PageRank Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 61. index.search_each('content:quot;worldquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot; end TF-IDF Search puts quot;*quot; * 50 sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true) index.search_each('content:quot;worldquot;', :sort => sf_pr) do |id, score| puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot; end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 62. index.search_each('content:quot;worldquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot; end PageRank FTW! puts quot;*quot; * 50 sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true) index.search_each('content:quot;worldquot;', :sort => sf_pr) do |id, score| puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot; end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 63. index.search_each('content:quot;worldquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot; end puts quot;*quot; * 50 sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true) index.search_each('content:quot;worldquot;', :sort => sf_pr) do |id, score| puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot; end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) Others # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) Google # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 64. Search*: Graphs are ubiquitous! PageRank is a general purpose hammer Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 65. Username GitCred ============================== 37signals 10.00 imbriaco 9.76 why 8.74 rails 8.56 defunkt 8.17 technoweenie 7.83 jeresig 7.60 mojombo 7.51 yui 7.34 drnic 7.34 pjhyett 6.91 wycats 6.85 dhh 6.84 http://bit.ly/3YQPU PageRank + Social Graph GitHub Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 66. Hmm… Analyze the social graph: - Filter messages by ‘TwitterRank’ - Suggest users by ‘TwitterRank’ -… PageRank + Social Graph Twitter Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 67. PageRank + Product Graph E-commerce Link items purchased in same cart… Run PR on it. Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 68. PageRank = Powerful Hammer use it! Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 69. Personalization how would you do it? Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 70. 0.15 ������ Teleportation distribution doesn’t ������ = ⋮ have to be uniform! 0.15 ������ yahoo.com is my homepage! PageRank + Personalization customize the teleportation vector Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 71. Make pages with links! Gaming PageRank http://bit.ly/pagerank-spam for fun and profit (I don’t endorse it) Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 72. Slides: http://bit.ly/railsconf-pagerank Ferret: http://bit.ly/ferret RB-GSL: http://bit.ly/rb-gsl PageRank on Wikipedia: http://bit.ly/wp-pagerank Gaming PageRank: http://bit.ly/pagerank-spam Michael Nielsen’s lectures on PageRank: http://michaelnielsen.org/blog Questions? The slides… Twitter My blog Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf