SlideShare una empresa de Scribd logo
1 de 68
Descargar para leer sin conexión
In Search Of...
   integrating site search

                      Ian Barber
                     @ianbarber
               http://phpir.com
             ian@ibuildings.com
http://joind.in/talk/view/1462
what do you want?
How Search Works
 Integrating Search
  Improving Results
       Using Search
Search Performance
          Questions




                 3
4
Query
Query      Query
 Query
 Query
           Parser


Result
Result
 Result
 Result   Index




          Analyser   Document
                     Document
                      Document
                      Document

                             4
Tokenisation



“  With AT&T’s help, the F.B.I
Miami-Dade office had recovered
$1.1 million from O’Healy’s Ponzi
scheme, 10-15% more than


           ”
expected.


                                    6
PHP Tokenisation

function tokenise($string) {
    $string = strtolower($string);
    preg_match_all('/w+/', $string,
            $matches, PREG_OFFSET_CAPTURE);
    return $matches[0];
}




                                         7
Document Term Pairs
Document ID         Term
    1                the
    1               best
    1                of
    1                the
    ...              ...
   204               and
   204              what
   204              would
                               8
Inverted Index
Term              Documents
best    1 (4, 16), 4 (422), 129 (344) ...

what    24 (50, 98), 75 (33, 208) ...

would   99 (32, 599), 201 (344) ..

 ...                    ...


                                            9
Boolean Query Merge
Query: Best Western Hotel
 best     1    4    129   298   305   338
western   4   95    194   204   298   305


working   4   298   305
 hotel    2   40    200   298   355   402

Result: Document 298
                                        10
Lorem ipsum dolor sit amet,
                                                                 Lorem ipsum dolor sit amet,               consectetur adipiscing elit. Sed sit amet ante
                                                                                                           vitae enim elementum semper sodales quis
                                                              consectetur adipiscing elit. Sed sit amet ante
                                                              vitae enim elementum semper sodalesipsum. Aliquam vel condimentum Lorem ipsum dolor sit amet,
                                                                                                            quis                                   neque.
                                                              ipsum. Aliquam vel condimentum neque.        Curabitur ornare feugiat ornare. Donec
                                                                                                                                                consectetur adipiscing elit. Sed sit amet ante
                                                                                                           consectetur elit metus. Nulla eleifend
                                                              Curabitur ornare feugiat ornare. Donec                                            vitae enim elementum semper sodales quis
                                                              consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum     ipsum. Aliquam vel condimentum neque.
                                                                                                           vestibulum, justo vel egestas elementum,
                                                              tincidunt massa et euismod. Vestibulum sit amet,
                                                                                     Lorem ipsum dolor                                          Curabitur ornare feugiat ornare. Donec
                                                              vestibulum, justo consectetur elementum,elit.enim sit ametquam, vel gravida est
                                                                                  vel egestas adipiscing   purus
                                                                                                                   Sed
                                                                                                                        ornare
                                                                                                                                  ante          consectetur elit metus. Nulla eleifend
                                                              purus enim ornarevitae enim elementum sempernibh.
                                                                                    quam, vel gravida est vel sodales quis
                                                                                                           enim
                                                                                                                                                tincidunt massa et euismod. Vestibulum
Lorem ipsum dolor sit amet, consectetur                       enim vel nibh.
                                                            Lorem ipsum dolor ipsum. Aliquam vel condimentum neque. fringillavestibulum, justo vel egestas elementum,
                                                                                  sit amet,                Nam non eros nisi, eget               justo.
                                                         consectetur adipiscingCurabitur sit ametfeugiat ornare. Donec mauris vehicula enim ornare quam, vel gravida est
                                                                                    elit. Sed ornare ante                                       purus
adipiscing elit. Sed sit amet ante vitae enim            vitae enim elementum     consectetur elitjusto.Fusce vel risus vitae
                                                              Nam non eros nisi,semper sodalesmetus. Nulla eleifend
                                                                                     eget fringilla quis                                        enim vel nibh.
                                                              Fusce vel risus condimentum neque. facilisis sit amet in mi. Nulla ut turpis id
                                                         ipsum. Aliquam velvitae maurismassa et euismod. Vestibulum
                                                                                  tincidunt vehicula
elementum semper sodales quis ipsum. Aliquam                  facilisis sit amet in mi. Nulla ut turpis felis sollicitudin dictum sed nonNam non eros nisi, eget fringilla justo.
                                                         Curabitur ornare feugiat ornare. Donec velid
                                                                                  vestibulum, justo          egestas elementum,                  ipsum.
                                                                                                           Praesent gravida nulla, sed blandit leo.
                                                                                                                     ut risus est
                                                                                                       Lorem ipsum dolor sit amet, Lorem ipsum dolor sit amet,
                                                         consectetur elit metus.purus enim ornare quam, vel volutpat laoreet lacus,Fusce vel risus vitae mauris vehicula
                                                              felis sollicitudin dictum sed non ipsum.
                                                                                    Nulla eleifend
vel condimentum neque. Curabitur ornare                                           enim Vestibulum Curabitur                                      ut
                                                                                                    consectetur adipiscing elit. Sed sit amet ante
                                                                                                                                       consectetur adipiscing elit. Sed sit amet ante
                                                         tincidunt massa risus nulla, sed nibh. leo.consectetur arcu vestibulum vel.facilisis sit amet in mi. Nulla ut turpis id
                                                              Praesent ut et euismod.vel blandit
                                                                                                    ut                                  sodales Donec
                                                              Curabitur volutpat laoreet lacus, vitae enim elementum semper vitae enim elementum semper sodales quis
                                                                                                                                                  quis
                                                                                                                                                felis sollicitudin dictum sed non ipsum.
                                                         vestibulum, justo vel egestas elementum, dapibus fringilla arcu, et semper lacus
feugiat ornare. Donec consectetur elit metus.                                     Nam non vel. ipsum. Aliquam vel condimentumLorem ipsumut risussit amet, blandit leo.
                                                              consectetur arcu vestibulumeros nisi, eget fringilla justo.
                                                         purus enim ornare quam, vel gravida est     Donec                             ipsum. Praesent vel condimentum neque.
                                                                                                                                             neque.
                                                                                                                                                Aliquam dolor nulla, sed
                                                                                   arcu, vel risusCurabitur ornare feugiat ornare.consectetur adipiscing elit. Sed Donec ut
                                                                                                                                       Curabitur ornare volutpat laoreetsit amet ante
                                                                                                                                          Donec
                                                         enim dapibus fringilla Fusce et sempervitae mauris vehicula
                                                                vel nibh.                             lacus                                     Curabitur feugiat ornare. lacus,
                                                                                                    consectetur elitut turpisNulla eleifendenim elementumNulla eleifend quis
                                                                                                                      metus. id        consectetur elit metus. semper sodales Donec
Nulla eleifend tincidunt massa et euismod.                                        facilisis sit amet in mi. Nulla                        vitae consectetur arcu vestibulum vel.
                                                                                                    tincidunt massa et euismod. Vestibulum massa et euismod. Vestibulum lacus
                                                                                                                                       tincidunt
                                                         Nam non eros nisi, eget fringilla justo. dictum sed non ipsum.
                                                                                  felis sollicitudin                                     ipsum. dapibus fringilla arcu, et semper
                                                                                                                                                  Aliquam vel condimentum neque.
                                                                                                    vestibulum, justo vel egestas elementum, ornare vel egestas elementum,
                                                                                                                                       vestibulum, justo feugiat ornare. Donec
Vestibulum vestibulum, justo vel egestas                 Fusce vel risus vitae mauris vehicula nulla, sed blandit leo.
                                                                                  Praesent ut risus
                                                                                                    purus
                                                                                                                                         Curabitur
                                                                                  Curabitur volutpat enim ornare quam, vel gravidaenim ornare quam, vel gravida est
                                                                                                                                       purus est elit metus. Nulla eleifend
                                                         facilisis sit amet in mi. Nulla ut turpis id laoreet lacus, ut                  consectetur
                                                                                                    enim vel nibh.vel. Donec
                                                                                  consectetur arcu vestibulum                          enim vel nibh. et euismod. Vestibulum
elementum, purus enim ornare quam, vel                   felis sollicitudin dictum sed non ipsum.
                                                         Praesent ut risus nulla, dapibus fringilla arcu, et semper lacus
                                                                                    sed blandit leo.
                                                                                                                                         tincidunt massa
                                                                                                                                         vestibulum, justo vel egestas elementum,
                                                                                                    Nam non eros nisi, eget fringilla justo. eros nisi, eget fringilla justo.est
                                                                                                                                       Nam non ornare quam, vel gravida
gravida est enim vel nibh.                               Curabitur volutpat laoreet lacus, ut                                            purus enim
                                                                                                    Fusce vel risus vitae mauris vehicula vel nibh. vitae mauris vehicula
                                                                                                                                       Fusce vel risus
                                                                                                                                         enim
                                                        Lorem ipsum dolor sit amet, vel. Donec
                                                         consectetur arcu vestibulum
                                                                                                    facilisis sit amet in mi. Nulla ut turpis id amet in mi. Nulla ut turpis id
                                                                                                                                       facilisis sit
                                                     consectetur adipiscing elit.et semper lacus sollicitudin dictum sed non ipsum.
                                                           dapibus fringilla arcu, Sed sit amet ante
                                                                                                    felis                              felis sollicitudin dictum sed non ipsum.
                                                                                                                                         Nam non eros nisi, eget fringilla justo.
Nam non eros nisi, eget fringilla justo. Fusce vel   vitae enim elementum semper sodales quis
                                                     ipsum. Aliquam vel condimentum neque.
                                                                                                    Praesent ut risus nulla, sed blandit leo. utrisus vitae mauris vehicula
                                                                                                                                       Praesent risus nulla, sed blandit leo.
                                                                                                                                         Fusce vel
                                                                                                    Curabitur volutpat laoreet lacus, ut
                                                                                                                                       Curabitur volutpat laoreet lacus, ut
                                                     Curabitur ornare feugiat ornare. Donec consectetur arcu vestibulum vel. Donec sit arcu vestibulum vel. turpis id
                                                                                                                                         facilisis amet in mi. Nulla ut
risus vitae mauris vehicula facilisis sit amet in                                    Lorem ipsum dolor sit amet,
                                                                                                                                       consectetur
                                                     consectetur elit metus. Nulla eleifendadipiscing elit. Sed sit amet ante felis sollicitudin dictum sed non ipsum.
                                                                                  consectetur
                                                                                                                                                                            Donec
                                                                  Lorem ipsum dolor sit amet, dapibus fringilla arcu, etLorem ipsum dolor sit amet, et semper lacus
                                                                                                                                   semper lacus fringilla nulla, sed blandit leo.
                                                                                                                                         dapibus ut risus arcu,
                                                               consectetur adipiscing enimSed
                                                                                  vitae elit. elementum ante                    quis Praesent
                                                     tincidunt massa et euismod. Vestibulumsit amet semper sodalesconsectetur adipiscing elit. Sed sit amet ante
mi. Nulla ut turpis id felis sollicitudin dictum     vestibulum, justo vel egestas elementum,
                                                               vitae enim elementum semper sodales quis                        vitae
                                                                                                                                         Curabitur volutpat laoreet lacus, ut
                                                                                  ipsum. Aliquam vel condimentum neque. enim elementum semper sodales quis
                                                     purus enim ornare quam,vel condimentum feugiat ornare. Donec
                                                                                  Curabitur ornare neque.
                                                                                   vel gravida est                                       consectetur arcu vestibulum vel. Donec
sed non ipsum. Praesent ut risus nulla, sed                    ipsum. Aliquam
                                                     enim vel nibh.
                                                               Curabitur ornare feugiat ornare.       metus.
                                                                                                                               ipsum. Aliquam vel condimentum neque.
                                                                                  consectetur elit Donec Nulla eleifend Curabiturdapibus feugiat ornare.et semper lacus
                                                                                                                                            ornare
                                                                                                                                                    fringilla arcu,
                                                                                                                                                                    Donec
                                                                                  tincidunt massa et euismod. Vestibulum
blandit leo. Curabitur volutpat laoreet lacus, ut              consectetur elit metus. Nulla eleifend
                                                                                  vestibulum,Loremvel egestas elementum,
                                                     Nam non eros nisi, eget fringilla justo.    justo ipsum dolor sit amet,
                                                               tincidunt massa et euismod. Vestibulum
                                                                                                                               consectetur elit metus. Nulla eleifend
                                                                                                                               tincidunt ipsum dolor sit amet,
                                                                                                                                  Lorem massa et euismod. Vestibulum
                                                                                  purus enim ornare quam, vel gravidaSed sit amet ante vel egestas elementum,
                                                                                             consectetur adipiscing elit. est
                                                     Fusce vel risus vitaejusto vel egestas elementum,
                                                               vestibulum, mauris vehicula
consectetur arcu vestibulum vel. Donec dapibus                                    enim vel vitae enim est
                                                                                             nibh.
                                                                sit amet in ornare quam, vel id
                                                                                                                               vestibulum, justo
                                                                                                                               consectetur adipiscing elit. Sed sit amet ante
                                                     facilisis purus enim mi. Nulla ut turpisgravida elementum semper sodales quis
                                                                                                                               purus enim ornare quam, vel gravida est
                                                                                                                               vitae enim elementum semper sodales quis
                                                     felis sollicitudin dictum sed non ipsum. Aliquam vel condimentum vel nibh. vel condimentum neque.
                                                               enim vel nibh.
                                                                                             ipsum.
                                                                                                                               enim
                                                                                                                                      neque.
fringilla arcu, et semper lacus egestas non.         Praesent ut risus nulla, sed blandit leo. nisi, eget fringilla
                                                                                  Nam non eros
                                                                                                                               ipsum. Aliquam
                                                                                             Curabitur ornare feugiatjusto. Donec
                                                                                                                          ornare.
                                                                                                                               Curabitur ornare feugiat ornare. Donec
                                                                                             consectetur elit metus. Nulla eleifend
                                                     Curabitur volutpateros nisi,lacus, fringilla vitae mauris vehicula consectetur elit metus. Nulla eleifend
                                                               Nam non laoreet egetvel risus justo.
                                                                                  Fusce ut                                     Nam non eros nisi, eget fringilla justo.
Quisque eu purus ut lacus egestas dapibus.           consectetur arcu vestibulum vel. Donec inmassa et euismod. Vestibulum
                                                                                             tincidunt
                                                               Fusce vel risus vitae mauris amet mi. Nulla ut turpis tincidunt massavitae mauris Vestibulum
                                                                                  facilisis sit vehicula                       id
                                                                                                                               Fusce vel risus et euismod. vehicula
                                                                                  felis sollicitudin dictum vel egestas vestibulum,amet in mi. Nulla elementum,
                                                                                             vestibulum, justo
                                                       dapibus fringilla arcu, et semper lacus turpis id sed non ipsum.
                                                               facilisis sit amet in mi. Nulla ut
                                                                                                                                elementum,
                                                                                                                               facilisis sit justo vel egestas ut turpis id
Integer in velit id est dictum bibendum in id mi.                                            purus enim ornareblandit vel gravida est
                                                               felis sollicitudin dictum sed non ipsum. sed
                                                                                  Praesent ut risus nulla,
                                                                                             enim vel nibh.
                                                                                                                   quam, leo.
                                                                                                                               purus enim ornare quam, velnon ipsum.
                                                                                                                               felis sollicitudin dictum sed gravida est
                                                               Praesent ut risus Curabitur volutpat laoreet lacus, ut enim vel ut risus nulla, sed blandit leo.
                                                                                  nulla, sed blandit leo.                      Praesent nibh.
                                                                                  consectetur arcu vestibulum vel. Donec
                                                               Curabitur volutpat laoreet lacus, ut                            Curabitur volutpat laoreet lacus, ut
                                                                                    dapibus Nam nonarcu, nisi, eget fringilla justo. arcu vestibulum vel. Donec
                                                                                             fringilla eros
                                                               consectetur arcu vestibulum vel. Donec et semper lacus          consectetur
                                                                                                                               Nam non eros nisi, eget fringilla justo.
                                                                                             Fusce vel risus vitae mauris vehicula
                                                                 dapibus fringilla arcu, et semper lacus                       Fusce velfringilla arcu, et semper lacus
                                                                                                                                 dapibus risus vitae mauris vehicula
                                                                                             facilisis sit amet in mi. Nulla ut turpis id
                                                                                                                               facilisis sit amet in mi. Nulla ut turpis id
                                                                                             felis sollicitudin dictum sed non ipsum.
                                                                                                                               felis sollicitudin dictum sed non ipsum.
                                                                                             Praesent ut risus nulla, sed blandit leo.
                                                                                                                               Praesent ut risus nulla, sed blandit leo.
                                                                                             Curabitur volutpat laoreet lacus, ut
                                                                                                                               Curabitur volutpat laoreet lacus, ut
                                                                                             consectetur arcu vestibulum vel. Donec
                                                                                                                               consectetur arcu vestibulum vel. Donec
                                                                                               dapibus fringilla arcu, et semper lacus
                                                                                                                                 dapibus fringilla arcu, et semper lacus
TF-IDF

function getWeight($docID, $term, $total) {
  $tf = count($term[$docID]);
  $idf = log($total / count($term), 2);
  return $tf * $idf;
}




                                         12
Document Vector
        socket   what   heavy   steel   ...

Doc 1    0.02    0.3    0.001    0      ...

Doc 2     0       0       0      0      ...

Doc 3   0.001    0.2      0      0      ...

Doc 4     0       0     0.002   0.003   ...


                                              13
Ranked Query Merge
  best     23    42   179   246   333   703

 weight   0.008 0.002 0.023 0.039 0.014 0.001

western    42    88   120   179   246   798

 weight   0.003 0.004 0.023 0.001 0.034 0.004

1 - 246: 0.073
2 - 179: 0.024
3 - 120: 0.023
                                           14
PHP Similarity
function score($queryString, $index) {
  $query = tokenize($queryString);
  $matches = array();
  foreach($query as $qterm) {
    $postings = $index[$qterm];
    foreach($postings as $id => $posting) {
      $matches[$id] += $posting['score'];
    }
  }
  return arsort($matches);
}

                                         15
Integrating Search
                     16
MySQL Full Text Search
CREATE TABLE example (
    id INT(11) NOT NULL auto_increment,
    title VARCHAR(255),
    content TEXT,
    PRIMARY KEY(id),
    FULLTEXT(title,content)
) Engine=MyISAM;

INSERT INTO example (title, content) VALUES
('Mikko & Bacon','Mikko loves bacon'),
('Marcello & Bacon','Marcello hates bacon'),
('Jo & Sausages','Johanna loves sausages'),
('Hollywood & Garlic','Lorenzo hates garlic'),
('James & Cheddar','James is keen on cheeses');
                                              17
MySQL FTI Query
SELECT * FROM example WHERE
MATCH(title,content) AGAINST('loves bacon');

+----+------------------+------------------------+
| id | title             | content                |
+----+------------------+------------------------+
| 1 | Mikko & Bacon      | Mikko loves bacon      |
| 2 | Marcello & Bacon | Marcello hates bacon     |
| 3 | Jo & Sausages      | Johanna loves sausages |
+----+------------------+------------------------+
3 rows in set (0.00 sec)




                                                 18
Looking At The Index
/var/lib/mysql/fttest# myisam_ftdump
example 1

Total rows: 5
Total words: 17
Unique words: 14
Longest word: 9 chars (hollywood)
Median length: 5
Average global weight: 1.176117
Most common word: 2 times, weight: 0.405465
(bacon)


                                          19
Sphinx
http://www.sphinxsearch.com




                        20
Sphinx Configuration
source posts
{
  type             =   mysql
  sql_host         =   localhost
  sql_user         =   user
  sql_pass         =   password
  sql_db           =   search

    sql_query      = 
      SELECT id, title, content FROM example;
    sql_attr_multi = uint tag from query; 
      SELECT example_id, tag_id FROM tags;
}
                                            21
index posts
{
  source     = posts
  path       = /var/data/sphinx/example
  morphology = stem_en

    min_word_len     =   3
    min_prefix_len   =   3
    min_infix_len    =   0
    enable_star      =   1
}




                                          22
Stemming
        http://tartarus.org/~martin/PorterStemmer




happening - happen
happened - happen
happens   - happen




                                              23
Command Line Searching
indexer --config /etc/sphinx.conf --all
search --config /etc/sphinx.conf love bacon

displaying matches:
1. document=1, weight=3, tag=(1,2)
! id=1
! title=Mikko & Bacon
! content=Mikko loves bacon
words:
1. 'love': 2 documents, 2 hits
2. 'bacon': 2 documents, 4 hits

searchd --config /etc/sphinx.conf

                                              24
Sphinx From PHP

$cl = new SphinxClient();
$cl->SetServer('localhost', 3312);
$cl->SetMatchMode(SPH_MATCH_ANY);

$result = $cl->Query('bac*');
$docIDs = array_keys($result["matches"]);

$cl->SetFilter('tag', array(1));
$result = $cl->Query('bac*');
$docIDs = array_keys($result["matches"]);


                                            25
Swish-E
   http://swish-e.org
pecl install swish-beta
                    26
Filesystem Index With Swish-E
/usr/local/bin/swish-e -S fs -c fs-swish-e.conf


fs-swish-e.conf
IndexDir            /var/data/documents
IndexFile           fs-swish-e.index
IndexOnly           .doc .docx .pdf
FuzzyIndexingMode   Stemming_en1

FileFilter .pdf /usr/local/bin/swish_filter.pl
FileFilter .doc /usr/local/bin/swish_filter.pl
Crawling Content
/usr/local/bin/swish-e -S prog -c www-swish-e.conf


www-swish-e.conf
IndexDir      /usr/local/lib/swish-e/spider.pl
IndexFile     www-swish-e.index
SwishProgParameters default http://phpir.com/

FuzzyIndexingMode Stemming_en1
DefaultContents   HTML
Swish-E With Multiple Indices
$swish     = new Swish(
   'www-swish-e.index fs-swish-e.index'
);
$search    = $swish->prepare();

$queryStr = 'search string goes here';
$result   = $search->execute($queryStr);
$total    = $result->hits;

while($r = $result->nextResult()) {
  echo $r->swishdocpath; // url
}
Lucene




         30
$index = Zend_Search_Lucene::create('idx');
foreach($documents as $title => $content) {
  $doc = new Zend_Search_Lucene_Document();
  $doc->addField(
    Zend_Search_Lucene_Field::Text(
      'title', $title));
  $doc->addField(
    Zend_Search_Lucene_Field::UnStored(
      'content', $content));
  $index->addDocument($doc);
}



                             Build Index
                                         31
$results = $index->find('loves bacon');
foreach($results as $result) {
        echo $result->score, " ";
        echo $result->title, "n";
}

Output:
0.81656279309067 Mikko and Bacon
0.24800278854758 Marcello & Bacon



       Query Zend Search Lucene
                                          32
$file = file_get_contents($url);

$doc = Zend_Search_Lucene_Document_Html::
                           loadHTML($file);

$doc->addField(
   Zend_Search_Lucene_Field::Text(
     'url', $url
);
$index->addDocument($doc)



                             Index HTML
                                         33
Solr
http://lucene.apache.org/solr/
                                 34
Solr Search Index
$options = array( 'hostname' => 'localhost',
                  'port'     => 8983 );

$client = new SolrClient($options);
$doc = new SolrInputDocument();
$doc->addField('id', $id);
$doc->addField('cat', $category);
$doc->addField('title', $title);
$doc->addField('text', $text);
$response = $client->addDocument($doc);
$client->commit();


                                          35
Solr Search Client
$client = new SolrClient($options);

$query = new SolrQuery('bacon');
$response = $client->query($query);
$r = $response->getResponse();

foreach($r['response']['docs'] as $d) {
  echo $d->title[0] . "n";
}



                                          36
Xapian
http://xapian.org




              37
Xapian In PHP
$db = new XapianWritableDatabase(
      'idx', Xapian::DB_CREATE_OR_OPEN);
$i = new XapianTermGenerator();
$i->set_stemmer(new XapianStem("english"));

$doc = new XapianDocument();
$doc->set_data($content);
$doc->add_value(1, $title);

$i->set_document($doc);
$i->index_text($content);
$db->add_document($doc);
                                         38
Xapian Search In PHP

$database = new XapianDatabase('idx');
$enquire = new XapianEnquire($database);
$qp = new XapianQueryParser();
$qp->set_stemmer(new XapianStem("english"));
$qp->set_database($database);
$qp->set_stemming_strategy(
    XapianQueryParser::STEM_SOME);
$query = $qp->parse_query($queryString);

$enquire->set_query($query);


                                          39
$matches = $enquire->get_mset(0, 10);

$i = $matches->begin();
while(!$i->equals($matches->end())) {
  $n = $i->get_rank() + 1;
  $data = $i->get_document()->get_data();
  $title = $i->get_document()->get_value(1);
  $score = $i->get_percent();
  $i->next();
}




                                         40
Improving Results




                    41
Anchor Text




         42
Parse Anchor Text
$p = file_get_contents('http://phpir.com');

libxml_use_internal_errors(true);
$dom = DomDocument::loadHTML($p);
$links = $dom->getElementsByTagName('a');

foreach($links as $link) {
    $href = $link->getAttribute('href');
    $text = $link->nodeValue;
}


                                            43
1
         2




         3



    Zone Weighting
                44
$doc = new Zend_Search_Lucene_Document();

$tfield = Zend_Search_Lucene_Field::Text
   ('title', $title);
$tfield->boost = 1.3;
$doc->addField($tfield);

$doc->addField(
  Zend_Search_Lucene_Field::UnStored
   ('content', $content));

$index->addDocument($doc);


                 ZSL Zone Weighting
                                            45
Document Authority




                46
Document Weights in ZSL
$doc = new Zend_Search_Lucene_Document();
$doc->addField(
  Zend_Search_Lucene_Field::Text
   ('title', $title));
$doc->addField(
  Zend_Search_Lucene_Field::UnStored
   ('content', $content));

$doc->boost = 1 + ($numComments / 100);

$index->addDocument($doc);

                                            47
Using Search




          48
Summaries & Highlighting




                           49
Sphinx Extract & Highlight
$cl = new SphinxClient();
$cl->SetServer( "localhost", 3312 );
$q = 'bacon';
$r = $cl->Query($q);
foreach ($r["matches"] as $doc => $info) {
  $text[$doc] = getTextFromDB($doc);
}

$e = $cl->BuildExcerpts($text, 'posts', $q);
foreach($extracts as $extract) {
  echo $extract;
}
                                             50
Xapian Spelling Correction
Indexer
$indexer = new XapianTermGenerator();
$indexer->set_database($database);
$indexer->set_flags(
   XapianTermGenerator::FLAG_SPELLING);
Searcher
$queryString = "strreplace or str_cmp";
$q = new XapianQueryParser();
$q->set_database($database);
$query = $q->parse_query($queryString,
XapianQueryParser::FLAG_SPELLING_CORRECTION);
echo "Did you mean: " .
  $q->get_corrected_query_string() . "n";
                                          52
Spelling Correction Output
 php xapsearch.php

Did you mean: str_replace or strcmp

4644 results found for “strreplace or str_cmp”:
1: 2% docid=572
  [phpdocs/html/cc.license.html]
2: 2% docid=7169
  [phpdocs/html/imagick.constants.html]
3: 2% docid=10086
  [phpdocs/html/sqlite3result.fetcharray.html]
4: 2% docid=6132
  [phpdocs/html/function.swf-posround.html]

                                                  53
Results Sorting




                  54
Sorting in ZSL

$q = Zend_Search_Lucene_Search_QueryParser::
 parse('search string');

$results = $index->find($q, 'title');
foreach($results as $result) {
  echo '<h3>', $result->title, "</h3>n";
  $doc = getDocumentFromDB($result->did);
  echo
    $q->htmlFragmentHighlightMatches($doc);
}


                                          55
Faceted Search




                 56
Faceted Search In Solr
$client = new SolrClient($options);
$query = new SolrQuery('bacon');
$response = $client->query($query);
$query->setFacet(true);
$query->addFacetField('cat');
$r = $response->getResponse();
$f = $r['facet_counts']['facet_fields'];
foreach($f['cat'] as $facet => $count) {
  echo $facet . " " . $count . "n";
}


                                           57
More Like This




            58
More Like This
$rset = new XapianRset();
$rset->add_document(5959); // str_replace
$e = $enquire->get_eset(40, $rset);

$t = $e->begin();
for($t; !$t->equals($e->end()); $t->next()){
  $qs[] = new XapianQuery($t->get_term(),
                  intval($t->get_weight()));
}

$query = new XapianQuery(
                  XapianQuery::OP_OR, $qs);
                                            59
More Like This Example
 php xapsim.php

1656 results found:
1: 100% docid=5959
    [phpdocs/html/function.str-replace.html]
2: 47% docid=5956
    [phpdocs/html/function.str-ireplace.html]
3: 24% docid=5328
    [phpdocs/html/function.preg-replace.html]
4: 18% docid=5958
    [phpdocs/html/function.str-repeat.html]


                                          60
Search Performance




                61
Index Updates
            New


Docs
Docs        Delta
 Docs
 Docs                    Delta   Main


Main                         Query
            Main


        Delta     Main
                                        62
Search Speed
Zend Search Lucene
$index = Zend_Search_Lucene::open('index');
$index->optimize();
Sphinx
 indexer --merge main delta --rotate
Solr
$client = new SolrClient($options);
$client->optimize();

Xapian
 xapian-compact xapindex xapindex2
                                         63
Distributing Search
        Document
        Document
         Document
         Document



Index     Index       Index




        Application

                              64
Large Scale Search


      http://www.nutch.org




   http://hadoop.apache.org



                        65
Image Credits
Title                http://www.flickr.com/photos/generated/2084287794/
What Do You Want     http://www.flickr.com/photos/the_justified_sinner/
You Are Here         2498066986/
                     http://www.flickr.com/photos/alecvuijlsteke/2692475420/
Integrating Search   http://www.flickr.com/photos/squeaks2569/3700355684/
Sphinx               http://www.flickr.com/photos/generated/2084287794/
Lucene               http://www.flickr.com/photos/mypanda/7731447/
Swish-e              http://www.flickr.com/photos/ryan_fung/2239687100/
Solr                 http://www.flickr.com/photos/m-j-s/2724756177/
Xapian               http://www.flickr.com/photos/olibac/3522056495/
Using Search         http://www.flickr.com/photos/eneas/175027945/
Improving Search     http://www.flickr.com/photos/x-ray_delta_one/3928200642/
Search Performance   http://www.flickr.com/photos/maisonbisson/1634408/
Large Scale Search   http://www.flickr.com/photos/zedzap/3663508847/

                                                                              66
Questions?




             67
Thank you!

                      Ian Barber
                     @ianbarber
               http://phpir.com
             ian@ibuildings.com
http://joind.in/talk/view/1462

Más contenido relacionado

Más de Ian Barber

How to stand on the shoulders of giants
How to stand on the shoulders of giantsHow to stand on the shoulders of giants
How to stand on the shoulders of giantsIan Barber
 
ZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made SimpleZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made SimpleIan Barber
 
Teaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersTeaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersIan Barber
 
ZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 VersionZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 VersionIan Barber
 
Debugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionDebugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionIan Barber
 
ZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 VersionZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 VersionIan Barber
 
ZeroMQ Is The Answer
ZeroMQ Is The AnswerZeroMQ Is The Answer
ZeroMQ Is The AnswerIan Barber
 
Deployment Tactics
Deployment TacticsDeployment Tactics
Deployment TacticsIan Barber
 
In Search Of: Integrating Site Search (PHP Barcelona)
In Search Of: Integrating Site Search (PHP Barcelona)In Search Of: Integrating Site Search (PHP Barcelona)
In Search Of: Integrating Site Search (PHP Barcelona)Ian Barber
 
Debugging: Rules & Tools
Debugging: Rules & ToolsDebugging: Rules & Tools
Debugging: Rules & ToolsIan Barber
 
Document Classification In PHP - Slight Return
Document Classification In PHP - Slight ReturnDocument Classification In PHP - Slight Return
Document Classification In PHP - Slight ReturnIan Barber
 
Document Classification In PHP
Document Classification In PHPDocument Classification In PHP
Document Classification In PHPIan Barber
 

Más de Ian Barber (12)

How to stand on the shoulders of giants
How to stand on the shoulders of giantsHow to stand on the shoulders of giants
How to stand on the shoulders of giants
 
ZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made SimpleZeroMQ: Messaging Made Simple
ZeroMQ: Messaging Made Simple
 
Teaching Your Machine To Find Fraudsters
Teaching Your Machine To Find FraudstersTeaching Your Machine To Find Fraudsters
Teaching Your Machine To Find Fraudsters
 
ZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 VersionZeroMQ Is The Answer: PHP Tek 11 Version
ZeroMQ Is The Answer: PHP Tek 11 Version
 
Debugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 VersionDebugging: Rules And Tools - PHPTek 11 Version
Debugging: Rules And Tools - PHPTek 11 Version
 
ZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 VersionZeroMQ Is The Answer: DPC 11 Version
ZeroMQ Is The Answer: DPC 11 Version
 
ZeroMQ Is The Answer
ZeroMQ Is The AnswerZeroMQ Is The Answer
ZeroMQ Is The Answer
 
Deployment Tactics
Deployment TacticsDeployment Tactics
Deployment Tactics
 
In Search Of: Integrating Site Search (PHP Barcelona)
In Search Of: Integrating Site Search (PHP Barcelona)In Search Of: Integrating Site Search (PHP Barcelona)
In Search Of: Integrating Site Search (PHP Barcelona)
 
Debugging: Rules & Tools
Debugging: Rules & ToolsDebugging: Rules & Tools
Debugging: Rules & Tools
 
Document Classification In PHP - Slight Return
Document Classification In PHP - Slight ReturnDocument Classification In PHP - Slight Return
Document Classification In PHP - Slight Return
 
Document Classification In PHP
Document Classification In PHPDocument Classification In PHP
Document Classification In PHP
 

Último

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Último (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

In Search Of... integrating site search

  • 1. In Search Of... integrating site search Ian Barber @ianbarber http://phpir.com ian@ibuildings.com http://joind.in/talk/view/1462
  • 2. what do you want?
  • 3. How Search Works Integrating Search Improving Results Using Search Search Performance Questions 3
  • 4. 4
  • 5. Query Query Query Query Query Parser Result Result Result Result Index Analyser Document Document Document Document 4
  • 6. Tokenisation “ With AT&T’s help, the F.B.I Miami-Dade office had recovered $1.1 million from O’Healy’s Ponzi scheme, 10-15% more than ” expected. 6
  • 7. PHP Tokenisation function tokenise($string) { $string = strtolower($string); preg_match_all('/w+/', $string, $matches, PREG_OFFSET_CAPTURE); return $matches[0]; } 7
  • 8. Document Term Pairs Document ID Term 1 the 1 best 1 of 1 the ... ... 204 and 204 what 204 would 8
  • 9. Inverted Index Term Documents best 1 (4, 16), 4 (422), 129 (344) ... what 24 (50, 98), 75 (33, 208) ... would 99 (32, 599), 201 (344) .. ... ... 9
  • 10. Boolean Query Merge Query: Best Western Hotel best 1 4 129 298 305 338 western 4 95 194 204 298 305 working 4 298 305 hotel 2 40 200 298 355 402 Result: Document 298 10
  • 11. Lorem ipsum dolor sit amet, Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodalesipsum. Aliquam vel condimentum Lorem ipsum dolor sit amet, quis neque. ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur adipiscing elit. Sed sit amet ante consectetur elit metus. Nulla eleifend Curabitur ornare feugiat ornare. Donec vitae enim elementum semper sodales quis consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum ipsum. Aliquam vel condimentum neque. vestibulum, justo vel egestas elementum, tincidunt massa et euismod. Vestibulum sit amet, Lorem ipsum dolor Curabitur ornare feugiat ornare. Donec vestibulum, justo consectetur elementum,elit.enim sit ametquam, vel gravida est vel egestas adipiscing purus Sed ornare ante consectetur elit metus. Nulla eleifend purus enim ornarevitae enim elementum sempernibh. quam, vel gravida est vel sodales quis enim tincidunt massa et euismod. Vestibulum Lorem ipsum dolor sit amet, consectetur enim vel nibh. Lorem ipsum dolor ipsum. Aliquam vel condimentum neque. fringillavestibulum, justo vel egestas elementum, sit amet, Nam non eros nisi, eget justo. consectetur adipiscingCurabitur sit ametfeugiat ornare. Donec mauris vehicula enim ornare quam, vel gravida est elit. Sed ornare ante purus adipiscing elit. Sed sit amet ante vitae enim vitae enim elementum consectetur elitjusto.Fusce vel risus vitae Nam non eros nisi,semper sodalesmetus. Nulla eleifend eget fringilla quis enim vel nibh. Fusce vel risus condimentum neque. facilisis sit amet in mi. Nulla ut turpis id ipsum. Aliquam velvitae maurismassa et euismod. Vestibulum tincidunt vehicula elementum semper sodales quis ipsum. Aliquam facilisis sit amet in mi. Nulla ut turpis felis sollicitudin dictum sed nonNam non eros nisi, eget fringilla justo. Curabitur ornare feugiat ornare. Donec velid vestibulum, justo egestas elementum, ipsum. Praesent gravida nulla, sed blandit leo. ut risus est Lorem ipsum dolor sit amet, Lorem ipsum dolor sit amet, consectetur elit metus.purus enim ornare quam, vel volutpat laoreet lacus,Fusce vel risus vitae mauris vehicula felis sollicitudin dictum sed non ipsum. Nulla eleifend vel condimentum neque. Curabitur ornare enim Vestibulum Curabitur ut consectetur adipiscing elit. Sed sit amet ante consectetur adipiscing elit. Sed sit amet ante tincidunt massa risus nulla, sed nibh. leo.consectetur arcu vestibulum vel.facilisis sit amet in mi. Nulla ut turpis id Praesent ut et euismod.vel blandit ut sodales Donec Curabitur volutpat laoreet lacus, vitae enim elementum semper vitae enim elementum semper sodales quis quis felis sollicitudin dictum sed non ipsum. vestibulum, justo vel egestas elementum, dapibus fringilla arcu, et semper lacus feugiat ornare. Donec consectetur elit metus. Nam non vel. ipsum. Aliquam vel condimentumLorem ipsumut risussit amet, blandit leo. consectetur arcu vestibulumeros nisi, eget fringilla justo. purus enim ornare quam, vel gravida est Donec ipsum. Praesent vel condimentum neque. neque. Aliquam dolor nulla, sed arcu, vel risusCurabitur ornare feugiat ornare.consectetur adipiscing elit. Sed Donec ut Curabitur ornare volutpat laoreetsit amet ante Donec enim dapibus fringilla Fusce et sempervitae mauris vehicula vel nibh. lacus Curabitur feugiat ornare. lacus, consectetur elitut turpisNulla eleifendenim elementumNulla eleifend quis metus. id consectetur elit metus. semper sodales Donec Nulla eleifend tincidunt massa et euismod. facilisis sit amet in mi. Nulla vitae consectetur arcu vestibulum vel. tincidunt massa et euismod. Vestibulum massa et euismod. Vestibulum lacus tincidunt Nam non eros nisi, eget fringilla justo. dictum sed non ipsum. felis sollicitudin ipsum. dapibus fringilla arcu, et semper Aliquam vel condimentum neque. vestibulum, justo vel egestas elementum, ornare vel egestas elementum, vestibulum, justo feugiat ornare. Donec Vestibulum vestibulum, justo vel egestas Fusce vel risus vitae mauris vehicula nulla, sed blandit leo. Praesent ut risus purus Curabitur Curabitur volutpat enim ornare quam, vel gravidaenim ornare quam, vel gravida est purus est elit metus. Nulla eleifend facilisis sit amet in mi. Nulla ut turpis id laoreet lacus, ut consectetur enim vel nibh.vel. Donec consectetur arcu vestibulum enim vel nibh. et euismod. Vestibulum elementum, purus enim ornare quam, vel felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, dapibus fringilla arcu, et semper lacus sed blandit leo. tincidunt massa vestibulum, justo vel egestas elementum, Nam non eros nisi, eget fringilla justo. eros nisi, eget fringilla justo.est Nam non ornare quam, vel gravida gravida est enim vel nibh. Curabitur volutpat laoreet lacus, ut purus enim Fusce vel risus vitae mauris vehicula vel nibh. vitae mauris vehicula Fusce vel risus enim Lorem ipsum dolor sit amet, vel. Donec consectetur arcu vestibulum facilisis sit amet in mi. Nulla ut turpis id amet in mi. Nulla ut turpis id facilisis sit consectetur adipiscing elit.et semper lacus sollicitudin dictum sed non ipsum. dapibus fringilla arcu, Sed sit amet ante felis felis sollicitudin dictum sed non ipsum. Nam non eros nisi, eget fringilla justo. Nam non eros nisi, eget fringilla justo. Fusce vel vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Praesent ut risus nulla, sed blandit leo. utrisus vitae mauris vehicula Praesent risus nulla, sed blandit leo. Fusce vel Curabitur volutpat laoreet lacus, ut Curabitur volutpat laoreet lacus, ut Curabitur ornare feugiat ornare. Donec consectetur arcu vestibulum vel. Donec sit arcu vestibulum vel. turpis id facilisis amet in mi. Nulla ut risus vitae mauris vehicula facilisis sit amet in Lorem ipsum dolor sit amet, consectetur consectetur elit metus. Nulla eleifendadipiscing elit. Sed sit amet ante felis sollicitudin dictum sed non ipsum. consectetur Donec Lorem ipsum dolor sit amet, dapibus fringilla arcu, etLorem ipsum dolor sit amet, et semper lacus semper lacus fringilla nulla, sed blandit leo. dapibus ut risus arcu, consectetur adipiscing enimSed vitae elit. elementum ante quis Praesent tincidunt massa et euismod. Vestibulumsit amet semper sodalesconsectetur adipiscing elit. Sed sit amet ante mi. Nulla ut turpis id felis sollicitudin dictum vestibulum, justo vel egestas elementum, vitae enim elementum semper sodales quis vitae Curabitur volutpat laoreet lacus, ut ipsum. Aliquam vel condimentum neque. enim elementum semper sodales quis purus enim ornare quam,vel condimentum feugiat ornare. Donec Curabitur ornare neque. vel gravida est consectetur arcu vestibulum vel. Donec sed non ipsum. Praesent ut risus nulla, sed ipsum. Aliquam enim vel nibh. Curabitur ornare feugiat ornare. metus. ipsum. Aliquam vel condimentum neque. consectetur elit Donec Nulla eleifend Curabiturdapibus feugiat ornare.et semper lacus ornare fringilla arcu, Donec tincidunt massa et euismod. Vestibulum blandit leo. Curabitur volutpat laoreet lacus, ut consectetur elit metus. Nulla eleifend vestibulum,Loremvel egestas elementum, Nam non eros nisi, eget fringilla justo. justo ipsum dolor sit amet, tincidunt massa et euismod. Vestibulum consectetur elit metus. Nulla eleifend tincidunt ipsum dolor sit amet, Lorem massa et euismod. Vestibulum purus enim ornare quam, vel gravidaSed sit amet ante vel egestas elementum, consectetur adipiscing elit. est Fusce vel risus vitaejusto vel egestas elementum, vestibulum, mauris vehicula consectetur arcu vestibulum vel. Donec dapibus enim vel vitae enim est nibh. sit amet in ornare quam, vel id vestibulum, justo consectetur adipiscing elit. Sed sit amet ante facilisis purus enim mi. Nulla ut turpisgravida elementum semper sodales quis purus enim ornare quam, vel gravida est vitae enim elementum semper sodales quis felis sollicitudin dictum sed non ipsum. Aliquam vel condimentum vel nibh. vel condimentum neque. enim vel nibh. ipsum. enim neque. fringilla arcu, et semper lacus egestas non. Praesent ut risus nulla, sed blandit leo. nisi, eget fringilla Nam non eros ipsum. Aliquam Curabitur ornare feugiatjusto. Donec ornare. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend Curabitur volutpateros nisi,lacus, fringilla vitae mauris vehicula consectetur elit metus. Nulla eleifend Nam non laoreet egetvel risus justo. Fusce ut Nam non eros nisi, eget fringilla justo. Quisque eu purus ut lacus egestas dapibus. consectetur arcu vestibulum vel. Donec inmassa et euismod. Vestibulum tincidunt Fusce vel risus vitae mauris amet mi. Nulla ut turpis tincidunt massavitae mauris Vestibulum facilisis sit vehicula id Fusce vel risus et euismod. vehicula felis sollicitudin dictum vel egestas vestibulum,amet in mi. Nulla elementum, vestibulum, justo dapibus fringilla arcu, et semper lacus turpis id sed non ipsum. facilisis sit amet in mi. Nulla ut elementum, facilisis sit justo vel egestas ut turpis id Integer in velit id est dictum bibendum in id mi. purus enim ornareblandit vel gravida est felis sollicitudin dictum sed non ipsum. sed Praesent ut risus nulla, enim vel nibh. quam, leo. purus enim ornare quam, velnon ipsum. felis sollicitudin dictum sed gravida est Praesent ut risus Curabitur volutpat laoreet lacus, ut enim vel ut risus nulla, sed blandit leo. nulla, sed blandit leo. Praesent nibh. consectetur arcu vestibulum vel. Donec Curabitur volutpat laoreet lacus, ut Curabitur volutpat laoreet lacus, ut dapibus Nam nonarcu, nisi, eget fringilla justo. arcu vestibulum vel. Donec fringilla eros consectetur arcu vestibulum vel. Donec et semper lacus consectetur Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula dapibus fringilla arcu, et semper lacus Fusce velfringilla arcu, et semper lacus dapibus risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus dapibus fringilla arcu, et semper lacus
  • 12. TF-IDF function getWeight($docID, $term, $total) { $tf = count($term[$docID]); $idf = log($total / count($term), 2); return $tf * $idf; } 12
  • 13. Document Vector socket what heavy steel ... Doc 1 0.02 0.3 0.001 0 ... Doc 2 0 0 0 0 ... Doc 3 0.001 0.2 0 0 ... Doc 4 0 0 0.002 0.003 ... 13
  • 14. Ranked Query Merge best 23 42 179 246 333 703 weight 0.008 0.002 0.023 0.039 0.014 0.001 western 42 88 120 179 246 798 weight 0.003 0.004 0.023 0.001 0.034 0.004 1 - 246: 0.073 2 - 179: 0.024 3 - 120: 0.023 14
  • 15. PHP Similarity function score($queryString, $index) { $query = tokenize($queryString); $matches = array(); foreach($query as $qterm) { $postings = $index[$qterm]; foreach($postings as $id => $posting) { $matches[$id] += $posting['score']; } } return arsort($matches); } 15
  • 17. MySQL Full Text Search CREATE TABLE example ( id INT(11) NOT NULL auto_increment, title VARCHAR(255), content TEXT, PRIMARY KEY(id), FULLTEXT(title,content) ) Engine=MyISAM; INSERT INTO example (title, content) VALUES ('Mikko & Bacon','Mikko loves bacon'), ('Marcello & Bacon','Marcello hates bacon'), ('Jo & Sausages','Johanna loves sausages'), ('Hollywood & Garlic','Lorenzo hates garlic'), ('James & Cheddar','James is keen on cheeses'); 17
  • 18. MySQL FTI Query SELECT * FROM example WHERE MATCH(title,content) AGAINST('loves bacon'); +----+------------------+------------------------+ | id | title | content | +----+------------------+------------------------+ | 1 | Mikko & Bacon | Mikko loves bacon | | 2 | Marcello & Bacon | Marcello hates bacon | | 3 | Jo & Sausages | Johanna loves sausages | +----+------------------+------------------------+ 3 rows in set (0.00 sec) 18
  • 19. Looking At The Index /var/lib/mysql/fttest# myisam_ftdump example 1 Total rows: 5 Total words: 17 Unique words: 14 Longest word: 9 chars (hollywood) Median length: 5 Average global weight: 1.176117 Most common word: 2 times, weight: 0.405465 (bacon) 19
  • 21. Sphinx Configuration source posts { type = mysql sql_host = localhost sql_user = user sql_pass = password sql_db = search sql_query = SELECT id, title, content FROM example; sql_attr_multi = uint tag from query; SELECT example_id, tag_id FROM tags; } 21
  • 22. index posts { source = posts path = /var/data/sphinx/example morphology = stem_en min_word_len = 3 min_prefix_len = 3 min_infix_len = 0 enable_star = 1 } 22
  • 23. Stemming http://tartarus.org/~martin/PorterStemmer happening - happen happened - happen happens - happen 23
  • 24. Command Line Searching indexer --config /etc/sphinx.conf --all search --config /etc/sphinx.conf love bacon displaying matches: 1. document=1, weight=3, tag=(1,2) ! id=1 ! title=Mikko & Bacon ! content=Mikko loves bacon words: 1. 'love': 2 documents, 2 hits 2. 'bacon': 2 documents, 4 hits searchd --config /etc/sphinx.conf 24
  • 25. Sphinx From PHP $cl = new SphinxClient(); $cl->SetServer('localhost', 3312); $cl->SetMatchMode(SPH_MATCH_ANY); $result = $cl->Query('bac*'); $docIDs = array_keys($result["matches"]); $cl->SetFilter('tag', array(1)); $result = $cl->Query('bac*'); $docIDs = array_keys($result["matches"]); 25
  • 26. Swish-E http://swish-e.org pecl install swish-beta 26
  • 27. Filesystem Index With Swish-E /usr/local/bin/swish-e -S fs -c fs-swish-e.conf fs-swish-e.conf IndexDir /var/data/documents IndexFile fs-swish-e.index IndexOnly .doc .docx .pdf FuzzyIndexingMode Stemming_en1 FileFilter .pdf /usr/local/bin/swish_filter.pl FileFilter .doc /usr/local/bin/swish_filter.pl
  • 28. Crawling Content /usr/local/bin/swish-e -S prog -c www-swish-e.conf www-swish-e.conf IndexDir /usr/local/lib/swish-e/spider.pl IndexFile www-swish-e.index SwishProgParameters default http://phpir.com/ FuzzyIndexingMode Stemming_en1 DefaultContents HTML
  • 29. Swish-E With Multiple Indices $swish = new Swish( 'www-swish-e.index fs-swish-e.index' ); $search = $swish->prepare(); $queryStr = 'search string goes here'; $result = $search->execute($queryStr); $total = $result->hits; while($r = $result->nextResult()) { echo $r->swishdocpath; // url }
  • 30. Lucene 30
  • 31. $index = Zend_Search_Lucene::create('idx'); foreach($documents as $title => $content) { $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text( 'title', $title)); $doc->addField( Zend_Search_Lucene_Field::UnStored( 'content', $content)); $index->addDocument($doc); } Build Index 31
  • 32. $results = $index->find('loves bacon'); foreach($results as $result) { echo $result->score, " "; echo $result->title, "n"; } Output: 0.81656279309067 Mikko and Bacon 0.24800278854758 Marcello & Bacon Query Zend Search Lucene 32
  • 33. $file = file_get_contents($url); $doc = Zend_Search_Lucene_Document_Html:: loadHTML($file); $doc->addField( Zend_Search_Lucene_Field::Text( 'url', $url ); $index->addDocument($doc) Index HTML 33
  • 35. Solr Search Index $options = array( 'hostname' => 'localhost', 'port' => 8983 ); $client = new SolrClient($options); $doc = new SolrInputDocument(); $doc->addField('id', $id); $doc->addField('cat', $category); $doc->addField('title', $title); $doc->addField('text', $text); $response = $client->addDocument($doc); $client->commit(); 35
  • 36. Solr Search Client $client = new SolrClient($options); $query = new SolrQuery('bacon'); $response = $client->query($query); $r = $response->getResponse(); foreach($r['response']['docs'] as $d) { echo $d->title[0] . "n"; } 36
  • 38. Xapian In PHP $db = new XapianWritableDatabase( 'idx', Xapian::DB_CREATE_OR_OPEN); $i = new XapianTermGenerator(); $i->set_stemmer(new XapianStem("english")); $doc = new XapianDocument(); $doc->set_data($content); $doc->add_value(1, $title); $i->set_document($doc); $i->index_text($content); $db->add_document($doc); 38
  • 39. Xapian Search In PHP $database = new XapianDatabase('idx'); $enquire = new XapianEnquire($database); $qp = new XapianQueryParser(); $qp->set_stemmer(new XapianStem("english")); $qp->set_database($database); $qp->set_stemming_strategy( XapianQueryParser::STEM_SOME); $query = $qp->parse_query($queryString); $enquire->set_query($query); 39
  • 40. $matches = $enquire->get_mset(0, 10); $i = $matches->begin(); while(!$i->equals($matches->end())) { $n = $i->get_rank() + 1; $data = $i->get_document()->get_data(); $title = $i->get_document()->get_value(1); $score = $i->get_percent(); $i->next(); } 40
  • 43. Parse Anchor Text $p = file_get_contents('http://phpir.com'); libxml_use_internal_errors(true); $dom = DomDocument::loadHTML($p); $links = $dom->getElementsByTagName('a'); foreach($links as $link) { $href = $link->getAttribute('href'); $text = $link->nodeValue; } 43
  • 44. 1 2 3 Zone Weighting 44
  • 45. $doc = new Zend_Search_Lucene_Document(); $tfield = Zend_Search_Lucene_Field::Text ('title', $title); $tfield->boost = 1.3; $doc->addField($tfield); $doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content)); $index->addDocument($doc); ZSL Zone Weighting 45
  • 47. Document Weights in ZSL $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text ('title', $title)); $doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content)); $doc->boost = 1 + ($numComments / 100); $index->addDocument($doc); 47
  • 50. Sphinx Extract & Highlight $cl = new SphinxClient(); $cl->SetServer( "localhost", 3312 ); $q = 'bacon'; $r = $cl->Query($q); foreach ($r["matches"] as $doc => $info) { $text[$doc] = getTextFromDB($doc); } $e = $cl->BuildExcerpts($text, 'posts', $q); foreach($extracts as $extract) { echo $extract; } 50
  • 51.
  • 52. Xapian Spelling Correction Indexer $indexer = new XapianTermGenerator(); $indexer->set_database($database); $indexer->set_flags( XapianTermGenerator::FLAG_SPELLING); Searcher $queryString = "strreplace or str_cmp"; $q = new XapianQueryParser(); $q->set_database($database); $query = $q->parse_query($queryString, XapianQueryParser::FLAG_SPELLING_CORRECTION); echo "Did you mean: " . $q->get_corrected_query_string() . "n"; 52
  • 53. Spelling Correction Output php xapsearch.php Did you mean: str_replace or strcmp 4644 results found for “strreplace or str_cmp”: 1: 2% docid=572 [phpdocs/html/cc.license.html] 2: 2% docid=7169 [phpdocs/html/imagick.constants.html] 3: 2% docid=10086 [phpdocs/html/sqlite3result.fetcharray.html] 4: 2% docid=6132 [phpdocs/html/function.swf-posround.html] 53
  • 55. Sorting in ZSL $q = Zend_Search_Lucene_Search_QueryParser:: parse('search string'); $results = $index->find($q, 'title'); foreach($results as $result) { echo '<h3>', $result->title, "</h3>n"; $doc = getDocumentFromDB($result->did); echo $q->htmlFragmentHighlightMatches($doc); } 55
  • 57. Faceted Search In Solr $client = new SolrClient($options); $query = new SolrQuery('bacon'); $response = $client->query($query); $query->setFacet(true); $query->addFacetField('cat'); $r = $response->getResponse(); $f = $r['facet_counts']['facet_fields']; foreach($f['cat'] as $facet => $count) { echo $facet . " " . $count . "n"; } 57
  • 59. More Like This $rset = new XapianRset(); $rset->add_document(5959); // str_replace $e = $enquire->get_eset(40, $rset); $t = $e->begin(); for($t; !$t->equals($e->end()); $t->next()){ $qs[] = new XapianQuery($t->get_term(), intval($t->get_weight())); } $query = new XapianQuery( XapianQuery::OP_OR, $qs); 59
  • 60. More Like This Example php xapsim.php 1656 results found: 1: 100% docid=5959 [phpdocs/html/function.str-replace.html] 2: 47% docid=5956 [phpdocs/html/function.str-ireplace.html] 3: 24% docid=5328 [phpdocs/html/function.preg-replace.html] 4: 18% docid=5958 [phpdocs/html/function.str-repeat.html] 60
  • 62. Index Updates New Docs Docs Delta Docs Docs Delta Main Main Query Main Delta Main 62
  • 63. Search Speed Zend Search Lucene $index = Zend_Search_Lucene::open('index'); $index->optimize(); Sphinx indexer --merge main delta --rotate Solr $client = new SolrClient($options); $client->optimize(); Xapian xapian-compact xapindex xapindex2 63
  • 64. Distributing Search Document Document Document Document Index Index Index Application 64
  • 65. Large Scale Search http://www.nutch.org http://hadoop.apache.org 65
  • 66. Image Credits Title http://www.flickr.com/photos/generated/2084287794/ What Do You Want http://www.flickr.com/photos/the_justified_sinner/ You Are Here 2498066986/ http://www.flickr.com/photos/alecvuijlsteke/2692475420/ Integrating Search http://www.flickr.com/photos/squeaks2569/3700355684/ Sphinx http://www.flickr.com/photos/generated/2084287794/ Lucene http://www.flickr.com/photos/mypanda/7731447/ Swish-e http://www.flickr.com/photos/ryan_fung/2239687100/ Solr http://www.flickr.com/photos/m-j-s/2724756177/ Xapian http://www.flickr.com/photos/olibac/3522056495/ Using Search http://www.flickr.com/photos/eneas/175027945/ Improving Search http://www.flickr.com/photos/x-ray_delta_one/3928200642/ Search Performance http://www.flickr.com/photos/maisonbisson/1634408/ Large Scale Search http://www.flickr.com/photos/zedzap/3663508847/ 66
  • 68. Thank you! Ian Barber @ianbarber http://phpir.com ian@ibuildings.com http://joind.in/talk/view/1462

Notas del editor

  1. Contact Details
  2. This is a question we&amp;#x2019;d often like to ask our users But with search, they tell us Search is about getting content to users that want it Searching users are Active and engaged Give them what they want and they are more likely to Buy, Read, Comment, Share etc.
  3. This talk covers how full text search works, looks at some different options for integration looks at making it better Time for questions at the end, but one does spring to mind now:
  4. Why search, why not let google do it? Private, intranet, FB inbox, offline Bad at, twitter for a long time, blogs for a long time Product focus, like amazon Speed of update, like a forum Now, lets look at how a full text search operates.
  5. Search Engine Structure Raw Text Documents (add url, title, split up etc.) Text Analysis Index Query Parser Query Results Search UI
  6. Simplified structure of a search engine. Start with pool of raw data, chunked into documents Analyser processes text in docs , Index stores Other side: Search UI Query parsed by query parser, like anlyser, Searched on index and Results sorted and returned
  7. Tokenising is taking a document and splitting it into tokens to index. Can be difficult, even with space char. Commas - remove punctuation - then send 1.1 mil to 11 mil! Hyphens Apostrophes
  8. That said, starting with something simple isn&amp;#x2019;t a bad idea. Here we look for continuous sequences of word chars Capture with offset, which is for phrase matching. More advanced SEs have better tokenisation: &amp; in AT&amp;T Some instead have buzzwords file, specific terms: C++
  9. Pair extracted tokens with assigned doc ID Filter stop words - an, the, of - don&amp;#x2019;t distinguish Position info included
  10. Invert and merge pairs, so terms -&gt; doc Positions still stored, represented by () e.g best @ 4 and 16 in doc 1. Often stored separate, or just a straight count List of docs == posting list Enough to start a search
  11. Take search query and tokenise the same way. Important! For each term we array_intersect. Can do boolean searches by doing array union for OR etc. BUT no RANK - any result with all words as good as other Must store importance of terms to documents - weight
  12. The weighting scheme includes two measures TF - term frequency, the count of terms in the document IDF, inverse document frequency, the rareness of the term in the collection
  13. Simple but usable weight algo, basis of most. TF - Count of times term appears IDF - total docs / docs with term, 10 total / 3 with term. Log to smooth Store this score with the document in the posting list for the term Normalise scores over a doc to acct for length - but still boosts short text
  14. TF-IDF PHP code
  15. TF-IDF PHP code
  16. Document is position in N dimension space One dimension for each term ever seen Mostly 0 Normalised to length 1 (sqrt of sum of sqrs of vals)
  17. Just look at 2 terms here to keep it simple Here, rather than just looking for matches We accumulate a score for each matching document 246 is our highest scoring document, picking up two good scores But 120 makes it in at number three, despite not having &amp;#x2018;best&amp;#x2019; in it.
  18. For a 2 term, 2 dimension case, that looks like this. Calculate cosine of angle between with dot prod Similarity - 1 = same, 0 = orthogonal (no shared terms) We can treat a query as a new document The documents it is most similar to are the best results Only need to compare to documents that share terms -rest will be score 0
  19. Look at query terms, retrieve posting list from index Treat query term weights as 1 - incorrect, but ok for relative results Index merge, and calc dot product by summing weights. So, don&amp;#x2019;t need a full match Could add phrase search, or positional bias.
  20. Two main question Where does the data come from? How is the index accessed? Look at 6 PHP friendly engines Each different integration method Each with new bits of functionality
  21. Data from a database columns in one table Simplest of all to implement - integrate through query Note fulltext index. Straight vector space search impl. as described before. Only can be used for MyISAM, not InnoDB If you&amp;#x2019;re using postgres, tsearch built in since 8.3
  22. MATCH AGAINST syntax Boolean too - all engines have this, we focus on natural Only one document has both words Ranked in score order - MATCH AGAINST returns a float Note there&amp;#x2019;s some tricky default config: min word length 4, and 50% fill exclusion
  23. One interesting option is Query Expansion - Blindly expand the search based on words returned. Usually not a very good idea, because we want more precision that recall Precision is quality of results, recall is completeness In this case it&amp;#x2019;s expanded to lorenzo, because of marcello&amp;#x2019;s hatred for bacon
  24. Can actually interrogate the index myisam_ftdump Run from the database directory However, lets say you want to search on a normalised schema directly - multiple tables
  25. Using sphinx you can index a more complex query Used on craigslist, and apparently on The Pirate Bay There is a PHP API for access, or extension pecl/sphinx Same interface but faster
  26. Once installed, setup sphinx.conf file Top: Connection Stuff - also works with postgresql Indexing on sql_query - could use view, complex etc. Adding attributes - non indexed elements of a doc - Numeric or timestamp only in sphinx. Using multi valued attributes, support tags many to many Other options, such sql_query_pre or post
  27. Next tell sphinx about the index Minimum length of indexed word Prefix for wild card search - infix anywhere, prefix end We also enable a stemmer
  28. Stemming consistently collapses different forms of the same word to a stem Here each version is reduced to happen, but not always an english word is generated, just a consistent one This allows us to match more words, and is often, but not always, helpful The most common algorithm is the Porter stemmer, there is a PHP implementation on the site
  29. Indexer command to build index Might lock DB table, there is a ranged table work around Command line search, defaults to require all Stemmer - love vs loves Last line - start indexing daemon
  30. Match any word Wildcard search - prefix search, Returns both &amp;#x2018;bacon&amp;#x2019; docs Add filters - limiting to certain values of attributes Now we just get 1 result Sphinx can be built into mysql as a table type, and queried via a where arg
  31. From the other end - Swish is easy to plug in to existing system at short notice Swish-e is an engine with a long pedigree, and a PHP extension. Used by quite a few universities. Doesn&amp;#x2019;t support multibyte charsets, which is a bit of a downside. Great for combinations where you have a bunch of word docs or similar documents, and a website, and you want to search both.
  32. First &amp;#x2018;fs&amp;#x2019; for file system index - we create a conf file for indexer In the conf we tell it where to look for files FileFilters extract text from non-text formats doc/pdf Can specify IndexDir multiple times different doc stores Requires wv ware and xpdf Apache Tika
  33. Includes an effective web crawler, another way to get data Getting it through the web loses some of the advantages Can plug into website no real control over Mode is prog to call out to the spider script Index file is different name
  34. Being able to query across the two indexes is very handy Here we search fs and www indexes and give combined results Can use various filters to limit search to parts of HTML documents Or filter on file system paths
  35. Now we&amp;#x2019;ll look at engines where we index from within PHP Lucene, apache foundation search engine Very succesful, but has ports instead of bindings Native PHP port in Zend Framework, Zend Search Lucene
  36. Hook right into the application, easy addition/plugin Lots of control, easy to add metadata/attributes Lucene calls them fields: string keys, multiple value types Text indexed and original stored - unstored not Index compatible w/ Java lucene 2.3 - can index java, search PHP
  37. Querying is straightforward, and quick. The scores are only really interesting as a relative value
  38. Includes some handy utilities such as HTML doc parser Spits out various fields such as title and body auto Allows you to add other fields as required Advantage of PHP - easy to hack at, add new doc types HOWEVER - doesn&amp;#x2019;t scale to large collections so you may prefer to use one the Java based versions... and the easiest way is with Solr
  39. Solr uses java lucene - wraps in REST+XML/JSON web service Convenient for all the usual SOA reasons. Solr is in use by CNET, digg, netflix and other high profile sites. There is a PHP extension, or a PHP client API
  40. Not massively different from ZSL. Solr needs you to create a schema first, to define the fields of docs Note the client commit down the bottom. Until a document is committed Hardly know you&amp;#x2019;re using a webservice
  41. Searching is similar XML based response format means a more complex return struct Solr is great for larger scale collections Provides good admin functionality - enterprise friendly
  42. Our last engine is Xapian High performance C based search engine. There is a Solr like service called Flax, but we&amp;#x2019;ll look at the engine directly. PHP SWIG based extension and low level API Gives some cool features, and a lot of control Creates database on FS, or can be accessed remote
  43. Separation between the document and the indexer Integration of stemmer - english here We have an numeric indexed attribute, referred to as a &amp;#x2018;value&amp;#x2019; here, for the title
  44. Xapian index (local etc.)
  45. The searching is more complicated We have more control - STEM_SOME, don&amp;#x2019;t stem words that start with a capital letter (proper nouns)
  46. Xapian query
  47. Retrieving the result relies on these functions wrapping around C iterators. Note the percentage score value - overtly relative, but can be thresholded if needed
  48. Xapian query result
  49. We have search engine We know where data coming from How can we improve results
  50. Link text can be a great source of keywords To use a classic example, from one of the early papers about google, if someone types &amp;#x2018;big blue&amp;#x2019; into google, one of the top results is IBM.com. But the page it points to doesn&amp;#x2019;t contain the phrase Things link to it that do contain that phrase, and Google index against it. Big win for things like images and videos, where there may be no text
  51. Need to parse document Easy in PHP with the DOM parser We could then add these to the index, as a new field on a document ZSL has a built in html document type, but the getLinks function doesn&amp;#x2019;t include anchor text
  52. Anchor text extract
  53. The next idea is zone weighting. This is a page from my blog I know what&amp;#x2019;s important on this page - 1 to 3 Google has to guess, based on appearance Green = boilerplate - don&amp;#x2019;t index Index these zones as fields, and weight differently
  54. If we break our content down into fields, We can set different &amp;#x2018;boost&amp;#x2019; values on them Boosts &gt; 1 more important, &lt; 1 less important E.G. de-emphasise comments
  55. Document Weight - Importance, Authority In general - not tied to specific query Page rank - but that wont work on small collection Comments - &amp;#x201C;great post&amp;#x201D;, comment count Inbound visitors Retweets - Google uses a UserRank PR type calculation on follower counts
  56. Similar to zones, boost at document level The default is 1 Adding one 100th for each comment This of course could be tuned for individual circumstances
  57. Got engine, got data, got good results Now, look at ways to improve search user experience
  58. With UI - do what other websites do With search - do what google et al do Summaries or snippits show a selection of the page
  59. Sphinx build highlights
  60. Most search engines have some support for this. With Sphinx here, we can pass the query and index name to the BuildExcerpts function to get highlighted contextual snippits getTextFromDB is just a pretend function that would wrap retrieving the raw full text.
  61. We can do by storing some of the original text in SE We&amp;#x2019;ve added a StoreDescription based on the body, for 1000 characters This will appear in the result object as swishdescription. We may want to index more, then choose the bit we display based on the presence of query words.
  62. Google highlighted search terms on summaries Can do on whole document as well Easy to do in many engines ZSL highlight matches - could use stored field or external HighlightMatches without fragment will add HTML headers
  63. Spelling correction is a really handy function Important to correct to known words from the index Rather than default dictionary
  64. Xapian example - set flag on indexer &amp; queryparser. We had an index based on PHP documentation Have mistyped str_replace and strcmp
  65. Function names were corrected, despite not being &amp;#x2018;words&amp;#x2019; They featured in index, and had low edit distance from query Some low quality results returned - where we might use threshold Solr/Lucene has a similar plugin
  66. Another useful idea is sorting result sets on other than rank This is an example from google news E.G. file search, email, private messages may want others (sender, date, subject)
  67. Here we&amp;#x2019;ve added a sort on title Can be expensive as SEs can&amp;#x2019;t do normal shortcuts But normally straightforward
  68. We&amp;#x2019;ve got a search here on epicurious, the food and cooking site. Shows categories and result counts This is called faceted search, categories = facets Document has many categories Good for product based search Solr was built with faceted search in mind for CNET reviews
  69. Enable faceted mode, set one facet, &amp;#x2018;cat&amp;#x2019; If we&amp;#x2019;d been duplicating epicurious, each of the options on the left would have been a facet. Get results plus enumeration of options in each facet + count
  70. User can offer feedback by selecting more like this Find documents like this one Good for search with many meanings - &amp;#x2018;creed&amp;#x2019; (game, band, belief) Example from a dissertation search engine
  71. Generate search from document user selected Xapian has built in, can do in Solr as well. Top 40 most important terms extracted (can do more than one doc) Using str_replace from index of phpdoc Combine terms with ORs
  72. Finds itself, and other good matches MySQL FTI has blind query expansion, which gets more results based on the results retrieved - not as good, and hella slow!
  73. Search can be expensive Lots of data to process Most engines have some sort of query cache built in We&amp;#x2019;ll take a quick look at some different aspects of performance.
  74. Indexes designed for more read than write Adding data can be expensive to a large index. Have two indexes Merge Lucene uses segments automatically
  75. Smaller index: less IO, better O/S cache, faster results But slower update speed Recombine segments, Merge deltas Optimise and compress index This can be an expensive operation though. Try to keep index on local disk, not network
  76. When demands too big for a single server, need to look at distributing Replication tends not to give such a boost here, as you generally have too large an index which is too slow for single queries, rather than scale Need to shard contents based on hash - something not searched for Most systems have a way of working with remote backends, to give single search and sort point
  77. The systems we&amp;#x2019;ve talked about will all index tens of thousands of documents Xap and Solr should handle into the millions on one server 100s of mil/billions = webscale - Challenges: Data size, rate of update Nutch is a FOSS webscale SE/crawler created by Doug Cutting, of Lucene. Also did hadoop: mapreduce, distributes files etc. (not being sued by google) Used on thousands of nodes at yahoo, among others
  78. Thanks!