SlideShare una empresa de Scribd logo
1 de 51
Find it,
possibly also near you!
       Paul Borgermans
About me
●   Currently employed by eZ Systems http://ez.no
●   Active in open source community for a while
     –   Squid http proxy server (about 15 y ago)
     –   PHP based CMS solutions (mostly eZ Publish)
     –              executive committee

●   Currently fancying :
     –   PHP as the master glue language for almost everything
     –   Apache Lucene family of projects (mainly Solr)
     –   NoSQL (Not only SQL) and scalable architectures
     –   CMS systems & information management
Outline
●   Overview of Apache Solr
●   Concepts & internals
●   How to use it with PHP
●   Use cases & tips
●   Resources
Overview of Apache Solr
Apache Solr Curriculum Vitae
●   Open source Apache Lucene project,
    started by Yonik Seeley
●   Standalone, enterprise grade search
    server built on top of Lucene
●   Lives in a Java servlet container
●   Access through a REST-ful API
        –   HTTP
        –   Primary payload in requests: XML
        –   Other response formats: PHP, JSON, …
Used by ..




And many more ...
Solr in a nutshell
●   State of the art, advanced full text search and
    information retrieval
●   Fast, scalable with native replication features
●   Flexible configuration
●   Document oriented storage
●   Geospatial search
●   Native cloud features
Full text search main features
●   Tuneable relevancy ranking on top of internal
    similarity algorithms
●   Highlighting
●   Sorting
●   Filtering
●   “Drill-down” navigation (facets)
●   Automatic related content
●   Spell checking
●   Multilingual text analysis
At a glance ..
Tunable relevancy ranking
●   “Boosting” at index and query time
        –   certain types of content
        –   certain parts of content (“fields”)
        –            page-rank like if the content has relations
●   Elevate request component
        –   predefined “pages/documents” to the top when certain
              keywords are entered
●   With customised functions
        –   more recent articles
        –   proximity (geolocations)
Filtering
●   Does not influence the relevancy
●   Narrows down the scope
●   Very powerful: full boolean, wildcards,
    fuzzy, and unlimited combinations
●   Ranges (dates, numbers,
    alphanumeric, ...)


     Also for implementing security!
Facets
●   Along the main query, “facet fields” may be defined,
    usually operating on meta-data:
        –   Type of content
        –   Publication year
        –   Keywords
        –   Author ....
●   The result set is returned offering the number hits
    within each “facet”
●   You can use the selected facet as a subsequent filter
Facets: example
Automatic related content
               (“More Like This”)
●   Search engine determines itself which are the
    important terms of a page and performs a query
●   All other normal features can be used
       –   Filtering
       –   Sorting
       –   Facets
Spell checking
●   Two possible strategies
        –   Dictionary look-up
        –   Using the indexed words itself (recommended)
●   Possible “Google” approach using the “best guess”
        –   Search for “Grein botle“
             =>        suggests “Green bottle”
●   Let Solr return individual keyword suggestions
      => more client side processing required
Multilingual features
●   Adapted tokenizers
●   Stemming (reducing words to common form)
        –   Reduces some spelling errors too!
        –   May decrease accuracy
●   Different algorithms per language
●   Normalisation (“latin 1 characters”)
        –   élève = eleve, Spaß = spass, ...
Geospatial search
Performance
●   Solr employs intelligent caches
        –   filters
        –   queries
        –   internal indexes
●   Optimized for search/retrieval
●   Possible autowarming on start up
●   When updates are done, caches are
    reconstructed on the fly in the background
Performance (2)
●   Replication
        –   master-slave for now
        –   works across platforms with same configuration
        –   no native OS features needed (or rsync)
        –   more cloud features under development
●   Sharding (client driven)
Concepts and internals
The Solr/Lucene index
●   Inverted index
●   Holds a collection of “documents” (hello NoSQL)
●   Document
        –   Collection of fields
        –   Flexible schema!
        –   Unique ID (user defined)
●   Solr uses a XML based config file:

    schema.xml
Fields
●   Various field types, derived from base classes
●   Indexed
        –    contains the inverted index
        –    usually analyzed & tokenized
        –    makes it searchable and sortable
●   Stored
        –    contains also the original content
        –    content can be part of the request response
●   Can be multi-valued!
        –    opens possibilities beyond full text search
Field definitions: schema.xml
●   Field types
        –   text
        –   numerical
        –   dates
        –   location
        –   … (about 25 in total)
●   Actual fields (name, definition, properties)
●   Dynamic fields
●   Copy fields (as aggregators)
schema.xml: simple field type examples
    <fieldType name="string" class="solr.StrField"
 sortMissingLast="true" omitNorms="true"/>

     <!-- boolean type: "true" or "false" -->
     <fieldType name="boolean" class="solr.BoolField"
 sortMissingLast="true" omitNorms="true"/>

    <!-- A Trie based date field for faster date range
queries and date faceting. -->
    <fieldType name="tdate" class="solr.TrieDateField"
omitNorms="true" precisionStep="6"
positionIncrementGap="0"/>

  <!-- A text field that only splits on whitespace for exact matching
of words -->
    <fieldType name="text_ws" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>
schema.xml: more complex field type

  <!-- A general unstemmed text field - good if one does not know the language of the field -->
    <fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="false" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"
splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
Huh?
Analysis
●   Solr does not really search your text, but rather
    the terms that result from the analysis of text
●   Typically a chain of
        –   Character filter(s)
        –   Tokenisation
        –   Filter A
        –   Filter B
        –   …
Solr comes with many tokenizers and
                   filters

●   Some are language specific
●   Others are very specialised
●   It is very important to get this right

    otherwise, you may not get what you expect!
Text analysis examples
String   Field    term     term
         type     position position
         “text”   1        2

iPad     =>       i        pad
                           ipad
élève.   =>       elev

PowerS   =>       power    shot
hot                        powershot
Character filters
●   Used to cleanup text before tokenizing
       –   HTMLStripCharFilter (strips html, xml, js, css)
       –   MappingCharFilter (normalisation of characters,
            removing accents)
       –   Regular expression filter
Tokenizers
●   Convert text to tokens (terms)
●   You can define only one per field/analyzer
●   Examples
        –   WhitespaceTokenizer (splits on white space)
        –   StandardTokenizer
        –   CJK variants
Additional filters
●   Many possible per field/analyzer
●   Many delivered with Solr out of the box
●   If not enough, write a tiny bit of Java or look for
    contributions



●   Examples ...
Phonetic filters
●   PhoneticFilterFactory
●   “sounds like” transformations and matching
●   Algorithms:
       –   Metaphone
       –   Double Metaphone
       –   Soundex
       –   Refined Soundex
Reversing Filter
●   Reverses the order of characters
●   Use: allow “leading wildcards”
●   *thing => gniht*
●   A lot faster (prefixes)
Synonyms
●   Inject synonyms for certain terms
●   Language specific
●   Best used for query time analysis
       –   may inflate the search index too much
       –   decreases relevancy
Stemming
●   Reduce terms to their root form
       –   Plural forms
       –   Conjugations
●   Language specific (or not relevant, CJK)
●   Many specialised stemmers available
       –   Most european languages
       –   Dutch (!)
Copy fields
●   Analysis is done differently for
        –   searching/filtering
        –   faceting/sorting
●   Stemming and not stemming in different fields
    can increase relevance of results

●   Use copy fields in schema.xml or do it client
    side
Geospatial search
●   Solr dedicated fields
        –   Latitude Longitude type
●   Special geospatial functions in filtering &
    boosting
        –   Haversine distance (geosphere)
        –   Simple ranges (squares in 2-D)
        –   Special query constructs (upcoming)
How to use it with
Get the data and feed it
●   Most *AMP applications have databases
●   Map your data to a “document model”
       –   denormalization, flattening
       –   most DB fields can be fed unaltered, Solr takes
            care of the rest
●   Send it through HTTP as XML

●   One constraint: it must be UTF-8!
Searching
  ●   Construct a GET/POST query
  ●   Base parameters
            –   “q” for query text
            –   “start” for offset
            –   “rows” for max number of results to return
Example:
http://localhost:8983/solr/select/?q=content&version=2.2&start=0&rows=10&indent=on&wt=php
Searching (2)
●   Additional parameters
         –   response format (wt)
                   ●php = array(), json, ...
         –   type of search handler (qt)
         –   highlighting (hl.*)
         –   facets (f.<fieldName>.<FacetParam>=<value>)
         –   spellcheck (spellcheck)
         –   …
PHP client side
●   Roll your own classes & functions
         –   Not difficult, it's REST after all
         –   Some Curl, XML, Json or native PHP array parsing
●   Use existing libraries
         –   PECL: http://pecl.php.net/package/solr
         –   http://code.google.com/p/solr-php-client/
                    (follows ZF coding standards)
         –   eZ Components: ezcSearch
●   PHP CMS's usually come with their own
         –   eZ Publish, Drupal, Symfony ...
Use-cases & tips
Indexing binary files
●   Solr includes the Apache Tika libraries
        –   convert about any format to plain text
        –   you can activate a dedicated requesthandler for it

                 OR
●   Use it standalone (command line) for integration into
    existing code

       See: http://lucene.apache.org/tika/
Integrate legacy data
●   Use the Solr Data Import Handler
●   Able to index DB's directly
        –   define the schema to use (including possible
             joins)
        –   fire simple requests to Solr to actually
               index/update
●   Also XML feeds, files (csv), ...
e-Commerce
●   If you want so sell, make sure users find the products
    they want
        –   Use facets (categories, drill-down, …)
        –   Push high margin / hot / new products with elevation
        –   Pay a lot of attention to index and query time analysis
●   Feed additional meta-data and use it to tune
        –   Ratings
        –   Analytics (Google, Omniture, ...)
Have multilingual content?
●   Multi-core configuration
        –   Setup a dedicated Solr core per language
        –   Each has its own schema definitions, while you
             can still use common field names
●   If using one index
        –   Use dynamic fields and create language specific
             analyzers for dedicate language
             suffixes/prefixes
Resources
●   Solr: wiki, mailing lists, downloads
    http://lucene.apache.org/solr/
●   Free book, articles (by core Solr devs)
    http://www.lucidimagination.com/
●   Bother me ;)
Thank you!

                Questions?

email: paul dot borgermans at gmail dot com
      http://twitter.com/paulborgermans

        Please rate this talk/slides:
        http://joind.in/talk/view/1504

Más contenido relacionado

La actualidad más candente

Deepak khetawat sling_models_sightly_jsp
Deepak khetawat sling_models_sightly_jspDeepak khetawat sling_models_sightly_jsp
Deepak khetawat sling_models_sightly_jspDEEPAK KHETAWAT
 
ORM, JPA, & Hibernate Overview
ORM, JPA, & Hibernate OverviewORM, JPA, & Hibernate Overview
ORM, JPA, & Hibernate OverviewBrett Meyer
 
Java Persistence API (JPA) - A Brief Overview
Java Persistence API (JPA) - A Brief OverviewJava Persistence API (JPA) - A Brief Overview
Java Persistence API (JPA) - A Brief OverviewCraig Dickson
 
Hibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance TechniquesHibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance TechniquesBrett Meyer
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
 
Introduction Apache Solr & PHP
Introduction Apache Solr & PHPIntroduction Apache Solr & PHP
Introduction Apache Solr & PHPHiraq Citra M
 
Orm and hibernate
Orm and hibernateOrm and hibernate
Orm and hibernates4al_com
 
CQ5 QueryBuilder - .adaptTo(Berlin) 2011
CQ5 QueryBuilder - .adaptTo(Berlin) 2011CQ5 QueryBuilder - .adaptTo(Berlin) 2011
CQ5 QueryBuilder - .adaptTo(Berlin) 2011Alexander Klimetschek
 
Day 7 - Make it Fast
Day 7 - Make it FastDay 7 - Make it Fast
Day 7 - Make it FastBarry Jones
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From SolrRamzi Alqrainy
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr Tommaso Teofili
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013Roy Russo
 
Introduction to Apache Solr.
Introduction to Apache Solr.Introduction to Apache Solr.
Introduction to Apache Solr.ashish0x90
 
Stardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF DatabaseStardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF Databasekendallclark
 

La actualidad más candente (20)

Deepak khetawat sling_models_sightly_jsp
Deepak khetawat sling_models_sightly_jspDeepak khetawat sling_models_sightly_jsp
Deepak khetawat sling_models_sightly_jsp
 
ORM, JPA, & Hibernate Overview
ORM, JPA, & Hibernate OverviewORM, JPA, & Hibernate Overview
ORM, JPA, & Hibernate Overview
 
Java Persistence API (JPA) - A Brief Overview
Java Persistence API (JPA) - A Brief OverviewJava Persistence API (JPA) - A Brief Overview
Java Persistence API (JPA) - A Brief Overview
 
Hibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance TechniquesHibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance Techniques
 
Quiery builder
Quiery builderQuiery builder
Quiery builder
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
Introduction Apache Solr & PHP
Introduction Apache Solr & PHPIntroduction Apache Solr & PHP
Introduction Apache Solr & PHP
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
Orm and hibernate
Orm and hibernateOrm and hibernate
Orm and hibernate
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
CQ5 QueryBuilder - .adaptTo(Berlin) 2011
CQ5 QueryBuilder - .adaptTo(Berlin) 2011CQ5 QueryBuilder - .adaptTo(Berlin) 2011
CQ5 QueryBuilder - .adaptTo(Berlin) 2011
 
Day 7 - Make it Fast
Day 7 - Make it FastDay 7 - Make it Fast
Day 7 - Make it Fast
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Scaling search in Oak with Solr
Scaling search in Oak with Solr Scaling search in Oak with Solr
Scaling search in Oak with Solr
 
ElasticSearch AJUG 2013
ElasticSearch AJUG 2013ElasticSearch AJUG 2013
ElasticSearch AJUG 2013
 
Introduction to Apache Solr.
Introduction to Apache Solr.Introduction to Apache Solr.
Introduction to Apache Solr.
 
Stardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF DatabaseStardog 1.1: An Easier, Smarter, Faster RDF Database
Stardog 1.1: An Easier, Smarter, Faster RDF Database
 

Destacado

What's brewing in the eZ Systems extensions kitchen
What's brewing in the eZ Systems extensions kitchenWhat's brewing in the eZ Systems extensions kitchen
What's brewing in the eZ Systems extensions kitchenPaul Borgermans
 
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...lucenerevolution
 
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...Ramzi Alqrainy
 
Boosting Documents in Solr by Recency, Popularity, and User Preferences
Boosting Documents in Solr by Recency, Popularity, and User PreferencesBoosting Documents in Solr by Recency, Popularity, and User Preferences
Boosting Documents in Solr by Recency, Popularity, and User PreferencesLucidworks (Archived)
 
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks EnterpriseImplementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks EnterpriseLucidworks (Archived)
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsLinkedIn
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerLuminary Labs
 

Destacado (7)

What's brewing in the eZ Systems extensions kitchen
What's brewing in the eZ Systems extensions kitchenWhat's brewing in the eZ Systems extensions kitchen
What's brewing in the eZ Systems extensions kitchen
 
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
 
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity...
 
Boosting Documents in Solr by Recency, Popularity, and User Preferences
Boosting Documents in Solr by Recency, Popularity, and User PreferencesBoosting Documents in Solr by Recency, Popularity, and User Preferences
Boosting Documents in Solr by Recency, Popularity, and User Preferences
 
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks EnterpriseImplementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
 
Study: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving CarsStudy: The Future of VR, AR and Self-Driving Cars
Study: The Future of VR, AR and Self-Driving Cars
 
Hype vs. Reality: The AI Explainer
Hype vs. Reality: The AI ExplainerHype vs. Reality: The AI Explainer
Hype vs. Reality: The AI Explainer
 

Similar a Find it, possibly also near you!

The Lumber Mill - XSLT For Your Templates
The Lumber Mill  - XSLT For Your TemplatesThe Lumber Mill  - XSLT For Your Templates
The Lumber Mill - XSLT For Your TemplatesThomas Weinert
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch BasicsShifa Khan
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and TricksErik Hatcher
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platformTommaso Teofili
 
Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8Websolutions Agency
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Kai Chan
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayMichael Yarichuk
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to ElasticsearchClifford James
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampKais Hassan, PhD
 
Sunspot - The Ruby Way into Solr
Sunspot - The Ruby Way into SolrSunspot - The Ruby Way into Solr
Sunspot - The Ruby Way into SolrBADR
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" DataArt
 
Advanced guide to develop ajax applications using dojo
Advanced guide to develop ajax applications using dojoAdvanced guide to develop ajax applications using dojo
Advanced guide to develop ajax applications using dojoFu Cheng
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyRobert Viseur
 

Similar a Find it, possibly also near you! (20)

The Lumber Mill - XSLT For Your Templates
The Lumber Mill  - XSLT For Your TemplatesThe Lumber Mill  - XSLT For Your Templates
The Lumber Mill - XSLT For Your Templates
 
Solr5
Solr5Solr5
Solr5
 
Solr
SolrSolr
Solr
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and Tricks
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Apache solr
Apache solrApache solr
Apache solr
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Apache Solr for begginers
Apache Solr for begginersApache Solr for begginers
Apache Solr for begginers
 
Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy Way
 
Hands on-solr
Hands on-solrHands on-solr
Hands on-solr
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science Bootcamp
 
Sunspot - The Ruby Way into Solr
Sunspot - The Ruby Way into SolrSunspot - The Ruby Way into Solr
Sunspot - The Ruby Way into Solr
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys"
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Advanced guide to develop ajax applications using dojo
Advanced guide to develop ajax applications using dojoAdvanced guide to develop ajax applications using dojo
Advanced guide to develop ajax applications using dojo
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 

Find it, possibly also near you!

  • 1. Find it, possibly also near you! Paul Borgermans
  • 2. About me ● Currently employed by eZ Systems http://ez.no ● Active in open source community for a while – Squid http proxy server (about 15 y ago) – PHP based CMS solutions (mostly eZ Publish) – executive committee ● Currently fancying : – PHP as the master glue language for almost everything – Apache Lucene family of projects (mainly Solr) – NoSQL (Not only SQL) and scalable architectures – CMS systems & information management
  • 3. Outline ● Overview of Apache Solr ● Concepts & internals ● How to use it with PHP ● Use cases & tips ● Resources
  • 5. Apache Solr Curriculum Vitae ● Open source Apache Lucene project, started by Yonik Seeley ● Standalone, enterprise grade search server built on top of Lucene ● Lives in a Java servlet container ● Access through a REST-ful API – HTTP – Primary payload in requests: XML – Other response formats: PHP, JSON, …
  • 6. Used by .. And many more ...
  • 7. Solr in a nutshell ● State of the art, advanced full text search and information retrieval ● Fast, scalable with native replication features ● Flexible configuration ● Document oriented storage ● Geospatial search ● Native cloud features
  • 8. Full text search main features ● Tuneable relevancy ranking on top of internal similarity algorithms ● Highlighting ● Sorting ● Filtering ● “Drill-down” navigation (facets) ● Automatic related content ● Spell checking ● Multilingual text analysis
  • 10. Tunable relevancy ranking ● “Boosting” at index and query time – certain types of content – certain parts of content (“fields”) – page-rank like if the content has relations ● Elevate request component – predefined “pages/documents” to the top when certain keywords are entered ● With customised functions – more recent articles – proximity (geolocations)
  • 11. Filtering ● Does not influence the relevancy ● Narrows down the scope ● Very powerful: full boolean, wildcards, fuzzy, and unlimited combinations ● Ranges (dates, numbers, alphanumeric, ...) Also for implementing security!
  • 12. Facets ● Along the main query, “facet fields” may be defined, usually operating on meta-data: – Type of content – Publication year – Keywords – Author .... ● The result set is returned offering the number hits within each “facet” ● You can use the selected facet as a subsequent filter
  • 14. Automatic related content (“More Like This”) ● Search engine determines itself which are the important terms of a page and performs a query ● All other normal features can be used – Filtering – Sorting – Facets
  • 15.
  • 16. Spell checking ● Two possible strategies – Dictionary look-up – Using the indexed words itself (recommended) ● Possible “Google” approach using the “best guess” – Search for “Grein botle“ => suggests “Green bottle” ● Let Solr return individual keyword suggestions => more client side processing required
  • 17. Multilingual features ● Adapted tokenizers ● Stemming (reducing words to common form) – Reduces some spelling errors too! – May decrease accuracy ● Different algorithms per language ● Normalisation (“latin 1 characters”) – élève = eleve, Spaß = spass, ...
  • 19. Performance ● Solr employs intelligent caches – filters – queries – internal indexes ● Optimized for search/retrieval ● Possible autowarming on start up ● When updates are done, caches are reconstructed on the fly in the background
  • 20. Performance (2) ● Replication – master-slave for now – works across platforms with same configuration – no native OS features needed (or rsync) – more cloud features under development ● Sharding (client driven)
  • 22. The Solr/Lucene index ● Inverted index ● Holds a collection of “documents” (hello NoSQL) ● Document – Collection of fields – Flexible schema! – Unique ID (user defined) ● Solr uses a XML based config file: schema.xml
  • 23. Fields ● Various field types, derived from base classes ● Indexed – contains the inverted index – usually analyzed & tokenized – makes it searchable and sortable ● Stored – contains also the original content – content can be part of the request response ● Can be multi-valued! – opens possibilities beyond full text search
  • 24. Field definitions: schema.xml ● Field types – text – numerical – dates – location – … (about 25 in total) ● Actual fields (name, definition, properties) ● Dynamic fields ● Copy fields (as aggregators)
  • 25. schema.xml: simple field type examples <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/> <!-- boolean type: "true" or "false" --> <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/> <!-- A Trie based date field for faster date range queries and date faceting. --> <fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0"/> <!-- A text field that only splits on whitespace for exact matching of words --> <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType>
  • 26. schema.xml: more complex field type <!-- A general unstemmed text field - good if one does not know the language of the field --> <fieldType name="textgen" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
  • 27. Huh?
  • 28. Analysis ● Solr does not really search your text, but rather the terms that result from the analysis of text ● Typically a chain of – Character filter(s) – Tokenisation – Filter A – Filter B – …
  • 29. Solr comes with many tokenizers and filters ● Some are language specific ● Others are very specialised ● It is very important to get this right otherwise, you may not get what you expect!
  • 30. Text analysis examples String Field term term type position position “text” 1 2 iPad => i pad ipad élève. => elev PowerS => power shot hot powershot
  • 31. Character filters ● Used to cleanup text before tokenizing – HTMLStripCharFilter (strips html, xml, js, css) – MappingCharFilter (normalisation of characters, removing accents) – Regular expression filter
  • 32. Tokenizers ● Convert text to tokens (terms) ● You can define only one per field/analyzer ● Examples – WhitespaceTokenizer (splits on white space) – StandardTokenizer – CJK variants
  • 33. Additional filters ● Many possible per field/analyzer ● Many delivered with Solr out of the box ● If not enough, write a tiny bit of Java or look for contributions ● Examples ...
  • 34. Phonetic filters ● PhoneticFilterFactory ● “sounds like” transformations and matching ● Algorithms: – Metaphone – Double Metaphone – Soundex – Refined Soundex
  • 35. Reversing Filter ● Reverses the order of characters ● Use: allow “leading wildcards” ● *thing => gniht* ● A lot faster (prefixes)
  • 36. Synonyms ● Inject synonyms for certain terms ● Language specific ● Best used for query time analysis – may inflate the search index too much – decreases relevancy
  • 37. Stemming ● Reduce terms to their root form – Plural forms – Conjugations ● Language specific (or not relevant, CJK) ● Many specialised stemmers available – Most european languages – Dutch (!)
  • 38. Copy fields ● Analysis is done differently for – searching/filtering – faceting/sorting ● Stemming and not stemming in different fields can increase relevance of results ● Use copy fields in schema.xml or do it client side
  • 39. Geospatial search ● Solr dedicated fields – Latitude Longitude type ● Special geospatial functions in filtering & boosting – Haversine distance (geosphere) – Simple ranges (squares in 2-D) – Special query constructs (upcoming)
  • 40. How to use it with
  • 41. Get the data and feed it ● Most *AMP applications have databases ● Map your data to a “document model” – denormalization, flattening – most DB fields can be fed unaltered, Solr takes care of the rest ● Send it through HTTP as XML ● One constraint: it must be UTF-8!
  • 42. Searching ● Construct a GET/POST query ● Base parameters – “q” for query text – “start” for offset – “rows” for max number of results to return Example: http://localhost:8983/solr/select/?q=content&version=2.2&start=0&rows=10&indent=on&wt=php
  • 43. Searching (2) ● Additional parameters – response format (wt) ●php = array(), json, ... – type of search handler (qt) – highlighting (hl.*) – facets (f.<fieldName>.<FacetParam>=<value>) – spellcheck (spellcheck) – …
  • 44. PHP client side ● Roll your own classes & functions – Not difficult, it's REST after all – Some Curl, XML, Json or native PHP array parsing ● Use existing libraries – PECL: http://pecl.php.net/package/solr – http://code.google.com/p/solr-php-client/ (follows ZF coding standards) – eZ Components: ezcSearch ● PHP CMS's usually come with their own – eZ Publish, Drupal, Symfony ...
  • 46. Indexing binary files ● Solr includes the Apache Tika libraries – convert about any format to plain text – you can activate a dedicated requesthandler for it OR ● Use it standalone (command line) for integration into existing code See: http://lucene.apache.org/tika/
  • 47. Integrate legacy data ● Use the Solr Data Import Handler ● Able to index DB's directly – define the schema to use (including possible joins) – fire simple requests to Solr to actually index/update ● Also XML feeds, files (csv), ...
  • 48. e-Commerce ● If you want so sell, make sure users find the products they want – Use facets (categories, drill-down, …) – Push high margin / hot / new products with elevation – Pay a lot of attention to index and query time analysis ● Feed additional meta-data and use it to tune – Ratings – Analytics (Google, Omniture, ...)
  • 49. Have multilingual content? ● Multi-core configuration – Setup a dedicated Solr core per language – Each has its own schema definitions, while you can still use common field names ● If using one index – Use dynamic fields and create language specific analyzers for dedicate language suffixes/prefixes
  • 50. Resources ● Solr: wiki, mailing lists, downloads http://lucene.apache.org/solr/ ● Free book, articles (by core Solr devs) http://www.lucidimagination.com/ ● Bother me ;)
  • 51. Thank you! Questions? email: paul dot borgermans at gmail dot com http://twitter.com/paulborgermans Please rate this talk/slides: http://joind.in/talk/view/1504