Apache Solr is a state of the art, high performance and scalable search server you can use in your (PHP) application to provide a very feature rich search experience. Besides full-text search, it also provides spell checking, highlighting, facets and powerful functions that can put it in the realm of a general information retrieval engine, replacing complex database queries you would (need to) use otherwise.
Use cases range from e-commerce, real-estate database search, intranets/extranets, content management systems, document management systems and anything that offers exploration of structured and/or unstructured information. The recent addition of geo-aware features makes even location searches possible.
2. About me
● Currently employed by eZ Systems http://ez.no
● Active in open source community for a while
– Squid http proxy server (about 15 y ago)
– PHP based CMS solutions (mostly eZ Publish)
– executive committee
● Currently fancying :
– PHP as the master glue language for almost everything
– Apache Lucene family of projects (mainly Solr)
– NoSQL (Not only SQL) and scalable architectures
– CMS systems & information management
3. Outline
● Overview of Apache Solr
● Concepts & internals
● How to use it with PHP
● Use cases & tips
● Resources
5. Apache Solr Curriculum Vitae
● Open source Apache Lucene project,
started by Yonik Seeley
● Standalone, enterprise grade search
server built on top of Lucene
● Lives in a Java servlet container
● Access through a REST-ful API
– HTTP
– Primary payload in requests: XML
– Other response formats: PHP, JSON, …
7. Solr in a nutshell
● State of the art, advanced full text search and
information retrieval
● Fast, scalable with native replication features
● Flexible configuration
● Document oriented storage
● Geospatial search
● Native cloud features
8. Full text search main features
● Tuneable relevancy ranking on top of internal
similarity algorithms
● Highlighting
● Sorting
● Filtering
● “Drill-down” navigation (facets)
● Automatic related content
● Spell checking
● Multilingual text analysis
10. Tunable relevancy ranking
● “Boosting” at index and query time
– certain types of content
– certain parts of content (“fields”)
– page-rank like if the content has relations
● Elevate request component
– predefined “pages/documents” to the top when certain
keywords are entered
● With customised functions
– more recent articles
– proximity (geolocations)
11. Filtering
● Does not influence the relevancy
● Narrows down the scope
● Very powerful: full boolean, wildcards,
fuzzy, and unlimited combinations
● Ranges (dates, numbers,
alphanumeric, ...)
Also for implementing security!
12. Facets
● Along the main query, “facet fields” may be defined,
usually operating on meta-data:
– Type of content
– Publication year
– Keywords
– Author ....
● The result set is returned offering the number hits
within each “facet”
● You can use the selected facet as a subsequent filter
14. Automatic related content
(“More Like This”)
● Search engine determines itself which are the
important terms of a page and performs a query
● All other normal features can be used
– Filtering
– Sorting
– Facets
15.
16. Spell checking
● Two possible strategies
– Dictionary look-up
– Using the indexed words itself (recommended)
● Possible “Google” approach using the “best guess”
– Search for “Grein botle“
=> suggests “Green bottle”
● Let Solr return individual keyword suggestions
=> more client side processing required
17. Multilingual features
● Adapted tokenizers
● Stemming (reducing words to common form)
– Reduces some spelling errors too!
– May decrease accuracy
● Different algorithms per language
● Normalisation (“latin 1 characters”)
– élève = eleve, Spaß = spass, ...
19. Performance
● Solr employs intelligent caches
– filters
– queries
– internal indexes
● Optimized for search/retrieval
● Possible autowarming on start up
● When updates are done, caches are
reconstructed on the fly in the background
20. Performance (2)
● Replication
– master-slave for now
– works across platforms with same configuration
– no native OS features needed (or rsync)
– more cloud features under development
● Sharding (client driven)
22. The Solr/Lucene index
● Inverted index
● Holds a collection of “documents” (hello NoSQL)
● Document
– Collection of fields
– Flexible schema!
– Unique ID (user defined)
● Solr uses a XML based config file:
schema.xml
23. Fields
● Various field types, derived from base classes
● Indexed
– contains the inverted index
– usually analyzed & tokenized
– makes it searchable and sortable
● Stored
– contains also the original content
– content can be part of the request response
● Can be multi-valued!
– opens possibilities beyond full text search
24. Field definitions: schema.xml
● Field types
– text
– numerical
– dates
– location
– … (about 25 in total)
● Actual fields (name, definition, properties)
● Dynamic fields
● Copy fields (as aggregators)
25. schema.xml: simple field type examples
<fieldType name="string" class="solr.StrField"
sortMissingLast="true" omitNorms="true"/>
<!-- boolean type: "true" or "false" -->
<fieldType name="boolean" class="solr.BoolField"
sortMissingLast="true" omitNorms="true"/>
<!-- A Trie based date field for faster date range
queries and date faceting. -->
<fieldType name="tdate" class="solr.TrieDateField"
omitNorms="true" precisionStep="6"
positionIncrementGap="0"/>
<!-- A text field that only splits on whitespace for exact matching
of words -->
<fieldType name="text_ws" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
26. schema.xml: more complex field type
<!-- A general unstemmed text field - good if one does not know the language of the field -->
<fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
enablePositionIncrements="false" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"
splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
28. Analysis
● Solr does not really search your text, but rather
the terms that result from the analysis of text
● Typically a chain of
– Character filter(s)
– Tokenisation
– Filter A
– Filter B
– …
29. Solr comes with many tokenizers and
filters
● Some are language specific
● Others are very specialised
● It is very important to get this right
otherwise, you may not get what you expect!
30. Text analysis examples
String Field term term
type position position
“text” 1 2
iPad => i pad
ipad
élève. => elev
PowerS => power shot
hot powershot
31. Character filters
● Used to cleanup text before tokenizing
– HTMLStripCharFilter (strips html, xml, js, css)
– MappingCharFilter (normalisation of characters,
removing accents)
– Regular expression filter
32. Tokenizers
● Convert text to tokens (terms)
● You can define only one per field/analyzer
● Examples
– WhitespaceTokenizer (splits on white space)
– StandardTokenizer
– CJK variants
33. Additional filters
● Many possible per field/analyzer
● Many delivered with Solr out of the box
● If not enough, write a tiny bit of Java or look for
contributions
● Examples ...
35. Reversing Filter
● Reverses the order of characters
● Use: allow “leading wildcards”
● *thing => gniht*
● A lot faster (prefixes)
36. Synonyms
● Inject synonyms for certain terms
● Language specific
● Best used for query time analysis
– may inflate the search index too much
– decreases relevancy
37. Stemming
● Reduce terms to their root form
– Plural forms
– Conjugations
● Language specific (or not relevant, CJK)
● Many specialised stemmers available
– Most european languages
– Dutch (!)
38. Copy fields
● Analysis is done differently for
– searching/filtering
– faceting/sorting
● Stemming and not stemming in different fields
can increase relevance of results
● Use copy fields in schema.xml or do it client
side
39. Geospatial search
● Solr dedicated fields
– Latitude Longitude type
● Special geospatial functions in filtering &
boosting
– Haversine distance (geosphere)
– Simple ranges (squares in 2-D)
– Special query constructs (upcoming)
41. Get the data and feed it
● Most *AMP applications have databases
● Map your data to a “document model”
– denormalization, flattening
– most DB fields can be fed unaltered, Solr takes
care of the rest
● Send it through HTTP as XML
● One constraint: it must be UTF-8!
42. Searching
● Construct a GET/POST query
● Base parameters
– “q” for query text
– “start” for offset
– “rows” for max number of results to return
Example:
http://localhost:8983/solr/select/?q=content&version=2.2&start=0&rows=10&indent=on&wt=php
44. PHP client side
● Roll your own classes & functions
– Not difficult, it's REST after all
– Some Curl, XML, Json or native PHP array parsing
● Use existing libraries
– PECL: http://pecl.php.net/package/solr
– http://code.google.com/p/solr-php-client/
(follows ZF coding standards)
– eZ Components: ezcSearch
● PHP CMS's usually come with their own
– eZ Publish, Drupal, Symfony ...
46. Indexing binary files
● Solr includes the Apache Tika libraries
– convert about any format to plain text
– you can activate a dedicated requesthandler for it
OR
● Use it standalone (command line) for integration into
existing code
See: http://lucene.apache.org/tika/
47. Integrate legacy data
● Use the Solr Data Import Handler
● Able to index DB's directly
– define the schema to use (including possible
joins)
– fire simple requests to Solr to actually
index/update
● Also XML feeds, files (csv), ...
48. e-Commerce
● If you want so sell, make sure users find the products
they want
– Use facets (categories, drill-down, …)
– Push high margin / hot / new products with elevation
– Pay a lot of attention to index and query time analysis
● Feed additional meta-data and use it to tune
– Ratings
– Analytics (Google, Omniture, ...)
49. Have multilingual content?
● Multi-core configuration
– Setup a dedicated Solr core per language
– Each has its own schema definitions, while you
can still use common field names
● If using one index
– Use dynamic fields and create language specific
analyzers for dedicate language
suffixes/prefixes
51. Thank you!
Questions?
email: paul dot borgermans at gmail dot com
http://twitter.com/paulborgermans
Please rate this talk/slides:
http://joind.in/talk/view/1504