2. SOLR
●
SOLR is an standalone search server, that can scale separatedly from
the application that uses it
●
i.e. Avoid the case where an e-commerce server is slowed down by the
users searching their product catalog
●
SOLR is accessed using HTTP/XML REST-like and JSON APIs
●
Multi-platform, multi-language and client-independent
●
Results in XML, CSV, or JSON (with custom variations for
Ruby,Python,PHP)
●
100% Opensource, written in Java, runs in JVM
●
Apache Foundation top-level project
●
Most widely-used search server in industry
3. SOLR : A Lucene server
●
Solr is a search platform that provides all the features of Lucene search engine *
●
high-performance indexing
●
Incremental and batch indexing
●
Small footprint (RAM and disk)
●
And has all of Lucene features
●
Ranked searching
●
Many query types (phrase, wildcard, regexp, range, geospatial proximity)
●
Many field types, meaningful sorting
●
Multi-index search and merge of results
● Faceting
●
Language recognition (stemming)
● Suggestions
* (both projects are actually merged since SOLR 3.1, March 2010)
4. Simple SOLR Example
●
Index a product catalog (i.e. IPod Video)
●
Data in XML format
<doc>
<field name="id">MA147LL/A</field>
<field name="name">Apple 60 GB iPod with Video Playback Black</field>
<field name="features">2.5-inch, 320x240 color TFT LCD display with LED backlight</field>
<field name="features">Up to 20 hours of battery life</field>
<field name="features">Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video</field>
<field name="price">399.00</field>
<field name="inStock">true</field>
<field name="store">37.7752,-100.0232</field> <!-- Dodge City store -->
</doc>
●
Schema configuration
<field
<field
name="id" type="string" indexed="true" stored="true"/>
name="name" type="text" indexed="true" stored="true"/>
<field name="features" type="text" indexed="true" stored="true" multiValued="true"/>
<field name="price" type="float" indexed="true" stored="true"/>
<field name="inStock" type="boolean" indexed="true" stored="true" />
<field name="store" type="location" indexed="true" stored="true"/>
5. Simple SOLR Example
●
Query
●
Return all products with « video » in any field, sorted by descendant
price, show just the name,price,inStock
curl "http://localhost:8983/solr/collection1/select?q=video&sort=price+desc&fl=name,price,instock&indent=true"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="fl">name,price</str>
<str name="sort">price desc</str>
<str name="indent">true</str>
<str name="q">video</str>
</lst>
</lst>
<result name="response" numFound="3" start="0">
<doc>
<str name="name">ATI Radeon X1900 XTX 512 MB PCIE Video Card</str>
<float name="price">649.99</float>
<bool name="inStock">false</bool></doc>
<doc>
<str name="name">ASUS Extreme N7800GTX/2DHTV (256 MB)</str>
<float name="price">479.95</float>
<bool name="inStock">false</bool></doc>
<doc>
<str name="name">Apple 60 GB iPod with Video Playback Black</str>
<float name="price">399.0</float>
<bool name="inStock">true</bool></doc>
</result>
</response>
7. Simple SOLR Example
●
Filter Query
●
Uses different cache than Search Cache (useful for big results)
Filter Query : all products priced from 300 to 499 USD
q=*&fl=name,price&fq=price:[300 TO 499]
<result name="response" numFound="4" start="0">
<doc>
<str name="name">Maxtor DiamondMax 11 - hard drive - 500 GB – SATA-300</str>
<float name="price">350.0</float>
</doc>
<doc>
<str name="name">Apple 60 GB iPod with Video Playback Black</str>
<float name="price">399.0</float>
</doc>
<doc>
<str name="name">Canon PowerShot SD500</str>
<float name="price">329.95</float>
</doc>
<doc>
<str name="name">ASUS Extreme N7800GTX/2DHTV (256 MB)</str>
<float name="price">479.95</float>
</doc>
</result>
8. Simple SOLR Example
●
Spatial Query
●
Store data:
– <field name="store">45.17614,-93.87341</field> <!-- Buffalo store -->
– <field name="store">40.7143,-74.006</field> <!-- NYC store -->
– <field name="store">37.7752,-122.4232</field> <!-- San Francisco store -->
●
We are at 45.15,-93.85 (at 3.437 km from the Buffalo store)
●
Find all products in a store within 5km of our position:
QUERY : &fl=name,store&q=*:*&fq={!geofilt%20pt=45.15,-93.85%20sfield=store%20d=5} »
"response":{"numFound":3,"start":0,"docs":[
{
"name":"Maxtor DiamondMax 11 - hard drive - 500 GB - SATA-300",
"store":"45.17614,-93.87341"},
{
"name":"Belkin Mobile Power Cord for iPod w/ Dock",
"store":"45.18014,-93.87741"},
{
"name":"A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - OEM",
"store":"45.18414,-93.88141"}]
}
9. SOLR Features
●
SOLR Cloud
●
Cluster configuration using zookeper
●
Easy sharding and failover management
●
Self-healing, no single point of failure
●
SOLR Cell (aka RequestImportHandler)
●
TIKA integration for binary document parsing
●
Parses DOC, PDF, XLS, MIME, etc
●
DataImportHandlers
●
Automatically fetch and index SQL Databases, E-mails, RSS feeds,
Files in folder, etc.
10. SOLR Features
●
Multiple Solr Core
●
Many index collections in the same server
●
Different schema definitions for each collection
●
Different configurations for storage, replication, etc
●
Caching
●
Recurrent searches are cached, improves speed
●
Advanced warming techniques
●
Adding content triggers just a partial cache update
● Advanced
●
Language detection
●
Natural Language Processing
●
Clustering to scale both search and document retrieval
12. SOLR TIKA integration
●
SOLRCell embeds TIKA for binary file parsing
●
TIKA parses DOC, PDF, XLSX, HTML... and represent it
using XHTML, JSON or CSV
●
Full list of accepted formats :
http://tika.apache.org/1.3/formats.html
●
For some files, it can just index metadata (MP3, JPG, AVI)
●
SOLRCell will internally recover the TIKA output and store it so
we can search it
●
SOLR does not store the original binary file
15. SOLR Use Cases
●
Liferay Search
●
As liferay already uses Lucene, we can connect it to a SOLR server
●
Leverages the Liferay server and lets the SOLR cluster handle all the
user searches in the portal
●
Magento E-Commerce .
●
Avoids using MySQL for searching
●
Better search results
●
Better overall performance
●
Alfresco Search
●
Currently, Alfresco recommends to setup SOLR from the beginning
●
By default, Lucene+Tika is used internally