2. Agenda
• Apache Lucene overview
• Why do we need Apache Solr?
• Everyman tales from Solr
• Enterprise what?
• One step beyond...
3. Apache Lucene overview
• Information Retrieval library
• Inverted indexes are quick and efficient
• Vector space model
• Advanced search options (synonims, stopwords, similarity, nearness)
• Different language implementations (Java, .NET, C, Python)
4. The Lucene API
• Lucene indexes are built on a Directory
• Directory can be accessed by IndexReaders and IndexWriters
• IndexSearchers are built on top of Directories and IndexReaders
• IndexWriters can write Documents inside the index
• Documents are made of Fields
• Fields have value(s) and options
• Directory > IndexReader/Writer > Document > Field
6. Indexing Lucene
• A Lucene index has one or more segments and a generation
• Changes to the index must be committed (and optimized)
• No fixed schema
• Each field can be STORED, INDEXED and ANALYZED
• Each field can have NORMS and TERM VECTORS
7. Searching Lucene
• Open an IndexSearcher on top of an IndexReader over a Directory
• Many query types: TermQuery, MultiTermQuery, BooleanQuery,
WildcardQuery, PhraseQuery, PrefixQuery, MultiPhraseQuery, FuzzyQuery,
TermRangeQuery, NumericRangeQuery
• Get results from a TopDocs object
8. Why do we need Apache Solr?
• Lucene is a library
• Lucene by itself can only be queried programmatically
• Often the search system has to be totally independent from other
systems (i.e.: CMS)
• A ready to deploy search server is what you need
• Need to scale both vertically and horizontally
11. Apache Solr - Overview
• Ready to use enterprise search server
• REST (and programmatic) API
• Results in XML, JSON, PHP, Ruby, etc...
• Exploit Lucene power
• Scaling capabilities (replication, distributed search)
• Easy administration interface
• Easy to extend and customize (plugin architecture)
12. Apache Solr - Project status
• Latest release 1.4.1 on June 2010
• Lots of new features on trunk
• Most of new features on branch 3.0
• A huge very active community
• Lucid Imagination powered project
13. Solr - 5 minutes tutorial
• Download latest release (1.4.1)
• cd $SOLR_HOME/example
• java -jar -server start.jar
• You have an up and running Solr instance you can access via http://localhost:8983/solr
(this runs on top of Jetty)
• cd $SOLR_HOME/example/exampledocs
• Index with the command: sh post.sh *.xml
• Search with your browser
14. Solr - Query syntax
• Default operator is OR (you can override adding &q.op=AND to the HTTP req)
• You can query fields with fieldname:value
• Common + - AND OR NOT modifiers
• Range queries on date or numeric fields timestamp:[* TO NOW]
• Boost terms, i.e.: roma^2 inter
• Fuzzy search roam~0.6
• ...
15. Solr - Basic configuration steps
• Define fields, types and analysis inside schema.xml
• Play with solrconfig.xml:
• request handlers (update, search)
• index parameters
• caches
• deletion policy
• autowarming
• replication, clustering, etc...
16. Solr - schema.xml
• Types
• Analyzers to use for each type
• Fields with name, type and options
• Unique key
• Dynamic fields
• Copy fields
• Don’t use the default schema.xml, write it from scratch!
17. Solr - Type definition
Analyzers for querying and indexing
inside the schema
18. Solr - solrconfig.xml
• Where Solr will write the index
• Index merge factor
• Control different caches: documents, query results, filters
• Request handlers available to consume (HTTP) requests, typically at least a (standard)
search and an update handler exist
• Update request processor chains to configure indexing behavior
• Event listeners (newSearcher, firstSearcher)
• and much more...
19. Solr - Indexing
• Update requests on index are given with XML commands via HTTP POST
• <add> to insert and update
• <del> to remove by unique key or query
20. Solr - Searching
• HTTP GET to Solr instance with mandatory q parameter which specify the
query
• df - the default field to query
• fl - the list of fields to return (stored fields only)
• sort - fields used for sorting, default to score (it’s not a field)
• start, rows - paging attributes
• wt - response type, default to xml, can be json, php, ruby, etc
21. Solr - Data import
• Typically “old” systems rely on databases
• Data can be imported from DBs using the DataImportHandler component
• Define datasource, driver and mappings
22. Solr - Highlighting
• Useful when a snippet of the search results is needed
• In Solr 1.4.1 only stored fields can be highlighted
• Add &hl=true&hl.fl=field1,field2 to HTTP search request in order to enable
highlighting on field1 and field2
23. Solr - Faceting
• Break up search results into multiple categories showing counts for each
• Often used in stores
• Can be very useful in guiding user experience
• User can then drill down only results of a certain category
24. Solr - Filter queries
• Queries used as filters against the actual query
• Define document superset without influencing score
• Useful for domain specific queries where you want the user to search only in
certain “areas” of the index
• Add &fq=somefilterquery with the default Solr syntax
26. Solr - Multi core
• Define multiple Solr cores inside one only Solr instance
• Each cores maintain its own index
• Unified administration interface
• Runtime commands to create, swap, load, unload, delete cores
27. Solr - Replication
• It’s useful in case of high traffic to replicate a Solr instance and split (with
eventually some load balancer in front) the queries
• Master has the original index
• Slave polls master asking the last version of index
• If slave has a lower version of the index asks the master for the difference
(rsync like)
• In the meanwhile indexes remain available
28. Solr - Distributed search
• When an index is too large, in terms of space or memory required, it can be
useful to define two or more shards
• A shard is a Solr instance and can be searched or indexed independently
• At the same time it’s possible to query all the shards having the result be
merged from the sub-results of each shard
• http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/
solr&indent=true&q=category:information
• Note that the document distribution among indexes is up to the user (or who
feeds the indexes)