3. Setting Expectations
● This presentation assumes the reader is
aware of the Solr/Lucene technology for a
while;
● The goal is to update the overall knowledge
around Solr and its features;
● This presentation is not an exhausted list of all
Solr features or capabilities.
4. Solr 5
● Refreshing memories..
● Solr (pronounced "solar") is an open source enterprise search
platform, written in Java, from the Apache Lucene project. Its
major features include full-text search, hit highlighting, faceted
search, real-time indexing, dynamic clustering, database
integration, NoSQL features and rich document (e.g., Word,
PDF) handling. Providing distributed search and index
replication, Solr is designed for scalability and Fault tolerance.
Solr is the most popular enterprise search engine; (source:
Wikipedia)
6. Quick review: Elasticsearch or Solr?
● Both are released under Apache Software License;
● Solr and ES have lively user and developer communities and are
rapidly being developed;
● If you need to customize or actively contribute stick with Solr;
● Both have good commercial support;
● Solr is still much more text-search-oriented. ES is more naturally
when comes to build analytical applications that relies on complex
features like filtering and grouping;
● Elasticsearch is a bit easier to get started and deployed;
● Fully distributed deployment is a little harder with Solr, you will need
a Zookeeper setup;
● If you already uses Solr or ES you don't need to change unless you
face a real motivation.
9. Core Concepts - Document
● Despite the NoSQL hype Solr is essentially a search
engine. That's it, you fed with tons of information and
expects to retrieve later, fast!
● The basic unit of information is called a document and a
document is composed of several fields;
● A JSON document with 5 fields.
{
"population": 33576,
"state": "SC",
"city": "LEXINGTON",
"location": "33.972383, -81.23586",
"id": "29072"
}
10. Core Concepts - Schema
● Each document field can be digested (analyzed)
according to the user's needs;
● To accomplished this task the user can define a schema
for each kind of document expected to be indexed;
● Solr has a field type for almost anything: BinaryField,
BoolField, CollationField, CurrencyField, DateRangeField,
ExternalFileField, LatLonType, PointType,
PreAnalyzedField, SpatialRecursivePrefixTreeFieldType,
StrField, TextField, UUIDField, etc. .. ..
● Solr is very extensible. You can build your own type too.
11. Core Concepts – Dynamic Fields
● Dynamic fields gives the power of convention over configuration. For
instance any field name ending with _is will be treated as a multivalued
integer type field.
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
<dynamicField name="*_is" type="int" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_s" type="string" indexed="true" stored="true" />
<dynamicField name="*_ss" type="string" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_l" type="long" indexed="true" stored="true"/>
<dynamicField name="*_ls" type="long" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_t" type="text_general" indexed="true" stored="true"/>
<dynamicField name="*_txt" type="text_general" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_en" type="text_en" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
<dynamicField name="*_bs" type="boolean" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_f" type="float" indexed="true" stored="true"/>
<dynamicField name="*_fs" type="float" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_d" type="double" indexed="true" stored="true"/>
<dynamicField name="*_ds" type="double" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false" />
<dynamicField name="*_dt" type="date" indexed="true" stored="true"/>
12. Core Concepts - Schemaless
● The schema is built (guessed) automatically while the
index is being filled;
● Schema.xml is manipulated only by the new Schema
REST API. From now on this is a Managed Schema;
● Previously unseen fields are run through a cascading set
of value-based parsers, which guess the Java class of
field values - parsers for Boolean, Integer, Long, Float,
Double, and Date are currently available;
● Automatic schema field addition, based on field value
class(es): Previously unseen fields are added to the
schema, based on field value Java classes, which are
mapped to schema field types
13. Core Concepts – Schemaless (cont.)
● The solr distribution has an example of a
managed schema. Basically you'll need to
configure:
– The schemaFactory on solrconfig.xml;
– Define an UpdateRequestProcessorChain;
– Make the UpdateRequestProcessorChain
the Default for the UpdateRequestHandler;
● You also can use schemaless mode and pre-
emptively create fields before indexing;
14. Core Concepts - Analyze Chain
● A field type can be quite complex and defines an
analyze chain during indexing and querying time;
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
15. Core Concepts - Field Properties
● A document field is declared using one of the predefined field
types. Each field carries some properties that can impact on
the index and search perfomance, like the property stored
that keeps a copy of the entire field.
<field name="price" type="float" default="0.0" indexed="true" stored="true"/>
● There are advanced options like omitNorms used to boost
field during indexing and can be turned off to save some
memory.
16. Core Concepts - DocValues
● Since version 4 of Lucene fields can have a special
property called DocValues;
● The well know inverted index (term-to-document) is not
suitable and does not scale well for operations like sorting,
faceting and highlighting;
● When a field enables the docValue property the data is
stored in column-oriented way with a document-to-value
mapping at index time;
<field name="manu_exact" type="string" indexed="false" stored="false" docValues="true" />
18. Query Parsers
● Solr offers great control on how to parser user's input
query. There are 3 main parsers:
– Standard Query Parser
– Dismax Query Parser
– Extended Dismax Query Parser
● There are others specialized parsers, like spatial,
boost, MLT etc;
● All parsers shares a common set of parameters like
sort, start, rows, fq, fl, timeAllowed, wt etc.
19. Query Parsers – Standard Parser
● Also known as the Lucene parser;
● Exposes all Lucene features allowing to build
complex queries;
● But it's very intolerant of syntax errors;
20. Query Parsers - Dismax
● Designed to process simple phrases entered by users and to search
for individual terms across several fields using different weighting
based on the significance of each field;
● It's very lenient. Rarely produces an error;
● Uses a simplified subset of the Lucene Query Parser;
● User can specify quotes to group phrase, +/- to define mandatory or
optional clauses, AND/OR operators. Everything else is escaped to
simplify user's experience;
● DisMax stands for Maximum Disjunction: A query that generates the
union of documents produced by its subqueries, and that scores
each document with the maximum score for that document as
produced by any subquery, plus a tie breaking increment for any
additional matching subqueries.
21. Query Parsers - eDismax
● Supports the full Lucene query parser syntax;
● Supports queries such as AND, OR, NOT, -, and +;
● Treats "and" and "or" as "AND" and "OR" in Lucene syntax
mode;
● Supports pure negative nested queries: as +foo (-foo) will
match all documents;
● Lets you specify which fields the end user is allowed to query,
and to disallow direct fielded searches.
23. Faceting
● Faceted search, also called faceted navigation or faceted
browsing, is a technique for accessing information organized
according to a faceted classification system, allowing users to
explore a collection of information by applying multiple filters. A
faceted classification system classifies each information
element along multiple explicit dimensions, called facets,
enabling the classifications to be accessed and ordered in
multiple ways rather than in a single, pre-determined,
taxonomic order. (source: Wikipedia)
25. Faceting
● Can be done by Field-Value. Very common
when the field holds some kind of
categorization or tagging values;
● By range on any date or numeric field;
● By query to define your custom facets.
28. Nested Objects
● Appears frequently on relation databases;
● Historically search engines only indexes flat data, no
hierarchy at all;
● Every major DBMS today has some sort of textual
searching but it's far from ideal on some scenarios;
● If you truly need a full text search engine I am afraid
you'll have to maintain another moving part on your
architecture.
29. Nested Documents
● A JSON document example with nested objects:
[{
"id": "book1",
"title_s": "The Way of Kings",
"authors_ss": ["Brandon Lee"],
"cat_s": "fantasy",
"pubyear_i": 2010,
"publisher_s": "Tor",
"reviews": [{
"pubdate_dt": "2015-01-03T14:30:00Z",
"stars_i": 5,
"author_s": "Robert Youh",
"comment_s": "A great start to what looks like an epic series!"
}, {
"pubdate_dt": "2014-03-15T12:00:00Z",
"stars_i": 3,
"author_s": "Daniel K",
"comment_s": "This book was too long."
}]
}]
30. Let's Index
curl -X POST -H "Content-Type: application/json" -d '[{
"id": "book1",
"title": "The Way of Kings",
"authors": ["Brandon Lee"],
"cat": "fantasy",
"pubyear": 2010,
"publisher": "Tor",
"reviews": [{
"pubdate": "2015-01-03T14:30:00Z",
"stars": 5,
"author": "Robert Youh",
"comment": "A great start to what looks like an epic series!"
}, {
"pubdate": "2014-03-15T12:00:00Z",
"stars": 3,
"author_s": "Daniel K",
"comment": "This book was too long."
}]
}]' 'http://localhost:8983/solr/books/update'
31. Oops!
{
"responseHeader": {
"status": 400,
"QTime": 1
},
"error": {
"msg": "Error parsing JSON field value. Unexpected OBJECT_START at [150],
field=reviews",
"code": 400
}
}
● Sadly Solr/Lucene can't understand nested objects directly
from the document hierarchy.
32. Oop's Solutions
● Assuming you need all Lucene's power you can:
– Denormalize your data before indexing. This can be quite
problematic resulting in duplicated content when you join
all tables/collections and of course may not scale well;
– Index all entities into separated indices and make your
own join at application level. This does not scale well, add
application logic overhead and affects overall relevance
when you split the data into uncorrelated indices;
– Uses Lucene's join features already integrated with Solr
with some limitations;
33. Solr Joins
● Solr Nested Objects is implemented using Lucene's
Block Join feature;
● Block Join and (Query-time) Joins are different
beasts;
● Block Join arranges children and parent contiguously
on the index and depends on more information during
querying;
● Query time Joins does not rely on any special
arrangement on the index level.
34. Block Joins (index-time-join)
● Documents need to be converted to special syntax:
[{
"id": "book1",
"title_s": "The Way of Kings",
"authors_ss": ["Brandon Lee"],
"cat_s": "fantasy",
"pubyear_i": 2010,
"publisher_s": "Tor",
"type_s": "book",
"_childDocuments_": [{
"id": "book1_c1",
"type_s": "reviews",
"pubdate_dt": "2015-01-03T14:30:00Z",
"stars_i": 5,
"author_s": "Robert Youh",
"comment_t": "A great start to what looks like an epic series!"
}, {
"id": "book1_c2",
"type_s": "reviews",
"pubdate_dt": "2014-03-15T12:00:00Z",
"stars_i": 3,
"author_s": "Daniel K",
"comment_t": "This book was too long."
}]
}]
35. Querying Nested Documents
● Querying 3 stars rated books using the Block Join Parent
Query Parser:
curl -X GET 'http://localhost:8983/solr/books/select?q={!parent which="type_s:book"}stars_i:3'
● Querying all comments of fantasy books using the Block
Join Children Query Parser:
curl -X GET 'http://localhost:8983/solr/books/select?q={!child of="type_s:book"}cat_s:fantasy'
36. Join - JoinQueryParser
● Allows normalizing relationships between documents with a
join operation. This is different from the concept of a join in a
relational database because no information is being truly
joined. An appropriate SQL analogy would be an "inner query"
● Suppose you have two indices (entities) named movies(id,
title, director_id) and movies_directors(id, name,
has_oscar);
fq={!join from=id fromIndex=movie_directors to=director_id}has_oscar:true
37. Join x BlockJoin
● Join:
– Does not rely on any index arrangement beforehand;
– Can be very slower but don't require extra disk space;
– The documents should be flattened, no hierarchy, as usual;
– The joined fields should use compatible types.
● BlockJoin:
– It's faster but parent and children documents should be
updated in block, no partial updates (including deletions);
– Uses a lot of extra disk space as each children and parent
counts as a different document;
– Different types should live in the same index.
39. Clustering Results
● The clustering plugin attempts to automatically discover groups
of related search hits (documents) and assign human-readable
labels to these groups;
● Think as a kind of unsupervised faceting;
● It's built online. Using the query results;
● Useful to explore the data and discover facets dynamically;
● For simple queries, the clustering time will usually dominate the
fetch time. If the document content is very long the retrieval of
stored content can become a bottleneck;
● Each cluster result has a label, a score and some documents
that falls into this cluster (facet).
40. World Bank Projects Dataset
● Generating clusters from data of projects funded by the
World Bank;
{
"region_name_s": "East Asia and Pacific",
"project_abstract_t": "The development objective of the Second Power Transmission Development Project for
Indonesia is to meet growing electricity demand and increase access to electricity in the project area through
strengthening and expanding the capacity of the power transmission networks in the project area in a sustainable
manner. The project has single component with following two parts: first part is extension and rehabilitation of
selected existing 150-20 Kilovolt (kV) substations and 70-20 kV substations in the project area, including adding one
or more new transformers and associated equipment; and or replacing existing transformers with new transformers
and associated equipment with higher capacity; and second part is construction of selected new 150-20 kV
substations in the project area, including installation of transformers and associated equipment.",
"project_name_s": "Indonesia Second Power Transmission Development Project",
"country_code_s": "ID",
"country_name_s": "Republic of Indonesia",
"source_s": "IBRD",
"total_amt_i": 325000000,
"status_s": "Active",
"id": "P123994",
}
41. World Bank Project Dataset (cont.)
● Facets based on fields project_name_s and project_abstract_t
using the first 100 documents. Nothing tweaked and we already get
some good results. (Note: the output is from a python code)
[([u'Additional Financing'], 54.23808447032144),
([u'Program'], 31.94900144597348),
([u'Agricultural'], 29.989495158991218),
([u'Development Policy'], 46.552690339024544),
([u'Improvement Project'], 51.00742730020363),
([u'Management'], 27.078107046125666),
([u'Education'], 21.679421779033508),
([u'National'], 22.636815589946526),
([u'Regional'], 21.280823272486778),
([u'DPL'], 14.733639303266868),
([u'Development Policy Operation'], 29.228788836371688),
([u'Ecosystem'], 24.59586749086255),
([u'Implementation'], 15.251178555217848),
([u'Industries Transparency Initiative'], 35.386053776359034) ...
43. Okapi BM25
● BM25 is a competitor of the classic TF/IDF vector space
model, actually the default for solr scoring model;
● Is a probabilistic relevance model. It is considered the state
of the art for information retrieval;
● Both use term frequency, inverse document frequency, and
field-length normalization, but the definition of each of these
factors is a little different;
● Lucene 6 will use BM25 as the default score model;
● In Solr the users can define fields with different scoring models;
● Understanding the math differences is out of the scope of this
presentation.
45. What is a Spatial Query?
● A spatial query is a special type of database
query supported by geodatabases and spatial
databases. The queries differ from non-spatial
SQL queries in several important ways. Two of
the most important are that they allow for the
use of geometry data types such as points,
lines and polygons and that these queries
consider the spatial relationship between
these geometries. (source: Wikipedia)
46. Solr Spatial Features
● Solr can index location data for spatial or
geospatial queries;
● Index points or any other shapes;
● Filtering results by bounding box, circle or by
other shapes;
● Sort or boost scoring by distance between
points, or relative area between rectangles;
● Heatmaps / Spatial Grid Faceting
● Standards: GeoJSON, WKT
47. Spatial Field Types
● LatLonType: Index only points;
● SpatialRecursivePrefixTreeFieldType (RPT): Index
other shapes like circle or polygon;
● BBOXField: Bounding box specialized;
● Example of configuration in solrconfig.xml:
<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType"
geo="true" distErrPct="0.025" maxDistErr="0.005" distanceUnits="kilometers" />
48. LatLonType
● Only points can be indexed;
● Can be queried by circle or bounding-box;
● Better for distance sorting or boosting;
<field name="store">45.17614,-93.87341</field>
<field name="store">40.7143,-74.006</field>
<field name="store">37.7752,-122.4232</field>
49. SpatialRecursivePrefixTreeFieldType
● Query by polygons and other complex shapes, in
addition to lat-long circles & bounding-boxes;
● Configurable precision which can vary per shape at query
time;
● Index non-point shapes as well as point shapes;
● Multi-valued field, useful for geodecoding;
● Well-Known-Text (WKT) shape syntax for indexing too;
<field name="store">45.17614,-93.87341</field>
<field name="geo">-74.093 41.042 -69.347 44.558</field>
<field name="geo">Circle(4.56,1.23 d=0.0710)</field>
<field name="geo">POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))</field>
50. BBoxField
● Only indexes bounding-boxes;
● Queried by another bounding-box;
● Predicates: Intersects, Within, Contains, Disjoint, Equals;
● Supports relevancy sort/boost like overlapRatio or simply
the area.
52. GeoFilt
● Retrieve results based on the geospatial distance from a given
point;
●
&q=*:*&fq={!geofilt sfield=store}&pt=45.15,-93.85&d=5
53. BBox
● Is like the geofilt but calculates the bounding box of a circle.
● The rectangle is faster to compute hence this filter is useful
when it's acceptable to return results outside the radius.
●
&q=*:*&fq={!bbox sfield=store}&pt=45.15,-93.85&d=5
54. Distance Function Queries
● Geodist: Geodetic points distance;
● Dist: Distance between multi-dimensional
vectors;
● Hsin: Distance between two points on a
sphere;
● Sqedist: Squared Euclidean distance between
points;
● Remember that Solr can sort or boost by any
function query;
56. Spatial Clustering
● Suppose you have a lot of points and wants to query to
get the results to plot into a map;
● You can just scroll down the results with the rows
parameter and use your map sdk to render;
● But since version 5.1 Solr can facet on these points too;
● One way to look at this problem is grid based heatmap
clustering. All points in a grid square get counted to give a
grid square a numeric value, and those values correspond
to a color scale.
59. SolrCloud
● Solr is no more distributed as WAR application. It is a full fledged
server ready for deployment.
● Solr can be executed in two modes:
– Standalone Server
– SolrCloud
● Standalone mode resembles a multi-core setup running on a single
Servlet container;
● SolrCloud delivers all goodness of a Solr cluster that combines fault
tolerance and high availability;
● Now it's much easier to deploy and treat Solr as a backend
dependency on your infrastructure;
● A lot of improvements on the startup scripts and command line tools
has been made to simplify the user experience.
60. Solr Cluster
● A cluster is set of Solr nodes managed by ZooKeeper
as a single unit. When you have a cluster, you can
always make requests to the cluster and if the request
is acknowledged, you can be sure that it will be
managed as a unit and be durable, i.e., you won't lose
data. Updates can be seen right after they are made
and the cluster can be expanded or contracted.
61. Sharding & Replication
● When your data is too large for one node, you can break it up and
store it in sections by creating one or more shards;
● A shard is a way of splitting a core over a number of "servers", or
nodes. For example, you might have a shard for data that represents
each state, or different categories that are likely to be searched
independently, but are often combined;
● Each shard has a replica set to achieve robustness;
● Shards and replicas are orchestrated by Zookeeper that provides
balancing and failover;
● In SolrCloud there are no masters or slaves. Instead, there are
leaders and replicas.
63. Scripts Improved
● bin/solr
$ solr
Usage: solr COMMAND OPTIONS
where COMMAND is one of: start, stop, restart, status, healthcheck, create, create_core, create_collection,
delete, version
Standalone server example (start Solr running in the background on port 8984):
./solr start -p 8984
SolrCloud example (start Solr running in SolrCloud mode using localhost:2181 to connect to ZooKeeper, with
1g max heap size and remote Java debug options enabled):
./solr start -c -m 1g -z localhost:2181 -a "-Xdebug
-Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1044"
Pass -help after any COMMAND to see command-specific usage information,
such as: ./solr start -help or ./solr stop -help
64. Quick start
$ solr start
Waiting up to 30 seconds to see Solr running on port 8983 [/]
Started Solr server on port 8983 (pid=20034). Happy searching!
$ solr status
Found 1 Solr nodes:
Solr process 20034 running on port 8983
{
"solr_home":"/home/lsouza/solr/latest/server/solr",
"version":"5.4.0 1718046 - upayavira - 2015-12-04 23:16:46",
"startTime":"2015-11-22T22:05:57.659Z",
"uptime":"0 days, 0 hours, 0 minutes, 11 seconds",
"memory":"92.1 MB (%18.8) of 490.7 MB"}
$ solr stop
Sending stop command to Solr running on port 8983 ... waiting 5 seconds to allow Jetty process 20034 to stop
gracefully.