SlideShare una empresa de Scribd logo
1 de 67
Descargar para leer sin conexión
17 December 2015
Leonardo Souza
lsouza@amtera.com.br
Setting Expectations
● This presentation assumes the reader is
aware of the Solr/Lucene technology for a
while;
● The goal is to update the overall knowledge
around Solr and its features;
● This presentation is not an exhausted list of all
Solr features or capabilities.
Solr 5
● Refreshing memories..
● Solr (pronounced "solar") is an open source enterprise search
platform, written in Java, from the Apache Lucene project. Its
major features include full-text search, hit highlighting, faceted
search, real-time indexing, dynamic clustering, database
integration, NoSQL features and rich document (e.g., Word,
PDF) handling. Providing distributed search and index
replication, Solr is designed for scalability and Fault tolerance.
Solr is the most popular enterprise search engine; (source:
Wikipedia)
Google Trends
Quick review: Elasticsearch or Solr?
● Both are released under Apache Software License;
● Solr and ES have lively user and developer communities and are
rapidly being developed;
● If you need to customize or actively contribute stick with Solr;
● Both have good commercial support;
● Solr is still much more text-search-oriented. ES is more naturally
when comes to build analytical applications that relies on complex
features like filtering and grouping;
● Elasticsearch is a bit easier to get started and deployed;
● Fully distributed deployment is a little harder with Solr, you will need
a Zookeeper setup;
● If you already uses Solr or ES you don't need to change unless you
face a real motivation.
Solr - Agenda
● Core Concepts;
● Query Parsers;
● Faceting;
● Nested Documents;
● Clustering results;
● Okapi BM25;
● Spatial Search;
● SolrCloud.
Solr Core Concepts
Core Concepts - Document
● Despite the NoSQL hype Solr is essentially a search
engine. That's it, you fed with tons of information and
expects to retrieve later, fast!
● The basic unit of information is called a document and a
document is composed of several fields;
● A JSON document with 5 fields.
{
"population": 33576,
"state": "SC",
"city": "LEXINGTON",
"location": "33.972383, -81.23586",
"id": "29072"
}
Core Concepts - Schema
● Each document field can be digested (analyzed)
according to the user's needs;
● To accomplished this task the user can define a schema
for each kind of document expected to be indexed;
● Solr has a field type for almost anything: BinaryField,
BoolField, CollationField, CurrencyField, DateRangeField,
ExternalFileField, LatLonType, PointType,
PreAnalyzedField, SpatialRecursivePrefixTreeFieldType,
StrField, TextField, UUIDField, etc. .. ..
● Solr is very extensible. You can build your own type too.
Core Concepts – Dynamic Fields
● Dynamic fields gives the power of convention over configuration. For
instance any field name ending with _is will be treated as a multivalued
integer type field.
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
<dynamicField name="*_is" type="int" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_s" type="string" indexed="true" stored="true" />
<dynamicField name="*_ss" type="string" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_l" type="long" indexed="true" stored="true"/>
<dynamicField name="*_ls" type="long" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_t" type="text_general" indexed="true" stored="true"/>
<dynamicField name="*_txt" type="text_general" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_en" type="text_en" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
<dynamicField name="*_bs" type="boolean" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_f" type="float" indexed="true" stored="true"/>
<dynamicField name="*_fs" type="float" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_d" type="double" indexed="true" stored="true"/>
<dynamicField name="*_ds" type="double" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false" />
<dynamicField name="*_dt" type="date" indexed="true" stored="true"/>
Core Concepts - Schemaless
● The schema is built (guessed) automatically while the
index is being filled;
● Schema.xml is manipulated only by the new Schema
REST API. From now on this is a Managed Schema;
● Previously unseen fields are run through a cascading set
of value-based parsers, which guess the Java class of
field values - parsers for Boolean, Integer, Long, Float,
Double, and Date are currently available;
● Automatic schema field addition, based on field value
class(es): Previously unseen fields are added to the
schema, based on field value Java classes, which are
mapped to schema field types
Core Concepts – Schemaless (cont.)
● The solr distribution has an example of a
managed schema. Basically you'll need to
configure:
– The schemaFactory on solrconfig.xml;
– Define an UpdateRequestProcessorChain;
– Make the UpdateRequestProcessorChain
the Default for the UpdateRequestHandler;
● You also can use schemaless mode and pre-
emptively create fields before indexing;
Core Concepts - Analyze Chain
● A field type can be quite complex and defines an
analyze chain during indexing and querying time;
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
Core Concepts - Field Properties
● A document field is declared using one of the predefined field
types. Each field carries some properties that can impact on
the index and search perfomance, like the property stored
that keeps a copy of the entire field.
<field name="price" type="float" default="0.0" indexed="true" stored="true"/>
● There are advanced options like omitNorms used to boost
field during indexing and can be turned off to save some
memory.
Core Concepts - DocValues
● Since version 4 of Lucene fields can have a special
property called DocValues;
● The well know inverted index (term-to-document) is not
suitable and does not scale well for operations like sorting,
faceting and highlighting;
● When a field enables the docValue property the data is
stored in column-oriented way with a document-to-value
mapping at index time;
<field name="manu_exact" type="string" indexed="false" stored="false" docValues="true" />
Query Parsers
Query Parsers
● Solr offers great control on how to parser user's input
query. There are 3 main parsers:
– Standard Query Parser
– Dismax Query Parser
– Extended Dismax Query Parser
● There are others specialized parsers, like spatial,
boost, MLT etc;
● All parsers shares a common set of parameters like
sort, start, rows, fq, fl, timeAllowed, wt etc.
Query Parsers – Standard Parser
● Also known as the Lucene parser;
● Exposes all Lucene features allowing to build
complex queries;
● But it's very intolerant of syntax errors;
Query Parsers - Dismax
● Designed to process simple phrases entered by users and to search
for individual terms across several fields using different weighting
based on the significance of each field;
● It's very lenient. Rarely produces an error;
● Uses a simplified subset of the Lucene Query Parser;
● User can specify quotes to group phrase, +/- to define mandatory or
optional clauses, AND/OR operators. Everything else is escaped to
simplify user's experience;
● DisMax stands for Maximum Disjunction: A query that generates the
union of documents produced by its subqueries, and that scores
each document with the maximum score for that document as
produced by any subquery, plus a tie breaking increment for any
additional matching subqueries.
Query Parsers - eDismax
● Supports the full Lucene query parser syntax;
● Supports queries such as AND, OR, NOT, -, and +;
● Treats "and" and "or" as "AND" and "OR" in Lucene syntax
mode;
● Supports pure negative nested queries: as +foo (-foo) will
match all documents;
● Lets you specify which fields the end user is allowed to query,
and to disallow direct fielded searches.
Faceting
Faceting
● Faceted search, also called faceted navigation or faceted
browsing, is a technique for accessing information organized
according to a faceted classification system, allowing users to
explore a collection of information by applying multiple filters. A
faceted classification system classifies each information
element along multiple explicit dimensions, called facets,
enabling the classifications to be accessed and ordered in
multiple ways rather than in a single, pre-determined,
taxonomic order. (source: Wikipedia)
Faceting – Explained Visually
Faceting
● Can be done by Field-Value. Very common
when the field holds some kind of
categorization or tagging values;
● By range on any date or numeric field;
● By query to define your custom facets.
Faceting Example
● Schema.xml
<field name=”category” type=”string” />
<field name=”manufacturer” type=”string” />
● Faceting on fields:
http://localhost:8983/solr/collection_name/select?q=*:*&
facet=true&
facet.field=category&
facet.field=manufacturer
Nested Documents
Nested Objects
● Appears frequently on relation databases;
● Historically search engines only indexes flat data, no
hierarchy at all;
● Every major DBMS today has some sort of textual
searching but it's far from ideal on some scenarios;
● If you truly need a full text search engine I am afraid
you'll have to maintain another moving part on your
architecture.
Nested Documents
● A JSON document example with nested objects:
[{
"id": "book1",
"title_s": "The Way of Kings",
"authors_ss": ["Brandon Lee"],
"cat_s": "fantasy",
"pubyear_i": 2010,
"publisher_s": "Tor",
"reviews": [{
"pubdate_dt": "2015-01-03T14:30:00Z",
"stars_i": 5,
"author_s": "Robert Youh",
"comment_s": "A great start to what looks like an epic series!"
}, {
"pubdate_dt": "2014-03-15T12:00:00Z",
"stars_i": 3,
"author_s": "Daniel K",
"comment_s": "This book was too long."
}]
}]
Let's Index
curl -X POST -H "Content-Type: application/json" -d '[{
"id": "book1",
"title": "The Way of Kings",
"authors": ["Brandon Lee"],
"cat": "fantasy",
"pubyear": 2010,
"publisher": "Tor",
"reviews": [{
"pubdate": "2015-01-03T14:30:00Z",
"stars": 5,
"author": "Robert Youh",
"comment": "A great start to what looks like an epic series!"
}, {
"pubdate": "2014-03-15T12:00:00Z",
"stars": 3,
"author_s": "Daniel K",
"comment": "This book was too long."
}]
}]' 'http://localhost:8983/solr/books/update'
Oops!
{
"responseHeader": {
"status": 400,
"QTime": 1
},
"error": {
"msg": "Error parsing JSON field value. Unexpected OBJECT_START at [150],
field=reviews",
"code": 400
}
}
● Sadly Solr/Lucene can't understand nested objects directly
from the document hierarchy.
Oop's Solutions
● Assuming you need all Lucene's power you can:
– Denormalize your data before indexing. This can be quite
problematic resulting in duplicated content when you join
all tables/collections and of course may not scale well;
– Index all entities into separated indices and make your
own join at application level. This does not scale well, add
application logic overhead and affects overall relevance
when you split the data into uncorrelated indices;
– Uses Lucene's join features already integrated with Solr
with some limitations;
Solr Joins
● Solr Nested Objects is implemented using Lucene's
Block Join feature;
● Block Join and (Query-time) Joins are different
beasts;
● Block Join arranges children and parent contiguously
on the index and depends on more information during
querying;
● Query time Joins does not rely on any special
arrangement on the index level.
Block Joins (index-time-join)
● Documents need to be converted to special syntax:
[{
"id": "book1",
"title_s": "The Way of Kings",
"authors_ss": ["Brandon Lee"],
"cat_s": "fantasy",
"pubyear_i": 2010,
"publisher_s": "Tor",
"type_s": "book",
"_childDocuments_": [{
"id": "book1_c1",
"type_s": "reviews",
"pubdate_dt": "2015-01-03T14:30:00Z",
"stars_i": 5,
"author_s": "Robert Youh",
"comment_t": "A great start to what looks like an epic series!"
}, {
"id": "book1_c2",
"type_s": "reviews",
"pubdate_dt": "2014-03-15T12:00:00Z",
"stars_i": 3,
"author_s": "Daniel K",
"comment_t": "This book was too long."
}]
}]
Querying Nested Documents
● Querying 3 stars rated books using the Block Join Parent
Query Parser:
curl -X GET 'http://localhost:8983/solr/books/select?q={!parent which="type_s:book"}stars_i:3'
● Querying all comments of fantasy books using the Block
Join Children Query Parser:
curl -X GET 'http://localhost:8983/solr/books/select?q={!child of="type_s:book"}cat_s:fantasy'
Join - JoinQueryParser
● Allows normalizing relationships between documents with a
join operation. This is different from the concept of a join in a
relational database because no information is being truly
joined. An appropriate SQL analogy would be an "inner query"
● Suppose you have two indices (entities) named movies(id,
title, director_id) and movies_directors(id, name,
has_oscar);
fq={!join from=id fromIndex=movie_directors to=director_id}has_oscar:true
Join x BlockJoin
● Join:
– Does not rely on any index arrangement beforehand;
– Can be very slower but don't require extra disk space;
– The documents should be flattened, no hierarchy, as usual;
– The joined fields should use compatible types.
● BlockJoin:
– It's faster but parent and children documents should be
updated in block, no partial updates (including deletions);
– Uses a lot of extra disk space as each children and parent
counts as a different document;
– Different types should live in the same index.
Clustering Results
Clustering Results
● The clustering plugin attempts to automatically discover groups
of related search hits (documents) and assign human-readable
labels to these groups;
● Think as a kind of unsupervised faceting;
● It's built online. Using the query results;
● Useful to explore the data and discover facets dynamically;
● For simple queries, the clustering time will usually dominate the
fetch time. If the document content is very long the retrieval of
stored content can become a bottleneck;
● Each cluster result has a label, a score and some documents
that falls into this cluster (facet).
World Bank Projects Dataset
● Generating clusters from data of projects funded by the
World Bank;
{
"region_name_s": "East Asia and Pacific",
"project_abstract_t": "The development objective of the Second Power Transmission Development Project for
Indonesia is to meet growing electricity demand and increase access to electricity in the project area through
strengthening and expanding the capacity of the power transmission networks in the project area in a sustainable
manner. The project has single component with following two parts: first part is extension and rehabilitation of
selected existing 150-20 Kilovolt (kV) substations and 70-20 kV substations in the project area, including adding one
or more new transformers and associated equipment; and or replacing existing transformers with new transformers
and associated equipment with higher capacity; and second part is construction of selected new 150-20 kV
substations in the project area, including installation of transformers and associated equipment.",
"project_name_s": "Indonesia Second Power Transmission Development Project",
"country_code_s": "ID",
"country_name_s": "Republic of Indonesia",
"source_s": "IBRD",
"total_amt_i": 325000000,
"status_s": "Active",
"id": "P123994",
}
World Bank Project Dataset (cont.)
● Facets based on fields project_name_s and project_abstract_t
using the first 100 documents. Nothing tweaked and we already get
some good results. (Note: the output is from a python code)
[([u'Additional Financing'], 54.23808447032144),
([u'Program'], 31.94900144597348),
([u'Agricultural'], 29.989495158991218),
([u'Development Policy'], 46.552690339024544),
([u'Improvement Project'], 51.00742730020363),
([u'Management'], 27.078107046125666),
([u'Education'], 21.679421779033508),
([u'National'], 22.636815589946526),
([u'Regional'], 21.280823272486778),
([u'DPL'], 14.733639303266868),
([u'Development Policy Operation'], 29.228788836371688),
([u'Ecosystem'], 24.59586749086255),
([u'Implementation'], 15.251178555217848),
([u'Industries Transparency Initiative'], 35.386053776359034) ...
Okapi BM25
Okapi BM25
● BM25 is a competitor of the classic TF/IDF vector space
model, actually the default for solr scoring model;
● Is a probabilistic relevance model. It is considered the state
of the art for information retrieval;
● Both use term frequency, inverse document frequency, and
field-length normalization, but the definition of each of these
factors is a little different;
● Lucene 6 will use BM25 as the default score model;
● In Solr the users can define fields with different scoring models;
● Understanding the math differences is out of the scope of this
presentation.
Spatial Search
What is a Spatial Query?
● A spatial query is a special type of database
query supported by geodatabases and spatial
databases. The queries differ from non-spatial
SQL queries in several important ways. Two of
the most important are that they allow for the
use of geometry data types such as points,
lines and polygons and that these queries
consider the spatial relationship between
these geometries. (source: Wikipedia)
Solr Spatial Features
● Solr can index location data for spatial or
geospatial queries;
● Index points or any other shapes;
● Filtering results by bounding box, circle or by
other shapes;
● Sort or boost scoring by distance between
points, or relative area between rectangles;
● Heatmaps / Spatial Grid Faceting
● Standards: GeoJSON, WKT
Spatial Field Types
● LatLonType: Index only points;
● SpatialRecursivePrefixTreeFieldType (RPT): Index
other shapes like circle or polygon;
● BBOXField: Bounding box specialized;
● Example of configuration in solrconfig.xml:
<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType"
geo="true" distErrPct="0.025" maxDistErr="0.005" distanceUnits="kilometers" />
LatLonType
● Only points can be indexed;
● Can be queried by circle or bounding-box;
● Better for distance sorting or boosting;
<field name="store">45.17614,-93.87341</field>
<field name="store">40.7143,-74.006</field>
<field name="store">37.7752,-122.4232</field>
SpatialRecursivePrefixTreeFieldType
● Query by polygons and other complex shapes, in
addition to lat-long circles & bounding-boxes;
● Configurable precision which can vary per shape at query
time;
● Index non-point shapes as well as point shapes;
● Multi-valued field, useful for geodecoding;
● Well-Known-Text (WKT) shape syntax for indexing too;
<field name="store">45.17614,-93.87341</field>
<field name="geo">-74.093 41.042 -69.347 44.558</field>
<field name="geo">Circle(4.56,1.23 d=0.0710)</field>
<field name="geo">POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))</field>
BBoxField
● Only indexes bounding-boxes;
● Queried by another bounding-box;
● Predicates: Intersects, Within, Contains, Disjoint, Equals;
● Supports relevancy sort/boost like overlapRatio or simply
the area.
Spatial Filters
● Solr has two types of spatial filters:
– GeoFilt;
– BBox;
GeoFilt
● Retrieve results based on the geospatial distance from a given
point;
●
&q=*:*&fq={!geofilt sfield=store}&pt=45.15,-93.85&d=5
BBox
● Is like the geofilt but calculates the bounding box of a circle.
● The rectangle is faster to compute hence this filter is useful
when it's acceptable to return results outside the radius.
●
&q=*:*&fq={!bbox sfield=store}&pt=45.15,-93.85&d=5
Distance Function Queries
● Geodist: Geodetic points distance;
● Dist: Distance between multi-dimensional
vectors;
● Hsin: Distance between two points on a
sphere;
● Sqedist: Squared Euclidean distance between
points;
● Remember that Solr can sort or boost by any
function query;
Spatial Predicates
● Intersects
● IsWithin
● Contains
● BBOxField only:
– IsDisjointTo
– IsEqualTo
Spatial Clustering
● Suppose you have a lot of points and wants to query to
get the results to plot into a map;
● You can just scroll down the results with the rows
parameter and use your map sdk to render;
● But since version 5.1 Solr can facet on these points too;
● One way to look at this problem is grid based heatmap
clustering. All points in a grid square get counted to give a
grid square a numeric value, and those values correspond
to a color scale.
Heatmaps - worldwidegeoweb.com
SolrCloud
SolrCloud
● Solr is no more distributed as WAR application. It is a full fledged
server ready for deployment.
● Solr can be executed in two modes:
– Standalone Server
– SolrCloud
● Standalone mode resembles a multi-core setup running on a single
Servlet container;
● SolrCloud delivers all goodness of a Solr cluster that combines fault
tolerance and high availability;
● Now it's much easier to deploy and treat Solr as a backend
dependency on your infrastructure;
● A lot of improvements on the startup scripts and command line tools
has been made to simplify the user experience.
Solr Cluster
● A cluster is set of Solr nodes managed by ZooKeeper
as a single unit. When you have a cluster, you can
always make requests to the cluster and if the request
is acknowledged, you can be sure that it will be
managed as a unit and be durable, i.e., you won't lose
data. Updates can be seen right after they are made
and the cluster can be expanded or contracted.
Sharding & Replication
● When your data is too large for one node, you can break it up and
store it in sections by creating one or more shards;
● A shard is a way of splitting a core over a number of "servers", or
nodes. For example, you might have a shard for data that represents
each state, or different categories that are likely to be searched
independently, but are often combined;
● Each shard has a replica set to achieve robustness;
● Shards and replicas are orchestrated by Zookeeper that provides
balancing and failover;
● In SolrCloud there are no masters or slaves. Instead, there are
leaders and replicas.
SolrCloud
● Encyclopedia. Sharded and Replicated.
Scripts Improved
● bin/solr
$ solr
Usage: solr COMMAND OPTIONS
where COMMAND is one of: start, stop, restart, status, healthcheck, create, create_core, create_collection,
delete, version
Standalone server example (start Solr running in the background on port 8984):
./solr start -p 8984
SolrCloud example (start Solr running in SolrCloud mode using localhost:2181 to connect to ZooKeeper, with
1g max heap size and remote Java debug options enabled):
./solr start -c -m 1g -z localhost:2181 -a "-Xdebug
-Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1044"
Pass -help after any COMMAND to see command-specific usage information,
such as: ./solr start -help or ./solr stop -help
Quick start
$ solr start
Waiting up to 30 seconds to see Solr running on port 8983 [/]
Started Solr server on port 8983 (pid=20034). Happy searching!
$ solr status
Found 1 Solr nodes:
Solr process 20034 running on port 8983
{
"solr_home":"/home/lsouza/solr/latest/server/solr",
"version":"5.4.0 1718046 - upayavira - 2015-12-04 23:16:46",
"startTime":"2015-11-22T22:05:57.659Z",
"uptime":"0 days, 0 hours, 0 minutes, 11 seconds",
"memory":"92.1 MB (%18.8) of 490.7 MB"}
$ solr stop
Sending stop command to Solr running on port 8983 ... waiting 5 seconds to allow Jetty process 20034 to stop
gracefully.
Thank You!
References
● http://lucene.apache.org/solr/
● http://geojson.org/
● http://pt.slideshare.net/DavidSmiley2/lucenesolr-spatial-in-2015
● http://boundingbox.klokantech.com/
● https://cwiki.apache.org/confluence/display/solr/Spatial+Search
● https://www.usps.com/
● https://www.census.gov
● https://github.com/GeospatialPython/pyshp
● http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-gen
eration-of-lucene-relevation/
●
17 December 2015
Leonardo Souza
lsouza@amtera.com.br

Más contenido relacionado

La actualidad más candente

Elasticsearch V/s Relational Database
Elasticsearch V/s Relational DatabaseElasticsearch V/s Relational Database
Elasticsearch V/s Relational DatabaseRicha Budhraja
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHPPaul Borgermans
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampKais Hassan, PhD
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
JSON in der Oracle Datenbank
JSON in der Oracle DatenbankJSON in der Oracle Datenbank
JSON in der Oracle DatenbankUlrike Schwinn
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Adrien Grand
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchSperasoft
 
Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Lucidworks
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCampGokulD
 
Faster Data Analytics with Apache Spark using Apache Solr
Faster Data Analytics with Apache Spark using Apache SolrFaster Data Analytics with Apache Spark using Apache Solr
Faster Data Analytics with Apache Spark using Apache SolrChitturi Kiran
 
Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsLucidworks
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiRobert Calcavecchia
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic IntroductionMayur Rathod
 

La actualidad más candente (20)

Elasticsearch V/s Relational Database
Elasticsearch V/s Relational DatabaseElasticsearch V/s Relational Database
Elasticsearch V/s Relational Database
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHP
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science Bootcamp
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
JSON in der Oracle Datenbank
JSON in der Oracle DatenbankJSON in der Oracle Datenbank
JSON in der Oracle Datenbank
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Solr Architecture
Solr ArchitectureSolr Architecture
Solr Architecture
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Faster Data Analytics with Apache Spark using Apache Solr
Faster Data Analytics with Apache Spark using Apache SolrFaster Data Analytics with Apache Spark using Apache Solr
Faster Data Analytics with Apache Spark using Apache Solr
 
Webinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data AnalyticsWebinar: Solr & Spark for Real Time Big Data Analytics
Webinar: Solr & Spark for Real Time Big Data Analytics
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya BhamidpatiPhilly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
Philly PHP: April '17 Elastic Search Introduction by Aditya Bhamidpati
 
ElasticSearch Basic Introduction
ElasticSearch Basic IntroductionElasticSearch Basic Introduction
ElasticSearch Basic Introduction
 

Similar a Solr5

Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )'Moinuddin Ahmed
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6DEEPAK KHETAWAT
 
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve contentOpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve contentAlkacon Software GmbH & Co. KG
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and SparkLucidworks
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Vinay Kumar
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Kai Chan
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to ElasticsearchClifford James
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and TricksErik Hatcher
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemTrey Grainger
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" DataArt
 
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfcadejaumafiq
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr WorkshopJSGB
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Search engine. Elasticsearch
Search engine. ElasticsearchSearch engine. Elasticsearch
Search engine. ElasticsearchSelecto
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineDaniel N
 

Similar a Solr5 (20)

Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6
 
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve contentOpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Apache Solr for begginers
Apache Solr for begginersApache Solr for begginers
Apache Solr for begginers
 
Data Science with Solr and Spark
Data Science with Solr and SparkData Science with Solr and Spark
Data Science with Solr and Spark
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Query Parsing - Tips and Tricks
Query Parsing - Tips and TricksQuery Parsing - Tips and Tricks
Query Parsing - Tips and Tricks
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Apache solr
Apache solrApache solr
Apache solr
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys"
 
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdfELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
ELK-Stack-Essential-Concepts-TheELKStack-LunchandLearn.pdf
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
OpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collectorOpenCms Days 2014 - Using the SOLR collector
OpenCms Days 2014 - Using the SOLR collector
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Search engine. Elasticsearch
Search engine. ElasticsearchSearch engine. Elasticsearch
Search engine. Elasticsearch
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
 

Último

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Último (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Solr5

  • 1. 17 December 2015 Leonardo Souza lsouza@amtera.com.br
  • 2.
  • 3. Setting Expectations ● This presentation assumes the reader is aware of the Solr/Lucene technology for a while; ● The goal is to update the overall knowledge around Solr and its features; ● This presentation is not an exhausted list of all Solr features or capabilities.
  • 4. Solr 5 ● Refreshing memories.. ● Solr (pronounced "solar") is an open source enterprise search platform, written in Java, from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is designed for scalability and Fault tolerance. Solr is the most popular enterprise search engine; (source: Wikipedia)
  • 6. Quick review: Elasticsearch or Solr? ● Both are released under Apache Software License; ● Solr and ES have lively user and developer communities and are rapidly being developed; ● If you need to customize or actively contribute stick with Solr; ● Both have good commercial support; ● Solr is still much more text-search-oriented. ES is more naturally when comes to build analytical applications that relies on complex features like filtering and grouping; ● Elasticsearch is a bit easier to get started and deployed; ● Fully distributed deployment is a little harder with Solr, you will need a Zookeeper setup; ● If you already uses Solr or ES you don't need to change unless you face a real motivation.
  • 7. Solr - Agenda ● Core Concepts; ● Query Parsers; ● Faceting; ● Nested Documents; ● Clustering results; ● Okapi BM25; ● Spatial Search; ● SolrCloud.
  • 9. Core Concepts - Document ● Despite the NoSQL hype Solr is essentially a search engine. That's it, you fed with tons of information and expects to retrieve later, fast! ● The basic unit of information is called a document and a document is composed of several fields; ● A JSON document with 5 fields. { "population": 33576, "state": "SC", "city": "LEXINGTON", "location": "33.972383, -81.23586", "id": "29072" }
  • 10. Core Concepts - Schema ● Each document field can be digested (analyzed) according to the user's needs; ● To accomplished this task the user can define a schema for each kind of document expected to be indexed; ● Solr has a field type for almost anything: BinaryField, BoolField, CollationField, CurrencyField, DateRangeField, ExternalFileField, LatLonType, PointType, PreAnalyzedField, SpatialRecursivePrefixTreeFieldType, StrField, TextField, UUIDField, etc. .. .. ● Solr is very extensible. You can build your own type too.
  • 11. Core Concepts – Dynamic Fields ● Dynamic fields gives the power of convention over configuration. For instance any field name ending with _is will be treated as a multivalued integer type field. <dynamicField name="*_i" type="int" indexed="true" stored="true"/> <dynamicField name="*_is" type="int" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_s" type="string" indexed="true" stored="true" /> <dynamicField name="*_ss" type="string" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_l" type="long" indexed="true" stored="true"/> <dynamicField name="*_ls" type="long" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_t" type="text_general" indexed="true" stored="true"/> <dynamicField name="*_txt" type="text_general" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_en" type="text_en" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_b" type="boolean" indexed="true" stored="true"/> <dynamicField name="*_bs" type="boolean" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_f" type="float" indexed="true" stored="true"/> <dynamicField name="*_fs" type="float" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_d" type="double" indexed="true" stored="true"/> <dynamicField name="*_ds" type="double" indexed="true" stored="true" multiValued="true"/> <dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false" /> <dynamicField name="*_dt" type="date" indexed="true" stored="true"/>
  • 12. Core Concepts - Schemaless ● The schema is built (guessed) automatically while the index is being filled; ● Schema.xml is manipulated only by the new Schema REST API. From now on this is a Managed Schema; ● Previously unseen fields are run through a cascading set of value-based parsers, which guess the Java class of field values - parsers for Boolean, Integer, Long, Float, Double, and Date are currently available; ● Automatic schema field addition, based on field value class(es): Previously unseen fields are added to the schema, based on field value Java classes, which are mapped to schema field types
  • 13. Core Concepts – Schemaless (cont.) ● The solr distribution has an example of a managed schema. Basically you'll need to configure: – The schemaFactory on solrconfig.xml; – Define an UpdateRequestProcessorChain; – Make the UpdateRequestProcessorChain the Default for the UpdateRequestHandler; ● You also can use schemaless mode and pre- emptively create fields before indexing;
  • 14. Core Concepts - Analyze Chain ● A field type can be quite complex and defines an analyze chain during indexing and querying time; <fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType>
  • 15. Core Concepts - Field Properties ● A document field is declared using one of the predefined field types. Each field carries some properties that can impact on the index and search perfomance, like the property stored that keeps a copy of the entire field. <field name="price" type="float" default="0.0" indexed="true" stored="true"/> ● There are advanced options like omitNorms used to boost field during indexing and can be turned off to save some memory.
  • 16. Core Concepts - DocValues ● Since version 4 of Lucene fields can have a special property called DocValues; ● The well know inverted index (term-to-document) is not suitable and does not scale well for operations like sorting, faceting and highlighting; ● When a field enables the docValue property the data is stored in column-oriented way with a document-to-value mapping at index time; <field name="manu_exact" type="string" indexed="false" stored="false" docValues="true" />
  • 18. Query Parsers ● Solr offers great control on how to parser user's input query. There are 3 main parsers: – Standard Query Parser – Dismax Query Parser – Extended Dismax Query Parser ● There are others specialized parsers, like spatial, boost, MLT etc; ● All parsers shares a common set of parameters like sort, start, rows, fq, fl, timeAllowed, wt etc.
  • 19. Query Parsers – Standard Parser ● Also known as the Lucene parser; ● Exposes all Lucene features allowing to build complex queries; ● But it's very intolerant of syntax errors;
  • 20. Query Parsers - Dismax ● Designed to process simple phrases entered by users and to search for individual terms across several fields using different weighting based on the significance of each field; ● It's very lenient. Rarely produces an error; ● Uses a simplified subset of the Lucene Query Parser; ● User can specify quotes to group phrase, +/- to define mandatory or optional clauses, AND/OR operators. Everything else is escaped to simplify user's experience; ● DisMax stands for Maximum Disjunction: A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries.
  • 21. Query Parsers - eDismax ● Supports the full Lucene query parser syntax; ● Supports queries such as AND, OR, NOT, -, and +; ● Treats "and" and "or" as "AND" and "OR" in Lucene syntax mode; ● Supports pure negative nested queries: as +foo (-foo) will match all documents; ● Lets you specify which fields the end user is allowed to query, and to disallow direct fielded searches.
  • 23. Faceting ● Faceted search, also called faceted navigation or faceted browsing, is a technique for accessing information organized according to a faceted classification system, allowing users to explore a collection of information by applying multiple filters. A faceted classification system classifies each information element along multiple explicit dimensions, called facets, enabling the classifications to be accessed and ordered in multiple ways rather than in a single, pre-determined, taxonomic order. (source: Wikipedia)
  • 25. Faceting ● Can be done by Field-Value. Very common when the field holds some kind of categorization or tagging values; ● By range on any date or numeric field; ● By query to define your custom facets.
  • 26. Faceting Example ● Schema.xml <field name=”category” type=”string” /> <field name=”manufacturer” type=”string” /> ● Faceting on fields: http://localhost:8983/solr/collection_name/select?q=*:*& facet=true& facet.field=category& facet.field=manufacturer
  • 28. Nested Objects ● Appears frequently on relation databases; ● Historically search engines only indexes flat data, no hierarchy at all; ● Every major DBMS today has some sort of textual searching but it's far from ideal on some scenarios; ● If you truly need a full text search engine I am afraid you'll have to maintain another moving part on your architecture.
  • 29. Nested Documents ● A JSON document example with nested objects: [{ "id": "book1", "title_s": "The Way of Kings", "authors_ss": ["Brandon Lee"], "cat_s": "fantasy", "pubyear_i": 2010, "publisher_s": "Tor", "reviews": [{ "pubdate_dt": "2015-01-03T14:30:00Z", "stars_i": 5, "author_s": "Robert Youh", "comment_s": "A great start to what looks like an epic series!" }, { "pubdate_dt": "2014-03-15T12:00:00Z", "stars_i": 3, "author_s": "Daniel K", "comment_s": "This book was too long." }] }]
  • 30. Let's Index curl -X POST -H "Content-Type: application/json" -d '[{ "id": "book1", "title": "The Way of Kings", "authors": ["Brandon Lee"], "cat": "fantasy", "pubyear": 2010, "publisher": "Tor", "reviews": [{ "pubdate": "2015-01-03T14:30:00Z", "stars": 5, "author": "Robert Youh", "comment": "A great start to what looks like an epic series!" }, { "pubdate": "2014-03-15T12:00:00Z", "stars": 3, "author_s": "Daniel K", "comment": "This book was too long." }] }]' 'http://localhost:8983/solr/books/update'
  • 31. Oops! { "responseHeader": { "status": 400, "QTime": 1 }, "error": { "msg": "Error parsing JSON field value. Unexpected OBJECT_START at [150], field=reviews", "code": 400 } } ● Sadly Solr/Lucene can't understand nested objects directly from the document hierarchy.
  • 32. Oop's Solutions ● Assuming you need all Lucene's power you can: – Denormalize your data before indexing. This can be quite problematic resulting in duplicated content when you join all tables/collections and of course may not scale well; – Index all entities into separated indices and make your own join at application level. This does not scale well, add application logic overhead and affects overall relevance when you split the data into uncorrelated indices; – Uses Lucene's join features already integrated with Solr with some limitations;
  • 33. Solr Joins ● Solr Nested Objects is implemented using Lucene's Block Join feature; ● Block Join and (Query-time) Joins are different beasts; ● Block Join arranges children and parent contiguously on the index and depends on more information during querying; ● Query time Joins does not rely on any special arrangement on the index level.
  • 34. Block Joins (index-time-join) ● Documents need to be converted to special syntax: [{ "id": "book1", "title_s": "The Way of Kings", "authors_ss": ["Brandon Lee"], "cat_s": "fantasy", "pubyear_i": 2010, "publisher_s": "Tor", "type_s": "book", "_childDocuments_": [{ "id": "book1_c1", "type_s": "reviews", "pubdate_dt": "2015-01-03T14:30:00Z", "stars_i": 5, "author_s": "Robert Youh", "comment_t": "A great start to what looks like an epic series!" }, { "id": "book1_c2", "type_s": "reviews", "pubdate_dt": "2014-03-15T12:00:00Z", "stars_i": 3, "author_s": "Daniel K", "comment_t": "This book was too long." }] }]
  • 35. Querying Nested Documents ● Querying 3 stars rated books using the Block Join Parent Query Parser: curl -X GET 'http://localhost:8983/solr/books/select?q={!parent which="type_s:book"}stars_i:3' ● Querying all comments of fantasy books using the Block Join Children Query Parser: curl -X GET 'http://localhost:8983/solr/books/select?q={!child of="type_s:book"}cat_s:fantasy'
  • 36. Join - JoinQueryParser ● Allows normalizing relationships between documents with a join operation. This is different from the concept of a join in a relational database because no information is being truly joined. An appropriate SQL analogy would be an "inner query" ● Suppose you have two indices (entities) named movies(id, title, director_id) and movies_directors(id, name, has_oscar); fq={!join from=id fromIndex=movie_directors to=director_id}has_oscar:true
  • 37. Join x BlockJoin ● Join: – Does not rely on any index arrangement beforehand; – Can be very slower but don't require extra disk space; – The documents should be flattened, no hierarchy, as usual; – The joined fields should use compatible types. ● BlockJoin: – It's faster but parent and children documents should be updated in block, no partial updates (including deletions); – Uses a lot of extra disk space as each children and parent counts as a different document; – Different types should live in the same index.
  • 39. Clustering Results ● The clustering plugin attempts to automatically discover groups of related search hits (documents) and assign human-readable labels to these groups; ● Think as a kind of unsupervised faceting; ● It's built online. Using the query results; ● Useful to explore the data and discover facets dynamically; ● For simple queries, the clustering time will usually dominate the fetch time. If the document content is very long the retrieval of stored content can become a bottleneck; ● Each cluster result has a label, a score and some documents that falls into this cluster (facet).
  • 40. World Bank Projects Dataset ● Generating clusters from data of projects funded by the World Bank; { "region_name_s": "East Asia and Pacific", "project_abstract_t": "The development objective of the Second Power Transmission Development Project for Indonesia is to meet growing electricity demand and increase access to electricity in the project area through strengthening and expanding the capacity of the power transmission networks in the project area in a sustainable manner. The project has single component with following two parts: first part is extension and rehabilitation of selected existing 150-20 Kilovolt (kV) substations and 70-20 kV substations in the project area, including adding one or more new transformers and associated equipment; and or replacing existing transformers with new transformers and associated equipment with higher capacity; and second part is construction of selected new 150-20 kV substations in the project area, including installation of transformers and associated equipment.", "project_name_s": "Indonesia Second Power Transmission Development Project", "country_code_s": "ID", "country_name_s": "Republic of Indonesia", "source_s": "IBRD", "total_amt_i": 325000000, "status_s": "Active", "id": "P123994", }
  • 41. World Bank Project Dataset (cont.) ● Facets based on fields project_name_s and project_abstract_t using the first 100 documents. Nothing tweaked and we already get some good results. (Note: the output is from a python code) [([u'Additional Financing'], 54.23808447032144), ([u'Program'], 31.94900144597348), ([u'Agricultural'], 29.989495158991218), ([u'Development Policy'], 46.552690339024544), ([u'Improvement Project'], 51.00742730020363), ([u'Management'], 27.078107046125666), ([u'Education'], 21.679421779033508), ([u'National'], 22.636815589946526), ([u'Regional'], 21.280823272486778), ([u'DPL'], 14.733639303266868), ([u'Development Policy Operation'], 29.228788836371688), ([u'Ecosystem'], 24.59586749086255), ([u'Implementation'], 15.251178555217848), ([u'Industries Transparency Initiative'], 35.386053776359034) ...
  • 43. Okapi BM25 ● BM25 is a competitor of the classic TF/IDF vector space model, actually the default for solr scoring model; ● Is a probabilistic relevance model. It is considered the state of the art for information retrieval; ● Both use term frequency, inverse document frequency, and field-length normalization, but the definition of each of these factors is a little different; ● Lucene 6 will use BM25 as the default score model; ● In Solr the users can define fields with different scoring models; ● Understanding the math differences is out of the scope of this presentation.
  • 45. What is a Spatial Query? ● A spatial query is a special type of database query supported by geodatabases and spatial databases. The queries differ from non-spatial SQL queries in several important ways. Two of the most important are that they allow for the use of geometry data types such as points, lines and polygons and that these queries consider the spatial relationship between these geometries. (source: Wikipedia)
  • 46. Solr Spatial Features ● Solr can index location data for spatial or geospatial queries; ● Index points or any other shapes; ● Filtering results by bounding box, circle or by other shapes; ● Sort or boost scoring by distance between points, or relative area between rectangles; ● Heatmaps / Spatial Grid Faceting ● Standards: GeoJSON, WKT
  • 47. Spatial Field Types ● LatLonType: Index only points; ● SpatialRecursivePrefixTreeFieldType (RPT): Index other shapes like circle or polygon; ● BBOXField: Bounding box specialized; ● Example of configuration in solrconfig.xml: <fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType" geo="true" distErrPct="0.025" maxDistErr="0.005" distanceUnits="kilometers" />
  • 48. LatLonType ● Only points can be indexed; ● Can be queried by circle or bounding-box; ● Better for distance sorting or boosting; <field name="store">45.17614,-93.87341</field> <field name="store">40.7143,-74.006</field> <field name="store">37.7752,-122.4232</field>
  • 49. SpatialRecursivePrefixTreeFieldType ● Query by polygons and other complex shapes, in addition to lat-long circles & bounding-boxes; ● Configurable precision which can vary per shape at query time; ● Index non-point shapes as well as point shapes; ● Multi-valued field, useful for geodecoding; ● Well-Known-Text (WKT) shape syntax for indexing too; <field name="store">45.17614,-93.87341</field> <field name="geo">-74.093 41.042 -69.347 44.558</field> <field name="geo">Circle(4.56,1.23 d=0.0710)</field> <field name="geo">POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))</field>
  • 50. BBoxField ● Only indexes bounding-boxes; ● Queried by another bounding-box; ● Predicates: Intersects, Within, Contains, Disjoint, Equals; ● Supports relevancy sort/boost like overlapRatio or simply the area.
  • 51. Spatial Filters ● Solr has two types of spatial filters: – GeoFilt; – BBox;
  • 52. GeoFilt ● Retrieve results based on the geospatial distance from a given point; ● &q=*:*&fq={!geofilt sfield=store}&pt=45.15,-93.85&d=5
  • 53. BBox ● Is like the geofilt but calculates the bounding box of a circle. ● The rectangle is faster to compute hence this filter is useful when it's acceptable to return results outside the radius. ● &q=*:*&fq={!bbox sfield=store}&pt=45.15,-93.85&d=5
  • 54. Distance Function Queries ● Geodist: Geodetic points distance; ● Dist: Distance between multi-dimensional vectors; ● Hsin: Distance between two points on a sphere; ● Sqedist: Squared Euclidean distance between points; ● Remember that Solr can sort or boost by any function query;
  • 55. Spatial Predicates ● Intersects ● IsWithin ● Contains ● BBOxField only: – IsDisjointTo – IsEqualTo
  • 56. Spatial Clustering ● Suppose you have a lot of points and wants to query to get the results to plot into a map; ● You can just scroll down the results with the rows parameter and use your map sdk to render; ● But since version 5.1 Solr can facet on these points too; ● One way to look at this problem is grid based heatmap clustering. All points in a grid square get counted to give a grid square a numeric value, and those values correspond to a color scale.
  • 59. SolrCloud ● Solr is no more distributed as WAR application. It is a full fledged server ready for deployment. ● Solr can be executed in two modes: – Standalone Server – SolrCloud ● Standalone mode resembles a multi-core setup running on a single Servlet container; ● SolrCloud delivers all goodness of a Solr cluster that combines fault tolerance and high availability; ● Now it's much easier to deploy and treat Solr as a backend dependency on your infrastructure; ● A lot of improvements on the startup scripts and command line tools has been made to simplify the user experience.
  • 60. Solr Cluster ● A cluster is set of Solr nodes managed by ZooKeeper as a single unit. When you have a cluster, you can always make requests to the cluster and if the request is acknowledged, you can be sure that it will be managed as a unit and be durable, i.e., you won't lose data. Updates can be seen right after they are made and the cluster can be expanded or contracted.
  • 61. Sharding & Replication ● When your data is too large for one node, you can break it up and store it in sections by creating one or more shards; ● A shard is a way of splitting a core over a number of "servers", or nodes. For example, you might have a shard for data that represents each state, or different categories that are likely to be searched independently, but are often combined; ● Each shard has a replica set to achieve robustness; ● Shards and replicas are orchestrated by Zookeeper that provides balancing and failover; ● In SolrCloud there are no masters or slaves. Instead, there are leaders and replicas.
  • 63. Scripts Improved ● bin/solr $ solr Usage: solr COMMAND OPTIONS where COMMAND is one of: start, stop, restart, status, healthcheck, create, create_core, create_collection, delete, version Standalone server example (start Solr running in the background on port 8984): ./solr start -p 8984 SolrCloud example (start Solr running in SolrCloud mode using localhost:2181 to connect to ZooKeeper, with 1g max heap size and remote Java debug options enabled): ./solr start -c -m 1g -z localhost:2181 -a "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1044" Pass -help after any COMMAND to see command-specific usage information, such as: ./solr start -help or ./solr stop -help
  • 64. Quick start $ solr start Waiting up to 30 seconds to see Solr running on port 8983 [/] Started Solr server on port 8983 (pid=20034). Happy searching! $ solr status Found 1 Solr nodes: Solr process 20034 running on port 8983 { "solr_home":"/home/lsouza/solr/latest/server/solr", "version":"5.4.0 1718046 - upayavira - 2015-12-04 23:16:46", "startTime":"2015-11-22T22:05:57.659Z", "uptime":"0 days, 0 hours, 0 minutes, 11 seconds", "memory":"92.1 MB (%18.8) of 490.7 MB"} $ solr stop Sending stop command to Solr running on port 8983 ... waiting 5 seconds to allow Jetty process 20034 to stop gracefully.
  • 66. References ● http://lucene.apache.org/solr/ ● http://geojson.org/ ● http://pt.slideshare.net/DavidSmiley2/lucenesolr-spatial-in-2015 ● http://boundingbox.klokantech.com/ ● https://cwiki.apache.org/confluence/display/solr/Spatial+Search ● https://www.usps.com/ ● https://www.census.gov ● https://github.com/GeospatialPython/pyshp ● http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-gen eration-of-lucene-relevation/ ●
  • 67. 17 December 2015 Leonardo Souza lsouza@amtera.com.br