3. Agenda
What is Solr
When to use it
When not to use it
How it works
Demo
Pleeeeease ask questions.
Otherwise it will be boring :(
4. What is
Open source
Search engine
Based on Apache *
Distributed (SolrCloud) or not (master-slave)
* Actually the two project merged in 2010
5. More on search: the term dictionary and its friends
Term Docs Positions counts,
stored, etc
big 1,2 [0],[2] ...
bucharest 3 [0]
data 1 [1]
fun 1 ...
is 1,3
other 2
text 2
1) Big data is fun
2) Other text
3) Bucharest is big
analysis
big AND data
“big data”
7. The [relevancy] score
BM25: bag-of-words based on TF-IDFq=big AND data
big
big
big
big
big
big
I have big big big data
Term
Frequency data
data
Inverse
Document
Frequency
more
occurrences in
the document,
more weight
less
occurrences
in the index,
more weight
8. Relevancy tuning
title: Big Data
description: this is a book about big data
published: 2016
title: Spark Rulz
description: big data big data big data big data
published: 2015
q=big AND data
boost fields
boost values
9. Back to sorting: where the inverted index fails
Term Docs
1 [star] 1,2,8,5,128
2 7,84,129,
3 3,29,345
4 11,123,455
5 12,14,16,17
Search returned docs
84, 455, 12 and 8
Now sort them by
rating.
¯_(ツ)_/¯
10. Enter doc values
Doc Terms
8 1
12 3
84 5
129 4
455 2
Search returned docs
84, 455, 12 and 8
Now sort them by
rating.
Similar, but not quite
like stored fields*
* Faster retrieval for doc values. For analyzed text, you’re stuck with stored fields
and in-memory field cache
11. Facets
search returns
doc IDs
facet=true
facet.field=host
doc1: host=server01
doc2: host=server02
doc3: host=server01
doc4: host=server01
server01: 3
server02: 1
doc values,
usually*
* can be filter cache on low cardinality fields (depends on facet.method)
12. Facets can be hierarchical
top_genres:{
terms:{
field: genre,
limit: 5,
facet:{
top_authors:{
terms:{
field: author,
limit: 2
"top_genres":{
"buckets":[
{
"val":"Fantasy",
"count":5432,
"top_authors":{ // top authors in the "Fantasy" genre
"buckets":[{
"val":"Mercedes Lackey",
"count":121},
{
"val":"Piers Anthony",
"count":98}
]
}
},
{
"val":"Mystery",
"count":4322,
"top_authors":{ // top authors in the "Mystery" genre
"buckets":[{
"val":"James Patterson",
"count":146},
Can also be numeric/date
ranges or functions like avg,
sum, unique or percentile
16. Master-slave: high-QPS on static data
indexer master
slave1
slave2
slave3
searcher
replicates
segments
docs
queries
Simple
Battle-tested
Index data only once
Slaves can cache like crazy
Separate roles ⇒ separate (see optimized) hardware and configs
19. In a nutshell
Typical use-cases Typical challenges
Product search (books, movies, bikes
weapons… anything that requires relevancy)
Updates (though there’s WiP for numeric
doc values in SOLR-5944)
Time-series data (logs, metrics, social
media...)
Not really schema-less (schema can only
be appended)
Search on top of (or as a source of) other Big
Data tools (Spark, HDFS…)
Doesn’t like sparse data (again, there’s
ongoing work to make it better, see
LUCENE-7253)
Search on top of (or alongside) relational
DBs
Some relational, stream and batch
processing capabilities, but not the tool
for those jobs