SlideShare a Scribd company logo
1 of 44
Download to read offline
Search Engine-Building
with Lucene and Solr
Part 2
Kai Chan
SoCal Code Camp, November 2013
Overview
●
●
●
●
●
●
●

indexing process
searching process
advanced features
scaling/redundancy
resources
demo
questions/answers
Indexing Process
● request handler
○ data are read to create documents

● update request processor chain
○
○
○
○

optional document-wide processing
fields can be added, changed, removed
analysis
creation of indexed and stored fields

● update handler
○ the index is updated
Update Request Processor Chain
● de-duplication
○ creates a signature (hash) for each document to be
added
○ replaces (delete) existing documents with the same
signature
○ MD5Signature
■ exact hashing
○ Lookup3Signature
■ faster calculation and smaller hash than MD5
○ TextProfileSignature
■ fuzzy hashing, near-duplicate detection
Update Request Processor Chain
● language detection
○ detects the language used in field(s)
○ adds a language field to the document
○ TikaLanguageIdentifierUpdateProcessorFa
ctory
■ uses Apache Tika
○ LangDetectLanguageIdentifierUpdateProce
ssorFactory
■ uses language-detection library
○ external programs
■ e.g. Chromium Compact Language Detector
See Also: Language detection with Google's Compact Language Detector <http://blog.mikemccandless.com/2011/10/languagedetection-with-googles-compact.html>
Analysis
● analyzed
○ tokenization, i.e. breaking down the content to be
search into smaller units (“tokens”)
○ manipulation of tokens

● not analyzed
○ the whole content treated as 1 unit for searching

● analyzed v.s. not analyzed
○ are individual tokens meaningful on their own?
○ are individual tokens used in queries?
Example 1: book title
Lucene in Action, Second Edition: Covers Apache Lucene 3.0

Lucene in Action, Second Edition: Covers Apache Lucene 3.0

search for “Lucene”: no match

Lucene in Action, Second Edition: Covers Apache Lucene 3.0

makes more sense to tokenize

Example 2: ISBN

1-933-98817-7

1 933 98817 7

makes more sense to not tokenize

1 933 98817 7

search for “933”: match
Analysis
analyzed:
● text

How about URL?

not analyzed:
● number
● serial number
● GUID
● checksum
Analysis
● character filter(s)
○ character replacement
○ e.g. accent marks with their base forms
café → cafe
jalapeño → jalapeno

● tokenizer
● token filter(s)
Analysis
● character filter(s)
● tokenizer
○ create tokens (“words”) from characters
○ sometimes straightforward
○ many unusual cases:
e-mail address, URL, code, etc.

● token filter(s)
Analysis
● character filter(s)
● tokenizer
● token filter(s)
○ token replacement
■ change case, remove apostrophe
■ remove stop words (a, and, the, for)
■ split/join words (ice-cream, ice cream, icecream)
■ stemming (importing, imported → import)
■ synonym (nation → country)
Field value:
Let's sign up for the amazing So-Cal Code Camp® at http://bit.ly/oZiZsu. Free WiFi!

Tokens (text_general):
1
2
3 4
6
6
7 8
9
10
11 12
13
14
17
let's sign up for the amazing so cal code camp at http bit.ly oZiZsu
fi

Tokens (text_en):
1
2
3
17
let
sign up
fi
Tokens (text_en_splitting):
1
2
3
20
let
sign up
fi

6
amaz

6
amaz

7

8

9

10

so cal code camp

7

8

9

10

so cal code camp
socal

12
http

12
http

13

14

bit.li ozizsu

13

14 1516 17

20

16

free wi

15

16

free wi

18

19

bit ly o zi zsu free wi
httpbitlyozizsu

wifi
8

15

17
Searching Process
●
●
●
●
●
●

query parsing
analysis
scoring
sorting
loading of stored fields
optional search components
○
○
○
○

faceting
term vector
More Like This
highlighting
Scoring
● for a given query, each document not filtered
out gets a score (float)
● higher score: higher in the results
● scoring algorithms
○ default: TF-IDF
○ other: Okapi BM25, etc.
○ very customizable

See Also: Lucene/Solr Revolution 2013 presentation “Beyond TF-IDF: Why, What and How”
Scoring - TF-IDF
● term frequency (TF)
○ how many times does this term appear in this
document?

● inverse document frequency (IDF)
○ how many documents contain this term?
○ score proportional to the inverse of document
frequency
Scoring - Other Factors
● coordination factor (coord)
○ documents that contains all or most query terms get
higher scores

● normalizing factor (norm)
○ adjust for field length and query complexity
Scoring - Boost
● manual override: ask Lucene/Solr to give a
higher score to some particular thing(s)
● index-time
○ per document
○ per field (of a particular document)

● search-time
○ per query
More Like This
● finds documents similar in content (of one
field) to those matched
● constructs a query based on the highest
scoring terms in a document
● requires the field to:
○ have stored term vectors (recommended), or
○ be stored

Credit: How MoreLikeThis Works in Lucene <http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/>
Spell Checking
● typos in queries happen
● returns spell checking suggestion (if any)
within the same result
● can also be used for auto-complete
○ treating a prefix as a spelling mistake
○ returning full words as suggestions
/select?q=text:"busness comunication"&spellcheck=true&wt=xml

<lst name="spellcheck">
<lst name="suggestions">
<lst name="busness">
<int name="numFound">1</int>
<int name="startOffset">6</int>
<int name="endOffset">13</int>
<arr name="suggestion">
<str>business</str>
</arr>
</lst>
<lst name="comunication">
<int name="numFound">1</int>
<int name="startOffset"
>14</int>
<int name="endOffset">26</int>
<arr name="suggestion">
<str>communication</str>
</arr>
</lst>
</lst>
</lst>
Query Elevation
● a.k.a. “sponsored search”
● make sure certain documents appear at the
top of the results for a certain query
Credit: Google Web Search <http://www.google.com/>
Query Elevation
● configure the elevator search component
in solrconfig.xml
● in elevate.xml, specify the queries and
the list of documents (by id) to elevate or
exclude
● enable query elevation:
enableElevation=true
● (optional) override the sort parameter:
forceElevation=true
Function Query
● like formulas in Excel
● apply functions to field values for filtering
and scoring
Function Query
● query:
q={!func} cos(angle)
● query (range):
q={!frange l=0.5 u=1} cos(angle)
● field:
fl=angle,cos(angle)
● sort:
sort=cos(angle) desc
Spatial Search
● data: contains locations
(longitudes, latitudes)
○ e.g. merchants with store locations

● search: filter and/or sort by location
Credit: Google Maps <http://maps.google.com/>
Spatial Search
● geofilt
○ circle centered at a given point
○ distance from a given point
○ fq={!geofilt sfield=store}&pt=45.15,
-93.85&d=5

● bbox
○ square (“bounding box”) centered at a given point
○ distance from a given point + corners
○ fq={!bbox sfield=store}&pt=45.15,-93.85
&d=5

Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
geofilt

bbox

5 km

(45.15, -93.85)

Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>

5 km

(45.15, -93.85)
geofilt

x

bbox

x

x
x

o
o

5 km

5 km
o

o

(45.15, -93.85)

(45.15, -93.85)

o

o

x

Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>

o
Spatial Search
● geodist
○ returns the distance between the location given in a
field and a certain coordinate
○ e.g. sort by ascending distance from (45.15,-93.85),
and return the distances as the score:
q={!func}geodist()&sfield=store&pt=45.
15,-93.85&sort=score+asc

Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
Scaling/Redundancy - Problems
● collection too large for a single machine
● too many requests for a single machine
● a machine can go down
Scaling/Redundancy - Solutions
● collection too large for a single machine
○ distribution
■ spread the collection across multiple machines

● too many requests for a single machine
○ distribution
■ spread the requests across multiple machines

● a machine can go down
○ replication
■ copy data and configuration across multiple
machines
■ make sure no single point of failure
SolrCloud
● Solr instances
● ZooKeeper instances
SolrCloud
● Solr instances
○ collection (logical index) divided into one or more
partial collections (“shards”)
○ for each shard, one or more Solr instances keep
copies of the data
■ one as leader - handles reads and writes
■ others as replicas - handle reads

● ZooKeeper instances
SolrCloud
● Solr instances
● ZooKeeper instances
○ management of Solr instances
○ leader election
○ node discovery
collection (i.e. logical index)

shard 1:
⅓ of the
collection

shard 2:
⅓ of the
collection

shard 3:
⅓ of the
collection

leader

replica

replica

leader

replica

replica

leader

replica

replica

replica
collection (i.e. logical index)

shard 1:
⅓ of the
collection

shard 2:
⅓ of the
collection

shard 3:
⅓ of the
collection

leader

replica

replica

replica

leader

replica

replica

replica

leader

replica

replica
collection (i.e. logical index)

shard 1:
⅓ of the
collection

shard 2:
⅓ of the
collection

shard 3:
⅓ of the
collection

leader

replica

replica

replica

(offline)

leader

replica

replica

leader

replica

replica
collection (i.e. logical index)

shard 1:
⅓ of the
collection

shard 2:
⅓ of the
collection

shard 3:
⅓ of the
collection

leader

replica

replica

replica

replica

leader

replica

replica

leader

replica

replica
Resources - Books
● Lucene in Action
○ written by 3 committer and PMC members
○ somewhat outdated (2010; covers Lucene 3.0)
○ http://www.manning.com/hatcher3/

● Solr in Action
○ early access; coming out later this year
○ http://www.manning.com/grainger/

● Apache Solr 4 Cookbook
○ common problems and useful tips
○ http://www.packtpub.com/apache-solr-4cookbook/book
Resources - Books
● Introduction to Information Retrieval
○ not specific to Lucene/Solr, but about IR concepts
○ free e-book
○ http://nlp.stanford.edu/IR-book/

● Managing Gigabytes
○ indexing, compression and other topics
○ accompanied by MG4J - a full-text search software
○ http://mg4j.di.unimi.it/
Resources - Web
● official websites
○ Lucene Core - http://lucene.apache.org/core/
○ Solr - http://lucene.apache.org/solr/

● mailing lists
● Wiki sites
○ Lucene Core - http://wiki.apache.org/lucene-java/
○ Solr - http://wiki.apache.org/solr/

● reference guides
○ API Documentation for Lucene and Solr
○ Apache Solr Reference Guide
Getting Started
● download Solr
○ requires Java 6 or newer to run

● Solr comes bundled/configured with Jetty
○ <Solr directory>/example/start.jar

● "exampledocs" directory contains sample
documents
○ <Solr directory>/example/exampledocs/post.jar
○ java -Durl=http://localhost:
8983/solr/update -jar post.jar *.xml

● use the Solr admin interface
○ http://localhost:8983/solr/

More Related Content

What's hot

Filesinc 130512002619-phpapp01
Filesinc 130512002619-phpapp01Filesinc 130512002619-phpapp01
Filesinc 130512002619-phpapp01Rex Joe
 
MongoDB Advanced Topics
MongoDB Advanced TopicsMongoDB Advanced Topics
MongoDB Advanced TopicsCésar Rodas
 
Avro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSONAvro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSONAlexandre Victoor
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkZalando Technology
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation FrameworkMongoDB
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open houseJulien Le Dem
 
Java Data Migration with Data Pipeline
Java Data Migration with Data PipelineJava Data Migration with Data Pipeline
Java Data Migration with Data PipelineNorth Concepts
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBNosh Petigara
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBantoinegirbal
 
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander KorotkovPostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander KorotkovNikolay Samokhvalov
 
2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introductionantoinegirbal
 
Introduction to couch_db
Introduction to couch_dbIntroduction to couch_db
Introduction to couch_dbRomain Testard
 
Indexing and Performance Tuning
Indexing and Performance TuningIndexing and Performance Tuning
Indexing and Performance TuningMongoDB
 

What's hot (20)

Beautiful soup
Beautiful soupBeautiful soup
Beautiful soup
 
Filesinc 130512002619-phpapp01
Filesinc 130512002619-phpapp01Filesinc 130512002619-phpapp01
Filesinc 130512002619-phpapp01
 
MongoDB Advanced Topics
MongoDB Advanced TopicsMongoDB Advanced Topics
MongoDB Advanced Topics
 
Avro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSONAvro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSON
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
 
Make Your Data Searchable With Solr in 25 Minutes
Make Your Data Searchable With Solr in 25 MinutesMake Your Data Searchable With Solr in 25 Minutes
Make Your Data Searchable With Solr in 25 Minutes
 
MongoDB (Advanced)
MongoDB (Advanced)MongoDB (Advanced)
MongoDB (Advanced)
 
C++ files and streams
C++ files and streamsC++ files and streams
C++ files and streams
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
04 standard class library c#
04 standard class library c#04 standard class library c#
04 standard class library c#
 
Java Data Migration with Data Pipeline
Java Data Migration with Data PipelineJava Data Migration with Data Pipeline
Java Data Migration with Data Pipeline
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
MongoDB
MongoDBMongoDB
MongoDB
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Apache solr
Apache solrApache solr
Apache solr
 
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander KorotkovPostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
 
2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction2011 Mongo FR - MongoDB introduction
2011 Mongo FR - MongoDB introduction
 
Introduction to couch_db
Introduction to couch_dbIntroduction to couch_db
Introduction to couch_db
 
Indexing and Performance Tuning
Indexing and Performance TuningIndexing and Performance Tuning
Indexing and Performance Tuning
 

Similar to Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksSearching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksAlexandre Rafalovitch
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDBMongoDB
 
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Kai Chan
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaMushfekur Rahman
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBJason Terpko
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBAntonios Giannopoulos
 
Finding Xori: Malware Analysis Triage with Automated Disassembly
Finding Xori: Malware Analysis Triage with Automated DisassemblyFinding Xori: Malware Analysis Triage with Automated Disassembly
Finding Xori: Malware Analysis Triage with Automated DisassemblyPriyanka Aash
 
DEFCON 23 - Jason Haddix - how do i shot web
DEFCON 23 - Jason Haddix - how do i shot webDEFCON 23 - Jason Haddix - how do i shot web
DEFCON 23 - Jason Haddix - how do i shot webFelipe Prado
 
How to Shot Web - Jason Haddix at DEFCON 23 - See it Live: Details in Descrip...
How to Shot Web - Jason Haddix at DEFCON 23 - See it Live: Details in Descrip...How to Shot Web - Jason Haddix at DEFCON 23 - See it Live: Details in Descrip...
How to Shot Web - Jason Haddix at DEFCON 23 - See it Live: Details in Descrip...bugcrowd
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life琛琳 饶
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...javier ramirez
 
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...Lucidworks
 

Similar to Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013) (20)

Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksSearching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: Sharding
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
 
Handout: 'Open Source Tools & Resources'
Handout: 'Open Source Tools & Resources'Handout: 'Open Source Tools & Resources'
Handout: 'Open Source Tools & Resources'
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
 
Finding Xori: Malware Analysis Triage with Automated Disassembly
Finding Xori: Malware Analysis Triage with Automated DisassemblyFinding Xori: Malware Analysis Triage with Automated Disassembly
Finding Xori: Malware Analysis Triage with Automated Disassembly
 
DEFCON 23 - Jason Haddix - how do i shot web
DEFCON 23 - Jason Haddix - how do i shot webDEFCON 23 - Jason Haddix - how do i shot web
DEFCON 23 - Jason Haddix - how do i shot web
 
Solr workshop
Solr workshopSolr workshop
Solr workshop
 
How to Shot Web - Jason Haddix at DEFCON 23 - See it Live: Details in Descrip...
How to Shot Web - Jason Haddix at DEFCON 23 - See it Live: Details in Descrip...How to Shot Web - Jason Haddix at DEFCON 23 - See it Live: Details in Descrip...
How to Shot Web - Jason Haddix at DEFCON 23 - See it Live: Details in Descrip...
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life
 
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
 
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
 
Spark Meetup
Spark MeetupSpark Meetup
Spark Meetup
 

Recently uploaded

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Recently uploaded (20)

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)

  • 1. Search Engine-Building with Lucene and Solr Part 2 Kai Chan SoCal Code Camp, November 2013
  • 2. Overview ● ● ● ● ● ● ● indexing process searching process advanced features scaling/redundancy resources demo questions/answers
  • 3. Indexing Process ● request handler ○ data are read to create documents ● update request processor chain ○ ○ ○ ○ optional document-wide processing fields can be added, changed, removed analysis creation of indexed and stored fields ● update handler ○ the index is updated
  • 4. Update Request Processor Chain ● de-duplication ○ creates a signature (hash) for each document to be added ○ replaces (delete) existing documents with the same signature ○ MD5Signature ■ exact hashing ○ Lookup3Signature ■ faster calculation and smaller hash than MD5 ○ TextProfileSignature ■ fuzzy hashing, near-duplicate detection
  • 5. Update Request Processor Chain ● language detection ○ detects the language used in field(s) ○ adds a language field to the document ○ TikaLanguageIdentifierUpdateProcessorFa ctory ■ uses Apache Tika ○ LangDetectLanguageIdentifierUpdateProce ssorFactory ■ uses language-detection library ○ external programs ■ e.g. Chromium Compact Language Detector See Also: Language detection with Google's Compact Language Detector <http://blog.mikemccandless.com/2011/10/languagedetection-with-googles-compact.html>
  • 6. Analysis ● analyzed ○ tokenization, i.e. breaking down the content to be search into smaller units (“tokens”) ○ manipulation of tokens ● not analyzed ○ the whole content treated as 1 unit for searching ● analyzed v.s. not analyzed ○ are individual tokens meaningful on their own? ○ are individual tokens used in queries?
  • 7. Example 1: book title Lucene in Action, Second Edition: Covers Apache Lucene 3.0 Lucene in Action, Second Edition: Covers Apache Lucene 3.0 search for “Lucene”: no match Lucene in Action, Second Edition: Covers Apache Lucene 3.0 makes more sense to tokenize Example 2: ISBN 1-933-98817-7 1 933 98817 7 makes more sense to not tokenize 1 933 98817 7 search for “933”: match
  • 8. Analysis analyzed: ● text How about URL? not analyzed: ● number ● serial number ● GUID ● checksum
  • 9. Analysis ● character filter(s) ○ character replacement ○ e.g. accent marks with their base forms café → cafe jalapeño → jalapeno ● tokenizer ● token filter(s)
  • 10. Analysis ● character filter(s) ● tokenizer ○ create tokens (“words”) from characters ○ sometimes straightforward ○ many unusual cases: e-mail address, URL, code, etc. ● token filter(s)
  • 11. Analysis ● character filter(s) ● tokenizer ● token filter(s) ○ token replacement ■ change case, remove apostrophe ■ remove stop words (a, and, the, for) ■ split/join words (ice-cream, ice cream, icecream) ■ stemming (importing, imported → import) ■ synonym (nation → country)
  • 12. Field value: Let's sign up for the amazing So-Cal Code Camp® at http://bit.ly/oZiZsu. Free WiFi! Tokens (text_general): 1 2 3 4 6 6 7 8 9 10 11 12 13 14 17 let's sign up for the amazing so cal code camp at http bit.ly oZiZsu fi Tokens (text_en): 1 2 3 17 let sign up fi Tokens (text_en_splitting): 1 2 3 20 let sign up fi 6 amaz 6 amaz 7 8 9 10 so cal code camp 7 8 9 10 so cal code camp socal 12 http 12 http 13 14 bit.li ozizsu 13 14 1516 17 20 16 free wi 15 16 free wi 18 19 bit ly o zi zsu free wi httpbitlyozizsu wifi 8 15 17
  • 13. Searching Process ● ● ● ● ● ● query parsing analysis scoring sorting loading of stored fields optional search components ○ ○ ○ ○ faceting term vector More Like This highlighting
  • 14. Scoring ● for a given query, each document not filtered out gets a score (float) ● higher score: higher in the results ● scoring algorithms ○ default: TF-IDF ○ other: Okapi BM25, etc. ○ very customizable See Also: Lucene/Solr Revolution 2013 presentation “Beyond TF-IDF: Why, What and How”
  • 15. Scoring - TF-IDF ● term frequency (TF) ○ how many times does this term appear in this document? ● inverse document frequency (IDF) ○ how many documents contain this term? ○ score proportional to the inverse of document frequency
  • 16. Scoring - Other Factors ● coordination factor (coord) ○ documents that contains all or most query terms get higher scores ● normalizing factor (norm) ○ adjust for field length and query complexity
  • 17. Scoring - Boost ● manual override: ask Lucene/Solr to give a higher score to some particular thing(s) ● index-time ○ per document ○ per field (of a particular document) ● search-time ○ per query
  • 18. More Like This ● finds documents similar in content (of one field) to those matched ● constructs a query based on the highest scoring terms in a document ● requires the field to: ○ have stored term vectors (recommended), or ○ be stored Credit: How MoreLikeThis Works in Lucene <http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/>
  • 19. Spell Checking ● typos in queries happen ● returns spell checking suggestion (if any) within the same result ● can also be used for auto-complete ○ treating a prefix as a spelling mistake ○ returning full words as suggestions
  • 20. /select?q=text:"busness comunication"&spellcheck=true&wt=xml <lst name="spellcheck"> <lst name="suggestions"> <lst name="busness"> <int name="numFound">1</int> <int name="startOffset">6</int> <int name="endOffset">13</int> <arr name="suggestion"> <str>business</str> </arr> </lst> <lst name="comunication"> <int name="numFound">1</int> <int name="startOffset" >14</int> <int name="endOffset">26</int> <arr name="suggestion"> <str>communication</str> </arr> </lst> </lst> </lst>
  • 21. Query Elevation ● a.k.a. “sponsored search” ● make sure certain documents appear at the top of the results for a certain query
  • 22. Credit: Google Web Search <http://www.google.com/>
  • 23. Query Elevation ● configure the elevator search component in solrconfig.xml ● in elevate.xml, specify the queries and the list of documents (by id) to elevate or exclude ● enable query elevation: enableElevation=true ● (optional) override the sort parameter: forceElevation=true
  • 24. Function Query ● like formulas in Excel ● apply functions to field values for filtering and scoring
  • 25. Function Query ● query: q={!func} cos(angle) ● query (range): q={!frange l=0.5 u=1} cos(angle) ● field: fl=angle,cos(angle) ● sort: sort=cos(angle) desc
  • 26. Spatial Search ● data: contains locations (longitudes, latitudes) ○ e.g. merchants with store locations ● search: filter and/or sort by location
  • 27. Credit: Google Maps <http://maps.google.com/>
  • 28. Spatial Search ● geofilt ○ circle centered at a given point ○ distance from a given point ○ fq={!geofilt sfield=store}&pt=45.15, -93.85&d=5 ● bbox ○ square (“bounding box”) centered at a given point ○ distance from a given point + corners ○ fq={!bbox sfield=store}&pt=45.15,-93.85 &d=5 Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
  • 29. geofilt bbox 5 km (45.15, -93.85) Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/> 5 km (45.15, -93.85)
  • 30. geofilt x bbox x x x o o 5 km 5 km o o (45.15, -93.85) (45.15, -93.85) o o x Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/> o
  • 31. Spatial Search ● geodist ○ returns the distance between the location given in a field and a certain coordinate ○ e.g. sort by ascending distance from (45.15,-93.85), and return the distances as the score: q={!func}geodist()&sfield=store&pt=45. 15,-93.85&sort=score+asc Credit: Apache Solr Reference Guide 4.5 <http://lucene.apache.org/>
  • 32. Scaling/Redundancy - Problems ● collection too large for a single machine ● too many requests for a single machine ● a machine can go down
  • 33. Scaling/Redundancy - Solutions ● collection too large for a single machine ○ distribution ■ spread the collection across multiple machines ● too many requests for a single machine ○ distribution ■ spread the requests across multiple machines ● a machine can go down ○ replication ■ copy data and configuration across multiple machines ■ make sure no single point of failure
  • 34. SolrCloud ● Solr instances ● ZooKeeper instances
  • 35. SolrCloud ● Solr instances ○ collection (logical index) divided into one or more partial collections (“shards”) ○ for each shard, one or more Solr instances keep copies of the data ■ one as leader - handles reads and writes ■ others as replicas - handle reads ● ZooKeeper instances
  • 36. SolrCloud ● Solr instances ● ZooKeeper instances ○ management of Solr instances ○ leader election ○ node discovery
  • 37. collection (i.e. logical index) shard 1: ⅓ of the collection shard 2: ⅓ of the collection shard 3: ⅓ of the collection leader replica replica leader replica replica leader replica replica replica
  • 38. collection (i.e. logical index) shard 1: ⅓ of the collection shard 2: ⅓ of the collection shard 3: ⅓ of the collection leader replica replica replica leader replica replica replica leader replica replica
  • 39. collection (i.e. logical index) shard 1: ⅓ of the collection shard 2: ⅓ of the collection shard 3: ⅓ of the collection leader replica replica replica (offline) leader replica replica leader replica replica
  • 40. collection (i.e. logical index) shard 1: ⅓ of the collection shard 2: ⅓ of the collection shard 3: ⅓ of the collection leader replica replica replica replica leader replica replica leader replica replica
  • 41. Resources - Books ● Lucene in Action ○ written by 3 committer and PMC members ○ somewhat outdated (2010; covers Lucene 3.0) ○ http://www.manning.com/hatcher3/ ● Solr in Action ○ early access; coming out later this year ○ http://www.manning.com/grainger/ ● Apache Solr 4 Cookbook ○ common problems and useful tips ○ http://www.packtpub.com/apache-solr-4cookbook/book
  • 42. Resources - Books ● Introduction to Information Retrieval ○ not specific to Lucene/Solr, but about IR concepts ○ free e-book ○ http://nlp.stanford.edu/IR-book/ ● Managing Gigabytes ○ indexing, compression and other topics ○ accompanied by MG4J - a full-text search software ○ http://mg4j.di.unimi.it/
  • 43. Resources - Web ● official websites ○ Lucene Core - http://lucene.apache.org/core/ ○ Solr - http://lucene.apache.org/solr/ ● mailing lists ● Wiki sites ○ Lucene Core - http://wiki.apache.org/lucene-java/ ○ Solr - http://wiki.apache.org/solr/ ● reference guides ○ API Documentation for Lucene and Solr ○ Apache Solr Reference Guide
  • 44. Getting Started ● download Solr ○ requires Java 6 or newer to run ● Solr comes bundled/configured with Jetty ○ <Solr directory>/example/start.jar ● "exampledocs" directory contains sample documents ○ <Solr directory>/example/exampledocs/post.jar ○ java -Durl=http://localhost: 8983/solr/update -jar post.jar *.xml ● use the Solr admin interface ○ http://localhost:8983/solr/