SlideShare a Scribd company logo
1 of 53
Download to read offline
Apache Solr
Masterclass
From zero to hero
June 2014
www.slideshare.net/arafalov/solr-masterclass-bangkok-june-2014
2
Alexandre Rafalovitch
www.outerthoughts.com
Web search engines !
are quite sophisticated
3
4
But the real search needs !
are!
much DEEPER and BROADER
5
Searching code
6
Searching people and companies
7
Searching products
8
Searching library material
9
Searching languages
10
Understanding full-text search
SELECT * 

FROM database

WHERE field LIKE ‘%word%’#
This DOES NOT Scale#
Instead: #
break text into tokens#
domain-specific processing (e.g. lower-casing)#
build fast-access structures#
algorithms for term, phrases, proximity search
11
Basic search engine features
Search (Duh!): keyword, phrase, field-specific#
Positive and negative terms#
Sort: relevancy, recency#
Pagination#
Compact summary in results#
SPEED
12
Advanced search engine features
Facets/Taxonomy - based navigation with live counts#
Language-specific processing#
Domain-specific text processing (WiFi = Wi-Fi = WIFI)#
Geographic search#
More-like-this, did-you-mean, autocomplete#
Scaling/Clustering#
NOT web crawling - different, but related
13
Search engine solutions?
Solr#
Elastic Search#
Xapian#
Sphinx#
Groonga#
Searchdaimon#
{F}lexSearch#
Algolia (SaaS)#
Searchify
(SaaS)#
ForageJS#
Lunr.js#
FACT-Finder#
DtSearch#
MarkLogic#
Verity#
Fast#
Most databases#
!
!
…AND MORE
14
Used with permission from SemaText
Open Source Search Evolution
15
Secret Ingredient - Lucene
Solr#
Elastic Search#
SwiftType#
Galene (LinkedIn’s)#
PyLucene (Python
wrapper)#
Lucene.net (C# port)
Scalable, high-performance
indexing#
Incremental indexing#
Full-text search#
Information-Retrieval
algorithms#
Implemented in Java#
Written in 1999, still going
strong
16
Secret Ingredient - Solr
Certified distributions#
LucidWorks#
HelioSearch#
Big Data platforms#
Cloudera#
Hortonworks HDP#
Hosted and SaaS#
Amazon CloudSearch#
WebSolr, SolrHQ, SearchBox
Lucene full-text-search#
XML and REST config#
Schema/Schemaless#
SolrCloud (clustering)#
Caching#
Near real-time#
Rich-document indexing (Tika
inside)#
Plugins, components, processors
17
Solr Ecosystem sample
Drupal#
Project Blacklight#
LuxDB#
SolrMeter#
CrafterCMS#
Typo3#
Magenta#
HippoCMS#
ColdFusion#
SolrNet#
DataStax#
Dovecot#
NGData Lily#
Basho Riak#
YaCy#
Apache ManifoldCF#
Apache Camel#
FranzAllegrograph#
BitNami Solr Stack#
Carrot2!
Broadleaf Commerce#
Cloudera CDK!
CodeLibs Fess (フェス)!
Splunk#
Alfresco#
Rosette by BasisTech!
Luwak by Flax!
Quepid by OSC!
TwigKit!
SPM by SemaText!
SILK by LucidWorks!
Banana (O/S Solr
Kibana)
18
DEMO Time
19
DEMO - Basic
Unzip#
Go to example directory#
Run Solr#
Import some documents from example docs#
grep -l store *.xml | xargs ./post.sh#
Show off Solr 4 admin panel
20
DEMO - Browse handler
Restart Solr with -Dsolr.clustering.enabled=true#
Visit http://localhost:8983/solr/browse/ #
Show off#
Search#
Facets - Categories and Ranges#
Spatial/Geo-distance#
Clusters
21
Getting into Solr
22
Start for free
Download, unzip, cd example; java -jar start.jar#
Go through basic tutorial in docs/tutorial.html#
Copy example directory, modify schema.xml until happy#
If coming from ElasticSearch, look at example-schemaless#
Do NOT follow this path to production#
Example schema is a kitchen sink !!! Read it as a story.#
<solr>/examples/solr/collection1/conf/{schema.xml|solrconfig.xml}
23
Simplest Solr - directory layout
solr-home - point here with -Dsolr.solr.home
collection1 - default collection name, without solr.xml
conf - configuration directory for the collection
schema.xml - defines fields and types
solrconfig.xml - defines low-level configuration but also
components, handlers, and chains for UpdateRequestProcessor
24
Simplest Solr - schema.xml
<?xml version="1.0" encoding="UTF-8" ?>
<schema version="1.5" name="simplest-solr">
<fieldType name="string" class=“solr.StrField"/>
!
<field name="id" type="string" indexed="true" stored="true"
required="true"/>
<dynamicField name="*" type="string" indexed="true"
stored="true" multiValued="true"/>
!
<uniqueKey>id</uniqueKey>
</schema>
25
Simplest Solr - solrconfig.xml
<?xml version="1.0" encoding="UTF-8" ?>
<config>
<luceneMatchVersion>LUCENE_4_9</luceneMatchVersion>
<requestDispatcher handleSelect="false">
<httpCaching never304="true" />
</requestDispatcher>
<requestHandler name="/select" class="solr.SearchHandler" />
<requestHandler name="/update" class="solr.UpdateRequestHandler" />
<requestHandler name="/admin" class="solr.admin.AdminHandlers" />
<requestHandler name="/analysis/field"
class="solr.FieldAnalysisRequestHandler" startup="lazy" />
</config>
26
DEMO
https://github.com/arafalov/simplest-solr-config
java -Dsolr.solr.home=…./simplest-solr
Go to <solr>/example/exampledocs
grep -l store *.xml |xargs ./post.sh (same, same)
Check Admin UI
Query - same, but different (multivalue, date)
Schema browser
27
Lots of things missing
Some admin UI items disabled (Ping, Files)#
No Near-Real-Time or atomic/partial update#
No types (apart from String)#
No dynamic schema#
No SolrCloud#
DOES NOT MATTER. NOTYET!
28
Two ways of learning
You can follow a path (going forward)#
A tutorial#
A book#
Learn what it teaches#
You can reach for the goal (going backwards)#
Have an idea#
Try to achieve it#
Learn what’s on the critical path#
Both are valuable. The second is harder, but gives you more.
29
Goal-driven Solr
1. Start with the simplest configuration that works#
2. Get something in (import data)#
3. Get something out (display data)#
4. Celebrate!!
5. Decide/Fine-tune what/how you want to find things#
6. Change the schema to match#
7. Change the import/display to match#
8. GOTO 5 (never really stops)
30
Getting data in
curl#
post.jar (in example/exampledocs); Try “java -jar post.jar -h” for help#
Admin UI (core/Documents)#
Clients (SolrJ, among 33 at various level of support: https://leanpub.com/solr-
clients/)#
Formats: XML, JSON, CSV, other formats (processed with Tika)#
DataImportHandler to pull data from external sources#
BigData connectors (Hadoop, Flume, etc) #
BigData integrations (DataStax for Solr on Cassandra, Cloudera for Solr on
HDFS)
31
Getting data out
Curl#
Web browser#
Admin UI (core/Query)#
Clients (ResponseWriters for JSON, XML, Python, Ruby, PHP,
CSV)#
UI toolkits (Cloudera HUE, TwigKit)#
Internal post-processors (we saw VelocityResponseWriter at /browse)#
Needs middleware or strong proxy - not secure otherwise
32
Celebrate!
You achieved basic end-to-end test#
You got Solr running#
You figured out how to display it#
You now know where the issues are#
FIX THOSE NEXT
33
Fine-tune schema
Solr is not friends with your data, it’s here to get your documents
found.#
<field name="features" stored="true" indexed="true"
type="text_general" multiValued=“true"/>#
stored=true - that’s for you#
indexed=true - that’s for Solr, where the magic happens#
type=“type_name” - defines what analyser chain to use!
SeeAdminUI core/Analysis#
See http://www.solr-start.com/info/analyzers/ for full list
34
Analyzers - English
<fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">#
<analyzer type="index">#
<tokenizer class="solr.StandardTokenizerFactory"/>#
<filter class=“solr.StopFilterFactory" ignoreCase=“true" words=“lang/
stopwords_en.txt"/>#
<filter class="solr.LowerCaseFilterFactory"/>#
# <filter class="solr.EnglishPossessiveFilterFactory"/>#
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>#
<filter class=“solr.PorterStemFilterFactory”/>….#
</analyzer>….
35
Analyzers - Persian
<fieldType name="text_fa" class="solr.TextField"
positionIncrementGap="100">#
<analyzer>#
<charFilter class="solr.PersianCharFilterFactory"/>#
<tokenizer class="solr.StandardTokenizerFactory"/>#
<filter class="solr.LowerCaseFilterFactory"/>#
<filter class="solr.ArabicNormalizationFilterFactory"/>#
<filter class="solr.PersianNormalizationFilterFactory"/>#
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/
stopwords_fa.txt" />#
</analyzer>#
</fieldType>
36
copyField FTW
<copyField source="cat" dest="text"/>#
<copyField source="*_t" dest="text" maxChars="3000"/>#
Indexing book authors 

“Schildt, Herbert; Wolpert, Lewis; Davies, P. “#
For searching: Tokenized, case-folded, punctuation-stripped:

schildt / herbert / wolpert / lewis / davies / p #
For sorting: Untokenized, case-folded, punctuation-stripped:

schildt herbert wolpert lewis davies p #
For faceting: Primary author only, using a solr.StringField:

Schildt, Herbert
37
Fine-tune search
Default query parser supports Lucene search syntax:#
text +compulsory -negated field:value#
uses default field or explicit field#
not very good for complex analysis#
eDisMax supports that plus searching across many fields#
Many more specialised types: https://cwiki.apache.org/
confluence/display/solr/Other+Parsers
38
Fine-tune indexing
UpdateRequestProcessor#
after you send your data to Solr #
before it hits the schema#
Deal with missing values, do pre-processing, identify
languages, secret to schemaless mode (see example-schemaless)#
Defined in solrconfig.xml, search for
updateRequestProcessorChain#
Full list at: http://www.solr-start.com/info/update-request-
processors/
39
Fine-tune display
Sorting #
Faceting - automatic taxonomy with counts (indexed value)#
Highlighting#
MoreLikeThis#
Statistics#
Grouping, Pivoting#
Debug for troubleshooting
40
Documentation
Solr WIKI - old but still has a lot of information#
Solr Reference Guide - new; online and downloadable#
http://www.solr-start.com/ - my resources of learners#
http://heliosearch.org/author/joel-bernstein/ - about new
features
41
With Solr, how far can I go?
Cloudera (BigData) has > 1,000,000,000 $USD
investments - opportunities?#
8M+ searches/day, 40 languages, 100ms NRT, 1024 cores,
256 shards, 32 servers on #solr at Bloomberg http://bit.ly/
1jmG72G (via @FlaxSearch)
42
Hackathon
43
First steps
Install Solr 4.9#
Go through the tutorial - gives you basics and end-to-end test#
Join the Slack chat (invitations are coming)#
Twit #SolrMasterclassBkk , @SolrStart, if have space :-)#
Attend breakout sessions#
Choose your own adventure (next)
44
Path 1 - Solr indexing book
Great for first timers#
Gets you from zero to comfortable#
All example are provided#
If are you stuck, I will help you#
Probably will not win you any prizes….. #
Do it for the skills
45
Path 2 - Your own dataset
Get it in at any costs#
Get it displayed#
Start iterating#
Book a time slot to discuss your questions#
Demo tips#
Explain problem domain (what is your dataset)#
Show how far you got#
Discuss the challenges
46
Path 3 - Need a dataset
Index your favourite Git repository (e.g. Solr): 

https://github.com/arafalov/git-to-solr#
Your own WordPress blog export (with DataImportHandler)#
Your own hard-drive#
Demo tips#
How far did you get#
Concentrate on displaying something cool (statistics?)#
Coolest Solr feature you found
47
Path 4 - A bigger challenge
Project Guttenberg (ask me for a copy of RDF dump)#
WorldCup matches data: http://worldcup.sfg.io/ #
Twitter feed (e.g. with Spring XD/Integration)#
Your own photographs collection (Tika extracts metadata)
48
DEMO Rules
There are no rules#
And the prizes are not terribly important#
What we are looking for is learning#
Make something new out of something old#
Learn a new features and show others#
Learn, teach, share - everybody wins
49
For later
50
Accelerate your learning
If still feel like a beginner, buy my book - seriously. That’s what it’s for#
All code/data is at: https://github.com/arafalov/solr-indexing-book #
Buy Solr InAction - recently and is a great reference, 

follow @ManningBooks for discounts#
Use my www.solr-start.com resources and join the mailing list 

(I’ll do that for you this time)#
Join solr-user mailing list - full of advanced hackers#
Watch Lucid Revolution videos for background#
Start helping out on Stack Overflow #solr#
Blog what you learned, twit with #Solr
51
Other Search-related books
Designing the Search Experience: The Information
Architecture of Discovery - by a TwigKit creator +1#
SearchAnalytics for Your Site: Conversations with Your
Customers by Louis Rosenfeld - see also Quepid#
Enterprise Search by Martin White
52
53
Alexandre Rafalovitch
www.outerthoughts.com

More Related Content

What's hot

Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEcommerce Solution Provider SysIQ
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature PreviewYonik Seeley
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorialChris Huang
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engineth0masr
 
Mastering solr
Mastering solrMastering solr
Mastering solrjurcello
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Apache Solr + ajax solr
Apache Solr + ajax solrApache Solr + ajax solr
Apache Solr + ajax solrNet7
 
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksSearching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksAlexandre Rafalovitch
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conferenceErik Hatcher
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered LuceneErik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Solr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksSolr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksErik Hatcher
 
From content to search: speed-dating Apache Solr (ApacheCON 2018)
From content to search: speed-dating Apache Solr (ApacheCON 2018)From content to search: speed-dating Apache Solr (ApacheCON 2018)
From content to search: speed-dating Apache Solr (ApacheCON 2018)Alexandre Rafalovitch
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 

What's hot (20)

Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so coolEnterprise Search Solution: Apache SOLR. What's available and why it's so cool
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
 
Solr 6 Feature Preview
Solr 6 Feature PreviewSolr 6 Feature Preview
Solr 6 Feature Preview
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
 
Mastering solr
Mastering solrMastering solr
Mastering solr
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Apache Solr + ajax solr
Apache Solr + ajax solrApache Solr + ajax solr
Apache Solr + ajax solr
 
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasksSearching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
Solr Black Belt Pre-conference
Solr Black Belt Pre-conferenceSolr Black Belt Pre-conference
Solr Black Belt Pre-conference
 
Solr Powered Lucene
Solr Powered LuceneSolr Powered Lucene
Solr Powered Lucene
 
Solr workshop
Solr workshopSolr workshop
Solr workshop
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
it's just search
it's just searchit's just search
it's just search
 
Solr Indexing and Analysis Tricks
Solr Indexing and Analysis TricksSolr Indexing and Analysis Tricks
Solr Indexing and Analysis Tricks
 
From content to search: speed-dating Apache Solr (ApacheCON 2018)
From content to search: speed-dating Apache Solr (ApacheCON 2018)From content to search: speed-dating Apache Solr (ApacheCON 2018)
From content to search: speed-dating Apache Solr (ApacheCON 2018)
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Solr Introduction
Solr IntroductionSolr Introduction
Solr Introduction
 

Similar to Master Solr with this 34-slide presentation

Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher lucenerevolution
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialSourcesense
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" DataArt
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Kai Chan
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampKais Hassan, PhD
 
20150210 solr introdution
20150210 solr introdution20150210 solr introdution
20150210 solr introdutionXuan-Chao Huang
 
Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8Websolutions Agency
 
Solr introduction
Solr introductionSolr introduction
Solr introductionLap Tran
 
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Lucidworks
 
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve contentOpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve contentAlkacon Software GmbH & Co. KG
 
Rails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search EngineRails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search EngineDavid Keener
 
Small wins in a small time with Apache Solr
Small wins in a small time with Apache SolrSmall wins in a small time with Apache Solr
Small wins in a small time with Apache SolrSourcesense
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relationJay Bharat
 

Similar to Master Solr with this 34-slide presentation (20)

Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys"
 
Apache Solr for begginers
Apache Solr for begginersApache Solr for begginers
Apache Solr for begginers
 
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
 
Information Retrieval - Data Science Bootcamp
Information Retrieval - Data Science BootcampInformation Retrieval - Data Science Bootcamp
Information Retrieval - Data Science Bootcamp
 
20150210 solr introdution
20150210 solr introdution20150210 solr introdution
20150210 solr introdution
 
Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8
 
Solr introduction
Solr introductionSolr introduction
Solr introduction
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
 
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve contentOpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
 
Rails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search EngineRails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search Engine
 
Small wins in a small time with Apache Solr
Small wins in a small time with Apache SolrSmall wins in a small time with Apache Solr
Small wins in a small time with Apache Solr
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 
Apache solr
Apache solrApache solr
Apache solr
 

Recently uploaded

办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationMarko4394
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxeditsforyah
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 

Recently uploaded (17)

办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentation
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptx
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 

Master Solr with this 34-slide presentation

  • 1. Apache Solr Masterclass From zero to hero June 2014 www.slideshare.net/arafalov/solr-masterclass-bangkok-june-2014
  • 3. Web search engines ! are quite sophisticated 3
  • 4. 4
  • 5. But the real search needs ! are! much DEEPER and BROADER 5
  • 7. Searching people and companies 7
  • 11. Understanding full-text search SELECT * 
 FROM database
 WHERE field LIKE ‘%word%’# This DOES NOT Scale# Instead: # break text into tokens# domain-specific processing (e.g. lower-casing)# build fast-access structures# algorithms for term, phrases, proximity search 11
  • 12. Basic search engine features Search (Duh!): keyword, phrase, field-specific# Positive and negative terms# Sort: relevancy, recency# Pagination# Compact summary in results# SPEED 12
  • 13. Advanced search engine features Facets/Taxonomy - based navigation with live counts# Language-specific processing# Domain-specific text processing (WiFi = Wi-Fi = WIFI)# Geographic search# More-like-this, did-you-mean, autocomplete# Scaling/Clustering# NOT web crawling - different, but related 13
  • 14. Search engine solutions? Solr# Elastic Search# Xapian# Sphinx# Groonga# Searchdaimon# {F}lexSearch# Algolia (SaaS)# Searchify (SaaS)# ForageJS# Lunr.js# FACT-Finder# DtSearch# MarkLogic# Verity# Fast# Most databases# ! ! …AND MORE 14
  • 15. Used with permission from SemaText Open Source Search Evolution 15
  • 16. Secret Ingredient - Lucene Solr# Elastic Search# SwiftType# Galene (LinkedIn’s)# PyLucene (Python wrapper)# Lucene.net (C# port) Scalable, high-performance indexing# Incremental indexing# Full-text search# Information-Retrieval algorithms# Implemented in Java# Written in 1999, still going strong 16
  • 17. Secret Ingredient - Solr Certified distributions# LucidWorks# HelioSearch# Big Data platforms# Cloudera# Hortonworks HDP# Hosted and SaaS# Amazon CloudSearch# WebSolr, SolrHQ, SearchBox Lucene full-text-search# XML and REST config# Schema/Schemaless# SolrCloud (clustering)# Caching# Near real-time# Rich-document indexing (Tika inside)# Plugins, components, processors 17
  • 18. Solr Ecosystem sample Drupal# Project Blacklight# LuxDB# SolrMeter# CrafterCMS# Typo3# Magenta# HippoCMS# ColdFusion# SolrNet# DataStax# Dovecot# NGData Lily# Basho Riak# YaCy# Apache ManifoldCF# Apache Camel# FranzAllegrograph# BitNami Solr Stack# Carrot2! Broadleaf Commerce# Cloudera CDK! CodeLibs Fess (フェス)! Splunk# Alfresco# Rosette by BasisTech! Luwak by Flax! Quepid by OSC! TwigKit! SPM by SemaText! SILK by LucidWorks! Banana (O/S Solr Kibana) 18
  • 20. DEMO - Basic Unzip# Go to example directory# Run Solr# Import some documents from example docs# grep -l store *.xml | xargs ./post.sh# Show off Solr 4 admin panel 20
  • 21. DEMO - Browse handler Restart Solr with -Dsolr.clustering.enabled=true# Visit http://localhost:8983/solr/browse/ # Show off# Search# Facets - Categories and Ranges# Spatial/Geo-distance# Clusters 21
  • 23. Start for free Download, unzip, cd example; java -jar start.jar# Go through basic tutorial in docs/tutorial.html# Copy example directory, modify schema.xml until happy# If coming from ElasticSearch, look at example-schemaless# Do NOT follow this path to production# Example schema is a kitchen sink !!! Read it as a story.# <solr>/examples/solr/collection1/conf/{schema.xml|solrconfig.xml} 23
  • 24. Simplest Solr - directory layout solr-home - point here with -Dsolr.solr.home collection1 - default collection name, without solr.xml conf - configuration directory for the collection schema.xml - defines fields and types solrconfig.xml - defines low-level configuration but also components, handlers, and chains for UpdateRequestProcessor 24
  • 25. Simplest Solr - schema.xml <?xml version="1.0" encoding="UTF-8" ?> <schema version="1.5" name="simplest-solr"> <fieldType name="string" class=“solr.StrField"/> ! <field name="id" type="string" indexed="true" stored="true" required="true"/> <dynamicField name="*" type="string" indexed="true" stored="true" multiValued="true"/> ! <uniqueKey>id</uniqueKey> </schema> 25
  • 26. Simplest Solr - solrconfig.xml <?xml version="1.0" encoding="UTF-8" ?> <config> <luceneMatchVersion>LUCENE_4_9</luceneMatchVersion> <requestDispatcher handleSelect="false"> <httpCaching never304="true" /> </requestDispatcher> <requestHandler name="/select" class="solr.SearchHandler" /> <requestHandler name="/update" class="solr.UpdateRequestHandler" /> <requestHandler name="/admin" class="solr.admin.AdminHandlers" /> <requestHandler name="/analysis/field" class="solr.FieldAnalysisRequestHandler" startup="lazy" /> </config> 26
  • 27. DEMO https://github.com/arafalov/simplest-solr-config java -Dsolr.solr.home=…./simplest-solr Go to <solr>/example/exampledocs grep -l store *.xml |xargs ./post.sh (same, same) Check Admin UI Query - same, but different (multivalue, date) Schema browser 27
  • 28. Lots of things missing Some admin UI items disabled (Ping, Files)# No Near-Real-Time or atomic/partial update# No types (apart from String)# No dynamic schema# No SolrCloud# DOES NOT MATTER. NOTYET! 28
  • 29. Two ways of learning You can follow a path (going forward)# A tutorial# A book# Learn what it teaches# You can reach for the goal (going backwards)# Have an idea# Try to achieve it# Learn what’s on the critical path# Both are valuable. The second is harder, but gives you more. 29
  • 30. Goal-driven Solr 1. Start with the simplest configuration that works# 2. Get something in (import data)# 3. Get something out (display data)# 4. Celebrate!! 5. Decide/Fine-tune what/how you want to find things# 6. Change the schema to match# 7. Change the import/display to match# 8. GOTO 5 (never really stops) 30
  • 31. Getting data in curl# post.jar (in example/exampledocs); Try “java -jar post.jar -h” for help# Admin UI (core/Documents)# Clients (SolrJ, among 33 at various level of support: https://leanpub.com/solr- clients/)# Formats: XML, JSON, CSV, other formats (processed with Tika)# DataImportHandler to pull data from external sources# BigData connectors (Hadoop, Flume, etc) # BigData integrations (DataStax for Solr on Cassandra, Cloudera for Solr on HDFS) 31
  • 32. Getting data out Curl# Web browser# Admin UI (core/Query)# Clients (ResponseWriters for JSON, XML, Python, Ruby, PHP, CSV)# UI toolkits (Cloudera HUE, TwigKit)# Internal post-processors (we saw VelocityResponseWriter at /browse)# Needs middleware or strong proxy - not secure otherwise 32
  • 33. Celebrate! You achieved basic end-to-end test# You got Solr running# You figured out how to display it# You now know where the issues are# FIX THOSE NEXT 33
  • 34. Fine-tune schema Solr is not friends with your data, it’s here to get your documents found.# <field name="features" stored="true" indexed="true" type="text_general" multiValued=“true"/># stored=true - that’s for you# indexed=true - that’s for Solr, where the magic happens# type=“type_name” - defines what analyser chain to use! SeeAdminUI core/Analysis# See http://www.solr-start.com/info/analyzers/ for full list 34
  • 35. Analyzers - English <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"># <analyzer type="index"># <tokenizer class="solr.StandardTokenizerFactory"/># <filter class=“solr.StopFilterFactory" ignoreCase=“true" words=“lang/ stopwords_en.txt"/># <filter class="solr.LowerCaseFilterFactory"/># # <filter class="solr.EnglishPossessiveFilterFactory"/># <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/># <filter class=“solr.PorterStemFilterFactory”/>….# </analyzer>…. 35
  • 36. Analyzers - Persian <fieldType name="text_fa" class="solr.TextField" positionIncrementGap="100"># <analyzer># <charFilter class="solr.PersianCharFilterFactory"/># <tokenizer class="solr.StandardTokenizerFactory"/># <filter class="solr.LowerCaseFilterFactory"/># <filter class="solr.ArabicNormalizationFilterFactory"/># <filter class="solr.PersianNormalizationFilterFactory"/># <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/ stopwords_fa.txt" /># </analyzer># </fieldType> 36
  • 37. copyField FTW <copyField source="cat" dest="text"/># <copyField source="*_t" dest="text" maxChars="3000"/># Indexing book authors 
 “Schildt, Herbert; Wolpert, Lewis; Davies, P. “# For searching: Tokenized, case-folded, punctuation-stripped:
 schildt / herbert / wolpert / lewis / davies / p # For sorting: Untokenized, case-folded, punctuation-stripped:
 schildt herbert wolpert lewis davies p # For faceting: Primary author only, using a solr.StringField:
 Schildt, Herbert 37
  • 38. Fine-tune search Default query parser supports Lucene search syntax:# text +compulsory -negated field:value# uses default field or explicit field# not very good for complex analysis# eDisMax supports that plus searching across many fields# Many more specialised types: https://cwiki.apache.org/ confluence/display/solr/Other+Parsers 38
  • 39. Fine-tune indexing UpdateRequestProcessor# after you send your data to Solr # before it hits the schema# Deal with missing values, do pre-processing, identify languages, secret to schemaless mode (see example-schemaless)# Defined in solrconfig.xml, search for updateRequestProcessorChain# Full list at: http://www.solr-start.com/info/update-request- processors/ 39
  • 40. Fine-tune display Sorting # Faceting - automatic taxonomy with counts (indexed value)# Highlighting# MoreLikeThis# Statistics# Grouping, Pivoting# Debug for troubleshooting 40
  • 41. Documentation Solr WIKI - old but still has a lot of information# Solr Reference Guide - new; online and downloadable# http://www.solr-start.com/ - my resources of learners# http://heliosearch.org/author/joel-bernstein/ - about new features 41
  • 42. With Solr, how far can I go? Cloudera (BigData) has > 1,000,000,000 $USD investments - opportunities?# 8M+ searches/day, 40 languages, 100ms NRT, 1024 cores, 256 shards, 32 servers on #solr at Bloomberg http://bit.ly/ 1jmG72G (via @FlaxSearch) 42
  • 44. First steps Install Solr 4.9# Go through the tutorial - gives you basics and end-to-end test# Join the Slack chat (invitations are coming)# Twit #SolrMasterclassBkk , @SolrStart, if have space :-)# Attend breakout sessions# Choose your own adventure (next) 44
  • 45. Path 1 - Solr indexing book Great for first timers# Gets you from zero to comfortable# All example are provided# If are you stuck, I will help you# Probably will not win you any prizes….. # Do it for the skills 45
  • 46. Path 2 - Your own dataset Get it in at any costs# Get it displayed# Start iterating# Book a time slot to discuss your questions# Demo tips# Explain problem domain (what is your dataset)# Show how far you got# Discuss the challenges 46
  • 47. Path 3 - Need a dataset Index your favourite Git repository (e.g. Solr): 
 https://github.com/arafalov/git-to-solr# Your own WordPress blog export (with DataImportHandler)# Your own hard-drive# Demo tips# How far did you get# Concentrate on displaying something cool (statistics?)# Coolest Solr feature you found 47
  • 48. Path 4 - A bigger challenge Project Guttenberg (ask me for a copy of RDF dump)# WorldCup matches data: http://worldcup.sfg.io/ # Twitter feed (e.g. with Spring XD/Integration)# Your own photographs collection (Tika extracts metadata) 48
  • 49. DEMO Rules There are no rules# And the prizes are not terribly important# What we are looking for is learning# Make something new out of something old# Learn a new features and show others# Learn, teach, share - everybody wins 49
  • 51. Accelerate your learning If still feel like a beginner, buy my book - seriously. That’s what it’s for# All code/data is at: https://github.com/arafalov/solr-indexing-book # Buy Solr InAction - recently and is a great reference, 
 follow @ManningBooks for discounts# Use my www.solr-start.com resources and join the mailing list 
 (I’ll do that for you this time)# Join solr-user mailing list - full of advanced hackers# Watch Lucid Revolution videos for background# Start helping out on Stack Overflow #solr# Blog what you learned, twit with #Solr 51
  • 52. Other Search-related books Designing the Search Experience: The Information Architecture of Discovery - by a TwigKit creator +1# SearchAnalytics for Your Site: Conversations with Your Customers by Louis Rosenfeld - see also Quepid# Enterprise Search by Martin White 52