This document provides an overview of the Apache Solr search engine. It begins with an introduction to full-text search and how it differs from basic SQL queries. It then covers the basic and advanced features of Solr, highlighting facets, language-specific processing, and geographic search. The document reviews how Solr uses Lucene for indexing and search capabilities. It concludes with discussing ways to get started with Solr, including downloading the software and importing sample data for testing.
11. Understanding full-text search
SELECT *
FROM database
WHERE field LIKE ‘%word%’#
This DOES NOT Scale#
Instead: #
break text into tokens#
domain-specific processing (e.g. lower-casing)#
build fast-access structures#
algorithms for term, phrases, proximity search
11
12. Basic search engine features
Search (Duh!): keyword, phrase, field-specific#
Positive and negative terms#
Sort: relevancy, recency#
Pagination#
Compact summary in results#
SPEED
12
13. Advanced search engine features
Facets/Taxonomy - based navigation with live counts#
Language-specific processing#
Domain-specific text processing (WiFi = Wi-Fi = WIFI)#
Geographic search#
More-like-this, did-you-mean, autocomplete#
Scaling/Clustering#
NOT web crawling - different, but related
13
20. DEMO - Basic
Unzip#
Go to example directory#
Run Solr#
Import some documents from example docs#
grep -l store *.xml | xargs ./post.sh#
Show off Solr 4 admin panel
20
21. DEMO - Browse handler
Restart Solr with -Dsolr.clustering.enabled=true#
Visit http://localhost:8983/solr/browse/ #
Show off#
Search#
Facets - Categories and Ranges#
Spatial/Geo-distance#
Clusters
21
23. Start for free
Download, unzip, cd example; java -jar start.jar#
Go through basic tutorial in docs/tutorial.html#
Copy example directory, modify schema.xml until happy#
If coming from ElasticSearch, look at example-schemaless#
Do NOT follow this path to production#
Example schema is a kitchen sink !!! Read it as a story.#
<solr>/examples/solr/collection1/conf/{schema.xml|solrconfig.xml}
23
24. Simplest Solr - directory layout
solr-home - point here with -Dsolr.solr.home
collection1 - default collection name, without solr.xml
conf - configuration directory for the collection
schema.xml - defines fields and types
solrconfig.xml - defines low-level configuration but also
components, handlers, and chains for UpdateRequestProcessor
24
28. Lots of things missing
Some admin UI items disabled (Ping, Files)#
No Near-Real-Time or atomic/partial update#
No types (apart from String)#
No dynamic schema#
No SolrCloud#
DOES NOT MATTER. NOTYET!
28
29. Two ways of learning
You can follow a path (going forward)#
A tutorial#
A book#
Learn what it teaches#
You can reach for the goal (going backwards)#
Have an idea#
Try to achieve it#
Learn what’s on the critical path#
Both are valuable. The second is harder, but gives you more.
29
30. Goal-driven Solr
1. Start with the simplest configuration that works#
2. Get something in (import data)#
3. Get something out (display data)#
4. Celebrate!!
5. Decide/Fine-tune what/how you want to find things#
6. Change the schema to match#
7. Change the import/display to match#
8. GOTO 5 (never really stops)
30
31. Getting data in
curl#
post.jar (in example/exampledocs); Try “java -jar post.jar -h” for help#
Admin UI (core/Documents)#
Clients (SolrJ, among 33 at various level of support: https://leanpub.com/solr-
clients/)#
Formats: XML, JSON, CSV, other formats (processed with Tika)#
DataImportHandler to pull data from external sources#
BigData connectors (Hadoop, Flume, etc) #
BigData integrations (DataStax for Solr on Cassandra, Cloudera for Solr on
HDFS)
31
32. Getting data out
Curl#
Web browser#
Admin UI (core/Query)#
Clients (ResponseWriters for JSON, XML, Python, Ruby, PHP,
CSV)#
UI toolkits (Cloudera HUE, TwigKit)#
Internal post-processors (we saw VelocityResponseWriter at /browse)#
Needs middleware or strong proxy - not secure otherwise
32
33. Celebrate!
You achieved basic end-to-end test#
You got Solr running#
You figured out how to display it#
You now know where the issues are#
FIX THOSE NEXT
33
34. Fine-tune schema
Solr is not friends with your data, it’s here to get your documents
found.#
<field name="features" stored="true" indexed="true"
type="text_general" multiValued=“true"/>#
stored=true - that’s for you#
indexed=true - that’s for Solr, where the magic happens#
type=“type_name” - defines what analyser chain to use!
SeeAdminUI core/Analysis#
See http://www.solr-start.com/info/analyzers/ for full list
34
37. copyField FTW
<copyField source="cat" dest="text"/>#
<copyField source="*_t" dest="text" maxChars="3000"/>#
Indexing book authors
“Schildt, Herbert; Wolpert, Lewis; Davies, P. “#
For searching: Tokenized, case-folded, punctuation-stripped:
schildt / herbert / wolpert / lewis / davies / p #
For sorting: Untokenized, case-folded, punctuation-stripped:
schildt herbert wolpert lewis davies p #
For faceting: Primary author only, using a solr.StringField:
Schildt, Herbert
37
38. Fine-tune search
Default query parser supports Lucene search syntax:#
text +compulsory -negated field:value#
uses default field or explicit field#
not very good for complex analysis#
eDisMax supports that plus searching across many fields#
Many more specialised types: https://cwiki.apache.org/
confluence/display/solr/Other+Parsers
38
39. Fine-tune indexing
UpdateRequestProcessor#
after you send your data to Solr #
before it hits the schema#
Deal with missing values, do pre-processing, identify
languages, secret to schemaless mode (see example-schemaless)#
Defined in solrconfig.xml, search for
updateRequestProcessorChain#
Full list at: http://www.solr-start.com/info/update-request-
processors/
39
41. Documentation
Solr WIKI - old but still has a lot of information#
Solr Reference Guide - new; online and downloadable#
http://www.solr-start.com/ - my resources of learners#
http://heliosearch.org/author/joel-bernstein/ - about new
features
41
42. With Solr, how far can I go?
Cloudera (BigData) has > 1,000,000,000 $USD
investments - opportunities?#
8M+ searches/day, 40 languages, 100ms NRT, 1024 cores,
256 shards, 32 servers on #solr at Bloomberg http://bit.ly/
1jmG72G (via @FlaxSearch)
42
44. First steps
Install Solr 4.9#
Go through the tutorial - gives you basics and end-to-end test#
Join the Slack chat (invitations are coming)#
Twit #SolrMasterclassBkk , @SolrStart, if have space :-)#
Attend breakout sessions#
Choose your own adventure (next)
44
45. Path 1 - Solr indexing book
Great for first timers#
Gets you from zero to comfortable#
All example are provided#
If are you stuck, I will help you#
Probably will not win you any prizes….. #
Do it for the skills
45
46. Path 2 - Your own dataset
Get it in at any costs#
Get it displayed#
Start iterating#
Book a time slot to discuss your questions#
Demo tips#
Explain problem domain (what is your dataset)#
Show how far you got#
Discuss the challenges
46
47. Path 3 - Need a dataset
Index your favourite Git repository (e.g. Solr):
https://github.com/arafalov/git-to-solr#
Your own WordPress blog export (with DataImportHandler)#
Your own hard-drive#
Demo tips#
How far did you get#
Concentrate on displaying something cool (statistics?)#
Coolest Solr feature you found
47
48. Path 4 - A bigger challenge
Project Guttenberg (ask me for a copy of RDF dump)#
WorldCup matches data: http://worldcup.sfg.io/ #
Twitter feed (e.g. with Spring XD/Integration)#
Your own photographs collection (Tika extracts metadata)
48
49. DEMO Rules
There are no rules#
And the prizes are not terribly important#
What we are looking for is learning#
Make something new out of something old#
Learn a new features and show others#
Learn, teach, share - everybody wins
49
51. Accelerate your learning
If still feel like a beginner, buy my book - seriously. That’s what it’s for#
All code/data is at: https://github.com/arafalov/solr-indexing-book #
Buy Solr InAction - recently and is a great reference,
follow @ManningBooks for discounts#
Use my www.solr-start.com resources and join the mailing list
(I’ll do that for you this time)#
Join solr-user mailing list - full of advanced hackers#
Watch Lucid Revolution videos for background#
Start helping out on Stack Overflow #solr#
Blog what you learned, twit with #Solr
51
52. Other Search-related books
Designing the Search Experience: The Information
Architecture of Discovery - by a TwigKit creator +1#
SearchAnalytics for Your Site: Conversations with Your
Customers by Louis Rosenfeld - see also Quepid#
Enterprise Search by Martin White
52