5. What is “Search”?
• Information/Document Retrieval
• Basic Definition:
• Finding previously seen documents that are
related to some user-supplied terms.
6. What is “Search”?
• Information/Document Retrieval
• Basic Definition:
• Finding previously seen documents that are
related to some user-supplied terms.
• Advanced Definition:
7. What is “Search”?
• Information/Document Retrieval
• Basic Definition:
• Finding previously seen documents that are
related to some user-supplied terms.
• Advanced Definition:
• Finding relevant content for some query by
understanding the contextual meaning of
terms in the search index and query.
8. What is “Search”?
• Information/Document Retrieval
• Basic Definition:
• Finding previously seen documents that are
related to some user-supplied terms.
• Advanced Definition:
• Finding relevant content for some query by
understanding the contextual meaning of
terms in the search index and query.
• Semantic Search
26. Solr Documents
• A document represents a distinct piece of
content that can be stored/retrieved
27. Solr Documents
• A document represents a distinct piece of
content that can be stored/retrieved
• Bible Verse
28. Solr Documents
• A document represents a distinct piece of
content that can be stored/retrieved
• Bible Verse
• Journal Article
29. Solr Documents
• A document represents a distinct piece of
content that can be stored/retrieved
• Bible Verse
• Journal Article
• Commentary Chapter/Section
30. Solr Documents
• A document represents a distinct piece of
content that can be stored/retrieved
• Bible Verse
• Journal Article
• Commentary Chapter/Section
• Web Page
48. Solr Fields
• The “String” Field Type
• <fieldType
name="string"
class="solr.StrField" />
49. Solr Fields
• The “String” Field Type
• <fieldType
name="string"
class="solr.StrField" />
• No Filter; No Tokenizer
50. Solr Fields
• The “String” Field Type
• <fieldType
name="string"
class="solr.StrField" />
• No Filter; No Tokenizer
• Field content won’t be split or changed
55. Put Data in Solr
• Remember, Solr communicates using XML
over HTTP
56. Put Data in Solr
• Remember, Solr communicates using XML
over HTTP
• No concept of updating a document -
delete, then add
57. Put Data in Solr
• Remember, Solr communicates using XML
over HTTP
• No concept of updating a document -
delete, then add
• To add, POST XML to update handler
58. Put Data in Solr
• Remember, Solr communicates using XML
over HTTP
• No concept of updating a document -
delete, then add
• To add, POST XML to update handler
• http://localhost:8080/solr/bible/update
59. Add XML
<add>
<doc>
<id>1</id>
<net>In the beginning God created the heavens and
the earth.</net>
</doc>
</add>
60. PHP API
• No XML!
• $client = new SolrClient($options);
$doc = new SolrInputDocument();
$doc->addField('id', 1); //Must be Integer
$doc->addField('net', ‘In the beginning God
created the heavens and the earth.’);
$client->addDocument($doc);
63. Querying Solr
• HTTP GET Request
• http://localhost:8080/solr/bible3/select?q=god
64. Querying Solr
• HTTP GET Request
• http://localhost:8080/solr/bible3/select?q=god
• | Path to Solr ||Core||Handler||Query |
65. Querying Solr
• HTTP GET Request
• http://localhost:8080/solr/bible3/select?q=god
• | Path to Solr ||Core||Handler||Query |
• Returns XML By Default
66. Querying Solr
• HTTP GET Request
• http://localhost:8080/solr/bible3/select?q=god
• | Path to Solr ||Core||Handler||Query |
• Returns XML By Default
• Can return JSON and more
69. Querying Solr
• Queries the defaultSearchField by default
• <defaultSearchField>all_index</defaultSearchField>
70. Querying Solr
• Queries the defaultSearchField by default
• <defaultSearchField>all_index</defaultSearchField>
• Can query other fields by using the syntax:field:value
71. Querying Solr
• Queries the defaultSearchField by default
• <defaultSearchField>all_index</defaultSearchField>
• Can query other fields by using the syntax:field:value
• http://localhost:8080/solr/bible3/select?q=id:27974
72. Querying Solr
• Queries the defaultSearchField by default
• <defaultSearchField>all_index</defaultSearchField>
• Can query other fields by using the syntax:field:value
• http://localhost:8080/solr/bible3/select?q=id:27974
• Multiple queries / Booleans
73. Querying Solr
• Queries the defaultSearchField by default
• <defaultSearchField>all_index</defaultSearchField>
• Can query other fields by using the syntax:field:value
• http://localhost:8080/solr/bible3/select?q=id:27974
• Multiple queries / Booleans
• http://localhost:8080/solr/bible3/select?q=god AND book:40
81. Search Multiple
Translations
• + Quasi Synonym term/phrase injection
82. Search Multiple
Translations
• + Quasi Synonym term/phrase injection
• + Less variation across translations leads to stronger
possible matches
83. Search Multiple
Translations
• + Quasi Synonym term/phrase injection
• + Less variation across translations leads to stronger
possible matches
• + Matches verses when the source translation isn’t
known
84. Search Multiple
Translations
• + Quasi Synonym term/phrase injection
• + Less variation across translations leads to stronger
possible matches
• + Matches verses when the source translation isn’t
known
• - No control over which translation gets more weight
85. Search Multiple
Translations
• + Quasi Synonym term/phrase injection
• + Less variation across translations leads to stronger
possible matches
• + Matches verses when the source translation isn’t
known
• - No control over which translation gets more weight
• - No control over scoring of matches
86. Search Multiple
Translations
• Another way: Dismax
• Can score a document (verse) match based on scores/matches
from multiple fields.
• net_index^1 kjv_index^1
• Not exponents - weights
• We’re searching the net_index and kjv_index fields, each with
a boost/weight of 1.
• net_index^6 kjv_index^.5
• http://localhost:8080/solr/bible4/select?q=respect%20for%20god&defType=dismax&tie=.
1&qf=net_index^1%20kjv_index^1&fl=score
• http://localhost:8080/solr/bible4/select?q=respect%20for%20god&defType=dismax&tie=.
1&qf=net_index^6%20kjv_index^.5&fl=score
88. Scoring
• score(q,d) =
coord(q,d)· queryNorm(q)· ∑ ( tf(t in d)· idf(t)2· norm(t,d))
t in q
89. Scoring
• score(q,d) =
coord(q,d)· queryNorm(q)· ∑ ( tf(t in d)· idf(t)2· norm(t,d))
t in q
• Basic Factors
90. Scoring
• score(q,d) =
coord(q,d)· queryNorm(q)· ∑ ( tf(t in d)· idf(t)2· norm(t,d))
t in q
• Basic Factors
• Term Frequency in a document (↑ is better)
91. Scoring
• score(q,d) =
coord(q,d)· queryNorm(q)· ∑ ( tf(t in d)· idf(t)2· norm(t,d))
t in q
• Basic Factors
• Term Frequency in a document (↑ is better)
• Term Frequency in Corpus (↓ is Better)
92. Scoring
• score(q,d) =
coord(q,d)· queryNorm(q)· ∑ ( tf(t in d)· idf(t)2· norm(t,d))
t in q
• Basic Factors
• Term Frequency in a document (↑ is better)
• Term Frequency in Corpus (↓ is Better)
• Length of matching document (↓ is Better)
93. Scoring
• score(q,d) =
coord(q,d)· queryNorm(q)· ∑ ( tf(t in d)· idf(t)2· norm(t,d))
t in q
• Basic Factors
• Term Frequency in a document (↑ is better)
• Term Frequency in Corpus (↓ is Better)
• Length of matching document (↓ is Better)
• “Jesus Wept” - John 11:35
94. Scoring
• score(q,d) =
coord(q,d)· queryNorm(q)· ∑ ( tf(t in d)· idf(t)2· norm(t,d))
t in q
• Basic Factors
• Term Frequency in a document (↑ is better)
• Term Frequency in Corpus (↓ is Better)
• Length of matching document (↓ is Better)
• “Jesus Wept” - John 11:35
• http://localhost:8080/solr/bible3/select?q=wept
95. Scoring
• score(q,d) =
coord(q,d)· queryNorm(q)· ∑ ( tf(t in d)· idf(t)2· norm(t,d))
t in q
• Basic Factors
• Term Frequency in a document (↑ is better)
• Term Frequency in Corpus (↓ is Better)
• Length of matching document (↓ is Better)
• “Jesus Wept” - John 11:35
• http://localhost:8080/solr/bible3/select?q=wept
• http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/
Similarity.html
98. Search Multiple
Translations
• Another way: Dismax
• Can score a document (verse) match based on scores/matches
from multiple fields.
99. Search Multiple
Translations
• Another way: Dismax
• Can score a document (verse) match based on scores/matches
from multiple fields.
• net_index^1 kjv_index^1
100. Search Multiple
Translations
• Another way: Dismax
• Can score a document (verse) match based on scores/matches
from multiple fields.
• net_index^1 kjv_index^1
• Not exponents - weights
101. Search Multiple
Translations
• Another way: Dismax
• Can score a document (verse) match based on scores/matches
from multiple fields.
• net_index^1 kjv_index^1
• Not exponents - weights
• We’re searching the net_index and kjv_index fields, each with
a boost/weight of 1.
102. Search Multiple
Translations
• Another way: Dismax
• Can score a document (verse) match based on scores/matches
from multiple fields.
• net_index^1 kjv_index^1
• Not exponents - weights
• We’re searching the net_index and kjv_index fields, each with
a boost/weight of 1.
• net_index^6 kjv_index^.5
103. Search Multiple
Translations
• Another way: Dismax
• Can score a document (verse) match based on scores/matches
from multiple fields.
• net_index^1 kjv_index^1
• Not exponents - weights
• We’re searching the net_index and kjv_index fields, each with
a boost/weight of 1.
• net_index^6 kjv_index^.5
• http://localhost:8080/solr/bible4/select?q=respect%20for%20god&defType=dismax&tie=.
1&qf=net_index^1%20kjv_index^1&fl=score
104. Search Multiple
Translations
• Another way: Dismax
• Can score a document (verse) match based on scores/matches
from multiple fields.
• net_index^1 kjv_index^1
• Not exponents - weights
• We’re searching the net_index and kjv_index fields, each with
a boost/weight of 1.
• net_index^6 kjv_index^.5
• http://localhost:8080/solr/bible4/select?q=respect%20for%20god&defType=dismax&tie=.
1&qf=net_index^1%20kjv_index^1&fl=score
• http://localhost:8080/solr/bible4/select?q=respect%20for%20god&defType=dismax&tie=.
1&qf=net_index^6%20kjv_index^.5&fl=score
106. Topic Tagging
• Use a topically-tagged Bible/concordance to mark-
up each verse, or just key verses
107. Topic Tagging
• Use a topically-tagged Bible/concordance to mark-
up each verse, or just key verses
• Helpful for “theme” based queries.
108. Topic Tagging
• Use a topically-tagged Bible/concordance to mark-
up each verse, or just key verses
• Helpful for “theme” based queries.
• “Social Justice” - no good matches
109. Topic Tagging
• Use a topically-tagged Bible/concordance to mark-
up each verse, or just key verses
• Helpful for “theme” based queries.
• “Social Justice” - no good matches
• “Satan” - Many Names
110. Topic Tagging
• Use a topically-tagged Bible/concordance to mark-
up each verse, or just key verses
• Helpful for “theme” based queries.
• “Social Justice” - no good matches
• “Satan” - Many Names
• Name Tagging in general can be very helpful
114. Searching Strong’s
• Add a field for Strong’s: strongs_index
• 1473 1510 2316 11 2316 2464 2532 2316 2384 1510 3756
2316 3498 235 2198
• Most of the benefits of text searching
115. Searching Strong’s
• Add a field for Strong’s: strongs_index
• 1473 1510 2316 11 2316 2464 2532 2316 2384 1510 3756
2316 3498 235 2198
• Most of the benefits of text searching
• “Word” frequency
116. Searching Strong’s
• Add a field for Strong’s: strongs_index
• 1473 1510 2316 11 2316 2464 2532 2316 2384 1510 3756
2316 3498 235 2198
• Most of the benefits of text searching
• “Word” frequency
• Document vs. corpus frequency of search terms
122. Searching Articles
• Similar approach to text-based queries
• Stem words
• Use Synonyms
• Remove Stop Words
• Without manual tagging, there’s no automatic way
to index/search by Bible Reference
126. Searching Articles
• Article contains reference: “John 3”
• User searches for “John 3:16” or “John 2-4”
• Results: no meaningful matches at best
(unless the documents match the query
“John”
130. Searching Articles
• Solr-based Solutions:
• Identify and index references and their
composite verses using a grammar.
• John 1:1-3 -> John 1:1; John 1:2; John 1:3
131. Searching Articles
• Solr-based Solutions:
• Identify and index references and their
composite verses using a grammar.
• John 1:1-3 -> John 1:1; John 1:2; John 1:3
• Store in a multivalued field - each
reference is a “term”
132. Searching Articles
• Solr-based Solutions:
• Identify and index references and their
composite verses using a grammar.
• John 1:1-3 -> John 1:1; John 1:2; John 1:3
• Store in a multivalued field - each
reference is a “term”
• Must also parse and expand references in
queries in order to match
135. Searching Articles
• Relational database-based solution:
• Assign an id to every verse
136. Searching Articles
• Relational database-based solution:
• Assign an id to every verse
• Store: id, articleId, verseId
137. Searching Articles
• Relational database-based solution:
• Assign an id to every verse
• Store: id, articleId, verseId
• Parse user query to ids.
138. Searching Articles
• Relational database-based solution:
• Assign an id to every verse
• Store: id, articleId, verseId
• Parse user query to ids.
• SELECT COUNT(id)
WHERE verseId IN (ID_LIST)
GROUP BY articleId
139. Searching Articles
• Relational database-based solution:
• Assign an id to every verse
• Store: id, articleId, verseId
• Parse user query to ids.
• SELECT COUNT(id)
WHERE verseId IN (ID_LIST)
GROUP BY articleId
• Higher count -> Article is most likely to me more
about that reference than other articles with a
lower count
143. Searching Articles
• Relational database-based solution:
• Large amount of rows.
• 15,000 Journal articles have > 9,000,000 rows
(verse occurrences)
144. Searching Articles
• Relational database-based solution:
• Large amount of rows.
• 15,000 Journal articles have > 9,000,000 rows
(verse occurrences)
• Can store id, articleId, verseId, count
145. Searching Articles
• Relational database-based solution:
• Large amount of rows.
• 15,000 Journal articles have > 9,000,000 rows
(verse occurrences)
• Can store id, articleId, verseId, count
• Then SUM() the counts for each articleId.
146. Searching Articles
• Relational database-based solution:
• Large amount of rows.
• 15,000 Journal articles have > 9,000,000 rows
(verse occurrences)
• Can store id, articleId, verseId, count
• Then SUM() the counts for each articleId.
• Negligibly faster.
147. Searching Articles
• Relational database-based solution:
• Large amount of rows.
• 15,000 Journal articles have > 9,000,000 rows
(verse occurrences)
• Can store id, articleId, verseId, count
• Then SUM() the counts for each articleId.
• Negligibly faster.
• Only approx. 3,000,000 rows
150. Heterogeneous Indexes
• All content is not created equally.
• Content quality and its affect on the quality of
your results becomes a factor when you move
from one resource to > one
151. Heterogeneous Indexes
• All content is not created equally.
• Content quality and its affect on the quality of
your results becomes a factor when you move
from one resource to > one
• One Bible, One website, One Journal
152. Heterogeneous Indexes
• All content is not created equally.
• Content quality and its affect on the quality of
your results becomes a factor when you move
from one resource to > one
• One Bible, One website, One Journal
• Apply a field or document boost to help
normalize results
153. Heterogeneous Indexes
• All content is not created equally.
• Content quality and its affect on the quality of
your results becomes a factor when you move
from one resource to > one
• One Bible, One website, One Journal
• Apply a field or document boost to help
normalize results
• Some content gets bumped up and some down