18. r boost for a query on ferrari than the
get from a query on insurance.
entInversionof a term used to sca
frequency df document
total number of documents in a corpu
frequency follows:
frequency (idf) of a term t as
N
idft = log .
dft
rare term is high, whereas the idf of a
ure 6.4 gives an example of idf’s in a co
19. g scheme assigns to term
tf-idft,d = tft,d × idft .
ssigns to term t a weigh
21. 7 Vector space re
6
v(q)
v(d2 )
B
¨
¨
¨¨ v(d2 )
I
¨
¨
¨¨
¨¨
¨
¨
-
¨
Cosine similarity illustrated.
igure 7.1
22.
23. Q: “gold silver truck”
D1: “Shipment of gold damaged in a
fire”
D2: “Delivery of silver arrived in a
silver truck”
D3: “Shipment of gold arrived in a
truck”
35. Lucene includes several built-in analyzers. The primary ones are shown in table 4.2.
We’ll leave discussion of the two language-specific analyzers, RussianAnalyzer
and GermanAnalyzer, to section 4.8.2 and the special per-field analyzer wrapper,
PerFieldAnalyzerWrapper, to section 4.4.
Table 4.2 Primary analyzers available in Lucene
Analyzer Steps taken
Splits tokens at whitespace
WhitespaceAnalyzer
Divides text at nonletter characters and lowercases
SimpleAnalyzer
Divides text at nonletter characters, lowercases, and removes stop words
StopAnalyzer
Tokenizes based on a sophisticated grammar that recognizes e-mail
StandardAnalyzer
addresses, acronyms, Chinese-Japanese-Korean characters,
alphanumerics, and more; lowercases; and removes stop words
The built-in analyzers we discuss in this section—WhitespaceAnalyzer, Simple-
Analyzer, StopAnalyzer, and StandardAnalyzer—are designed to work with text in
almost any Western (European-based) language. You can see the effect of each of
these analyzers in the output in section 4.2.3. WhitespaceAnalyzer and Simple-
Analyzer are both trivial and we don’t cover them in more detail here. We explore
the StopAnalyzer and StandardAnalyzer in more depth because they have non-
38. ex options: store
store
Value Description
:no Don’t store field
:yes Store field in its original format.
Use this value if you want to highlight
matches or print match excerpts a la Google
search.
:compressed Store field in compressed format.
39. index
Index options: index
Value Description
:no Do not make this field searchable.
:yes Make this field searchable and tok-
enize its contents.
:untokenized Make this field searchable but do not
tokenize its contents. Use this value
for fields you wish to sort by.
:omit norms Same as :yes except omit the norms
file. The norms file can be omit-
ted if you don’t boost any fields and
you don’t need scoring based on field
length.
:untokenized omit norms Same as :untokenized except omit the
norms file.
Ruby Day Kraków: Full Text Search with Ferret
40. term_vector
Index options: term vector
Value Description
:no Don’t store term-vectors
:yes Store term-vectors without storing positions
or offsets.
:with positions Store term-vectors with positions.
:with offsets Store term-vectors with offsets.
:with positions ofssets Store term-vectors with positions and off-
sets.
Ruby Day Kraków: Full Text Search with Ferret