You’re Solr powered, and needing to customize its capabilities. Apache Solr is flexibly architected, with practically everything pluggable. Under the hood, Solr is driven by the well-known Apache Lucene. Lucene for Solr Developers will guide you through the various ways in which Solr can be extended, customized, and enhanced with a bit of Lucene API know-how. We’ll delve into improving analysis with custom character mapping, tokenizing, and token filtering extensions; show why and how to implement specialized query parsing, and how to add your own search and update request handling.
2. Abstract
You’re Solr powered, and needing to customize its
capabilities. Apache Solr is flexibly architected, with
practically everything pluggable. Under the hood, Solr is
driven by the well-known Apache Lucene. Lucene for
Solr Developers will guide you through the various ways
in which Solr can be extended, customized, and enhanced
with a bit of Lucene API know-how. We’ll delve into
improving analysis with custom character mapping,
tokenizing, and token filtering extensions; show why and
how to implement specialized query parsing, and how to
add your own search and update request handling.
2
3. About me...
• Co-author, “Lucene in Action”
• Commiter, Lucene and Solr
• Lucene PMC and ASF member
• Member of Technical Staff / co-founder,
Lucid Imagination
3
4. ... works
search platform
www.lucidimagination.com
4
5. What is Lucene?
• An open source search library (not an application)
• 100% Java
• Continuously improved and tuned over more than
10 years
• Compact, portable index representation
• Programmable text analyzers, spell checking and
highlighting
• Not a crawler or a text extraction tool
5
6. Inverted Index
• Lucene stores input data in what is known as an
inverted index
• In an inverted index each indexed term points to a
list of documents that contain the term
• Similar to the index provided at the end of a book
• In this case "inverted" simply means the list of terms
point to documents
• It is much faster to find a term in an index, than to
scan all the documents
6
8. Segments and Merging
• A Lucene index is a collection of one or more sub-indexes
called segments
• Each segment is a fully independent index
• A multi-way merge algorithm is used to periodically merge
segments
• New segments are created when an IndexWriter flushes new
documents and pending deletes to disk
• Trying for a balance between large-scale performance vs. small-
scale updates
• Optimization merges all segments into one
8
10. Segments
• When a document is deleted it still exists
in an index segment until that segment is
merged
• At certain trigger points, these Documents
are flushed to the Directory
• Can be forced by calling commit
• Segments are periodically merged
10
15. Lucene Scoring
• Lucene uses a similarity scoring formula to rank results by measuring the
similarity between a query and the documents that match the query. The
factors that form the scoring formula are:
• Term Frequency: tf (t in d). How often the term occurs in the document.
• Inverse Document Frequency: idf (t). A measure of how rare the term is in
the whole collection. One over the number of times the term appears in
the collection.
• Terms that are rare throughout the entire collection score higher.
15
16. Coord and Norms
• Coord: The coordination factor, coord (q, d).
Boosts documents that match more of the
search terms than other documents.
• If 4 of 4 terms match coord = 4/4
• If 3 of 4 terms match coord = 3/4
• Length Normalization - Adjust the score based
on length of fields in the document.
• shorter fields that match get a boost
16
17. Scoring Factors (cont)
• Boost: (t.field in d). A way to boost a field
or a whole document above others.
• Query Norm: (q). Normalization value
for a query, given the sum of the squared
weights of each of the query terms.
• You will often hear the Lucene scoring
simply referred to as
TF·IDF.
17
18. Explanation
• Lucene has a feature called Explanation
• Solr uses the debugQuery parameter to
retrieve scoring explanations
0.2987913 = (MATCH) fieldWeight(text:lucen in 688), product of:
1.4142135 = tf(termFreq(text:lucen)=2)
9.014501 = idf(docFreq=3, maxDocs=12098)
0.0234375 = fieldNorm(field=text, doc=688)
18
21. Customizing - Don't do it!
• Unless you need to.
• In other words... ensure you've given the built-in
capabilities a try, asked on the e-mail list, and
spelunked into at least Solr's code a bit to make
some sense of the situation.
• But we're here to roll up our sleeves, because we
need to...
21
22. But first...
• Look at Lucene and/or Solr source code as
appropriate
• Carefully read javadocs and wiki pages - lots of tips
there
• And, hey, search for what you're trying to do...
• Google, of course
• But try out LucidFind and other Lucene ecosystem
specific search systems -
http://www.lucidimagination.com/search/
22
24. Factories
• FooFactory (most) everywhere.
Sometimes there's BarPlugin style
• for sake of discussion... let's just skip the
"factory" part
• In Solr, Factories and Plugins are used by
configuration loading to parameterize and
construct
24
25. "Installing" plugins
• Compile .java to .class, JAR it up
• Put JAR files in either:
• <solr-home>/lib
• a shared lib when using multicore
• anywhere, and register location in
solrconfig.xml
• Hook in plugins as appropriate
25
31. CharFilter
• extend BaseCharFilter
• enables pre-tokenization filtering/morphing
of incoming field value
• only affects tokenization, not stored value
• Built-in CharFilters: HTMLStripCharFilter,
PatternReplaceCharFilter, and
MappingCharFilter
31
32. Tokenizer
• common to extend CharTokenizer
• implement -
• protected abstract boolean isTokenChar(int c);
• optionally override -
• protected int normalize(int c)
• extend Tokenizer directly for finer control
• Popular built-in Tokenizers include: WhitespaceTokenizer,
StandardTokenizer, PatternTokenizer, KeywordTokenizer,
ICUTokenizer
32
33. TokenFilter
• a TokenStream whose input is another
TokenStream
• Popular TokenFilters include:
LowerCaseFilter, CommonGramsFilter,
SnowballFilter, StopFilter,
WordDelimiterFilter
33
34. Lucene's analysis APIs
• tricky business, what with Attributes
(Source/Factory's), State, characters, code
points,Version, etc...
• Test!!!
• BaseTokenStreamTestCase
• Look at Lucene and Solr's test cases
34
39. Built-in QParsers
from QParserPlugin.java
/** internal use - name to class mappings of builtin parsers */
public static final Object[] standardPlugins = {
LuceneQParserPlugin.NAME, LuceneQParserPlugin.class,
OldLuceneQParserPlugin.NAME, OldLuceneQParserPlugin.class,
FunctionQParserPlugin.NAME, FunctionQParserPlugin.class,
PrefixQParserPlugin.NAME, PrefixQParserPlugin.class,
BoostQParserPlugin.NAME, BoostQParserPlugin.class,
DisMaxQParserPlugin.NAME, DisMaxQParserPlugin.class,
ExtendedDismaxQParserPlugin.NAME, ExtendedDismaxQParserPlugin.class,
FieldQParserPlugin.NAME, FieldQParserPlugin.class,
RawQParserPlugin.NAME, RawQParserPlugin.class,
TermQParserPlugin.NAME, TermQParserPlugin.class,
NestedQParserPlugin.NAME, NestedQParserPlugin.class,
FunctionRangeQParserPlugin.NAME, FunctionRangeQParserPlugin.class,
SpatialFilterQParserPlugin.NAME, SpatialFilterQParserPlugin.class,
SpatialBoxQParserPlugin.NAME, SpatialBoxQParserPlugin.class,
JoinQParserPlugin.NAME, JoinQParserPlugin.class,
};
39
40. Local Parameters
• {!qparser_name param=value}expression
• or
• {!qparser_name param=value v=expression}
• Can substitute $references from request
parameters
40
44. Built-in Update
Processors
• RunUpdateProcessor
• Actually performs the operations, such as
adding the documents to the index
• LogUpdateProcessor
• Logs each operation
• SignatureUpdateProcessor
• duplicate detection and optionally rejection
44
46. Update Processor
Chain
• UpdateProcessor's sequence into a chain
• Each processor can abort the entire update
or hand processing to next processor in
the chain
• Chains, of update processor factories, are
specified in solrconfig.xml
• Update requests can specify an
update.processor parameter
46
47. Default update
processor chain
From SolrCore.java
// construct the default chain
UpdateRequestProcessorFactory[] factories =
new UpdateRequestProcessorFactory[]{
new RunUpdateProcessorFactory(),
new LogUpdateProcessorFactory()
};
Note: these steps have been swapped on trunk recently
47
48. Example Update
Processor
• What are the best facets to show for a particular
query? Wouldn't it be nice to see the distribution of
document "attributes" represented across a result
set?
• Learned this trick from the Smithsonian, who were
doing it manually - add an indexed field containing the
field names of the interesting other fields on the
document.
• Facet on that field "of field names" initially, then
request facets on the top values returned.
48
50. FieldsUsedUpdateProcessorFactory
public class FieldsUsedUpdateProcessorFactory extends UpdateRequestProcessorFactory {
private String fieldsUsedFieldName;
private Pattern fieldNamePattern;
public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp,
UpdateRequestProcessor next) {
return new FieldsUsedUpdateProcessor(req, rsp, this, next);
}
// ... next slide ...
}
50
51. FieldsUsedUpdateProcessorFactory
@Override
public void init(NamedList args) {
if (args == null) return;
SolrParams params = SolrParams.toSolrParams(args);
fieldsUsedFieldName = params.get("fieldsUsedFieldName");
if (fieldsUsedFieldName == null) {
throw new SolrException
(SolrException.ErrorCode.SERVER_ERROR,
"fieldsUsedFieldName must be specified");
}
// TODO check that fieldsUsedFieldName is a valid field name and multiValued
String fieldNameRegex = params.get("fieldNameRegex");
if (fieldNameRegex == null) {
throw new SolrException
(SolrException.ErrorCode.SERVER_ERROR,
"fieldNameRegex must be specified");
}
fieldNamePattern = Pattern.compile(fieldNameRegex);
super.init(args);
}
51
52. class FieldsUsedUpdateProcessor extends UpdateRequestProcessor {
public FieldsUsedUpdateProcessor(SolrQueryRequest req,
SolrQueryResponse rsp,
FieldsUsedUpdateProcessorFactory factory,
UpdateRequestProcessor next) {
super(next);
}
@Override
public void processAdd(AddUpdateCommand cmd) throws IOException {
SolrInputDocument doc = cmd.getSolrInputDocument();
Collection<String> incomingFieldNames = doc.getFieldNames();
Iterator<String> iterator = incomingFieldNames.iterator();
ArrayList<String> usedFields = new ArrayList<String>();
while (iterator.hasNext()) {
String f = iterator.next();
if (fieldNamePattern.matcher(f).matches()) {
usedFields.add(f);
}
}
doc.addField(fieldsUsedFieldName, usedFields.toArray());
super.processAdd(cmd);
}
}
52
55. Example - auto facet
select
• It sure would be nice if you could have Solr automatically
select field(s) for faceting based dynamically off the
profile of the results. For example, you're indexing
disparate types of products, all with varying attributes
(color, size - like for apparel, memory_size - for
electronics, subject - for books, etc), and a user searches
for "ipod" where most products match products with
color and memory_size attributes... let's automatically
facet on those fields.
• https://issues.apache.org/jira/browse/SOLR-2641
55
56. AutoFacetSelection
Component
• Too much code for a slide, let's take a look in
an IDE...
• Basically -
• process() gets autofacet.field and autofacet.n
request params, facets on field, takes top N
values, sets those as facet.field's
• Gotcha - need to call rb.setNeedDocSet
(true) in prepare() as faceting needs it
56