1. Lucene for Solr
Developers
uberconf - July 14, 2011
Presented by Erik Hatcher
erik.hatcher@lucidimagination.com
Lucid Imagination
http://www.lucidimagination.com
4. Customizing - Don't do it!
• Unless you need to.
• In other words... ensure you've given the built-in
capabilities a try, asked on the e-mail list, and
spelunked into at least Solr's code a bit to make
some sense of the situation.
• But we're here to roll up our sleeves, because we
need to...
5. But first...
• Look at Lucene and/or Solr source code as
appropriate
• Carefully read javadocs and wiki pages - lots of tips
there
• And, hey, search for what you're trying to do...
• Google, of course
• But try out LucidFind and other Lucene ecosystem
specific search systems -
http://www.lucidimagination.com/search/
7. Factories
• FooFactory (most) everywhere.
Sometimes there's BarPlugin style
• for sake of discussion... let's just skip the
"factory" part
• In Solr, Factories and Plugins are used by
configuration loading to parameterize and
construct
8. "Installing" plugins
• Compile .java to .class, JAR it up
• Put JAR files in either:
• <solr-home>/lib
• a shared lib when using multicore
• anywhere, and register location in
solrconfig.xml
• Hook in plugins as appropriate
14. CharFilter
• extend BaseCharFilter
• enables pre-tokenization filtering/morphing
of incoming field value
• only affects tokenization, not stored value
• Built-in CharFilters: HTMLStripCharFilter,
PatternReplaceCharFilter, and
MappingCharFilter
15. Tokenizer
• common to extend CharTokenizer
• implement -
• protected abstract boolean isTokenChar(int c);
• optionally override -
• protected int normalize(int c)
• extend Tokenizer directly for finer control
• Popular built-in Tokenizers include: WhitespaceTokenizer,
StandardTokenizer, PatternTokenizer, KeywordTokenizer,
ICUTokenizer
16. TokenFilter
• a TokenStream whose input is another
TokenStream
• Popular TokenFilters include:
LowerCaseFilter, CommonGramsFilter,
SnowballFilter, StopFilter,
WordDelimiterFilter
17. Lucene's analysis APIs
• tricky business, what with Attributes
(Source/Factory's), State, characters, code
points,Version, etc...
• Test!!!
• BaseTokenStreamTestCase
• Look at Lucene and Solr's test cases
22. Built-in QParsers
from QParserPlugin.java
/** internal use - name to class mappings of builtin parsers */
public static final Object[] standardPlugins = {
LuceneQParserPlugin.NAME, LuceneQParserPlugin.class,
OldLuceneQParserPlugin.NAME, OldLuceneQParserPlugin.class,
FunctionQParserPlugin.NAME, FunctionQParserPlugin.class,
PrefixQParserPlugin.NAME, PrefixQParserPlugin.class,
BoostQParserPlugin.NAME, BoostQParserPlugin.class,
DisMaxQParserPlugin.NAME, DisMaxQParserPlugin.class,
ExtendedDismaxQParserPlugin.NAME, ExtendedDismaxQParserPlugin.class,
FieldQParserPlugin.NAME, FieldQParserPlugin.class,
RawQParserPlugin.NAME, RawQParserPlugin.class,
TermQParserPlugin.NAME, TermQParserPlugin.class,
NestedQParserPlugin.NAME, NestedQParserPlugin.class,
FunctionRangeQParserPlugin.NAME, FunctionRangeQParserPlugin.class,
SpatialFilterQParserPlugin.NAME, SpatialFilterQParserPlugin.class,
SpatialBoxQParserPlugin.NAME, SpatialBoxQParserPlugin.class,
JoinQParserPlugin.NAME, JoinQParserPlugin.class,
};
23. Local Parameters
• {!qparser_name param=value}expression
• or
• {!qparser_name param=value v=expression}
• Can substitute $references from request
parameters
25. Custom QParser
• Implement a QParserPlugin that creates your
custom QParser
• Register in solrconfig.xml
• <queryParser name="myparser"
class="com.mycompany.MyQParserPlugin"/>
27. Built-in Update
Processors
• RunUpdateProcessor
• Actually performs the operations, such as
adding the documents to the index
• LogUpdateProcessor
• Logs each operation
• SignatureUpdateProcessor
• duplicate detection and optionally rejection
29. Update Processor
Chain
• UpdateProcessor's sequence into a chain
• Each processor can abort the entire update
or hand processing to next processor in
the chain
• Chains, of update processor factories, are
specified in solrconfig.xml
• Update requests can specify an
update.processor parameter
30. Default update
processor chain
From SolrCore.java
// construct the default chain
UpdateRequestProcessorFactory[] factories =
new UpdateRequestProcessorFactory[]{
new RunUpdateProcessorFactory(),
new LogUpdateProcessorFactory()
};
Note: these steps have been swapped on trunk recently
31. Example Update
Processor
• What are the best facets to show for a particular
query? Wouldn't it be nice to see the distribution of
document "attributes" represented across a result
set?
• Learned this trick from the Smithsonian, who were
doing it manually - add an indexed field containing the
field names of the interesting other fields on the
document.
• Facet on that field "of field names" initially, then
request facets on the top values returned.
33. FieldsUsedUpdateProcessorFactory
public class FieldsUsedUpdateProcessorFactory extends UpdateRequestProcessorFactory {
private String fieldsUsedFieldName;
private Pattern fieldNamePattern;
public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp,
UpdateRequestProcessor next) {
return new FieldsUsedUpdateProcessor(req, rsp, this, next);
}
// ... next slide ...
}
34. FieldsUsedUpdateProcessorFactory
@Override
public void init(NamedList args) {
if (args == null) return;
SolrParams params = SolrParams.toSolrParams(args);
fieldsUsedFieldName = params.get("fieldsUsedFieldName");
if (fieldsUsedFieldName == null) {
throw new SolrException
(SolrException.ErrorCode.SERVER_ERROR,
"fieldsUsedFieldName must be specified");
}
// TODO check that fieldsUsedFieldName is a valid field name and multiValued
String fieldNameRegex = params.get("fieldNameRegex");
if (fieldNameRegex == null) {
throw new SolrException
(SolrException.ErrorCode.SERVER_ERROR,
"fieldNameRegex must be specified");
}
fieldNamePattern = Pattern.compile(fieldNameRegex);
super.init(args);
}
35. class FieldsUsedUpdateProcessor extends UpdateRequestProcessor {
public FieldsUsedUpdateProcessor(SolrQueryRequest req,
SolrQueryResponse rsp,
FieldsUsedUpdateProcessorFactory factory,
UpdateRequestProcessor next) {
super(next);
}
@Override
public void processAdd(AddUpdateCommand cmd) throws IOException {
SolrInputDocument doc = cmd.getSolrInputDocument();
Collection<String> incomingFieldNames = doc.getFieldNames();
Iterator<String> iterator = incomingFieldNames.iterator();
ArrayList<String> usedFields = new ArrayList<String>();
while (iterator.hasNext()) {
String f = iterator.next();
if (fieldNamePattern.matcher(f).matches()) {
usedFields.add(f);
}
}
doc.addField(fieldsUsedFieldName, usedFields.toArray());
super.processAdd(cmd);
}
}
38. Example - auto facet
select
• It sure would be nice if you could have Solr automatically
select field(s) for faceting based dynamically off the
profile of the results. For example, you're indexing
disparate types of products, all with varying attributes
(color, size - like for apparel, memory_size - for
electronics, subject - for books, etc), and a user searches
for "ipod" where most products match products with
color and memory_size attributes... let's automatically
facet on those fields.
• https://issues.apache.org/jira/browse/SOLR-2641
39. AutoFacetSelection
Component
• Too much code for a slide, let's take a look in
an IDE...
• Basically -
• process() gets autofacet.field and autofacet.n
request params, facets on field, takes top N
values, sets those as facet.field's
• Gotcha - need to call rb.setNeedDocSet
(true) in prepare() as faceting needs it