Lucene for Solr Developers

Lucene for Solr
Developers
uberconf - July 14, 2011
Presented by Erik Hatcher
erik.hatcher@lucidimagination.com
Lucid Imagination
http://www.lucidimagination.com

Lucene Core
• IndexWriter
• Directory
• IndexReader, IndexSearcher
• analysis: Analyzer, TokenStream,
Tokenizer,TokenFilter
• Query

Customizing - Don't do it!

• Unless you need to.
• In other words... ensure you've given the built-in
capabilities a try, asked on the e-mail list, and
spelunked into at least Solr's code a bit to make
some sense of the situation.
• But we're here to roll up our sleeves, because we
need to...

But ﬁrst...
• Look at Lucene and/or Solr source code as
appropriate

• Carefully read javadocs and wiki pages - lots of tips
there

• And, hey, search for what you're trying to do...

• Google, of course

• But try out LucidFind and other Lucene ecosystem
speciﬁc search systems -
http://www.lucidimagination.com/search/

Extension points
• Tokenizer, TokenFilter, • QParser
CharFilter
• DataImportHandler
• SearchComponent hooks

• RequestHandler • data sources

• ResponseWriter • entity processors

• FieldType • transformers

• Similarity • several others

Factories
• FooFactory (most) everywhere.
Sometimes there's BarPlugin style

• for sake of discussion... let's just skip the
"factory" part
• In Solr, Factories and Plugins are used by
conﬁguration loading to parameterize and
construct

"Installing" plugins
• Compile .java to .class, JAR it up
• Put JAR ﬁles in either:
• <solr-home>/lib
• a shared lib when using multicore
• anywhere, and register location in
solrconﬁg.xml
• Hook in plugins as appropriate

Multicore sharedLib

<solr sharedLib="/usr/local/solr/customlib"
persistent="true">
<cores adminPath="/admin/cores">
<core instanceDir="core1" name="core1"/>
<core instanceDir="core2" name="core2"/>
</cores>
</solr>

Plugins via
solrconﬁg.xml

• <lib dir="/path/to/your/custom/jars" />

Analysis

• CharFilter
• Tokenizer
• TokenFilter

Primer

• Tokens, Terms
• Attributes: Type, Payloads, Offsets,
Positions, Term Vectors
• part of the picture:

Version

• enum:
• Version.LUCENE_31,
Version.LUCENE_32, etc
• Version.onOrAfter(Version other)

CharFilter
• extend BaseCharFilter
• enables pre-tokenization ﬁltering/morphing
of incoming ﬁeld value
• only affects tokenization, not stored value
• Built-in CharFilters: HTMLStripCharFilter,
PatternReplaceCharFilter, and
MappingCharFilter

Tokenizer
• common to extend CharTokenizer

• implement -

• protected abstract boolean isTokenChar(int c);

• optionally override -

• protected int normalize(int c)

• extend Tokenizer directly for ﬁner control

• Popular built-in Tokenizers include: WhitespaceTokenizer,
StandardTokenizer, PatternTokenizer, KeywordTokenizer,
ICUTokenizer

TokenFilter

• a TokenStream whose input is another
TokenStream
• Popular TokenFilters include:
LowerCaseFilter, CommonGramsFilter,
SnowballFilter, StopFilter,
WordDelimiterFilter

Lucene's analysis APIs
• tricky business, what with Attributes
(Source/Factory's), State, characters, code
points,Version, etc...
• Test!!!
• BaseTokenStreamTestCase
• Look at Lucene and Solr's test cases

Solr's Analysis Tools

• Admin analysis tool
• Field analysis request handler
• DEMO

Query Parsing

• String -> org.apache.lucene.search.Query

QParserPlugin
public abstract class QParserPlugin
implements NamedListInitializedPlugin {

public abstract QParser createParser(
String qstr,
SolrParams localParams,
SolrParams params,
SolrQueryRequest req);
}

QParser
public abstract class QParser {

public abstract Query parse()
throws ParseException;

}

Built-in QParsers
from QParserPlugin.java
/** internal use - name to class mappings of builtin parsers */
public static final Object[] standardPlugins = {
LuceneQParserPlugin.NAME, LuceneQParserPlugin.class,
OldLuceneQParserPlugin.NAME, OldLuceneQParserPlugin.class,
FunctionQParserPlugin.NAME, FunctionQParserPlugin.class,
PrefixQParserPlugin.NAME, PrefixQParserPlugin.class,
BoostQParserPlugin.NAME, BoostQParserPlugin.class,
DisMaxQParserPlugin.NAME, DisMaxQParserPlugin.class,
ExtendedDismaxQParserPlugin.NAME, ExtendedDismaxQParserPlugin.class,
FieldQParserPlugin.NAME, FieldQParserPlugin.class,
RawQParserPlugin.NAME, RawQParserPlugin.class,
TermQParserPlugin.NAME, TermQParserPlugin.class,
NestedQParserPlugin.NAME, NestedQParserPlugin.class,
FunctionRangeQParserPlugin.NAME, FunctionRangeQParserPlugin.class,
SpatialFilterQParserPlugin.NAME, SpatialFilterQParserPlugin.class,
SpatialBoxQParserPlugin.NAME, SpatialBoxQParserPlugin.class,
JoinQParserPlugin.NAME, JoinQParserPlugin.class,
};

Local Parameters

• {!qparser_name param=value}expression
• or
• {!qparser_name param=value v=expression}
• Can substitute $references from request
parameters

Param Substitution
solrconfig.xml
<requestHandler name="/document"
class="solr.SearchHandler">
<lst name="invariants">
<str name="q">{!term f=id v=$id}</str>
</lst>
</requestHandler>

Solr request
http://localhost:8983/solr/document?id=FOO37

Custom QParser

• Implement a QParserPlugin that creates your
custom QParser
• Register in solrconﬁg.xml
• <queryParser name="myparser"
class="com.mycompany.MyQParserPlugin"/>

Update Processor

• Responsible for handling these commands:
• add/update
• delete
• commit
• merge indexes

Built-in Update
Processors
• RunUpdateProcessor
• Actually performs the operations, such as
adding the documents to the index
• LogUpdateProcessor
• Logs each operation
• SignatureUpdateProcessor
• duplicate detection and optionally rejection

UIMA Update
Processor
• UIMA - Unstructured Information Management
Architecture - http://uima.apache.org/

• Enables UIMA components to augment
documents

• Entity extraction, automated categorization,
language detection, etc

• "contrib" plugin

• http://wiki.apache.org/solr/SolrUIMA

Update Processor
Chain
• UpdateProcessor's sequence into a chain
• Each processor can abort the entire update
or hand processing to next processor in
the chain
• Chains, of update processor factories, are
speciﬁed in solrconﬁg.xml
• Update requests can specify an
update.processor parameter

Default update
processor chain
From SolrCore.java
// construct the default chain
UpdateRequestProcessorFactory[] factories =
new UpdateRequestProcessorFactory[]{
new RunUpdateProcessorFactory(),
new LogUpdateProcessorFactory()
};

Note: these steps have been swapped on trunk recently

Example Update
Processor
• What are the best facets to show for a particular
query? Wouldn't it be nice to see the distribution of
document "attributes" represented across a result
set?

• Learned this trick from the Smithsonian, who were
doing it manually - add an indexed field containing the
field names of the interesting other fields on the
document.

• Facet on that field "of field names" initially, then
request facets on the top values returned.

Config for custom
update processor
<updateRequestProcessorChain name="fields_used" default="true">
<processor class="solr.processor.FieldsUsedUpdateProcessorFactory">
<str name="fieldsUsedFieldName">attribute_fields</str>
<str name="fieldNameRegex">.*_attribute</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

FieldsUsedUpdateProcessorFactory

public class FieldsUsedUpdateProcessorFactory extends UpdateRequestProcessorFactory {
private String ﬁeldsUsedFieldName;
private Pattern ﬁeldNamePattern;

public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp,
UpdateRequestProcessor next) {
return new FieldsUsedUpdateProcessor(req, rsp, this, next);
}

// ... next slide ...

}

FieldsUsedUpdateProcessorFactory
@Override
public void init(NamedList args) {
if (args == null) return;

SolrParams params = SolrParams.toSolrParams(args);

fieldsUsedFieldName = params.get("fieldsUsedFieldName");
if (fieldsUsedFieldName == null) {
throw new SolrException
(SolrException.ErrorCode.SERVER_ERROR,
"fieldsUsedFieldName must be specified");
}

// TODO check that fieldsUsedFieldName is a valid field name and multiValued

String fieldNameRegex = params.get("fieldNameRegex");
if (fieldNameRegex == null) {
throw new SolrException
(SolrException.ErrorCode.SERVER_ERROR,
"fieldNameRegex must be specified");
}
fieldNamePattern = Pattern.compile(fieldNameRegex);

super.init(args);
}

class FieldsUsedUpdateProcessor extends UpdateRequestProcessor {
public FieldsUsedUpdateProcessor(SolrQueryRequest req,
SolrQueryResponse rsp,
FieldsUsedUpdateProcessorFactory factory,
UpdateRequestProcessor next) {
super(next);
}

@Override
public void processAdd(AddUpdateCommand cmd) throws IOException {
SolrInputDocument doc = cmd.getSolrInputDocument();

Collection<String> incomingFieldNames = doc.getFieldNames();

Iterator<String> iterator = incomingFieldNames.iterator();
ArrayList<String> usedFields = new ArrayList<String>();
while (iterator.hasNext()) {
String f = iterator.next();
if (fieldNamePattern.matcher(f).matches()) {
usedFields.add(f);
}
}

doc.addField(fieldsUsedFieldName, usedFields.toArray());
super.processAdd(cmd);
}
}

FieldsUsedUpdateProcessor
in action
schema.xml
<dynamicField name="*_attribute" type="string" indexed="true" stored="true" multiValued="true"/>

Add some documents
solr.add([{:id=>1, :name => "Big Blue Shoes", :size_attribute => 'L', :color_attribute => 'Blue'},
{:id=>2, :name => "Cool Gizmo", :memory_attribute => "16GB", :color_attribute => 'White'}])
solr.commit

Facet on attribute_fields
- http://localhost:8983/solr/select?q=*:*&facet=on&facet.field=attribute_fields&wt=json&indent=on
"facet_fields":{
"attribute_fields":[
"color_attribute",2,
"memory_attribute",1,
"size_attribute",1]}

Search Components
• Built-in: Clustering, Debug, Facet, Highlight,
MoreLikeThis, Query, QueryElevation,
SpellCheck, Stats, TermVector, Terms
• Non-distributed API:
• prepare(ResponseBuilder rb)
• process(ResponseBuilder rb)

Example - auto facet
select
• It sure would be nice if you could have Solr automatically
select field(s) for faceting based dynamically off the
profile of the results. For example, you're indexing
disparate types of products, all with varying attributes
(color, size - like for apparel, memory_size - for
electronics, subject - for books, etc), and a user searches
for "ipod" where most products match products with
color and memory_size attributes... let's automatically
facet on those fields.

• https://issues.apache.org/jira/browse/SOLR-2641

AutoFacetSelection
Component
• Too much code for a slide, let's take a look in
an IDE...

• Basically -

• process() gets autofacet.field and autofacet.n
request params, facets on field, takes top N
values, sets those as facet.field's

• Gotcha - need to call rb.setNeedDocSet
(true) in prepare() as faceting needs it

SearchComponent
conﬁg
<searchComponent name="autofacet"
class="solr.AutoFacetSelectionComponent"/>
<requestHandler name="/searchplus"
class="solr.SearchHandler">
<arr name="components">
<str>query</str>
<str>autofacet</str>
<str>facet</str>
<str>debug</str>
</arr>
</requestHandler>

autofacet success
http://localhost:8983/solr/searchplus
?q=*:*&facet=on&autofacet.field=attribute_fields&wt=json&indent=on
{
"response":{"numFound":2,"start":0,"docs":[
{
"size_attribute":["L"],
"color_attribute":["Blue"],
"name":"Big Blue Shoes",
"id":"1",
"attribute_fields":["size_attribute",
"color_attribute"]},
{
"color_attribute":["White"],
"name":"Cool Gizmo",
"memory_attribute":["16GB"],
"id":"2",
"attribute_fields":["color_attribute",
"memory_attribute"]}]
},
"facet_counts":{
"facet_queries":{},
"facet_fields":{
"color_attribute":[
"Blue",1,
"White",1],
"memory_attribute":[
"16GB",1]}}}

Distributed-aware
SearchComponents
• SearchComponent has a few distributed mode
methods:

• distributedProcess(ResponseBuilder)

• modifyRequest(ResponseBuilder rb,
SearchComponent who, ShardRequest sreq)

• handleResponses(ResponseBuilder rb,
ShardRequest sreq)

• ﬁnishStage(ResponseBuilder rb)

Testing

• AbstractSolrTestCase
• SolrTestCaseJ4
• SolrMeter
• http://code.google.com/p/solrmeter/

For more information...
• http://www.lucidimagination.com

• LucidFind

• search Lucene ecosystem: mailing lists, wikis, JIRA, etc

• http://search.lucidimagination.com

• Getting started with LucidWorks Enterprise:

• http://www.lucidimagination.com/products/
lucidworks-search-platform/enterprise

• http://lucene.apache.org/solr - wiki, e-mail lists

Lucene for Solr Developers

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (10)

Similar a Lucene for Solr Developers

Similar a Lucene for Solr Developers (20)

Más de Erik Hatcher

Más de Erik Hatcher (13)

Último

Último (20)

Lucene for Solr Developers