Open source enterprise search and retrieval platform

Datum 21 augustus 2010
Enterprise Search
EAI
Semantic Web
Open Source
Search & Retrieval
Platform
Marc Teutelink

How Apache open source software is used
during the implementation of an
Enterprise Search and Retrieval Platform
(Lucene/SOLR, Nutch, Tika, ServiceMix/Camel, Felix/Ace)

Marc Teutelink
marc.teutelink@luminis.eu
@mteutelink
•Software architect at Luminis
•15+ years experience in software development; specialized in
Enterprise Search, Enterprise Application Integration and
Semantic Web technology
•Currently writing “Enterprise Search in Action” for Manning
(Mid-2011)

Agenda
•Enterprise Search
• What is Enterprise Search: Functions and features
• Challenges
• Logical Architecture
•Enterprise Search Solution
• Technology Stack
• Collection Process
• Publication Process
• Enricher framework
• Deployment
•Conclusion

What is Enterprise Search?
“Enterprise Search offers a solution for searching,
finding and presenting enterprise related information
in the larger sense of the word”
Enterprise search is all about searching through documents from
any type and format from any sources located anywhere with the
upmost flexibility
• Web search: limited to public documents on the web
• Desktop search: limited to private documents on the local machine
• Enterprise search: no limitations on document type and location

Enterprise Search
(features)
•Information Sources and Types
• Wide range of sources: local and remote filesystems, content repositories,
e-mail, databases, internet, intranet and extranet
• Type not limited: any type ranging from structured to unstructured data, text
and binary formats and compound formats (zip)
•Usage
• Not limited to interactive use  automated business processes
•Security
• Integrations with enterprise security infrastructure
•User Interaction and personalization
• Identity enables more personalized search results

Enterprise Search
(features)
•Extended metadata
• More metadata  better and more precise search results
• More control over schema (for example Dynamic Fields)
•Ranking
• More control over ranking: personalized ranking (group)
•Data extraction and derivation
• Extract data using various techniques: Xpath, Xquery
• Derive data: using external knowledge models: RDBMS, RDF Store, Web Services
• Conditional extraction & derivation
•Managing and monitoring
• On-the-fly management (JMX)
• Real time monitoring

Enterprise Search
(features)
•User Interfaces
• Web search
• All about selling advertisements to the mass
• Generalistic & minimalistic screens; focus on adds
• Enterprise search
• All about finding: rich navigation; focus on quick find
• Small targeted audience
• Specialized and customized screens (use of ontologies, taxonomies
and classifications)
• Use of identity (results customized to user) and web 2.0
• Grouping
• field collapsing, faceted search & clustering

Enterprise Search
(Challenges)
•Performance and scalability
•Rich functions and features
•Managebility
•Flexibility
•Easy maintenance
•Quick issue and problem solving
•Reduce total cost of ownerschip

Enterprise Search
(Challenges)
•Managebility
•Flexibility
•Easy maintenance
Commercial Search Engines?

Enterprise Search
(Challenges)
•Managebility
•Flexibility
•Easy maintenance
Apache Based (Open Source)
Search & Retrieval Platform

Enterprise Search
(Logical Architecture)
Actors
Search Engine
Indexing
Sources
Content Inbound
Pull
(Crawling)
Pull
(Harvesting)
Push
(SOAP/ReST)
Searching & Ordering
Filtering Grouping
Request Inbound
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Request Validation
Syntactic Semantic
Request Enrichment
Redirection
(Suggestions)
Enhancement
(add/remove clauses)
Response Enrichment
Redirection
(more like this)
Enhancement
(metadata, editorial)
Response Outbound
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework)
Sorting
Content Enrichment
Extraction Enhancement Filtering
Collection Process Publication Process
Content Validation
SemanticSyntactic

Enterprise Search
(Collection Process)
Sources
• Any document format
• Any type
• Structured and unstructured
• Textual and binary
• Compound
• Residing Anywhere
• Security
Search Engine
Indexing
Sources
Content Inbound
Pull
(Crawling)
Pull
(Harvesting)
Push
(SOAP/ReST)
Content Enrichment
Collection Process
Content Validation
SemanticSyntactic

Enterprise Search
Search Engine
Indexing
Sources
Content Inbound
Pull
(Crawling)
Pull
(Harvesting)
Push
(SOAP/ReST)
Content Enrichment
Collection Process
Content Validation
SemanticSyntactic
Content Inbound
• Pull (Crawling/Spidering)
• Internet, intranet & extranet
• Local and remote filesystems
• Pull (Harvesting)
• Databases
• Content Repositories / Mgmt Systems
• Webservices inbound
• Push
• Webservices (SOAP/REST)
• Real time indexing

Enterprise Search
Search Engine
Indexing
Sources
Content Inbound
Pull
(Crawling)
Pull
(Harvesting)
Push
(SOAP/ReST)
Content Enrichment
Collection Process
Content Validation
SemanticSyntactic
Content Validation
• Syntactic validation
• Based on DTD / XML-Schema
• Structure and limited content
• Semantic validation
• Based on algorithms:
• Groovy, XPath, Regex, …
• Think about exception handling
• Placed anywhere in flow
• During inbound: XML-Schema validation
• After Enrichment: Validate derived metadata

Enterprise Search
Search Engine
Indexing
Sources
Content Inbound
Pull
(Crawling)
Pull
(Harvesting)
Push
(SOAP/ReST)
Content Enrichment
Collection Process
Content Validation
SemanticSyntactic
Content Enrichment
• Extraction
• Metadata
• Content (free text of document)
• Enhancing
• Derive new and alter existing metadata
• Filtering
• Remove (parts of) metadata
• Leverage external knowledge models
• Conditional enrichment

Enterprise Search
Search Engine
Indexing
Sources
Content Inbound
Pull
(Crawling)
Pull
(Harvesting)
Push
(SOAP/ReST)
Content Enrichment
Collection Process
Content Validation
SemanticSyntactic
Indexing
• Store in search engine(s)
• Content based routing
• Document boosting

Enterprise Search
(Publication Process)
Request Inbound
• HTTP/Get
• URL based with parameters
• Response in XML, JSON, …
• HTTP/Post
• XML (SOAP, REST) request
• XML (SOAP, REST) response
• API
• Java, Perl, …
• Wrappers on HTTP/Get
Actors
Search Engine
Filtering Grouping
Request Inbound
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Request Validation
Syntactic Semantic
Request Enrichment
Redirection
(Suggestions)
Enhancement
Response Enrichment
Redirection
(more like this)
Enhancement
Response Outbound
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework)
Sorting
Publication Process

Enterprise Search
Actors
Search Engine
Filtering Grouping
Request Inbound
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Request Validation
Syntactic Semantic
Request Enrichment
Redirection
(Suggestions)
Enhancement
Response Enrichment
Redirection
(more like this)
Enhancement
Response Outbound
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework)
Sorting
Publication Process Request Validation
• Syntactic Validation
• Correct Query syntax?
• Semantic Validation
• Correct Field Filters?
• Based on algorithms: Groovy, Regex
• Placed anywhere in flow
• @inbound: XML-Schema validation
• @enrichment: Validate derived request clauses

Enterprise Search
Actors
Search Engine
Filtering Grouping
Request Inbound
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Request Validation
Syntactic Semantic
Request Enrichment
Redirection
(Suggestions)
Enhancement
Response Enrichment
Redirection
(more like this)
Enhancement
Response Outbound
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework)
Sorting
Publication Process Request Enrichment
• Redirection
• Spelling suggestions
• Metadata suggestions
• Enhancing
• Add/Remove clauses
• Stemming, Synonyms, stop words

Enterprise Search
Actors
Search Engine
Filtering Grouping
Request Inbound
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Request Validation
Syntactic Semantic
Request Enrichment
Redirection
(Suggestions)
Enhancement
Response Enrichment
Redirection
(more like this)
Enhancement
Response Outbound
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework)
Sorting
Publication Process Searching & Ordering
• Filtering
• Field Search
• Grouping
• Add group information
• Field collapsing, Faceted Search & Clustering
• Sorting
• Sort on Field
• Ranking

Enterprise Search
Actors
Search Engine
Filtering Grouping
Request Inbound
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Request Validation
Syntactic Semantic
Request Enrichment
Redirection
(Suggestions)
Enhancement
Response Enrichment
Redirection
(more like this)
Enhancement
Response Outbound
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework)
Sorting
Publication Process Response Enrichment
• Redirection
• Suggestions
• More like this
• Enhancing
• Add/Remove response fields
• Schema information
• Editorial information

Enterprise Search
Actors
Search Engine
Filtering Grouping
Request Inbound
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Request Validation
Syntactic Semantic
Request Enrichment
Redirection
(Suggestions)
Enhancement
Response Enrichment
Redirection
(more like this)
Enhancement
Response Outbound
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework)
Sorting
Publication Process Response outbound
• Stateless
• No security
• XSLT, SolrJS
• Statefull
• Security
• Web2.0
• Web Application Framework

Technology Stack
•Use ESB for the flow: Apache ServiceMix with Camel
• Leverage standard ESB components (Transformers, Validation, Splitter,
Filter, Routers, Scripting)
• Leverage standard ESB transports (WS, SMTP, JMS, JCR, JDBC, FILE)
• Custom: Crawler Apache Nutch
• Leverage only crawl framework
• Extend NutchIndexWriter; asynchronously pushing crawled documents
back into ESB flow (reply-to)
•ESB Makes distributed flow possibleContent based routing
•Hot deploy Easy maintenance
•Reusing services across collection processes
•Search Engine independent

Collection Process Flow
Content Indexer
Content Inbound
2
1
Documents
Message
N
D
Document
Messages
D D
Lucene/Solr
INDEX
HTTP Transport
(Channel Adapter)
Lucene/SOLR
(SOLRJ)
D
SOLR Document
Message
Splitter
Channel
Content Validation Content Enrichment
Enricher
Content Filter
Content Enricher
Syntactic Validation
(Channel Purger)
Push Inbound
(Message Endpoint)
Semantic Validation
(Channel Purger)
Invalid Message
Channel
!
?
Invalid Message
ChannelChannel Transformer
(Message Translator)

Technology Stack
•Use flow from Apache Lucene/Solr
• Leverage standard Solr components (synonyms, stopwords,
stemming, MLT, spelling, faceted search, …)
• Custom components: using Solr’s extendability framework
• Security: authority field in schema with Apache Shiro integration
• Field filters (zipcode,…)
•User interfaces
• Stateless: SolrJs, XSLTResponseWriter & VelocityResponseWriter
• Statefull: Apache Wicket with Spring

Actors
Search Engine
Indexing
Sources
Content Inbound
Pull
(Crawling)
Pull
(Harvesting)
Push
(SOAP/ReST)
Filtering Grouping
Request Inbound
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Request Validation
Syntactic Semantic
Request Enrichment
Redirection
(Suggestions)
Enhancement
Response Enrichment
Redirection
(more like this)
Enhancement
Response Outbound
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework)
Sorting
Content Enrichment
Content Validation
SemanticSyntactic
Enterprise Search

Actors
Search Engine
Indexing
Sources
Content Inbound
Pull
(Crawling)
Pull
(Harvesting)
Push
(SOAP/ReST)
Filtering Grouping
Request Inbound
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Request Validation
Syntactic Semantic
Request Enrichment
Redirection
(Suggestions)
Enhancement
Response Enrichment
Redirection
(more like this)
Enhancement
Response Outbound
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework)
Sorting
Content Enrichment
Content Validation
SemanticSyntactic
Lucene/SOLR
ServiceMix/Camel
Nutch
Apache WicketSolrJS/XSLT
Enterprise Search

Actors
Search Engine
Indexing
Sources
Content Inbound
Pull
(Crawling)
Pull
(Harvesting)
Push
(SOAP/ReST)
Filtering Grouping
Request Inbound
HTTP/Get
(URL)
HTTP/Post
(SOAP/ReST)
API
(Java,Perl,...)
Request Validation
Syntactic Semantic
Request Enrichment
Redirection
(Suggestions)
Enhancement
Response Enrichment
Redirection
(more like this)
Enhancement
Response Outbound
Stateless
(XSLT, SolrJS)
Statefull
(Webapp Framework)
Sorting
Content Enrichment
Content Validation
SemanticSyntactic
Enterprise Search
Luminis Enricher Framework

•Custom Enricher Framework
• Existing ESB & SOLR enricher capabilities not sufficient.
• Enriching = one or more actions (extraction, enhancing &
filtering) performed on documents with fields
• Same enricher to be used for:
• Collection process:
• Documents  enriching, filtering & splitting
• Publication process:
• Search requests’first-components’ searchcomponent
• Search response’last-components’ searchcomponent

Content Indexer
Content Inbound
2
1
Documents
Message
N
D
Document
Messages
D D
Lucene/Solr
INDEX
SOLR Indexer
(Channel Adapter)
Lucene/SOLR
(SOLRJ)
D
SOLR Document
Message
Splitter
Channel
Enricher
Content Filter
Content Enricher
(Channel Purger)
Push Inbound
(Message Endpoint)
Semantic Validation
(Channel Purger)
Invalid Message
Channel
!
?
Invalid Message
ChannelChannel

Content Indexer
Content Inbound
2
1
Documents
Message
N
D
Document
Messages
D D
Lucene/Solr
INDEX
SOLR Indexer
(Channel Adapter)
Lucene/SOLR
(SOLRJ)
D
SOLR Document
Message
Splitter
Channel
Enricher
Content Filter
Content Enricher
(Channel Purger)
Push Inbound
(Message Endpoint)
Semantic Validation
(Channel Purger)
Invalid Message
Channel
!
?
Invalid Message
ChannelChannel
<<SearchHandler>>
RequestHandler
"ﬁrst-components" "components" "last-components"
<<XML>>
Response
<<SearchComponent>>
query
<<SearchComponent>>
facet
<<SearchComponent>>
mlt
<<SearchComponent>>
highlight
<<SearchComponent>>
stats
<<SearchComponent>>
debug
<<SOLRQueryRequest>>
Query
<<XSLT>>
XML2HTML
<<QueryResponseWriter>>
XSLTResponseWriter
<<(X)HTML>>
Resultaat

(architecture)
•Pipe-and-filter architecture
• Documents flow through series of actions
• Output from one action is input to another action
• Fields from input document can be used in action’s clauses: values in
expressions filled by replacing velocity type patterns with field values
•Conditional flows supported
•Reuse of flows & Subflows supported

(architecture)
•Pipe-and-filter architecture
• Documents flow through series of actions
• Output from one action is input to another action
• Fields from input document can be used in action’s clauses: values in
expressions filled by replacing velocity type patterns with field values
•Conditional flows supported
•Reuse of flows & Subflows supported
Action
(select C where ${B})
Action
(remove A2)
Document
[[A1,A2],[B]]
Document
[[A1],[B]]
Document
[[A1],[B],[C1]]
If [B=3]
YES
Action
(select C where ${A})
Document
[[A1],[B],[C2]]
NO

(Configuration)
•Enricher flow and expression configuration via XML based DSL
• Conditional: if-then-else & switch-case-else (with regex support)
• Actions: Add & remove fields and field values using expressions
• Expression handlers currently supported:
• Field
• Function (execute methods via Java Reflection)
• HttpClient (retrieve content by URL described by field values)
• Xslt, Xpath, Xquery (external XML databases)
• JDBC
• SparQL (OpenRDF)
• Apache Lucene/Solr
• Apache Tika (Meta and Text extraction)

(Examples)
<enricher name="Field" >
<field name="a">AA1</field>
<field name="b">BB1</field>
<multivalue-field name="c">CC1</multivalue-field>
<if test="field::c" pattern="CC2">
<then>
<field name="e">EE1</field>
</then>
</if>
<if test="field::a">
<then>
<field name="f">FF1</field>
</then>
</if>
<rename-field name="b">d</rename-field>
<remove-field name="a"/>
</enricher>

(Examples)
<then>
</then>
</if>
<then>
</then>
</if>
</enricher>
<enricher name="XPath”
xmlns:str="http://exslt.org/strings"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:html="http://www.w3.org/1999/xhtml">
field name="Description" expression-type="xpath">
//html:meta[@name='DC.description']/@content
</field>
<multivalue-field name="Type" expression-type="xpath">
//html:meta[@name='DC.type' and
(@scheme='OVERHEIDbm.bekendmakingtypeGemeente' or
@scheme='OVERHEIDbm.bekendmakingtypeProvincie' or
@scheme='OVERHEIDbm.bekendmakingtypeWaterschap')
]/@content
</multivalue-field>
<field name="publisher" expression-type="xpath">
fn:string-join(('Blow, ', 'blow, ', 'thou ', 'winter ', 'wind!'), '')
</field>
fn:concat(//html:meta[@name='OVERHEID.organisationType']/@content,
//html:meta[@name='DC.creator']/@content)
</field>
</enricher>

(Examples)
<then>
</then>
</if>
<then>
</then>
</if>
</enricher>
</field>
]/@content
</multivalue-field>
</field>
</field>
</enricher>
<enricher name="SPARQL">
<field name="place">http://www.my.com/#channels</field>
<field expression-type="sparql" repository="TESTRDF">
<![CDATA[
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?definition
WHERE {
?${place} skos:definition ?definition.
}
]]>
</field>
</enricher>

(Examples)
<then>
</then>
</if>
<then>
</then>
</if>
</enricher>
</field>
]/@content
</multivalue-field>
</field>
</field>
</enricher>
<enricher name="SPARQL">
<field name="place">http://www.my.com/#channels</field>
<field expression-type="sparql" repository="TESTRDF">
<![CDATA[
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?definition
WHERE {
?${place} skos:definition ?definition.
}
]]>
</field>
</enricher>
<enricher name=”HttpAndTika">
<field name="content.url"><![CDATA[http://na.apachecon.com/c/acna2010/speakers/501]]></field>
<field expression-type=”http" name="content.file">field:content.url</field>
<field name="auteur" source="field::content.file">xpath://H1</field>
<multivalue-field expression-type=”tika.meta” source="field::content.file”/>
<field name=”content" expression-type=”tika.text” source="field::content.file”/>
<switch test=”field::content.url
<case pattern=".*.rijksweb.nl.*"><field name=”source">Rijksweb</field></case>
<case pattern=".*.deventer.nl.*"><field name=”source">Gemeente Deventer</field></case>
<case pattern="file:.*"><field name=”source">Locale Harde Schijf</field></case>
<else><field name=”source">Overige</field></else>
</switch>
</enricher>

(Technology)
•Enricher and expresion handlers are Java based OSGi
services:
• Hot pluggable and updatable
• Flow and expression configuration changes no restart
• Extendible: New expression handlers immediatly available in
actions after installing OSGi bundle
•Runs in Apache Felix
• Collection Process: ServiceMix contains OSGi container
• Publication Process: Custom OSGi loader for Lucene/Solr
•Centralized & transactional provisioning (Apache Ace)
‑ Components & Configuration

Deployment Architecture
<<device>>
Slave Publication Server
(Slave2)
<<Container>>
Apache Tomcat
Enricher
(Luminis)
Lucene/SOLR
(Apache)
Wicket
(Apache)
<<config>>
SOLR::schema.xml
<<config>>
Luminis:Enricher.xml
<<config>>
SOLR::solrconfig.xml
Felix OSGi
(Apache)
<<device>>
Firewall <<device>>
HTTP Load Balancer
<<device>>
Master Collection Server
<<Container>>
Apache Tomcat
Enricher
(Luminis)
Nutch
(Apache)
ServiceMix
(Apache)
Tika
(Apache)
Lucene/SOLR
(Apache)
<<config>>
<<config>>
<<config>>
SOLR::schema.xml
<<config>>
servicenix::config.xml
OpenRDF
<<Data Container>>
SQL
<<Database>>
Knowledge Models
<<RDFTripleStore>>
Knowledge Models
<<HTTP>>
<<HTTP>>
<<HTTP>>
<<JDBC>>
<<HTTP>>
Felix OSGi
(Apache)
<<HTTP>>
<<HTTP/ReST>>
<<HTTP/ReST>>
<<device>>
Deployment Server
Ace
(Apache)
Felix OSGi
(Apache)
<<PROVISIONING>>
<<JDBC>>
<<device>>
Slave Publication Server
(Slave1)
<<Container>>
Apache Tomcat
Enricher
(Luminis)
Lucene/SOLR
(Apache)
Wicket
(Apache)
<<config>>
SOLR::schema.xml
<<config>>
<<config>>
Felix OSGi
(Apache)

Conclusions
•Enterprise Search Solution is not Google search
•Open Source paves the way; misses some ingredients
• Useful ingredients: Lucene/Solr, Nutch, Tika, ServiceMix/Camel,
Wicket, MySQL, OpenRDF, Felix/Ace
• Missing ingredients: Enricher
•Interesting developments:
• Apache Chemistry (CMIS)
• Apache Clerezza
• Apache Nutch
• Apache Connectors Framework (ManifoldCF)

Questions & (answers?)
Marc Teutelink
marc.teutelink@luminis.eu
@mteutelink
MEAP December 2010 

Open source enterprise search and retrieval platform

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (20)

Último

Último (20)

Open source enterprise search and retrieval platform