Migration from Fast ESP to Lucene Solr - Michael McIntosh

Migration from FAST ESP to
Lucene Solr
Presented by Michael McIntosh
michaelm@tnrglobal.com, Oct 19th, 2011

Wednesday, October 19, 11

What will we cover?
Core Aspects of ESP to Solr Migration
Migration Overview
Crawling Content
Processing Content
Searching Content
Scaling for Growth
Questions?
© 2011 TNR Global, LLC.


Who am I?

• 7+ Years FAST ESP
• 10+ Years in Search
• 15+ Years in Software
• Early Lycos Developer
• I also develop brain-computer interfaces :)


Who are we?

• 7+ Years in Search
• 15+ Years in Web Dev
• 30+ Years in Software
• Focus on ESP, Solr, Lucene, and the Cloud
• Scalable Web & Search Solution Experts


Migration Overview



Migration Challenges

• Our clients depend on ESP 5.3
• No future support for Linux ESP
• We need a viable exit strategy
• We want a fairly painless approach
• How do we provide an alternative?


Migration Use Case

Federated Product Search
...millions of parts and services...

• XML documents (highly-structured)
• PDF documents (semi-structured)
• HTML documents (unstructured)



Our Approach
Solr Search Platform (SolrSP)
• Custom Scalable Crawler using Heritrix
• Events & Queues managed with RabbitMQ
• Caching & Persistence supported via Riak
• Python pipeline replacement using Pypes
• Advanced Linguistics via NLTK or Rosette


Crawling Content



Crawling for ESP

• For XML content, our scripts query a
service, download resources and feed
• For PDF content, our scripts query a
database, download PDF urls and feed
• For HTML, our scripts query a database,
download seed URLs and launch ESP’s
Enterprise Crawler



Crawling for Solr

• For XML & PDF content, the approach
remains the same with a different writer
• We tried Nutch crawler, but found it
challenging to make it do what we needed
• We tried Lucid Works bundled crawler, but
found the exposed functionality did not
offer the level of ﬂexibility we needed



Crawling with Heritrix

• Heritrix, created by the Internet Archive,
supports much of the same functionality
that the ESP Enterprise Crawler provides
• We wrapped Heritrix to provide a higher
level interface for service management
• Made it scalable and added document
caching via Riak to support refresh crawling



Crawler Architecture
Crawl Job Crawler
Request Manager

Queue Cluster
(RabbitMQ)

Heritrix Heritrix Heritrix
Messenger Messenger Messenger

Heritrix Heritrix Heritrix
Crawler Crawler Crawler

Persistance Cluster
(Riak)



Processing Content



Processing for ESP
ESP Processing is document-centric
• For XML, we transform, tag metadata,
classify content before indexing
• For PDF, we split pages, generate
thumbnails, tag metadata and classify before
indexing
• For HTML, we normalize, clean content,
tag metadata and classify before indexing



Processing for Solr
Solr Processing is field-centric
• Solr analyzers work on a field by field basis
and lack the flexible workflow ESP provides
• Using some Solr analyzers for the now, but
evaluating alternatives (Rosette, NLTK)
• Hadoop + Cascading looks promising
• We use Stackless Python with Pypes to
make ESP stage migration less painful


Processing with Pypes
• Written in Python

• Easy stage migration

• Very ﬂexible & robust

• Branching & Merging

• Single Input, Many
Outputs

• Trivial to embed and
extend



Processor Migration

...From ESP



Processor Migration

...to Pypes



Searching Content



Feature Differences
• ESP has robust faceting support but facets must be
defined at index time, unlike Solr faceting

• Solr does most of the heavy lifting at query time,
which allows for more flexible approaches

• Solr now directly supports taxonomy (hierarchical)
faceting functionality (for drill down categories)

• Solr now supports field collapsing which we use
heavily in ESP installation to collapse result sets

• ESP to Solr schema mapping fairly strait-forward



Search Interface
• Solr has no direct equivalent to FAST Query
Language (FQL) but function queries look like a
possible option for complex queries

• If you don’t have overly complex queries, the
edismax query parser looks like a good option

• Solr doesn’t have an easily extendable search-front
component like ESP, but we like TwigKit for that

• Default Solr stemmer isn’t as good as the ESP
lemmatizer, so if you need good lemmatization
consider Rosette Linguistics Platform or NLTK



Scaling for Growth



About the hardware...
• Solr allows you to use the familiar rows /
columns layout ESP uses
• Add shards to scale content, add search
slaves to scale queries
• We’re currently using master/slave indexer/
search setup, but options are numerous
• We’re developing a solution to support
scaling at will, a pain point for ESP as well



Its not just hardware...
• Use Fabric to automate cluster installs, data
builds and deployment tasks
• Use Jenkins to automate, manage and track
Fabric tasks
• Use Supervisor to manage multiple services
running on each node
• Use Lucid Works for better out-of-the-box
stemming, alerts, services and support



Migration In a Nutshell

• We now consider Solr robust enough to be a
viable replacement of a FAST ESP solution

• You supply the glue, or work with someone like us
to tie the different components together

• If you have many custom pipeline stages, consider
using Pypes to ease your initial ESP migration

• Fully supported versions of Solr are available via
Lucid Works using latest cutting edge features



Resources
Lucid Works http://www.lucidimagination.com/
Rosette http://www.basistech.com/lucene/
Heritrix http://crawler.archive.org/
TwigKit http://twigkit.com/
Pypes https://bitbucket.org/diji/pypes/
Riak http://basho.com/
NLTK http://www.nltk.org/
RabbitMQ http://www.rabbitmq.com/
Cascading http://www.cascading.org/
Fabric http://fabﬁle.org/
Jenkins http://jenkins-ci.org/
Supervisor http://supervisord.org/



Questions?
• Contact Us!
• Website: http://www.tnrglobal.com
• E-Mail: fast2solr@tnrglobal.com
• Phone: 001-413-425-1499

Thank you for your time!


Migration from Fast ESP to Lucene Solr - Michael McIntosh

Recommended

Recommended

More Related Content

Similar to Migration from Fast ESP to Lucene Solr - Michael McIntosh

Similar to Migration from Fast ESP to Lucene Solr - Michael McIntosh (20)

More from lucenerevolution

More from lucenerevolution (20)

Recently uploaded

Recently uploaded (20)

Migration from Fast ESP to Lucene Solr - Michael McIntosh