More Related Content Similar to Migration from Fast ESP to Lucene Solr - Michael McIntosh (20) More from lucenerevolution (20) Migration from Fast ESP to Lucene Solr - Michael McIntosh1. Migration from FAST ESP to
Lucene Solr
Presented by Michael McIntosh
michaelm@tnrglobal.com, Oct 19th, 2011
Wednesday, October 19, 11
2. What will we cover?
Core Aspects of ESP to Solr Migration
Migration Overview
Crawling Content
Processing Content
Searching Content
Scaling for Growth
Questions?
© 2011 TNR Global, LLC.
Wednesday, October 19, 11
3. Who am I?
• 7+ Years FAST ESP
• 10+ Years in Search
• 15+ Years in Software
• Early Lycos Developer
• I also develop brain-computer interfaces :)
© 2011 TNR Global, LLC.
Wednesday, October 19, 11
4. Who are we?
• 7+ Years in Search
• 15+ Years in Web Dev
• 30+ Years in Software
• Focus on ESP, Solr, Lucene, and the Cloud
• Scalable Web & Search Solution Experts
© 2011 TNR Global, LLC.
Wednesday, October 19, 11
6. Migration Challenges
• Our clients depend on ESP 5.3
• No future support for Linux ESP
• We need a viable exit strategy
• We want a fairly painless approach
• How do we provide an alternative?
© 2011 TNR Global, LLC.
Wednesday, October 19, 11
7. Migration Use Case
Federated Product Search
...millions of parts and services...
• XML documents (highly-structured)
• PDF documents (semi-structured)
• HTML documents (unstructured)
© 2011 TNR Global, LLC.
Wednesday, October 19, 11
8. Our Approach
Solr Search Platform (SolrSP)
• Custom Scalable Crawler using Heritrix
• Events & Queues managed with RabbitMQ
• Caching & Persistence supported via Riak
• Python pipeline replacement using Pypes
• Advanced Linguistics via NLTK or Rosette
© 2011 TNR Global, LLC.
Wednesday, October 19, 11
10. Crawling for ESP
• For XML content, our scripts query a
service, download resources and feed
• For PDF content, our scripts query a
database, download PDF urls and feed
• For HTML, our scripts query a database,
download seed URLs and launch ESP’s
Enterprise Crawler
© 2011 TNR Global, LLC.
Wednesday, October 19, 11
11. Crawling for Solr
• For XML & PDF content, the approach
remains the same with a different writer
• We tried Nutch crawler, but found it
challenging to make it do what we needed
• We tried Lucid Works bundled crawler, but
found the exposed functionality did not
offer the level of flexibility we needed
© 2011 TNR Global, LLC.
Wednesday, October 19, 11
12. Crawling with Heritrix
• Heritrix, created by the Internet Archive,
supports much of the same functionality
that the ESP Enterprise Crawler provides
• We wrapped Heritrix to provide a higher
level interface for service management
• Made it scalable and added document
caching via Riak to support refresh crawling
© 2011 TNR Global, LLC.
Wednesday, October 19, 11
13. Crawler Architecture
Crawl Job Crawler
Request Manager
Queue Cluster
(RabbitMQ)
Heritrix Heritrix Heritrix
Messenger Messenger Messenger
Heritrix Heritrix Heritrix
Crawler Crawler Crawler
Persistance Cluster
(Riak)
© 2011 TNR Global, LLC.
Wednesday, October 19, 11
15. Processing for ESP
ESP Processing is document-centric
• For XML, we transform, tag metadata,
classify content before indexing
• For PDF, we split pages, generate
thumbnails, tag metadata and classify before
indexing
• For HTML, we normalize, clean content,
tag metadata and classify before indexing
© 2011 TNR Global, LLC.
Wednesday, October 19, 11
16. Processing for Solr
Solr Processing is field-centric
• Solr analyzers work on a field by field basis
and lack the flexible workflow ESP provides
• Using some Solr analyzers for the now, but
evaluating alternatives (Rosette, NLTK)
• Hadoop + Cascading looks promising
• We use Stackless Python with Pypes to
make ESP stage migration less painful
© 2011 TNR Global, LLC.
Wednesday, October 19, 11
17. Processing with Pypes
• Written in Python
• Easy stage migration
• Very flexible & robust
• Branching & Merging
• Single Input, Many
Outputs
• Trivial to embed and
extend
© 2011 TNR Global, LLC.
Wednesday, October 19, 11
21. Feature Differences
• ESP has robust faceting support but facets must be
defined at index time, unlike Solr faceting
• Solr does most of the heavy lifting at query time,
which allows for more flexible approaches
• Solr now directly supports taxonomy (hierarchical)
faceting functionality (for drill down categories)
• Solr now supports field collapsing which we use
heavily in ESP installation to collapse result sets
• ESP to Solr schema mapping fairly strait-forward
© 2011 TNR Global, LLC.
Wednesday, October 19, 11
22. Search Interface
• Solr has no direct equivalent to FAST Query
Language (FQL) but function queries look like a
possible option for complex queries
• If you don’t have overly complex queries, the
edismax query parser looks like a good option
• Solr doesn’t have an easily extendable search-front
component like ESP, but we like TwigKit for that
• Default Solr stemmer isn’t as good as the ESP
lemmatizer, so if you need good lemmatization
consider Rosette Linguistics Platform or NLTK
© 2011 TNR Global, LLC.
Wednesday, October 19, 11
24. About the hardware...
• Solr allows you to use the familiar rows /
columns layout ESP uses
• Add shards to scale content, add search
slaves to scale queries
• We’re currently using master/slave indexer/
search setup, but options are numerous
• We’re developing a solution to support
scaling at will, a pain point for ESP as well
© 2011 TNR Global, LLC.
Wednesday, October 19, 11
25. Its not just hardware...
• Use Fabric to automate cluster installs, data
builds and deployment tasks
• Use Jenkins to automate, manage and track
Fabric tasks
• Use Supervisor to manage multiple services
running on each node
• Use Lucid Works for better out-of-the-box
stemming, alerts, services and support
© 2011 TNR Global, LLC.
Wednesday, October 19, 11
26. Migration In a Nutshell
• We now consider Solr robust enough to be a
viable replacement of a FAST ESP solution
• You supply the glue, or work with someone like us
to tie the different components together
• If you have many custom pipeline stages, consider
using Pypes to ease your initial ESP migration
• Fully supported versions of Solr are available via
Lucid Works using latest cutting edge features
© 2011 TNR Global, LLC.
Wednesday, October 19, 11
27. Resources
Lucid Works http://www.lucidimagination.com/
Rosette http://www.basistech.com/lucene/
Heritrix http://crawler.archive.org/
TwigKit http://twigkit.com/
Pypes https://bitbucket.org/diji/pypes/
Riak http://basho.com/
NLTK http://www.nltk.org/
RabbitMQ http://www.rabbitmq.com/
Cascading http://www.cascading.org/
Fabric http://fabfile.org/
Jenkins http://jenkins-ci.org/
Supervisor http://supervisord.org/
© 2011 TNR Global, LLC.
Wednesday, October 19, 11
28. Questions?
• Contact Us!
• Website: http://www.tnrglobal.com
• E-Mail: fast2solr@tnrglobal.com
• Phone: 001-413-425-1499
Thank you for your time!
© 2011 TNR Global, LLC.
Wednesday, October 19, 11