More Related Content
Similar to Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Similar to Esp2solr eurocon-2011-presentation-111021215049-phpapp02 (20)
Esp2solr eurocon-2011-presentation-111021215049-phpapp02
- 1. Migration from FAST ESP to
Lucene Solr
Presented by Michael McIntosh
michaelm@tnrglobal.com, Oct 19th, 2011
- 2. What will we cover?
Core Aspects of ESP to Solr Migration
Migration Overview
Crawling Content
Processing Content
Searching Content
Scaling for Growth
Questions?
© 2011 TNR Global, LLC.
- 3. Who am I?
• 7+ Years FAST ESP
• 10+ Years in Search
• 15+ Years in Software
• Early Lycos Developer
• I also develop brain-computer interfaces :)
© 2011 TNR Global, LLC.
- 4. Who are we?
• 7+ Years in Search
• 15+ Years in Web Dev
• 30+ Years in Software
• Focus on ESP, Solr, Lucene, and the Cloud
• Scalable Web & Search Solution Experts
© 2011 TNR Global, LLC.
- 6. Migration Challenges
• Our clients depend on ESP 5.3
• No future support for Linux ESP
• We need a viable exit strategy
• We want a fairly painless approach
• How do we provide an alternative?
© 2011 TNR Global, LLC.
- 7. Migration Use Case
Federated Product Search
...millions of parts and services...
• XML documents (highly-structured)
• PDF documents (semi-structured)
• HTML documents (unstructured)
© 2011 TNR Global, LLC.
- 8. Our Approach
Solr Search Platform (SolrSP)
• Custom Scalable Crawler using Heritrix
• Events & Queues managed with RabbitMQ
• Caching & Persistence supported via Riak
• Python pipeline replacement using Pypes
• Advanced Linguistics via NLTK or Rosette
© 2011 TNR Global, LLC.
- 10. Crawling for ESP
• For XML content, our scripts query a
service, download resources and feed
• For PDF content, our scripts query a
database, download PDF urls and feed
• For HTML, our scripts query a database,
download seed URLs and launch ESP’s
Enterprise Crawler
© 2011 TNR Global, LLC.
- 11. Crawling for Solr
• For XML & PDF content, the approach
remains the same with a different writer
• We tried Nutch crawler, but found it
challenging to make it do what we needed
• We tried Lucid Works bundled crawler, but
found the exposed functionality did not
offer the level of flexibility we needed
© 2011 TNR Global, LLC.
- 12. Crawling with Heritrix
• Heritrix, created by the Internet Archive,
supports much of the same functionality
that the ESP Enterprise Crawler provides
• We wrapped Heritrix to provide a higher
level interface for service management
• Made it scalable and added document
caching via Riak to support refresh crawling
© 2011 TNR Global, LLC.
- 13. Crawler Architecture
Crawl Job Crawler
Request Manager
Queue Cluster
(RabbitMQ)
Heritrix Heritrix Heritrix
Messenger Messenger Messenger
Heritrix Heritrix Heritrix
Crawler Crawler Crawler
Persistance Cluster
(Riak)
© 2011 TNR Global, LLC.
- 15. Processing for ESP
ESP Processing is document-centric
• For XML, we transform, tag metadata,
classify content before indexing
• For PDF, we split pages, generate
thumbnails, tag metadata and classify before
indexing
• For HTML, we normalize, clean content,
tag metadata and classify before indexing
© 2011 TNR Global, LLC.
- 16. Processing for Solr
Solr Processing is field-centric
• Solr analyzers work on a field by field basis
and lack the flexible workflow ESP provides
• Using some Solr analyzers for the now, but
evaluating alternatives (Rosette, NLTK)
• Hadoop + Cascading looks promising
• We use Stackless Python with Pypes to
make ESP stage migration less painful
© 2011 TNR Global, LLC.
- 17. Processing with Pypes
• Written in Python
• Easy stage migration
• Very flexible & robust
• Branching & Merging
• Single Input, Many
Outputs
• Trivial to embed and
extend
© 2011 TNR Global, LLC.
- 21. Feature Differences
• ESP has robust faceting support but facets must be
defined at index time, unlike Solr faceting
• Solr does most of the heavy lifting at query time,
which allows for more flexible approaches
• Solr now directly supports taxonomy (hierarchical)
faceting functionality (for drill down categories)
• Solr now supports field collapsing which we use
heavily in ESP installation to collapse result sets
• ESP to Solr schema mapping fairly strait-forward
© 2011 TNR Global, LLC.
- 22. Search Interface
• Solr has no direct equivalent to FAST Query
Language (FQL) but function queries look like a
possible option for complex queries
• If you don’t have overly complex queries, the
edismax query parser looks like a good option
• Solr doesn’t have an easily extendable search-front
component like ESP, but we like TwigKit for that
• Default Solr stemmer isn’t as good as the ESP
lemmatizer, so if you need good lemmatization
consider Rosette Linguistics Platform or NLTK
© 2011 TNR Global, LLC.
- 24. About the hardware...
• Solr allows you to use the familiar rows /
columns layout ESP uses
• Add shards to scale content, add search
slaves to scale queries
• We’re currently using master/slave indexer/
search setup, but options are numerous
• We’re developing a solution to support
scaling at will, a pain point for ESP as well
© 2011 TNR Global, LLC.
- 25. Its not just hardware...
• Use Fabric to automate cluster installs, data
builds and deployment tasks
• Use Jenkins to automate, manage and track
Fabric tasks
• Use Supervisor to manage multiple services
running on each node
• Use Lucid Works for better out-of-the-box
stemming, alerts, services and support
© 2011 TNR Global, LLC.
- 26. Migration In a Nutshell
• We now consider Solr robust enough to be a
viable replacement of a FAST ESP solution
• You supply the glue, or work with someone like us
to tie the different components together
• If you have many custom pipeline stages, consider
using Pypes to ease your initial ESP migration
• Fully supported versions of Solr are available via
Lucid Works using latest cutting edge features
© 2011 TNR Global, LLC.
- 27. Resources
Lucid Works http://www.lucidimagination.com/
Rosette http://www.basistech.com/lucene/
Heritrix http://crawler.archive.org/
TwigKit http://twigkit.com/
Pypes https://bitbucket.org/diji/pypes/
Riak http://basho.com/
NLTK http://www.nltk.org/
RabbitMQ http://www.rabbitmq.com/
Cascading http://www.cascading.org/
Fabric http://fabfile.org/
Jenkins http://jenkins-ci.org/
Supervisor http://supervisord.org/
© 2011 TNR Global, LLC.
- 28. Questions?
• Contact Us!
• Website: http://www.tnrglobal.com
• E-Mail: fast2solr@tnrglobal.com
• Phone: 001-413-425-1499
Thank you for your time!
© 2011 TNR Global, LLC.