SlideShare a Scribd company logo
1 of 28
Download to read offline
Migration from FAST ESP to
                                    Lucene Solr
                                   Presented by Michael McIntosh
                               michaelm@tnrglobal.com, Oct 19th, 2011




Wednesday, October 19, 11
What will we cover?
                Core Aspects of ESP to Solr Migration
                            Migration Overview
                            Crawling Content
                            Processing Content
                            Searching Content
                            Scaling for Growth
                            Questions?
                                            © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Who am I?

                    • 7+ Years FAST ESP
                    • 10+ Years in Search
                    • 15+ Years in Software
                    • Early Lycos Developer
                    • I also develop brain-computer interfaces :)
                                                   © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Who are we?

                    • 7+ Years in Search
                    • 15+ Years in Web Dev
                    • 30+ Years in Software
                    • Focus on ESP, Solr, Lucene, and the Cloud
                    • Scalable Web & Search Solution Experts
                                                   © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Migration Overview


                                         © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Migration Challenges

                    • Our clients depend on ESP 5.3
                    • No future support for Linux ESP
                    • We need a viable exit strategy
                    • We want a fairly painless approach
                    • How do we provide an alternative?
                                             © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Migration Use Case

                            Federated Product Search
                            ...millions of parts and services...

                    • XML documents (highly-structured)
                    • PDF documents (semi-structured)
                    • HTML documents (unstructured)

                                                      © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Our Approach
                            Solr Search Platform (SolrSP)
                    • Custom Scalable Crawler using Heritrix
                    • Events & Queues managed with RabbitMQ
                    • Caching & Persistence supported via Riak
                    • Python pipeline replacement using Pypes
                    • Advanced Linguistics via NLTK or Rosette
                                                  © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Crawling Content


                                        © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Crawling for ESP

                    • For XML content, our scripts query a
                            service, download resources and feed
                    • For PDF content, our scripts query a
                            database, download PDF urls and feed
                    • For HTML, our scripts query a database,
                            download seed URLs and launch ESP’s
                            Enterprise Crawler

                                                       © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Crawling for Solr

                    • For XML & PDF content, the approach
                            remains the same with a different writer
                    • We tried Nutch crawler, but found it
                            challenging to make it do what we needed
                    • We tried Lucid Works bundled crawler, but
                            found the exposed functionality did not
                            offer the level of flexibility we needed

                                                        © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Crawling with Heritrix

                    • Heritrix, created by the Internet Archive,
                            supports much of the same functionality
                            that the ESP Enterprise Crawler provides
                    • We wrapped Heritrix to provide a higher
                            level interface for service management
                    • Made it scalable and added document
                            caching via Riak to support refresh crawling

                                                         © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Crawler Architecture
                            Crawl Job        Crawler
                             Request         Manager



                                          Queue Cluster
                                           (RabbitMQ)



                             Heritrix        Heritrix          Heritrix
                            Messenger       Messenger         Messenger



                             Heritrix        Heritrix          Heritrix
                             Crawler         Crawler           Crawler



                                        Persistance Cluster
                                               (Riak)



                                                               © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Processing Content


                                         © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Processing for ESP
                            ESP Processing is document-centric
                    • For XML, we transform, tag metadata,
                            classify content before indexing
                    • For PDF, we split pages, generate
                            thumbnails, tag metadata and classify before
                            indexing
                    • For HTML, we normalize, clean content,
                            tag metadata and classify before indexing

                                                         © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Processing for Solr
                              Solr Processing is field-centric
                    • Solr analyzers work on a field by field basis
                            and lack the flexible workflow ESP provides
                    • Using some Solr analyzers for the now, but
                            evaluating alternatives (Rosette, NLTK)
                    • Hadoop + Cascading looks promising
                    • We use Stackless Python with Pypes to
                            make ESP stage migration less painful
                                                         © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Processing with Pypes
                              •   Written in Python

                              •   Easy stage migration

                              •   Very flexible & robust

                              •   Branching & Merging

                              •   Single Input, Many
                                  Outputs

                              •   Trivial to embed and
                                  extend

                                       © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Processor Migration

                                ...From ESP




                                   © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Processor Migration

                                ...to Pypes




                                  © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Searching Content


                                        © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Feature Differences
                    •       ESP has robust faceting support but facets must be
                            defined at index time, unlike Solr faceting

                    •       Solr does most of the heavy lifting at query time,
                            which allows for more flexible approaches

                    •       Solr now directly supports taxonomy (hierarchical)
                            faceting functionality (for drill down categories)

                    •       Solr now supports field collapsing which we use
                            heavily in ESP installation to collapse result sets

                    •       ESP to Solr schema mapping fairly strait-forward

                                                                © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Search Interface
                    •       Solr has no direct equivalent to FAST Query
                            Language (FQL) but function queries look like a
                            possible option for complex queries

                    •       If you don’t have overly complex queries, the
                            edismax query parser looks like a good option

                    •       Solr doesn’t have an easily extendable search-front
                            component like ESP, but we like TwigKit for that

                    •       Default Solr stemmer isn’t as good as the ESP
                            lemmatizer, so if you need good lemmatization
                            consider Rosette Linguistics Platform or NLTK

                                                              © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Scaling for Growth


                                         © 2011 TNR Global, LLC.

Wednesday, October 19, 11
About the hardware...
                    • Solr allows you to use the familiar rows /
                            columns layout ESP uses
                    • Add shards to scale content, add search
                            slaves to scale queries
                    • We’re currently using master/slave indexer/
                            search setup, but options are numerous
                    • We’re developing a solution to support
                            scaling at will, a pain point for ESP as well

                                                           © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Its not just hardware...
                    • Use Fabric to automate cluster installs, data
                            builds and deployment tasks
                    • Use Jenkins to automate, manage and track
                            Fabric tasks
                    • Use Supervisor to manage multiple services
                            running on each node
                    • Use Lucid Works for better out-of-the-box
                            stemming, alerts, services and support

                                                          © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Migration In a Nutshell

                    •       We now consider Solr robust enough to be a
                            viable replacement of a FAST ESP solution

                    •       You supply the glue, or work with someone like us
                            to tie the different components together

                    •       If you have many custom pipeline stages, consider
                            using Pypes to ease your initial ESP migration

                    •       Fully supported versions of Solr are available via
                            Lucid Works using latest cutting edge features

                                                               © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Resources
                       Lucid Works   http://www.lucidimagination.com/
                         Rosette     http://www.basistech.com/lucene/
                         Heritrix    http://crawler.archive.org/
                         TwigKit     http://twigkit.com/
                           Pypes     https://bitbucket.org/diji/pypes/
                            Riak     http://basho.com/
                           NLTK      http://www.nltk.org/
                        RabbitMQ     http://www.rabbitmq.com/
                        Cascading    http://www.cascading.org/
                           Fabric    http://fabfile.org/
                          Jenkins    http://jenkins-ci.org/
                        Supervisor   http://supervisord.org/

                                                              © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Questions?
                    • Contact Us!
                     • Website: http://www.tnrglobal.com
                     • E-Mail: fast2solr@tnrglobal.com
                     • Phone: 001-413-425-1499

                      Thank you for your time!
                                                 © 2011 TNR Global, LLC.

Wednesday, October 19, 11

More Related Content

Similar to Migration from Fast ESP to Lucene Solr - Michael McIntosh

Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02
TNR Global
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentation
Theo Schlossnagle
 
Practical Cloud Security
Practical Cloud SecurityPractical Cloud Security
Practical Cloud Security
Jason Chan
 
Splunk at Expedia - Gartner Symposium
Splunk at Expedia - Gartner SymposiumSplunk at Expedia - Gartner Symposium
Splunk at Expedia - Gartner Symposium
Eddie Satterly
 
Powered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundo
Powered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundoPowered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundo
Powered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundo
GeneXus
 
How Plone's Security Works
How Plone's Security WorksHow Plone's Security Works
How Plone's Security Works
Matthew Wilkes
 
Blackhat Workshop
Blackhat WorkshopBlackhat Workshop
Blackhat Workshop
wremes
 
Taking eZ Find beyond full-text search
Taking eZ Find beyond  full-text searchTaking eZ Find beyond  full-text search
Taking eZ Find beyond full-text search
Paul Borgermans
 
Building high traffic http front-ends. theo schlossnagle. зал 1
Building high traffic http front-ends. theo schlossnagle. зал 1Building high traffic http front-ends. theo schlossnagle. зал 1
Building high traffic http front-ends. theo schlossnagle. зал 1
rit2011
 
Search Analytics Business Value & NoSQL Backend
Search Analytics Business Value & NoSQL BackendSearch Analytics Business Value & NoSQL Backend
Search Analytics Business Value & NoSQL Backend
Sematext Group, Inc.
 

Similar to Migration from Fast ESP to Lucene Solr - Michael McIntosh (20)

Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentation
 
Practical Cloud Security
Practical Cloud SecurityPractical Cloud Security
Practical Cloud Security
 
Splunk at Expedia - Gartner Symposium
Splunk at Expedia - Gartner SymposiumSplunk at Expedia - Gartner Symposium
Splunk at Expedia - Gartner Symposium
 
Drupal vs Sharepoint
Drupal vs SharepointDrupal vs Sharepoint
Drupal vs Sharepoint
 
Powered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundo
Powered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundoPowered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundo
Powered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundo
 
SplunkLive New York 2011: DealerTrack
SplunkLive New York 2011: DealerTrackSplunkLive New York 2011: DealerTrack
SplunkLive New York 2011: DealerTrack
 
Who is KARL? and what does he know about Knowledge Management
Who is KARL? and what does he know about Knowledge ManagementWho is KARL? and what does he know about Knowledge Management
Who is KARL? and what does he know about Knowledge Management
 
How Plone's Security Works
How Plone's Security WorksHow Plone's Security Works
How Plone's Security Works
 
Data Segmenting in Anzo
Data Segmenting in AnzoData Segmenting in Anzo
Data Segmenting in Anzo
 
ELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learned
 
Blackhat Workshop
Blackhat WorkshopBlackhat Workshop
Blackhat Workshop
 
Community Code: Xero
Community Code: XeroCommunity Code: Xero
Community Code: Xero
 
Taking eZ Find beyond full-text search
Taking eZ Find beyond  full-text searchTaking eZ Find beyond  full-text search
Taking eZ Find beyond full-text search
 
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
 
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStackCMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
 
Webops dashboards
Webops dashboardsWebops dashboards
Webops dashboards
 
Building high traffic http front-ends. theo schlossnagle. зал 1
Building high traffic http front-ends. theo schlossnagle. зал 1Building high traffic http front-ends. theo schlossnagle. зал 1
Building high traffic http front-ends. theo schlossnagle. зал 1
 
Http front-ends
Http front-endsHttp front-ends
Http front-ends
 
Search Analytics Business Value & NoSQL Backend
Search Analytics Business Value & NoSQL BackendSearch Analytics Business Value & NoSQL Backend
Search Analytics Business Value & NoSQL Backend
 

More from lucenerevolution

Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
lucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

Migration from Fast ESP to Lucene Solr - Michael McIntosh

  • 1. Migration from FAST ESP to Lucene Solr Presented by Michael McIntosh michaelm@tnrglobal.com, Oct 19th, 2011 Wednesday, October 19, 11
  • 2. What will we cover? Core Aspects of ESP to Solr Migration Migration Overview Crawling Content Processing Content Searching Content Scaling for Growth Questions? © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 3. Who am I? • 7+ Years FAST ESP • 10+ Years in Search • 15+ Years in Software • Early Lycos Developer • I also develop brain-computer interfaces :) © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 4. Who are we? • 7+ Years in Search • 15+ Years in Web Dev • 30+ Years in Software • Focus on ESP, Solr, Lucene, and the Cloud • Scalable Web & Search Solution Experts © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 5. Migration Overview © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 6. Migration Challenges • Our clients depend on ESP 5.3 • No future support for Linux ESP • We need a viable exit strategy • We want a fairly painless approach • How do we provide an alternative? © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 7. Migration Use Case Federated Product Search ...millions of parts and services... • XML documents (highly-structured) • PDF documents (semi-structured) • HTML documents (unstructured) © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 8. Our Approach Solr Search Platform (SolrSP) • Custom Scalable Crawler using Heritrix • Events & Queues managed with RabbitMQ • Caching & Persistence supported via Riak • Python pipeline replacement using Pypes • Advanced Linguistics via NLTK or Rosette © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 9. Crawling Content © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 10. Crawling for ESP • For XML content, our scripts query a service, download resources and feed • For PDF content, our scripts query a database, download PDF urls and feed • For HTML, our scripts query a database, download seed URLs and launch ESP’s Enterprise Crawler © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 11. Crawling for Solr • For XML & PDF content, the approach remains the same with a different writer • We tried Nutch crawler, but found it challenging to make it do what we needed • We tried Lucid Works bundled crawler, but found the exposed functionality did not offer the level of flexibility we needed © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 12. Crawling with Heritrix • Heritrix, created by the Internet Archive, supports much of the same functionality that the ESP Enterprise Crawler provides • We wrapped Heritrix to provide a higher level interface for service management • Made it scalable and added document caching via Riak to support refresh crawling © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 13. Crawler Architecture Crawl Job Crawler Request Manager Queue Cluster (RabbitMQ) Heritrix Heritrix Heritrix Messenger Messenger Messenger Heritrix Heritrix Heritrix Crawler Crawler Crawler Persistance Cluster (Riak) © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 14. Processing Content © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 15. Processing for ESP ESP Processing is document-centric • For XML, we transform, tag metadata, classify content before indexing • For PDF, we split pages, generate thumbnails, tag metadata and classify before indexing • For HTML, we normalize, clean content, tag metadata and classify before indexing © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 16. Processing for Solr Solr Processing is field-centric • Solr analyzers work on a field by field basis and lack the flexible workflow ESP provides • Using some Solr analyzers for the now, but evaluating alternatives (Rosette, NLTK) • Hadoop + Cascading looks promising • We use Stackless Python with Pypes to make ESP stage migration less painful © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 17. Processing with Pypes • Written in Python • Easy stage migration • Very flexible & robust • Branching & Merging • Single Input, Many Outputs • Trivial to embed and extend © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 18. Processor Migration ...From ESP © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 19. Processor Migration ...to Pypes © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 20. Searching Content © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 21. Feature Differences • ESP has robust faceting support but facets must be defined at index time, unlike Solr faceting • Solr does most of the heavy lifting at query time, which allows for more flexible approaches • Solr now directly supports taxonomy (hierarchical) faceting functionality (for drill down categories) • Solr now supports field collapsing which we use heavily in ESP installation to collapse result sets • ESP to Solr schema mapping fairly strait-forward © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 22. Search Interface • Solr has no direct equivalent to FAST Query Language (FQL) but function queries look like a possible option for complex queries • If you don’t have overly complex queries, the edismax query parser looks like a good option • Solr doesn’t have an easily extendable search-front component like ESP, but we like TwigKit for that • Default Solr stemmer isn’t as good as the ESP lemmatizer, so if you need good lemmatization consider Rosette Linguistics Platform or NLTK © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 23. Scaling for Growth © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 24. About the hardware... • Solr allows you to use the familiar rows / columns layout ESP uses • Add shards to scale content, add search slaves to scale queries • We’re currently using master/slave indexer/ search setup, but options are numerous • We’re developing a solution to support scaling at will, a pain point for ESP as well © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 25. Its not just hardware... • Use Fabric to automate cluster installs, data builds and deployment tasks • Use Jenkins to automate, manage and track Fabric tasks • Use Supervisor to manage multiple services running on each node • Use Lucid Works for better out-of-the-box stemming, alerts, services and support © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 26. Migration In a Nutshell • We now consider Solr robust enough to be a viable replacement of a FAST ESP solution • You supply the glue, or work with someone like us to tie the different components together • If you have many custom pipeline stages, consider using Pypes to ease your initial ESP migration • Fully supported versions of Solr are available via Lucid Works using latest cutting edge features © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 27. Resources Lucid Works http://www.lucidimagination.com/ Rosette http://www.basistech.com/lucene/ Heritrix http://crawler.archive.org/ TwigKit http://twigkit.com/ Pypes https://bitbucket.org/diji/pypes/ Riak http://basho.com/ NLTK http://www.nltk.org/ RabbitMQ http://www.rabbitmq.com/ Cascading http://www.cascading.org/ Fabric http://fabfile.org/ Jenkins http://jenkins-ci.org/ Supervisor http://supervisord.org/ © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 28. Questions? • Contact Us! • Website: http://www.tnrglobal.com • E-Mail: fast2solr@tnrglobal.com • Phone: 001-413-425-1499 Thank you for your time! © 2011 TNR Global, LLC. Wednesday, October 19, 11