SlideShare a Scribd company logo
1 of 28
Download to read offline
Migration from FAST ESP to
                                    Lucene Solr
                                   Presented by Michael McIntosh
                               michaelm@tnrglobal.com, Oct 19th, 2011




Wednesday, October 19, 11
What will we cover?
                Core Aspects of ESP to Solr Migration
                            Migration Overview
                            Crawling Content
                            Processing Content
                            Searching Content
                            Scaling for Growth
                            Questions?
                                            © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Who am I?

                    • 7+ Years FAST ESP
                    • 10+ Years in Search
                    • 15+ Years in Software
                    • Early Lycos Developer
                    • I also develop brain-computer interfaces :)
                                                   © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Who are we?

                    • 7+ Years in Search
                    • 15+ Years in Web Dev
                    • 30+ Years in Software
                    • Focus on ESP, Solr, Lucene, and the Cloud
                    • Scalable Web & Search Solution Experts
                                                   © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Migration Overview


                                         © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Migration Challenges

                    • Our clients depend on ESP 5.3
                    • No future support for Linux ESP
                    • We need a viable exit strategy
                    • We want a fairly painless approach
                    • How do we provide an alternative?
                                             © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Migration Use Case

                            Federated Product Search
                            ...millions of parts and services...

                    • XML documents (highly-structured)
                    • PDF documents (semi-structured)
                    • HTML documents (unstructured)

                                                      © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Our Approach
                            Solr Search Platform (SolrSP)
                    • Custom Scalable Crawler using Heritrix
                    • Events & Queues managed with RabbitMQ
                    • Caching & Persistence supported via Riak
                    • Python pipeline replacement using Pypes
                    • Advanced Linguistics via NLTK or Rosette
                                                  © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Crawling Content


                                        © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Crawling for ESP

                    • For XML content, our scripts query a
                            service, download resources and feed
                    • For PDF content, our scripts query a
                            database, download PDF urls and feed
                    • For HTML, our scripts query a database,
                            download seed URLs and launch ESP’s
                            Enterprise Crawler

                                                       © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Crawling for Solr

                    • For XML & PDF content, the approach
                            remains the same with a different writer
                    • We tried Nutch crawler, but found it
                            challenging to make it do what we needed
                    • We tried Lucid Works bundled crawler, but
                            found the exposed functionality did not
                            offer the level of flexibility we needed

                                                        © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Crawling with Heritrix

                    • Heritrix, created by the Internet Archive,
                            supports much of the same functionality
                            that the ESP Enterprise Crawler provides
                    • We wrapped Heritrix to provide a higher
                            level interface for service management
                    • Made it scalable and added document
                            caching via Riak to support refresh crawling

                                                         © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Crawler Architecture
                            Crawl Job        Crawler
                             Request         Manager



                                          Queue Cluster
                                           (RabbitMQ)



                             Heritrix        Heritrix          Heritrix
                            Messenger       Messenger         Messenger



                             Heritrix        Heritrix          Heritrix
                             Crawler         Crawler           Crawler



                                        Persistance Cluster
                                               (Riak)



                                                               © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Processing Content


                                         © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Processing for ESP
                            ESP Processing is document-centric
                    • For XML, we transform, tag metadata,
                            classify content before indexing
                    • For PDF, we split pages, generate
                            thumbnails, tag metadata and classify before
                            indexing
                    • For HTML, we normalize, clean content,
                            tag metadata and classify before indexing

                                                         © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Processing for Solr
                              Solr Processing is field-centric
                    • Solr analyzers work on a field by field basis
                            and lack the flexible workflow ESP provides
                    • Using some Solr analyzers for the now, but
                            evaluating alternatives (Rosette, NLTK)
                    • Hadoop + Cascading looks promising
                    • We use Stackless Python with Pypes to
                            make ESP stage migration less painful
                                                         © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Processing with Pypes
                              •   Written in Python

                              •   Easy stage migration

                              •   Very flexible & robust

                              •   Branching & Merging

                              •   Single Input, Many
                                  Outputs

                              •   Trivial to embed and
                                  extend

                                       © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Processor Migration

                                ...From ESP




                                   © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Processor Migration

                                ...to Pypes




                                  © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Searching Content


                                        © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Feature Differences
                    •       ESP has robust faceting support but facets must be
                            defined at index time, unlike Solr faceting

                    •       Solr does most of the heavy lifting at query time,
                            which allows for more flexible approaches

                    •       Solr now directly supports taxonomy (hierarchical)
                            faceting functionality (for drill down categories)

                    •       Solr now supports field collapsing which we use
                            heavily in ESP installation to collapse result sets

                    •       ESP to Solr schema mapping fairly strait-forward

                                                                © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Search Interface
                    •       Solr has no direct equivalent to FAST Query
                            Language (FQL) but function queries look like a
                            possible option for complex queries

                    •       If you don’t have overly complex queries, the
                            edismax query parser looks like a good option

                    •       Solr doesn’t have an easily extendable search-front
                            component like ESP, but we like TwigKit for that

                    •       Default Solr stemmer isn’t as good as the ESP
                            lemmatizer, so if you need good lemmatization
                            consider Rosette Linguistics Platform or NLTK

                                                              © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Scaling for Growth


                                         © 2011 TNR Global, LLC.

Wednesday, October 19, 11
About the hardware...
                    • Solr allows you to use the familiar rows /
                            columns layout ESP uses
                    • Add shards to scale content, add search
                            slaves to scale queries
                    • We’re currently using master/slave indexer/
                            search setup, but options are numerous
                    • We’re developing a solution to support
                            scaling at will, a pain point for ESP as well

                                                           © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Its not just hardware...
                    • Use Fabric to automate cluster installs, data
                            builds and deployment tasks
                    • Use Jenkins to automate, manage and track
                            Fabric tasks
                    • Use Supervisor to manage multiple services
                            running on each node
                    • Use Lucid Works for better out-of-the-box
                            stemming, alerts, services and support

                                                          © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Migration In a Nutshell

                    •       We now consider Solr robust enough to be a
                            viable replacement of a FAST ESP solution

                    •       You supply the glue, or work with someone like us
                            to tie the different components together

                    •       If you have many custom pipeline stages, consider
                            using Pypes to ease your initial ESP migration

                    •       Fully supported versions of Solr are available via
                            Lucid Works using latest cutting edge features

                                                               © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Resources
                       Lucid Works   http://www.lucidimagination.com/
                         Rosette     http://www.basistech.com/lucene/
                         Heritrix    http://crawler.archive.org/
                         TwigKit     http://twigkit.com/
                           Pypes     https://bitbucket.org/diji/pypes/
                            Riak     http://basho.com/
                           NLTK      http://www.nltk.org/
                        RabbitMQ     http://www.rabbitmq.com/
                        Cascading    http://www.cascading.org/
                           Fabric    http://fabfile.org/
                          Jenkins    http://jenkins-ci.org/
                        Supervisor   http://supervisord.org/

                                                              © 2011 TNR Global, LLC.

Wednesday, October 19, 11
Questions?
                    • Contact Us!
                     • Website: http://www.tnrglobal.com
                     • E-Mail: fast2solr@tnrglobal.com
                     • Phone: 001-413-425-1499

                      Thank you for your time!
                                                 © 2011 TNR Global, LLC.

Wednesday, October 19, 11

More Related Content

Similar to Migrating from FAST ESP to Lucene Solr

Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02TNR Global
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentationTheo Schlossnagle
 
Practical Cloud Security
Practical Cloud SecurityPractical Cloud Security
Practical Cloud SecurityJason Chan
 
Splunk at Expedia - Gartner Symposium
Splunk at Expedia - Gartner SymposiumSplunk at Expedia - Gartner Symposium
Splunk at Expedia - Gartner SymposiumEddie Satterly
 
Powered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundo
Powered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundoPowered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundo
Powered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundoGeneXus
 
SplunkLive New York 2011: DealerTrack
SplunkLive New York 2011: DealerTrackSplunkLive New York 2011: DealerTrack
SplunkLive New York 2011: DealerTrackSplunk
 
Who is KARL? and what does he know about Knowledge Management
Who is KARL? and what does he know about Knowledge ManagementWho is KARL? and what does he know about Knowledge Management
Who is KARL? and what does he know about Knowledge ManagementJim Glenn
 
How Plone's Security Works
How Plone's Security WorksHow Plone's Security Works
How Plone's Security WorksMatthew Wilkes
 
Data Segmenting in Anzo
Data Segmenting in AnzoData Segmenting in Anzo
Data Segmenting in AnzoLeeFeigenbaum
 
ELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedTin Le
 
Blackhat Workshop
Blackhat WorkshopBlackhat Workshop
Blackhat Workshopwremes
 
Community Code: Xero
Community Code: XeroCommunity Code: Xero
Community Code: XeroSencha
 
Taking eZ Find beyond full-text search
Taking eZ Find beyond  full-text searchTaking eZ Find beyond  full-text search
Taking eZ Find beyond full-text searchPaul Borgermans
 
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)mosaicnet
 
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStackCMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStackJoe Arnold
 
Building high traffic http front-ends. theo schlossnagle. зал 1
Building high traffic http front-ends. theo schlossnagle. зал 1Building high traffic http front-ends. theo schlossnagle. зал 1
Building high traffic http front-ends. theo schlossnagle. зал 1rit2011
 
Search Analytics Business Value & NoSQL Backend
Search Analytics Business Value & NoSQL BackendSearch Analytics Business Value & NoSQL Backend
Search Analytics Business Value & NoSQL BackendSematext Group, Inc.
 

Similar to Migrating from FAST ESP to Lucene Solr (20)

Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02Esp2solr eurocon-2011-presentation-111021215049-phpapp02
Esp2solr eurocon-2011-presentation-111021215049-phpapp02
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentation
 
Practical Cloud Security
Practical Cloud SecurityPractical Cloud Security
Practical Cloud Security
 
Splunk at Expedia - Gartner Symposium
Splunk at Expedia - Gartner SymposiumSplunk at Expedia - Gartner Symposium
Splunk at Expedia - Gartner Symposium
 
Drupal vs Sharepoint
Drupal vs SharepointDrupal vs Sharepoint
Drupal vs Sharepoint
 
Powered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundo
Powered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundoPowered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundo
Powered by Oracle! Te ayudamos a distribuir tu aplicación en todo el mundo
 
SplunkLive New York 2011: DealerTrack
SplunkLive New York 2011: DealerTrackSplunkLive New York 2011: DealerTrack
SplunkLive New York 2011: DealerTrack
 
Who is KARL? and what does he know about Knowledge Management
Who is KARL? and what does he know about Knowledge ManagementWho is KARL? and what does he know about Knowledge Management
Who is KARL? and what does he know about Knowledge Management
 
How Plone's Security Works
How Plone's Security WorksHow Plone's Security Works
How Plone's Security Works
 
Data Segmenting in Anzo
Data Segmenting in AnzoData Segmenting in Anzo
Data Segmenting in Anzo
 
ELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learned
 
Blackhat Workshop
Blackhat WorkshopBlackhat Workshop
Blackhat Workshop
 
Community Code: Xero
Community Code: XeroCommunity Code: Xero
Community Code: Xero
 
Taking eZ Find beyond full-text search
Taking eZ Find beyond  full-text searchTaking eZ Find beyond  full-text search
Taking eZ Find beyond full-text search
 
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
토드(Toad) 신제품 및 크로스 플랫폼 전략(1)
 
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStackCMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
CMPE 297 Lecture: Building Infrastructure Clouds with OpenStack
 
Webops dashboards
Webops dashboardsWebops dashboards
Webops dashboards
 
Http front-ends
Http front-endsHttp front-ends
Http front-ends
 
Building high traffic http front-ends. theo schlossnagle. зал 1
Building high traffic http front-ends. theo schlossnagle. зал 1Building high traffic http front-ends. theo schlossnagle. зал 1
Building high traffic http front-ends. theo schlossnagle. зал 1
 
Search Analytics Business Value & NoSQL Backend
Search Analytics Business Value & NoSQL BackendSearch Analytics Business Value & NoSQL Backend
Search Analytics Business Value & NoSQL Backend
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

Migrating from FAST ESP to Lucene Solr

  • 1. Migration from FAST ESP to Lucene Solr Presented by Michael McIntosh michaelm@tnrglobal.com, Oct 19th, 2011 Wednesday, October 19, 11
  • 2. What will we cover? Core Aspects of ESP to Solr Migration Migration Overview Crawling Content Processing Content Searching Content Scaling for Growth Questions? © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 3. Who am I? • 7+ Years FAST ESP • 10+ Years in Search • 15+ Years in Software • Early Lycos Developer • I also develop brain-computer interfaces :) © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 4. Who are we? • 7+ Years in Search • 15+ Years in Web Dev • 30+ Years in Software • Focus on ESP, Solr, Lucene, and the Cloud • Scalable Web & Search Solution Experts © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 5. Migration Overview © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 6. Migration Challenges • Our clients depend on ESP 5.3 • No future support for Linux ESP • We need a viable exit strategy • We want a fairly painless approach • How do we provide an alternative? © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 7. Migration Use Case Federated Product Search ...millions of parts and services... • XML documents (highly-structured) • PDF documents (semi-structured) • HTML documents (unstructured) © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 8. Our Approach Solr Search Platform (SolrSP) • Custom Scalable Crawler using Heritrix • Events & Queues managed with RabbitMQ • Caching & Persistence supported via Riak • Python pipeline replacement using Pypes • Advanced Linguistics via NLTK or Rosette © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 9. Crawling Content © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 10. Crawling for ESP • For XML content, our scripts query a service, download resources and feed • For PDF content, our scripts query a database, download PDF urls and feed • For HTML, our scripts query a database, download seed URLs and launch ESP’s Enterprise Crawler © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 11. Crawling for Solr • For XML & PDF content, the approach remains the same with a different writer • We tried Nutch crawler, but found it challenging to make it do what we needed • We tried Lucid Works bundled crawler, but found the exposed functionality did not offer the level of flexibility we needed © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 12. Crawling with Heritrix • Heritrix, created by the Internet Archive, supports much of the same functionality that the ESP Enterprise Crawler provides • We wrapped Heritrix to provide a higher level interface for service management • Made it scalable and added document caching via Riak to support refresh crawling © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 13. Crawler Architecture Crawl Job Crawler Request Manager Queue Cluster (RabbitMQ) Heritrix Heritrix Heritrix Messenger Messenger Messenger Heritrix Heritrix Heritrix Crawler Crawler Crawler Persistance Cluster (Riak) © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 14. Processing Content © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 15. Processing for ESP ESP Processing is document-centric • For XML, we transform, tag metadata, classify content before indexing • For PDF, we split pages, generate thumbnails, tag metadata and classify before indexing • For HTML, we normalize, clean content, tag metadata and classify before indexing © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 16. Processing for Solr Solr Processing is field-centric • Solr analyzers work on a field by field basis and lack the flexible workflow ESP provides • Using some Solr analyzers for the now, but evaluating alternatives (Rosette, NLTK) • Hadoop + Cascading looks promising • We use Stackless Python with Pypes to make ESP stage migration less painful © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 17. Processing with Pypes • Written in Python • Easy stage migration • Very flexible & robust • Branching & Merging • Single Input, Many Outputs • Trivial to embed and extend © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 18. Processor Migration ...From ESP © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 19. Processor Migration ...to Pypes © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 20. Searching Content © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 21. Feature Differences • ESP has robust faceting support but facets must be defined at index time, unlike Solr faceting • Solr does most of the heavy lifting at query time, which allows for more flexible approaches • Solr now directly supports taxonomy (hierarchical) faceting functionality (for drill down categories) • Solr now supports field collapsing which we use heavily in ESP installation to collapse result sets • ESP to Solr schema mapping fairly strait-forward © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 22. Search Interface • Solr has no direct equivalent to FAST Query Language (FQL) but function queries look like a possible option for complex queries • If you don’t have overly complex queries, the edismax query parser looks like a good option • Solr doesn’t have an easily extendable search-front component like ESP, but we like TwigKit for that • Default Solr stemmer isn’t as good as the ESP lemmatizer, so if you need good lemmatization consider Rosette Linguistics Platform or NLTK © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 23. Scaling for Growth © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 24. About the hardware... • Solr allows you to use the familiar rows / columns layout ESP uses • Add shards to scale content, add search slaves to scale queries • We’re currently using master/slave indexer/ search setup, but options are numerous • We’re developing a solution to support scaling at will, a pain point for ESP as well © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 25. Its not just hardware... • Use Fabric to automate cluster installs, data builds and deployment tasks • Use Jenkins to automate, manage and track Fabric tasks • Use Supervisor to manage multiple services running on each node • Use Lucid Works for better out-of-the-box stemming, alerts, services and support © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 26. Migration In a Nutshell • We now consider Solr robust enough to be a viable replacement of a FAST ESP solution • You supply the glue, or work with someone like us to tie the different components together • If you have many custom pipeline stages, consider using Pypes to ease your initial ESP migration • Fully supported versions of Solr are available via Lucid Works using latest cutting edge features © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 27. Resources Lucid Works http://www.lucidimagination.com/ Rosette http://www.basistech.com/lucene/ Heritrix http://crawler.archive.org/ TwigKit http://twigkit.com/ Pypes https://bitbucket.org/diji/pypes/ Riak http://basho.com/ NLTK http://www.nltk.org/ RabbitMQ http://www.rabbitmq.com/ Cascading http://www.cascading.org/ Fabric http://fabfile.org/ Jenkins http://jenkins-ci.org/ Supervisor http://supervisord.org/ © 2011 TNR Global, LLC. Wednesday, October 19, 11
  • 28. Questions? • Contact Us! • Website: http://www.tnrglobal.com • E-Mail: fast2solr@tnrglobal.com • Phone: 001-413-425-1499 Thank you for your time! © 2011 TNR Global, LLC. Wednesday, October 19, 11