SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
The Future of Search in
                                     Plone
                                          Sally Kleinfeldt
                                            and friends
                                  Plone Conference, San Francisco
                                        November 3, 2011




Tuesday, November 29, 2011
Motivation


                             •   Raise awareness

                             •   Promote discussion

                             •   Forge consensus




Tuesday, November 29, 2011
Agenda


                             •   Introduction to IR concepts

                             •   Description of Solr and ZCatalog

                             •   Discussion




Tuesday, November 29, 2011
IR 101




Tuesday, November 29, 2011
IR 101

                             •   Transformations

                             •   Terms

                             •   Models

                             •   Measures




Tuesday, November 29, 2011
IR 101
                                     Transformations
                             •   Turn binary, HTML, or other document
                                 formats into fields and strings

                             •   Parse the strings into a set of terms

                             •   Build indexes of the terms specific to the IR
                                 model used

                             •   Queries are parsed into query operators and
                                 strings, which are parsed into terms




Tuesday, November 29, 2011
IR 101
                                     String => Terms
                             •   Tokenization - locate word boundaries

                             •   Normalization - remove capitals and diacritics

                             •   Stopping - remove stop words (a, of, on,
                                 the...)

                             •   Stemming - reduce to word stems (walks,
                                 walking => walk)

                             •   Recognizers - concepts, parts of speech,
                                 names, locations...

                             •   Must be identical for documents and queries


Tuesday, November 29, 2011
IR 101
                                              Terms

                             •   Application specific

                             •   Words or phrases

                             •   IR models assign weights to terms in
                                 documents




Tuesday, November 29, 2011
IR 101
                                       Term Weighting
                             •   Simplest:Yes/No Boolean value

                             •   Better: Term Frequency - # occurrences

                             •   More meaningful: tf-idf

                                 •   Term Freq * Inverse Document Freq

                                 •   How many documents contain the term?

                                 •   Increase weight of rare terms and vice
                                     versa




Tuesday, November 29, 2011
IR 101
                                      Boolean Model
                             •   First and most adopted

                             •   Based on Boolean logic + set theory

                             •   Does a document contain query terms - Y/N

                             •   Intuitive, easy to implement

                             •   No ranking, special query language, too many
                                 or too few results

                             •   Typical for library systems




Tuesday, November 29, 2011
IR 101
                                 Vector Space Models
                             •   Represent documents and queries as vectors
                                 of terms

                             •   Term values are weighted - by count or tf-idf

                             •   Use vector operations to compare
                                 documents with queries

                             •   Relevance score based on cosine of angle
                                 between doc/query vectors




Tuesday, November 29, 2011
IR 101
                                 Probabilistic Models
                             •   Compute probability that a document is
                                 relevant to a query

                             •   Relevance ranking functions range from
                                 simple to complex

                             •   Sophisticated ranking functions include

                                 •   Okapi BM25 (uses tf and idf)

                                 •   Machine learning formulas (use training
                                     data)




Tuesday, November 29, 2011
IR 101
                             Extending the Models
                             •   Many many refinements possible

                                 •   Term interdependencies

                                 •   Fuzzy sets

                                 •   Semantic analysis, link analysis

                                 •   Combining models (Extended Boolean)

                             •   The best search engines represent thousands
                                 of engineering hours




Tuesday, November 29, 2011
IR 101
                                              Measures
                             •   Search engine results are measured against:

                                 •   Precision - Percent of results that are
                                     relevant

                                 •   Recall - Percent of relevant results that are
                                     returned

                                 •   F-Score - Harmonic mean of precision and
                                     recall




Tuesday, November 29, 2011
ZCatalog and Solr




Tuesday, November 29, 2011
ZCatalog
                             •   Zope/Plone search engine

                             •   Full text and field searching

                             •   Probabilistic model using Okapi BM25

                             •   OOTB ZCTextIndex very simple

                             •   TextIndexNG adds multilingual, better parsing
                                 components, binary transforms, synonyms




Tuesday, November 29, 2011
Solr
                             •   Popular open source enterprise search
                                 platform

                             •   Eliminating smaller commercial search
                                 companies

                             •   Java, based on Lucene Java search library,
                                 sophisticated vector space ++ model

                             •   RESTful APIs

                             •   Large, active community

                             •   Powers Twitter, Wikipedia, Netflix...



Tuesday, November 29, 2011
What does Solr have
                             that ZCatalog Doesn’t?
                             •   Better relevance ranking

                             •   More search features: snippets, hit
                                 highlighting, spelling suggestions, synonyms,
                                 more like this, faceted search

                             •   More configurable: stop words, field
                                 boosting, parsing components

                             •   An army of engineers working on it




Tuesday, November 29, 2011
Plone + Solr
                                              Today
                             •   Two add-ons available

                                 •   collective.solr - Intercepts catalog queries
                                     and dispatches them to Solr

                                 •   alm.solrindex - adds a new index type to
                                     the catalog, SolrIndex

                             •   Plus a buildout recipe:
                                 collective.recipe.solrinstance




Tuesday, November 29, 2011
Conclusions from
                             Conference Discussion




Tuesday, November 29, 2011
Why Does Plone
                                     Need Solr?
                             •   Certain types of projects need it, for features
                                 or because ZCatalog can’t scale to very large
                                 sites

                             •   We need it to keep up with the enterprise
                                 CMS pack




Tuesday, November 29, 2011
Points of Agreement
                             •   It will be impossible to completely replace
                                 ZCatalog with Solr

                                 •   Solr indexing will never be transactional

                                 •   Removing ZCatalog from Zope would be
                                     very difficult

                                 •   Tackle small, focused ZCatalog
                                     improvements when possible - like
                                     improving indexing interface




Tuesday, November 29, 2011
Points of Agreement

                             •   Navigation and search should be handled
                                 separately

                                 •   Navigation needs to be transactional,
                                     search does not

                                 •   Split out a catalog used for navigation from
                                     the general catalog

                                 •   Explore a non-catalog utility to support
                                     navigation, optimize for speed




Tuesday, November 29, 2011
Points of Agreement
                             •   Treating Solr integration simply as ZCatalog
                                 replacement does not take best advantage of
                                 Solr features

                                 •   ZCatalog can’t represent the richness of
                                     Solr, focus on the Solr API

                                 •   Take advantage of spelling suggestions,
                                     facets, results snippets with hit highlighting,
                                     synonyms, more like this, etc.

                                 •   Provide Solr indexing, field weighting, etc.
                                     configuration choices in the control panel



Tuesday, November 29, 2011
Points of Agreement
                             •   Neither of the current Solr add-ons provides
                                 the best foundation for the future

                                 •   But they’ve taught us how to do things
                                     better

                             •   Non-Solr approaches to improved Plone
                                 search should be deprecated

                                 •   Andreas Jung is not planning improvements
                                     to TextIndexNG!




Tuesday, November 29, 2011
Points of Agreement


                             •   Stop investing in ZCatalog as a search engine,
                                 Solr is the future




Tuesday, November 29, 2011
Plone + Solr
                                            Roadmap
                             •   Short term: Make Solr integration easy with
                                 an approved add-on (like LDAP)

                                 •   Build on what we’ve learned and create a
                                     better add-on to replace collective.solr and
                                     alm.solrindex

                                 •   Who wants to sponsor a sprint?




Tuesday, November 29, 2011
Plone + Solr
                                            Roadmap
                             •   Long term: Ship Solr integration with Plone,
                                 but don’t require Solr

                                 •   Solr has a lot of overhead and is not always
                                     needed

                                 •   But using it should be as easy as answering
                                     yes to a “Build with Solr?” installation
                                     option




Tuesday, November 29, 2011

Más contenido relacionado

Más de Jazkarta, Inc.

Academic Websites in Plone
Academic Websites in PloneAcademic Websites in Plone
Academic Websites in Plone
Jazkarta, Inc.
 

Más de Jazkarta, Inc. (20)

Traveling through time and place with Plone
Traveling through time and place with PloneTraveling through time and place with Plone
Traveling through time and place with Plone
 
Questions: A Form Library for Python with SurveyJS Frontend
Questions: A Form Library for Python with SurveyJS FrontendQuestions: A Form Library for Python with SurveyJS Frontend
Questions: A Form Library for Python with SurveyJS Frontend
 
The User Experience: Editing Composite Pages in Plone 6 and Beyond
The User Experience: Editing Composite Pages in Plone 6 and BeyondThe User Experience: Editing Composite Pages in Plone 6 and Beyond
The User Experience: Editing Composite Pages in Plone 6 and Beyond
 
WTA and Plone After 13 Years
WTA and Plone After 13 YearsWTA and Plone After 13 Years
WTA and Plone After 13 Years
 
Collaborating With Orchid Data
Collaborating With Orchid DataCollaborating With Orchid Data
Collaborating With Orchid Data
 
Spend a Week Hacking in Sorrento!
Spend a Week Hacking in Sorrento!Spend a Week Hacking in Sorrento!
Spend a Week Hacking in Sorrento!
 
Plone 5 Upgrades In Real Life
Plone 5 Upgrades In Real LifePlone 5 Upgrades In Real Life
Plone 5 Upgrades In Real Life
 
Accessibility in Plone: The Good, the Bad, and the Ugly
Accessibility in Plone: The Good, the Bad, and the UglyAccessibility in Plone: The Good, the Bad, and the Ugly
Accessibility in Plone: The Good, the Bad, and the Ugly
 
Getting Paid Without GetPaid
Getting Paid Without GetPaidGetting Paid Without GetPaid
Getting Paid Without GetPaid
 
An Open Source Platform for Social Science Research
An Open Source Platform for Social Science ResearchAn Open Source Platform for Social Science Research
An Open Source Platform for Social Science Research
 
For the Love of Volunteers! How Do You Choose the Right Technology to Manage ...
For the Love of Volunteers! How Do You Choose the Right Technology to Manage ...For the Love of Volunteers! How Do You Choose the Right Technology to Manage ...
For the Love of Volunteers! How Do You Choose the Right Technology to Manage ...
 
Anatomy of a Large Website Project
Anatomy of a Large Website ProjectAnatomy of a Large Website Project
Anatomy of a Large Website Project
 
Anatomy of a Large Website Project - With Presenter Notes
Anatomy of a Large Website Project - With Presenter NotesAnatomy of a Large Website Project - With Presenter Notes
Anatomy of a Large Website Project - With Presenter Notes
 
Plone Hosting: A Panel Discussion
Plone Hosting: A Panel DiscussionPlone Hosting: A Panel Discussion
Plone Hosting: A Panel Discussion
 
Plone+Salesforce
Plone+SalesforcePlone+Salesforce
Plone+Salesforce
 
Academic Websites in Plone
Academic Websites in PloneAcademic Websites in Plone
Academic Websites in Plone
 
Plone
PlonePlone
Plone
 
Online exhibits in Plone
Online exhibits in PloneOnline exhibits in Plone
Online exhibits in Plone
 
ZODB Tips and Tricks
ZODB Tips and TricksZODB Tips and Tricks
ZODB Tips and Tricks
 
Pyramid Deployment and Maintenance
Pyramid Deployment and MaintenancePyramid Deployment and Maintenance
Pyramid Deployment and Maintenance
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

The Future of Search in Plone

  • 1. The Future of Search in Plone Sally Kleinfeldt and friends Plone Conference, San Francisco November 3, 2011 Tuesday, November 29, 2011
  • 2. Motivation • Raise awareness • Promote discussion • Forge consensus Tuesday, November 29, 2011
  • 3. Agenda • Introduction to IR concepts • Description of Solr and ZCatalog • Discussion Tuesday, November 29, 2011
  • 5. IR 101 • Transformations • Terms • Models • Measures Tuesday, November 29, 2011
  • 6. IR 101 Transformations • Turn binary, HTML, or other document formats into fields and strings • Parse the strings into a set of terms • Build indexes of the terms specific to the IR model used • Queries are parsed into query operators and strings, which are parsed into terms Tuesday, November 29, 2011
  • 7. IR 101 String => Terms • Tokenization - locate word boundaries • Normalization - remove capitals and diacritics • Stopping - remove stop words (a, of, on, the...) • Stemming - reduce to word stems (walks, walking => walk) • Recognizers - concepts, parts of speech, names, locations... • Must be identical for documents and queries Tuesday, November 29, 2011
  • 8. IR 101 Terms • Application specific • Words or phrases • IR models assign weights to terms in documents Tuesday, November 29, 2011
  • 9. IR 101 Term Weighting • Simplest:Yes/No Boolean value • Better: Term Frequency - # occurrences • More meaningful: tf-idf • Term Freq * Inverse Document Freq • How many documents contain the term? • Increase weight of rare terms and vice versa Tuesday, November 29, 2011
  • 10. IR 101 Boolean Model • First and most adopted • Based on Boolean logic + set theory • Does a document contain query terms - Y/N • Intuitive, easy to implement • No ranking, special query language, too many or too few results • Typical for library systems Tuesday, November 29, 2011
  • 11. IR 101 Vector Space Models • Represent documents and queries as vectors of terms • Term values are weighted - by count or tf-idf • Use vector operations to compare documents with queries • Relevance score based on cosine of angle between doc/query vectors Tuesday, November 29, 2011
  • 12. IR 101 Probabilistic Models • Compute probability that a document is relevant to a query • Relevance ranking functions range from simple to complex • Sophisticated ranking functions include • Okapi BM25 (uses tf and idf) • Machine learning formulas (use training data) Tuesday, November 29, 2011
  • 13. IR 101 Extending the Models • Many many refinements possible • Term interdependencies • Fuzzy sets • Semantic analysis, link analysis • Combining models (Extended Boolean) • The best search engines represent thousands of engineering hours Tuesday, November 29, 2011
  • 14. IR 101 Measures • Search engine results are measured against: • Precision - Percent of results that are relevant • Recall - Percent of relevant results that are returned • F-Score - Harmonic mean of precision and recall Tuesday, November 29, 2011
  • 15. ZCatalog and Solr Tuesday, November 29, 2011
  • 16. ZCatalog • Zope/Plone search engine • Full text and field searching • Probabilistic model using Okapi BM25 • OOTB ZCTextIndex very simple • TextIndexNG adds multilingual, better parsing components, binary transforms, synonyms Tuesday, November 29, 2011
  • 17. Solr • Popular open source enterprise search platform • Eliminating smaller commercial search companies • Java, based on Lucene Java search library, sophisticated vector space ++ model • RESTful APIs • Large, active community • Powers Twitter, Wikipedia, Netflix... Tuesday, November 29, 2011
  • 18. What does Solr have that ZCatalog Doesn’t? • Better relevance ranking • More search features: snippets, hit highlighting, spelling suggestions, synonyms, more like this, faceted search • More configurable: stop words, field boosting, parsing components • An army of engineers working on it Tuesday, November 29, 2011
  • 19. Plone + Solr Today • Two add-ons available • collective.solr - Intercepts catalog queries and dispatches them to Solr • alm.solrindex - adds a new index type to the catalog, SolrIndex • Plus a buildout recipe: collective.recipe.solrinstance Tuesday, November 29, 2011
  • 20. Conclusions from Conference Discussion Tuesday, November 29, 2011
  • 21. Why Does Plone Need Solr? • Certain types of projects need it, for features or because ZCatalog can’t scale to very large sites • We need it to keep up with the enterprise CMS pack Tuesday, November 29, 2011
  • 22. Points of Agreement • It will be impossible to completely replace ZCatalog with Solr • Solr indexing will never be transactional • Removing ZCatalog from Zope would be very difficult • Tackle small, focused ZCatalog improvements when possible - like improving indexing interface Tuesday, November 29, 2011
  • 23. Points of Agreement • Navigation and search should be handled separately • Navigation needs to be transactional, search does not • Split out a catalog used for navigation from the general catalog • Explore a non-catalog utility to support navigation, optimize for speed Tuesday, November 29, 2011
  • 24. Points of Agreement • Treating Solr integration simply as ZCatalog replacement does not take best advantage of Solr features • ZCatalog can’t represent the richness of Solr, focus on the Solr API • Take advantage of spelling suggestions, facets, results snippets with hit highlighting, synonyms, more like this, etc. • Provide Solr indexing, field weighting, etc. configuration choices in the control panel Tuesday, November 29, 2011
  • 25. Points of Agreement • Neither of the current Solr add-ons provides the best foundation for the future • But they’ve taught us how to do things better • Non-Solr approaches to improved Plone search should be deprecated • Andreas Jung is not planning improvements to TextIndexNG! Tuesday, November 29, 2011
  • 26. Points of Agreement • Stop investing in ZCatalog as a search engine, Solr is the future Tuesday, November 29, 2011
  • 27. Plone + Solr Roadmap • Short term: Make Solr integration easy with an approved add-on (like LDAP) • Build on what we’ve learned and create a better add-on to replace collective.solr and alm.solrindex • Who wants to sponsor a sprint? Tuesday, November 29, 2011
  • 28. Plone + Solr Roadmap • Long term: Ship Solr integration with Plone, but don’t require Solr • Solr has a lot of overhead and is not always needed • But using it should be as easy as answering yes to a “Build with Solr?” installation option Tuesday, November 29, 2011