SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
The Future of Search in
                                     Plone
                                          Sally Kleinfeldt
                                            and friends
                                  Plone Conference, San Francisco
                                        November 3, 2011




Tuesday, November 29, 2011
Motivation


                             •   Raise awareness

                             •   Promote discussion

                             •   Forge consensus




Tuesday, November 29, 2011
Agenda


                             •   Introduction to IR concepts

                             •   Description of Solr and ZCatalog

                             •   Discussion




Tuesday, November 29, 2011
IR 101




Tuesday, November 29, 2011
IR 101

                             •   Transformations

                             •   Terms

                             •   Models

                             •   Measures




Tuesday, November 29, 2011
IR 101
                                     Transformations
                             •   Turn binary, HTML, or other document
                                 formats into fields and strings

                             •   Parse the strings into a set of terms

                             •   Build indexes of the terms specific to the IR
                                 model used

                             •   Queries are parsed into query operators and
                                 strings, which are parsed into terms




Tuesday, November 29, 2011
IR 101
                                     String => Terms
                             •   Tokenization - locate word boundaries

                             •   Normalization - remove capitals and diacritics

                             •   Stopping - remove stop words (a, of, on,
                                 the...)

                             •   Stemming - reduce to word stems (walks,
                                 walking => walk)

                             •   Recognizers - concepts, parts of speech,
                                 names, locations...

                             •   Must be identical for documents and queries


Tuesday, November 29, 2011
IR 101
                                              Terms

                             •   Application specific

                             •   Words or phrases

                             •   IR models assign weights to terms in
                                 documents




Tuesday, November 29, 2011
IR 101
                                       Term Weighting
                             •   Simplest:Yes/No Boolean value

                             •   Better: Term Frequency - # occurrences

                             •   More meaningful: tf-idf

                                 •   Term Freq * Inverse Document Freq

                                 •   How many documents contain the term?

                                 •   Increase weight of rare terms and vice
                                     versa




Tuesday, November 29, 2011
IR 101
                                      Boolean Model
                             •   First and most adopted

                             •   Based on Boolean logic + set theory

                             •   Does a document contain query terms - Y/N

                             •   Intuitive, easy to implement

                             •   No ranking, special query language, too many
                                 or too few results

                             •   Typical for library systems




Tuesday, November 29, 2011
IR 101
                                 Vector Space Models
                             •   Represent documents and queries as vectors
                                 of terms

                             •   Term values are weighted - by count or tf-idf

                             •   Use vector operations to compare
                                 documents with queries

                             •   Relevance score based on cosine of angle
                                 between doc/query vectors




Tuesday, November 29, 2011
IR 101
                                 Probabilistic Models
                             •   Compute probability that a document is
                                 relevant to a query

                             •   Relevance ranking functions range from
                                 simple to complex

                             •   Sophisticated ranking functions include

                                 •   Okapi BM25 (uses tf and idf)

                                 •   Machine learning formulas (use training
                                     data)




Tuesday, November 29, 2011
IR 101
                             Extending the Models
                             •   Many many refinements possible

                                 •   Term interdependencies

                                 •   Fuzzy sets

                                 •   Semantic analysis, link analysis

                                 •   Combining models (Extended Boolean)

                             •   The best search engines represent thousands
                                 of engineering hours




Tuesday, November 29, 2011
IR 101
                                              Measures
                             •   Search engine results are measured against:

                                 •   Precision - Percent of results that are
                                     relevant

                                 •   Recall - Percent of relevant results that are
                                     returned

                                 •   F-Score - Harmonic mean of precision and
                                     recall




Tuesday, November 29, 2011
ZCatalog and Solr




Tuesday, November 29, 2011
ZCatalog
                             •   Zope/Plone search engine

                             •   Full text and field searching

                             •   Probabilistic model using Okapi BM25

                             •   OOTB ZCTextIndex very simple

                             •   TextIndexNG adds multilingual, better parsing
                                 components, binary transforms, synonyms




Tuesday, November 29, 2011
Solr
                             •   Popular open source enterprise search
                                 platform

                             •   Eliminating smaller commercial search
                                 companies

                             •   Java, based on Lucene Java search library,
                                 sophisticated vector space ++ model

                             •   RESTful APIs

                             •   Large, active community

                             •   Powers Twitter, Wikipedia, Netflix...



Tuesday, November 29, 2011
What does Solr have
                             that ZCatalog Doesn’t?
                             •   Better relevance ranking

                             •   More search features: snippets, hit
                                 highlighting, spelling suggestions, synonyms,
                                 more like this, faceted search

                             •   More configurable: stop words, field
                                 boosting, parsing components

                             •   An army of engineers working on it




Tuesday, November 29, 2011
Plone + Solr
                                              Today
                             •   Two add-ons available

                                 •   collective.solr - Intercepts catalog queries
                                     and dispatches them to Solr

                                 •   alm.solrindex - adds a new index type to
                                     the catalog, SolrIndex

                             •   Plus a buildout recipe:
                                 collective.recipe.solrinstance




Tuesday, November 29, 2011
Conclusions from
                             Conference Discussion




Tuesday, November 29, 2011
Why Does Plone
                                     Need Solr?
                             •   Certain types of projects need it, for features
                                 or because ZCatalog can’t scale to very large
                                 sites

                             •   We need it to keep up with the enterprise
                                 CMS pack




Tuesday, November 29, 2011
Points of Agreement
                             •   It will be impossible to completely replace
                                 ZCatalog with Solr

                                 •   Solr indexing will never be transactional

                                 •   Removing ZCatalog from Zope would be
                                     very difficult

                                 •   Tackle small, focused ZCatalog
                                     improvements when possible - like
                                     improving indexing interface




Tuesday, November 29, 2011
Points of Agreement

                             •   Navigation and search should be handled
                                 separately

                                 •   Navigation needs to be transactional,
                                     search does not

                                 •   Split out a catalog used for navigation from
                                     the general catalog

                                 •   Explore a non-catalog utility to support
                                     navigation, optimize for speed




Tuesday, November 29, 2011
Points of Agreement
                             •   Treating Solr integration simply as ZCatalog
                                 replacement does not take best advantage of
                                 Solr features

                                 •   ZCatalog can’t represent the richness of
                                     Solr, focus on the Solr API

                                 •   Take advantage of spelling suggestions,
                                     facets, results snippets with hit highlighting,
                                     synonyms, more like this, etc.

                                 •   Provide Solr indexing, field weighting, etc.
                                     configuration choices in the control panel



Tuesday, November 29, 2011
Points of Agreement
                             •   Neither of the current Solr add-ons provides
                                 the best foundation for the future

                                 •   But they’ve taught us how to do things
                                     better

                             •   Non-Solr approaches to improved Plone
                                 search should be deprecated

                                 •   Andreas Jung is not planning improvements
                                     to TextIndexNG!




Tuesday, November 29, 2011
Points of Agreement


                             •   Stop investing in ZCatalog as a search engine,
                                 Solr is the future




Tuesday, November 29, 2011
Plone + Solr
                                            Roadmap
                             •   Short term: Make Solr integration easy with
                                 an approved add-on (like LDAP)

                                 •   Build on what we’ve learned and create a
                                     better add-on to replace collective.solr and
                                     alm.solrindex

                                 •   Who wants to sponsor a sprint?




Tuesday, November 29, 2011
Plone + Solr
                                            Roadmap
                             •   Long term: Ship Solr integration with Plone,
                                 but don’t require Solr

                                 •   Solr has a lot of overhead and is not always
                                     needed

                                 •   But using it should be as easy as answering
                                     yes to a “Build with Solr?” installation
                                     option




Tuesday, November 29, 2011

Más contenido relacionado

Más de Jazkarta, Inc.

Traveling through time and place with Plone
Traveling through time and place with PloneTraveling through time and place with Plone
Traveling through time and place with PloneJazkarta, Inc.
 
Questions: A Form Library for Python with SurveyJS Frontend
Questions: A Form Library for Python with SurveyJS FrontendQuestions: A Form Library for Python with SurveyJS Frontend
Questions: A Form Library for Python with SurveyJS FrontendJazkarta, Inc.
 
The User Experience: Editing Composite Pages in Plone 6 and Beyond
The User Experience: Editing Composite Pages in Plone 6 and BeyondThe User Experience: Editing Composite Pages in Plone 6 and Beyond
The User Experience: Editing Composite Pages in Plone 6 and BeyondJazkarta, Inc.
 
WTA and Plone After 13 Years
WTA and Plone After 13 YearsWTA and Plone After 13 Years
WTA and Plone After 13 YearsJazkarta, Inc.
 
Collaborating With Orchid Data
Collaborating With Orchid DataCollaborating With Orchid Data
Collaborating With Orchid DataJazkarta, Inc.
 
Spend a Week Hacking in Sorrento!
Spend a Week Hacking in Sorrento!Spend a Week Hacking in Sorrento!
Spend a Week Hacking in Sorrento!Jazkarta, Inc.
 
Plone 5 Upgrades In Real Life
Plone 5 Upgrades In Real LifePlone 5 Upgrades In Real Life
Plone 5 Upgrades In Real LifeJazkarta, Inc.
 
Accessibility in Plone: The Good, the Bad, and the Ugly
Accessibility in Plone: The Good, the Bad, and the UglyAccessibility in Plone: The Good, the Bad, and the Ugly
Accessibility in Plone: The Good, the Bad, and the UglyJazkarta, Inc.
 
Getting Paid Without GetPaid
Getting Paid Without GetPaidGetting Paid Without GetPaid
Getting Paid Without GetPaidJazkarta, Inc.
 
An Open Source Platform for Social Science Research
An Open Source Platform for Social Science ResearchAn Open Source Platform for Social Science Research
An Open Source Platform for Social Science ResearchJazkarta, Inc.
 
For the Love of Volunteers! How Do You Choose the Right Technology to Manage ...
For the Love of Volunteers! How Do You Choose the Right Technology to Manage ...For the Love of Volunteers! How Do You Choose the Right Technology to Manage ...
For the Love of Volunteers! How Do You Choose the Right Technology to Manage ...Jazkarta, Inc.
 
Anatomy of a Large Website Project
Anatomy of a Large Website ProjectAnatomy of a Large Website Project
Anatomy of a Large Website ProjectJazkarta, Inc.
 
Anatomy of a Large Website Project - With Presenter Notes
Anatomy of a Large Website Project - With Presenter NotesAnatomy of a Large Website Project - With Presenter Notes
Anatomy of a Large Website Project - With Presenter NotesJazkarta, Inc.
 
Plone Hosting: A Panel Discussion
Plone Hosting: A Panel DiscussionPlone Hosting: A Panel Discussion
Plone Hosting: A Panel DiscussionJazkarta, Inc.
 
Academic Websites in Plone
Academic Websites in PloneAcademic Websites in Plone
Academic Websites in PloneJazkarta, Inc.
 
Online exhibits in Plone
Online exhibits in PloneOnline exhibits in Plone
Online exhibits in PloneJazkarta, Inc.
 
Pyramid Deployment and Maintenance
Pyramid Deployment and MaintenancePyramid Deployment and Maintenance
Pyramid Deployment and MaintenanceJazkarta, Inc.
 

Más de Jazkarta, Inc. (20)

Traveling through time and place with Plone
Traveling through time and place with PloneTraveling through time and place with Plone
Traveling through time and place with Plone
 
Questions: A Form Library for Python with SurveyJS Frontend
Questions: A Form Library for Python with SurveyJS FrontendQuestions: A Form Library for Python with SurveyJS Frontend
Questions: A Form Library for Python with SurveyJS Frontend
 
The User Experience: Editing Composite Pages in Plone 6 and Beyond
The User Experience: Editing Composite Pages in Plone 6 and BeyondThe User Experience: Editing Composite Pages in Plone 6 and Beyond
The User Experience: Editing Composite Pages in Plone 6 and Beyond
 
WTA and Plone After 13 Years
WTA and Plone After 13 YearsWTA and Plone After 13 Years
WTA and Plone After 13 Years
 
Collaborating With Orchid Data
Collaborating With Orchid DataCollaborating With Orchid Data
Collaborating With Orchid Data
 
Spend a Week Hacking in Sorrento!
Spend a Week Hacking in Sorrento!Spend a Week Hacking in Sorrento!
Spend a Week Hacking in Sorrento!
 
Plone 5 Upgrades In Real Life
Plone 5 Upgrades In Real LifePlone 5 Upgrades In Real Life
Plone 5 Upgrades In Real Life
 
Accessibility in Plone: The Good, the Bad, and the Ugly
Accessibility in Plone: The Good, the Bad, and the UglyAccessibility in Plone: The Good, the Bad, and the Ugly
Accessibility in Plone: The Good, the Bad, and the Ugly
 
Getting Paid Without GetPaid
Getting Paid Without GetPaidGetting Paid Without GetPaid
Getting Paid Without GetPaid
 
An Open Source Platform for Social Science Research
An Open Source Platform for Social Science ResearchAn Open Source Platform for Social Science Research
An Open Source Platform for Social Science Research
 
For the Love of Volunteers! How Do You Choose the Right Technology to Manage ...
For the Love of Volunteers! How Do You Choose the Right Technology to Manage ...For the Love of Volunteers! How Do You Choose the Right Technology to Manage ...
For the Love of Volunteers! How Do You Choose the Right Technology to Manage ...
 
Anatomy of a Large Website Project
Anatomy of a Large Website ProjectAnatomy of a Large Website Project
Anatomy of a Large Website Project
 
Anatomy of a Large Website Project - With Presenter Notes
Anatomy of a Large Website Project - With Presenter NotesAnatomy of a Large Website Project - With Presenter Notes
Anatomy of a Large Website Project - With Presenter Notes
 
Plone Hosting: A Panel Discussion
Plone Hosting: A Panel DiscussionPlone Hosting: A Panel Discussion
Plone Hosting: A Panel Discussion
 
Plone+Salesforce
Plone+SalesforcePlone+Salesforce
Plone+Salesforce
 
Academic Websites in Plone
Academic Websites in PloneAcademic Websites in Plone
Academic Websites in Plone
 
Plone
PlonePlone
Plone
 
Online exhibits in Plone
Online exhibits in PloneOnline exhibits in Plone
Online exhibits in Plone
 
ZODB Tips and Tricks
ZODB Tips and TricksZODB Tips and Tricks
ZODB Tips and Tricks
 
Pyramid Deployment and Maintenance
Pyramid Deployment and MaintenancePyramid Deployment and Maintenance
Pyramid Deployment and Maintenance
 

Último

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 

Último (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 

The Future of Search in Plone

  • 1. The Future of Search in Plone Sally Kleinfeldt and friends Plone Conference, San Francisco November 3, 2011 Tuesday, November 29, 2011
  • 2. Motivation • Raise awareness • Promote discussion • Forge consensus Tuesday, November 29, 2011
  • 3. Agenda • Introduction to IR concepts • Description of Solr and ZCatalog • Discussion Tuesday, November 29, 2011
  • 5. IR 101 • Transformations • Terms • Models • Measures Tuesday, November 29, 2011
  • 6. IR 101 Transformations • Turn binary, HTML, or other document formats into fields and strings • Parse the strings into a set of terms • Build indexes of the terms specific to the IR model used • Queries are parsed into query operators and strings, which are parsed into terms Tuesday, November 29, 2011
  • 7. IR 101 String => Terms • Tokenization - locate word boundaries • Normalization - remove capitals and diacritics • Stopping - remove stop words (a, of, on, the...) • Stemming - reduce to word stems (walks, walking => walk) • Recognizers - concepts, parts of speech, names, locations... • Must be identical for documents and queries Tuesday, November 29, 2011
  • 8. IR 101 Terms • Application specific • Words or phrases • IR models assign weights to terms in documents Tuesday, November 29, 2011
  • 9. IR 101 Term Weighting • Simplest:Yes/No Boolean value • Better: Term Frequency - # occurrences • More meaningful: tf-idf • Term Freq * Inverse Document Freq • How many documents contain the term? • Increase weight of rare terms and vice versa Tuesday, November 29, 2011
  • 10. IR 101 Boolean Model • First and most adopted • Based on Boolean logic + set theory • Does a document contain query terms - Y/N • Intuitive, easy to implement • No ranking, special query language, too many or too few results • Typical for library systems Tuesday, November 29, 2011
  • 11. IR 101 Vector Space Models • Represent documents and queries as vectors of terms • Term values are weighted - by count or tf-idf • Use vector operations to compare documents with queries • Relevance score based on cosine of angle between doc/query vectors Tuesday, November 29, 2011
  • 12. IR 101 Probabilistic Models • Compute probability that a document is relevant to a query • Relevance ranking functions range from simple to complex • Sophisticated ranking functions include • Okapi BM25 (uses tf and idf) • Machine learning formulas (use training data) Tuesday, November 29, 2011
  • 13. IR 101 Extending the Models • Many many refinements possible • Term interdependencies • Fuzzy sets • Semantic analysis, link analysis • Combining models (Extended Boolean) • The best search engines represent thousands of engineering hours Tuesday, November 29, 2011
  • 14. IR 101 Measures • Search engine results are measured against: • Precision - Percent of results that are relevant • Recall - Percent of relevant results that are returned • F-Score - Harmonic mean of precision and recall Tuesday, November 29, 2011
  • 15. ZCatalog and Solr Tuesday, November 29, 2011
  • 16. ZCatalog • Zope/Plone search engine • Full text and field searching • Probabilistic model using Okapi BM25 • OOTB ZCTextIndex very simple • TextIndexNG adds multilingual, better parsing components, binary transforms, synonyms Tuesday, November 29, 2011
  • 17. Solr • Popular open source enterprise search platform • Eliminating smaller commercial search companies • Java, based on Lucene Java search library, sophisticated vector space ++ model • RESTful APIs • Large, active community • Powers Twitter, Wikipedia, Netflix... Tuesday, November 29, 2011
  • 18. What does Solr have that ZCatalog Doesn’t? • Better relevance ranking • More search features: snippets, hit highlighting, spelling suggestions, synonyms, more like this, faceted search • More configurable: stop words, field boosting, parsing components • An army of engineers working on it Tuesday, November 29, 2011
  • 19. Plone + Solr Today • Two add-ons available • collective.solr - Intercepts catalog queries and dispatches them to Solr • alm.solrindex - adds a new index type to the catalog, SolrIndex • Plus a buildout recipe: collective.recipe.solrinstance Tuesday, November 29, 2011
  • 20. Conclusions from Conference Discussion Tuesday, November 29, 2011
  • 21. Why Does Plone Need Solr? • Certain types of projects need it, for features or because ZCatalog can’t scale to very large sites • We need it to keep up with the enterprise CMS pack Tuesday, November 29, 2011
  • 22. Points of Agreement • It will be impossible to completely replace ZCatalog with Solr • Solr indexing will never be transactional • Removing ZCatalog from Zope would be very difficult • Tackle small, focused ZCatalog improvements when possible - like improving indexing interface Tuesday, November 29, 2011
  • 23. Points of Agreement • Navigation and search should be handled separately • Navigation needs to be transactional, search does not • Split out a catalog used for navigation from the general catalog • Explore a non-catalog utility to support navigation, optimize for speed Tuesday, November 29, 2011
  • 24. Points of Agreement • Treating Solr integration simply as ZCatalog replacement does not take best advantage of Solr features • ZCatalog can’t represent the richness of Solr, focus on the Solr API • Take advantage of spelling suggestions, facets, results snippets with hit highlighting, synonyms, more like this, etc. • Provide Solr indexing, field weighting, etc. configuration choices in the control panel Tuesday, November 29, 2011
  • 25. Points of Agreement • Neither of the current Solr add-ons provides the best foundation for the future • But they’ve taught us how to do things better • Non-Solr approaches to improved Plone search should be deprecated • Andreas Jung is not planning improvements to TextIndexNG! Tuesday, November 29, 2011
  • 26. Points of Agreement • Stop investing in ZCatalog as a search engine, Solr is the future Tuesday, November 29, 2011
  • 27. Plone + Solr Roadmap • Short term: Make Solr integration easy with an approved add-on (like LDAP) • Build on what we’ve learned and create a better add-on to replace collective.solr and alm.solrindex • Who wants to sponsor a sprint? Tuesday, November 29, 2011
  • 28. Plone + Solr Roadmap • Long term: Ship Solr integration with Plone, but don’t require Solr • Solr has a lot of overhead and is not always needed • But using it should be as easy as answering yes to a “Build with Solr?” installation option Tuesday, November 29, 2011