SlideShare una empresa de Scribd logo
1 de 22
About us
• Founded in 2007
• B2B Startup
• Semantic Services started in
2011
• Big Data projects in 2012
• Training, Consultancy and
Support on Solr and Hadoop
About me
• Leo Oliveira
• 15 years of experience with
websites and search engines
• Specialized in Relevancy and Semantics
• Graduated in Business Management & IT
Innovation
The Importance of Relevancy
 The game changer in Search
 Google became what it is due to relevancy algorithms and PageRank
 Can be achieved through many different types of data
 Cross-reference, log analysis, social media, market research, new ideas etc
 It’s about the user
 And not about what we think of the user
 If you don’t understand your data, find what is relevant through
research
 Research user behavior and needs and you’ll find a way of finding relevant
information or you will find a way of discovering what is relevant automatically.
 If the user can’t find what they want using your search…
 … they’ll leave.
The Importance of having a resultset
 Sometimes you have little data and users get a lot of zero-result
queries.
 Then it’s time to find the right words for the job
 Synonym discovery, stemming
 Analysis: understand what is the user searching for
 Suggest other options
 Sometimes a zero-result query is just typo. Use suggestions to find the right
results.
 Create an interface that makes it easy for the user to try different keywords to get
results
 Analyze words that are searched together by the same users to create special
suggestions or second searches that brings some results
Relevancy and Semantics
 THE RELEVANCY RULES (the most forgotten ones)
 Phrases are more relevant the single words
 “Pure” words, that is, words the way they were typed, are more relevant than
stemmed words or even synonyms
 In general, newest docs are more relevant than old docs, but this rule could be
reversed.
 Closer venues and locations are more relevant than distant ones.
 Give the right weights to the right fields (seems simple, but it’s not)
 When you have a mix of relevant things…
 for instance, in an e-commerce, that you must consider the freshest docs, plus the
smaller prices and the best sellers, sometimes you need to create a math formula to
cope with all these items for relevancy. And it’s tricky.
 You can boost each parameter separately, but then you must test your formulas
very well not to prioritize one parameter more than the other.
Relevancy and Weighting
 USING DISMAX (an example of relevancy configuration)
 By using Dismax, you have two main options: qf and pf.
 Phrase Fields (pf) should have bigger weights than all of your Query Fields (qf)
 Field weights must be decided carefully. You can use a scale, such as Fibonacci
numbers, so as to not lose control of what is more relevant and what is not that relevant in
your “search formula”
 EXAMPLE OF pf for a movies website:
 MovieName^34 MovieActors^21 MovieDirector^13 MovieYear^8
 EXAMPLE OF qf:
 MovieName^21 MovieActors^13 MovieDirector^8 MovieYear^5
Synonyms Theory Vs Synonyms in Search
Synonym Theory Vs Synonym in Search
 In search, synonyms can get complicated
 You don’t need to use only “real” synonyms
 You must focus on user needs rather than “thesaurus”
 You can use it for other useful stuff, such as the most common typos to be
corrected or even some unrelated words that could do the job for you
 Sometimes you just need to get creative.
1-way transformations
 SynonymFilter
 E. g. Compter => computer
 Very useful for common typos or misconceptions
 Sometimes requires expansion.
 In our setup, we’ll set a fieldtype that will do the job.
2-way transformations
 SynonymFilter
 E. g. computer, pc, mac, notebook, server
 Expansion of meanings
 Sometimes you don’t use exact synonyms. Brand names and other “real world” terms can
be used.
 It’s hard to configure and maintain 1-way and 2-way in a single synonyms.txt file.
Index Time Vs Query Time
 Most synonyms should work with Query Time only
 But there are exceptions. E. g.: photoalbum, photo album
 When using a phrase synonym, this might require an index time synonym to work
correctly
 In our fieldtype, we’ll discuss how to achieve this.
 This index time synonym file is also useful for search when you need symbols in some
terms that could be droped by the tokenizer. E. g. C++ => cpp or .NET => dotnet etc
 Keep in mind that this index time file will have the exceptions that don’t work well in query-
time. To figure this out, sometimes it requires testing, sometimes it is easy to identify, such
as phrase synonyms.
Some examples
 Query time
 Enginer => Engineer (1-way)
 Engineer, engineering (2-way with expand=true)
 Computer, pc, mac
 Index time
 C# => csharp
 .NET => dotnet
 Human resources, HR
 Business intelligence, BI
 CRM, Costumer Relationship Management
Creating a fieldtype with semantic relevancy
<fieldType class="solr.TextField" name="text_pt" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_indextime.txt" ignoreCase="true" expand="true"/>
<filter catenateAll="0" catenateNumbers="1" catenateWords="1" class="solr.WordDelimiterFilterFactory" generateNumberParts="1"
generateWordParts="1" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.BrazilianStemFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt" ignoreCase="true" expand="false"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt" ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0" catenateWords="0" generateNumberParts="1"
generateWordParts="1" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.BrazilianStemFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
Creating a fieldtype with semantic relevancy
<fieldType class="solr.TextField" name="text_pt" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"
words="stopwords_pt.txt"/>
2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_indextime.txt"
ignoreCase="true" expand="true"/>
3. <filter catenateAll="0" catenateNumbers="1" catenateWords="1"
class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1"
splitOnCaseChange="1"/>
4. <filter class="solr.LowerCaseFilterFactory"/>
5. <filter class="solr.BrazilianStemFilterFactory"/>
6. <filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
Creating a fieldtype with semantic relevancy
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"
words="stopwords_pt.txt"/>
2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt"
ignoreCase="true" expand="false"/>
3. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt"
ignoreCase="true" expand="true"/>
4. <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0"
catenateWords="0" generateNumberParts="1" generateWordParts="1"
splitOnCaseChange="1"/>
5. <filter class="solr.LowerCaseFilterFactory"/>
6. <filter class="solr.BrazilianStemFilterFactory"/>
7. <filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
Creating a fieldtype with semantic relevancy
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"
words="stopwords_pt.txt"/>
2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt"
ignoreCase="true" expand="false"/>
3. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt"
ignoreCase="true" expand="true"/>
4. <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0"
catenateWords="0" generateNumberParts="1" generateWordParts="1"
splitOnCaseChange="1"/>
5. <filter class="solr.LowerCaseFilterFactory"/>
6. <filter class="solr.BrazilianStemFilterFactory"/>
7. <filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
Creating a fieldtype with semantic relevancy
<fieldType class="solr.TextField" name="text_pt_pure" positionIncrementGap="100”>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"
words="stopwords_pt.txt"/>
<filter catenateAll="0" catenateNumbers="1" catenateWords="1"
class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
Creating a relevant formula for search
 Imagine you have a field named “Product_name”
 To keep search result semantically relevant, you’ll need a “Product_name” field and also a
“Product_name_pure”
 Keep in mind that phrases are also more relevant than individual words
 In that case, here’s how Solrconfig would be:
 QF (query fields, single words)
 Product_name_pure^5 Product_name^3
 PF (phrase fields)
 Product_name_pure^8 Product_nameˆ5
What will you get?
 Relevant results semantically
 Users will understand the results
 Synonyms or stemmed tokens won’t disturb or create noise
 Similar to what main search websites are doing for many different things, such as
document search, e-commerce, intranet search, document search, library search etc
 Be able to find relevant results even in a scenario with millions or billions of documents.
I N F I N I T E P O S S I B I L I T I ES
THANK YOU!
Get in touch!
Email: loliveira@semantix.com.br
Twitter: @SemantixBR

Más contenido relacionado

Destacado

Suicides and suicide attempts during long term treatment with antidepressants...
Suicides and suicide attempts during long term treatment with antidepressants...Suicides and suicide attempts during long term treatment with antidepressants...
Suicides and suicide attempts during long term treatment with antidepressants...
Daryl Chow
 
Situaciones deseadas y no deseadas
Situaciones deseadas y no deseadasSituaciones deseadas y no deseadas
Situaciones deseadas y no deseadas
Emerson Salcedo
 
pl-visit Shop
pl-visit Shoppl-visit Shop
pl-visit Shop
plvisit
 
Wifi sip ip phone sc 6600 catalog march 22
Wifi sip ip phone sc 6600 catalog march 22Wifi sip ip phone sc 6600 catalog march 22
Wifi sip ip phone sc 6600 catalog march 22
solomonmin
 
Edu,jharvey,biel i marc sala experiments
Edu,jharvey,biel i marc sala experimentsEdu,jharvey,biel i marc sala experiments
Edu,jharvey,biel i marc sala experiments
6primariaparecollvic
 
Lean six sigma training and certification from TUV SUD ( Globally recognized)
Lean six sigma training and certification from TUV SUD ( Globally recognized)Lean six sigma training and certification from TUV SUD ( Globally recognized)
Lean six sigma training and certification from TUV SUD ( Globally recognized)
PriSkills Knowledge Solutions
 
La práctica del atletismo en personas con discapacidad (1) Cristian Zamora
La práctica del atletismo en personas con discapacidad (1) Cristian ZamoraLa práctica del atletismo en personas con discapacidad (1) Cristian Zamora
La práctica del atletismo en personas con discapacidad (1) Cristian Zamora
cristiansamisan
 
Trignometria 5º primera parte
Trignometria 5º   primera parteTrignometria 5º   primera parte
Trignometria 5º primera parte
cjperu
 
DHIFAF BAGHDAD company presenation
DHIFAF BAGHDAD company presenationDHIFAF BAGHDAD company presenation
DHIFAF BAGHDAD company presenation
DHIFAF BAGDAH
 

Destacado (17)

Suicides and suicide attempts during long term treatment with antidepressants...
Suicides and suicide attempts during long term treatment with antidepressants...Suicides and suicide attempts during long term treatment with antidepressants...
Suicides and suicide attempts during long term treatment with antidepressants...
 
Seleccion de trabajos publicados en La Voz de Asturias
Seleccion de trabajos publicados en La Voz de AsturiasSeleccion de trabajos publicados en La Voz de Asturias
Seleccion de trabajos publicados en La Voz de Asturias
 
Situaciones deseadas y no deseadas
Situaciones deseadas y no deseadasSituaciones deseadas y no deseadas
Situaciones deseadas y no deseadas
 
pl-visit Shop
pl-visit Shoppl-visit Shop
pl-visit Shop
 
ReuniÓn De Padres
ReuniÓn De PadresReuniÓn De Padres
ReuniÓn De Padres
 
Wifi sip ip phone sc 6600 catalog march 22
Wifi sip ip phone sc 6600 catalog march 22Wifi sip ip phone sc 6600 catalog march 22
Wifi sip ip phone sc 6600 catalog march 22
 
Yoel Kenan - AFRICORI - World Intellectual Property Day Workshops Presentation
Yoel Kenan - AFRICORI - World Intellectual Property Day Workshops PresentationYoel Kenan - AFRICORI - World Intellectual Property Day Workshops Presentation
Yoel Kenan - AFRICORI - World Intellectual Property Day Workshops Presentation
 
Food systems transformation: what is the role of pulses in the sustainability...
Food systems transformation: what is the role of pulses in the sustainability...Food systems transformation: what is the role of pulses in the sustainability...
Food systems transformation: what is the role of pulses in the sustainability...
 
Blusens Networks IPTV 2013
Blusens Networks IPTV 2013Blusens Networks IPTV 2013
Blusens Networks IPTV 2013
 
Regulatory Information Management
Regulatory Information ManagementRegulatory Information Management
Regulatory Information Management
 
Edu,jharvey,biel i marc sala experiments
Edu,jharvey,biel i marc sala experimentsEdu,jharvey,biel i marc sala experiments
Edu,jharvey,biel i marc sala experiments
 
Presentación de Internet TV (IPTV)
Presentación de Internet TV (IPTV)Presentación de Internet TV (IPTV)
Presentación de Internet TV (IPTV)
 
Lean six sigma training and certification from TUV SUD ( Globally recognized)
Lean six sigma training and certification from TUV SUD ( Globally recognized)Lean six sigma training and certification from TUV SUD ( Globally recognized)
Lean six sigma training and certification from TUV SUD ( Globally recognized)
 
La práctica del atletismo en personas con discapacidad (1) Cristian Zamora
La práctica del atletismo en personas con discapacidad (1) Cristian ZamoraLa práctica del atletismo en personas con discapacidad (1) Cristian Zamora
La práctica del atletismo en personas con discapacidad (1) Cristian Zamora
 
GUILLERMO SIMÓN.
GUILLERMO SIMÓN.GUILLERMO SIMÓN.
GUILLERMO SIMÓN.
 
Trignometria 5º primera parte
Trignometria 5º   primera parteTrignometria 5º   primera parte
Trignometria 5º primera parte
 
DHIFAF BAGHDAD company presenation
DHIFAF BAGHDAD company presenationDHIFAF BAGHDAD company presenation
DHIFAF BAGHDAD company presenation
 

Similar a Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA

Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Gabriel Moreira
 
Full text search
Full text searchFull text search
Full text search
deleteman
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender Systems
Gabriel Moreira
 

Similar a Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA (20)

How publishing works in the digital era
How publishing works in the digital eraHow publishing works in the digital era
How publishing works in the digital era
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Oops Concepts
Oops ConceptsOops Concepts
Oops Concepts
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
Moyez Dreamforce 2017 presentation on Large Data Volumes in SalesforceMoyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
Moyez Dreamforce 2017 presentation on Large Data Volumes in Salesforce
 
The need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsThe need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementations
 
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
 
Why Are Taxonomies Necessary?
Why Are Taxonomies Necessary?Why Are Taxonomies Necessary?
Why Are Taxonomies Necessary?
 
Full text search
Full text searchFull text search
Full text search
 
How To Write The Conclusion Of A. Online assignment writing service.
How To Write The Conclusion Of A. Online assignment writing service.How To Write The Conclusion Of A. Online assignment writing service.
How To Write The Conclusion Of A. Online assignment writing service.
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
E1f14172a91215931ed787d97dee1301fe7d
E1f14172a91215931ed787d97dee1301fe7dE1f14172a91215931ed787d97dee1301fe7d
E1f14172a91215931ed787d97dee1301fe7d
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender Systems
 
Classification, Tagging & Search
Classification, Tagging & SearchClassification, Tagging & Search
Classification, Tagging & Search
 
You Don't Know SEO
You Don't Know SEOYou Don't Know SEO
You Don't Know SEO
 
Essential Tools Of An Xml Workflow2003comp
Essential Tools Of An Xml Workflow2003compEssential Tools Of An Xml Workflow2003comp
Essential Tools Of An Xml Workflow2003comp
 
11.pptx
11.pptx11.pptx
11.pptx
 
Automating With Excel An Object Oriented Approach
Automating  With  Excel    An  Object  Oriented  ApproachAutomating  With  Excel    An  Object  Oriented  Approach
Automating With Excel An Object Oriented Approach
 
How To Write A Compare And Contrast Essay Example
How To Write A Compare And Contrast Essay ExampleHow To Write A Compare And Contrast Essay Example
How To Write A Compare And Contrast Essay Example
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA

  • 1.
  • 2. About us • Founded in 2007 • B2B Startup • Semantic Services started in 2011 • Big Data projects in 2012 • Training, Consultancy and Support on Solr and Hadoop
  • 3. About me • Leo Oliveira • 15 years of experience with websites and search engines • Specialized in Relevancy and Semantics • Graduated in Business Management & IT Innovation
  • 4. The Importance of Relevancy  The game changer in Search  Google became what it is due to relevancy algorithms and PageRank  Can be achieved through many different types of data  Cross-reference, log analysis, social media, market research, new ideas etc  It’s about the user  And not about what we think of the user  If you don’t understand your data, find what is relevant through research  Research user behavior and needs and you’ll find a way of finding relevant information or you will find a way of discovering what is relevant automatically.  If the user can’t find what they want using your search…  … they’ll leave.
  • 5. The Importance of having a resultset  Sometimes you have little data and users get a lot of zero-result queries.  Then it’s time to find the right words for the job  Synonym discovery, stemming  Analysis: understand what is the user searching for  Suggest other options  Sometimes a zero-result query is just typo. Use suggestions to find the right results.  Create an interface that makes it easy for the user to try different keywords to get results  Analyze words that are searched together by the same users to create special suggestions or second searches that brings some results
  • 6. Relevancy and Semantics  THE RELEVANCY RULES (the most forgotten ones)  Phrases are more relevant the single words  “Pure” words, that is, words the way they were typed, are more relevant than stemmed words or even synonyms  In general, newest docs are more relevant than old docs, but this rule could be reversed.  Closer venues and locations are more relevant than distant ones.  Give the right weights to the right fields (seems simple, but it’s not)  When you have a mix of relevant things…  for instance, in an e-commerce, that you must consider the freshest docs, plus the smaller prices and the best sellers, sometimes you need to create a math formula to cope with all these items for relevancy. And it’s tricky.  You can boost each parameter separately, but then you must test your formulas very well not to prioritize one parameter more than the other.
  • 7. Relevancy and Weighting  USING DISMAX (an example of relevancy configuration)  By using Dismax, you have two main options: qf and pf.  Phrase Fields (pf) should have bigger weights than all of your Query Fields (qf)  Field weights must be decided carefully. You can use a scale, such as Fibonacci numbers, so as to not lose control of what is more relevant and what is not that relevant in your “search formula”  EXAMPLE OF pf for a movies website:  MovieName^34 MovieActors^21 MovieDirector^13 MovieYear^8  EXAMPLE OF qf:  MovieName^21 MovieActors^13 MovieDirector^8 MovieYear^5
  • 8. Synonyms Theory Vs Synonyms in Search
  • 9. Synonym Theory Vs Synonym in Search  In search, synonyms can get complicated  You don’t need to use only “real” synonyms  You must focus on user needs rather than “thesaurus”  You can use it for other useful stuff, such as the most common typos to be corrected or even some unrelated words that could do the job for you  Sometimes you just need to get creative.
  • 10. 1-way transformations  SynonymFilter  E. g. Compter => computer  Very useful for common typos or misconceptions  Sometimes requires expansion.  In our setup, we’ll set a fieldtype that will do the job.
  • 11. 2-way transformations  SynonymFilter  E. g. computer, pc, mac, notebook, server  Expansion of meanings  Sometimes you don’t use exact synonyms. Brand names and other “real world” terms can be used.  It’s hard to configure and maintain 1-way and 2-way in a single synonyms.txt file.
  • 12. Index Time Vs Query Time  Most synonyms should work with Query Time only  But there are exceptions. E. g.: photoalbum, photo album  When using a phrase synonym, this might require an index time synonym to work correctly  In our fieldtype, we’ll discuss how to achieve this.  This index time synonym file is also useful for search when you need symbols in some terms that could be droped by the tokenizer. E. g. C++ => cpp or .NET => dotnet etc  Keep in mind that this index time file will have the exceptions that don’t work well in query- time. To figure this out, sometimes it requires testing, sometimes it is easy to identify, such as phrase synonyms.
  • 13. Some examples  Query time  Enginer => Engineer (1-way)  Engineer, engineering (2-way with expand=true)  Computer, pc, mac  Index time  C# => csharp  .NET => dotnet  Human resources, HR  Business intelligence, BI  CRM, Costumer Relationship Management
  • 14. Creating a fieldtype with semantic relevancy <fieldType class="solr.TextField" name="text_pt" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms_indextime.txt" ignoreCase="true" expand="true"/> <filter catenateAll="0" catenateNumbers="1" catenateWords="1" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.BrazilianStemFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt" ignoreCase="true" expand="false"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt" ignoreCase="true" expand="true"/> <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0" catenateWords="0" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.BrazilianStemFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> </fieldType>
  • 15. Creating a fieldtype with semantic relevancy <fieldType class="solr.TextField" name="text_pt" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> 1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> 2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_indextime.txt" ignoreCase="true" expand="true"/> 3. <filter catenateAll="0" catenateNumbers="1" catenateWords="1" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> 4. <filter class="solr.LowerCaseFilterFactory"/> 5. <filter class="solr.BrazilianStemFilterFactory"/> 6. <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer>
  • 16. Creating a fieldtype with semantic relevancy <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> 1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> 2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt" ignoreCase="true" expand="false"/> 3. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt" ignoreCase="true" expand="true"/> 4. <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0" catenateWords="0" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> 5. <filter class="solr.LowerCaseFilterFactory"/> 6. <filter class="solr.BrazilianStemFilterFactory"/> 7. <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> </fieldType>
  • 17. Creating a fieldtype with semantic relevancy <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> 1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> 2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt" ignoreCase="true" expand="false"/> 3. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt" ignoreCase="true" expand="true"/> 4. <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0" catenateWords="0" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> 5. <filter class="solr.LowerCaseFilterFactory"/> 6. <filter class="solr.BrazilianStemFilterFactory"/> 7. <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> </fieldType>
  • 18. Creating a fieldtype with semantic relevancy <fieldType class="solr.TextField" name="text_pt_pure" positionIncrementGap="100”> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/> <filter catenateAll="0" catenateNumbers="1" catenateWords="1" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer>
  • 19. Creating a relevant formula for search  Imagine you have a field named “Product_name”  To keep search result semantically relevant, you’ll need a “Product_name” field and also a “Product_name_pure”  Keep in mind that phrases are also more relevant than individual words  In that case, here’s how Solrconfig would be:  QF (query fields, single words)  Product_name_pure^5 Product_name^3  PF (phrase fields)  Product_name_pure^8 Product_nameˆ5
  • 20. What will you get?  Relevant results semantically  Users will understand the results  Synonyms or stemmed tokens won’t disturb or create noise  Similar to what main search websites are doing for many different things, such as document search, e-commerce, intranet search, document search, library search etc  Be able to find relevant results even in a scenario with millions or billions of documents.
  • 21. I N F I N I T E P O S S I B I L I T I ES
  • 22. THANK YOU! Get in touch! Email: loliveira@semantix.com.br Twitter: @SemantixBR