SlideShare una empresa de Scribd logo
1 de 50
Search
Find the rabbit..
224.4.2015 © Sanoma Media
Agenda
• Search Basics
• Features
• Search solutions
»MySQL (Full-Text search and Sphinx)
»Solr
»ElasticSearch
• Sanoma Content Library
• Common gotcha’s
Basics
A.B.C. of search
High level components
Filtering Indexing Querying Ranking
424.4.2015 © Sanoma Media
High level components
Filtering techniques
Filtering Indexing Querying Ranking
54/24/2015 © Sanoma Media
• Tokenizing
• Stop Words
• Synonyms
• Stemming
• Term occurrence
• Phonetics
High level components
Filtering techniques Filtering Indexing Querying Ranking
64/24/2015 © Sanoma Media
• Tokenizing
• Stop Words
• Synonyms
• Stemming
• Term occurrence
• Phonetics
The quick brown fox jumps
over a lazy dog
The
quick
brown
fox
jumps
over
a
lazy
dog
High level components
Filtering techniques Filtering Indexing Querying Ranking
74/24/2015 © Sanoma Media
• Tokenizing
• Stop Words
• Synonyms
• Stemming
• Term occurrence
• Phonetics
• Special characters: +/-,.;!@#$%^&
etc.
»I.B.M.
• Case and numeric changes:
»PowerShot, TransAM, SD500, iPod
• Decide what you want to happened
with:
»Canon Power-Shot SD500
(Canon Power shot SD-500, Canon Powershot SD 500)
»O’neill’s
• Remove stop words from being indexed
• No value, since they’re to common
The
quick quick
brown brown
fox fox
jumps jumps
over
a
lazy lazy
dog dog
Stop words
a,able,about,
across,after,al
l,almost,also,
am,among,an
,and,any,are,
as,at,be,beca
use,been,but,
by,can,cannot
,could,dear,di
d,do,does,eith
er,else,ever,e
very,for,from,
,got,had
have,h
High level components
Filtering techniques Filtering Indexing Querying Ranking
84/24/2015 © Sanoma Media
• Tokenizing
• Stop Words
• Synonyms
• Stemming
• Term occurrence
• Phonetics
High level components
Filtering techniques Filtering Indexing Querying Ranking
94/24/2015 © Sanoma Media
• Tokenizing
• Stop Words
• Synonyms
• Stemming
• Term occurrence
• Phonetics
• De-duplicate various words:
»bicycle, cycle, bike
»i-pod, ipot => iPod
High level components
Filtering techniques Filtering Indexing Querying Ranking
104/24/2015 © Sanoma Media
• Tokenizing
• Stop Words
• Synonyms
• Stemming
• Term occurrence
• Phonetics
• Determine the stem of a word:
»Dogs => dog
»Recharging => recharg
»Rechargeable => recharg
• Language specific:
»Porter for English (-s, -ed, -ly, -ing, etc.)
»SnowballPorter or Kraaij-Pohlmann
for Dutch (ge-, -en, etc.)
High level components
Filtering techniques Filtering Indexing Querying Ranking
114/24/2015 © Sanoma Media
• Tokenizing
• Stop Words
• Synonyms
• Stemming
• Term occurrence
• Phonetics
• Options for limiting the size of the
index:
»Minimum Term frequency
»Minimum Term Length
High level components
Filtering techniques Filtering Indexing Querying Ranking
124/24/2015 © Sanoma Media
• Tokenizing
• Stop Words
• Synonyms
• Stemming
• Term occurrence
• Phonetics
• Handling sounds like queries:
» Robert => R163 <= Rupert
» Smith => (SM0,XMT) ∩ (XMT,SMT) <= Schmith
• Various methods available:
» DoubleMetaphone
» Metaphone
» Soundex
» RefinedSoundex
» Caverphone
» BeiderMorse
• Levenstein can be used during quering
High level components
Apply the filters on Filtering and querying
Filtering Indexing Querying Ranking
1324.4.2015 © Sanoma Media
Same filters
Stopwords,stemming,
synonyms,etc.
Filters
High level components
Indexing
Filtering Indexing Querying Ranking
1424.4.2015 © Sanoma Media
High level components
Querying
Filtering Indexing Querying Ranking
1524.4.2015 © Sanoma Media
DEMO
Stemming, Phonetics
1624.4.2015 © Sanoma Media
High level components
Ranking
Filtering Indexing Querying Ranking
1724.4.2015 © Sanoma Media
TF-IDF
Term Frequency-Inverse Document Frequency
How often does the search
term occur in the text
How many words are
in the entire text
High level components
Ranking – TF-IDF
Filtering Indexing Querying Ranking
1824.4.2015 © Sanoma Media
3/12 = 0,25 5/24 = 0,21
More relevant
USER PATTERNS
1924.4.2015 © Sanoma Media
User patterns
• Features should be adjusted to the user and usage patterns your seeing
• What are users searching for on your site
• How are they searching for it
• Use web analytics to track and improve your search behavior
2024.4.2015 © Sanoma Media
Image credits: http://www.flickr.com/photos/morville/collections/72157604060564791/
User pattern - Quit
2124.4.2015 © Sanoma Media
Image credits: http://www.flickr.com/photos/morville/collections/72157604060564791/
User patterns – Pogosticking
2224.4.2015 © Sanoma Media
Image credits: http://www.flickr.com/photos/morville/collections/72157604060564791/
User patterns - Thrashing
2324.4.2015 © Sanoma Media
Image credits: http://www.flickr.com/photos/morville/collections/72157604060564791/
User patterns - Narrow
2424.4.2015 © Sanoma Media
Image credits: http://www.flickr.com/photos/morville/collections/72157604060564791/
User patterns – Others
• Pearl Growing
• Expand
2524.4.2015 © Sanoma Media
Image credits: http://www.flickr.com/photos/morville/collections/72157604060564791/
Search
Features
Search Features
• Faceting
• Autocomplete
• More like this..
• Highlighting
• Spellchecking
did you mean
• Geospatial
“bike repair” in area of [long,lat],[long,lat]
• Boosting
when title is more relevant then content
• Elevation
always get a certain result at position n
get the current weather, current traffic at 1st
position or ingest ads
2724.4.2015 © Sanoma Media
Search Features - Faceting
2824.4.2015 © Sanoma Media
From the user perspective, faceted
search (also called faceted
navigation, guided navigation, or
parametric search) breaks up search
results into multiple categories,
typically showing counts for each,
and allows the user to "drill down" or
further restrict their search results
based on those facets.
Search Features - Autocomplete
2924.4.2015 © Sanoma Media
Search Features - More like this..
3024.4.2015 © Sanoma Media
• Give you the related items based on a document
• Compares the Term Vectors of various documents
• Creates a query with boosting:
body:pre body:username^.56974 body:column^.57123 body:oracle^.61915 ...
Term
Number of Instances of
Term in Document
Number of Documents
Matching Term
IDF value Score
pre 18 26 4.609916 82.978
username 10 23 4.7276993 47.276
column 9 13 5.266696 47.400264
oracle 9 8 5.7085285 51.376
alter 7 1 7.212606 50.488
Search Features - Highlighting
3124.4.2015 © Sanoma Media
• Highlighting the search terms
• Includes stemming and other logic
DEMO SOLR
3224.4.2015 © Sanoma Media
SOLUTIONS
3324.4.2015 © Sanoma Media
Services
Common search options
• MySQL based
»Native Full-Text search
»Sphinx Search Plugin
• Lucene based (Java)
»Apache Lucene/Solr
»ElasticSearch
3424.4.2015 © Sanoma Media
Services
Common search options
3524.4.2015 © Sanoma Media
Ease of
use
Power
MySQL Based
Native Full-Text vs Sphinx
MySQL Full-Text search
• Only for MyISAM tables, and only on
CHAR, VARCHAR and TEXT fields
• Only standard English stop words
• Limited query capabilities
• Slow on large collections (1GB+)
• Building facetting is “hard” and
“expensive”
• No stemming, no synonyms, no
custom flieds, no highlighting
Sphinx
• External plugin
• All storage engines
• Also on numeric field types
• ~3x faster on index and query
• Simple stemming and synonyms
• No custom fields, no highlighting
3624.4.2015 © Sanoma Media
Querying is easy
• MySQL Full-Text query:
SELECT * FROM articles
WHERE MATCH (title,body)
AGAINST ('database');
• Getting the score:
SELECT id, MATCH (title,body)
AGAINST ('Tutorial')
FROM articles;
• Sphinx query, index is
separate table:
SELECT id, created_time, @weight
FROM my_sphinx_index
WHERE created_time BETWEEN (X AND Y)
AND MATCH ('Android phone’)
ORDER by @weight DESC,
created_time DESC
3724.4.2015 © Sanoma Media
Lucene based
ElasticSearch
• Simpler Solr
• No need for a schema
• Easy to cluster
• Focus on scaling and realtime
• Go with the defaults
• Configuration = 3 lines
• Percolation!
• Versions and TTLs
Solr
• Exposing all of the lucene
power
• Clustering possible, but
harder
• Focus on complete and
customizable
• Defaults?
• Configuration = 3.000 lines
3824.4.2015 © Sanoma Media
Solr vs ElasticSearch
Search Fresh Index While Idle
0
10
20
30
40
50
60
Searchtimeinms.
ElasticSearch
Solr
3924.4.2015 © Sanoma Media
Lower is better
Solr vs ElasticSearch
Search Fresh Index While Indexing 1doc/3sec
0
50
100
150
200
250
Searchtimeinms.
ElasticSearch
Solr
4024.4.2015 © Sanoma Media
Lower is better
Solr vs ElasticSearch
Search Full Index While Indexing 1doc/3sec
0
500
1000
1500
2000
2500
Searchtimeinms.
ElasticSearch
Solr
4124.4.2015 © Sanoma Media
Lower is better
Solr vs ElasticSearch
Search Full Index While Indexing 1doc/3sec
0
500
1000
1500
2000
2500
Searchtimeinms.
ElasticSearch
Solr
4224.4.2015 © Sanoma Media
Lower is better
Idle Indexing Full + Indexing
Solr vs ElasticSearch
4324.4.2015 © Sanoma Media
Lower is better
SOLR ElasticSearch
Querying with Solr and ElasticSearch
Solr
• Normal query
http://../solr?q=field:banana
• Facetting
http://../solr?q=field:banana&facet=
on&facet.field=tags
ElasticSearch
• Normal query
http://../_search?q=field:value
• Advanced queries, via PUT:
POST http://../collection/seach
{
"query": { "query_string" :{"query" : "T*"}
},
"facets" : {
"tags" : { "terms" : {"field" : "tags"} }
}
} 4424.4.2015 © Sanoma Media
ElasticSearch
4524.4.2015 © Sanoma Media
SANOMA CONTENT LIBRARY
4624.4.2015 © Sanoma Media
Sanoma Content Library
Search
.. in site
.. in cluster
.. in network
Elevation (ads)
Facetting
Related
More like this
Relevant ads
Products
Reuse
Sharing
Variants
(simple) Drm
Images
Analyse
Sentiment
Named Entities
Tagging
Classificatie
Key phrases
474/24/2015 © Sanoma Media
Services: Content Library
4824.4.2015 © Sanoma Media
Content
Library
Analyse
Pipeline
NER Sentiment
Crawler
Indexer
Search
index
Search
- nu.nl
- wtf
Related
- Vrouwen
- Kieskeurig
Relevant
- Txel
API
Edge
Redir
ects
Loader
Solr
Mongo
Integration
- Vrouwen
- Wordpress
- SAS
CMS
/
JCR
Keyphrase
extractor
Classifier
Common gotcha’s
• Use right settings for your language stopwords and stemming
• Indexing too much or too detailed:
»Timestamps
4924.4.2015 © Sanoma Media
END
5024.4.2015 © Sanoma Media

Más contenido relacionado

Similar a Search Basics

Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Lucidworks
 
Candidate selection tutorial
Candidate selection tutorialCandidate selection tutorial
Candidate selection tutorialYiqun Liu
 
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...Aman Grover
 
How to Apply Your Taxonomy to Your Content Automatically
How to Apply Your Taxonomy to Your Content AutomaticallyHow to Apply Your Taxonomy to Your Content Automatically
How to Apply Your Taxonomy to Your Content AutomaticallyAccess Innovations, Inc.
 
Sumo Logic QuickStart
Sumo Logic QuickStartSumo Logic QuickStart
Sumo Logic QuickStartSumo Logic
 
Análisis de las novedades del Elastic Stack
Análisis de las novedades del Elastic StackAnálisis de las novedades del Elastic Stack
Análisis de las novedades del Elastic StackElasticsearch
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search TrainingCloudera, Inc.
 
Sumo Logic QuickStart - May 2016
Sumo Logic QuickStart - May 2016Sumo Logic QuickStart - May 2016
Sumo Logic QuickStart - May 2016Sumo Logic
 
Sumo Logic QuickStart Webinar Sep 2016
Sumo Logic QuickStart Webinar Sep 2016Sumo Logic QuickStart Webinar Sep 2016
Sumo Logic QuickStart Webinar Sep 2016Sumo Logic
 
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...lucenerevolution
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformLucidworks (Archived)
 
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!Richard Robinson
 
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...Grokking VN
 
GWAVACon 2015: GWAVA - Sneak Peek
GWAVACon 2015: GWAVA - Sneak PeekGWAVACon 2015: GWAVA - Sneak Peek
GWAVACon 2015: GWAVA - Sneak PeekGWAVA
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElasticsearch
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
 
RedisSearch / CRDT: Kyle Davis, Meir Shpilraien
RedisSearch / CRDT: Kyle Davis, Meir ShpilraienRedisSearch / CRDT: Kyle Davis, Meir Shpilraien
RedisSearch / CRDT: Kyle Davis, Meir ShpilraienRedis Labs
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...Dr. Haxel Consult
 

Similar a Search Basics (20)

Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
 
Candidate selection tutorial
Candidate selection tutorialCandidate selection tutorial
Candidate selection tutorial
 
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
 
2016 Cymer Intern
2016 Cymer Intern2016 Cymer Intern
2016 Cymer Intern
 
How to Apply Your Taxonomy to Your Content Automatically
How to Apply Your Taxonomy to Your Content AutomaticallyHow to Apply Your Taxonomy to Your Content Automatically
How to Apply Your Taxonomy to Your Content Automatically
 
Sumo Logic QuickStart
Sumo Logic QuickStartSumo Logic QuickStart
Sumo Logic QuickStart
 
Análisis de las novedades del Elastic Stack
Análisis de las novedades del Elastic StackAnálisis de las novedades del Elastic Stack
Análisis de las novedades del Elastic Stack
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search Training
 
Search
SearchSearch
Search
 
Sumo Logic QuickStart - May 2016
Sumo Logic QuickStart - May 2016Sumo Logic QuickStart - May 2016
Sumo Logic QuickStart - May 2016
 
Sumo Logic QuickStart Webinar Sep 2016
Sumo Logic QuickStart Webinar Sep 2016Sumo Logic QuickStart Webinar Sep 2016
Sumo Logic QuickStart Webinar Sep 2016
 
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
 
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery PlatformExtending Solr: Building a Cloud-like Knowledge Discovery Platform
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
 
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
 
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...
TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world se...
 
GWAVACon 2015: GWAVA - Sneak Peek
GWAVACon 2015: GWAVA - Sneak PeekGWAVACon 2015: GWAVA - Sneak Peek
GWAVACon 2015: GWAVA - Sneak Peek
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep dive
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
RedisSearch / CRDT: Kyle Davis, Meir Shpilraien
RedisSearch / CRDT: Kyle Davis, Meir ShpilraienRedisSearch / CRDT: Kyle Davis, Meir Shpilraien
RedisSearch / CRDT: Kyle Davis, Meir Shpilraien
 
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
II-SDV 2017: Localizing International Content for Search, Data Mining and Ana...
 

Último

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...caitlingebhard1
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringWSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingWSO2
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxMarkSteadman7
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTopCSSGallery
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuidePixlogix Infotech
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Último (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Search Basics

  • 2. 224.4.2015 © Sanoma Media Agenda • Search Basics • Features • Search solutions »MySQL (Full-Text search and Sphinx) »Solr »ElasticSearch • Sanoma Content Library • Common gotcha’s
  • 4. High level components Filtering Indexing Querying Ranking 424.4.2015 © Sanoma Media
  • 5. High level components Filtering techniques Filtering Indexing Querying Ranking 54/24/2015 © Sanoma Media • Tokenizing • Stop Words • Synonyms • Stemming • Term occurrence • Phonetics
  • 6. High level components Filtering techniques Filtering Indexing Querying Ranking 64/24/2015 © Sanoma Media • Tokenizing • Stop Words • Synonyms • Stemming • Term occurrence • Phonetics The quick brown fox jumps over a lazy dog The quick brown fox jumps over a lazy dog
  • 7. High level components Filtering techniques Filtering Indexing Querying Ranking 74/24/2015 © Sanoma Media • Tokenizing • Stop Words • Synonyms • Stemming • Term occurrence • Phonetics • Special characters: +/-,.;!@#$%^& etc. »I.B.M. • Case and numeric changes: »PowerShot, TransAM, SD500, iPod • Decide what you want to happened with: »Canon Power-Shot SD500 (Canon Power shot SD-500, Canon Powershot SD 500) »O’neill’s
  • 8. • Remove stop words from being indexed • No value, since they’re to common The quick quick brown brown fox fox jumps jumps over a lazy lazy dog dog Stop words a,able,about, across,after,al l,almost,also, am,among,an ,and,any,are, as,at,be,beca use,been,but, by,can,cannot ,could,dear,di d,do,does,eith er,else,ever,e very,for,from, ,got,had have,h High level components Filtering techniques Filtering Indexing Querying Ranking 84/24/2015 © Sanoma Media • Tokenizing • Stop Words • Synonyms • Stemming • Term occurrence • Phonetics
  • 9. High level components Filtering techniques Filtering Indexing Querying Ranking 94/24/2015 © Sanoma Media • Tokenizing • Stop Words • Synonyms • Stemming • Term occurrence • Phonetics • De-duplicate various words: »bicycle, cycle, bike »i-pod, ipot => iPod
  • 10. High level components Filtering techniques Filtering Indexing Querying Ranking 104/24/2015 © Sanoma Media • Tokenizing • Stop Words • Synonyms • Stemming • Term occurrence • Phonetics • Determine the stem of a word: »Dogs => dog »Recharging => recharg »Rechargeable => recharg • Language specific: »Porter for English (-s, -ed, -ly, -ing, etc.) »SnowballPorter or Kraaij-Pohlmann for Dutch (ge-, -en, etc.)
  • 11. High level components Filtering techniques Filtering Indexing Querying Ranking 114/24/2015 © Sanoma Media • Tokenizing • Stop Words • Synonyms • Stemming • Term occurrence • Phonetics • Options for limiting the size of the index: »Minimum Term frequency »Minimum Term Length
  • 12. High level components Filtering techniques Filtering Indexing Querying Ranking 124/24/2015 © Sanoma Media • Tokenizing • Stop Words • Synonyms • Stemming • Term occurrence • Phonetics • Handling sounds like queries: » Robert => R163 <= Rupert » Smith => (SM0,XMT) ∩ (XMT,SMT) <= Schmith • Various methods available: » DoubleMetaphone » Metaphone » Soundex » RefinedSoundex » Caverphone » BeiderMorse • Levenstein can be used during quering
  • 13. High level components Apply the filters on Filtering and querying Filtering Indexing Querying Ranking 1324.4.2015 © Sanoma Media Same filters
  • 14. Stopwords,stemming, synonyms,etc. Filters High level components Indexing Filtering Indexing Querying Ranking 1424.4.2015 © Sanoma Media
  • 15. High level components Querying Filtering Indexing Querying Ranking 1524.4.2015 © Sanoma Media
  • 17. High level components Ranking Filtering Indexing Querying Ranking 1724.4.2015 © Sanoma Media TF-IDF Term Frequency-Inverse Document Frequency How often does the search term occur in the text How many words are in the entire text
  • 18. High level components Ranking – TF-IDF Filtering Indexing Querying Ranking 1824.4.2015 © Sanoma Media 3/12 = 0,25 5/24 = 0,21 More relevant
  • 20. User patterns • Features should be adjusted to the user and usage patterns your seeing • What are users searching for on your site • How are they searching for it • Use web analytics to track and improve your search behavior 2024.4.2015 © Sanoma Media Image credits: http://www.flickr.com/photos/morville/collections/72157604060564791/
  • 21. User pattern - Quit 2124.4.2015 © Sanoma Media Image credits: http://www.flickr.com/photos/morville/collections/72157604060564791/
  • 22. User patterns – Pogosticking 2224.4.2015 © Sanoma Media Image credits: http://www.flickr.com/photos/morville/collections/72157604060564791/
  • 23. User patterns - Thrashing 2324.4.2015 © Sanoma Media Image credits: http://www.flickr.com/photos/morville/collections/72157604060564791/
  • 24. User patterns - Narrow 2424.4.2015 © Sanoma Media Image credits: http://www.flickr.com/photos/morville/collections/72157604060564791/
  • 25. User patterns – Others • Pearl Growing • Expand 2524.4.2015 © Sanoma Media Image credits: http://www.flickr.com/photos/morville/collections/72157604060564791/
  • 27. Search Features • Faceting • Autocomplete • More like this.. • Highlighting • Spellchecking did you mean • Geospatial “bike repair” in area of [long,lat],[long,lat] • Boosting when title is more relevant then content • Elevation always get a certain result at position n get the current weather, current traffic at 1st position or ingest ads 2724.4.2015 © Sanoma Media
  • 28. Search Features - Faceting 2824.4.2015 © Sanoma Media From the user perspective, faceted search (also called faceted navigation, guided navigation, or parametric search) breaks up search results into multiple categories, typically showing counts for each, and allows the user to "drill down" or further restrict their search results based on those facets.
  • 29. Search Features - Autocomplete 2924.4.2015 © Sanoma Media
  • 30. Search Features - More like this.. 3024.4.2015 © Sanoma Media • Give you the related items based on a document • Compares the Term Vectors of various documents • Creates a query with boosting: body:pre body:username^.56974 body:column^.57123 body:oracle^.61915 ... Term Number of Instances of Term in Document Number of Documents Matching Term IDF value Score pre 18 26 4.609916 82.978 username 10 23 4.7276993 47.276 column 9 13 5.266696 47.400264 oracle 9 8 5.7085285 51.376 alter 7 1 7.212606 50.488
  • 31. Search Features - Highlighting 3124.4.2015 © Sanoma Media • Highlighting the search terms • Includes stemming and other logic
  • 32. DEMO SOLR 3224.4.2015 © Sanoma Media
  • 34. Services Common search options • MySQL based »Native Full-Text search »Sphinx Search Plugin • Lucene based (Java) »Apache Lucene/Solr »ElasticSearch 3424.4.2015 © Sanoma Media
  • 35. Services Common search options 3524.4.2015 © Sanoma Media Ease of use Power
  • 36. MySQL Based Native Full-Text vs Sphinx MySQL Full-Text search • Only for MyISAM tables, and only on CHAR, VARCHAR and TEXT fields • Only standard English stop words • Limited query capabilities • Slow on large collections (1GB+) • Building facetting is “hard” and “expensive” • No stemming, no synonyms, no custom flieds, no highlighting Sphinx • External plugin • All storage engines • Also on numeric field types • ~3x faster on index and query • Simple stemming and synonyms • No custom fields, no highlighting 3624.4.2015 © Sanoma Media
  • 37. Querying is easy • MySQL Full-Text query: SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('database'); • Getting the score: SELECT id, MATCH (title,body) AGAINST ('Tutorial') FROM articles; • Sphinx query, index is separate table: SELECT id, created_time, @weight FROM my_sphinx_index WHERE created_time BETWEEN (X AND Y) AND MATCH ('Android phone’) ORDER by @weight DESC, created_time DESC 3724.4.2015 © Sanoma Media
  • 38. Lucene based ElasticSearch • Simpler Solr • No need for a schema • Easy to cluster • Focus on scaling and realtime • Go with the defaults • Configuration = 3 lines • Percolation! • Versions and TTLs Solr • Exposing all of the lucene power • Clustering possible, but harder • Focus on complete and customizable • Defaults? • Configuration = 3.000 lines 3824.4.2015 © Sanoma Media
  • 39. Solr vs ElasticSearch Search Fresh Index While Idle 0 10 20 30 40 50 60 Searchtimeinms. ElasticSearch Solr 3924.4.2015 © Sanoma Media Lower is better
  • 40. Solr vs ElasticSearch Search Fresh Index While Indexing 1doc/3sec 0 50 100 150 200 250 Searchtimeinms. ElasticSearch Solr 4024.4.2015 © Sanoma Media Lower is better
  • 41. Solr vs ElasticSearch Search Full Index While Indexing 1doc/3sec 0 500 1000 1500 2000 2500 Searchtimeinms. ElasticSearch Solr 4124.4.2015 © Sanoma Media Lower is better
  • 42. Solr vs ElasticSearch Search Full Index While Indexing 1doc/3sec 0 500 1000 1500 2000 2500 Searchtimeinms. ElasticSearch Solr 4224.4.2015 © Sanoma Media Lower is better Idle Indexing Full + Indexing
  • 43. Solr vs ElasticSearch 4324.4.2015 © Sanoma Media Lower is better SOLR ElasticSearch
  • 44. Querying with Solr and ElasticSearch Solr • Normal query http://../solr?q=field:banana • Facetting http://../solr?q=field:banana&facet= on&facet.field=tags ElasticSearch • Normal query http://../_search?q=field:value • Advanced queries, via PUT: POST http://../collection/seach { "query": { "query_string" :{"query" : "T*"} }, "facets" : { "tags" : { "terms" : {"field" : "tags"} } } } 4424.4.2015 © Sanoma Media
  • 47. Sanoma Content Library Search .. in site .. in cluster .. in network Elevation (ads) Facetting Related More like this Relevant ads Products Reuse Sharing Variants (simple) Drm Images Analyse Sentiment Named Entities Tagging Classificatie Key phrases 474/24/2015 © Sanoma Media
  • 48. Services: Content Library 4824.4.2015 © Sanoma Media Content Library Analyse Pipeline NER Sentiment Crawler Indexer Search index Search - nu.nl - wtf Related - Vrouwen - Kieskeurig Relevant - Txel API Edge Redir ects Loader Solr Mongo Integration - Vrouwen - Wordpress - SAS CMS / JCR Keyphrase extractor Classifier
  • 49. Common gotcha’s • Use right settings for your language stopwords and stemming • Indexing too much or too detailed: »Timestamps 4924.4.2015 © Sanoma Media

Notas del editor

  1. 1
  2. 2
  3. 4
  4. 5
  5. 6
  6. 7
  7. 8
  8. 9
  9. 10
  10. 11
  11. 12
  12. 13
  13. 14
  14. 15
  15. 17
  16. 18
  17. MySQL Stopwords => myisam/ft_static.h Also no multi-term
  18. Search Precolation