SlideShare una empresa de Scribd logo
1 de 12
Descargar para leer sin conexión
Using SweetSpotSimilarity for
Solr Fulltext Indexing
(A Public Service Message)
Jay Luker
SAO/NASA Astrophysics Data System
http://adsabs.harvard.edu/
From http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/search/Similarity.html
Score for a
particular
result
Buncha stuff you probably
ought to read up on.
"encapsulates a
few (indexing
time) boost and
length factors"
{
norm(t,d)
Includes...
● Document boost - e.g. <doc boost="2.5">
● Field boost - e.g. <field boost="3.0">
and what we're concerned with...
● lengthNorm(field) - computed at index time based
on the number of tokens in the field of the input
document.
These factors, multiplied together, make up the norm(t,
d) for a given document
lengthNorm(String fieldName, int numTokens)
"Matches in longer fields are less precise, so implementations of
this method usually return smaller values when numTokens is
large, and larger values when numTokens is small."
Translation:
SHORTER DOCUMENTS SCORE HIGHER
from the javadoc:
changes this ...
to this ...
lengthNorm(L) =
1
sqrt(L)
SweetSpotSimilarity
lucene/contrib/misc/...
lengthNorm(L) =
1
sqrt(steepness*(|L-min|+|L-max|-(max-min))+1)
min/max = your "sweet spot" range. Lengths within
this range compute to a constant, i.e., 1.
steepness = controls the curve up to and down from
the sweet spot "plateau".
(termcounts for all ADS's searchable fulltext since 01/2000)
<similarity class="org.ads.solr.SweetSpotSimilarityFactory">
<str name="min">1000</str>
<str name="max">20000</str>
<str name="steepness">0.5</str>
</similarity>
In schema.xml
public class SweetSpotSimilarityFactory extends SimilarityFactory {
public static final Logger log = 
LoggerFactory.getLogger(SolrResourceLoader.class);
@Override
public Similarity getSimilarity() {
SweetSpotSimilarity sim = new SweetSpotSimilarity();
int max = this.params.getInt("max");
int min = this.params.getInt("min");
float steepness = this.params.getFloat("steepness");
log.info("max: " + max);
log.info("min: " + min);
log.info("steepness: " + steepness);
// yuck! hardcoded field settings for now
sim.setLengthNormFactors("body", min, max, steepness, true);
return sim;
}
}
Thanks!
Further reading:
"Lucene and Juru at TREC 2007: 1-Million Queries Track"
http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf
Also, check out our Blacklight beta search!
http://labs.adsabs.harvard.edu/fulltext

Más contenido relacionado

La actualidad más candente

Similarity Measurement Preliminary Results
Similarity  Measurement  Preliminary ResultsSimilarity  Measurement  Preliminary Results
Similarity Measurement Preliminary Results
xiaojuzheng
 
ML2014_Poster_ TextClusteringDemo
ML2014_Poster_ TextClusteringDemoML2014_Poster_ TextClusteringDemo
ML2014_Poster_ TextClusteringDemo
George Simov
 
Modelling Accessibility Performance in LTE networks, An Analytics Methodology
Modelling Accessibility Performance in LTE networks, An Analytics MethodologyModelling Accessibility Performance in LTE networks, An Analytics Methodology
Modelling Accessibility Performance in LTE networks, An Analytics Methodology
alien_gmx
 
Poster-SetCoverAlgorithm
Poster-SetCoverAlgorithmPoster-SetCoverAlgorithm
Poster-SetCoverAlgorithm
Divya Jain
 

La actualidad más candente (18)

Similarity Measurement Preliminary Results
Similarity  Measurement  Preliminary ResultsSimilarity  Measurement  Preliminary Results
Similarity Measurement Preliminary Results
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Query Optimization
Query OptimizationQuery Optimization
Query Optimization
 
Os
OsOs
Os
 
ML2014_Poster_ TextClusteringDemo
ML2014_Poster_ TextClusteringDemoML2014_Poster_ TextClusteringDemo
ML2014_Poster_ TextClusteringDemo
 
LoryfelNunezInsight
LoryfelNunezInsightLoryfelNunezInsight
LoryfelNunezInsight
 
LoryfelNunez
LoryfelNunezLoryfelNunez
LoryfelNunez
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
RDBMS
RDBMSRDBMS
RDBMS
 
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelClustering Using Shared Reference Points Algorithm Based On a Sound Data Model
Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
 
Journal paper 1
Journal paper 1Journal paper 1
Journal paper 1
 
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
 
Modelling Accessibility Performance in LTE networks, An Analytics Methodology
Modelling Accessibility Performance in LTE networks, An Analytics MethodologyModelling Accessibility Performance in LTE networks, An Analytics Methodology
Modelling Accessibility Performance in LTE networks, An Analytics Methodology
 
Poster-SetCoverAlgorithm
Poster-SetCoverAlgorithmPoster-SetCoverAlgorithm
Poster-SetCoverAlgorithm
 
Entropy scaling search method
Entropy scaling search methodEntropy scaling search method
Entropy scaling search method
 
Query trees
Query treesQuery trees
Query trees
 
Ghost
GhostGhost
Ghost
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDT
 

Similar a Using SweetSpotSimilarity for Solr Fulltext Indexing

Overview of query evaluation
Overview of query evaluationOverview of query evaluation
Overview of query evaluation
avniS
 
11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution
Alexander Decker
 
The life of a query (oracle edition)
The life of a query (oracle edition)The life of a query (oracle edition)
The life of a query (oracle edition)
maclean liu
 

Similar a Using SweetSpotSimilarity for Solr Fulltext Indexing (20)

Aggarwal Draft
Aggarwal DraftAggarwal Draft
Aggarwal Draft
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
 
Overview of query evaluation
Overview of query evaluationOverview of query evaluation
Overview of query evaluation
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
IR with lucene
IR with luceneIR with lucene
IR with lucene
 
Query optimization to improve performance of the code execution
Query optimization to improve performance of the code executionQuery optimization to improve performance of the code execution
Query optimization to improve performance of the code execution
 
11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Declarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTDeclarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemT
 
Implementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeImplementation of query optimization for reducing run time
Implementation of query optimization for reducing run time
 
A look ahead at spark 2.0
A look ahead at spark 2.0 A look ahead at spark 2.0
A look ahead at spark 2.0
 
엘라스틱서치 적합성 이해하기 20160630
엘라스틱서치 적합성 이해하기 20160630엘라스틱서치 적합성 이해하기 20160630
엘라스틱서치 적합성 이해하기 20160630
 
Query optimization for_sensor_networks
Query optimization for_sensor_networksQuery optimization for_sensor_networks
Query optimization for_sensor_networks
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
 
Stress test data pipeline
Stress test data pipelineStress test data pipeline
Stress test data pipeline
 
The life of a query (oracle edition)
The life of a query (oracle edition)The life of a query (oracle edition)
The life of a query (oracle edition)
 
Data structure and algorithm
Data structure and algorithmData structure and algorithm
Data structure and algorithm
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised Approach
 

Más de Jay Luker

N Characters in Search of an Author: Improving Author Name Indexing & Searchi...
N Characters in Search of an Author: Improving Author Name Indexing & Searchi...N Characters in Search of an Author: Improving Author Name Indexing & Searchi...
N Characters in Search of an Author: Improving Author Name Indexing & Searchi...
Jay Luker
 
Letting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search Component
Jay Luker
 
LexFarm Busa Farm Site Plan
LexFarm Busa Farm Site PlanLexFarm Busa Farm Site Plan
LexFarm Busa Farm Site Plan
Jay Luker
 
LexFarm Presentation
LexFarm PresentationLexFarm Presentation
LexFarm Presentation
Jay Luker
 
LexFarm Proposal
LexFarm ProposalLexFarm Proposal
LexFarm Proposal
Jay Luker
 

Más de Jay Luker (7)

Coinage
CoinageCoinage
Coinage
 
Learning Engineering Initiatives at Harvard DCE
Learning Engineering Initiatives at Harvard DCELearning Engineering Initiatives at Harvard DCE
Learning Engineering Initiatives at Harvard DCE
 
N Characters in Search of an Author: Improving Author Name Indexing & Searchi...
N Characters in Search of an Author: Improving Author Name Indexing & Searchi...N Characters in Search of an Author: Improving Author Name Indexing & Searchi...
N Characters in Search of an Author: Improving Author Name Indexing & Searchi...
 
Letting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search Component
 
LexFarm Busa Farm Site Plan
LexFarm Busa Farm Site PlanLexFarm Busa Farm Site Plan
LexFarm Busa Farm Site Plan
 
LexFarm Presentation
LexFarm PresentationLexFarm Presentation
LexFarm Presentation
 
LexFarm Proposal
LexFarm ProposalLexFarm Proposal
LexFarm Proposal
 

Último

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Último (20)

%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 

Using SweetSpotSimilarity for Solr Fulltext Indexing

  • 1. Using SweetSpotSimilarity for Solr Fulltext Indexing (A Public Service Message) Jay Luker SAO/NASA Astrophysics Data System http://adsabs.harvard.edu/
  • 2. From http://lucene.apache.org/java/2_9_3/api/all/org/apache/lucene/search/Similarity.html Score for a particular result Buncha stuff you probably ought to read up on. "encapsulates a few (indexing time) boost and length factors" {
  • 3. norm(t,d) Includes... ● Document boost - e.g. <doc boost="2.5"> ● Field boost - e.g. <field boost="3.0"> and what we're concerned with... ● lengthNorm(field) - computed at index time based on the number of tokens in the field of the input document. These factors, multiplied together, make up the norm(t, d) for a given document
  • 4. lengthNorm(String fieldName, int numTokens) "Matches in longer fields are less precise, so implementations of this method usually return smaller values when numTokens is large, and larger values when numTokens is small." Translation: SHORTER DOCUMENTS SCORE HIGHER from the javadoc:
  • 5. changes this ... to this ... lengthNorm(L) = 1 sqrt(L) SweetSpotSimilarity lucene/contrib/misc/... lengthNorm(L) = 1 sqrt(steepness*(|L-min|+|L-max|-(max-min))+1)
  • 6. min/max = your "sweet spot" range. Lengths within this range compute to a constant, i.e., 1. steepness = controls the curve up to and down from the sweet spot "plateau".
  • 7. (termcounts for all ADS's searchable fulltext since 01/2000)
  • 8. <similarity class="org.ads.solr.SweetSpotSimilarityFactory"> <str name="min">1000</str> <str name="max">20000</str> <str name="steepness">0.5</str> </similarity> In schema.xml
  • 9. public class SweetSpotSimilarityFactory extends SimilarityFactory { public static final Logger log = LoggerFactory.getLogger(SolrResourceLoader.class); @Override public Similarity getSimilarity() { SweetSpotSimilarity sim = new SweetSpotSimilarity(); int max = this.params.getInt("max"); int min = this.params.getInt("min"); float steepness = this.params.getFloat("steepness"); log.info("max: " + max); log.info("min: " + min); log.info("steepness: " + steepness); // yuck! hardcoded field settings for now sim.setLengthNormFactors("body", min, max, steepness, true); return sim; } }
  • 10.
  • 11.
  • 12. Thanks! Further reading: "Lucene and Juru at TREC 2007: 1-Million Queries Track" http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf Also, check out our Blacklight beta search! http://labs.adsabs.harvard.edu/fulltext