Information Retrieval (IR) approaches for semantic web search engines have become very populars in the last years. Popularization of different IR libraries, like Lucene, that allows IR implementations almost out-of-the-box have make easier IR integration in Semantic Web search engines. However, one of the most important features of Semantic Web documents is the structure, since this structure allow us to represent
semantic in a machine readable format. In this paper we analyze the specific problems of structured IR and how to adapt weighting schemas for semantic document retrieval.
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Using BM25F for Semantic Search
1. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Using BM25F for Semantic Search
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin
Perez-Iglesias, Victor Fresno
Metadata Research Center, UNC, UCM, UNED
April 26 - 2010
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
2. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Outline
1 Introduction
2 Indexing RDF using inverted indexes
Indexing based on links
3 Ranking based retrieval for RDF objects based on structured IR
Ranking for structured documents
Dangers to combine scores from different document fields
4 A Semantic Search evaluation framework
5 The case study: Lucene Vs BM25F
Results and discussion
6 Conclusions and Future Work
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
3. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Keyword-based Semantic Search
Keyword-based Semantic Web search engine development has
become a major research area garnering much attention in the
Semantic Web community over the last seven years.
Just for the sake of curiosity
It is possible to improve quality results in terms of relevance
applying just classical IR approaches to RDF semantic structure?
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
4. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Two main problems
Indexing RDF triples using inverted indexes
Ranking based retrieval for RDF objects
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
5. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Two main problems
Indexing RDF triples using inverted indexes
Ranking based retrieval for RDF objects
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
6. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
Indexing based on links
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Outline
1 Introduction
2 Indexing RDF using inverted indexes
Indexing based on links
3 Ranking based retrieval for RDF objects based on structured IR
Ranking for structured documents
Dangers to combine scores from different document fields
4 A Semantic Search evaluation framework
5 The case study: Lucene Vs BM25F
Results and discussion
6 Conclusions and Future Work
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
7. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
Indexing based on links
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Problem
How to store RDF triples in inverted indexes OR how to represent
subjects, predicates, and objects information in a n × m matrix.
Solutions
SIREN (based on XML indexing techniques)
SEMPLORE model (based on the idea of artificial documents
with fields)
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
8. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
Indexing based on links
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Problem
How to store RDF triples in inverted indexes OR how to represent
subjects, predicates, and objects information in a n × m matrix.
Solutions
SIREN (based on XML indexing techniques)
SEMPLORE model (based on the idea of artificial documents
with fields)
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
9. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
Indexing based on links
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Problem
How to store RDF triples in inverted indexes OR how to represent
subjects, predicates, and objects information in a n × m matrix.
Solutions
SIREN (based on XML indexing techniques)
SEMPLORE model (based on the idea of artificial documents
with fields)
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
10. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
Indexing based on links
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
We follow SEMPLORE model with some changes.
Index Structure based on SEMPLORE model
FIELD CONTENT
text plain text
title keywords from the URI
obj objects
inlinks incoming link defined by a predicate
type rdf:type
Table: Fields used to represent RDF structure in the inverted index.
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
11. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
Indexing based on links
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Two dbpedia entries
The Godfather http://dbpedia.org/page/The Godfather
Francis Ford Coppola
http://dbpedia.org/page/Francis Ford Coppola
Triple
The Goodfather (Subject) - dbpprop:director (Predicate)- Francis
Ford Coppola (Object)
Using inlink text to index the landing URL
The word director can be used as keyword to index the entry
described by this URI
http://dbpedia.org/page/Francis Ford Coppola.
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
12. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
Indexing based on links
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Two dbpedia entries
The Godfather http://dbpedia.org/page/The Godfather
Francis Ford Coppola
http://dbpedia.org/page/Francis Ford Coppola
Triple
The Goodfather (Subject) - dbpprop:director (Predicate)- Francis
Ford Coppola (Object)
Using inlink text to index the landing URL
The word director can be used as keyword to index the entry
described by this URI
http://dbpedia.org/page/Francis Ford Coppola.
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
13. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
Indexing based on links
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Two dbpedia entries
The Godfather http://dbpedia.org/page/The Godfather
Francis Ford Coppola
http://dbpedia.org/page/Francis Ford Coppola
Triple
The Goodfather (Subject) - dbpprop:director (Predicate)- Francis
Ford Coppola (Object)
Using inlink text to index the landing URL
The word director can be used as keyword to index the entry
described by this URI
http://dbpedia.org/page/Francis Ford Coppola.
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
14. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents
A Semantic Search evaluation framework Dangers to combine scores from different document fields
The case study: Lucene Vs BM25F
Conclusions and Future Work
Outline
1 Introduction
2 Indexing RDF using inverted indexes
Indexing based on links
3 Ranking based retrieval for RDF objects based on structured IR
Ranking for structured documents
Dangers to combine scores from different document fields
4 A Semantic Search evaluation framework
5 The case study: Lucene Vs BM25F
Results and discussion
6 Conclusions and Future Work
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
15. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents
A Semantic Search evaluation framework Dangers to combine scores from different document fields
The case study: Lucene Vs BM25F
Conclusions and Future Work
Classical IR
For long time, search engines have been dealing with flat
documents, that is, without structure.
Consequence
The main consequence of this approach is the fact that terms
within a document are considered to have the same relevance (or
value), disregarding their role in the document.
Simplification
This assumption implies a relevance model simplification based on
bag of words, and, therefore, useful information is lost.
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
16. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents
A Semantic Search evaluation framework Dangers to combine scores from different document fields
The case study: Lucene Vs BM25F
Conclusions and Future Work
Classical IR
For long time, search engines have been dealing with flat
documents, that is, without structure.
Consequence
The main consequence of this approach is the fact that terms
within a document are considered to have the same relevance (or
value), disregarding their role in the document.
Simplification
This assumption implies a relevance model simplification based on
bag of words, and, therefore, useful information is lost.
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
17. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents
A Semantic Search evaluation framework Dangers to combine scores from different document fields
The case study: Lucene Vs BM25F
Conclusions and Future Work
Classical IR
For long time, search engines have been dealing with flat
documents, that is, without structure.
Consequence
The main consequence of this approach is the fact that terms
within a document are considered to have the same relevance (or
value), disregarding their role in the document.
Simplification
This assumption implies a relevance model simplification based on
bag of words, and, therefore, useful information is lost.
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
18. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents
A Semantic Search evaluation framework Dangers to combine scores from different document fields
The case study: Lucene Vs BM25F
Conclusions and Future Work
Structured IR
Structured IR uses the document’s structure to identify where
the most representative terms of the document are (e.g. title,
abstract,HTML or XML tags, etc)
Boost factors are used to modify the impact of every term in
the ranking function in order to take into account the
document’s structure.
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
19. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents
A Semantic Search evaluation framework Dangers to combine scores from different document fields
The case study: Lucene Vs BM25F
Conclusions and Future Work
Ranking functions
State of the art models have been adapted to this situation.
BM25F
LM for structured documents
... but this adaptation have some tricks.
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
20. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents
A Semantic Search evaluation framework Dangers to combine scores from different document fields
The case study: Lucene Vs BM25F
Conclusions and Future Work
The Problem
The linear combination of weigths for each field of the document is
√
not enough if a saturation function, like log (tf ) or tf is used in
the TF function
Figure: Source: Robertson et al.2004
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
21. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents
A Semantic Search evaluation framework Dangers to combine scores from different document fields
The case study: Lucene Vs BM25F
Conclusions and Future Work
Lucene’s ranking function
The method used by Lucene to compute the score of an structured
document is based on the linear combination of the scores for each
field of the document.
score(q, d) = score(q, c) (1)
c∈d
where
score(q, c) = tfc (t, d) ∗ idf (t) ∗ wc (2)
t∈q
and
tfc (t, d) = freq(t) (3)
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
22. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Outline
1 Introduction
2 Indexing RDF using inverted indexes
Indexing based on links
3 Ranking based retrieval for RDF objects based on structured IR
Ranking for structured documents
Dangers to combine scores from different document fields
4 A Semantic Search evaluation framework
5 The case study: Lucene Vs BM25F
Results and discussion
6 Conclusions and Future Work
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
23. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
The collection
INEX evaluation framework fits good enough to the goal of
evaluating Semantic Search systems with small changes
We have mapped Dbpedia to the Wikipedia version used in
the INEX contest
Dbpedia entries contain semantic information drawn from
Wikipedia pages
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
24. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
DBpedia entries: A sort of structured documents
http://dbpedia.org/resource/The Lord of the Rings
http://dbpedia.org/resource/Berlin
http://dbpedia.org/resource/Semantic Web
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
25. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Statistics of the collections
Currently Dbpedia contains almost three millions of entries and the
INEX Wikipedia collection contains 2,666,190 documents. As a
result, our corpus only takes into account the 2,233,718 document
or entities that result from the intersection of both collections.
Topics (Queries)
Given the corpus, INEX 2009 topics and assessments are adapted
to this intersection. The result of this operation have been 68
topics and a modified assessments file.
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
26. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
Results and discussion
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Outline
1 Introduction
2 Indexing RDF using inverted indexes
Indexing based on links
3 Ranking based retrieval for RDF objects based on structured IR
Ranking for structured documents
Dangers to combine scores from different document fields
4 A Semantic Search evaluation framework
5 The case study: Lucene Vs BM25F
Results and discussion
6 Conclusions and Future Work
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
27. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
Results and discussion
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Semantic Search Engines
Sindice
Watson
Falcon
SEMPLORE
Everybody is using Lucene, but are they using Lucene’s ranking
function? I don’t know.
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
28. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
Results and discussion
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Using TITLE, DESCRIPTION and NARRATIVE from Topics
MAP P@5 P@10 GMAP R-Prec
Lucene .1560 .4147 .3368 .0957 .2100
LuceneF .1200 .3971 .2971 .0578 .1632
BM25 .1746 ..4735 .3868 .1081 .2257
BM25F .1822 .4647 .3824 .1170 .2262
Table: MAP, P@5, P@10, GMAP, R-Prec for long queries. All this
measures ranges from 0 to 1
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
29. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
Results and discussion
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Sensibility test for BM25F. All this measures ranges from 0 to 1.
te = text, ti = title, in = inlinks, ob = obj, ty = type,
all = allfields
te te+ti te+in te+ob te+ty all
MAP .1756 .1867 .1760 .1749 .1750 .1822
GMAP .1084 .1190 .1098 .1080 .1080 .1170
P@5 .4529 .4559 .4500 .4500 .4559 .4746
P@10 .3882 .3941 .3897 .3853 .3853 .3824
Table: Sensibility test for BM25F. All this measures ranges from 0 to 1.
te = text, ti = title, in = inlinks, ob = obj, ty = type, all = allfields
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
30. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
Results and discussion
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
density.default(x = data4$map)
5
4
3
Density
2
1
0
0.0 0.2 0.4 0.6
N = 69 Bandwidth = 0.03185
Figure: Density of the MAP values for different ranking approaches
(BM25=blue, BM25F=red, Lucene=yellow, Lucene multifield=black)
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
31. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Outline
1 Introduction
2 Indexing RDF using inverted indexes
Indexing based on links
3 Ranking based retrieval for RDF objects based on structured IR
Ranking for structured documents
Dangers to combine scores from different document fields
4 A Semantic Search evaluation framework
5 The case study: Lucene Vs BM25F
Results and discussion
6 Conclusions and Future Work
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
32. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Conclusions
Lucene hurts the retrieval performance, while BM25F does
not when we are working on structured information, which is
very important for Semantic Search.
IR ranking functions are not able to take profit from the
semantic information contained in the fields with less text.
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
33. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
Future work
It is necessary more work on how to adapt IR ranking
functions to Semantic Search
It is not trivial to use semantic information from the Web of
data to improve search on the Web for keywords based
retrieval
It is important to identify what kind of information needs can
be solved using semantic information
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search
34. Introduction
Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
A Semantic Search evaluation framework
The case study: Lucene Vs BM25F
Conclusions and Future Work
BM25F implementation for Lucene is available
Joaqu´ P´rez-Iglesias, Jos´ R. P´rez-Ag¨era, V´
ın e e e u ıctor Fresno, Yuval
Z. Feinstein: Integrating the Probabilistic Models BM25/BM25F
into Lucene CoRR abs/0911.5046: (2009)
http://nlp.uned.es/ jperezi/Lucene-BM25/
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
Using BM25F for Semantic Search