SlideShare a Scribd company logo
1 of 34
Download to read offline
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




                          Using BM25F for Semantic Search

           Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin
                           Perez-Iglesias, Victor Fresno

                                 Metadata Research Center, UNC, UCM, UNED


                                                    April 26 - 2010




Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work


 Outline

       1    Introduction
       2    Indexing RDF using inverted indexes
              Indexing based on links
       3    Ranking based retrieval for RDF objects based on structured IR
              Ranking for structured documents
              Dangers to combine scores from different document fields
       4    A Semantic Search evaluation framework
       5    The case study: Lucene Vs BM25F
              Results and discussion
       6    Conclusions and Future Work

Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       Keyword-based Semantic Search
       Keyword-based Semantic Web search engine development has
       become a major research area garnering much attention in the
       Semantic Web community over the last seven years.

       Just for the sake of curiosity
       It is possible to improve quality results in terms of relevance
       applying just classical IR approaches to RDF semantic structure?




Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       Two main problems
          Indexing RDF triples using inverted indexes
               Ranking based retrieval for RDF objects




Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       Two main problems
          Indexing RDF triples using inverted indexes
               Ranking based retrieval for RDF objects




Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                                                                 Indexing based on links
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work


 Outline

       1    Introduction
       2    Indexing RDF using inverted indexes
              Indexing based on links
       3    Ranking based retrieval for RDF objects based on structured IR
              Ranking for structured documents
              Dangers to combine scores from different document fields
       4    A Semantic Search evaluation framework
       5    The case study: Lucene Vs BM25F
              Results and discussion
       6    Conclusions and Future Work

Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                                                                 Indexing based on links
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       Problem
       How to store RDF triples in inverted indexes OR how to represent
       subjects, predicates, and objects information in a n × m matrix.

       Solutions
            SIREN (based on XML indexing techniques)
               SEMPLORE model (based on the idea of artificial documents
               with fields)




Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                                                                 Indexing based on links
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       Problem
       How to store RDF triples in inverted indexes OR how to represent
       subjects, predicates, and objects information in a n × m matrix.

       Solutions
            SIREN (based on XML indexing techniques)
               SEMPLORE model (based on the idea of artificial documents
               with fields)




Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                                                                 Indexing based on links
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       Problem
       How to store RDF triples in inverted indexes OR how to represent
       subjects, predicates, and objects information in a n × m matrix.

       Solutions
            SIREN (based on XML indexing techniques)
               SEMPLORE model (based on the idea of artificial documents
               with fields)




Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                                                                 Indexing based on links
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       We follow SEMPLORE model with some changes.
       Index Structure based on SEMPLORE model
                          FIELD           CONTENT
                          text            plain text
                          title           keywords from the URI
                          obj             objects
                          inlinks         incoming link defined by a predicate
                          type            rdf:type
           Table: Fields used to represent RDF structure in the inverted index.




Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                                                                 Indexing based on links
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work



       Two dbpedia entries
               The Godfather http://dbpedia.org/page/The Godfather
               Francis Ford Coppola
               http://dbpedia.org/page/Francis Ford Coppola

       Triple
       The Goodfather (Subject) - dbpprop:director (Predicate)- Francis
       Ford Coppola (Object)

       Using inlink text to index the landing URL
       The word director can be used as keyword to index the entry
       described by this URI
       http://dbpedia.org/page/Francis Ford Coppola.

Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                                                                 Indexing based on links
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work



       Two dbpedia entries
               The Godfather http://dbpedia.org/page/The Godfather
               Francis Ford Coppola
               http://dbpedia.org/page/Francis Ford Coppola

       Triple
       The Goodfather (Subject) - dbpprop:director (Predicate)- Francis
       Ford Coppola (Object)

       Using inlink text to index the landing URL
       The word director can be used as keyword to index the entry
       described by this URI
       http://dbpedia.org/page/Francis Ford Coppola.

Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                                                                 Indexing based on links
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work



       Two dbpedia entries
               The Godfather http://dbpedia.org/page/The Godfather
               Francis Ford Coppola
               http://dbpedia.org/page/Francis Ford Coppola

       Triple
       The Goodfather (Subject) - dbpprop:director (Predicate)- Francis
       Ford Coppola (Object)

       Using inlink text to index the landing URL
       The word director can be used as keyword to index the entry
       described by this URI
       http://dbpedia.org/page/Francis Ford Coppola.

Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR   Ranking for structured documents
                     A Semantic Search evaluation framework      Dangers to combine scores from different document fields
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work


 Outline

       1    Introduction
       2    Indexing RDF using inverted indexes
              Indexing based on links
       3    Ranking based retrieval for RDF objects based on structured IR
              Ranking for structured documents
              Dangers to combine scores from different document fields
       4    A Semantic Search evaluation framework
       5    The case study: Lucene Vs BM25F
              Results and discussion
       6    Conclusions and Future Work

Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR   Ranking for structured documents
                     A Semantic Search evaluation framework      Dangers to combine scores from different document fields
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       Classical IR
       For long time, search engines have been dealing with flat
       documents, that is, without structure.

       Consequence
       The main consequence of this approach is the fact that terms
       within a document are considered to have the same relevance (or
       value), disregarding their role in the document.

       Simplification
       This assumption implies a relevance model simplification based on
       bag of words, and, therefore, useful information is lost.


Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR   Ranking for structured documents
                     A Semantic Search evaluation framework      Dangers to combine scores from different document fields
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       Classical IR
       For long time, search engines have been dealing with flat
       documents, that is, without structure.

       Consequence
       The main consequence of this approach is the fact that terms
       within a document are considered to have the same relevance (or
       value), disregarding their role in the document.

       Simplification
       This assumption implies a relevance model simplification based on
       bag of words, and, therefore, useful information is lost.


Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR   Ranking for structured documents
                     A Semantic Search evaluation framework      Dangers to combine scores from different document fields
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       Classical IR
       For long time, search engines have been dealing with flat
       documents, that is, without structure.

       Consequence
       The main consequence of this approach is the fact that terms
       within a document are considered to have the same relevance (or
       value), disregarding their role in the document.

       Simplification
       This assumption implies a relevance model simplification based on
       bag of words, and, therefore, useful information is lost.


Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR   Ranking for structured documents
                     A Semantic Search evaluation framework      Dangers to combine scores from different document fields
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       Structured IR
            Structured IR uses the document’s structure to identify where
            the most representative terms of the document are (e.g. title,
            abstract,HTML or XML tags, etc)
               Boost factors are used to modify the impact of every term in
               the ranking function in order to take into account the
               document’s structure.




Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR   Ranking for structured documents
                     A Semantic Search evaluation framework      Dangers to combine scores from different document fields
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       Ranking functions
       State of the art models have been adapted to this situation.
               BM25F
               LM for structured documents
       ... but this adaptation have some tricks.




Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR   Ranking for structured documents
                     A Semantic Search evaluation framework      Dangers to combine scores from different document fields
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work


       The Problem
       The linear combination of weigths for each field of the document is
                                                             √
       not enough if a saturation function, like log (tf ) or tf is used in
       the TF function




                                   Figure: Source: Robertson et al.2004
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR       Ranking for structured documents
                     A Semantic Search evaluation framework          Dangers to combine scores from different document fields
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       Lucene’s ranking function
       The method used by Lucene to compute the score of an structured
       document is based on the linear combination of the scores for each
       field of the document.


                                        score(q, d) =                   score(q, c)                                (1)
                                                                  c∈d

       where
                               score(q, c) =                     tfc (t, d) ∗ idf (t) ∗ wc                         (2)
                                                       t∈q

       and
                                               tfc (t, d) =             freq(t)                                    (3)


Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work


 Outline

       1    Introduction
       2    Indexing RDF using inverted indexes
              Indexing based on links
       3    Ranking based retrieval for RDF objects based on structured IR
              Ranking for structured documents
              Dangers to combine scores from different document fields
       4    A Semantic Search evaluation framework
       5    The case study: Lucene Vs BM25F
              Results and discussion
       6    Conclusions and Future Work

Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       The collection
           INEX evaluation framework fits good enough to the goal of
           evaluating Semantic Search systems with small changes
               We have mapped Dbpedia to the Wikipedia version used in
               the INEX contest
               Dbpedia entries contain semantic information drawn from
               Wikipedia pages




Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       DBpedia entries: A sort of structured documents
               http://dbpedia.org/resource/The Lord of the Rings
               http://dbpedia.org/resource/Berlin
               http://dbpedia.org/resource/Semantic Web




Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       Statistics of the collections
       Currently Dbpedia contains almost three millions of entries and the
       INEX Wikipedia collection contains 2,666,190 documents. As a
       result, our corpus only takes into account the 2,233,718 document
       or entities that result from the intersection of both collections.

       Topics (Queries)
       Given the corpus, INEX 2009 topics and assessments are adapted
       to this intersection. The result of this operation have been 68
       topics and a modified assessments file.




Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                                                                 Results and discussion
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work


 Outline

       1    Introduction
       2    Indexing RDF using inverted indexes
              Indexing based on links
       3    Ranking based retrieval for RDF objects based on structured IR
              Ranking for structured documents
              Dangers to combine scores from different document fields
       4    A Semantic Search evaluation framework
       5    The case study: Lucene Vs BM25F
              Results and discussion
       6    Conclusions and Future Work

Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                                                                 Results and discussion
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       Semantic Search Engines
          Sindice
               Watson
               Falcon
               SEMPLORE
       Everybody is using Lucene, but are they using Lucene’s ranking
       function? I don’t know.




Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                                                                 Results and discussion
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       Using TITLE, DESCRIPTION and NARRATIVE from Topics
                                     MAP            P@5          P@10          GMAP          R-Prec
                   Lucene            .1560          .4147        .3368         .0957         .2100
                   LuceneF           .1200          .3971        .2971         .0578         .1632
                   BM25              .1746          ..4735       .3868         .1081         .2257
                   BM25F             .1822          .4647        .3824         .1170         .2262
       Table: MAP, P@5, P@10, GMAP, R-Prec for long queries. All this
       measures ranges from 0 to 1




Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                                                                 Results and discussion
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       Sensibility test for BM25F. All this measures ranges from 0 to 1.
       te = text, ti = title, in = inlinks, ob = obj, ty = type,
       all = allfields
                                te           te+ti          te+in      te+ob          te+ty         all
               MAP              .1756        .1867          .1760      .1749          .1750         .1822
               GMAP             .1084        .1190          .1098      .1080          .1080         .1170
               P@5              .4529        .4559          .4500      .4500          .4559         .4746
               P@10             .3882        .3941          .3897      .3853          .3853         .3824
       Table: Sensibility test for BM25F. All this measures ranges from 0 to 1.
       te = text, ti = title, in = inlinks, ob = obj, ty = type, all = allfields




Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                                                                      Results and discussion
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work



                                                     density.default(x = data4$map)




                                          5
                                          4
                                          3
                                Density

                                          2
                                          1
                                          0




                                              0.0         0.2            0.4          0.6

                                                        N = 69 Bandwidth = 0.03185




       Figure: Density of the MAP values for different ranking approaches
       (BM25=blue, BM25F=red, Lucene=yellow, Lucene multifield=black)
Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work


 Outline

       1    Introduction
       2    Indexing RDF using inverted indexes
              Indexing based on links
       3    Ranking based retrieval for RDF objects based on structured IR
              Ranking for structured documents
              Dangers to combine scores from different document fields
       4    A Semantic Search evaluation framework
       5    The case study: Lucene Vs BM25F
              Results and discussion
       6    Conclusions and Future Work

Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       Conclusions
           Lucene hurts the retrieval performance, while BM25F does
           not when we are working on structured information, which is
           very important for Semantic Search.
               IR ranking functions are not able to take profit from the
               semantic information contained in the fields with less text.




Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       Future work
           It is necessary more work on how to adapt IR ranking
           functions to Semantic Search
               It is not trivial to use semantic information from the Web of
               data to improve search on the Web for keywords based
               retrieval
               It is important to identify what kind of information needs can
               be solved using semantic information




Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search
Introduction
                          Indexing RDF using inverted indexes
Ranking based retrieval for RDF objects based on structured IR
                     A Semantic Search evaluation framework
                            The case study: Lucene Vs BM25F
                                 Conclusions and Future Work




       BM25F implementation for Lucene is available
       Joaqu´ P´rez-Iglesias, Jos´ R. P´rez-Ag¨era, V´
             ın e                  e    e        u     ıctor Fresno, Yuval
       Z. Feinstein: Integrating the Probabilistic Models BM25/BM25F
       into Lucene CoRR abs/0911.5046: (2009)
       http://nlp.uned.es/ jperezi/Lucene-BM25/




Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno
                                                                  Using BM25F for Semantic Search

More Related Content

Recently uploaded

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 

Recently uploaded (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 

Featured

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 

Featured (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 

Using BM25F for Semantic Search

  • 1. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Using BM25F for Semantic Search Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Metadata Research Center, UNC, UCM, UNED April 26 - 2010 Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 2. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Outline 1 Introduction 2 Indexing RDF using inverted indexes Indexing based on links 3 Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents Dangers to combine scores from different document fields 4 A Semantic Search evaluation framework 5 The case study: Lucene Vs BM25F Results and discussion 6 Conclusions and Future Work Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 3. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Keyword-based Semantic Search Keyword-based Semantic Web search engine development has become a major research area garnering much attention in the Semantic Web community over the last seven years. Just for the sake of curiosity It is possible to improve quality results in terms of relevance applying just classical IR approaches to RDF semantic structure? Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 4. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Two main problems Indexing RDF triples using inverted indexes Ranking based retrieval for RDF objects Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 5. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Two main problems Indexing RDF triples using inverted indexes Ranking based retrieval for RDF objects Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 6. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Indexing based on links A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Outline 1 Introduction 2 Indexing RDF using inverted indexes Indexing based on links 3 Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents Dangers to combine scores from different document fields 4 A Semantic Search evaluation framework 5 The case study: Lucene Vs BM25F Results and discussion 6 Conclusions and Future Work Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 7. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Indexing based on links A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Problem How to store RDF triples in inverted indexes OR how to represent subjects, predicates, and objects information in a n × m matrix. Solutions SIREN (based on XML indexing techniques) SEMPLORE model (based on the idea of artificial documents with fields) Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 8. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Indexing based on links A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Problem How to store RDF triples in inverted indexes OR how to represent subjects, predicates, and objects information in a n × m matrix. Solutions SIREN (based on XML indexing techniques) SEMPLORE model (based on the idea of artificial documents with fields) Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 9. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Indexing based on links A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Problem How to store RDF triples in inverted indexes OR how to represent subjects, predicates, and objects information in a n × m matrix. Solutions SIREN (based on XML indexing techniques) SEMPLORE model (based on the idea of artificial documents with fields) Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 10. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Indexing based on links A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work We follow SEMPLORE model with some changes. Index Structure based on SEMPLORE model FIELD CONTENT text plain text title keywords from the URI obj objects inlinks incoming link defined by a predicate type rdf:type Table: Fields used to represent RDF structure in the inverted index. Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 11. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Indexing based on links A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Two dbpedia entries The Godfather http://dbpedia.org/page/The Godfather Francis Ford Coppola http://dbpedia.org/page/Francis Ford Coppola Triple The Goodfather (Subject) - dbpprop:director (Predicate)- Francis Ford Coppola (Object) Using inlink text to index the landing URL The word director can be used as keyword to index the entry described by this URI http://dbpedia.org/page/Francis Ford Coppola. Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 12. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Indexing based on links A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Two dbpedia entries The Godfather http://dbpedia.org/page/The Godfather Francis Ford Coppola http://dbpedia.org/page/Francis Ford Coppola Triple The Goodfather (Subject) - dbpprop:director (Predicate)- Francis Ford Coppola (Object) Using inlink text to index the landing URL The word director can be used as keyword to index the entry described by this URI http://dbpedia.org/page/Francis Ford Coppola. Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 13. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Indexing based on links A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Two dbpedia entries The Godfather http://dbpedia.org/page/The Godfather Francis Ford Coppola http://dbpedia.org/page/Francis Ford Coppola Triple The Goodfather (Subject) - dbpprop:director (Predicate)- Francis Ford Coppola (Object) Using inlink text to index the landing URL The word director can be used as keyword to index the entry described by this URI http://dbpedia.org/page/Francis Ford Coppola. Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 14. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents A Semantic Search evaluation framework Dangers to combine scores from different document fields The case study: Lucene Vs BM25F Conclusions and Future Work Outline 1 Introduction 2 Indexing RDF using inverted indexes Indexing based on links 3 Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents Dangers to combine scores from different document fields 4 A Semantic Search evaluation framework 5 The case study: Lucene Vs BM25F Results and discussion 6 Conclusions and Future Work Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 15. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents A Semantic Search evaluation framework Dangers to combine scores from different document fields The case study: Lucene Vs BM25F Conclusions and Future Work Classical IR For long time, search engines have been dealing with flat documents, that is, without structure. Consequence The main consequence of this approach is the fact that terms within a document are considered to have the same relevance (or value), disregarding their role in the document. Simplification This assumption implies a relevance model simplification based on bag of words, and, therefore, useful information is lost. Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 16. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents A Semantic Search evaluation framework Dangers to combine scores from different document fields The case study: Lucene Vs BM25F Conclusions and Future Work Classical IR For long time, search engines have been dealing with flat documents, that is, without structure. Consequence The main consequence of this approach is the fact that terms within a document are considered to have the same relevance (or value), disregarding their role in the document. Simplification This assumption implies a relevance model simplification based on bag of words, and, therefore, useful information is lost. Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 17. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents A Semantic Search evaluation framework Dangers to combine scores from different document fields The case study: Lucene Vs BM25F Conclusions and Future Work Classical IR For long time, search engines have been dealing with flat documents, that is, without structure. Consequence The main consequence of this approach is the fact that terms within a document are considered to have the same relevance (or value), disregarding their role in the document. Simplification This assumption implies a relevance model simplification based on bag of words, and, therefore, useful information is lost. Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 18. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents A Semantic Search evaluation framework Dangers to combine scores from different document fields The case study: Lucene Vs BM25F Conclusions and Future Work Structured IR Structured IR uses the document’s structure to identify where the most representative terms of the document are (e.g. title, abstract,HTML or XML tags, etc) Boost factors are used to modify the impact of every term in the ranking function in order to take into account the document’s structure. Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 19. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents A Semantic Search evaluation framework Dangers to combine scores from different document fields The case study: Lucene Vs BM25F Conclusions and Future Work Ranking functions State of the art models have been adapted to this situation. BM25F LM for structured documents ... but this adaptation have some tricks. Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 20. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents A Semantic Search evaluation framework Dangers to combine scores from different document fields The case study: Lucene Vs BM25F Conclusions and Future Work The Problem The linear combination of weigths for each field of the document is √ not enough if a saturation function, like log (tf ) or tf is used in the TF function Figure: Source: Robertson et al.2004 Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 21. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents A Semantic Search evaluation framework Dangers to combine scores from different document fields The case study: Lucene Vs BM25F Conclusions and Future Work Lucene’s ranking function The method used by Lucene to compute the score of an structured document is based on the linear combination of the scores for each field of the document. score(q, d) = score(q, c) (1) c∈d where score(q, c) = tfc (t, d) ∗ idf (t) ∗ wc (2) t∈q and tfc (t, d) = freq(t) (3) Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 22. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Outline 1 Introduction 2 Indexing RDF using inverted indexes Indexing based on links 3 Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents Dangers to combine scores from different document fields 4 A Semantic Search evaluation framework 5 The case study: Lucene Vs BM25F Results and discussion 6 Conclusions and Future Work Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 23. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work The collection INEX evaluation framework fits good enough to the goal of evaluating Semantic Search systems with small changes We have mapped Dbpedia to the Wikipedia version used in the INEX contest Dbpedia entries contain semantic information drawn from Wikipedia pages Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 24. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work DBpedia entries: A sort of structured documents http://dbpedia.org/resource/The Lord of the Rings http://dbpedia.org/resource/Berlin http://dbpedia.org/resource/Semantic Web Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 25. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Statistics of the collections Currently Dbpedia contains almost three millions of entries and the INEX Wikipedia collection contains 2,666,190 documents. As a result, our corpus only takes into account the 2,233,718 document or entities that result from the intersection of both collections. Topics (Queries) Given the corpus, INEX 2009 topics and assessments are adapted to this intersection. The result of this operation have been 68 topics and a modified assessments file. Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 26. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Results and discussion A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Outline 1 Introduction 2 Indexing RDF using inverted indexes Indexing based on links 3 Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents Dangers to combine scores from different document fields 4 A Semantic Search evaluation framework 5 The case study: Lucene Vs BM25F Results and discussion 6 Conclusions and Future Work Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 27. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Results and discussion A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Semantic Search Engines Sindice Watson Falcon SEMPLORE Everybody is using Lucene, but are they using Lucene’s ranking function? I don’t know. Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 28. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Results and discussion A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Using TITLE, DESCRIPTION and NARRATIVE from Topics MAP P@5 P@10 GMAP R-Prec Lucene .1560 .4147 .3368 .0957 .2100 LuceneF .1200 .3971 .2971 .0578 .1632 BM25 .1746 ..4735 .3868 .1081 .2257 BM25F .1822 .4647 .3824 .1170 .2262 Table: MAP, P@5, P@10, GMAP, R-Prec for long queries. All this measures ranges from 0 to 1 Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 29. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Results and discussion A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Sensibility test for BM25F. All this measures ranges from 0 to 1. te = text, ti = title, in = inlinks, ob = obj, ty = type, all = allfields te te+ti te+in te+ob te+ty all MAP .1756 .1867 .1760 .1749 .1750 .1822 GMAP .1084 .1190 .1098 .1080 .1080 .1170 P@5 .4529 .4559 .4500 .4500 .4559 .4746 P@10 .3882 .3941 .3897 .3853 .3853 .3824 Table: Sensibility test for BM25F. All this measures ranges from 0 to 1. te = text, ti = title, in = inlinks, ob = obj, ty = type, all = allfields Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 30. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR Results and discussion A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work density.default(x = data4$map) 5 4 3 Density 2 1 0 0.0 0.2 0.4 0.6 N = 69 Bandwidth = 0.03185 Figure: Density of the MAP values for different ranking approaches (BM25=blue, BM25F=red, Lucene=yellow, Lucene multifield=black) Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 31. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Outline 1 Introduction 2 Indexing RDF using inverted indexes Indexing based on links 3 Ranking based retrieval for RDF objects based on structured IR Ranking for structured documents Dangers to combine scores from different document fields 4 A Semantic Search evaluation framework 5 The case study: Lucene Vs BM25F Results and discussion 6 Conclusions and Future Work Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 32. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Conclusions Lucene hurts the retrieval performance, while BM25F does not when we are working on structured information, which is very important for Semantic Search. IR ranking functions are not able to take profit from the semantic information contained in the fields with less text. Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 33. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work Future work It is necessary more work on how to adapt IR ranking functions to Semantic Search It is not trivial to use semantic information from the Web of data to improve search on the Web for keywords based retrieval It is important to identify what kind of information needs can be solved using semantic information Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search
  • 34. Introduction Indexing RDF using inverted indexes Ranking based retrieval for RDF objects based on structured IR A Semantic Search evaluation framework The case study: Lucene Vs BM25F Conclusions and Future Work BM25F implementation for Lucene is available Joaqu´ P´rez-Iglesias, Jos´ R. P´rez-Ag¨era, V´ ın e e e u ıctor Fresno, Yuval Z. Feinstein: Integrating the Probabilistic Models BM25/BM25F into Lucene CoRR abs/0911.5046: (2009) http://nlp.uned.es/ jperezi/Lucene-BM25/ Jose R. Perez-Aguera, Javier Arroyo, Jane Greenberg, Joaquin Perez-Iglesias, Victor Fresno Using BM25F for Semantic Search