SlideShare una empresa de Scribd logo
1 de 50
Keyword Search on Structured Data using
Relevance Models*

Veli Bicer




                                                                                                      INFORMATIK
FZI Research Center for Information Technology
Karlsruhe, Germany




                                                                                      FZI FORSCHUNGSZENTRUM
Joint work with Thanh Tran from Semantic Search Group, AIFB
Institute, KIT




* based on the papers @ 20th ACM Conference on Information and Knowledge
Management (CIKM’11) and @ 10th International Semantic Web Conference (ISWC’11)
                           © FZI Forschungszentrum Informatik                     1
About the presenter

Veli Bicer
        Research Scientist at FZI Research Center for Information
         Technology, Karlsruhe, Germany
        Associated Researcher at Karlsruhe Service Research Institute (KSRI)
           KSRI founded by IBM Germany
Research Interests
        Semantic Data Management/Search
        Relational Learning
        Software Engineering (for Services)
Projects
        German Internet Research Programme THESEUS
           KOIOS Semantic Search in Core Technology Cluster
           TEXO Internet-of-Services Use-case
        Previously, EU ICT Artemis, Satine, Saphire and Ride



10.04.2012                    © FZI Forschungszentrum Informatik                2
Agenda

Introduction
    Keyword search on structured data
    Relevance models
Approach
      Ranking scheme using relevance models
      Top-k Query processing
Experiments
Application
      Search on environmental data
Conclusion




                           © FZI Forschungszentrum Informatik   3
INFORMATIK
                                                              FZI FORSCHUNGSZENTRUM
      Introduction




10.04.2012           © FZI Forschungszentrum Informatik   4
Keyword Search on Structured Data

Rationale
   4 billion web searches daily
   Data-driven websites have relational database backend
      Predefined search forms constrain retrieval
   SQL difficult to learn
      simplify data retrieval by not using SQL




                           © FZI Forschungszentrum Informatik   5
Keyword Search on Structured Data

Example
      Who is the character played by Audrey Hepburn in Roman Holiday?
Query result                                      Person                      Character
      A tree of tuples that is reduced           id name                     id   name         pid mid
       with respect to the query.                 p1 Audrey Hepburn           c1   Princess     p1     m1
                                                                                   Ann
Which would you rather write?                     p3 Kate Winslet
                                                                              c3   Iris         p3     m2
                                                  … ………
                                                                                   Simpkins
  SELECT C.name
                                                                              …    ……..
  FROM Person, Character, Movie
  WHERE Person.id = Character.pId                 Movie
  AND Character.mid = Movie.id                    id    title         plot
  AND Person.name = ‘Audrey Hepburn'              m1 Roman Holiday    Princess Ann is a royal princess
  AND Movie.title = ‘Roman Holiday' ;                                 of unknow of an …
                                                  m2 The Holiday      Iris swaps her cottage for the
      or “Hepburn Holiday”                                           holiday along the next two …
                                                  m3 The Aviator      Hughes and Hepburn go to a
                                                                      holiday and fly together ..
                                                  …     ……            …..



                              © FZI Forschungszentrum Informatik                                         6
Keyword Search on Structured Data
Many approaches are proposed recently
   Performance focus
   Less consideration of ranking


Recent study (Coffman and Weaver, CIKM 2010)
     effectiveness of previous works are below expectations
     problem about ranking strategies, not performance

Two major types of ranking schemes:
   IR-inspired TF-IDF ranking
      (Liu et al, 2006) (SPARK, 2007)
   Proximity based approaches
      (Banks, 2002) (Bidirectional, 2005)


Problem:
     Missing a robust and principled approach!!

                              © FZI Forschungszentrum Informatik   7
Relevance Models

Proposed by Lavrenko and Croft (SIGIR 01)                          Q                     D
Assumes that                                                           Classical Model

    queries and documents are samples from a
     hidden representation space and
    generated from the same generative model
Initial representation of relevance is                                       R
    unknown
      Estimated from query
                                                                   Q                     D
                                                                       Language Model




                                                                             R



                                                                   Q                     D
                              © FZI Forschungszentrum Informatik                             8
                                                                       Relevance Model
INFORMATIK
                                                          FZI FORSCHUNGSZENTRUM
      Approach




10.04.2012       © FZI Forschungszentrum Informatik   9
Overview of Approach
1       Query
                             2   PRF
                                                     3      Query RM
                                                                                                                      4      Res. RM


    words       p                                     words       p                                                   words         p

    hepburn     0.5                                   hepburn     0.21                5    Res. Score
                                                                                                                      hepburn       0.12

    holiday     0.5                                   holiday     0.15                                                holiday       0.18

                                                      audrey      0.13                                                audrey        0.11

                                                      katharine   0.09            D(RMQ||RMR)                         katharine     0.05

                                                      princess    0.01                                                princess      0.00

                                                      roman       0.01                                                roman         0.06

                                                      ….          …                                                   ….            …




                                                                                                              Title                Name

                                                                                                              Roman Holiday        Audrey
                                                                                                                                   Hepburn

                                                                                                              Breakfast at Tiff.   Audrey
                                                                                                                                   Hepburn

                                                                                                              The Aviator          Katharine
                                                                                                                                   Hepbun

                                                                                                              The Holiday          Kate
                                                                                                                                   Winslet




    6     Query Generation             7   Structured Queries                         8   Top-k Query Proc.
                                                                                                                 9        Result Ranking




                                                 © FZI Forschungszentrum Informatik                                                  10
Data Model

Different kinds of data
            e.g. relational, XML and RDF data
Data Graph of nodes and edges (G=(V,E))
Resource nodes, attribute nodes
            Every resource is typed
            Resources have unique ids, (e.g. primary keys)




10.04.2012                        © FZI Forschungszentrum Informatik   11
Edge-Specific Relevance Models                                                                                      1   2   3


       A set of feedback resources FR are retrieved from an inverted keyword index:
               E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, c2,m3}
       Edge-specific relevance model for each unique edge e:
                                                                   Probability of word at resource




                                                                                     Importance of resource w.r.t. query




         Inverted Index                       FR                                              Edge-specific Relevance Models
princess         m1, c1
breakfast        m3                             p1
                                    name                  birthplace
hepburn
hepburn          m3,p1,p4,c2
                                Audrey Hepburn        Ixelles Belgium
melbourne        p2
iris             c3                   m3
                                              title           The Holiday

holiday
holiday          m1,m2,m3                  plot

breakfast        m3             Iris swaps her
                                 cottage for the
ann              m1,c2         holiday along the
                                     next two                …..
……….            … …….                                      © FZI Forschungszentrum Informatik                                          12
Edge Specific Resource Models                                          4      5


Each resource (a tuple) is also represented as a RM
      …as final results (joint tuples) are obtained by combining resources
Edge-specific resource model:



The score of resource: cross-entropy of edge-specific RM and
   ResM:




                             © FZI Forschungszentrum Informatik                   13
Smoothing

Well-known technique to address data sparseness and improve
  accuracy of RMs (and LMs)
                   is the core probability for both query and resource RM
Local smoothing



Neighborhood of attribute a is another attribute a’:
    a and a’ shares the same resources
    resources of a and a’ are of the same type
    resources of a and a’ are connected over a FK




                                                               Neighborhood of a
                          © FZI Forschungszentrum Informatik                       14
Smoothing
                                                                                                      words         P name (v | p1 )
                                                                                                                     r     a

                               Person                             Character                           audrey         0.5       0.4     0.37   0.36
                     type               type                                  type                    hepburn        0.5       0.4     0.39   0.38
                                                        pid_fk
                                   p1                                    c1                           ixelles                  0.1     0.09   0.08
                   p4
                                               birthplace                                             belgium
                            name                                  name                                                         0.1     0.09   0.08
      name
                            Audrey Hepburn      Ixelles Belgium    Princess Ann                       katharine                        0.02   0.01
Katharine
Hepburn              birthplace                                                                       connecticut                      0.02   0.01
             Connecticut USA                                                                          usa                              0.02   0.01
                                                                                                      princess
                                                                                                                                              0.035
                                                                                                      ann
                                                                                                                                              0.035



   Smoothing of each type is controlled
     by weights:



   where γ1 ,γ2 ,γ3 are control parameters
     set in experiments



   10.04.2012                                                         © FZI Forschungszentrum Informatik                                              15
Ranking JRTs                                                        9


Ranking aggregated JRTs:
      Cross entropy between edge-specific RM (Query Model) and geometric
       mean of combined edge-specific ResM:



The proposed score is monotonic w.r.t. individual resource scores
      …a desired property for most of top-k algorithms




                            © FZI Forschungszentrum Informatik          16
Query Translation*                                                                                                      6           7


Mapping of keywords to data elements
                                                                        Hepburn             Hepburn                     Holiday          Holiday

                                                                                                                                                  title
                                                                       name          name                                       title
       Result in a set of keyword elements                                   p4            p1
                                                                                                                                             m1
                                                                                                                               m3

Data Graph exploration                                                               type
                                                                                                                                        type

     Search for substructures (query graph)                                                  pid_fk
                                                                                                           Character
                                                                                   Person                                     mid_fk
      connecting keyword elements
                                                                                                 bornIn                                  Movie
     Bi-directional exploration of query
                                                                                     Is-a                 Location
      graphs operates on summary of data                                                                                                 hasDist
                                                                                                                   hasLoc
      graph only                                                     Summary           Producer                               Studio

Top-k computation
                                                                     Graph                                worksFor


       Search guided by a scoring function to                                     Person                 Character                 Movie

        output only the top-k queries                                         type                     type                     type
                                                                                              pid_fk                 mid_fk
Query graphs to be processed                                             name
                                                                                      ?p                      ?c                        ?m
                                                                                                                                               title
       Free vs. Non-free variables                                        Hepburn                                                           Holiday




*[Tran et al. ICDE’09]
                                © FZI Forschungszentrum Informatik                                                                              17
Top-k Query Processing                                                    8


Top-k query processing (TQP) is highly common in Web-
  accessible databases
   return K highest-ranked answers
   avoid unnecessary accesses to database
TQP assumes
   Scoring function and attribute values to be known a-priori (e.g. RankJoin)
   Combine attribute values by aggregation function
   Sorted access (SA), random access (RA) probes
How to adapt TQP to return top-k relevant results?
   Results are joined set of resources
   Scores are query-dependent
      No indexing is possible
Idea:
   Retrieve resources for non-free variables and rank
   Use SA on those initially retrieved resources
   Use RA to find other resources

                            © FZI Forschungszentrum Informatik                18
Top-k Query Processing
  Result candidate c=<(x1,…,xk),score>
         complete when all variables are bound to some resources
         xi =* indicates unbounded
                                                                                                                          Threshold
  Binding operator                                                                                                        0.50
         c’=(c,xiri)
  Threshold determines upper bound for unseen resources
         Scheduling between SA and RA
         Tight bound is desired
                                                                                                                          Priority Queue
                                                                                                                          <(p1,*,*),0.50>
                              Person             Character              Movie                                             <(*,*,m2),0.50>
                         type                   type                 type
                                       pid_fk               mid_fk
                                ?p                     ?c                   ?m    title
                     name

                    Hepburn                                                          Holiday


Person                                 Character                                   0.11        Movie
id name                  S(r)          id name                                   S(r)          id   title          S(r)
p1 Audrey Hepburn        0.20          c1 Princess Ann                                         m2 The Holiday      0.19   Output             K=1
p3 Katharine Hepburn 0.18              c2 Katharine Hepburn                                    m1 Roman Holiday    0.18
p5 Philip Hepburn        0.13          c3 Iris Simpkins                                        m3 Holiday Blues    0.09
p6 Anna Hepburn          0.12          c4 Louise                                               m4 Family Holiday   0.08


                                                             © FZI Forschungszentrum Informatik                                             19
Top-k Query Processing
  Result candidate c=<(x1,…,xk),score>
         complete when all variables are bound to some resources
         xi =* indicates unbounded
                                                                                                                          Threshold
  Binding operator                                                                                                        0.48
         c’=(c,xiri)
  Threshold determines upper bound for unseen resources
         Scheduling between SA and RA
         Tight bound is desired
                                                                                                                          Priority Queue
                                                                                                                          <(p1,*,*),0.50>
                              Person             Character              Movie                                             <(*,*,m2),0.50>
                         type                   type                 type
                                       pid_fk               mid_fk                                                        <(p3,*,*),0.48>
                                ?p                     ?c                   ?m    title
                     name

                    Hepburn                                                          Holiday


Person                                 Character                                   0.11        Movie
id name                  S(r)          id name                                   S(r)          id   title          S(r)
p1 Audrey Hepburn        0.20          c1 Princess Ann                                         m2 The Holiday      0.19   Output             K=1
p3 Katharine Hepburn 0.18              c2 Katharine Hepburn                                    m1 Roman Holiday    0.18
p5 Philip Hepburn        0.13          c3 Iris Simpkins                                        m3 Holiday Blues    0.09
p6 Anna Hepburn          0.12          c4 Louise                                               m4 Family Holiday   0.08


                                                             © FZI Forschungszentrum Informatik                                             20
Top-k Query Processing
  Result candidate c=<(x1,…,xk),score>
         complete when all variables are bound to some resources
         xi =* indicates unbounded
                                                                                                                          Threshold
  Binding operator                                                                                                        0.47
         c’=(c,xiri)
  Threshold determines upper bound for unseen resources
         Scheduling between SA and RA
         Tight bound is desired
                                                                                                                          Priority Queue
                                                                                                                           <(*,*,m2),0.50>
                              Person             Character              Movie                                             <(p1,c1,*),0.49>
                         type                   type                 type
                                       pid_fk               mid_fk                                                        <(p3,*,*),0.48>
                                ?p                     ?c                   ?m    title
                     name

                    Hepburn                                                          Holiday


Person                                 Character                                   0.10        Movie
id name                  S(r)          id name                                   S(r)          id   title          S(r)
p1 Audrey Hepburn        0.20          c1 Princess Ann                           0.10          m2 The Holiday      0.19   Output             K=1
p3 Katharine Hepburn 0.18              c2 Katharine Hepburn                                    m1 Roman Holiday    0.18
p5 Philip Hepburn        0.13          c3 Iris Simpkins                                        m3 Holiday Blues    0.09
p6 Anna Hepburn          0.12          c4 Louise                                               m4 Family Holiday   0.08


                                                             © FZI Forschungszentrum Informatik                                             21
Top-k Query Processing
  Result candidate c=<(x1,…,xk),score>
         complete when all variables are bound to some resources
         xi =* indicates unbounded
                                                                                                                          Threshold
  Binding operator                                                                                                        0.46
         c’=(c,xiri)
  Threshold determines upper bound for unseen resources
         Scheduling between SA and RA
         Tight bound is desired
                                                                                                                          Priority Queue
                                                                                                                          <(p1,c1,*),0.49>
                              Person             Character              Movie                                             <(p3,*,*),0.48>
                         type                   type                 type
                                       pid_fk               mid_fk                                                        <(*,c3,m2),0.44>
                                ?p                     ?c                   ?m    title
                     name

                    Hepburn                                                          Holiday


Person                                 Character                                   0.09        Movie
id name                  S(r)          id name                                   S(r)          id   title          S(r)
p1 Audrey Hepburn        0.20          c1 Princess Ann                           0.10          m2 The Holiday      0.19   Output             K=1
p3 Katharine Hepburn 0.18              c2 Katharine Hepburn                                    m1 Roman Holiday    0.18
p5 Philip Hepburn        0.13          c3 Iris Simpkins                          0.05          m3 Holiday Blues    0.09
p6 Anna Hepburn          0.12          c4 Louise                                               m4 Family Holiday   0.08


                                                             © FZI Forschungszentrum Informatik                                             22
Top-k Query Processing
  Result candidate c=<(x1,…,xk),score>
         complete when all variables are bound to some resources
         xi =* indicates unbounded
                                                                                                                          Threshold
  Binding operator                                                                                                        0.46
         c’=(c,xiri)
  Threshold determines upper bound for unseen resources
         Scheduling between SA and RA
         Tight bound is desired
                                                                                                                          Priority Queue
                                                                                                                          <(p3,*,*),0.48>
                              Person             Character              Movie                                             <(*,c3,m2),0.44>
                         type                   type                 type
                                       pid_fk               mid_fk
                                ?p                     ?c                   ?m    title
                     name

                    Hepburn                                                          Holiday


Person                                 Character                                   0.09        Movie
id name                  S(r)          id name                                   S(r)          id   title          S(r)
p1 Audrey Hepburn        0.20          c1 Princess Ann                           0.10          m2 The Holiday      0.19   Output             K=1
p3 Katharine Hepburn 0.18              c2 Katharine Hepburn                                    m1 Roman Holiday    0.18   <(p1,c1,m1),0.48>
p5 Philip Hepburn        0.13          c3 Iris Simpkins                          0.05          m3 Holiday Blues    0.09
p6 Anna Hepburn          0.12          c4 Louise                                               m4 Family Holiday   0.08


                                                             © FZI Forschungszentrum Informatik                                             23
INFORMATIK
                                                        FZI FORSCHUNGSZENTRUM
Experiments




              © FZI Forschungszentrum Informatik   24
Experiments

Datasets: Subsets of Wikipedia, IMDB and Mondial Web
  databases
Queries: 50 queries for each dataset including “TREC style”
  queries and “single resource” queries
Metrics: Three metrics are used: (1) the number of top-1 relevant
  results, (2) Reciprocal rank and (3) Mean Average Precision
  (MAP)
Baselines: BANKS , Bidirectional (proximity) , Efficient , SPARK,
  CoveredDensity (TF-IDF).
RM-S: Our approach




                        © FZI Forschungszentrum Informatik      25
Experiments




                                                   MAP scores for all queries




                                             Reciprocal rank for single
                                                 resource queries



              © FZI Forschungszentrum Informatik                                26
Experiments




       Precision-recall for TREC-style queries on Wikipedia
                    © FZI Forschungszentrum Informatik        27
INFORMATIK
                                                        FZI FORSCHUNGSZENTRUM
Application




              © FZI Forschungszentrum Informatik   28
Large amount of environmental data

Environmental issues stir public interests
      Increase transparency, awareness, responsibility, protection
Growing amount of data
    Public access through EU directive 2003/4/EC
    PortalU (Germany) http://www.portalu.de/
    EDP (UK) http://www.edp.nerc.ac.uk
    Envirofacts (USA) http://www.epa.gov/enviro/index.html
Linking data in international context
    Local government databases of environmental part of LOD cloud
    Linked environment data for the life sciences




                            © FZI Forschungszentrum Informatik        29
Opportunity: mass dissemination and
consumption of environmental data
The percentage of people who actively find environmental
   information is significantly lower than those who have those with
   frequent access to it!
Complex results
      CO emission values around Karlsruhe area in Germany
Analytics
      CO emission values around Karlsruhe area in Germany
         Sorted by year
         Bar chart
      Emission values of US and Germany
         Compare average
         Timeline visualization




                          © FZI Forschungszentrum Informatik    30
KOIOS – Overview

A semantic search system
    Exploit semantics in the data for keywords interpretation to hide
     complexity of query languages and data representation
    Keyword search for searching structured data
    Lower access barriers while enabling richness of data to be fully
     harnessed
Contribution
    Transfer research results to commercial EIS
    Selector mechanism
Process
    Input: keywords
    Facet-based refinement
    Selector (result and view template) initialization
    Output: query results embedded in specific views



                           © FZI Forschungszentrum Informatik            31
KOIOS – Architecture




                © FZI Forschungszentrum Informatik   32
Facets generation
Derive facets from query results (not from query!) for refinement
     Attributes serve as facet categories
     Attribute values as facet values
E.g. for ?s
     Statistics.description: “CO-Emission , PKW”, “CO-Emission , LKW”…
     Value.year: 2005,2006,…




                           © FZI Forschungszentrum Informatik             33
Selectors

Selector: parameterized, predefined result and view templates
    Data parameters: specify scope of information need, initialized to a
     particular values based on facet categories and values
    Query parameter: additional data processing for analysis tasks
     (GROUP-BY, SORT, MIN, MAX, AVERAGE etc.)
    Presentation parameter: visualization types (data value, data series,
     data table, map-based, specific diagram type, etc.)




                           © FZI Forschungszentrum Informatik                34
Selector initialization

Selectors
      capture templates for information needs and presentation of their
       results
Map facets to selectors and initialize them
      Applicable selectors: cover facet categories
      Initialize selectors based on facet values
      Initialized values are captured in the WHERE clause
      Non-initialized parameters are included in the SELECT clause




                            © FZI Forschungszentrum Informatik             35
Deployment
Hippolytos project (Theseus)
      Easy access to spatial data
       warehouse (disy Cadenza) built for
       domain of environmental
       administration
Data about
      Emission and waste
      From the Baden-Württemberg
      Provided by:
       Umweltinformationssystem (UIS)
       Baden-Württemberg, Landesamt für
       Geoinformation und
       Landentwicklung (LGL) Baden-
       Württemberg and Statistisches
       Landesamt Baden-Württemberg


                           © FZI Forschungszentrum Informatik   36
Facets and selectors




                © FZI Forschungszentrum Informatik   37
Chart-based visualization
Map-based visualization
Conclusions

Keyword search on structured data is a popular problem for
  which various solutions exist.

We focus on the aspect of result ranking, providing a principled
  approach that employs relevance models.

Experiments show that RMs are promising for searching
  structured data.

Top-k Query processing proposed to get only most relevant
  results

Application on environmental data enables intuitive
    Access
    Visualization
    Analysis of environmental information!

                           © FZI Forschungszentrum Informatik      40
INFORMATIK
                                FZI FORSCHUNGSZENTRUM
Thank you for your attention!
Questions?
Opportunity: mass dissemination and
consumption of environmental data
Increase transparency, awareness, responsibility, protection




                        © FZI Forschungszentrum Informatik     42
Challenges: intuitive access and visualization of
structured environmental data and analytics
The percentage of people who actively find environmental
   information is significantly lower than those who have those
   with frequent access to it!

  Complex structured queries
      Knowledge of the underlying data /
      query language
  Complex structured data
      Heterogeneity and distribution of
      environmental data is overwhelming
  Complex structured results
      Understanding results and
      extracting relevant information /
      analytics are difficult tasks


                          © FZI Forschungszentrum Informatik      43
KOIOS

Semantic search system, KOIOS, for intuitive access, analysis,
  and visualization of structured environmental information

    Overview and architecture
    Structured query generation
    from keywords
    Facet-based browsing and
    refinement
    Selector initialization for final
    result and view construction
    Implementation and deployment
    Conclusions




                         © FZI Forschungszentrum Informatik      44
Conclusions

Replace predefined forms and hard-coded visualization
Semantic search using lightweight semantics in data and
  schema to dynamically
    Translate keywords to queries
    Generate facets for results
    Initialize result and presentation templates
Enables intuitive
    Access
    Visualization
    Analysis of environmental information!




                            © FZI Forschungszentrum Informatik   45
Inverted Index
                                             princess      m1, c1
                                             breakfast     m3
                                             hepburn       m3,p1,p4,c2
                                             melbourne     p2
                                             iris          c3
                                             holiday       m1,m2,m3
                                             breakfast     m3
                                             ann           m1,c2
                                             ……….         … …….


04.04.2011   © FZI Forschungszentrum Informatik                           49
Ranking Schemes

Proximity between keyword nodes
          EASE:




          XRank:
             w is the smallest text window in n that contains all search keywords




2012-4-10
SIGMOD09 Tutorial                         50
Ranking Schemes

Based on graph structure
          BANKS
             Nodes:
             Edges :
          PageRank-like methods
             XRank [Guo et al, SIGMOD03]
             ObjectRank [Balmin et al, VLDB04] : considers both
              Global ObjectRank and Keyword-specific
              ObjectRank




2012-4-10
SIGMOD09 Tutorial                    51
Ranking Schemes
                                                        1 ln(1 ln(tf ))      N 1
                                Score(n, Q)                               ln
                                              w Q   n (1 s )  s dl / avdl     df
 TF*IDF based:
            Discover/EASE
            [Liu et al, SIGMOD06]




            SPARK
               but not at the node level




2012-4-10
SIGMOD09 Tutorial                       52
Relevance Models


                   Relevance                                          sample probabilities
                    Model             q1                              P(w|Q)           w
                                               israeli
                                                                         .077 palestinian
                       M              q2       palestinian               .055 israel
                                                                         .034 jerusalem
                       M              q3       raids                     .033 protest
                       M                                                 .027 raid
                                      w        ???                       .011 clash
                                                     P(q | w)            .010 bank
                                                                         .010 west
                      P( w)                                              .010 troop
P( w | q1...qk )                                P(q | M ) P( M | w)            …
                    P(q1...qk )   q        M



                                               P(q1...qk | w)

Más contenido relacionado

Último

Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesShubhangi Sonawane
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 

Último (20)

Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 

Destacado

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destacado (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Keyword Search on Structured Data using Relevance Models

  • 1. Keyword Search on Structured Data using Relevance Models* Veli Bicer INFORMATIK FZI Research Center for Information Technology Karlsruhe, Germany FZI FORSCHUNGSZENTRUM Joint work with Thanh Tran from Semantic Search Group, AIFB Institute, KIT * based on the papers @ 20th ACM Conference on Information and Knowledge Management (CIKM’11) and @ 10th International Semantic Web Conference (ISWC’11) © FZI Forschungszentrum Informatik 1
  • 2. About the presenter Veli Bicer  Research Scientist at FZI Research Center for Information Technology, Karlsruhe, Germany  Associated Researcher at Karlsruhe Service Research Institute (KSRI)  KSRI founded by IBM Germany Research Interests  Semantic Data Management/Search  Relational Learning  Software Engineering (for Services) Projects  German Internet Research Programme THESEUS  KOIOS Semantic Search in Core Technology Cluster  TEXO Internet-of-Services Use-case  Previously, EU ICT Artemis, Satine, Saphire and Ride 10.04.2012 © FZI Forschungszentrum Informatik 2
  • 3. Agenda Introduction  Keyword search on structured data  Relevance models Approach  Ranking scheme using relevance models  Top-k Query processing Experiments Application  Search on environmental data Conclusion © FZI Forschungszentrum Informatik 3
  • 4. INFORMATIK FZI FORSCHUNGSZENTRUM Introduction 10.04.2012 © FZI Forschungszentrum Informatik 4
  • 5. Keyword Search on Structured Data Rationale  4 billion web searches daily  Data-driven websites have relational database backend  Predefined search forms constrain retrieval  SQL difficult to learn  simplify data retrieval by not using SQL © FZI Forschungszentrum Informatik 5
  • 6. Keyword Search on Structured Data Example  Who is the character played by Audrey Hepburn in Roman Holiday? Query result Person Character  A tree of tuples that is reduced id name id name pid mid with respect to the query. p1 Audrey Hepburn c1 Princess p1 m1 Ann Which would you rather write? p3 Kate Winslet c3 Iris p3 m2 … ……… Simpkins SELECT C.name … …….. FROM Person, Character, Movie WHERE Person.id = Character.pId Movie AND Character.mid = Movie.id id title plot AND Person.name = ‘Audrey Hepburn' m1 Roman Holiday Princess Ann is a royal princess AND Movie.title = ‘Roman Holiday' ; of unknow of an … m2 The Holiday Iris swaps her cottage for the  or “Hepburn Holiday” holiday along the next two … m3 The Aviator Hughes and Hepburn go to a holiday and fly together .. … …… ….. © FZI Forschungszentrum Informatik 6
  • 7. Keyword Search on Structured Data Many approaches are proposed recently  Performance focus  Less consideration of ranking Recent study (Coffman and Weaver, CIKM 2010)  effectiveness of previous works are below expectations  problem about ranking strategies, not performance Two major types of ranking schemes:  IR-inspired TF-IDF ranking  (Liu et al, 2006) (SPARK, 2007)  Proximity based approaches  (Banks, 2002) (Bidirectional, 2005) Problem:  Missing a robust and principled approach!! © FZI Forschungszentrum Informatik 7
  • 8. Relevance Models Proposed by Lavrenko and Croft (SIGIR 01) Q D Assumes that Classical Model  queries and documents are samples from a hidden representation space and  generated from the same generative model Initial representation of relevance is R unknown  Estimated from query Q D Language Model R Q D © FZI Forschungszentrum Informatik 8 Relevance Model
  • 9. INFORMATIK FZI FORSCHUNGSZENTRUM Approach 10.04.2012 © FZI Forschungszentrum Informatik 9
  • 10. Overview of Approach 1 Query 2 PRF 3 Query RM 4 Res. RM words p words p words p hepburn 0.5 hepburn 0.21 5 Res. Score hepburn 0.12 holiday 0.5 holiday 0.15 holiday 0.18 audrey 0.13 audrey 0.11 katharine 0.09 D(RMQ||RMR) katharine 0.05 princess 0.01 princess 0.00 roman 0.01 roman 0.06 …. … …. … Title Name Roman Holiday Audrey Hepburn Breakfast at Tiff. Audrey Hepburn The Aviator Katharine Hepbun The Holiday Kate Winslet 6 Query Generation 7 Structured Queries 8 Top-k Query Proc. 9 Result Ranking © FZI Forschungszentrum Informatik 10
  • 11. Data Model Different kinds of data  e.g. relational, XML and RDF data Data Graph of nodes and edges (G=(V,E)) Resource nodes, attribute nodes  Every resource is typed  Resources have unique ids, (e.g. primary keys) 10.04.2012 © FZI Forschungszentrum Informatik 11
  • 12. Edge-Specific Relevance Models 1 2 3 A set of feedback resources FR are retrieved from an inverted keyword index:  E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, c2,m3} Edge-specific relevance model for each unique edge e: Probability of word at resource Importance of resource w.r.t. query Inverted Index FR Edge-specific Relevance Models princess  m1, c1 breakfast  m3 p1 name birthplace hepburn hepburn  m3,p1,p4,c2 Audrey Hepburn Ixelles Belgium melbourne  p2 iris  c3 m3 title The Holiday holiday holiday  m1,m2,m3 plot breakfast  m3 Iris swaps her cottage for the ann  m1,c2 holiday along the next two ….. ………. … ……. © FZI Forschungszentrum Informatik 12
  • 13. Edge Specific Resource Models 4 5 Each resource (a tuple) is also represented as a RM  …as final results (joint tuples) are obtained by combining resources Edge-specific resource model: The score of resource: cross-entropy of edge-specific RM and ResM: © FZI Forschungszentrum Informatik 13
  • 14. Smoothing Well-known technique to address data sparseness and improve accuracy of RMs (and LMs)  is the core probability for both query and resource RM Local smoothing Neighborhood of attribute a is another attribute a’:  a and a’ shares the same resources  resources of a and a’ are of the same type  resources of a and a’ are connected over a FK Neighborhood of a © FZI Forschungszentrum Informatik 14
  • 15. Smoothing words P name (v | p1 ) r a Person Character audrey 0.5 0.4 0.37 0.36 type type type hepburn 0.5 0.4 0.39 0.38 pid_fk p1 c1 ixelles 0.1 0.09 0.08 p4 birthplace belgium name name 0.1 0.09 0.08 name Audrey Hepburn Ixelles Belgium Princess Ann katharine 0.02 0.01 Katharine Hepburn birthplace connecticut 0.02 0.01 Connecticut USA usa 0.02 0.01 princess 0.035 ann 0.035 Smoothing of each type is controlled by weights: where γ1 ,γ2 ,γ3 are control parameters set in experiments 10.04.2012 © FZI Forschungszentrum Informatik 15
  • 16. Ranking JRTs 9 Ranking aggregated JRTs:  Cross entropy between edge-specific RM (Query Model) and geometric mean of combined edge-specific ResM: The proposed score is monotonic w.r.t. individual resource scores  …a desired property for most of top-k algorithms © FZI Forschungszentrum Informatik 16
  • 17. Query Translation* 6 7 Mapping of keywords to data elements Hepburn Hepburn Holiday Holiday title name name title  Result in a set of keyword elements p4 p1 m1 m3 Data Graph exploration type type  Search for substructures (query graph) pid_fk Character Person mid_fk connecting keyword elements bornIn Movie  Bi-directional exploration of query Is-a Location graphs operates on summary of data hasDist hasLoc graph only Summary Producer Studio Top-k computation Graph worksFor  Search guided by a scoring function to Person Character Movie output only the top-k queries type type type pid_fk mid_fk Query graphs to be processed name ?p ?c ?m title  Free vs. Non-free variables Hepburn Holiday *[Tran et al. ICDE’09] © FZI Forschungszentrum Informatik 17
  • 18. Top-k Query Processing 8 Top-k query processing (TQP) is highly common in Web- accessible databases  return K highest-ranked answers  avoid unnecessary accesses to database TQP assumes  Scoring function and attribute values to be known a-priori (e.g. RankJoin)  Combine attribute values by aggregation function  Sorted access (SA), random access (RA) probes How to adapt TQP to return top-k relevant results?  Results are joined set of resources  Scores are query-dependent  No indexing is possible Idea:  Retrieve resources for non-free variables and rank  Use SA on those initially retrieved resources  Use RA to find other resources © FZI Forschungszentrum Informatik 18
  • 19. Top-k Query Processing Result candidate c=<(x1,…,xk),score>  complete when all variables are bound to some resources  xi =* indicates unbounded Threshold Binding operator 0.50  c’=(c,xiri) Threshold determines upper bound for unseen resources  Scheduling between SA and RA  Tight bound is desired Priority Queue <(p1,*,*),0.50> Person Character Movie <(*,*,m2),0.50> type type type pid_fk mid_fk ?p ?c ?m title name Hepburn Holiday Person Character 0.11 Movie id name S(r) id name S(r) id title S(r) p1 Audrey Hepburn 0.20 c1 Princess Ann m2 The Holiday 0.19 Output K=1 p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18 p5 Philip Hepburn 0.13 c3 Iris Simpkins m3 Holiday Blues 0.09 p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08 © FZI Forschungszentrum Informatik 19
  • 20. Top-k Query Processing Result candidate c=<(x1,…,xk),score>  complete when all variables are bound to some resources  xi =* indicates unbounded Threshold Binding operator 0.48  c’=(c,xiri) Threshold determines upper bound for unseen resources  Scheduling between SA and RA  Tight bound is desired Priority Queue <(p1,*,*),0.50> Person Character Movie <(*,*,m2),0.50> type type type pid_fk mid_fk <(p3,*,*),0.48> ?p ?c ?m title name Hepburn Holiday Person Character 0.11 Movie id name S(r) id name S(r) id title S(r) p1 Audrey Hepburn 0.20 c1 Princess Ann m2 The Holiday 0.19 Output K=1 p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18 p5 Philip Hepburn 0.13 c3 Iris Simpkins m3 Holiday Blues 0.09 p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08 © FZI Forschungszentrum Informatik 20
  • 21. Top-k Query Processing Result candidate c=<(x1,…,xk),score>  complete when all variables are bound to some resources  xi =* indicates unbounded Threshold Binding operator 0.47  c’=(c,xiri) Threshold determines upper bound for unseen resources  Scheduling between SA and RA  Tight bound is desired Priority Queue <(*,*,m2),0.50> Person Character Movie <(p1,c1,*),0.49> type type type pid_fk mid_fk <(p3,*,*),0.48> ?p ?c ?m title name Hepburn Holiday Person Character 0.10 Movie id name S(r) id name S(r) id title S(r) p1 Audrey Hepburn 0.20 c1 Princess Ann 0.10 m2 The Holiday 0.19 Output K=1 p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18 p5 Philip Hepburn 0.13 c3 Iris Simpkins m3 Holiday Blues 0.09 p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08 © FZI Forschungszentrum Informatik 21
  • 22. Top-k Query Processing Result candidate c=<(x1,…,xk),score>  complete when all variables are bound to some resources  xi =* indicates unbounded Threshold Binding operator 0.46  c’=(c,xiri) Threshold determines upper bound for unseen resources  Scheduling between SA and RA  Tight bound is desired Priority Queue <(p1,c1,*),0.49> Person Character Movie <(p3,*,*),0.48> type type type pid_fk mid_fk <(*,c3,m2),0.44> ?p ?c ?m title name Hepburn Holiday Person Character 0.09 Movie id name S(r) id name S(r) id title S(r) p1 Audrey Hepburn 0.20 c1 Princess Ann 0.10 m2 The Holiday 0.19 Output K=1 p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18 p5 Philip Hepburn 0.13 c3 Iris Simpkins 0.05 m3 Holiday Blues 0.09 p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08 © FZI Forschungszentrum Informatik 22
  • 23. Top-k Query Processing Result candidate c=<(x1,…,xk),score>  complete when all variables are bound to some resources  xi =* indicates unbounded Threshold Binding operator 0.46  c’=(c,xiri) Threshold determines upper bound for unseen resources  Scheduling between SA and RA  Tight bound is desired Priority Queue <(p3,*,*),0.48> Person Character Movie <(*,c3,m2),0.44> type type type pid_fk mid_fk ?p ?c ?m title name Hepburn Holiday Person Character 0.09 Movie id name S(r) id name S(r) id title S(r) p1 Audrey Hepburn 0.20 c1 Princess Ann 0.10 m2 The Holiday 0.19 Output K=1 p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18 <(p1,c1,m1),0.48> p5 Philip Hepburn 0.13 c3 Iris Simpkins 0.05 m3 Holiday Blues 0.09 p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08 © FZI Forschungszentrum Informatik 23
  • 24. INFORMATIK FZI FORSCHUNGSZENTRUM Experiments © FZI Forschungszentrum Informatik 24
  • 25. Experiments Datasets: Subsets of Wikipedia, IMDB and Mondial Web databases Queries: 50 queries for each dataset including “TREC style” queries and “single resource” queries Metrics: Three metrics are used: (1) the number of top-1 relevant results, (2) Reciprocal rank and (3) Mean Average Precision (MAP) Baselines: BANKS , Bidirectional (proximity) , Efficient , SPARK, CoveredDensity (TF-IDF). RM-S: Our approach © FZI Forschungszentrum Informatik 25
  • 26. Experiments MAP scores for all queries Reciprocal rank for single resource queries © FZI Forschungszentrum Informatik 26
  • 27. Experiments Precision-recall for TREC-style queries on Wikipedia © FZI Forschungszentrum Informatik 27
  • 28. INFORMATIK FZI FORSCHUNGSZENTRUM Application © FZI Forschungszentrum Informatik 28
  • 29. Large amount of environmental data Environmental issues stir public interests  Increase transparency, awareness, responsibility, protection Growing amount of data  Public access through EU directive 2003/4/EC  PortalU (Germany) http://www.portalu.de/  EDP (UK) http://www.edp.nerc.ac.uk  Envirofacts (USA) http://www.epa.gov/enviro/index.html Linking data in international context  Local government databases of environmental part of LOD cloud  Linked environment data for the life sciences © FZI Forschungszentrum Informatik 29
  • 30. Opportunity: mass dissemination and consumption of environmental data The percentage of people who actively find environmental information is significantly lower than those who have those with frequent access to it! Complex results  CO emission values around Karlsruhe area in Germany Analytics  CO emission values around Karlsruhe area in Germany  Sorted by year  Bar chart  Emission values of US and Germany  Compare average  Timeline visualization © FZI Forschungszentrum Informatik 30
  • 31. KOIOS – Overview A semantic search system  Exploit semantics in the data for keywords interpretation to hide complexity of query languages and data representation  Keyword search for searching structured data  Lower access barriers while enabling richness of data to be fully harnessed Contribution  Transfer research results to commercial EIS  Selector mechanism Process  Input: keywords  Facet-based refinement  Selector (result and view template) initialization  Output: query results embedded in specific views © FZI Forschungszentrum Informatik 31
  • 32. KOIOS – Architecture © FZI Forschungszentrum Informatik 32
  • 33. Facets generation Derive facets from query results (not from query!) for refinement  Attributes serve as facet categories  Attribute values as facet values E.g. for ?s  Statistics.description: “CO-Emission , PKW”, “CO-Emission , LKW”…  Value.year: 2005,2006,… © FZI Forschungszentrum Informatik 33
  • 34. Selectors Selector: parameterized, predefined result and view templates  Data parameters: specify scope of information need, initialized to a particular values based on facet categories and values  Query parameter: additional data processing for analysis tasks (GROUP-BY, SORT, MIN, MAX, AVERAGE etc.)  Presentation parameter: visualization types (data value, data series, data table, map-based, specific diagram type, etc.) © FZI Forschungszentrum Informatik 34
  • 35. Selector initialization Selectors  capture templates for information needs and presentation of their results Map facets to selectors and initialize them  Applicable selectors: cover facet categories  Initialize selectors based on facet values  Initialized values are captured in the WHERE clause  Non-initialized parameters are included in the SELECT clause © FZI Forschungszentrum Informatik 35
  • 36. Deployment Hippolytos project (Theseus)  Easy access to spatial data warehouse (disy Cadenza) built for domain of environmental administration Data about  Emission and waste  From the Baden-Württemberg  Provided by: Umweltinformationssystem (UIS) Baden-Württemberg, Landesamt für Geoinformation und Landentwicklung (LGL) Baden- Württemberg and Statistisches Landesamt Baden-Württemberg © FZI Forschungszentrum Informatik 36
  • 37. Facets and selectors © FZI Forschungszentrum Informatik 37
  • 40. Conclusions Keyword search on structured data is a popular problem for which various solutions exist. We focus on the aspect of result ranking, providing a principled approach that employs relevance models. Experiments show that RMs are promising for searching structured data. Top-k Query processing proposed to get only most relevant results Application on environmental data enables intuitive  Access  Visualization  Analysis of environmental information! © FZI Forschungszentrum Informatik 40
  • 41. INFORMATIK FZI FORSCHUNGSZENTRUM Thank you for your attention! Questions?
  • 42. Opportunity: mass dissemination and consumption of environmental data Increase transparency, awareness, responsibility, protection © FZI Forschungszentrum Informatik 42
  • 43. Challenges: intuitive access and visualization of structured environmental data and analytics The percentage of people who actively find environmental information is significantly lower than those who have those with frequent access to it! Complex structured queries Knowledge of the underlying data / query language Complex structured data Heterogeneity and distribution of environmental data is overwhelming Complex structured results Understanding results and extracting relevant information / analytics are difficult tasks © FZI Forschungszentrum Informatik 43
  • 44. KOIOS Semantic search system, KOIOS, for intuitive access, analysis, and visualization of structured environmental information Overview and architecture Structured query generation from keywords Facet-based browsing and refinement Selector initialization for final result and view construction Implementation and deployment Conclusions © FZI Forschungszentrum Informatik 44
  • 45. Conclusions Replace predefined forms and hard-coded visualization Semantic search using lightweight semantics in data and schema to dynamically  Translate keywords to queries  Generate facets for results  Initialize result and presentation templates Enables intuitive  Access  Visualization  Analysis of environmental information! © FZI Forschungszentrum Informatik 45
  • 46. Inverted Index princess  m1, c1 breakfast  m3 hepburn  m3,p1,p4,c2 melbourne  p2 iris  c3 holiday  m1,m2,m3 breakfast  m3 ann  m1,c2 ………. … ……. 04.04.2011 © FZI Forschungszentrum Informatik 49
  • 47. Ranking Schemes Proximity between keyword nodes  EASE:  XRank:  w is the smallest text window in n that contains all search keywords 2012-4-10 SIGMOD09 Tutorial 50
  • 48. Ranking Schemes Based on graph structure  BANKS  Nodes:  Edges :  PageRank-like methods  XRank [Guo et al, SIGMOD03]  ObjectRank [Balmin et al, VLDB04] : considers both Global ObjectRank and Keyword-specific ObjectRank 2012-4-10 SIGMOD09 Tutorial 51
  • 49. Ranking Schemes 1 ln(1 ln(tf )) N 1 Score(n, Q) ln w Q n (1 s ) s dl / avdl df TF*IDF based:  Discover/EASE  [Liu et al, SIGMOD06]  SPARK  but not at the node level 2012-4-10 SIGMOD09 Tutorial 52
  • 50. Relevance Models Relevance sample probabilities Model q1 P(w|Q) w israeli .077 palestinian M q2 palestinian .055 israel .034 jerusalem M q3 raids .033 protest M .027 raid w ??? .011 clash P(q | w) .010 bank .010 west P( w) .010 troop P( w | q1...qk ) P(q | M ) P( M | w) … P(q1...qk ) q M P(q1...qk | w)

Notas del editor

  1. Top-K Queries are a long studied topic in the database and information retrieval communitiesThe main objective of these queries is to return the K highest-ranked answers quickly and efficiently.A Top-K query returns the subset of most relevant answers, instead of ALL answers, for two reasons: i) to minimize the cost metric that is associated with the retrieval of all answers (e.g., disk, network, etc.)ii) to maximize the quality of the answer set, such that the user is not overwhelmed with irrelevant results