SlideShare una empresa de Scribd logo
1 de 29
Automated Ranking of
         Database Query Results


Sanjay Agrawal, Surajit Chaudhari, Gautam
Das, Aristides Gionis


                                  Presented By: Upa Gupta
Contents
   Introduction
   IDF Similarity
   QF Similarity
   Breaking Ties
   Implementation
       ITA Algorithm
   Conclusion
Introduction
   Database is Boolean Query Model
       E.g.. Select * WHERE MFR_Country =
        “Germany” AND Type = “Sports” AND
        Manufacture = “Volkswagon”
   Problems in Database
       Empty Answers
           Too selective query leading to Null Result Set
       Many Answers
           General query leading to too many results
Introduction
   Ranking of Database Query Results using
    IR techniques.
       Applying TF-IDF concept to database that is
        based on the frequency of the attribute
        values.
       Need to extend the TF-IDF to Numerical
        Domains
            IDF Similarity is discussed in paper
       Collecting WORKLOAD and using it for
        ranking.
            QF Similarity, leveraging Workload Information
Introduction

   Many Answers Problem is solved using
    Top-K Query Processing

   Index-based Threshold Algorithm (ITA)
    developed exploiting IDF/QF Similarity.
IDF Similarity
   What is TF-IDF Technique?
       Given a set of documents and a query,
        documents are ranked based on TF and IDF
        of the words of the document.


   Adapting IDF concept to Database
    containing only categorical Attributes
    t=<t1,……tm>  values of Attribute A
    n  Number of tuples in the database
IDF Similarity
   For all the values of t:
       Frequency F(t) is defined as no. of tuples
        having Attribute A = t
       IDF is calculated as:
                      IDF(t) = log(n/F(t))
       For pair of values u and v in Attribute A
        domain
               S(u,v) = IDF (u) if u=v otherwise 0
       For tuple T and Query Q for all the Attributes
                          m
        (A1…Ak)           S (t , q )
                                k   k   k

               SIM(T,Q) = k 1
IDF Similarity
   Example:
    CAR_ID MODEL     MFR           MFR_Country   Type
    1       SLR      Mercedes      Germany       Sports
    2       A6       Audi          Germany       Executive
    3       R8       Audi          Germany       Sports
    4       Gallardo Lamborghini   Italy         Sports



        Query Q: Select * WHERE MFR_Country =
         “Germany” AND Type = “Sports” AND MFR =
         “Volkswagon”
IDF Similarity
n=4
F (MFR_Country = Germany) = 3
IDF(MFR_Country = Germany) = log(n/F(MFR_Country = Germany))
                            = log(4/3) = 0.287
Similarly,
   IDF(MFR_Country=Italy) = 1.38                IDF(MFR = Audi) =
   0.69
   IDF(MFR = Lamborghini) = 1.38                IDF(MFR = Mercedes)
   = 1.38
   IDF(Type = Sports) = 0.287            IDF(Type = Executive) = 1.38

Similarity of 1st tuple with Q = SIM(T,Q)
   = S(Germany, Germany) + S(Sports, Sports) + S(Mercedes,
   Volkswagen)
   = IDF(MFR_Country = Germany) + IDF(Type = Sports) + 0
   = 0.287+0.287+0 = 0.574
IDF Similarity
   Consider a Numeric Attribute in DB e.g. PRICE
   SIMPLE SOLUTION: Discretize the data between
    ranges
   Consider two Range: (0, 50) and (51, 100)
       Values 49 and 52 are considered completely dissimilar.
   Frequencyn of a 1numeric value t of an attribute is defined
                         t t
                               2
                      /2   i
                             h
                                        sum of contributions to
    as          e                       t from every ti
                  i                          database.
         F(t) =

         IDF(t) = log(n/F(t))i t 2
                            t              h = bandwidth parameter
                       1/ 2
                                h
         S(t,q) = density at t of a Gaussian )
                    e               IDF ( q Distribution centered q.
IDF Similarity
   Consider following Query:
   Select * where MFR IN (“Germany”, “Italy”,
    ”Japan”)    m

   SIM(T,Q) =     max S k ( t k , q )
                       q Qk
               k   1
QF Similarity
   Problems with IDF:
       In a realtor database, more homes are built in
        recent years such as 2007 and 2008 as
        compared to 1980 and 1981.Thus recent
        years have small IDF. Yet newer homes have
        higher demand.

       In a bookstore DB, demand for an author is
        due to factor other than no. of books he has
        written
QF Similarity
   WORKLOAD: Past Queries
   Importance of attribute values is
    determined by frequency of their
    occurrence in workload.
   As in above eg, frequency of queries
    requesting homes in 2010 are more than
    of the year 1981
QF Similarity
   For categorical data
      RQF(q) = raw frequency of occurrence of value q

       of attribute A in query strings of workload

       RQFMax = raw frequency of most frequently
        occurring value in workload

       Query frequency QF(q) = RQF(q)/RQFMax

     s(t, q) = QF(q),    if q = t otherwise 0
   QF resembles TF
QF Similarity
   Consider Workload containing following
    values of Attribute TYPE:

    {Sports, Executive, Luxury, Sports, Sports,
      Executive}

    QF(Executive) = RQF(Executive)/RQFMax
                = 2/3
QF Similarity
   Similarity between pairs of different categorical
    attribute values can also be derived from
    workload eg. To find S(Audi, Mercedes)

   Similarity coefficient between t and q in this case
    is defined by jaccard coefficient scaled by QF
    factor as shown below.
     S(t,q)=J(W(t),W(q))/QF(q)
       W(t) = Subset of queries in workload W in which
        categorical value t occurs in an IN clause
QF-IDF



   For QF-IDF Similarity
    S(t,q)=QF(q) *IDF(q) when t=q otherwise
      0
BREAKING TIES
   IF SIM(t1, q) = SIM (t2, q)
       Which Should be ranked Higher??
        

      QF and IDF partitions database into

       classes
    CAR_ID MODEL MFR       MFR_Country Type
    1          SLR      Mercedes      Germany   Sports
    2          A6       Audi          Germany   Executive
    3          R8       Audi          Germany   Sports
    4          Gallardo Lamborghini   Italy     Sports



           Q: SELECT * WHERE Type = “Sports” AND
            MFR_Country = “Germany”
Breaking Ties with QF
   Determine weights of missing attribute values
    that reflect their “global importance” using
    workload.
                    log( QF ( t k ))
                k
   Global Imp =                       tk= missing attribute

   Missing Attributes for Q: MFR and Model
Breaking Ties with QF
   Considering Workload with following values of MFR
    and Model
    MFR{Audi, Audi, Lamborghini, Mercedes,
    Lamborghini, Audi}
    Model{R8, A6, Gallardo, SLR, Gallardo, A6}
   QF(SLR) = ½ = 0.5
        1       SLR      Mercede Germany =Sports 0.33
                          QF(Mercedes) 1/3 =
                        s

   Global Imp = log(0.5) + log(0.33).
   NEGATIVE VALUES of Global Imp ??
Breaking Ties with IDF
   Tuples with large IDF(occuring infequently)
    of missing attributes are ranked higher
       Cars which are not popular are ranked higher


   Tuples with small IDF of missing attributes
    are ranked higher
       Cars having Moonroof will be ranked less
        which is a desirable feature.
Implementation

   Pre-processing component



   Query–processing component
Implementation
   Pre Processing Component

       Compute and store a representation of
        similarity function(QF-IDF, QF, IDF) in
        auxiliary database tables
Implementation
   Query Processing Component
       Job: Retrieving Top-K results from Database

       ITA Algorithm: Use of Fagin’s Threshold
        Algorithm and Similarity function
            Sorted Access: Along any attribute Ak, TIDs of
             tuples are retrieved.
            Random Access: entire tuple corresponding to a
             TID is retrieved.
ITA Algorithm
   Repeat
   Initialize Top-K Buffer to empty
   For each k = 1 to p
      TID = Index of the next Tuple is retrieved from the ordered

        Lists
      T = Complete Tuple is retrieved for TID

      Compute value of Ranking Function

      If Rank of T is higher than the rank of lowest ranking tuple

        in Top-K Buffer, then update Top-K Buffer
      If Stopping Condition has been reached then Exit

   End For
   Until all index of the tuples have been seen.
ITA Algorithm
Stopping Condition
   Hypothetical tuple – current value a1,…,
  ap for A1,… Ap, corresponding to index
  seeks on L1,…, Lp and qp+1,….. qm for
  remaining columns from the query directly.
 Termination – Similarity of hypothetical
  tuple to the query< tuple in Top-k buffer
  with least similarity.
ITA for Numeric columns
   Consider a query has condition Ak = qk for
    a numeric column Ak.

   Two index scan is performed on Ak.
       First retrieve TID’s > qk in incresing order.
       Second retrieve TID’s < qk in decreasing
        order.

   We then pick TID’s from the merged
    stream.
Conclusion
   Automated Ranking Infrastructure for SQL
    databases.
   Extended TF-IDF based techniques from
    Information retrieval to numeric and mixed
    data.
   Implementation of Ranking function that
    exploited Fagin’s TA
THANK YOU

Más contenido relacionado

La actualidad más candente

NFA Converted to DFA , Minimization of DFA , Transition Diagram
NFA Converted to DFA , Minimization of DFA , Transition DiagramNFA Converted to DFA , Minimization of DFA , Transition Diagram
NFA Converted to DFA , Minimization of DFA , Transition DiagramAbdullah Jan
 
Finite automata-for-lexical-analysis
Finite automata-for-lexical-analysisFinite automata-for-lexical-analysis
Finite automata-for-lexical-analysisDattatray Gandhmal
 
Extending Gremlin with Foundational Steps
Extending Gremlin with Foundational StepsExtending Gremlin with Foundational Steps
Extending Gremlin with Foundational StepsStephen Mallette
 
A Concurrent Language for Argumentation
A Concurrent Language for ArgumentationA Concurrent Language for Argumentation
A Concurrent Language for ArgumentationCarlo Taticchi
 
Python Programming Basics for begginners
Python Programming Basics for begginnersPython Programming Basics for begginners
Python Programming Basics for begginnersAbishek Purushothaman
 
Introduction to functional programming (In Arabic)
Introduction to functional programming (In Arabic)Introduction to functional programming (In Arabic)
Introduction to functional programming (In Arabic)Omar Abdelhafith
 
Presentation on use of r statistics
Presentation on use of r statisticsPresentation on use of r statistics
Presentation on use of r statisticsKrishna Dhakal
 
CILK/CILK++ and Reducers
CILK/CILK++ and ReducersCILK/CILK++ and Reducers
CILK/CILK++ and ReducersYunming Zhang
 
Nondeterministic Finite Automata
Nondeterministic Finite AutomataNondeterministic Finite Automata
Nondeterministic Finite AutomataAdel Al-Ofairi
 

La actualidad más candente (15)

Functional programming
Functional programmingFunctional programming
Functional programming
 
NFA Converted to DFA , Minimization of DFA , Transition Diagram
NFA Converted to DFA , Minimization of DFA , Transition DiagramNFA Converted to DFA , Minimization of DFA , Transition Diagram
NFA Converted to DFA , Minimization of DFA , Transition Diagram
 
Finite automata-for-lexical-analysis
Finite automata-for-lexical-analysisFinite automata-for-lexical-analysis
Finite automata-for-lexical-analysis
 
Unit iii
Unit iiiUnit iii
Unit iii
 
Extending Gremlin with Foundational Steps
Extending Gremlin with Foundational StepsExtending Gremlin with Foundational Steps
Extending Gremlin with Foundational Steps
 
Mit cilk
Mit cilkMit cilk
Mit cilk
 
9 processing arrays
9 processing arrays9 processing arrays
9 processing arrays
 
A Concurrent Language for Argumentation
A Concurrent Language for ArgumentationA Concurrent Language for Argumentation
A Concurrent Language for Argumentation
 
Python Programming Basics for begginners
Python Programming Basics for begginnersPython Programming Basics for begginners
Python Programming Basics for begginners
 
Introduction to functional programming (In Arabic)
Introduction to functional programming (In Arabic)Introduction to functional programming (In Arabic)
Introduction to functional programming (In Arabic)
 
Lecture1
Lecture1Lecture1
Lecture1
 
Unit iv
Unit ivUnit iv
Unit iv
 
Presentation on use of r statistics
Presentation on use of r statisticsPresentation on use of r statistics
Presentation on use of r statistics
 
CILK/CILK++ and Reducers
CILK/CILK++ and ReducersCILK/CILK++ and Reducers
CILK/CILK++ and Reducers
 
Nondeterministic Finite Automata
Nondeterministic Finite AutomataNondeterministic Finite Automata
Nondeterministic Finite Automata
 

Similar a new Slideshow!

Slideshow mới up nè. ^_^
Slideshow mới up nè. ^_^Slideshow mới up nè. ^_^
Slideshow mới up nè. ^_^Dung Trương
 
My cool new Slideshow!
My cool new Slideshow!My cool new Slideshow!
My cool new Slideshow!Dung Trương
 
ABAP Programming Overview
ABAP Programming OverviewABAP Programming Overview
ABAP Programming Overviewsapdocs. info
 
Chapter 1abapprogrammingoverview-091205081953-phpapp01
Chapter 1abapprogrammingoverview-091205081953-phpapp01Chapter 1abapprogrammingoverview-091205081953-phpapp01
Chapter 1abapprogrammingoverview-091205081953-phpapp01tabish
 
chapter-1abapprogrammingoverview-091205081953-phpapp01
chapter-1abapprogrammingoverview-091205081953-phpapp01chapter-1abapprogrammingoverview-091205081953-phpapp01
chapter-1abapprogrammingoverview-091205081953-phpapp01tabish
 
Chapter 1 Abap Programming Overview
Chapter 1 Abap Programming OverviewChapter 1 Abap Programming Overview
Chapter 1 Abap Programming OverviewAshish Kumar
 
Abapprogrammingoverview 090715081305-phpapp02
Abapprogrammingoverview 090715081305-phpapp02Abapprogrammingoverview 090715081305-phpapp02
Abapprogrammingoverview 090715081305-phpapp02tabish
 
Abapprogrammingoverview 090715081305-phpapp02
Abapprogrammingoverview 090715081305-phpapp02Abapprogrammingoverview 090715081305-phpapp02
Abapprogrammingoverview 090715081305-phpapp02wingsrai
 
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...ETH Zurich
 
Kursi i programimit të Gjuhës R
Kursi i programimit të Gjuhës R Kursi i programimit të Gjuhës R
Kursi i programimit të Gjuhës R tctal
 
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Craig Chao
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
 
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...GeeksLab Odessa
 
Chapter Eight(1)
Chapter Eight(1)Chapter Eight(1)
Chapter Eight(1)bolovv
 
Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_PennonsoftPennonSoft
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra SigmodJeff Hammerbacher
 

Similar a new Slideshow! (20)

Up thử cái mới
Up thử cái mớiUp thử cái mới
Up thử cái mới
 
Slideshow!
Slideshow!Slideshow!
Slideshow!
 
Slideshow mới up nè. ^_^
Slideshow mới up nè. ^_^Slideshow mới up nè. ^_^
Slideshow mới up nè. ^_^
 
9-1-13
9-1-139-1-13
9-1-13
 
My cool new Slideshow!
My cool new Slideshow!My cool new Slideshow!
My cool new Slideshow!
 
ABAP Programming Overview
ABAP Programming OverviewABAP Programming Overview
ABAP Programming Overview
 
Chapter 1abapprogrammingoverview-091205081953-phpapp01
Chapter 1abapprogrammingoverview-091205081953-phpapp01Chapter 1abapprogrammingoverview-091205081953-phpapp01
Chapter 1abapprogrammingoverview-091205081953-phpapp01
 
chapter-1abapprogrammingoverview-091205081953-phpapp01
chapter-1abapprogrammingoverview-091205081953-phpapp01chapter-1abapprogrammingoverview-091205081953-phpapp01
chapter-1abapprogrammingoverview-091205081953-phpapp01
 
Chapter 1 Abap Programming Overview
Chapter 1 Abap Programming OverviewChapter 1 Abap Programming Overview
Chapter 1 Abap Programming Overview
 
Abapprogrammingoverview 090715081305-phpapp02
Abapprogrammingoverview 090715081305-phpapp02Abapprogrammingoverview 090715081305-phpapp02
Abapprogrammingoverview 090715081305-phpapp02
 
Abapprogrammingoverview 090715081305-phpapp02
Abapprogrammingoverview 090715081305-phpapp02Abapprogrammingoverview 090715081305-phpapp02
Abapprogrammingoverview 090715081305-phpapp02
 
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
Abstracting Vector Architectures in Library Generators: Case Study Convolutio...
 
Kursi i programimit të Gjuhës R
Kursi i programimit të Gjuhës R Kursi i programimit të Gjuhës R
Kursi i programimit të Gjuhës R
 
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...
AI&BigData Lab.Руденко Петр. Automation and optimisation of machine learning ...
 
Chapter Eight(1)
Chapter Eight(1)Chapter Eight(1)
Chapter Eight(1)
 
Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
 

Más de Dung Trương

Más de Dung Trương (8)

Test
TestTest
Test
 
My cool new Slideshow!
My cool new Slideshow!My cool new Slideshow!
My cool new Slideshow!
 
Đề cương khóa luận
Đề cương khóa luậnĐề cương khóa luận
Đề cương khóa luận
 
huhu
huhuhuhu
huhu
 
My cool new Slideshow! thu
My cool new Slideshow! thuMy cool new Slideshow! thu
My cool new Slideshow! thu
 
Mới
MớiMới
Mới
 
Đề cương khóa luận
Đề cương khóa luậnĐề cương khóa luận
Đề cương khóa luận
 
344444
344444344444
344444
 

new Slideshow!

  • 1. Automated Ranking of Database Query Results Sanjay Agrawal, Surajit Chaudhari, Gautam Das, Aristides Gionis Presented By: Upa Gupta
  • 2. Contents  Introduction  IDF Similarity  QF Similarity  Breaking Ties  Implementation  ITA Algorithm  Conclusion
  • 3. Introduction  Database is Boolean Query Model  E.g.. Select * WHERE MFR_Country = “Germany” AND Type = “Sports” AND Manufacture = “Volkswagon”  Problems in Database  Empty Answers  Too selective query leading to Null Result Set  Many Answers  General query leading to too many results
  • 4. Introduction  Ranking of Database Query Results using IR techniques.  Applying TF-IDF concept to database that is based on the frequency of the attribute values.  Need to extend the TF-IDF to Numerical Domains  IDF Similarity is discussed in paper  Collecting WORKLOAD and using it for ranking.  QF Similarity, leveraging Workload Information
  • 5. Introduction  Many Answers Problem is solved using Top-K Query Processing  Index-based Threshold Algorithm (ITA) developed exploiting IDF/QF Similarity.
  • 6. IDF Similarity  What is TF-IDF Technique?  Given a set of documents and a query, documents are ranked based on TF and IDF of the words of the document.  Adapting IDF concept to Database containing only categorical Attributes t=<t1,……tm>  values of Attribute A n  Number of tuples in the database
  • 7. IDF Similarity  For all the values of t:  Frequency F(t) is defined as no. of tuples having Attribute A = t  IDF is calculated as: IDF(t) = log(n/F(t))  For pair of values u and v in Attribute A domain S(u,v) = IDF (u) if u=v otherwise 0  For tuple T and Query Q for all the Attributes m (A1…Ak) S (t , q ) k k k SIM(T,Q) = k 1
  • 8. IDF Similarity  Example: CAR_ID MODEL MFR MFR_Country Type 1 SLR Mercedes Germany Sports 2 A6 Audi Germany Executive 3 R8 Audi Germany Sports 4 Gallardo Lamborghini Italy Sports Query Q: Select * WHERE MFR_Country = “Germany” AND Type = “Sports” AND MFR = “Volkswagon”
  • 9. IDF Similarity n=4 F (MFR_Country = Germany) = 3 IDF(MFR_Country = Germany) = log(n/F(MFR_Country = Germany)) = log(4/3) = 0.287 Similarly, IDF(MFR_Country=Italy) = 1.38 IDF(MFR = Audi) = 0.69 IDF(MFR = Lamborghini) = 1.38 IDF(MFR = Mercedes) = 1.38 IDF(Type = Sports) = 0.287 IDF(Type = Executive) = 1.38 Similarity of 1st tuple with Q = SIM(T,Q) = S(Germany, Germany) + S(Sports, Sports) + S(Mercedes, Volkswagen) = IDF(MFR_Country = Germany) + IDF(Type = Sports) + 0 = 0.287+0.287+0 = 0.574
  • 10. IDF Similarity  Consider a Numeric Attribute in DB e.g. PRICE  SIMPLE SOLUTION: Discretize the data between ranges  Consider two Range: (0, 50) and (51, 100)  Values 49 and 52 are considered completely dissimilar.  Frequencyn of a 1numeric value t of an attribute is defined t t 2 /2 i h sum of contributions to as e t from every ti i database. F(t) = IDF(t) = log(n/F(t))i t 2 t h = bandwidth parameter 1/ 2 h S(t,q) = density at t of a Gaussian ) e IDF ( q Distribution centered q.
  • 11. IDF Similarity  Consider following Query:  Select * where MFR IN (“Germany”, “Italy”, ”Japan”) m  SIM(T,Q) = max S k ( t k , q ) q Qk k 1
  • 12. QF Similarity  Problems with IDF:  In a realtor database, more homes are built in recent years such as 2007 and 2008 as compared to 1980 and 1981.Thus recent years have small IDF. Yet newer homes have higher demand.  In a bookstore DB, demand for an author is due to factor other than no. of books he has written
  • 13. QF Similarity  WORKLOAD: Past Queries  Importance of attribute values is determined by frequency of their occurrence in workload.  As in above eg, frequency of queries requesting homes in 2010 are more than of the year 1981
  • 14. QF Similarity  For categorical data  RQF(q) = raw frequency of occurrence of value q of attribute A in query strings of workload  RQFMax = raw frequency of most frequently occurring value in workload  Query frequency QF(q) = RQF(q)/RQFMax  s(t, q) = QF(q), if q = t otherwise 0  QF resembles TF
  • 15. QF Similarity  Consider Workload containing following values of Attribute TYPE: {Sports, Executive, Luxury, Sports, Sports, Executive} QF(Executive) = RQF(Executive)/RQFMax = 2/3
  • 16. QF Similarity  Similarity between pairs of different categorical attribute values can also be derived from workload eg. To find S(Audi, Mercedes)  Similarity coefficient between t and q in this case is defined by jaccard coefficient scaled by QF factor as shown below. S(t,q)=J(W(t),W(q))/QF(q)  W(t) = Subset of queries in workload W in which categorical value t occurs in an IN clause
  • 17. QF-IDF  For QF-IDF Similarity S(t,q)=QF(q) *IDF(q) when t=q otherwise 0
  • 18. BREAKING TIES  IF SIM(t1, q) = SIM (t2, q) Which Should be ranked Higher??   QF and IDF partitions database into classes CAR_ID MODEL MFR MFR_Country Type 1 SLR Mercedes Germany Sports 2 A6 Audi Germany Executive 3 R8 Audi Germany Sports 4 Gallardo Lamborghini Italy Sports  Q: SELECT * WHERE Type = “Sports” AND MFR_Country = “Germany”
  • 19. Breaking Ties with QF  Determine weights of missing attribute values that reflect their “global importance” using workload. log( QF ( t k )) k  Global Imp = tk= missing attribute  Missing Attributes for Q: MFR and Model
  • 20. Breaking Ties with QF  Considering Workload with following values of MFR and Model MFR{Audi, Audi, Lamborghini, Mercedes, Lamborghini, Audi} Model{R8, A6, Gallardo, SLR, Gallardo, A6}  QF(SLR) = ½ = 0.5 1 SLR Mercede Germany =Sports 0.33 QF(Mercedes) 1/3 = s  Global Imp = log(0.5) + log(0.33).  NEGATIVE VALUES of Global Imp ??
  • 21. Breaking Ties with IDF  Tuples with large IDF(occuring infequently) of missing attributes are ranked higher  Cars which are not popular are ranked higher  Tuples with small IDF of missing attributes are ranked higher  Cars having Moonroof will be ranked less which is a desirable feature.
  • 22. Implementation  Pre-processing component  Query–processing component
  • 23. Implementation  Pre Processing Component  Compute and store a representation of similarity function(QF-IDF, QF, IDF) in auxiliary database tables
  • 24. Implementation  Query Processing Component  Job: Retrieving Top-K results from Database  ITA Algorithm: Use of Fagin’s Threshold Algorithm and Similarity function  Sorted Access: Along any attribute Ak, TIDs of tuples are retrieved.  Random Access: entire tuple corresponding to a TID is retrieved.
  • 25. ITA Algorithm  Repeat  Initialize Top-K Buffer to empty  For each k = 1 to p  TID = Index of the next Tuple is retrieved from the ordered Lists  T = Complete Tuple is retrieved for TID  Compute value of Ranking Function  If Rank of T is higher than the rank of lowest ranking tuple in Top-K Buffer, then update Top-K Buffer  If Stopping Condition has been reached then Exit  End For  Until all index of the tuples have been seen.
  • 26. ITA Algorithm Stopping Condition Hypothetical tuple – current value a1,…, ap for A1,… Ap, corresponding to index seeks on L1,…, Lp and qp+1,….. qm for remaining columns from the query directly. Termination – Similarity of hypothetical tuple to the query< tuple in Top-k buffer with least similarity.
  • 27. ITA for Numeric columns  Consider a query has condition Ak = qk for a numeric column Ak.  Two index scan is performed on Ak.  First retrieve TID’s > qk in incresing order.  Second retrieve TID’s < qk in decreasing order.  We then pick TID’s from the merged stream.
  • 28. Conclusion  Automated Ranking Infrastructure for SQL databases.  Extended TF-IDF based techniques from Information retrieval to numeric and mixed data.  Implementation of Ranking function that exploited Fagin’s TA