new Slideshow!

Automated Ranking of
Database Query Results

Sanjay Agrawal, Surajit Chaudhari, Gautam
Das, Aristides Gionis

Presented By: Upa Gupta

Contents
 Introduction
 IDF Similarity
 QF Similarity
 Breaking Ties
 Implementation
 ITA Algorithm
 Conclusion

Introduction
 Database is Boolean Query Model
 E.g.. Select * WHERE MFR_Country =
“Germany” AND Type = “Sports” AND
Manufacture = “Volkswagon”
 Problems in Database
 Empty Answers
 Too selective query leading to Null Result Set
 Many Answers
 General query leading to too many results

Introduction
 Ranking of Database Query Results using
IR techniques.
 Applying TF-IDF concept to database that is
based on the frequency of the attribute
values.
 Need to extend the TF-IDF to Numerical
Domains
 IDF Similarity is discussed in paper
 Collecting WORKLOAD and using it for
ranking.
 QF Similarity, leveraging Workload Information

Introduction

 Many Answers Problem is solved using
Top-K Query Processing

 Index-based Threshold Algorithm (ITA)
developed exploiting IDF/QF Similarity.

IDF Similarity
 What is TF-IDF Technique?
 Given a set of documents and a query,
documents are ranked based on TF and IDF
of the words of the document.

 Adapting IDF concept to Database
containing only categorical Attributes
t=<t1,……tm>  values of Attribute A
n  Number of tuples in the database

IDF Similarity
 For all the values of t:
 Frequency F(t) is defined as no. of tuples
having Attribute A = t
 IDF is calculated as:
IDF(t) = log(n/F(t))
 For pair of values u and v in Attribute A
domain
S(u,v) = IDF (u) if u=v otherwise 0
 For tuple T and Query Q for all the Attributes
m
(A1…Ak) S (t , q )
k k k

SIM(T,Q) = k 1

IDF Similarity
 Example:
CAR_ID MODEL MFR MFR_Country Type
1 SLR Mercedes Germany Sports
2 A6 Audi Germany Executive
3 R8 Audi Germany Sports
4 Gallardo Lamborghini Italy Sports

Query Q: Select * WHERE MFR_Country =
“Germany” AND Type = “Sports” AND MFR =
“Volkswagon”

IDF Similarity
n=4
F (MFR_Country = Germany) = 3
IDF(MFR_Country = Germany) = log(n/F(MFR_Country = Germany))
= log(4/3) = 0.287
Similarly,
IDF(MFR_Country=Italy) = 1.38 IDF(MFR = Audi) =
0.69
IDF(MFR = Lamborghini) = 1.38 IDF(MFR = Mercedes)
= 1.38
IDF(Type = Sports) = 0.287 IDF(Type = Executive) = 1.38

Similarity of 1st tuple with Q = SIM(T,Q)
= S(Germany, Germany) + S(Sports, Sports) + S(Mercedes,
Volkswagen)
= IDF(MFR_Country = Germany) + IDF(Type = Sports) + 0
= 0.287+0.287+0 = 0.574

IDF Similarity
 Consider a Numeric Attribute in DB e.g. PRICE
 SIMPLE SOLUTION: Discretize the data between
ranges
 Consider two Range: (0, 50) and (51, 100)
 Values 49 and 52 are considered completely dissimilar.
 Frequencyn of a 1numeric value t of an attribute is defined
t t
2
/2 i
h
sum of contributions to
as e t from every ti
i database.
F(t) =

IDF(t) = log(n/F(t))i t 2
t h = bandwidth parameter
1/ 2
h
S(t,q) = density at t of a Gaussian )
e IDF ( q Distribution centered q.

IDF Similarity
 Consider following Query:
 Select * where MFR IN (“Germany”, “Italy”,
”Japan”) m

 SIM(T,Q) = max S k ( t k , q )
q Qk
k 1

QF Similarity
 Problems with IDF:
 In a realtor database, more homes are built in
recent years such as 2007 and 2008 as
compared to 1980 and 1981.Thus recent
years have small IDF. Yet newer homes have
higher demand.

 In a bookstore DB, demand for an author is
due to factor other than no. of books he has
written

QF Similarity
 WORKLOAD: Past Queries
 Importance of attribute values is
determined by frequency of their
occurrence in workload.
 As in above eg, frequency of queries
requesting homes in 2010 are more than
of the year 1981

QF Similarity
 For categorical data
 RQF(q) = raw frequency of occurrence of value q

of attribute A in query strings of workload

 RQFMax = raw frequency of most frequently
occurring value in workload

 Query frequency QF(q) = RQF(q)/RQFMax

 s(t, q) = QF(q), if q = t otherwise 0
 QF resembles TF

QF Similarity
 Consider Workload containing following
values of Attribute TYPE:

{Sports, Executive, Luxury, Sports, Sports,
Executive}

QF(Executive) = RQF(Executive)/RQFMax
= 2/3

QF Similarity
 Similarity between pairs of different categorical
attribute values can also be derived from
workload eg. To find S(Audi, Mercedes)

 Similarity coefficient between t and q in this case
is defined by jaccard coefficient scaled by QF
factor as shown below.
S(t,q)=J(W(t),W(q))/QF(q)
 W(t) = Subset of queries in workload W in which
categorical value t occurs in an IN clause

QF-IDF

 For QF-IDF Similarity
S(t,q)=QF(q) *IDF(q) when t=q otherwise
0

BREAKING TIES
 IF SIM(t1, q) = SIM (t2, q)
Which Should be ranked Higher??


 QF and IDF partitions database into

classes
CAR_ID MODEL MFR MFR_Country Type
1 SLR Mercedes Germany Sports
2 A6 Audi Germany Executive
3 R8 Audi Germany Sports
4 Gallardo Lamborghini Italy Sports

 Q: SELECT * WHERE Type = “Sports” AND
MFR_Country = “Germany”

Breaking Ties with QF
 Determine weights of missing attribute values
that reflect their “global importance” using
workload.
log( QF ( t k ))
k
 Global Imp = tk= missing attribute

 Missing Attributes for Q: MFR and Model

Breaking Ties with QF
 Considering Workload with following values of MFR
and Model
MFR{Audi, Audi, Lamborghini, Mercedes,
Lamborghini, Audi}
Model{R8, A6, Gallardo, SLR, Gallardo, A6}
 QF(SLR) = ½ = 0.5
1 SLR Mercede Germany =Sports 0.33
QF(Mercedes) 1/3 =
s

 Global Imp = log(0.5) + log(0.33).
 NEGATIVE VALUES of Global Imp ??

Breaking Ties with IDF
 Tuples with large IDF(occuring infequently)
of missing attributes are ranked higher
 Cars which are not popular are ranked higher

 Tuples with small IDF of missing attributes
are ranked higher
 Cars having Moonroof will be ranked less
which is a desirable feature.

Implementation

 Pre-processing component

 Query–processing component

Implementation
 Pre Processing Component

 Compute and store a representation of
similarity function(QF-IDF, QF, IDF) in
auxiliary database tables

Implementation
 Query Processing Component
 Job: Retrieving Top-K results from Database

 ITA Algorithm: Use of Fagin’s Threshold
Algorithm and Similarity function
 Sorted Access: Along any attribute Ak, TIDs of
tuples are retrieved.
 Random Access: entire tuple corresponding to a
TID is retrieved.

ITA Algorithm
 Repeat
 Initialize Top-K Buffer to empty
 For each k = 1 to p
 TID = Index of the next Tuple is retrieved from the ordered

Lists
 T = Complete Tuple is retrieved for TID

 Compute value of Ranking Function

 If Rank of T is higher than the rank of lowest ranking tuple

in Top-K Buffer, then update Top-K Buffer
 If Stopping Condition has been reached then Exit

 End For
 Until all index of the tuples have been seen.

ITA Algorithm
Stopping Condition
Hypothetical tuple – current value a1,…,
ap for A1,… Ap, corresponding to index
seeks on L1,…, Lp and qp+1,….. qm for
remaining columns from the query directly.
Termination – Similarity of hypothetical
tuple to the query< tuple in Top-k buffer
with least similarity.

ITA for Numeric columns
 Consider a query has condition Ak = qk for
a numeric column Ak.

 Two index scan is performed on Ak.
 First retrieve TID’s > qk in incresing order.
 Second retrieve TID’s < qk in decreasing
order.

 We then pick TID’s from the merged
stream.

Conclusion
 Automated Ranking Infrastructure for SQL
databases.
 Extended TF-IDF based techniques from
Information retrieval to numeric and mixed
data.
 Implementation of Ranking function that
exploited Fagin’s TA

new Slideshow!

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (15)

Similar a new Slideshow!

Similar a new Slideshow! (20)

Más de Dung Trương

Más de Dung Trương (8)

new Slideshow!