SlideShare una empresa de Scribd logo
1 de 125
Descargar para leer sin conexión
Identifying Task-based Sessions
in Search Engine Query Logs
Gabriele
Tolomei
Ca’ Foscari University ofVenice
ISTI-CNR, Pisa
Italy
February, 12 2011
Fabrizio
Silvestri
Raffaele
Perego
Claudio
Lucchese
Salvatore
Orlando
Agenda
• Introduction
• Contributions
• Experiments and Results
• Conclusions and Future Work
Gabriele Tolomei - February, 12 2011
Agenda
• Introduction
• Contributions
• Experiments and Results
• Conclusions and Future Work
Gabriele Tolomei - February, 12 2011
Problem Statement:TSDP
Task-based Session Discovery Problem:
Discover sets of possibly non contiguous queries
issued by users and collected by Web Search Engine
Query Logs whose aim is to carry out specific “tasks”
4
Gabriele Tolomei - February, 12 2011
Background
• What is a Web task?
5
Gabriele Tolomei - February, 12 2011
Background
• What is a Web task?
• A “template” for representing any (atomic) activity that can be
achieved by exploiting the information available on the Web, e.g.,
“find a recipe”,“book a flight”,“read news”, etc.
5
Gabriele Tolomei - February, 12 2011
Background
• What is a Web task?
• A “template” for representing any (atomic) activity that can be
achieved by exploiting the information available on the Web, e.g.,
“find a recipe”,“book a flight”,“read news”, etc.
• Why WSE Query Logs?
5
Gabriele Tolomei - February, 12 2011
Background
• What is a Web task?
• A “template” for representing any (atomic) activity that can be
achieved by exploiting the information available on the Web, e.g.,
“find a recipe”,“book a flight”,“read news”, etc.
• Why WSE Query Logs?
• Users rely on WSEs for satisfying their information needs by issuing
possibly interleaved stream of related queries
5
Gabriele Tolomei - February, 12 2011
Background
• What is a Web task?
• A “template” for representing any (atomic) activity that can be
achieved by exploiting the information available on the Web, e.g.,
“find a recipe”,“book a flight”,“read news”, etc.
• Why WSE Query Logs?
• Users rely on WSEs for satisfying their information needs by issuing
possibly interleaved stream of related queries
• WSEs collect the search activities, i.e., sessions, of their users by
means of issued queries, timestamps, clicked results, etc.
5
Gabriele Tolomei - February, 12 2011
Background
• What is a Web task?
• A “template” for representing any (atomic) activity that can be
achieved by exploiting the information available on the Web, e.g.,
“find a recipe”,“book a flight”,“read news”, etc.
• Why WSE Query Logs?
• Users rely on WSEs for satisfying their information needs by issuing
possibly interleaved stream of related queries
• WSEs collect the search activities, i.e., sessions, of their users by
means of issued queries, timestamps, clicked results, etc.
• User search sessions (especially long-term ones) might contain
interesting patterns that can be mined, e.g., sub-sessions whose
queries aim to perform the same Web task
5
Gabriele Tolomei - February, 12 2011
Motivation
• “Addiction to Web search”: no matter what your
information need is, ask it to a WSE and it will give you
the answer, e.g., people querying Google for “google”!
6
Gabriele Tolomei - February, 12 2011
Motivation
• “Addiction to Web search”: no matter what your
information need is, ask it to a WSE and it will give you
the answer, e.g., people querying Google for “google”!
• Everyone who is now at WSDM 2011 has dealt with a
lot of “stuff” for organizing her/his attendance
6
Gabriele Tolomei - February, 12 2011
Motivation
• “Addiction to Web search”: no matter what your
information need is, ask it to a WSE and it will give you
the answer, e.g., people querying Google for “google”!
• Everyone who is now at WSDM 2011 has dealt with a
lot of “stuff” for organizing her/his attendance
• Conference Web site is full of useful information but
still some tasks have to be performed (e.g., book flight,
reserve hotel room, rent car, etc.)
6
Gabriele Tolomei - February, 12 2011
Motivation
• “Addiction to Web search”: no matter what your
information need is, ask it to a WSE and it will give you
the answer, e.g., people querying Google for “google”!
• Everyone who is now at WSDM 2011 has dealt with a
lot of “stuff” for organizing her/his attendance
• Conference Web site is full of useful information but
still some tasks have to be performed (e.g., book flight,
reserve hotel room, rent car, etc.)
• Discovering tasks from WSE logs will allow us to better
understand user search intents at a “higher level of
abstraction”:
• from query-by-query to task-by-task Web search
6
Gabriele Tolomei - February, 12 2011
The Big Picture
query
hong kong
flights
...
7
Gabriele Tolomei - February, 12 2011
The Big Picture
query
fly to
hong kong
...
7
Gabriele Tolomei - February, 12 2011
The Big Picture
query
nba sport
news
...
7
Gabriele Tolomei - February, 12 2011
The Big Picture
query
pisa to
hong kong
...
7
Gabriele Tolomei - February, 12 2011
The Big Picture
... ......
7
Gabriele Tolomei - February, 12 2011
long-term session
The Big Picture
... ......
7
1 2 n
...
Gabriele Tolomei - February, 12 2011
Δt > tφ long-term session
The Big Picture
7
1 2 ... n
Gabriele Tolomei - February, 12 2011
The Big Picture
7
1 2 ... n
fly to
Hong Kong
nba news shopping in
Hong Kong
Gabriele Tolomei - February, 12 2011
Related Work
• Previous work on session identification
can be classified into:
1. time-based
2. content-based
3. novel heuristics (combining 1. and 2.)
8
Gabriele Tolomei - February, 12 2011
Related Work: time-based
• 1999: Silverstein et al. [1] firstly defined the
concept of “session”:
• 2 adjacent queries (qi, qi+1) are part of the same
session if their time submission gap is at most 5
minutes
• 2000: He and Göker [2] used different timeouts to
split user sessions (from 1 to 50 minutes)
• 2006: Jansen and Spink [4] described a session as
the time gap between the first and last recorded
timestamp on the WSE server
9
Gabriele Tolomei - February, 12 2011
Related Work: time-based
• 1999: Silverstein et al. [1] firstly defined the
concept of “session”:
• 2 adjacent queries (qi, qi+1) are part of the same
session if their time submission gap is at most 5
minutes
• 2000: He and Göker [2] used different timeouts to
split user sessions (from 1 to 50 minutes)
• 2006: Jansen and Spink [4] described a session as
the time gap between the first and last recorded
timestamp on the WSE server
PROs
✓ ease of implementation
CONs
✓ unable to deal with multi-
tasking behaviors
9
Gabriele Tolomei - February, 12 2011
Related Work: content-based
• Some work exploit lexical content of the queries
for determining a topic shift in the stream, i.e.,
session boundary [3, 5, 6, 7]
• Several string similarity scores have been
proposed, e.g., Levenshtein, Jaccard, etc.
• 2005: Shen et al. [8] compared “expanded
representation” of queries
• expansion of a query q is obtained by concatenating
titles and Web snippets for the top-50 results
provided by a WSE for q
10
Gabriele Tolomei - February, 12 2011
Related Work: content-based
• Some work exploit lexical content of the queries
for determining a topic shift in the stream, i.e.,
session boundary [3, 5, 6, 7]
• Several string similarity scores have been
proposed, e.g., Levenshtein, Jaccard, etc.
• 2005: Shen et al. [8] compared “expanded
representation” of queries
• expansion of a query q is obtained by concatenating
titles and Web snippets for the top-50 results
provided by a WSE for q
PROs
✓ effectiveness improvement
CONs
✓ vocabulary-mismatch problem:
e.g., (“nba”,“kobe bryant”)
10
Gabriele Tolomei - February, 12 2011
Related Work: novel
• 2005: Radlinski and Joachims [3] introduced query
chains, i.e., sequence of queries with similar
information need
• 2008: Boldi et al. [9] introduce the query-flow
graph as a model for representing WSE log data
• session identification as Traveling Salesman Problem
• 2008: Jones and Klinkner [10] address a problem
similar to the TSDP
• hierarchical search: mission vs. goal
• supervised approach: learn a suitable binary classifier
to detect whether two queries (qi, qj) belong to the
same task or not
11
Gabriele Tolomei - February, 12 2011
Related Work: novel
• 2005: Radlinski and Joachims [3] introduced query
chains, i.e., sequence of queries with similar
information need
• 2008: Boldi et al. [9] introduce the query-flow
graph as a model for representing WSE log data
• session identification as Traveling Salesman Problem
• 2008: Jones and Klinkner [10] address a problem
similar to the TSDP
• hierarchical search: mission vs. goal
• supervised approach: learn a suitable binary classifier
to detect whether two queries (qi, qj) belong to the
same task or not
PROs
✓ effectiveness improvement
CONs
✓ computational complexity
11
Gabriele Tolomei - February, 12 2011
Agenda
• Introduction
• Contributions
• Experiments and Results
• Conclusions and Future Work
Gabriele Tolomei - February, 12 2011
Outline
• Formalize the Task-based Session Discovery Problem
13
Gabriele Tolomei - February, 12 2011
Outline
• Formalize the Task-based Session Discovery Problem
• Analyze a real long-term WSE log of queries
13
Gabriele Tolomei - February, 12 2011
Outline
• Formalize the Task-based Session Discovery Problem
• Analyze a real long-term WSE log of queries
• Build a ground-truth of tasks by manually grouping a
sample of task-related queries in the given WSE log
13
Gabriele Tolomei - February, 12 2011
Outline
• Formalize the Task-based Session Discovery Problem
• Analyze a real long-term WSE log of queries
• Build a ground-truth of tasks by manually grouping a
sample of task-related queries in the given WSE log
• Perform some statistics on top of the ground-truth
13
Gabriele Tolomei - February, 12 2011
Outline
• Formalize the Task-based Session Discovery Problem
• Analyze a real long-term WSE log of queries
• Build a ground-truth of tasks by manually grouping a
sample of task-related queries in the given WSE log
• Perform some statistics on top of the ground-truth
• Propose several techniques for addressing the TSDP
13
Gabriele Tolomei - February, 12 2011
Agenda
• Introduction
• Contributions
• Query Log Analysis
• Ground-truth Analysis
• Approaching TSDP
• Experiments and Results
• Conclusions and Future Work
Gabriele Tolomei - February, 12 2011
Data Set:AOL Query Log
15
Original Data Set
✓ 3-months collection
✓ ~20M queries
✓ ~657K users
Gabriele Tolomei - February, 12 2011
Data Set:AOL Query Log
15
Original Data Set
Sample Data Set
✓ 1-week collection
✓ ~100K queries
✓ 1,000 users
✓ removed empty queries
✓ removed “non-sense” queries
✓ removed stop-words
✓ applied Porter stemming algorithm
✓ 3-months collection
✓ ~20M queries
✓ ~657K users
Gabriele Tolomei - February, 12 2011
16
Data Analysis: query time gap
Gabriele Tolomei - February, 12 2011
16
Data Analysis: query time gap
Gabriele Tolomei - February, 12 2011
tφ = 26 min.
84.1% of adjacent query pairs
are issued within 26 minutes
Agenda
• Introduction
• Contributions
• Query Log Analysis
• Ground-truth Analysis
• Approaching TSDP
• Experiments and Results
• Conclusions and Future Work
Gabriele Tolomei - February, 12 2011
• Long-term sessions of sample data set are first split
using the threshold tφ devised before (i.e., 26 minutes)
• obtaining several time-gap sessions
18
Ground-truth: construction
Gabriele Tolomei - February, 12 2011
• Long-term sessions of sample data set are first split
using the threshold tφ devised before (i.e., 26 minutes)
• obtaining several time-gap sessions
• Human annotators group queries that they claim to be
task-related inside each time-gap session
18
Ground-truth: construction
Gabriele Tolomei - February, 12 2011
• Long-term sessions of sample data set are first split
using the threshold tφ devised before (i.e., 26 minutes)
• obtaining several time-gap sessions
• Human annotators group queries that they claim to be
task-related inside each time-gap session
• Represents the true task-based partitioning manually
built from actual WSE query log data
18
Ground-truth: construction
Gabriele Tolomei - February, 12 2011
• Long-term sessions of sample data set are first split
using the threshold tφ devised before (i.e., 26 minutes)
• obtaining several time-gap sessions
• Human annotators group queries that they claim to be
task-related inside each time-gap session
• Represents the true task-based partitioning manually
built from actual WSE query log data
• Useful both for statistical purposes and evaluation of
automatic task-based session discovery methods
18
Ground-truth: construction
Gabriele Tolomei - February, 12 2011
19
Ground-truth: statistics
✓ 2,004 queries
✓ 446 time-gap sessions
✓ 1,424 annotated queries
✓ 307 annotated time-gap sessions
✓ 554 detected task-based sessions
Gabriele Tolomei - February, 12 2011
20
Ground-truth: statistics
Gabriele Tolomei - February, 12 2011
✓ 4.49 avg. queries per
time-gap session
✓ more than 70% time-gap
session contains at most
5 queries
20
Ground-truth: statistics
Gabriele Tolomei - February, 12 2011
✓ 4.49 avg. queries per
time-gap session
✓ more than 70% time-gap
session contains at most
5 queries
✓ 2.57 avg. queries per task
✓ ~75% tasks contains at
most 3 queries
20
Ground-truth: statistics
Gabriele Tolomei - February, 12 2011
✓ 4.49 avg. queries per
time-gap session
✓ more than 70% time-gap
session contains at most
5 queries
✓ 2.57 avg. queries per task
✓ ~75% tasks contains at
most 3 queries
✓ 1.80 avg. task per time-
gap session
✓ ~47% time-gap session
contains more than one
task (multi-tasking)
✓ 1,046 over 1,424 queries
(i.e., ~74%) included in
multi-tasking sessions
21
Ground-truth: statistics
✓ overlapping degree of
multi-tasking sessions
✓ jump occurs whenever
two queries of the same
task are not originally
adjacent
✓ ratio of task in a time-gap
session that contains at
least one jump
Gabriele Tolomei - February, 12 2011
Agenda
• Introduction
• Contributions
• Query Log Analysis
• Ground-truth Analysis
• Approaching TSDP
• Experiments and Results
• Conclusions and Future Work
Gabriele Tolomei - February, 12 2011
23
TSDP: approaches
1) TimeSplitting-t
Description:
The idea is that if two consecutive queries are far
away enough then they are also likely to be
unrelated.
Two consecutive queries (qi, qi+1) are in the same
task-based session if and only if their time
submission gap is lower than a certain threshold t.
Gabriele Tolomei - February, 12 2011
23
TSDP: approaches
1) TimeSplitting-t
Description:
The idea is that if two consecutive queries are far
away enough then they are also likely to be
unrelated.
Two consecutive queries (qi, qi+1) are in the same
task-based session if and only if their time
submission gap is lower than a certain threshold t.
PROs:
✓ ease of implementation
✓ O(n) time complexity (linear in the number n of
queries)
Gabriele Tolomei - February, 12 2011
23
TSDP: approaches
1) TimeSplitting-t
Description:
The idea is that if two consecutive queries are far
away enough then they are also likely to be
unrelated.
Two consecutive queries (qi, qi+1) are in the same
task-based session if and only if their time
submission gap is lower than a certain threshold t.
PROs:
✓ ease of implementation
✓ O(n) time complexity (linear in the number n of
queries)
CONs:
✓ unable to deal with multi-tasking
✓ unawareness of other discriminating query
features (e.g., lexical content)
Gabriele Tolomei - February, 12 2011
23
TSDP: approaches
1) TimeSplitting-t
Description:
The idea is that if two consecutive queries are far
away enough then they are also likely to be
unrelated.
Two consecutive queries (qi, qi+1) are in the same
task-based session if and only if their time
submission gap is lower than a certain threshold t.
PROs:
✓ ease of implementation
✓ O(n) time complexity (linear in the number n of
queries)
Methods: TS-5,TS-15,TS-26, etc.
CONs:
✓ unable to deal with multi-tasking
✓ unawareness of other discriminating query
features (e.g., lexical content)
Gabriele Tolomei - February, 12 2011
23
TSDP: approaches
2) QueryClustering-m
Description:
Queries are grouped using clustering algorithms,
which exploit several query features. Clustering
algorithms assembly such features using two
different distance functions for computing query-
pair similarity.
Two queries (qi, qj) are in the same task-based
session if and only if they are in the same cluster.
1) TimeSplitting-t
Description:
The idea is that if two consecutive queries are far
away enough then they are also likely to be
unrelated.
Two consecutive queries (qi, qi+1) are in the same
task-based session if and only if their time
submission gap is lower than a certain threshold t.
PROs:
✓ ease of implementation
✓ O(n) time complexity (linear in the number n of
queries)
Methods: TS-5,TS-15,TS-26, etc.
CONs:
✓ unable to deal with multi-tasking
✓ unawareness of other discriminating query
features (e.g., lexical content)
Gabriele Tolomei - February, 12 2011
23
TSDP: approaches
2) QueryClustering-m
Description:
Queries are grouped using clustering algorithms,
which exploit several query features. Clustering
algorithms assembly such features using two
different distance functions for computing query-
pair similarity.
Two queries (qi, qj) are in the same task-based
session if and only if they are in the same cluster.
PROs:
✓ able to detect multi-tasking sessions
✓ able to deal with “noisy queries” (i.e., outliers)
1) TimeSplitting-t
Description:
The idea is that if two consecutive queries are far
away enough then they are also likely to be
unrelated.
Two consecutive queries (qi, qi+1) are in the same
task-based session if and only if their time
submission gap is lower than a certain threshold t.
PROs:
✓ ease of implementation
✓ O(n) time complexity (linear in the number n of
queries)
Methods: TS-5,TS-15,TS-26, etc.
CONs:
✓ unable to deal with multi-tasking
✓ unawareness of other discriminating query
features (e.g., lexical content)
Gabriele Tolomei - February, 12 2011
23
TSDP: approaches
2) QueryClustering-m
Description:
Queries are grouped using clustering algorithms,
which exploit several query features. Clustering
algorithms assembly such features using two
different distance functions for computing query-
pair similarity.
Two queries (qi, qj) are in the same task-based
session if and only if they are in the same cluster.
PROs:
✓ able to detect multi-tasking sessions
✓ able to deal with “noisy queries” (i.e., outliers)
CONs:
✓ O(n2) time complexity (i.e. quadratic in the
number n of queries due to all-pairs-similarity
computational step)
1) TimeSplitting-t
Description:
The idea is that if two consecutive queries are far
away enough then they are also likely to be
unrelated.
Two consecutive queries (qi, qi+1) are in the same
task-based session if and only if their time
submission gap is lower than a certain threshold t.
PROs:
✓ ease of implementation
✓ O(n) time complexity (linear in the number n of
queries)
Methods: TS-5,TS-15,TS-26, etc.
CONs:
✓ unable to deal with multi-tasking
✓ unawareness of other discriminating query
features (e.g., lexical content)
Gabriele Tolomei - February, 12 2011
23
TSDP: approaches
2) QueryClustering-m
Description:
Queries are grouped using clustering algorithms,
which exploit several query features. Clustering
algorithms assembly such features using two
different distance functions for computing query-
pair similarity.
Two queries (qi, qj) are in the same task-based
session if and only if they are in the same cluster.
PROs:
✓ able to detect multi-tasking sessions
✓ able to deal with “noisy queries” (i.e., outliers)
CONs:
✓ O(n2) time complexity (i.e. quadratic in the
number n of queries due to all-pairs-similarity
computational step)
Methods: QC-MEANS, QC-SCAN, QC-WCC, and
QC-HTC
1) TimeSplitting-t
Description:
The idea is that if two consecutive queries are far
away enough then they are also likely to be
unrelated.
Two consecutive queries (qi, qi+1) are in the same
task-based session if and only if their time
submission gap is lower than a certain threshold t.
PROs:
✓ ease of implementation
✓ O(n) time complexity (linear in the number n of
queries)
Methods: TS-5,TS-15,TS-26, etc.
CONs:
✓ unable to deal with multi-tasking
✓ unawareness of other discriminating query
features (e.g., lexical content)
Gabriele Tolomei - February, 12 2011
24
Query Features
Gabriele Tolomei - February, 12 2011
Content-based (µcontent)
✓ two queries (qi, qj) sharing common
terms are likely related
✓ µjaccard: Jaccard index on query
character 3-grams
✓ µlevenshtein: normalized Levenshtein
distance
24
Query Features
Gabriele Tolomei - February, 12 2011
Semantic-based (µsemantic)
✓ using Wikipedia and Wiktionary for
“expanding” a query q
✓ “wikification” of q using vector-space
model
✓ relatedness between (qi, qj) computed
using cosine-similarity
Content-based (µcontent)
✓ two queries (qi, qj) sharing common
terms are likely related
✓ µjaccard: Jaccard index on query
character 3-grams
✓ µlevenshtein: normalized Levenshtein
distance
25
Distance Functions: µ1 vs. µ2
Gabriele Tolomei - February, 12 2011
✓ Convex combination µ1
✓ Conditional formula µ2
Idea: if two queries are close in term of lexical
content, the semantic expansion could be
unhelpful.Vice-versa, nothing can be said when
queries do not share any content feature
✓ Both µ1 and µ2 rely on the estimation of
some parameters, i.e., α, t, and b
✓ Use ground-truth for tuning parameters
• Models each time-gap session φ as a complete weighted
undirected graph Gφ = (V, E, w)
26
QC-WCC
Gabriele Tolomei - February, 12 2011
• Models each time-gap session φ as a complete weighted
undirected graph Gφ = (V, E, w)
• set of nodesV are the queries in φ
26
QC-WCC
Gabriele Tolomei - February, 12 2011
• Models each time-gap session φ as a complete weighted
undirected graph Gφ = (V, E, w)
• set of nodesV are the queries in φ
• set of edges E are weighted by the similarity of the
corresponding nodes
26
QC-WCC
Gabriele Tolomei - February, 12 2011
• Models each time-gap session φ as a complete weighted
undirected graph Gφ = (V, E, w)
• set of nodesV are the queries in φ
• set of edges E are weighted by the similarity of the
corresponding nodes
• Drop weak edges, i.e., with low similarity, assuming the
corresponding queries are not related and obtaining G’φ
26
QC-WCC
Gabriele Tolomei - February, 12 2011
• Models each time-gap session φ as a complete weighted
undirected graph Gφ = (V, E, w)
• set of nodesV are the queries in φ
• set of edges E are weighted by the similarity of the
corresponding nodes
• Drop weak edges, i.e., with low similarity, assuming the
corresponding queries are not related and obtaining G’φ
• Clusters are built on the basis of strong edges by finding
all the connected components of the pruned graph G’φ
26
QC-WCC
Gabriele Tolomei - February, 12 2011
• Models each time-gap session φ as a complete weighted
undirected graph Gφ = (V, E, w)
• set of nodesV are the queries in φ
• set of edges E are weighted by the similarity of the
corresponding nodes
• Drop weak edges, i.e., with low similarity, assuming the
corresponding queries are not related and obtaining G’φ
• Clusters are built on the basis of strong edges by finding
all the connected components of the pruned graph G’φ
• O(n2) time complexity where n = |V|
26
QC-WCC
Gabriele Tolomei - February, 12 2011
27
QC-WCC
Gabriele Tolomei - February, 12 2011
1 8765432φ
27
QC-WCC
Gabriele Tolomei - February, 12 2011
1 2
3
4
56
7
8
Build similarity graph Gφ
27
QC-WCC
Gabriele Tolomei - February, 12 2011
1 2
3
4
56
7
8
Drop “weak edges”
27
QC-WCC
Gabriele Tolomei - February, 12 2011
1 2
3
4
56
7
8
• Variation of QC-WCC based on head-tail components
28
QC-HTC
Gabriele Tolomei - February, 12 2011
• Variation of QC-WCC based on head-tail components
• Does not need to compute the full similarity graph
28
QC-HTC
Gabriele Tolomei - February, 12 2011
• Variation of QC-WCC based on head-tail components
• Does not need to compute the full similarity graph
• Exploits the sequentiality of query submissions to
reduce the number of similarity computations
28
QC-HTC
Gabriele Tolomei - February, 12 2011
• Variation of QC-WCC based on head-tail components
• Does not need to compute the full similarity graph
• Exploits the sequentiality of query submissions to
reduce the number of similarity computations
• Performs 2 steps:
1. sequential clustering
2. merging
28
QC-HTC
Gabriele Tolomei - February, 12 2011
• Partition each time-gap session into sequential
clusters containing only queries issued in a row
29
QC-HTC: sequential clustering
Gabriele Tolomei - February, 12 2011
• Partition each time-gap session into sequential
clusters containing only queries issued in a row
• Each query in every sequential cluster has to be
“similar enough” to the chronologically next one
29
QC-HTC: sequential clustering
Gabriele Tolomei - February, 12 2011
• Partition each time-gap session into sequential
clusters containing only queries issued in a row
• Each query in every sequential cluster has to be
“similar enough” to the chronologically next one
• Need to compute only the similarity between one
query and the next in the original data
29
QC-HTC: sequential clustering
Gabriele Tolomei - February, 12 2011
• Merge together related sequential clusters due to
multi-tasking
30
QC-HTC: merging
Gabriele Tolomei - February, 12 2011
• Merge together related sequential clusters due to
multi-tasking
• Hyp: a cluster is represented by its chronologically-
first and last queries, i.e., head and tail, respectively
30
QC-HTC: merging
Gabriele Tolomei - February, 12 2011
• Merge together related sequential clusters due to
multi-tasking
• Hyp: a cluster is represented by its chronologically-
first and last queries, i.e., head and tail, respectively
• Given two sequential clusters ci, cj and hi, ti, and hj,
tj, their corresponding head and tail queries the
similarity s(ci, cj) is computed as follow:
30
QC-HTC: merging
Gabriele Tolomei - February, 12 2011
• Merge together related sequential clusters due to
multi-tasking
• Hyp: a cluster is represented by its chronologically-
first and last queries, i.e., head and tail, respectively
• Given two sequential clusters ci, cj and hi, ti, and hj,
tj, their corresponding head and tail queries the
similarity s(ci, cj) is computed as follow:
30
QC-HTC: merging
s(ci, cj) = min w(e(qi, qj)) s.t. qi ∈ {hi, ti} and qj ∈ {hj, tj}
Gabriele Tolomei - February, 12 2011
• Merge together related sequential clusters due to
multi-tasking
• Hyp: a cluster is represented by its chronologically-
first and last queries, i.e., head and tail, respectively
• Given two sequential clusters ci, cj and hi, ti, and hj,
tj, their corresponding head and tail queries the
similarity s(ci, cj) is computed as follow:
30
QC-HTC: merging
s(ci, cj) = min w(e(qi, qj)) s.t. qi ∈ {hi, ti} and qj ∈ {hj, tj}
• ci and cj are merged as long as s(ci, cj) > η
• hi, ti and hj, tj are updated consequently
Gabriele Tolomei - February, 12 2011
31
QC-HTC
Gabriele Tolomei - February, 12 2011
1 8765432φ
31
QC-HTC
Gabriele Tolomei - February, 12 2011
1 2 1) Sequential Clustering
31
QC-HTC
Gabriele Tolomei - February, 12 2011
1 2
3
1) Sequential Clustering
31
QC-HTC
Gabriele Tolomei - February, 12 2011
1 2
3
4
1) Sequential Clustering
31
QC-HTC
Gabriele Tolomei - February, 12 2011
1 2
3
4
56
7
8
31
QC-HTC
Gabriele Tolomei - February, 12 2011
1 2
3
4
56
7
8
2) Merging
31
QC-HTC
Gabriele Tolomei - February, 12 2011
1 2
3
4
56
7
8
2) Merging
31
QC-HTC
Gabriele Tolomei - February, 12 2011
1 2
4
56
7
8
• In the first step the algorithm computes the similarity
only between one query and the next in the original data
• O(n) where n is the size of the time-gap session
32
QC-HTC: time complexity
Gabriele Tolomei - February, 12 2011
• In the first step the algorithm computes the similarity
only between one query and the next in the original data
• O(n) where n is the size of the time-gap session
• In the second step the algorithm computes the pairwise
similarity between each sequential cluster
• O(k2) where k is the number of sequential clusters
• if k = β·n with 0<β≤1 then time complexity is O(β2·n2)
• e.g. β = 1/2 O(n2/4) 4 times better than QC-WCC
32
QC-HTC: time complexity
Gabriele Tolomei - February, 12 2011
Agenda
• Introduction
• Contributions
• Experiments and Results
• Conclusions and Future Work
Gabriele Tolomei - February, 12 2011
• Run and compare all the proposed
approaches with:
34
Setup
Gabriele Tolomei - February, 12 2011
• Run and compare all the proposed
approaches with:
• TS-26: time-splitting technique (baseline)
34
Setup
Gabriele Tolomei - February, 12 2011
• Run and compare all the proposed
approaches with:
• TS-26: time-splitting technique (baseline)
• QFG: session extraction method based on the
query-flow graph model (state of the art)
34
Setup
Gabriele Tolomei - February, 12 2011
• Measure the degree of correspondence between true tasks,
i.e., manually-extracted ground-truth, and predicted tasks,
i.e., output by algorithms
35
Evaluation
Gabriele Tolomei - February, 12 2011
• Measure the degree of correspondence between true tasks,
i.e., manually-extracted ground-truth, and predicted tasks,
i.e., output by algorithms
35
Evaluation
a) F-MEASURE
✓ evaluates the extent to
which a predicted task
contains only and all the
queries of a true task
✓ combines p(i, j) and r(i, j)
the precision and recall
of task i w.r.t. class j
Gabriele Tolomei - February, 12 2011
• Measure the degree of correspondence between true tasks,
i.e., manually-extracted ground-truth, and predicted tasks,
i.e., output by algorithms
35
Evaluation
a) F-MEASURE
✓ evaluates the extent to
which a predicted task
contains only and all the
queries of a true task
✓ combines p(i, j) and r(i, j)
the precision and recall
of task i w.r.t. class j
b) RAND
✓ pairs of queries instead
of singleton
✓ f00, f01, f10, f11
Gabriele Tolomei - February, 12 2011
• Measure the degree of correspondence between true tasks,
i.e., manually-extracted ground-truth, and predicted tasks,
i.e., output by algorithms
35
Evaluation
a) F-MEASURE
✓ evaluates the extent to
which a predicted task
contains only and all the
queries of a true task
✓ combines p(i, j) and r(i, j)
the precision and recall
of task i w.r.t. class j
b) RAND
✓ pairs of queries instead
of singleton
✓ f00, f01, f10, f11
c) JACCARD
✓ pairs of queries instead
of singleton
✓ f01, f10, f11
Gabriele Tolomei - February, 12 2011
• 3 time thresholds used: 5, 15, and 26 minutes
36
Results:TS-t
Gabriele Tolomei - February, 12 2011
• 3 time thresholds used: 5, 15, and 26 minutes
• Note:TS-26 was used for splitting sample data set
• task-based sessions == time-gap sessions
36
Results:TS-t
Gabriele Tolomei - February, 12 2011
37
Results: QFG
Gabriele Tolomei - February, 12 2011
✓ trained on a segment of
our sample data set
✓ best results using η = 0.7
✓ vs. baseline:
• +16% F-measure
• +52% Rand
• +15% Jaccard
38
Results: QC-WCC
Gabriele Tolomei - February, 12 2011
✓ best results using µ2 and
η = 0.3
✓ vs. baseline:
• +20% F-measure
• +56% Rand
• +23% Jaccard
✓ vs. QFG:
• +5% F-measure
• +9% Rand
• +10% Jaccard
39
Results: QC-HTC
Gabriele Tolomei - February, 12 2011
✓ best results using µ2 and
η = 0.3
✓ vs. baseline:
• +19% F-measure
• +56% Rand
• +21% Jaccard
✓ vs. QFG:
• +4% F-measure
• +9% Rand
• +8% Jaccard
40
Results: best
Gabriele Tolomei - February, 12 2011
• Benefit of using Wikipedia instead of only lexical
content when computing query distance function
41
Results:Wiki impact
Gabriele Tolomei - February, 12 2011
• Benefit of using Wikipedia instead of only lexical
content when computing query distance function
• Capturing other two queries that are lexically
different but somehow “semantically” similar
41
Results:Wiki impact
Gabriele Tolomei - February, 12 2011
• Benefit of using Wikipedia instead of only lexical
content when computing query distance function
• Capturing other two queries that are lexically
different but somehow “semantically” similar
• Try going here: http://en.wikipedia.org/wiki/Cancun
41
Results:Wiki impact
Gabriele Tolomei - February, 12 2011
41
Results:Wiki impact
Gabriele Tolomei - February, 12 2011
Agenda
• Introduction
• Contributions
• Experiments and Results
• Conclusions and Future Work
Gabriele Tolomei - February, 12 2011
• Introduced the Task-based Session Discovery Problem
• from a WSE log of user activities extract several sets of
queries which are all related to the same task
43
Conclusions
Gabriele Tolomei - February, 12 2011
• Introduced the Task-based Session Discovery Problem
• from a WSE log of user activities extract several sets of
queries which are all related to the same task
• Compared clustering solutions exploiting two
distance functions based on query content and
semantic expansion (i.e.,Wiktionary and Wikipedia)
43
Conclusions
Gabriele Tolomei - February, 12 2011
• Introduced the Task-based Session Discovery Problem
• from a WSE log of user activities extract several sets of
queries which are all related to the same task
• Compared clustering solutions exploiting two
distance functions based on query content and
semantic expansion (i.e.,Wiktionary and Wikipedia)
• Proposed novel graph-based heuristic QC-HTC, lighter
than QC-WCC, outperforming other methods in
terms of F-measure, Rand and Jaccard index
43
Conclusions
Gabriele Tolomei - February, 12 2011
• Why should we stop here?
44
Future Work
Gabriele Tolomei - February, 12 2011
• Why should we stop here?
• Once discovered, smaller tasks might be part of
larger and more complex tasks
44
Future Work
Gabriele Tolomei - February, 12 2011
• Why should we stop here?
• Once discovered, smaller tasks might be part of
larger and more complex tasks
• The task “fly to Hong Kong” might be a step of
a larger task, e.g.,“holidays in Hong Kong”,
which in turn could involve several other tasks...
44
Future Work
Gabriele Tolomei - February, 12 2011
• Make Web Search Engine the “universal driver” for
executing our daily activities on the Web
45
Vision
Gabriele Tolomei - February, 12 2011
• Make Web Search Engine the “universal driver” for
executing our daily activities on the Web
• Once user types in a query,WSE should “infer the
tasks” user aims to perform (if any) serendipity!
45
Vision
Gabriele Tolomei - February, 12 2011
• Make Web Search Engine the “universal driver” for
executing our daily activities on the Web
• Once user types in a query,WSE should “infer the
tasks” user aims to perform (if any) serendipity!
• Results should be no longer only list of plain links but
also tasks, either simple and complex
45
Vision
Gabriele Tolomei - February, 12 2011
• Make Web Search Engine the “universal driver” for
executing our daily activities on the Web
• Once user types in a query,WSE should “infer the
tasks” user aims to perform (if any) serendipity!
• Results should be no longer only list of plain links but
also tasks, either simple and complex
• Recommendation of queries and/or Web pages both
intra- and inter-task
45
Vision
Gabriele Tolomei - February, 12 2011
task vs. query recommendation
46
References
Gabriele Tolomei - February, 12 2011
[1] Silverstein, Marais, Henzinger, and Moricz.“Analysis of a very large web search engine query log”. In SIGIR Forum, 1999
[2] He and Göker.“Detecting session boundaries from web user logs”. In BCS-IRSG, 2000
[3] Radlinski and Joachims.“Query chains: Learning to rank from implicit feedback”. In KDD '05
[4] Jansen and Spink.“How are we searching the world wide web?: a comparison of nine search engine transaction logs”.
In IPM, 2006
[5] Lau and Horvitz.“Patterns of search:Analyzing and modeling web query refinement”. In UM '99
[6] He and Harper.“Combining evidence for automatic web session identification”. In IPM, 2002
[7] Ozmutlu and Çavdur.“Application of automatic topic identification on excite web search engine data logs”. In IPM, 2005
[8] Shen,Tan, and Zhai.“Implicit user modeling for personalized search”. In CIKM '05
[9] Boldi, Bonchi, Castillo, Donato, Gionis, andVigna.“The query-flow graph: model and applications”. In CIKM '08
[10] Jones and Klinkner.“Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs”.
In CIKM '08
[11] MacQueen.“Some methods for classification and analysis of multivariate observations”. In BSMSP, 1967
[12] Ester, Kriegel, Sander, and Xu.“A density-based algorithm for discovering clusters in large spatial databases with noise”.
In KDD '96
ThankYou!
Questions?

Más contenido relacionado

Similar a WSDM 2011

Site search analytics workshop presentation
Site search analytics workshop presentationSite search analytics workshop presentation
Site search analytics workshop presentationLouis Rosenfeld
 
Lecture3: What is the DATA on the Social Web (VU Amsterdam Social Web Course)
Lecture3: What is the DATA on the Social Web (VU Amsterdam Social Web Course)Lecture3: What is the DATA on the Social Web (VU Amsterdam Social Web Course)
Lecture3: What is the DATA on the Social Web (VU Amsterdam Social Web Course)Lora Aroyo
 
Matapihi 'The National Digital Project'. The University of Auckland Library P...
Matapihi 'The National Digital Project'. The University of Auckland Library P...Matapihi 'The National Digital Project'. The University of Auckland Library P...
Matapihi 'The National Digital Project'. The University of Auckland Library P...Rose Holley
 
Digital learning martin bazley gem conference swansea
Digital learning martin bazley gem conference swanseaDigital learning martin bazley gem conference swansea
Digital learning martin bazley gem conference swanseaMartin Bazley
 
Can U Help? How to Deliver an Instant Messaging Service to Improve Communicat...
Can U Help? How to Deliver an Instant Messaging Service to Improve Communicat...Can U Help? How to Deliver an Instant Messaging Service to Improve Communicat...
Can U Help? How to Deliver an Instant Messaging Service to Improve Communicat...Karen Reiman-Sendi
 
Pg Cert HE Chapter 3 October 2010
Pg Cert HE Chapter 3 October 2010Pg Cert HE Chapter 3 October 2010
Pg Cert HE Chapter 3 October 2010NewportCELT
 
How to Get Lightning Fast Answers with Power BI Q&A and Cortana
How to Get Lightning Fast Answers with Power BI Q&A and CortanaHow to Get Lightning Fast Answers with Power BI Q&A and Cortana
How to Get Lightning Fast Answers with Power BI Q&A and CortanaVishal Pawar
 
OLE Project Regional Workshop - University of Kansas - Day 1
OLE Project Regional Workshop - University of Kansas - Day 1OLE Project Regional Workshop - University of Kansas - Day 1
OLE Project Regional Workshop - University of Kansas - Day 1Beth Warner
 
Me13sh task overview
Me13sh task overviewMe13sh task overview
Me13sh task overviewRobin Aly
 
VU University Amsterdam - The Social Web 2016 - Lecture 3
VU University Amsterdam - The Social Web 2016 - Lecture 3VU University Amsterdam - The Social Web 2016 - Lecture 3
VU University Amsterdam - The Social Web 2016 - Lecture 3Davide Ceolin
 
Data scientist enablement dse 400 week 5 roadmap
Data scientist enablement   dse 400   week 5 roadmapData scientist enablement   dse 400   week 5 roadmap
Data scientist enablement dse 400 week 5 roadmapDr. Mohan K. Bavirisetty
 
Introduction to NLP.pptx
Introduction to NLP.pptxIntroduction to NLP.pptx
Introduction to NLP.pptxbuivantan_uneti
 
Intranet 2.0 School: Building the essential staff intranet for your library
Intranet 2.0 School: Building the essential staff intranet for your libraryIntranet 2.0 School: Building the essential staff intranet for your library
Intranet 2.0 School: Building the essential staff intranet for your libraryChris Evjy
 
Ontology-based information extraction in the DERI Reading Group
Ontology-based information extraction in the DERI Reading GroupOntology-based information extraction in the DERI Reading Group
Ontology-based information extraction in the DERI Reading GroupTobias Wunner
 
An In-Depth Look at Pinpointing and Addressing Sources of Performance Problem...
An In-Depth Look at Pinpointing and Addressing Sources of Performance Problem...An In-Depth Look at Pinpointing and Addressing Sources of Performance Problem...
An In-Depth Look at Pinpointing and Addressing Sources of Performance Problem...BI Brainz
 

Similar a WSDM 2011 (20)

Site search analytics workshop presentation
Site search analytics workshop presentationSite search analytics workshop presentation
Site search analytics workshop presentation
 
Lecture3: What is the DATA on the Social Web (VU Amsterdam Social Web Course)
Lecture3: What is the DATA on the Social Web (VU Amsterdam Social Web Course)Lecture3: What is the DATA on the Social Web (VU Amsterdam Social Web Course)
Lecture3: What is the DATA on the Social Web (VU Amsterdam Social Web Course)
 
Matapihi 'The National Digital Project'. The University of Auckland Library P...
Matapihi 'The National Digital Project'. The University of Auckland Library P...Matapihi 'The National Digital Project'. The University of Auckland Library P...
Matapihi 'The National Digital Project'. The University of Auckland Library P...
 
Digital learning martin bazley gem conference swansea
Digital learning martin bazley gem conference swanseaDigital learning martin bazley gem conference swansea
Digital learning martin bazley gem conference swansea
 
Can U Help? How to Deliver an Instant Messaging Service to Improve Communicat...
Can U Help? How to Deliver an Instant Messaging Service to Improve Communicat...Can U Help? How to Deliver an Instant Messaging Service to Improve Communicat...
Can U Help? How to Deliver an Instant Messaging Service to Improve Communicat...
 
Estrat search
Estrat searchEstrat search
Estrat search
 
Sem web tutorial general
Sem web tutorial generalSem web tutorial general
Sem web tutorial general
 
Pg Cert HE Chapter 3 October 2010
Pg Cert HE Chapter 3 October 2010Pg Cert HE Chapter 3 October 2010
Pg Cert HE Chapter 3 October 2010
 
How to Get Lightning Fast Answers with Power BI Q&A and Cortana
How to Get Lightning Fast Answers with Power BI Q&A and CortanaHow to Get Lightning Fast Answers with Power BI Q&A and Cortana
How to Get Lightning Fast Answers with Power BI Q&A and Cortana
 
Misceb search2014
Misceb search2014Misceb search2014
Misceb search2014
 
OLE Project Regional Workshop - University of Kansas - Day 1
OLE Project Regional Workshop - University of Kansas - Day 1OLE Project Regional Workshop - University of Kansas - Day 1
OLE Project Regional Workshop - University of Kansas - Day 1
 
Me13sh task overview
Me13sh task overviewMe13sh task overview
Me13sh task overview
 
Day3 wayne beaton eclipse community mgt
Day3 wayne beaton eclipse  community mgtDay3 wayne beaton eclipse  community mgt
Day3 wayne beaton eclipse community mgt
 
VU University Amsterdam - The Social Web 2016 - Lecture 3
VU University Amsterdam - The Social Web 2016 - Lecture 3VU University Amsterdam - The Social Web 2016 - Lecture 3
VU University Amsterdam - The Social Web 2016 - Lecture 3
 
Virtual meetings bapmn_20170201
Virtual meetings bapmn_20170201Virtual meetings bapmn_20170201
Virtual meetings bapmn_20170201
 
Data scientist enablement dse 400 week 5 roadmap
Data scientist enablement   dse 400   week 5 roadmapData scientist enablement   dse 400   week 5 roadmap
Data scientist enablement dse 400 week 5 roadmap
 
Introduction to NLP.pptx
Introduction to NLP.pptxIntroduction to NLP.pptx
Introduction to NLP.pptx
 
Intranet 2.0 School: Building the essential staff intranet for your library
Intranet 2.0 School: Building the essential staff intranet for your libraryIntranet 2.0 School: Building the essential staff intranet for your library
Intranet 2.0 School: Building the essential staff intranet for your library
 
Ontology-based information extraction in the DERI Reading Group
Ontology-based information extraction in the DERI Reading GroupOntology-based information extraction in the DERI Reading Group
Ontology-based information extraction in the DERI Reading Group
 
An In-Depth Look at Pinpointing and Addressing Sources of Performance Problem...
An In-Depth Look at Pinpointing and Addressing Sources of Performance Problem...An In-Depth Look at Pinpointing and Addressing Sources of Performance Problem...
An In-Depth Look at Pinpointing and Addressing Sources of Performance Problem...
 

Último

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

WSDM 2011

  • 1. Identifying Task-based Sessions in Search Engine Query Logs Gabriele Tolomei Ca’ Foscari University ofVenice ISTI-CNR, Pisa Italy February, 12 2011 Fabrizio Silvestri Raffaele Perego Claudio Lucchese Salvatore Orlando
  • 2. Agenda • Introduction • Contributions • Experiments and Results • Conclusions and Future Work Gabriele Tolomei - February, 12 2011
  • 3. Agenda • Introduction • Contributions • Experiments and Results • Conclusions and Future Work Gabriele Tolomei - February, 12 2011
  • 4. Problem Statement:TSDP Task-based Session Discovery Problem: Discover sets of possibly non contiguous queries issued by users and collected by Web Search Engine Query Logs whose aim is to carry out specific “tasks” 4 Gabriele Tolomei - February, 12 2011
  • 5. Background • What is a Web task? 5 Gabriele Tolomei - February, 12 2011
  • 6. Background • What is a Web task? • A “template” for representing any (atomic) activity that can be achieved by exploiting the information available on the Web, e.g., “find a recipe”,“book a flight”,“read news”, etc. 5 Gabriele Tolomei - February, 12 2011
  • 7. Background • What is a Web task? • A “template” for representing any (atomic) activity that can be achieved by exploiting the information available on the Web, e.g., “find a recipe”,“book a flight”,“read news”, etc. • Why WSE Query Logs? 5 Gabriele Tolomei - February, 12 2011
  • 8. Background • What is a Web task? • A “template” for representing any (atomic) activity that can be achieved by exploiting the information available on the Web, e.g., “find a recipe”,“book a flight”,“read news”, etc. • Why WSE Query Logs? • Users rely on WSEs for satisfying their information needs by issuing possibly interleaved stream of related queries 5 Gabriele Tolomei - February, 12 2011
  • 9. Background • What is a Web task? • A “template” for representing any (atomic) activity that can be achieved by exploiting the information available on the Web, e.g., “find a recipe”,“book a flight”,“read news”, etc. • Why WSE Query Logs? • Users rely on WSEs for satisfying their information needs by issuing possibly interleaved stream of related queries • WSEs collect the search activities, i.e., sessions, of their users by means of issued queries, timestamps, clicked results, etc. 5 Gabriele Tolomei - February, 12 2011
  • 10. Background • What is a Web task? • A “template” for representing any (atomic) activity that can be achieved by exploiting the information available on the Web, e.g., “find a recipe”,“book a flight”,“read news”, etc. • Why WSE Query Logs? • Users rely on WSEs for satisfying their information needs by issuing possibly interleaved stream of related queries • WSEs collect the search activities, i.e., sessions, of their users by means of issued queries, timestamps, clicked results, etc. • User search sessions (especially long-term ones) might contain interesting patterns that can be mined, e.g., sub-sessions whose queries aim to perform the same Web task 5 Gabriele Tolomei - February, 12 2011
  • 11. Motivation • “Addiction to Web search”: no matter what your information need is, ask it to a WSE and it will give you the answer, e.g., people querying Google for “google”! 6 Gabriele Tolomei - February, 12 2011
  • 12. Motivation • “Addiction to Web search”: no matter what your information need is, ask it to a WSE and it will give you the answer, e.g., people querying Google for “google”! • Everyone who is now at WSDM 2011 has dealt with a lot of “stuff” for organizing her/his attendance 6 Gabriele Tolomei - February, 12 2011
  • 13. Motivation • “Addiction to Web search”: no matter what your information need is, ask it to a WSE and it will give you the answer, e.g., people querying Google for “google”! • Everyone who is now at WSDM 2011 has dealt with a lot of “stuff” for organizing her/his attendance • Conference Web site is full of useful information but still some tasks have to be performed (e.g., book flight, reserve hotel room, rent car, etc.) 6 Gabriele Tolomei - February, 12 2011
  • 14. Motivation • “Addiction to Web search”: no matter what your information need is, ask it to a WSE and it will give you the answer, e.g., people querying Google for “google”! • Everyone who is now at WSDM 2011 has dealt with a lot of “stuff” for organizing her/his attendance • Conference Web site is full of useful information but still some tasks have to be performed (e.g., book flight, reserve hotel room, rent car, etc.) • Discovering tasks from WSE logs will allow us to better understand user search intents at a “higher level of abstraction”: • from query-by-query to task-by-task Web search 6 Gabriele Tolomei - February, 12 2011
  • 15. The Big Picture query hong kong flights ... 7 Gabriele Tolomei - February, 12 2011
  • 16. The Big Picture query fly to hong kong ... 7 Gabriele Tolomei - February, 12 2011
  • 17. The Big Picture query nba sport news ... 7 Gabriele Tolomei - February, 12 2011
  • 18. The Big Picture query pisa to hong kong ... 7 Gabriele Tolomei - February, 12 2011
  • 19. The Big Picture ... ...... 7 Gabriele Tolomei - February, 12 2011 long-term session
  • 20. The Big Picture ... ...... 7 1 2 n ... Gabriele Tolomei - February, 12 2011 Δt > tφ long-term session
  • 21. The Big Picture 7 1 2 ... n Gabriele Tolomei - February, 12 2011
  • 22. The Big Picture 7 1 2 ... n fly to Hong Kong nba news shopping in Hong Kong Gabriele Tolomei - February, 12 2011
  • 23. Related Work • Previous work on session identification can be classified into: 1. time-based 2. content-based 3. novel heuristics (combining 1. and 2.) 8 Gabriele Tolomei - February, 12 2011
  • 24. Related Work: time-based • 1999: Silverstein et al. [1] firstly defined the concept of “session”: • 2 adjacent queries (qi, qi+1) are part of the same session if their time submission gap is at most 5 minutes • 2000: He and Göker [2] used different timeouts to split user sessions (from 1 to 50 minutes) • 2006: Jansen and Spink [4] described a session as the time gap between the first and last recorded timestamp on the WSE server 9 Gabriele Tolomei - February, 12 2011
  • 25. Related Work: time-based • 1999: Silverstein et al. [1] firstly defined the concept of “session”: • 2 adjacent queries (qi, qi+1) are part of the same session if their time submission gap is at most 5 minutes • 2000: He and Göker [2] used different timeouts to split user sessions (from 1 to 50 minutes) • 2006: Jansen and Spink [4] described a session as the time gap between the first and last recorded timestamp on the WSE server PROs ✓ ease of implementation CONs ✓ unable to deal with multi- tasking behaviors 9 Gabriele Tolomei - February, 12 2011
  • 26. Related Work: content-based • Some work exploit lexical content of the queries for determining a topic shift in the stream, i.e., session boundary [3, 5, 6, 7] • Several string similarity scores have been proposed, e.g., Levenshtein, Jaccard, etc. • 2005: Shen et al. [8] compared “expanded representation” of queries • expansion of a query q is obtained by concatenating titles and Web snippets for the top-50 results provided by a WSE for q 10 Gabriele Tolomei - February, 12 2011
  • 27. Related Work: content-based • Some work exploit lexical content of the queries for determining a topic shift in the stream, i.e., session boundary [3, 5, 6, 7] • Several string similarity scores have been proposed, e.g., Levenshtein, Jaccard, etc. • 2005: Shen et al. [8] compared “expanded representation” of queries • expansion of a query q is obtained by concatenating titles and Web snippets for the top-50 results provided by a WSE for q PROs ✓ effectiveness improvement CONs ✓ vocabulary-mismatch problem: e.g., (“nba”,“kobe bryant”) 10 Gabriele Tolomei - February, 12 2011
  • 28. Related Work: novel • 2005: Radlinski and Joachims [3] introduced query chains, i.e., sequence of queries with similar information need • 2008: Boldi et al. [9] introduce the query-flow graph as a model for representing WSE log data • session identification as Traveling Salesman Problem • 2008: Jones and Klinkner [10] address a problem similar to the TSDP • hierarchical search: mission vs. goal • supervised approach: learn a suitable binary classifier to detect whether two queries (qi, qj) belong to the same task or not 11 Gabriele Tolomei - February, 12 2011
  • 29. Related Work: novel • 2005: Radlinski and Joachims [3] introduced query chains, i.e., sequence of queries with similar information need • 2008: Boldi et al. [9] introduce the query-flow graph as a model for representing WSE log data • session identification as Traveling Salesman Problem • 2008: Jones and Klinkner [10] address a problem similar to the TSDP • hierarchical search: mission vs. goal • supervised approach: learn a suitable binary classifier to detect whether two queries (qi, qj) belong to the same task or not PROs ✓ effectiveness improvement CONs ✓ computational complexity 11 Gabriele Tolomei - February, 12 2011
  • 30. Agenda • Introduction • Contributions • Experiments and Results • Conclusions and Future Work Gabriele Tolomei - February, 12 2011
  • 31. Outline • Formalize the Task-based Session Discovery Problem 13 Gabriele Tolomei - February, 12 2011
  • 32. Outline • Formalize the Task-based Session Discovery Problem • Analyze a real long-term WSE log of queries 13 Gabriele Tolomei - February, 12 2011
  • 33. Outline • Formalize the Task-based Session Discovery Problem • Analyze a real long-term WSE log of queries • Build a ground-truth of tasks by manually grouping a sample of task-related queries in the given WSE log 13 Gabriele Tolomei - February, 12 2011
  • 34. Outline • Formalize the Task-based Session Discovery Problem • Analyze a real long-term WSE log of queries • Build a ground-truth of tasks by manually grouping a sample of task-related queries in the given WSE log • Perform some statistics on top of the ground-truth 13 Gabriele Tolomei - February, 12 2011
  • 35. Outline • Formalize the Task-based Session Discovery Problem • Analyze a real long-term WSE log of queries • Build a ground-truth of tasks by manually grouping a sample of task-related queries in the given WSE log • Perform some statistics on top of the ground-truth • Propose several techniques for addressing the TSDP 13 Gabriele Tolomei - February, 12 2011
  • 36. Agenda • Introduction • Contributions • Query Log Analysis • Ground-truth Analysis • Approaching TSDP • Experiments and Results • Conclusions and Future Work Gabriele Tolomei - February, 12 2011
  • 37. Data Set:AOL Query Log 15 Original Data Set ✓ 3-months collection ✓ ~20M queries ✓ ~657K users Gabriele Tolomei - February, 12 2011
  • 38. Data Set:AOL Query Log 15 Original Data Set Sample Data Set ✓ 1-week collection ✓ ~100K queries ✓ 1,000 users ✓ removed empty queries ✓ removed “non-sense” queries ✓ removed stop-words ✓ applied Porter stemming algorithm ✓ 3-months collection ✓ ~20M queries ✓ ~657K users Gabriele Tolomei - February, 12 2011
  • 39. 16 Data Analysis: query time gap Gabriele Tolomei - February, 12 2011
  • 40. 16 Data Analysis: query time gap Gabriele Tolomei - February, 12 2011 tφ = 26 min. 84.1% of adjacent query pairs are issued within 26 minutes
  • 41. Agenda • Introduction • Contributions • Query Log Analysis • Ground-truth Analysis • Approaching TSDP • Experiments and Results • Conclusions and Future Work Gabriele Tolomei - February, 12 2011
  • 42. • Long-term sessions of sample data set are first split using the threshold tφ devised before (i.e., 26 minutes) • obtaining several time-gap sessions 18 Ground-truth: construction Gabriele Tolomei - February, 12 2011
  • 43. • Long-term sessions of sample data set are first split using the threshold tφ devised before (i.e., 26 minutes) • obtaining several time-gap sessions • Human annotators group queries that they claim to be task-related inside each time-gap session 18 Ground-truth: construction Gabriele Tolomei - February, 12 2011
  • 44. • Long-term sessions of sample data set are first split using the threshold tφ devised before (i.e., 26 minutes) • obtaining several time-gap sessions • Human annotators group queries that they claim to be task-related inside each time-gap session • Represents the true task-based partitioning manually built from actual WSE query log data 18 Ground-truth: construction Gabriele Tolomei - February, 12 2011
  • 45. • Long-term sessions of sample data set are first split using the threshold tφ devised before (i.e., 26 minutes) • obtaining several time-gap sessions • Human annotators group queries that they claim to be task-related inside each time-gap session • Represents the true task-based partitioning manually built from actual WSE query log data • Useful both for statistical purposes and evaluation of automatic task-based session discovery methods 18 Ground-truth: construction Gabriele Tolomei - February, 12 2011
  • 46. 19 Ground-truth: statistics ✓ 2,004 queries ✓ 446 time-gap sessions ✓ 1,424 annotated queries ✓ 307 annotated time-gap sessions ✓ 554 detected task-based sessions Gabriele Tolomei - February, 12 2011
  • 47. 20 Ground-truth: statistics Gabriele Tolomei - February, 12 2011 ✓ 4.49 avg. queries per time-gap session ✓ more than 70% time-gap session contains at most 5 queries
  • 48. 20 Ground-truth: statistics Gabriele Tolomei - February, 12 2011 ✓ 4.49 avg. queries per time-gap session ✓ more than 70% time-gap session contains at most 5 queries ✓ 2.57 avg. queries per task ✓ ~75% tasks contains at most 3 queries
  • 49. 20 Ground-truth: statistics Gabriele Tolomei - February, 12 2011 ✓ 4.49 avg. queries per time-gap session ✓ more than 70% time-gap session contains at most 5 queries ✓ 2.57 avg. queries per task ✓ ~75% tasks contains at most 3 queries ✓ 1.80 avg. task per time- gap session ✓ ~47% time-gap session contains more than one task (multi-tasking) ✓ 1,046 over 1,424 queries (i.e., ~74%) included in multi-tasking sessions
  • 50. 21 Ground-truth: statistics ✓ overlapping degree of multi-tasking sessions ✓ jump occurs whenever two queries of the same task are not originally adjacent ✓ ratio of task in a time-gap session that contains at least one jump Gabriele Tolomei - February, 12 2011
  • 51. Agenda • Introduction • Contributions • Query Log Analysis • Ground-truth Analysis • Approaching TSDP • Experiments and Results • Conclusions and Future Work Gabriele Tolomei - February, 12 2011
  • 52. 23 TSDP: approaches 1) TimeSplitting-t Description: The idea is that if two consecutive queries are far away enough then they are also likely to be unrelated. Two consecutive queries (qi, qi+1) are in the same task-based session if and only if their time submission gap is lower than a certain threshold t. Gabriele Tolomei - February, 12 2011
  • 53. 23 TSDP: approaches 1) TimeSplitting-t Description: The idea is that if two consecutive queries are far away enough then they are also likely to be unrelated. Two consecutive queries (qi, qi+1) are in the same task-based session if and only if their time submission gap is lower than a certain threshold t. PROs: ✓ ease of implementation ✓ O(n) time complexity (linear in the number n of queries) Gabriele Tolomei - February, 12 2011
  • 54. 23 TSDP: approaches 1) TimeSplitting-t Description: The idea is that if two consecutive queries are far away enough then they are also likely to be unrelated. Two consecutive queries (qi, qi+1) are in the same task-based session if and only if their time submission gap is lower than a certain threshold t. PROs: ✓ ease of implementation ✓ O(n) time complexity (linear in the number n of queries) CONs: ✓ unable to deal with multi-tasking ✓ unawareness of other discriminating query features (e.g., lexical content) Gabriele Tolomei - February, 12 2011
  • 55. 23 TSDP: approaches 1) TimeSplitting-t Description: The idea is that if two consecutive queries are far away enough then they are also likely to be unrelated. Two consecutive queries (qi, qi+1) are in the same task-based session if and only if their time submission gap is lower than a certain threshold t. PROs: ✓ ease of implementation ✓ O(n) time complexity (linear in the number n of queries) Methods: TS-5,TS-15,TS-26, etc. CONs: ✓ unable to deal with multi-tasking ✓ unawareness of other discriminating query features (e.g., lexical content) Gabriele Tolomei - February, 12 2011
  • 56. 23 TSDP: approaches 2) QueryClustering-m Description: Queries are grouped using clustering algorithms, which exploit several query features. Clustering algorithms assembly such features using two different distance functions for computing query- pair similarity. Two queries (qi, qj) are in the same task-based session if and only if they are in the same cluster. 1) TimeSplitting-t Description: The idea is that if two consecutive queries are far away enough then they are also likely to be unrelated. Two consecutive queries (qi, qi+1) are in the same task-based session if and only if their time submission gap is lower than a certain threshold t. PROs: ✓ ease of implementation ✓ O(n) time complexity (linear in the number n of queries) Methods: TS-5,TS-15,TS-26, etc. CONs: ✓ unable to deal with multi-tasking ✓ unawareness of other discriminating query features (e.g., lexical content) Gabriele Tolomei - February, 12 2011
  • 57. 23 TSDP: approaches 2) QueryClustering-m Description: Queries are grouped using clustering algorithms, which exploit several query features. Clustering algorithms assembly such features using two different distance functions for computing query- pair similarity. Two queries (qi, qj) are in the same task-based session if and only if they are in the same cluster. PROs: ✓ able to detect multi-tasking sessions ✓ able to deal with “noisy queries” (i.e., outliers) 1) TimeSplitting-t Description: The idea is that if two consecutive queries are far away enough then they are also likely to be unrelated. Two consecutive queries (qi, qi+1) are in the same task-based session if and only if their time submission gap is lower than a certain threshold t. PROs: ✓ ease of implementation ✓ O(n) time complexity (linear in the number n of queries) Methods: TS-5,TS-15,TS-26, etc. CONs: ✓ unable to deal with multi-tasking ✓ unawareness of other discriminating query features (e.g., lexical content) Gabriele Tolomei - February, 12 2011
  • 58. 23 TSDP: approaches 2) QueryClustering-m Description: Queries are grouped using clustering algorithms, which exploit several query features. Clustering algorithms assembly such features using two different distance functions for computing query- pair similarity. Two queries (qi, qj) are in the same task-based session if and only if they are in the same cluster. PROs: ✓ able to detect multi-tasking sessions ✓ able to deal with “noisy queries” (i.e., outliers) CONs: ✓ O(n2) time complexity (i.e. quadratic in the number n of queries due to all-pairs-similarity computational step) 1) TimeSplitting-t Description: The idea is that if two consecutive queries are far away enough then they are also likely to be unrelated. Two consecutive queries (qi, qi+1) are in the same task-based session if and only if their time submission gap is lower than a certain threshold t. PROs: ✓ ease of implementation ✓ O(n) time complexity (linear in the number n of queries) Methods: TS-5,TS-15,TS-26, etc. CONs: ✓ unable to deal with multi-tasking ✓ unawareness of other discriminating query features (e.g., lexical content) Gabriele Tolomei - February, 12 2011
  • 59. 23 TSDP: approaches 2) QueryClustering-m Description: Queries are grouped using clustering algorithms, which exploit several query features. Clustering algorithms assembly such features using two different distance functions for computing query- pair similarity. Two queries (qi, qj) are in the same task-based session if and only if they are in the same cluster. PROs: ✓ able to detect multi-tasking sessions ✓ able to deal with “noisy queries” (i.e., outliers) CONs: ✓ O(n2) time complexity (i.e. quadratic in the number n of queries due to all-pairs-similarity computational step) Methods: QC-MEANS, QC-SCAN, QC-WCC, and QC-HTC 1) TimeSplitting-t Description: The idea is that if two consecutive queries are far away enough then they are also likely to be unrelated. Two consecutive queries (qi, qi+1) are in the same task-based session if and only if their time submission gap is lower than a certain threshold t. PROs: ✓ ease of implementation ✓ O(n) time complexity (linear in the number n of queries) Methods: TS-5,TS-15,TS-26, etc. CONs: ✓ unable to deal with multi-tasking ✓ unawareness of other discriminating query features (e.g., lexical content) Gabriele Tolomei - February, 12 2011
  • 60. 24 Query Features Gabriele Tolomei - February, 12 2011 Content-based (µcontent) ✓ two queries (qi, qj) sharing common terms are likely related ✓ µjaccard: Jaccard index on query character 3-grams ✓ µlevenshtein: normalized Levenshtein distance
  • 61. 24 Query Features Gabriele Tolomei - February, 12 2011 Semantic-based (µsemantic) ✓ using Wikipedia and Wiktionary for “expanding” a query q ✓ “wikification” of q using vector-space model ✓ relatedness between (qi, qj) computed using cosine-similarity Content-based (µcontent) ✓ two queries (qi, qj) sharing common terms are likely related ✓ µjaccard: Jaccard index on query character 3-grams ✓ µlevenshtein: normalized Levenshtein distance
  • 62. 25 Distance Functions: µ1 vs. µ2 Gabriele Tolomei - February, 12 2011 ✓ Convex combination µ1 ✓ Conditional formula µ2 Idea: if two queries are close in term of lexical content, the semantic expansion could be unhelpful.Vice-versa, nothing can be said when queries do not share any content feature ✓ Both µ1 and µ2 rely on the estimation of some parameters, i.e., α, t, and b ✓ Use ground-truth for tuning parameters
  • 63. • Models each time-gap session φ as a complete weighted undirected graph Gφ = (V, E, w) 26 QC-WCC Gabriele Tolomei - February, 12 2011
  • 64. • Models each time-gap session φ as a complete weighted undirected graph Gφ = (V, E, w) • set of nodesV are the queries in φ 26 QC-WCC Gabriele Tolomei - February, 12 2011
  • 65. • Models each time-gap session φ as a complete weighted undirected graph Gφ = (V, E, w) • set of nodesV are the queries in φ • set of edges E are weighted by the similarity of the corresponding nodes 26 QC-WCC Gabriele Tolomei - February, 12 2011
  • 66. • Models each time-gap session φ as a complete weighted undirected graph Gφ = (V, E, w) • set of nodesV are the queries in φ • set of edges E are weighted by the similarity of the corresponding nodes • Drop weak edges, i.e., with low similarity, assuming the corresponding queries are not related and obtaining G’φ 26 QC-WCC Gabriele Tolomei - February, 12 2011
  • 67. • Models each time-gap session φ as a complete weighted undirected graph Gφ = (V, E, w) • set of nodesV are the queries in φ • set of edges E are weighted by the similarity of the corresponding nodes • Drop weak edges, i.e., with low similarity, assuming the corresponding queries are not related and obtaining G’φ • Clusters are built on the basis of strong edges by finding all the connected components of the pruned graph G’φ 26 QC-WCC Gabriele Tolomei - February, 12 2011
  • 68. • Models each time-gap session φ as a complete weighted undirected graph Gφ = (V, E, w) • set of nodesV are the queries in φ • set of edges E are weighted by the similarity of the corresponding nodes • Drop weak edges, i.e., with low similarity, assuming the corresponding queries are not related and obtaining G’φ • Clusters are built on the basis of strong edges by finding all the connected components of the pruned graph G’φ • O(n2) time complexity where n = |V| 26 QC-WCC Gabriele Tolomei - February, 12 2011
  • 69. 27 QC-WCC Gabriele Tolomei - February, 12 2011 1 8765432φ
  • 70. 27 QC-WCC Gabriele Tolomei - February, 12 2011 1 2 3 4 56 7 8 Build similarity graph Gφ
  • 71. 27 QC-WCC Gabriele Tolomei - February, 12 2011 1 2 3 4 56 7 8 Drop “weak edges”
  • 72. 27 QC-WCC Gabriele Tolomei - February, 12 2011 1 2 3 4 56 7 8
  • 73. • Variation of QC-WCC based on head-tail components 28 QC-HTC Gabriele Tolomei - February, 12 2011
  • 74. • Variation of QC-WCC based on head-tail components • Does not need to compute the full similarity graph 28 QC-HTC Gabriele Tolomei - February, 12 2011
  • 75. • Variation of QC-WCC based on head-tail components • Does not need to compute the full similarity graph • Exploits the sequentiality of query submissions to reduce the number of similarity computations 28 QC-HTC Gabriele Tolomei - February, 12 2011
  • 76. • Variation of QC-WCC based on head-tail components • Does not need to compute the full similarity graph • Exploits the sequentiality of query submissions to reduce the number of similarity computations • Performs 2 steps: 1. sequential clustering 2. merging 28 QC-HTC Gabriele Tolomei - February, 12 2011
  • 77. • Partition each time-gap session into sequential clusters containing only queries issued in a row 29 QC-HTC: sequential clustering Gabriele Tolomei - February, 12 2011
  • 78. • Partition each time-gap session into sequential clusters containing only queries issued in a row • Each query in every sequential cluster has to be “similar enough” to the chronologically next one 29 QC-HTC: sequential clustering Gabriele Tolomei - February, 12 2011
  • 79. • Partition each time-gap session into sequential clusters containing only queries issued in a row • Each query in every sequential cluster has to be “similar enough” to the chronologically next one • Need to compute only the similarity between one query and the next in the original data 29 QC-HTC: sequential clustering Gabriele Tolomei - February, 12 2011
  • 80. • Merge together related sequential clusters due to multi-tasking 30 QC-HTC: merging Gabriele Tolomei - February, 12 2011
  • 81. • Merge together related sequential clusters due to multi-tasking • Hyp: a cluster is represented by its chronologically- first and last queries, i.e., head and tail, respectively 30 QC-HTC: merging Gabriele Tolomei - February, 12 2011
  • 82. • Merge together related sequential clusters due to multi-tasking • Hyp: a cluster is represented by its chronologically- first and last queries, i.e., head and tail, respectively • Given two sequential clusters ci, cj and hi, ti, and hj, tj, their corresponding head and tail queries the similarity s(ci, cj) is computed as follow: 30 QC-HTC: merging Gabriele Tolomei - February, 12 2011
  • 83. • Merge together related sequential clusters due to multi-tasking • Hyp: a cluster is represented by its chronologically- first and last queries, i.e., head and tail, respectively • Given two sequential clusters ci, cj and hi, ti, and hj, tj, their corresponding head and tail queries the similarity s(ci, cj) is computed as follow: 30 QC-HTC: merging s(ci, cj) = min w(e(qi, qj)) s.t. qi ∈ {hi, ti} and qj ∈ {hj, tj} Gabriele Tolomei - February, 12 2011
  • 84. • Merge together related sequential clusters due to multi-tasking • Hyp: a cluster is represented by its chronologically- first and last queries, i.e., head and tail, respectively • Given two sequential clusters ci, cj and hi, ti, and hj, tj, their corresponding head and tail queries the similarity s(ci, cj) is computed as follow: 30 QC-HTC: merging s(ci, cj) = min w(e(qi, qj)) s.t. qi ∈ {hi, ti} and qj ∈ {hj, tj} • ci and cj are merged as long as s(ci, cj) > η • hi, ti and hj, tj are updated consequently Gabriele Tolomei - February, 12 2011
  • 85. 31 QC-HTC Gabriele Tolomei - February, 12 2011 1 8765432φ
  • 86. 31 QC-HTC Gabriele Tolomei - February, 12 2011 1 2 1) Sequential Clustering
  • 87. 31 QC-HTC Gabriele Tolomei - February, 12 2011 1 2 3 1) Sequential Clustering
  • 88. 31 QC-HTC Gabriele Tolomei - February, 12 2011 1 2 3 4 1) Sequential Clustering
  • 89. 31 QC-HTC Gabriele Tolomei - February, 12 2011 1 2 3 4 56 7 8
  • 90. 31 QC-HTC Gabriele Tolomei - February, 12 2011 1 2 3 4 56 7 8 2) Merging
  • 91. 31 QC-HTC Gabriele Tolomei - February, 12 2011 1 2 3 4 56 7 8 2) Merging
  • 92. 31 QC-HTC Gabriele Tolomei - February, 12 2011 1 2 4 56 7 8
  • 93. • In the first step the algorithm computes the similarity only between one query and the next in the original data • O(n) where n is the size of the time-gap session 32 QC-HTC: time complexity Gabriele Tolomei - February, 12 2011
  • 94. • In the first step the algorithm computes the similarity only between one query and the next in the original data • O(n) where n is the size of the time-gap session • In the second step the algorithm computes the pairwise similarity between each sequential cluster • O(k2) where k is the number of sequential clusters • if k = β·n with 0<β≤1 then time complexity is O(β2·n2) • e.g. β = 1/2 O(n2/4) 4 times better than QC-WCC 32 QC-HTC: time complexity Gabriele Tolomei - February, 12 2011
  • 95. Agenda • Introduction • Contributions • Experiments and Results • Conclusions and Future Work Gabriele Tolomei - February, 12 2011
  • 96. • Run and compare all the proposed approaches with: 34 Setup Gabriele Tolomei - February, 12 2011
  • 97. • Run and compare all the proposed approaches with: • TS-26: time-splitting technique (baseline) 34 Setup Gabriele Tolomei - February, 12 2011
  • 98. • Run and compare all the proposed approaches with: • TS-26: time-splitting technique (baseline) • QFG: session extraction method based on the query-flow graph model (state of the art) 34 Setup Gabriele Tolomei - February, 12 2011
  • 99. • Measure the degree of correspondence between true tasks, i.e., manually-extracted ground-truth, and predicted tasks, i.e., output by algorithms 35 Evaluation Gabriele Tolomei - February, 12 2011
  • 100. • Measure the degree of correspondence between true tasks, i.e., manually-extracted ground-truth, and predicted tasks, i.e., output by algorithms 35 Evaluation a) F-MEASURE ✓ evaluates the extent to which a predicted task contains only and all the queries of a true task ✓ combines p(i, j) and r(i, j) the precision and recall of task i w.r.t. class j Gabriele Tolomei - February, 12 2011
  • 101. • Measure the degree of correspondence between true tasks, i.e., manually-extracted ground-truth, and predicted tasks, i.e., output by algorithms 35 Evaluation a) F-MEASURE ✓ evaluates the extent to which a predicted task contains only and all the queries of a true task ✓ combines p(i, j) and r(i, j) the precision and recall of task i w.r.t. class j b) RAND ✓ pairs of queries instead of singleton ✓ f00, f01, f10, f11 Gabriele Tolomei - February, 12 2011
  • 102. • Measure the degree of correspondence between true tasks, i.e., manually-extracted ground-truth, and predicted tasks, i.e., output by algorithms 35 Evaluation a) F-MEASURE ✓ evaluates the extent to which a predicted task contains only and all the queries of a true task ✓ combines p(i, j) and r(i, j) the precision and recall of task i w.r.t. class j b) RAND ✓ pairs of queries instead of singleton ✓ f00, f01, f10, f11 c) JACCARD ✓ pairs of queries instead of singleton ✓ f01, f10, f11 Gabriele Tolomei - February, 12 2011
  • 103. • 3 time thresholds used: 5, 15, and 26 minutes 36 Results:TS-t Gabriele Tolomei - February, 12 2011
  • 104. • 3 time thresholds used: 5, 15, and 26 minutes • Note:TS-26 was used for splitting sample data set • task-based sessions == time-gap sessions 36 Results:TS-t Gabriele Tolomei - February, 12 2011
  • 105. 37 Results: QFG Gabriele Tolomei - February, 12 2011 ✓ trained on a segment of our sample data set ✓ best results using η = 0.7 ✓ vs. baseline: • +16% F-measure • +52% Rand • +15% Jaccard
  • 106. 38 Results: QC-WCC Gabriele Tolomei - February, 12 2011 ✓ best results using µ2 and η = 0.3 ✓ vs. baseline: • +20% F-measure • +56% Rand • +23% Jaccard ✓ vs. QFG: • +5% F-measure • +9% Rand • +10% Jaccard
  • 107. 39 Results: QC-HTC Gabriele Tolomei - February, 12 2011 ✓ best results using µ2 and η = 0.3 ✓ vs. baseline: • +19% F-measure • +56% Rand • +21% Jaccard ✓ vs. QFG: • +4% F-measure • +9% Rand • +8% Jaccard
  • 108. 40 Results: best Gabriele Tolomei - February, 12 2011
  • 109. • Benefit of using Wikipedia instead of only lexical content when computing query distance function 41 Results:Wiki impact Gabriele Tolomei - February, 12 2011
  • 110. • Benefit of using Wikipedia instead of only lexical content when computing query distance function • Capturing other two queries that are lexically different but somehow “semantically” similar 41 Results:Wiki impact Gabriele Tolomei - February, 12 2011
  • 111. • Benefit of using Wikipedia instead of only lexical content when computing query distance function • Capturing other two queries that are lexically different but somehow “semantically” similar • Try going here: http://en.wikipedia.org/wiki/Cancun 41 Results:Wiki impact Gabriele Tolomei - February, 12 2011
  • 113. Agenda • Introduction • Contributions • Experiments and Results • Conclusions and Future Work Gabriele Tolomei - February, 12 2011
  • 114. • Introduced the Task-based Session Discovery Problem • from a WSE log of user activities extract several sets of queries which are all related to the same task 43 Conclusions Gabriele Tolomei - February, 12 2011
  • 115. • Introduced the Task-based Session Discovery Problem • from a WSE log of user activities extract several sets of queries which are all related to the same task • Compared clustering solutions exploiting two distance functions based on query content and semantic expansion (i.e.,Wiktionary and Wikipedia) 43 Conclusions Gabriele Tolomei - February, 12 2011
  • 116. • Introduced the Task-based Session Discovery Problem • from a WSE log of user activities extract several sets of queries which are all related to the same task • Compared clustering solutions exploiting two distance functions based on query content and semantic expansion (i.e.,Wiktionary and Wikipedia) • Proposed novel graph-based heuristic QC-HTC, lighter than QC-WCC, outperforming other methods in terms of F-measure, Rand and Jaccard index 43 Conclusions Gabriele Tolomei - February, 12 2011
  • 117. • Why should we stop here? 44 Future Work Gabriele Tolomei - February, 12 2011
  • 118. • Why should we stop here? • Once discovered, smaller tasks might be part of larger and more complex tasks 44 Future Work Gabriele Tolomei - February, 12 2011
  • 119. • Why should we stop here? • Once discovered, smaller tasks might be part of larger and more complex tasks • The task “fly to Hong Kong” might be a step of a larger task, e.g.,“holidays in Hong Kong”, which in turn could involve several other tasks... 44 Future Work Gabriele Tolomei - February, 12 2011
  • 120. • Make Web Search Engine the “universal driver” for executing our daily activities on the Web 45 Vision Gabriele Tolomei - February, 12 2011
  • 121. • Make Web Search Engine the “universal driver” for executing our daily activities on the Web • Once user types in a query,WSE should “infer the tasks” user aims to perform (if any) serendipity! 45 Vision Gabriele Tolomei - February, 12 2011
  • 122. • Make Web Search Engine the “universal driver” for executing our daily activities on the Web • Once user types in a query,WSE should “infer the tasks” user aims to perform (if any) serendipity! • Results should be no longer only list of plain links but also tasks, either simple and complex 45 Vision Gabriele Tolomei - February, 12 2011
  • 123. • Make Web Search Engine the “universal driver” for executing our daily activities on the Web • Once user types in a query,WSE should “infer the tasks” user aims to perform (if any) serendipity! • Results should be no longer only list of plain links but also tasks, either simple and complex • Recommendation of queries and/or Web pages both intra- and inter-task 45 Vision Gabriele Tolomei - February, 12 2011 task vs. query recommendation
  • 124. 46 References Gabriele Tolomei - February, 12 2011 [1] Silverstein, Marais, Henzinger, and Moricz.“Analysis of a very large web search engine query log”. In SIGIR Forum, 1999 [2] He and Göker.“Detecting session boundaries from web user logs”. In BCS-IRSG, 2000 [3] Radlinski and Joachims.“Query chains: Learning to rank from implicit feedback”. In KDD '05 [4] Jansen and Spink.“How are we searching the world wide web?: a comparison of nine search engine transaction logs”. In IPM, 2006 [5] Lau and Horvitz.“Patterns of search:Analyzing and modeling web query refinement”. In UM '99 [6] He and Harper.“Combining evidence for automatic web session identification”. In IPM, 2002 [7] Ozmutlu and Çavdur.“Application of automatic topic identification on excite web search engine data logs”. In IPM, 2005 [8] Shen,Tan, and Zhai.“Implicit user modeling for personalized search”. In CIKM '05 [9] Boldi, Bonchi, Castillo, Donato, Gionis, andVigna.“The query-flow graph: model and applications”. In CIKM '08 [10] Jones and Klinkner.“Beyond the session timeout: automatic hierarchical segmentation of search topics in query logs”. In CIKM '08 [11] MacQueen.“Some methods for classification and analysis of multivariate observations”. In BSMSP, 1967 [12] Ester, Kriegel, Sander, and Xu.“A density-based algorithm for discovering clusters in large spatial databases with noise”. In KDD '96