Search logs contain examples of frequently occurring patterns of user reformulations of queries. Intuitively, the reformulation "san francisco" → "san francisco 49ers" is semantically similar to "detroit" →"detroit lions". Likewise, "london"→"things to do in london" and "new york"→"new york tourist attractions" can also be considered similar transitions in intent. The reformulation "movies" → "new movies" and "york" → "new york", however, are clearly different despite the lexical similarities in the two reformulations. In this paper, we study the distributed representation of queries learnt by deep neural network models, such as the Convolutional Latent Semantic Model, and show that they can be used to represent query reformulations as vectors. These reformulation vectors exhibit favourable properties such as mapping semantically and syntactically similar query changes closer in the embedding space. Our work is motivated by the success of continuous space language models in capturing relationships between words and their meanings using offset vectors. We demonstrate a way to extend the same intuition to represent query reformulations.
Furthermore, we show that the distributed representations of queries and reformulations are both useful for modelling session context for query prediction tasks, such as for query auto-completion (QAC) ranking. Our empirical study demonstrates that short-term (session) history context features based on these two representations improves the mean reciprocal rank (MRR) for the QAC ranking task by more than 10% over a supervised ranker baseline. Our results also show that by using features based on both these representations together we achieve a better performance, than either of them individually.
Paper: http://research.microsoft.com/apps/pubs/default.aspx?id=244728
Exploring Session Context using Distributed Representations of Queries and Reformulations (SIGIR 2015)
1. Exploring Session Context using
Distributed Representations of
Queries and Reformulations
Bhaskar Mitra
Microsoft
(Paper: http://research.microsoft.com/apps/pubs/default.aspx?id=244728)
6. Questions
• Can we learn intuitively “meaningful” vector representations for
query reformulations?
• Can we use it for modelling session context for tasks such as query
auto-completion (QAC)?
8. Session Context for QAC
muscle cars
f
facebook
fandango
forever 21
fox news
f
facebook
ford
ford mustang
fast and furious
or
Previous query
9. Session Context
What’s the more likely query after “big ben”?
Topical disambiguation. vs Transition likelihood
(symmetrical) (asymmetrical)
big ben
big ben height
london clock tower
10. Distributed Representation
A (low-dimensional) vector representation for items (e.g., words,
sentences, images, etc.) such that all the values in a vector are
necessary to determine the exact item.
Imaginary example:
Also called embeddings.
6 3 0 4 1 7 2 8
11. As opposed to…
One-hot representation scheme, where all except one of the values
of the vector are zeros.
Imaginary example:
0 1 0 0 0 0 0 1
12. For Neural Networks…
Localist Representations
• One neuron to represent each
item
• One-to-one relationship
• For few items / classes only
Distributed Representations
• Multiple neurons to represent
each item
• Many-to-many relationship
• For many items with shared
attributes
13. Vector Algebra on Word Embeddings
Word2vec linguistic regularities
vector(“king”) – vector(“man”) + vector(“woman”) = vector(“queen”)
T. Mikolov, et al. Efficient estimation of word representations in vector space. arXiv preprint, 2013.
T. Mikolov, et al. Distributed representations of words and phrases and their compositionality. NIPS, 2013.
14. Convolutional Latent Semantic Model
• DNN trained on clickthrough data
• Maximize cosine similarity
• Tri-gram hashing over raw terms
• Convolutional-Pooling structure
P.-S. Huang, et al. Learning deep structured semantic models for web search using clickthrough data. CIKM, 2013.
Y. Shen, et al. Learning semantic representations using convolutional neural networks for web search. WWW, 2014.
16. Main Contributions
• CLSM models trained on Session Pairs (SP)
• Demonstrate semantic regularities in the CLSM query embedding
space
• Leverage the regularities to explicitly represent query reformulations
as vectors
• Improved Mean Reciprocal Rank (MRR) for session context-aware
QAC ranking by more than 10% using CLSM based features
17. Training on Session Pairs
• Pairs of consecutive queries
from search sessions
• Pre-Query and Post-Query
model
• Symmetric vs. Asymmetric
models
q1 q2 q3 q4
Advantages
1. Demonstrates higher levels of
reformulation regularities
(discussed next)
2. Train on time-stamped query log,
no need for clickthrough data
22. Session Context-Aware QAC
• Evaluation setup based on
• Temporally separated background, train, validation and test sets
• Sample queries and extract all possible prefixes
• Submitted query as ground truth
• Re-rank top N suggestion candidates using a LambdaMART model
• Two testbeds: search logs from AOL & Bing
M. Shokouhi. Learning to personalize query auto-completion. SIGIR, 2013.
23. Features
Non-contextual features
Prefix length, suggestion length, vowels-
alphabet ratio, contains numeric, etc.
N-gram similarity features
Character trigram similarity between
previous queries and suggestion candidate
Pairwise frequency feature
Pairwise frequency based on popular
sessions pairs in the background data
CLSM topical similarity features
CLSM similarity between previous
queries and suggestion candidate
CLSM reformulation features
Values along each dimension of the
reformulation vector based on previous
query and suggestion candidate
31. Summary of Contributions
• CLSM models trained on Session Pairs (SP)
• Demonstrate semantic regularities in the CLSM query embedding
space
• Leverage the regularities to explicitly represent query reformulations
as vectors
• Improved Mean Reciprocal Rank (MRR) for session context-aware
QAC ranking by more than 10% using CLSM based features
32. Potential Future Work
• Studying search trails (White et. al.) in the embedding space
• Query change retrieval model (Guan et. al.) using reformulation
embeddings
• Generating user embeddings for search personalization
• Study how reformulations vary by user expertise and device