Learning to rank (LTR) for information retrieval (IR) involves the application of machine learning models to rank artifacts, such as items to be recommended, in response to user's need. LTR models typically employ training data, such as human relevance labels and click data, to discriminatively train towards an IR objective. The focus of this tutorial will be on the fundamentals of neural networks and their applications to learning to rank.
3. Reading material
An Introduction to
Neural Information Retrieval
Foundations and Trendsยฎ in Information Retrieval
(December 2018)
Download PDF: http://bit.ly/fntir-neural
5. Learning to Rank (LTR)
โ... the task to automatically construct a
ranking model using training data, such
that the model can sort new objects
according to their degrees of relevance,
preference, or importance.โ
- Liu [2009]
Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf
11. Neural networks
Chains of parameterized linear transforms (e.g., multiply weight, add
bias) followed by non-linear functions (ฯ)
Popular choices for ฯ:
Parameters trained using backpropagation
E2E training over millions of samples in batched mode
Many choices of architecture and hyper-parameters
Non-linearity
Input
Linear transform
Non-linearity
Linear transform
Predicted output
forwardpass
backwardpass
Expected output
loss
Tanh ReLU
14. The softmax function
In neural classification models, the softmax function is popularly used
to normalize the neural network output scores across all the classes
15. Cross entropy
The cross entropy between two
probability distributions ๐ and ๐
over a discrete set of events is
given by,
If ๐ ๐๐๐๐๐๐๐ก = 1and ๐๐ = 0 for all
other values of ๐ then,
16. Cross entropy with
softmax loss
Cross entropy with softmax is a popular loss
function for classification
17. We are given training data: < ๐ฅ, ๐ฆ > pairs, where ๐ฅ is input and ๐ฆ is expected output
Step 1: Define model and randomly initialize learnable model parameters
Step 2: Given ๐ฅ, compute model output
Step 3: Given model output and ๐ฆ, compute loss ๐
Step 4: Compute gradient
๐๐
๐๐ค
of loss ๐ w.r.t. each parameter ๐ค
Step 5: Update parameter as ๐ค ๐๐๐ค = ๐ค ๐๐๐ โ ๐ ร
๐๐
๐๐ค
, where ๐ is learning rate
Step 6: Go back to step 2 and repeat till convergence
Gradient Descent
18. Goal: iteratively update the learnable parameters such that the loss ๐ is minimized
Compute the gradient of the loss ๐ w.r.t. each parameter (e.g., ๐ค1)
๐๐
๐๐ค1
=
๐๐
๐๐ฆ2
ร
๐๐ฆ2
๐๐ฆ1
ร
๐๐ฆ1
๐๐ค1
Update the parameter value based on the gradient with ๐ as the learning rate
๐ค1
๐๐๐ค
= ๐ค1
๐๐๐
โ ๐ ร
๐๐
๐๐ค1
Gradient Descent
Task: regression
Training data: ๐ฅ, ๐ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐ค1, ๐1, ๐ค2, ๐2
๐ฅ ๐ฆ1 ๐ฆ2
๐
๐ก๐๐โ ๐ค1. ๐ฅ + ๐1
๐ฆ โ ๐ฆ2
2
๐ฆ
๐ก๐๐โ ๐ค2. ๐ฆ1 + ๐2
โฆand repeat
19. Goal: iteratively update the learnable parameters such that the loss ๐ is minimized
Compute the gradient of the loss ๐ w.r.t. each parameter (e.g., ๐ค1)
๐๐
๐๐ค1
=
๐ ๐ฆ โ ๐ฆ2
2
๐๐ฆ2
ร
๐๐ฆ2
๐๐ฆ1
ร
๐๐ฆ1
๐๐ค1
Update the parameter value based on the gradient with ๐ as the learning rate
๐ค1
๐๐๐ค
= ๐ค1
๐๐๐
โ ๐ ร
๐๐
๐๐ค1
Gradient Descent
Task: regression
Training data: ๐ฅ, ๐ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐ค1, ๐1, ๐ค2, ๐2
๐ฅ ๐ฆ1 ๐ฆ2
๐
๐ก๐๐โ ๐ค1. ๐ฅ + ๐1
๐ฆ โ ๐ฆ2
2
๐ฆ
๐ก๐๐โ ๐ค2. ๐ฆ1 + ๐2
โฆand repeat
20. Goal: iteratively update the learnable parameters such that the loss ๐ is minimized
Compute the gradient of the loss ๐ w.r.t. each parameter (e.g., ๐ค1)
๐๐
๐๐ค1
= โ2 ร ๐ฆ โ ๐ฆ2 ร
๐๐ฆ2
๐๐ฆ1
ร
๐๐ฆ1
๐๐ค1
Update the parameter value based on the gradient with ๐ as the learning rate
๐ค1
๐๐๐ค
= ๐ค1
๐๐๐
โ ๐ ร
๐๐
๐๐ค1
Gradient Descent
Task: regression
Training data: ๐ฅ, ๐ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐ค1, ๐1, ๐ค2, ๐2
๐ฅ ๐ฆ1 ๐ฆ2
๐
๐ก๐๐โ ๐ค1. ๐ฅ + ๐1
๐ฆ โ ๐ฆ2
2
๐ฆ
๐ก๐๐โ ๐ค2. ๐ฆ1 + ๐2
โฆand repeat
21. Goal: iteratively update the learnable parameters such that the loss ๐ is minimized
Compute the gradient of the loss ๐ w.r.t. each parameter (e.g., ๐ค1)
๐๐
๐๐ค1
= โ2 ร ๐ฆ โ ๐ฆ2 ร
๐๐ก๐๐โ ๐ค2. ๐ฆ1 + ๐2
๐๐ฆ1
ร
๐๐ฆ1
๐๐ค1
Update the parameter value based on the gradient with ๐ as the learning rate
๐ค1
๐๐๐ค
= ๐ค1
๐๐๐
โ ๐ ร
๐๐
๐๐ค1
Gradient Descent
Task: regression
Training data: ๐ฅ, ๐ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐ค1, ๐1, ๐ค2, ๐2
๐ฅ ๐ฆ1 ๐ฆ2
๐
๐ก๐๐โ ๐ค1. ๐ฅ + ๐1
๐ฆ โ ๐ฆ2
2
๐ฆ
๐ก๐๐โ ๐ค2. ๐ฆ1 + ๐2
โฆand repeat
22. Goal: iteratively update the learnable parameters such that the loss ๐ is minimized
Compute the gradient of the loss ๐ w.r.t. each parameter (e.g., ๐ค1)
๐๐
๐๐ค1
= โ2 ร ๐ฆ โ ๐ฆ2 ร 1 โ ๐ก๐๐โ2
๐ค2. ๐ฅ + ๐2 ร ๐ค2 ร
๐๐ฆ1
๐๐ค1
Update the parameter value based on the gradient with ๐ as the learning rate
๐ค1
๐๐๐ค
= ๐ค1
๐๐๐
โ ๐ ร
๐๐
๐๐ค1
Gradient Descent
Task: regression
Training data: ๐ฅ, ๐ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐ค1, ๐1, ๐ค2, ๐2
๐ฅ ๐ฆ1 ๐ฆ2
๐
๐ก๐๐โ ๐ค1. ๐ฅ + ๐1
๐ฆ โ ๐ฆ2
2
๐ฆ
๐ก๐๐โ ๐ค2. ๐ฆ1 + ๐2
โฆand repeat
23. Goal: iteratively update the learnable parameters such that the loss ๐ is minimized
Compute the gradient of the loss ๐ w.r.t. each parameter (e.g., ๐ค1)
๐๐
๐๐ค1
= โ2 ร ๐ฆ โ ๐ฆ2 ร 1 โ ๐ก๐๐โ2
๐ค2. ๐ฅ + ๐2 ร ๐ค2 ร
๐๐ก๐๐โ ๐ค1. ๐ฅ + ๐1
๐๐ค1
Update the parameter value based on the gradient with ๐ as the learning rate
๐ค1
๐๐๐ค
= ๐ค1
๐๐๐
โ ๐ ร
๐๐
๐๐ค1
Gradient Descent
Task: regression
Training data: ๐ฅ, ๐ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐ค1, ๐1, ๐ค2, ๐2
๐ฅ ๐ฆ1 ๐ฆ2
๐
๐ก๐๐โ ๐ค1. ๐ฅ + ๐1
๐ฆ โ ๐ฆ2
2
๐ฆ
๐ก๐๐โ ๐ค2. ๐ฆ1 + ๐2
โฆand repeat
24. Goal: iteratively update the learnable parameters such that the loss ๐ is minimized
Compute the gradient of the loss ๐ w.r.t. each parameter (e.g., ๐ค1)
๐๐
๐๐ค1
= โ2 ร ๐ฆ โ ๐ฆ2 ร 1 โ ๐ก๐๐โ2
๐ค2. ๐ฅ + ๐2 ร ๐ค2 ร 1 โ ๐ก๐๐โ2
๐ค1. ๐ฅ + ๐1 ร ๐ฅ
Update the parameter value based on the gradient with ๐ as the learning rate
๐ค1
๐๐๐ค
= ๐ค1
๐๐๐
โ ๐ ร
๐๐
๐๐ค1
Gradient Descent
Task: regression
Training data: ๐ฅ, ๐ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐ค1, ๐1, ๐ค2, ๐2
๐ฅ ๐ฆ1 ๐ฆ2
๐
๐ก๐๐โ ๐ค1. ๐ฅ + ๐1
๐ฆ โ ๐ฆ2
2
๐ฆ
๐ก๐๐โ ๐ค2. ๐ฆ1 + ๐2
โฆand repeat
25. Exercise
Simple Neural Network from Scratch
Implement a simple multi-layer neural network
with single input feature, single output, and
single neuron per layer using (i) PyTorch and
(ii) from scratchโand demonstrate that both
approaches produce identical outcome.
https://github.com/spacemanidol/AFIRMDeep
Learning2020/blob/master/NNPrimer.ipynb
26. Computation
Networks
The โLegoโ approach to specifying neural architectures
Library of neural layers, each layer defines logic for:
1. Forward pass: compute layer output given layer input
2. Backward pass:
a) compute gradient of layer output w.r.t. layer inputs
b) compute gradient of layer output w.r.t. layer parameters (if any)
Chain nodes to create bigger and more complex networks
29. Bias-variance trade-off in the deep
learning era
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical biasโvariance trade-off. In PNAS, 2019.
30. Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2019.
Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What's Hidden in a Randomly Weighted Neural Network? In ArXiv, 2019.
The lottery ticket
hypothesis
33. Problem formulation
LTR models represent a rankable itemโe.g., a document or a movie or a
songโgiven some contextโe.g., a user-issued query or userโs historical
interactions with other itemsโas a numerical vector ๐ฅ โ โ ๐
The ranking model ๐: ๐ฅ โ โ is trained to map the vector to a real-valued
score such that relevant items are scored higher.
34. Why is ranking challenging?
Examples of ranking
metrics
Discounted Cumulative Gain (DCG)
๐ท๐ถ๐บ@๐ =
๐=1
๐
2 ๐๐๐๐
โ 1
๐๐๐2 ๐ + 1
Reciprocal Rank (RR)
๐ ๐ @๐ = max
1<๐<๐
๐๐๐๐
๐
Rank based metrics, such as DCG and MRR, are non-smooth / non-differentiable
35. Features
They can often be categorized as:
Query-independent or static features
e.g., incoming link count and document length
Query-dependent or dynamic features
e.g., BM25
Query-level features
e.g., query length
Traditional L2R models employ
hand-crafted features that
encode IR insights
36. Features
Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval, Information Retrieval Journal, 2010
37. Approaches
Pointwise approach
Relevance label ๐ฆ ๐,๐ is a numberโderived from binary or graded human
judgments or implicit user feedback (e.g., CTR). Typically, a regression or
classification model is trained to predict ๐ฆ ๐,๐ given ๐ฅ ๐,๐.
Pairwise approach
Pairwise preference between documents for a query (๐๐ โป ๐๐ w.r.t. ๐) as
label. Reduces to binary classification to predict more relevant document.
Listwise approach
Directly optimize for rank-based metric, such as NDCGโdifficult because
these metrics are often not differentiable w.r.t. model parameters.
Liu [2009] categorizes
different LTR approaches
based on training objectives:
Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
38. Pointwise objectives
Regression loss
Given ๐, ๐ predict the value of ๐ฆ ๐,๐
e.g., square loss for binary or categorical
labels,
where, ๐ฆ ๐,๐ is the one-hot representation
[Fuhr, 1989] or the actual value [Cossock and
Zhang, 2006] of the label
Norbert Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM TOIS, 1989.
David Cossock and Tong Zhang. Subset ranking using regression. In COLT, 2006.
labels
prediction
0 1 1
39. Pointwise objectives
Classification loss
Given ๐, ๐ predict the class ๐ฆ ๐,๐
e.g., cross-entropy with softmax over
categorical labels ๐ [Li et al., 2008],
where, ๐ ๐ฆ ๐,๐
is the modelโs score for label ๐ฆ ๐,๐
labels
prediction
0 1
Ping Li, Qiang Wu, and Christopher J Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2008.
40. Pairwise objectives Pairwise loss generally has the following form [Chen et al., 2009],
where, ๐ can be,
โข Hinge function ๐ ๐ง = ๐๐๐ฅ 0, 1 โ ๐ง [Herbrich et al., 2000]
โข Exponential function ๐ ๐ง = ๐โ๐ง
[Freund et al., 2003]
โข Logistic function ๐ ๐ง = ๐๐๐ 1 + ๐โ๐ง
[Burges et al., 2005]
โข Othersโฆ
Pairwise loss minimizes the average number of
inversions in rankingโi.e., ๐๐ โป ๐๐ w.r.t. ๐ but ๐๐ is
ranked higher than ๐๐
Given ๐, ๐๐, ๐๐ , predict the more relevant document
For ๐, ๐๐ and ๐, ๐๐ ,
Feature vectors: ๐ฅ๐ and ๐ฅ๐
Model scores: ๐ ๐ = ๐ ๐ฅ๐ and ๐ ๐ = ๐ ๐ฅ๐
Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhi-Ming Ma, and Hang Li. Ranking measures and loss functions in learning to rank. In NIPS, 2009.
Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. 2000.
Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. In JMLR, 2003.
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
41. Pairwise objectives
RankNet loss
Pairwise loss function proposed by Burges et al. [2005]โan industry favourite
[Burges, 2015]
Predicted probabilities: ๐๐๐ = ๐ ๐ ๐ > ๐ ๐ โก
๐ ๐พ.๐ ๐
๐ ๐พ.๐ ๐ +๐
๐พ.๐ ๐
=
1
1+๐
โ๐พ. ๐ ๐โ๐ ๐
Desired probabilities: ๐๐๐ = 1 and ๐๐๐ = 0
Computing cross-entropy between ๐ and ๐
โ ๐ ๐๐๐๐๐๐ก = โ ๐๐๐. ๐๐๐ ๐๐๐ โ ๐๐๐. ๐๐๐ ๐๐๐ = โ๐๐๐ ๐๐๐ = ๐๐๐ 1 + ๐โ๐พ. ๐ ๐โ๐ ๐
pairwise
preference
score
0 1
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
Chris Burges. RankNet: A ranking retrospective. https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/. 2015.
42. A generalized cross-entropy loss
An alternative loss function assumes a single relevant document ๐+ and compares it
against the full collection ๐ท
Predicted probabilities: p ๐+|๐ =
๐ ๐พ.๐ ๐,๐+
๐โ๐ท ๐ ๐พ.๐ ๐,๐
The cross-entropy loss is then given by,
โ ๐ถ๐ธ ๐, ๐+, ๐ท = โ๐๐๐ p ๐+|๐ = โ๐๐๐
๐ ๐พ.๐ ๐,๐+
๐โ๐ท ๐ ๐พ.๐ ๐,๐
Computing the softmax over the full collection is prohibitively expensiveโLTR models
typically consider few negative candidates [Huang et al., 2013, Shen et al., 2014, Mitra et al., 2017]
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013.
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014.
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
43. Blue: relevant Gray: non-relevant
NDCG and ERR higher for left but pairwise
errors less for right
Due to strong position-based discounting in
IR measures, errors at higher ranks are much
more problematic than at lower ranks
But listwise metrics are non-continuous and
non-differentiable
LISTWISE
OBJECTIVES
Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 2010.
[Burges, 2010]
44. Listwise objectives
Burges et al. [2006] make two observations:
1. To train a model we donโt need the costs
themselves, only the gradients (of the costs
w.r.t model scores)
2. It is desired that the gradient be bigger for
pairs of documents that produces a bigger
impact in NDCG by swapping positions
Christopher JC Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In NIPS, 2006.
LambdaRank loss
Multiply actual gradients with the change in
NDCG by swapping the rank positions of the
two documents
45. Listwise objectives
According to the Luce model [Luce, 2005],
given four items ๐1, ๐2, ๐3, ๐4 the probability
of observing a particular rank-order, say
๐2, ๐1, ๐4, ๐3 , is given by:
where, ๐ is a particular permutation and ๐ is a
transformation (e.g., linear, exponential, or
sigmoid) over the score ๐ ๐ corresponding to
item ๐๐
R Duncan Luce. Individual choice behavior. 1959.
Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In ICML, 2007.
Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: theory and algorithm. In ICML, 2008.
ListNet loss
Cao et al. [2007] propose to compute the
probability distribution over all possible
permutations based on model score and ground-
truth labels. The loss is then given by the K-L
divergence between these two distributions.
This is computationally very costly, computing
permutations of only the top-K items makes it
slightly less prohibitive.
ListMLE loss
Xia et al. [2008] propose to compute the
probability of the ideal permutation based on the
ground truth. However, with categorical labels
more than one permutation is possible.
46. Listwise objectives
Mingrui Wu, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. Smoothing DCG for learning to rank: A novel approach using smoothed hinge functions. In CIKM, 2009.
Smooth DCG
Wu et al. [2009] compute a โsmoothโ rank of
documents as a function of their scores
This โsmoothโ rank can be plugged into a
ranking metric, such as MRR or DCG, to
produce a smooth ranking loss