SlideShare una empresa de Scribd logo
1 de 47
Neural Learning to Rank
Bhaskar Mitra
Principal Applied Scientist, Microsoft
PhD candidate, University College London
@UnderdogGeek
Topics
A quick recap of neural networks
The fundamentals of learning to rank
Reading material
An Introduction to
Neural Information Retrieval
Foundations and Trendsยฎ in Information Retrieval
(December 2018)
Download PDF: http://bit.ly/fntir-neural
Most information retrieval
(IR) systems present a ranked
list of retrieved artifacts
Learning to Rank (LTR)
โ€... the task to automatically construct a
ranking model using training data, such
that the model can sort new objects
according to their degrees of relevance,
preference, or importance.โ€
- Liu [2009]
Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf
A quick recap of
neural networks
Vectors, matrices,
and tensors
Image source: https://dev.to/mmithrakumar/scalars-vectors-matrices-and-tensors-with-tensorflow-2-0-1f66
Image source: https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/
matrix transpose matrix addition
dot product matrix multiplication
Supervised learning
Image source: https://www.intechopen.com/books/artificial-neural-networks-architectures-and-applications/applying-artificial-neural-network-hadron-hadron-collisions-at-lhc
Neural networks
Chains of parameterized linear transforms (e.g., multiply weight, add
bias) followed by non-linear functions (ฯƒ)
Popular choices for ฯƒ:
Parameters trained using backpropagation
E2E training over millions of samples in batched mode
Many choices of architecture and hyper-parameters
Non-linearity
Input
Linear transform
Non-linearity
Linear transform
Predicted output
forwardpass
backwardpass
Expected output
loss
Tanh ReLU
Basic machine
learning tasks
Squared loss
The squared loss is a popular loss function for regression tasks
The softmax function
In neural classification models, the softmax function is popularly used
to normalize the neural network output scores across all the classes
Cross entropy
The cross entropy between two
probability distributions ๐‘ and ๐‘ž
over a discrete set of events is
given by,
If ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘Ÿ๐‘’๐‘๐‘ก = 1and ๐‘๐‘– = 0 for all
other values of ๐‘– then,
Cross entropy with
softmax loss
Cross entropy with softmax is a popular loss
function for classification
We are given training data: < ๐‘ฅ, ๐‘ฆ > pairs, where ๐‘ฅ is input and ๐‘ฆ is expected output
Step 1: Define model and randomly initialize learnable model parameters
Step 2: Given ๐‘ฅ, compute model output
Step 3: Given model output and ๐‘ฆ, compute loss ๐‘™
Step 4: Compute gradient
๐œ•๐‘™
๐œ•๐‘ค
of loss ๐‘™ w.r.t. each parameter ๐‘ค
Step 5: Update parameter as ๐‘ค ๐‘›๐‘’๐‘ค = ๐‘ค ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร—
๐œ•๐‘™
๐œ•๐‘ค
, where ๐œ‚ is learning rate
Step 6: Go back to step 2 and repeat till convergence
Gradient Descent
Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized
Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1)
๐œ•๐‘™
๐œ•๐‘ค1
=
๐œ•๐‘™
๐œ•๐‘ฆ2
ร—
๐œ•๐‘ฆ2
๐œ•๐‘ฆ1
ร—
๐œ•๐‘ฆ1
๐œ•๐‘ค1
Update the parameter value based on the gradient with ๐œ‚ as the learning rate
๐‘ค1
๐‘›๐‘’๐‘ค
= ๐‘ค1
๐‘œ๐‘™๐‘‘
โˆ’ ๐œ‚ ร—
๐œ•๐‘™
๐œ•๐‘ค1
Gradient Descent
Task: regression
Training data: ๐‘ฅ, ๐‘ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2
๐‘ฅ ๐‘ฆ1 ๐‘ฆ2
๐‘™
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1
๐‘ฆ โˆ’ ๐‘ฆ2
2
๐‘ฆ
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2
โ€ฆand repeat
Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized
Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1)
๐œ•๐‘™
๐œ•๐‘ค1
=
๐œ• ๐‘ฆ โˆ’ ๐‘ฆ2
2
๐œ•๐‘ฆ2
ร—
๐œ•๐‘ฆ2
๐œ•๐‘ฆ1
ร—
๐œ•๐‘ฆ1
๐œ•๐‘ค1
Update the parameter value based on the gradient with ๐œ‚ as the learning rate
๐‘ค1
๐‘›๐‘’๐‘ค
= ๐‘ค1
๐‘œ๐‘™๐‘‘
โˆ’ ๐œ‚ ร—
๐œ•๐‘™
๐œ•๐‘ค1
Gradient Descent
Task: regression
Training data: ๐‘ฅ, ๐‘ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2
๐‘ฅ ๐‘ฆ1 ๐‘ฆ2
๐‘™
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1
๐‘ฆ โˆ’ ๐‘ฆ2
2
๐‘ฆ
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2
โ€ฆand repeat
Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized
Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1)
๐œ•๐‘™
๐œ•๐‘ค1
= โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร—
๐œ•๐‘ฆ2
๐œ•๐‘ฆ1
ร—
๐œ•๐‘ฆ1
๐œ•๐‘ค1
Update the parameter value based on the gradient with ๐œ‚ as the learning rate
๐‘ค1
๐‘›๐‘’๐‘ค
= ๐‘ค1
๐‘œ๐‘™๐‘‘
โˆ’ ๐œ‚ ร—
๐œ•๐‘™
๐œ•๐‘ค1
Gradient Descent
Task: regression
Training data: ๐‘ฅ, ๐‘ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2
๐‘ฅ ๐‘ฆ1 ๐‘ฆ2
๐‘™
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1
๐‘ฆ โˆ’ ๐‘ฆ2
2
๐‘ฆ
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2
โ€ฆand repeat
Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized
Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1)
๐œ•๐‘™
๐œ•๐‘ค1
= โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร—
๐œ•๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2
๐œ•๐‘ฆ1
ร—
๐œ•๐‘ฆ1
๐œ•๐‘ค1
Update the parameter value based on the gradient with ๐œ‚ as the learning rate
๐‘ค1
๐‘›๐‘’๐‘ค
= ๐‘ค1
๐‘œ๐‘™๐‘‘
โˆ’ ๐œ‚ ร—
๐œ•๐‘™
๐œ•๐‘ค1
Gradient Descent
Task: regression
Training data: ๐‘ฅ, ๐‘ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2
๐‘ฅ ๐‘ฆ1 ๐‘ฆ2
๐‘™
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1
๐‘ฆ โˆ’ ๐‘ฆ2
2
๐‘ฆ
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2
โ€ฆand repeat
Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized
Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1)
๐œ•๐‘™
๐œ•๐‘ค1
= โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— 1 โˆ’ ๐‘ก๐‘Ž๐‘›โ„Ž2
๐‘ค2. ๐‘ฅ + ๐‘2 ร— ๐‘ค2 ร—
๐œ•๐‘ฆ1
๐œ•๐‘ค1
Update the parameter value based on the gradient with ๐œ‚ as the learning rate
๐‘ค1
๐‘›๐‘’๐‘ค
= ๐‘ค1
๐‘œ๐‘™๐‘‘
โˆ’ ๐œ‚ ร—
๐œ•๐‘™
๐œ•๐‘ค1
Gradient Descent
Task: regression
Training data: ๐‘ฅ, ๐‘ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2
๐‘ฅ ๐‘ฆ1 ๐‘ฆ2
๐‘™
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1
๐‘ฆ โˆ’ ๐‘ฆ2
2
๐‘ฆ
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2
โ€ฆand repeat
Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized
Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1)
๐œ•๐‘™
๐œ•๐‘ค1
= โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— 1 โˆ’ ๐‘ก๐‘Ž๐‘›โ„Ž2
๐‘ค2. ๐‘ฅ + ๐‘2 ร— ๐‘ค2 ร—
๐œ•๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1
๐œ•๐‘ค1
Update the parameter value based on the gradient with ๐œ‚ as the learning rate
๐‘ค1
๐‘›๐‘’๐‘ค
= ๐‘ค1
๐‘œ๐‘™๐‘‘
โˆ’ ๐œ‚ ร—
๐œ•๐‘™
๐œ•๐‘ค1
Gradient Descent
Task: regression
Training data: ๐‘ฅ, ๐‘ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2
๐‘ฅ ๐‘ฆ1 ๐‘ฆ2
๐‘™
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1
๐‘ฆ โˆ’ ๐‘ฆ2
2
๐‘ฆ
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2
โ€ฆand repeat
Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized
Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1)
๐œ•๐‘™
๐œ•๐‘ค1
= โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— 1 โˆ’ ๐‘ก๐‘Ž๐‘›โ„Ž2
๐‘ค2. ๐‘ฅ + ๐‘2 ร— ๐‘ค2 ร— 1 โˆ’ ๐‘ก๐‘Ž๐‘›โ„Ž2
๐‘ค1. ๐‘ฅ + ๐‘1 ร— ๐‘ฅ
Update the parameter value based on the gradient with ๐œ‚ as the learning rate
๐‘ค1
๐‘›๐‘’๐‘ค
= ๐‘ค1
๐‘œ๐‘™๐‘‘
โˆ’ ๐œ‚ ร—
๐œ•๐‘™
๐œ•๐‘ค1
Gradient Descent
Task: regression
Training data: ๐‘ฅ, ๐‘ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2
๐‘ฅ ๐‘ฆ1 ๐‘ฆ2
๐‘™
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1
๐‘ฆ โˆ’ ๐‘ฆ2
2
๐‘ฆ
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2
โ€ฆand repeat
Exercise
Simple Neural Network from Scratch
Implement a simple multi-layer neural network
with single input feature, single output, and
single neuron per layer using (i) PyTorch and
(ii) from scratchโ€”and demonstrate that both
approaches produce identical outcome.
https://github.com/spacemanidol/AFIRMDeep
Learning2020/blob/master/NNPrimer.ipynb
Computation
Networks
The โ€œLegoโ€ approach to specifying neural architectures
Library of neural layers, each layer defines logic for:
1. Forward pass: compute layer output given layer input
2. Backward pass:
a) compute gradient of layer output w.r.t. layer inputs
b) compute gradient of layer output w.r.t. layer parameters (if any)
Chain nodes to create bigger and more complex networks
Why adding depth helps
http://playground.tensorflow.org
Bias-Variance trade-
off
https://medium.com/@akgone38/what-the-heck-bias-variance-tradeoff-is-fe4681c0e71b
Bias-variance trade-off in the deep
learning era
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical biasโ€“variance trade-off. In PNAS, 2019.
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2019.
Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What's Hidden in a Randomly Weighted Neural Network? In ArXiv, 2019.
The lottery ticket
hypothesis
Questions?
The fundamentals of
learning to rank
Problem formulation
LTR models represent a rankable itemโ€”e.g., a document or a movie or a
songโ€”given some contextโ€”e.g., a user-issued query or userโ€™s historical
interactions with other itemsโ€”as a numerical vector ๐‘ฅ โˆˆ โ„ ๐‘›
The ranking model ๐‘“: ๐‘ฅ โ†’ โ„ is trained to map the vector to a real-valued
score such that relevant items are scored higher.
Why is ranking challenging?
Examples of ranking
metrics
Discounted Cumulative Gain (DCG)
๐ท๐ถ๐บ@๐‘˜ =
๐‘–=1
๐‘˜
2 ๐‘Ÿ๐‘’๐‘™๐‘–
โˆ’ 1
๐‘™๐‘œ๐‘”2 ๐‘– + 1
Reciprocal Rank (RR)
๐‘…๐‘…@๐‘˜ = max
1<๐‘–<๐‘˜
๐‘Ÿ๐‘’๐‘™๐‘–
๐‘–
Rank based metrics, such as DCG and MRR, are non-smooth / non-differentiable
Features
They can often be categorized as:
Query-independent or static features
e.g., incoming link count and document length
Query-dependent or dynamic features
e.g., BM25
Query-level features
e.g., query length
Traditional L2R models employ
hand-crafted features that
encode IR insights
Features
Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval, Information Retrieval Journal, 2010
Approaches
Pointwise approach
Relevance label ๐‘ฆ ๐‘ž,๐‘‘ is a numberโ€”derived from binary or graded human
judgments or implicit user feedback (e.g., CTR). Typically, a regression or
classification model is trained to predict ๐‘ฆ ๐‘ž,๐‘‘ given ๐‘ฅ ๐‘ž,๐‘‘.
Pairwise approach
Pairwise preference between documents for a query (๐‘‘๐‘– โ‰ป ๐‘‘๐‘— w.r.t. ๐‘ž) as
label. Reduces to binary classification to predict more relevant document.
Listwise approach
Directly optimize for rank-based metric, such as NDCGโ€”difficult because
these metrics are often not differentiable w.r.t. model parameters.
Liu [2009] categorizes
different LTR approaches
based on training objectives:
Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
Pointwise objectives
Regression loss
Given ๐‘ž, ๐‘‘ predict the value of ๐‘ฆ ๐‘ž,๐‘‘
e.g., square loss for binary or categorical
labels,
where, ๐‘ฆ ๐‘ž,๐‘‘ is the one-hot representation
[Fuhr, 1989] or the actual value [Cossock and
Zhang, 2006] of the label
Norbert Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM TOIS, 1989.
David Cossock and Tong Zhang. Subset ranking using regression. In COLT, 2006.
labels
prediction
0 1 1
Pointwise objectives
Classification loss
Given ๐‘ž, ๐‘‘ predict the class ๐‘ฆ ๐‘ž,๐‘‘
e.g., cross-entropy with softmax over
categorical labels ๐‘Œ [Li et al., 2008],
where, ๐‘  ๐‘ฆ ๐‘ž,๐‘‘
is the modelโ€™s score for label ๐‘ฆ ๐‘ž,๐‘‘
labels
prediction
0 1
Ping Li, Qiang Wu, and Christopher J Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2008.
Pairwise objectives Pairwise loss generally has the following form [Chen et al., 2009],
where, ๐œ™ can be,
โ€ข Hinge function ๐œ™ ๐‘ง = ๐‘š๐‘Ž๐‘ฅ 0, 1 โˆ’ ๐‘ง [Herbrich et al., 2000]
โ€ข Exponential function ๐œ™ ๐‘ง = ๐‘’โˆ’๐‘ง
[Freund et al., 2003]
โ€ข Logistic function ๐œ™ ๐‘ง = ๐‘™๐‘œ๐‘” 1 + ๐‘’โˆ’๐‘ง
[Burges et al., 2005]
โ€ข Othersโ€ฆ
Pairwise loss minimizes the average number of
inversions in rankingโ€”i.e., ๐‘‘๐‘– โ‰ป ๐‘‘๐‘— w.r.t. ๐‘ž but ๐‘‘๐‘— is
ranked higher than ๐‘‘๐‘–
Given ๐‘ž, ๐‘‘๐‘–, ๐‘‘๐‘— , predict the more relevant document
For ๐‘ž, ๐‘‘๐‘– and ๐‘ž, ๐‘‘๐‘— ,
Feature vectors: ๐‘ฅ๐‘– and ๐‘ฅ๐‘—
Model scores: ๐‘ ๐‘– = ๐‘“ ๐‘ฅ๐‘– and ๐‘ ๐‘— = ๐‘“ ๐‘ฅ๐‘—
Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhi-Ming Ma, and Hang Li. Ranking measures and loss functions in learning to rank. In NIPS, 2009.
Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. 2000.
Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. In JMLR, 2003.
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
Pairwise objectives
RankNet loss
Pairwise loss function proposed by Burges et al. [2005]โ€”an industry favourite
[Burges, 2015]
Predicted probabilities: ๐‘๐‘–๐‘— = ๐‘ ๐‘ ๐‘– > ๐‘ ๐‘— โ‰ก
๐‘’ ๐›พ.๐‘  ๐‘–
๐‘’ ๐›พ.๐‘  ๐‘– +๐‘’
๐›พ.๐‘  ๐‘—
=
1
1+๐‘’
โˆ’๐›พ. ๐‘  ๐‘–โˆ’๐‘  ๐‘—
Desired probabilities: ๐‘๐‘–๐‘— = 1 and ๐‘๐‘—๐‘– = 0
Computing cross-entropy between ๐‘ and ๐‘
โ„’ ๐‘…๐‘Ž๐‘›๐‘˜๐‘๐‘’๐‘ก = โˆ’ ๐‘๐‘–๐‘—. ๐‘™๐‘œ๐‘” ๐‘๐‘–๐‘— โˆ’ ๐‘๐‘—๐‘–. ๐‘™๐‘œ๐‘” ๐‘๐‘—๐‘– = โˆ’๐‘™๐‘œ๐‘” ๐‘๐‘–๐‘— = ๐‘™๐‘œ๐‘” 1 + ๐‘’โˆ’๐›พ. ๐‘  ๐‘–โˆ’๐‘  ๐‘—
pairwise
preference
score
0 1
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
Chris Burges. RankNet: A ranking retrospective. https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/. 2015.
A generalized cross-entropy loss
An alternative loss function assumes a single relevant document ๐‘‘+ and compares it
against the full collection ๐ท
Predicted probabilities: p ๐‘‘+|๐‘ž =
๐‘’ ๐›พ.๐‘  ๐‘ž,๐‘‘+
๐‘‘โˆˆ๐ท ๐‘’ ๐›พ.๐‘  ๐‘ž,๐‘‘
The cross-entropy loss is then given by,
โ„’ ๐ถ๐ธ ๐‘ž, ๐‘‘+, ๐ท = โˆ’๐‘™๐‘œ๐‘” p ๐‘‘+|๐‘ž = โˆ’๐‘™๐‘œ๐‘”
๐‘’ ๐›พ.๐‘  ๐‘ž,๐‘‘+
๐‘‘โˆˆ๐ท ๐‘’ ๐›พ.๐‘  ๐‘ž,๐‘‘
Computing the softmax over the full collection is prohibitively expensiveโ€”LTR models
typically consider few negative candidates [Huang et al., 2013, Shen et al., 2014, Mitra et al., 2017]
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013.
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014.
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
Blue: relevant Gray: non-relevant
NDCG and ERR higher for left but pairwise
errors less for right
Due to strong position-based discounting in
IR measures, errors at higher ranks are much
more problematic than at lower ranks
But listwise metrics are non-continuous and
non-differentiable
LISTWISE
OBJECTIVES
Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 2010.
[Burges, 2010]
Listwise objectives
Burges et al. [2006] make two observations:
1. To train a model we donโ€™t need the costs
themselves, only the gradients (of the costs
w.r.t model scores)
2. It is desired that the gradient be bigger for
pairs of documents that produces a bigger
impact in NDCG by swapping positions
Christopher JC Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In NIPS, 2006.
LambdaRank loss
Multiply actual gradients with the change in
NDCG by swapping the rank positions of the
two documents
Listwise objectives
According to the Luce model [Luce, 2005],
given four items ๐‘‘1, ๐‘‘2, ๐‘‘3, ๐‘‘4 the probability
of observing a particular rank-order, say
๐‘‘2, ๐‘‘1, ๐‘‘4, ๐‘‘3 , is given by:
where, ๐œ‹ is a particular permutation and ๐œ™ is a
transformation (e.g., linear, exponential, or
sigmoid) over the score ๐‘ ๐‘– corresponding to
item ๐‘‘๐‘–
R Duncan Luce. Individual choice behavior. 1959.
Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In ICML, 2007.
Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: theory and algorithm. In ICML, 2008.
ListNet loss
Cao et al. [2007] propose to compute the
probability distribution over all possible
permutations based on model score and ground-
truth labels. The loss is then given by the K-L
divergence between these two distributions.
This is computationally very costly, computing
permutations of only the top-K items makes it
slightly less prohibitive.
ListMLE loss
Xia et al. [2008] propose to compute the
probability of the ideal permutation based on the
ground truth. However, with categorical labels
more than one permutation is possible.
Listwise objectives
Mingrui Wu, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. Smoothing DCG for learning to rank: A novel approach using smoothed hinge functions. In CIKM, 2009.
Smooth DCG
Wu et al. [2009] compute a โ€œsmoothโ€ rank of
documents as a function of their scores
This โ€œsmoothโ€ rank can be plugged into a
ranking metric, such as MRR or DCG, to
produce a smooth ranking loss
Questions?
@UnderdogGeek bmitra@microsoft.com

Mรกs contenido relacionado

La actualidad mรกs candente

Talk@rmit 09112017
Talk@rmit 09112017Talk@rmit 09112017
Talk@rmit 09112017Shuai Zhang
ย 
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...Simplilearn
ย 
Machine learning
Machine learningMachine learning
Machine learningInfoFarm
ย 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkYan Xu
ย 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )Mohammad Junaid Khan
ย 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsBenjamin Le
ย 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Balรกzs Hidasi
ย 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Alexandros Karatzoglou
ย 
Autoencoders
AutoencodersAutoencoders
AutoencodersCloudxLab
ย 
Deep learning
Deep learning Deep learning
Deep learning Rajgupta258
ย 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Simplilearn
ย 
Deep Learning With Python Tutorial | Edureka
Deep Learning With Python Tutorial | EdurekaDeep Learning With Python Tutorial | Edureka
Deep Learning With Python Tutorial | EdurekaEdureka!
ย 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Simplilearn
ย 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...Balรกzs Hidasi
ย 
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...Simplilearn
ย 
Overview of recommender system
Overview of recommender systemOverview of recommender system
Overview of recommender systemStanley Wang
ย 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Edureka!
ย 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNNChanuk Lim
ย 

La actualidad mรกs candente (20)

Talk@rmit 09112017
Talk@rmit 09112017Talk@rmit 09112017
Talk@rmit 09112017
ย 
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...
ย 
Machine learning
Machine learningMachine learning
Machine learning
ย 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
ย 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
ย 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender Systems
ย 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
ย 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial
ย 
Autoencoders
AutoencodersAutoencoders
Autoencoders
ย 
Deep learning
Deep learning Deep learning
Deep learning
ย 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
ย 
Deep Learning With Python Tutorial | Edureka
Deep Learning With Python Tutorial | EdurekaDeep Learning With Python Tutorial | Edureka
Deep Learning With Python Tutorial | Edureka
ย 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
ย 
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
GRU4Rec v2 - Recurrent Neural Networks with Top-k Gains for Session-based Rec...
ย 
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
ย 
Overview of recommender system
Overview of recommender systemOverview of recommender system
Overview of recommender system
ย 
Cnn
CnnCnn
Cnn
ย 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
ย 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
ย 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNN
ย 

Similar a Neural Learning to Rank

Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
ย 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
ย 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
ย 
Learning to Rank with Neural Networks
Learning to Rank with Neural NetworksLearning to Rank with Neural Networks
Learning to Rank with Neural NetworksBhaskar Mitra
ย 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
ย 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
ย 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politรจcnica de Catalunya
ย 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptxPrabhuSelvaraj15
ย 
Lesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfLesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfssuser7f0b19
ย 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptxssuserf07225
ย 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptxEmanAl15
ย 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesArvind Rapaka
ย 
Big Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxBig Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxPlacementsBCA
ย 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analyticsCollin Bennett
ย 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdfMariaKhan905189
ย 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with pythonSimone Piunno
ย 
ๆœบๅ™จๅญฆไน Adaboost
ๆœบๅ™จๅญฆไน Adaboostๆœบๅ™จๅญฆไน Adaboost
ๆœบๅ™จๅญฆไน AdaboostShocky1
ย 
ML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfTigabu Yaya
ย 

Similar a Neural Learning to Rank (20)

Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
ย 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
ย 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
ย 
Learning to Rank with Neural Networks
Learning to Rank with Neural NetworksLearning to Rank with Neural Networks
Learning to Rank with Neural Networks
ย 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
ย 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
ย 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
ย 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
ย 
Lesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfLesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdf
ย 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
ย 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptx
ย 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial Usecases
ย 
Big Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxBig Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptx
ย 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
ย 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
ย 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf
ย 
Xgboost
XgboostXgboost
Xgboost
ย 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with python
ย 
ๆœบๅ™จๅญฆไน Adaboost
ๆœบๅ™จๅญฆไน Adaboostๆœบๅ™จๅญฆไน Adaboost
ๆœบๅ™จๅญฆไน Adaboost
ย 
ML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdf
ย 

Mรกs de Bhaskar Mitra

Joint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationJoint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationBhaskar Mitra
ย 
Whatโ€™s next for deep learning for Search?
Whatโ€™s next for deep learning for Search?Whatโ€™s next for deep learning for Search?
Whatโ€™s next for deep learning for Search?Bhaskar Mitra
ย 
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...Bhaskar Mitra
ย 
Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Bhaskar Mitra
ย 
Multisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationMultisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationBhaskar Mitra
ย 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressBhaskar Mitra
ย 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
ย 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
ย 
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBenchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBhaskar Mitra
ย 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
ย 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Bhaskar Mitra
ย 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
ย 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
ย 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
ย 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
ย 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document RankingBhaskar Mitra
ย 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
ย 
Neu-IR 2017: welcome
Neu-IR 2017: welcomeNeu-IR 2017: welcome
Neu-IR 2017: welcomeBhaskar Mitra
ย 
The Duet model
The Duet modelThe Duet model
The Duet modelBhaskar Mitra
ย 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
ย 

Mรกs de Bhaskar Mitra (20)

Joint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationJoint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and Recommendation
ย 
Whatโ€™s next for deep learning for Search?
Whatโ€™s next for deep learning for Search?Whatโ€™s next for deep learning for Search?
Whatโ€™s next for deep learning for Search?
ย 
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
ย 
Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...
ย 
Multisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationMultisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and Recommendation
ย 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progress
ย 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
ย 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
ย 
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBenchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
ย 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
ย 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
ย 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
ย 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
ย 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
ย 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
ย 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
ย 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
ย 
Neu-IR 2017: welcome
Neu-IR 2017: welcomeNeu-IR 2017: welcome
Neu-IR 2017: welcome
ย 
The Duet model
The Duet modelThe Duet model
The Duet model
ย 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
ย 

รšltimo

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
ย 
Lucknow ๐Ÿ’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow ๐Ÿ’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow ๐Ÿ’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow ๐Ÿ’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
ย 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSรฉrgio Sacani
ย 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sรฉrgio Sacani
ย 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
ย 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
ย 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
ย 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
ย 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
ย 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
ย 
Hire ๐Ÿ’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire ๐Ÿ’• 9907093804 Hooghly Call Girls Service Call Girls AgencyHire ๐Ÿ’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire ๐Ÿ’• 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
ย 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
ย 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
ย 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
ย 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
ย 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
ย 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sรฉrgio Sacani
ย 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
ย 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
ย 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
ย 

รšltimo (20)

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
ย 
Lucknow ๐Ÿ’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow ๐Ÿ’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow ๐Ÿ’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow ๐Ÿ’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
ย 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
ย 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
ย 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
ย 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
ย 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
ย 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
ย 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
ย 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
ย 
Hire ๐Ÿ’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire ๐Ÿ’• 9907093804 Hooghly Call Girls Service Call Girls AgencyHire ๐Ÿ’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire ๐Ÿ’• 9907093804 Hooghly Call Girls Service Call Girls Agency
ย 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
ย 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
ย 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
ย 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
ย 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
ย 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
ย 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
ย 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
ย 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
ย 

Neural Learning to Rank

  • 1. Neural Learning to Rank Bhaskar Mitra Principal Applied Scientist, Microsoft PhD candidate, University College London @UnderdogGeek
  • 2. Topics A quick recap of neural networks The fundamentals of learning to rank
  • 3. Reading material An Introduction to Neural Information Retrieval Foundations and Trendsยฎ in Information Retrieval (December 2018) Download PDF: http://bit.ly/fntir-neural
  • 4. Most information retrieval (IR) systems present a ranked list of retrieved artifacts
  • 5. Learning to Rank (LTR) โ€... the task to automatically construct a ranking model using training data, such that the model can sort new objects according to their degrees of relevance, preference, or importance.โ€ - Liu [2009] Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009. Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf
  • 6.
  • 7. A quick recap of neural networks
  • 8. Vectors, matrices, and tensors Image source: https://dev.to/mmithrakumar/scalars-vectors-matrices-and-tensors-with-tensorflow-2-0-1f66 Image source: https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/ matrix transpose matrix addition dot product matrix multiplication
  • 9.
  • 10. Supervised learning Image source: https://www.intechopen.com/books/artificial-neural-networks-architectures-and-applications/applying-artificial-neural-network-hadron-hadron-collisions-at-lhc
  • 11. Neural networks Chains of parameterized linear transforms (e.g., multiply weight, add bias) followed by non-linear functions (ฯƒ) Popular choices for ฯƒ: Parameters trained using backpropagation E2E training over millions of samples in batched mode Many choices of architecture and hyper-parameters Non-linearity Input Linear transform Non-linearity Linear transform Predicted output forwardpass backwardpass Expected output loss Tanh ReLU
  • 13. Squared loss The squared loss is a popular loss function for regression tasks
  • 14. The softmax function In neural classification models, the softmax function is popularly used to normalize the neural network output scores across all the classes
  • 15. Cross entropy The cross entropy between two probability distributions ๐‘ and ๐‘ž over a discrete set of events is given by, If ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘Ÿ๐‘’๐‘๐‘ก = 1and ๐‘๐‘– = 0 for all other values of ๐‘– then,
  • 16. Cross entropy with softmax loss Cross entropy with softmax is a popular loss function for classification
  • 17. We are given training data: < ๐‘ฅ, ๐‘ฆ > pairs, where ๐‘ฅ is input and ๐‘ฆ is expected output Step 1: Define model and randomly initialize learnable model parameters Step 2: Given ๐‘ฅ, compute model output Step 3: Given model output and ๐‘ฆ, compute loss ๐‘™ Step 4: Compute gradient ๐œ•๐‘™ ๐œ•๐‘ค of loss ๐‘™ w.r.t. each parameter ๐‘ค Step 5: Update parameter as ๐‘ค ๐‘›๐‘’๐‘ค = ๐‘ค ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค , where ๐œ‚ is learning rate Step 6: Go back to step 2 and repeat till convergence Gradient Descent
  • 18. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = ๐œ•๐‘™ ๐œ•๐‘ฆ2 ร— ๐œ•๐‘ฆ2 ๐œ•๐‘ฆ1 ร— ๐œ•๐‘ฆ1 ๐œ•๐‘ค1 Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  • 19. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = ๐œ• ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐œ•๐‘ฆ2 ร— ๐œ•๐‘ฆ2 ๐œ•๐‘ฆ1 ร— ๐œ•๐‘ฆ1 ๐œ•๐‘ค1 Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  • 20. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— ๐œ•๐‘ฆ2 ๐œ•๐‘ฆ1 ร— ๐œ•๐‘ฆ1 ๐œ•๐‘ค1 Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  • 21. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— ๐œ•๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 ๐œ•๐‘ฆ1 ร— ๐œ•๐‘ฆ1 ๐œ•๐‘ค1 Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  • 22. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— 1 โˆ’ ๐‘ก๐‘Ž๐‘›โ„Ž2 ๐‘ค2. ๐‘ฅ + ๐‘2 ร— ๐‘ค2 ร— ๐œ•๐‘ฆ1 ๐œ•๐‘ค1 Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  • 23. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— 1 โˆ’ ๐‘ก๐‘Ž๐‘›โ„Ž2 ๐‘ค2. ๐‘ฅ + ๐‘2 ร— ๐‘ค2 ร— ๐œ•๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐œ•๐‘ค1 Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  • 24. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— 1 โˆ’ ๐‘ก๐‘Ž๐‘›โ„Ž2 ๐‘ค2. ๐‘ฅ + ๐‘2 ร— ๐‘ค2 ร— 1 โˆ’ ๐‘ก๐‘Ž๐‘›โ„Ž2 ๐‘ค1. ๐‘ฅ + ๐‘1 ร— ๐‘ฅ Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  • 25. Exercise Simple Neural Network from Scratch Implement a simple multi-layer neural network with single input feature, single output, and single neuron per layer using (i) PyTorch and (ii) from scratchโ€”and demonstrate that both approaches produce identical outcome. https://github.com/spacemanidol/AFIRMDeep Learning2020/blob/master/NNPrimer.ipynb
  • 26. Computation Networks The โ€œLegoโ€ approach to specifying neural architectures Library of neural layers, each layer defines logic for: 1. Forward pass: compute layer output given layer input 2. Backward pass: a) compute gradient of layer output w.r.t. layer inputs b) compute gradient of layer output w.r.t. layer parameters (if any) Chain nodes to create bigger and more complex networks
  • 27. Why adding depth helps http://playground.tensorflow.org
  • 29. Bias-variance trade-off in the deep learning era Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical biasโ€“variance trade-off. In PNAS, 2019.
  • 30. Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2019. Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What's Hidden in a Randomly Weighted Neural Network? In ArXiv, 2019. The lottery ticket hypothesis
  • 33. Problem formulation LTR models represent a rankable itemโ€”e.g., a document or a movie or a songโ€”given some contextโ€”e.g., a user-issued query or userโ€™s historical interactions with other itemsโ€”as a numerical vector ๐‘ฅ โˆˆ โ„ ๐‘› The ranking model ๐‘“: ๐‘ฅ โ†’ โ„ is trained to map the vector to a real-valued score such that relevant items are scored higher.
  • 34. Why is ranking challenging? Examples of ranking metrics Discounted Cumulative Gain (DCG) ๐ท๐ถ๐บ@๐‘˜ = ๐‘–=1 ๐‘˜ 2 ๐‘Ÿ๐‘’๐‘™๐‘– โˆ’ 1 ๐‘™๐‘œ๐‘”2 ๐‘– + 1 Reciprocal Rank (RR) ๐‘…๐‘…@๐‘˜ = max 1<๐‘–<๐‘˜ ๐‘Ÿ๐‘’๐‘™๐‘– ๐‘– Rank based metrics, such as DCG and MRR, are non-smooth / non-differentiable
  • 35. Features They can often be categorized as: Query-independent or static features e.g., incoming link count and document length Query-dependent or dynamic features e.g., BM25 Query-level features e.g., query length Traditional L2R models employ hand-crafted features that encode IR insights
  • 36. Features Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval, Information Retrieval Journal, 2010
  • 37. Approaches Pointwise approach Relevance label ๐‘ฆ ๐‘ž,๐‘‘ is a numberโ€”derived from binary or graded human judgments or implicit user feedback (e.g., CTR). Typically, a regression or classification model is trained to predict ๐‘ฆ ๐‘ž,๐‘‘ given ๐‘ฅ ๐‘ž,๐‘‘. Pairwise approach Pairwise preference between documents for a query (๐‘‘๐‘– โ‰ป ๐‘‘๐‘— w.r.t. ๐‘ž) as label. Reduces to binary classification to predict more relevant document. Listwise approach Directly optimize for rank-based metric, such as NDCGโ€”difficult because these metrics are often not differentiable w.r.t. model parameters. Liu [2009] categorizes different LTR approaches based on training objectives: Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
  • 38. Pointwise objectives Regression loss Given ๐‘ž, ๐‘‘ predict the value of ๐‘ฆ ๐‘ž,๐‘‘ e.g., square loss for binary or categorical labels, where, ๐‘ฆ ๐‘ž,๐‘‘ is the one-hot representation [Fuhr, 1989] or the actual value [Cossock and Zhang, 2006] of the label Norbert Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM TOIS, 1989. David Cossock and Tong Zhang. Subset ranking using regression. In COLT, 2006. labels prediction 0 1 1
  • 39. Pointwise objectives Classification loss Given ๐‘ž, ๐‘‘ predict the class ๐‘ฆ ๐‘ž,๐‘‘ e.g., cross-entropy with softmax over categorical labels ๐‘Œ [Li et al., 2008], where, ๐‘  ๐‘ฆ ๐‘ž,๐‘‘ is the modelโ€™s score for label ๐‘ฆ ๐‘ž,๐‘‘ labels prediction 0 1 Ping Li, Qiang Wu, and Christopher J Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2008.
  • 40. Pairwise objectives Pairwise loss generally has the following form [Chen et al., 2009], where, ๐œ™ can be, โ€ข Hinge function ๐œ™ ๐‘ง = ๐‘š๐‘Ž๐‘ฅ 0, 1 โˆ’ ๐‘ง [Herbrich et al., 2000] โ€ข Exponential function ๐œ™ ๐‘ง = ๐‘’โˆ’๐‘ง [Freund et al., 2003] โ€ข Logistic function ๐œ™ ๐‘ง = ๐‘™๐‘œ๐‘” 1 + ๐‘’โˆ’๐‘ง [Burges et al., 2005] โ€ข Othersโ€ฆ Pairwise loss minimizes the average number of inversions in rankingโ€”i.e., ๐‘‘๐‘– โ‰ป ๐‘‘๐‘— w.r.t. ๐‘ž but ๐‘‘๐‘— is ranked higher than ๐‘‘๐‘– Given ๐‘ž, ๐‘‘๐‘–, ๐‘‘๐‘— , predict the more relevant document For ๐‘ž, ๐‘‘๐‘– and ๐‘ž, ๐‘‘๐‘— , Feature vectors: ๐‘ฅ๐‘– and ๐‘ฅ๐‘— Model scores: ๐‘ ๐‘– = ๐‘“ ๐‘ฅ๐‘– and ๐‘ ๐‘— = ๐‘“ ๐‘ฅ๐‘— Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhi-Ming Ma, and Hang Li. Ranking measures and loss functions in learning to rank. In NIPS, 2009. Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. 2000. Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. In JMLR, 2003. Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
  • 41. Pairwise objectives RankNet loss Pairwise loss function proposed by Burges et al. [2005]โ€”an industry favourite [Burges, 2015] Predicted probabilities: ๐‘๐‘–๐‘— = ๐‘ ๐‘ ๐‘– > ๐‘ ๐‘— โ‰ก ๐‘’ ๐›พ.๐‘  ๐‘– ๐‘’ ๐›พ.๐‘  ๐‘– +๐‘’ ๐›พ.๐‘  ๐‘— = 1 1+๐‘’ โˆ’๐›พ. ๐‘  ๐‘–โˆ’๐‘  ๐‘— Desired probabilities: ๐‘๐‘–๐‘— = 1 and ๐‘๐‘—๐‘– = 0 Computing cross-entropy between ๐‘ and ๐‘ โ„’ ๐‘…๐‘Ž๐‘›๐‘˜๐‘๐‘’๐‘ก = โˆ’ ๐‘๐‘–๐‘—. ๐‘™๐‘œ๐‘” ๐‘๐‘–๐‘— โˆ’ ๐‘๐‘—๐‘–. ๐‘™๐‘œ๐‘” ๐‘๐‘—๐‘– = โˆ’๐‘™๐‘œ๐‘” ๐‘๐‘–๐‘— = ๐‘™๐‘œ๐‘” 1 + ๐‘’โˆ’๐›พ. ๐‘  ๐‘–โˆ’๐‘  ๐‘— pairwise preference score 0 1 Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005. Chris Burges. RankNet: A ranking retrospective. https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/. 2015.
  • 42. A generalized cross-entropy loss An alternative loss function assumes a single relevant document ๐‘‘+ and compares it against the full collection ๐ท Predicted probabilities: p ๐‘‘+|๐‘ž = ๐‘’ ๐›พ.๐‘  ๐‘ž,๐‘‘+ ๐‘‘โˆˆ๐ท ๐‘’ ๐›พ.๐‘  ๐‘ž,๐‘‘ The cross-entropy loss is then given by, โ„’ ๐ถ๐ธ ๐‘ž, ๐‘‘+, ๐ท = โˆ’๐‘™๐‘œ๐‘” p ๐‘‘+|๐‘ž = โˆ’๐‘™๐‘œ๐‘” ๐‘’ ๐›พ.๐‘  ๐‘ž,๐‘‘+ ๐‘‘โˆˆ๐ท ๐‘’ ๐›พ.๐‘  ๐‘ž,๐‘‘ Computing the softmax over the full collection is prohibitively expensiveโ€”LTR models typically consider few negative candidates [Huang et al., 2013, Shen et al., 2014, Mitra et al., 2017] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013. Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014. Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
  • 43. Blue: relevant Gray: non-relevant NDCG and ERR higher for left but pairwise errors less for right Due to strong position-based discounting in IR measures, errors at higher ranks are much more problematic than at lower ranks But listwise metrics are non-continuous and non-differentiable LISTWISE OBJECTIVES Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 2010. [Burges, 2010]
  • 44. Listwise objectives Burges et al. [2006] make two observations: 1. To train a model we donโ€™t need the costs themselves, only the gradients (of the costs w.r.t model scores) 2. It is desired that the gradient be bigger for pairs of documents that produces a bigger impact in NDCG by swapping positions Christopher JC Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In NIPS, 2006. LambdaRank loss Multiply actual gradients with the change in NDCG by swapping the rank positions of the two documents
  • 45. Listwise objectives According to the Luce model [Luce, 2005], given four items ๐‘‘1, ๐‘‘2, ๐‘‘3, ๐‘‘4 the probability of observing a particular rank-order, say ๐‘‘2, ๐‘‘1, ๐‘‘4, ๐‘‘3 , is given by: where, ๐œ‹ is a particular permutation and ๐œ™ is a transformation (e.g., linear, exponential, or sigmoid) over the score ๐‘ ๐‘– corresponding to item ๐‘‘๐‘– R Duncan Luce. Individual choice behavior. 1959. Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In ICML, 2007. Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: theory and algorithm. In ICML, 2008. ListNet loss Cao et al. [2007] propose to compute the probability distribution over all possible permutations based on model score and ground- truth labels. The loss is then given by the K-L divergence between these two distributions. This is computationally very costly, computing permutations of only the top-K items makes it slightly less prohibitive. ListMLE loss Xia et al. [2008] propose to compute the probability of the ideal permutation based on the ground truth. However, with categorical labels more than one permutation is possible.
  • 46. Listwise objectives Mingrui Wu, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. Smoothing DCG for learning to rank: A novel approach using smoothed hinge functions. In CIKM, 2009. Smooth DCG Wu et al. [2009] compute a โ€œsmoothโ€ rank of documents as a function of their scores This โ€œsmoothโ€ rank can be plugged into a ranking metric, such as MRR or DCG, to produce a smooth ranking loss