Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge

Founded 
January 2013
January 2014 
A Data Science Studio
powered team
wins a Challenge

Founded 
January 2013
January 2014 
powered team
wins a Challenge
Data Science Studio’s GA
February 2014

Founded 
January 2013
January 2014 
powered team
wins a Challenge
February 2014
July 2014 
Data Science Studio
Available for Free
with a
Community Edition
!
!

Founded 
January 2013
January 2014 
powered team
wins a Challenge
February 2014
15 People
Now
July 2014 
Data Science Studio
Available for Free
with a
Community Edition
!
!

!
BI
Developer
Data
Preparation
Build
Algorithm
Build
Application
Run
Application
Business
Analyst
Data 
Scientist

I don’t want to be a data
cleaner anymore
“

Finding Leaks in my Data
Pipelines

Waiting for
the
(gradient boosted)
trees
to grow
“

MPP Databases
Statistical Software Machine Learning
No-SQL Hadoop

Using Historical Logs of a search engine
QUERIES
RESULTS
CLICKS
!
and a set of new QUERIES and RESULTS
!
rerank the RESULTS in order to optimize relevance
Personalized Web Search
Yandex
Fri 11 Oct 2013 – Fri 10 Jan 2014
194 Teams
$9,000 cash prize

No researcher.
No experience in reranking.
Not much experience in ML for most of us.
Not exactly our job. No expectations.
Kenji Lefevre
37
Algrebraic Geometry
Learning Python
Christophe Bourguignat
37
Signal Processing Eng.
Learning Scikit
Mathieu Scordia
24
Data Scientist
Paul Masurel
33
Soft. Engineer
The Team

YANDEX SUPPLIED 27 DAYS OF ANONYMOUS LOG
Challenge Data
34,573,630 Sessions with user id
21,073,569 Queries
64,693,054 Clicks
~ 15GB
Example

A METRIC FOR RELEVANCE RIGHT FROM THE LOG?
ASSUMING WE SEARCH FOR "FRENCH NEWSPAPER", WE TAKE
A LOOK AT THE LOGS.

WE COMPUTE THE SO CALLED DWELL TIME OF A CLICK
I.E. THE TIME ELAPSED BEFORE THE NEXT ACTION
DWELL TIME

DWELL TIME HAS BEEN SHOWN TO BE CORRELATED WITH
THE RELEVANCE

GOOD WE HAVE A MEASURE OF RELEVANCE !
CAN WE GET AN OVERALL SCORE FOR OUR SEARCH ENGINE
NOW?

Emphasis on relevant
documents
Discount per ranking
Discount Cumulative Gain

Normalized Discount Cumulative Gain
Just Normalize Between 0 and 1

PERSONALIZED RERANKING
IS ABOUT REORDERING THE N-BEST RESULTS BASED ON
THE USER PAST SEARCH HISTORY
Results Obtained in the contest:
!
Original NCDG 0.79056
!
ReRanked NCDG 0.80714
!
!
~ Raising the rank of a relevant ( relevancy = 2) result
from Rank #6 to Rank #5 on each query
~ Raising the rank of a relevant ( relevancy = 2) result
from Rank #6 to Rank #2 in 20% of the queries
Equivalent To

Simple, point wise approach
Session 1 Session 2 ....
0
1
2
For each (URL, Session) predict relevance (0,1 or 2)

Supervised Learning on History
We split 27 days of the train dataset 24 (history) + 3 days (annotated).
!
Stop randomly in the last 3 days at a “test" session (like Yandex)
Train Set
(24 history)
Train Set
(annotation)
Test Set

Working with a ML workflow collaboratively

Features Construction :
Team Member work independantly
Learning :
Team Member work independantly
Split Train & Validation
Features
on 30 days
Labelled 30 days data

!
regression : we keep the hierarchy between the classes, but optimizing NDCG is cookery.
classiﬁcation : we lose the hierarchy but we can optimize the NDCG (more and that later)
REGRESSION or
CLASSIFICATION
According to
P. Li, C. J. C. Burges, and Q. Wu. Mcrank: Learning to rank using multiple classiﬁcation and
gradient boosting. In NIPS, 2007.
Classification outperforms regression.

!
Compute the probabilities that P(relevance = X)
Build a sorted list
!
Sort by
!
P(Relevance=1) + 3 P (Relevance=2)

Hence order by decreasing
Hence order by P(Relevance=1) + 3 P (Relevance=2)
P. Li, C. J. C. Burges, and Q. Wu. Mcrank: Learning to rank using multiple classiﬁcation and gradient
boosting. In NIPS, 2007.
get slightly better results with linear weighting.

FIRST OF ALL THE RANK
In this contest, the rank is both
The rank that has been displayed to the user
THE DISPLAY RANK
!
The rank that is computed by Yandex using 
PageRank, non-personalized log analysis?, TF-IDF,  
and machine learning etc.
THE NON-PERSONALIZED RANK
RANK AS feature

53% OF THE COMPETITORS  
COULD NOT IMPROVE THE BASELINE
Worse
53%
Better
47%

1. compute non-personalized rank
2. select 10 best hits and serves them in order
3. re-rank using log analysis.
4. put new ranking algorithm in prod (yeah right!)
5. compute NDCG on new logs
6. …
7. Proﬁts !!
IDEAL

1. compute non-personalized rank
2. select 10 bests hits
3. serve 10 bests hits ranked in random order
4. re-rank using log analysis, including non-personalized rank as a
feature
5. compute score against the log with the former rank
REAL

Users tend to click on the ﬁrst few urls.
User satisfaction metric is influenced by the display rank. Our score is not aligned with our goal.
PROBLEM
We cannot discriminate the effect of the signal
of the non-personalized rank from effect of the display rank

PROMOTES
OVER CONSERVATIVE RE-RANKING POLICY
Even if we know for sure that the url with rank 9 would be clicked by the user if it was presented at
rank 1, it would be probably a bad idea to rerank it to rank 1 in this contest.
Average per session of the max position jump

Revisits (Query-(User)-URL) features 
and variants
Query Features
Cumulative Features
User Click Habits
Collaborative Filtering
Seasonality
FEATURES

!
In the past, when the user was displayed this url, with the exact same query
what is the probability that :
REVISITS
• satisfaction=2
• satisfaction=1
• satisfaction=0
• miss (not-clicked)
• skipped (after the last click)
5 Conditional Probability Features
1 An overall counter of display
4 mean reciprocal rank 
(kind of the harmonic mean of the rank)
1 snippet quality score
(twisted formula used to compute
snippet quality)
11 Base Features

• (In the past|within the same sesssion),
• (with this very query | whatever query | a subquery | a super query)
• and was offered (this url/this domain)
MANY VARIATIONS
X2
X 3
X 2
12 variants
With the same user
Without being the same user ( URL - query features)
• Same Domain
• Same URL
• Same Query and Same URL
3 variants
15 Variants
X 11 Base Features
165 Features

ADDITIVE SMOOTHING
http://fumicoton.com/posts/bayesian_rating
• book A : 1 rating of 5. Average rating of 5.
• book B : 50 ratings. Average rating of 4.5
In our case to evaluate the probability that a (URL|query) should have a label l, under predicate P:

CUMULATIVE FEATURES
Aggregate the features of the URL above in the ranking list
Rationale : If a URL above is likely to be clicked, those below are likely to be missed

QUERY FEATURES
Click entropy
number of time it has been queried for
number of terms
average position within in session
average number of occurences in a session
MRR of its clicks
How complex and ambiguous is a query ?

USER FEATURES
What are the users habits ?
Click entropy
User click rank counters
Rank {1, 2} clicks
Rank {3, 4, 5} clicks
Rank {6,7,8,9,10 } clicks
Average number of terms
Average number of different terms in a session
Total number of queries issued by the user

SEASONALITY
What day is monday ?

COLLABORATIVE FILTERING
(ATTEMPT)
User / Domain interaction matrix.
FunkSVD Algorithm
Simon Funk
http://sifter.org/~simon/journal/20061211.html
https://github.com/commonsense/divisi/blob/master/svdlib/_svdlib.pyx
Cython implementation
Marginal increase 5.10^-5 of the NCDG
!
Why ?

Short Story
Point Wise, Random Forest, 30 Features, 4th Place (*)
List Wise , LambdaMART, 90 Features, 1st Place (*)
(*) A Yandex “PaceMaker" Team was also displaying results on the leaderboard and were  
at the ﬁrst place during the whole competition even if not ofﬁcially contestant 
Trained in 2 days, 1135 Trees
Optimize & Train in ~ 1 hour (12 cores), 24 trees

Lambda Mart
From RankNet to LambdaRank to LambdaMART: An Overview

Christopher J.C. Burges

Microsoft Research Technical Report MSR-TR-2010-82

LambdaMART = LambdaRank + MART

Lambda Rank
Original Ranking Re Ranked
13 errors 11 errors
High Quality Hit
Low Quality Hit
Rank Net Gradient
LambdaRank "Gradient"
Christopher J.C. Burges - Microsoft Research Technical Report MSR-TR-2010-82

Grid Search
We are not doing typical classiﬁcation here. It is extremely important to perform grid
search directly against NDCG ﬁnal score.
NDCG “conservatism” end up with large “min samples per leaf”
(between 40 and 80 )

Feature Selection

Top-Down approach : Starting from a high number of features, iteratively
removed subsets of features. This approach led to the subset of 90 features for
the LambdaMart winning solutions

(Similar strategy now implemented by sklearn.feature_selection.RFECV)

!

Bottom-up approach : Starting from a low number of features, add the
features that produce the best marginal improvement. Gave the 30 features
that lead to the best solution with the point-wise approach.

http://sourceforge.net/p/lemur/wiki/RankLib/
Ranklib ( Implementation of LambdaMART)
These Slides
http://www.slideshare.net/Dataiku
Learning to rank using multiple classification and gradient boosting.
P. Li, C. J. C. Burges, and Q. Wu. Mcrank - In NIPS, 2007
Christopher J.C. Burges - Microsoft Research Technical Report MSR-TR-2010-82
http://fumicoton.com/posts/bayesian_rating
Blog Post About Additive Smoothing
Blog Posts about the solution
Contest Url
Paper with Detailed Description
http://blog.kaggle.com/2014/02/06/winning-personalized-web-search-team-dataiku/
http://www.dataiku.com/blog/2014/01/14/winning-kaggle.html
http://research.microsoft.com/en-us/um/people/nickcr/wscd2014/papers/wscdchallenge2014dataiku.pdf
https://www.kaggle.com/c/yandex-personalized-web-search-challenge
Research Papers
References

Random Thoughts
Dependancy analysis and comparing rank with predictive “relevance"
could help determine general cases where the existing engine is not relevant
enough
How does it compare to a pure statistical approach ?
!
Applying personalisation technique this way might not be practical because of the
amount of live information to be maintained (in real-time) about users (each query,
each click) to perform actionnable predictions
How could a machine learning challenge enforce this kind of
constraints?
Is data science a science, a sport or a hobby. Newcomers can discover a ﬁeld,
improve existing results, and seemingly obtain incrementally more effective
results, with little plateau effect
!
Are we just at the very beginning non-industrial era of this discipline?

THANK YOU!
Florian DOUETTEAU
florian.douetteau@daitaku.com 
+33 6 70 56 88 97

Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge

Similar a Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge (20)

Más de Dataiku

Más de Dataiku (16)

Último

Último (20)

Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge