SlideShare una empresa de Scribd logo
1 de 66
Descargar para leer sin conexión
write your own data story!
short story
Founded

January 2013
Founded

January 2013
January 2014

A Data Science Studio
powered team
wins a Challenge
Founded

January 2013
January 2014

A Data Science Studio
powered team
wins a Challenge
Data Science Studio’s GA
February 2014
Founded

January 2013
January 2014

A Data Science Studio
powered team
wins a Challenge
Data Science Studio’s GA
February 2014
July 2014

Data Science Studio
Available for Free
with a
Community Edition
!
!
Founded

January 2013
January 2014

A Data Science Studio
powered team
wins a Challenge
Data Science Studio’s GA
February 2014
15 People
Now
July 2014

Data Science Studio
Available for Free
with a
Community Edition
!
!
!
BI
Developer
Data
Preparation
Build
Algorithm
Build
Application
Run
Application
Business
Analyst
Data

Scientist
I don’t want to be a data
cleaner anymore
“
Finding Leaks in my Data
Pipelines
Waiting for
the
(gradient boosted)
trees
to grow
“
MPP Databases
Statistical Software Machine Learning
No-SQL Hadoop
Demo Time
Challenge
Using Historical Logs of a search engine
QUERIES
RESULTS
CLICKS
!
and a set of new QUERIES and RESULTS
!
rerank the RESULTS in order to optimize relevance
Personalized Web Search
Yandex
Fri 11 Oct 2013 – Fri 10 Jan 2014
194 Teams
$9,000 cash prize
No researcher.
No experience in reranking.
Not much experience in ML for most of us.
Not exactly our job. No expectations.
Kenji Lefevre
37
Algrebraic Geometry
Learning Python
Christophe Bourguignat
37
Signal Processing Eng.
Learning Scikit
Mathieu Scordia
24
Data Scientist
Paul Masurel
33
Soft. Engineer
The Team
A-Team?
“HOBBITS"
YANDEX SUPPLIED 27 DAYS OF ANONYMOUS LOG
Challenge Data
34,573,630 Sessions with user id
21,073,569 Queries
64,693,054 Clicks
~ 15GB
Example
Relevance?
A METRIC FOR RELEVANCE RIGHT FROM THE LOG?
ASSUMING WE SEARCH FOR "FRENCH NEWSPAPER", WE TAKE
A LOOK AT THE LOGS.
WE COMPUTE THE SO CALLED DWELL TIME OF A CLICK
I.E. THE TIME ELAPSED BEFORE THE NEXT ACTION
DWELL TIME
DWELL TIME HAS BEEN SHOWN TO BE CORRELATED WITH
THE RELEVANCE
GOOD WE HAVE A MEASURE OF RELEVANCE !
CAN WE GET AN OVERALL SCORE FOR OUR SEARCH ENGINE
NOW?
Emphasis on relevant
documents
Discount per ranking
Discount Cumulative Gain
Normalized Discount Cumulative Gain
Just Normalize Between 0 and 1
PERSONALIZED RERANKING
IS ABOUT REORDERING THE N-BEST RESULTS BASED ON
THE USER PAST SEARCH HISTORY
Results Obtained in the contest:
!
Original NCDG 0.79056
!
ReRanked NCDG 0.80714
!
!
~ Raising the rank of a relevant ( relevancy = 2) result
from Rank #6 to Rank #5 on each query
~ Raising the rank of a relevant ( relevancy = 2) result
from Rank #6 to Rank #2 in 20% of the queries
Equivalent To
How they did it
Simple, point wise approach
Session 1 Session 2 ....
0
1
2
For each (URL, Session) predict relevance (0,1 or 2)
Supervised Learning on History
We split 27 days of the train dataset 24 (history) + 3 days (annotated).
!
Stop randomly in the last 3 days at a “test" session (like Yandex)
Train Set
(24 history)
Train Set
(annotation)
Test Set
Working with a ML workflow collaboratively
Features Construction :
Team Member work independantly
Learning :
Team Member work independantly
Split Train & Validation
Features
on 30 days
Labelled 30 days data
!
regression : we keep the hierarchy between the classes, but optimizing NDCG is cookery.
classification : we lose the hierarchy but we can optimize the NDCG (more and that later)
REGRESSION or
CLASSIFICATION
According to
P. Li, C. J. C. Burges, and Q. Wu. Mcrank: Learning to rank using multiple classification and
gradient boosting. In NIPS, 2007.
Classification outperforms regression.
!
Compute the probabilities that P(relevance = X)
Build a sorted list
!
Sort by
!
P(Relevance=1) + 3 P (Relevance=2)
Hence order by decreasing
Hence order by P(Relevance=1) + 3 P (Relevance=2)
P. Li, C. J. C. Burges, and Q. Wu. Mcrank: Learning to rank using multiple classification and gradient
boosting. In NIPS, 2007.
get slightly better results with linear weighting.
Features
FIRST OF ALL THE RANK
In this contest, the rank is both
The rank that has been displayed to the user
THE DISPLAY RANK
!
The rank that is computed by Yandex using

PageRank, non-personalized log analysis?, TF-IDF, 

and machine learning etc.
THE NON-PERSONALIZED RANK
RANK AS feature
Digression
THE PROBLEM!
WITH RERANKING
53% OF THE COMPETITORS 

COULD NOT IMPROVE THE BASELINE
Worse
53%
Better
47%
1. compute non-personalized rank
2. select 10 best hits and serves them in order
3. re-rank using log analysis.
4. put new ranking algorithm in prod (yeah right!)
5. compute NDCG on new logs
6. …
7. Profits !!
IDEAL
1. compute non-personalized rank
2. select 10 bests hits
3. serve 10 bests hits ranked in random order
4. re-rank using log analysis, including non-personalized rank as a
feature
5. compute score against the log with the former rank
REAL
Users tend to click on the first few urls.
User satisfaction metric is influenced by the display rank. Our score is not aligned with our goal.
PROBLEM
We cannot discriminate the effect of the signal
of the non-personalized rank from effect of the display rank
PROMOTES
OVER CONSERVATIVE RE-RANKING POLICY
Even if we know for sure that the url with rank 9 would be clicked by the user if it was presented at
rank 1, it would be probably a bad idea to rerank it to rank 1 in this contest.
Average per session of the max position jump
end digression
Revisits (Query-(User)-URL) features

and variants
Query Features
Cumulative Features
User Click Habits
Collaborative Filtering
Seasonality
FEATURES
!
In the past, when the user was displayed this url, with the exact same query
what is the probability that :
REVISITS
• satisfaction=2
• satisfaction=1
• satisfaction=0
• miss (not-clicked)
• skipped (after the last click)
5 Conditional Probability Features
1 An overall counter of display
4 mean reciprocal rank

(kind of the harmonic mean of the rank)
1 snippet quality score
(twisted formula used to compute
snippet quality)
11 Base Features
• (In the past|within the same sesssion),
• (with this very query | whatever query | a subquery | a super query)
• and was offered (this url/this domain)
MANY VARIATIONS
X2
X 3
X 2
12 variants
With the same user
Without being the same user ( URL - query features)
• Same Domain
• Same URL
• Same Query and Same URL
3 variants
15 Variants
X 11 Base Features
165 Features
ADDITIVE SMOOTHING
http://fumicoton.com/posts/bayesian_rating
• book A : 1 rating of 5. Average rating of 5.
• book B : 50 ratings. Average rating of 4.5
In our case to evaluate the probability that a (URL|query) should have a label l, under predicate P:
CUMULATIVE FEATURES
Aggregate the features of the URL above in the ranking list
Rationale : If a URL above is likely to be clicked, those below are likely to be missed
QUERY FEATURES
Click entropy
number of time it has been queried for
number of terms
average position within in session
average number of occurences in a session
MRR of its clicks
How complex and ambiguous is a query ?
USER FEATURES
What are the users habits ?
Click entropy
User click rank counters
Rank {1, 2} clicks
Rank {3, 4, 5} clicks
Rank {6,7,8,9,10 } clicks
Average number of terms
Average number of different terms in a session
Total number of queries issued by the user
SEASONALITY
What day is monday ?
COLLABORATIVE FILTERING
(ATTEMPT)
User / Domain interaction matrix.
FunkSVD Algorithm
Simon Funk
http://sifter.org/~simon/journal/20061211.html
https://github.com/commonsense/divisi/blob/master/svdlib/_svdlib.pyx
Cython implementation
Marginal increase 5.10^-5 of the NCDG
!
Why ?
learning
Short Story
Point Wise, Random Forest, 30 Features, 4th Place (*)
List Wise , LambdaMART, 90 Features, 1st Place (*)
(*) A Yandex “PaceMaker" Team was also displaying results on the leaderboard and were 

at the first place during the whole competition even if not officially contestant

Trained in 2 days, 1135 Trees
Optimize & Train in ~ 1 hour (12 cores), 24 trees
Lambda Mart
From RankNet to LambdaRank to LambdaMART: An Overview	

Christopher J.C. Burges	

Microsoft Research Technical Report MSR-TR-2010-82	

LambdaMART = LambdaRank + MART
Lambda Rank
Original Ranking Re Ranked
13 errors 11 errors
High Quality Hit
Low Quality Hit
Rank Net Gradient
LambdaRank "Gradient"
From RankNet to LambdaRank to LambdaMART: An Overview
Christopher J.C. Burges - Microsoft Research Technical Report MSR-TR-2010-82
Grid Search
We are not doing typical classification here. It is extremely important to perform grid
search directly against NDCG final score.
NDCG “conservatism” end up with large “min samples per leaf”
(between 40 and 80 )
Feature Selection
	

 Top-Down approach : Starting from a high number of features, iteratively
removed subsets of features. This approach led to the subset of 90 features for
the LambdaMart winning solutions	

(Similar strategy now implemented by sklearn.feature_selection.RFECV) 	

!
	

 Bottom-up approach : Starting from a low number of features, add the
features that produce the best marginal improvement. Gave the 30 features
that lead to the best solution with the point-wise approach.
Top Features
References
http://sourceforge.net/p/lemur/wiki/RankLib/
Ranklib ( Implementation of LambdaMART)
These Slides
http://www.slideshare.net/Dataiku
Learning to rank using multiple classification and gradient boosting.
P. Li, C. J. C. Burges, and Q. Wu. Mcrank - In NIPS, 2007
From RankNet to LambdaRank to LambdaMART: An Overview
Christopher J.C. Burges - Microsoft Research Technical Report MSR-TR-2010-82
http://fumicoton.com/posts/bayesian_rating
Blog Post About Additive Smoothing
Blog Posts about the solution
Contest Url
Paper with Detailed Description
http://blog.kaggle.com/2014/02/06/winning-personalized-web-search-team-dataiku/
http://www.dataiku.com/blog/2014/01/14/winning-kaggle.html
http://research.microsoft.com/en-us/um/people/nickcr/wscd2014/papers/wscdchallenge2014dataiku.pdf
https://www.kaggle.com/c/yandex-personalized-web-search-challenge
Research Papers
References
Random Thoughts
Dependancy analysis and comparing rank with predictive “relevance"
could help determine general cases where the existing engine is not relevant
enough
How does it compare to a pure statistical approach ?
!
Applying personalisation technique this way might not be practical because of the
amount of live information to be maintained (in real-time) about users (each query,
each click) to perform actionnable predictions
How could a machine learning challenge enforce this kind of
constraints?
Is data science a science, a sport or a hobby. Newcomers can discover a field,
improve existing results, and seemingly obtain incrementally more effective
results, with little plateau effect
!
Are we just at the very beginning non-industrial era of this discipline?
THANK YOU!
Florian DOUETTEAU
florian.douetteau@daitaku.com

+33 6 70 56 88 97

Más contenido relacionado

La actualidad más candente

Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopRTTS
 
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Databricks
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataDataWorks Summit/Hadoop Summit
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchSheetal Pratik
 
devopsdays Warsaw 2018 - Chaos while deploying ML
devopsdays Warsaw 2018 - Chaos while deploying MLdevopsdays Warsaw 2018 - Chaos while deploying ML
devopsdays Warsaw 2018 - Chaos while deploying MLThiago de Faria
 
Data kitchen 7 agile steps - big data fest 9-18-2015
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015DataKitchen
 
How to Automate your Enterprise Application / ERP Testing
How to Automate your  Enterprise Application / ERP TestingHow to Automate your  Enterprise Application / ERP Testing
How to Automate your Enterprise Application / ERP TestingRTTS
 
Whitepaper: Volume Testing Thick Clients and Databases
Whitepaper:  Volume Testing Thick Clients and DatabasesWhitepaper:  Volume Testing Thick Clients and Databases
Whitepaper: Volume Testing Thick Clients and DatabasesRTTS
 
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...Databricks
 
the Data World Distilled
the Data World Distilledthe Data World Distilled
the Data World DistilledRTTS
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamSri Ambati
 
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Rehgan Avon
 
H2O World - Building a Smarter Application - Tom Kraljevic
H2O World - Building a Smarter Application - Tom KraljevicH2O World - Building a Smarter Application - Tom Kraljevic
H2O World - Building a Smarter Application - Tom KraljevicSri Ambati
 
Big Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data QualityBig Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data QualityRTTS
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureNiels Naglé
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Initiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AIInitiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AIAmazon Web Services
 

La actualidad más candente (20)

Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
devopsdays Warsaw 2018 - Chaos while deploying ML
devopsdays Warsaw 2018 - Chaos while deploying MLdevopsdays Warsaw 2018 - Chaos while deploying ML
devopsdays Warsaw 2018 - Chaos while deploying ML
 
Data kitchen 7 agile steps - big data fest 9-18-2015
Data kitchen   7 agile steps - big data fest 9-18-2015Data kitchen   7 agile steps - big data fest 9-18-2015
Data kitchen 7 agile steps - big data fest 9-18-2015
 
How to Automate your Enterprise Application / ERP Testing
How to Automate your  Enterprise Application / ERP TestingHow to Automate your  Enterprise Application / ERP Testing
How to Automate your Enterprise Application / ERP Testing
 
Whitepaper: Volume Testing Thick Clients and Databases
Whitepaper:  Volume Testing Thick Clients and DatabasesWhitepaper:  Volume Testing Thick Clients and Databases
Whitepaper: Volume Testing Thick Clients and Databases
 
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
 
the Data World Distilled
the Data World Distilledthe Data World Distilled
the Data World Distilled
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
 
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
 
H2O World - Building a Smarter Application - Tom Kraljevic
H2O World - Building a Smarter Application - Tom KraljevicH2O World - Building a Smarter Application - Tom Kraljevic
H2O World - Building a Smarter Application - Tom Kraljevic
 
Big Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data QualityBig Data Testing: Ensuring MongoDB Data Quality
Big Data Testing: Ensuring MongoDB Data Quality
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architecture
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Initiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AIInitiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AI
 

Destacado

Florian Douetteau @ Dataiku
Florian Douetteau @ DataikuFlorian Douetteau @ Dataiku
Florian Douetteau @ DataikuPAPIs.io
 
User Engagement as Evaluation: a Ranking or a Regression Problem?
User Engagement as Evaluation: a Ranking or a Regression Problem?User Engagement as Evaluation: a Ranking or a Regression Problem?
User Engagement as Evaluation: a Ranking or a Regression Problem?Frédéric Guillou
 
Mining model for hotel recommendations (Kaggle Challenge)
Mining model for hotel recommendations (Kaggle Challenge)Mining model for hotel recommendations (Kaggle Challenge)
Mining model for hotel recommendations (Kaggle Challenge)Arjun Varma
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015 Dataiku
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Dataiku
 
Learning to Rank: An Introduction to LambdaMART
Learning to Rank: An Introduction to LambdaMARTLearning to Rank: An Introduction to LambdaMART
Learning to Rank: An Introduction to LambdaMARTJulian Qian
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) Dataiku
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...Dataiku
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ? Dataiku
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHDataiku
 
Tribology in Medicine
Tribology in MedicineTribology in Medicine
Tribology in MedicineLibin Thomas
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuDataiku
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare IndustryDataiku
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Dataiku
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Lucidworks
 
Advances in tribology
Advances in tribologyAdvances in tribology
Advances in tribologyApurv Tanay
 
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorialAlexandros Karatzoglou
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Dataiku
 

Destacado (20)

Florian Douetteau @ Dataiku
Florian Douetteau @ DataikuFlorian Douetteau @ Dataiku
Florian Douetteau @ Dataiku
 
User Engagement as Evaluation: a Ranking or a Regression Problem?
User Engagement as Evaluation: a Ranking or a Regression Problem?User Engagement as Evaluation: a Ranking or a Regression Problem?
User Engagement as Evaluation: a Ranking or a Regression Problem?
 
Mining model for hotel recommendations (Kaggle Challenge)
Mining model for hotel recommendations (Kaggle Challenge)Mining model for hotel recommendations (Kaggle Challenge)
Mining model for hotel recommendations (Kaggle Challenge)
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Learning to Rank: An Introduction to LambdaMART
Learning to Rank: An Introduction to LambdaMARTLearning to Rank: An Introduction to LambdaMART
Learning to Rank: An Introduction to LambdaMART
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
 
Tribology in Medicine
Tribology in MedicineTribology in Medicine
Tribology in Medicine
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
Advances in tribology
Advances in tribologyAdvances in tribology
Advances in tribology
 
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems -  ACM RecSys 2013 tutorialLearning to Rank for Recommender Systems -  ACM RecSys 2013 tutorial
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorial
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
 

Similar a Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge

Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems MongoDB
 
Using Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your ClusterUsing Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your ClusterMongoDB
 
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough dataВладимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough dataMail.ru Group
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France
 
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationAlessandro Benedetti
 
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...OpenSource Connections
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationSease
 
presentation.pdf
presentation.pdfpresentation.pdf
presentation.pdfcaa28steve
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...RTTS
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxElasticsearch
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusSease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Leveraging an in-house modeling framework for fun and profit
Leveraging an in-house modeling framework for fun and profitLeveraging an in-house modeling framework for fun and profit
Leveraging an in-house modeling framework for fun and profitCarl Anderson
 
Overcome the Reign of Chaos
Overcome the Reign of ChaosOvercome the Reign of Chaos
Overcome the Reign of ChaosMichael Stockerl
 
Building a Meta-search Engine
Building a Meta-search EngineBuilding a Meta-search Engine
Building a Meta-search EngineAyan Chandra
 
Graph processing at scale using spark & graph frames
Graph processing at scale using spark & graph framesGraph processing at scale using spark & graph frames
Graph processing at scale using spark & graph framesRon Barabash
 
DMSS: SEO Insights, Analysis & Reporting: Visualizing Your SEO Data
DMSS: SEO Insights, Analysis & Reporting: Visualizing Your SEO DataDMSS: SEO Insights, Analysis & Reporting: Visualizing Your SEO Data
DMSS: SEO Insights, Analysis & Reporting: Visualizing Your SEO DataSam Partland
 
Query Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programmingQuery Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programmingRTTS
 

Similar a Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge (20)

Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems
 
Using Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your ClusterUsing Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your Cluster
 
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough dataВладимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
 
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
 
presentation.pdf
presentation.pdfpresentation.pdf
presentation.pdf
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolbox
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
 
FINAL REVIEW
FINAL REVIEWFINAL REVIEW
FINAL REVIEW
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Leveraging an in-house modeling framework for fun and profit
Leveraging an in-house modeling framework for fun and profitLeveraging an in-house modeling framework for fun and profit
Leveraging an in-house modeling framework for fun and profit
 
Learn to Rank search results
Learn to Rank search resultsLearn to Rank search results
Learn to Rank search results
 
Overcome the Reign of Chaos
Overcome the Reign of ChaosOvercome the Reign of Chaos
Overcome the Reign of Chaos
 
Building a Meta-search Engine
Building a Meta-search EngineBuilding a Meta-search Engine
Building a Meta-search Engine
 
Graph processing at scale using spark & graph frames
Graph processing at scale using spark & graph framesGraph processing at scale using spark & graph frames
Graph processing at scale using spark & graph frames
 
DMSS: SEO Insights, Analysis & Reporting: Visualizing Your SEO Data
DMSS: SEO Insights, Analysis & Reporting: Visualizing Your SEO DataDMSS: SEO Insights, Analysis & Reporting: Visualizing Your SEO Data
DMSS: SEO Insights, Analysis & Reporting: Visualizing Your SEO Data
 
Query Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programmingQuery Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programming
 

Más de Dataiku

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Dataiku
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku Dataiku
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystemDataiku
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku
 
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data CircleDataiku
 
Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013Dataiku
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku
 
Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thDataiku
 
Dataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku
 
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku
 
Online Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunOnline Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunDataiku
 
Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku
 

Más de Dataiku (16)

Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...
 
Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
 
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku  - for Data Geek Paris@Criteo - Close the Data CircleDataiku  - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
 
Dataiku - google cloud platform roadshow - october 2013
Dataiku  - google cloud platform roadshow - october 2013Dataiku  - google cloud platform roadshow - october 2013
Dataiku - google cloud platform roadshow - october 2013
 
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013
 
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16thDataiku, Pitch Data Innovation Night, Boston, Septembre 16th
Dataiku, Pitch Data Innovation Night, Boston, Septembre 16th
 
Data Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from thData Disruption for Insurance - Perspective from th
Data Disruption for Insurance - Perspective from th
 
Dataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine Learning
 
Dataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin BuzzwordsDataiku Flow and dctc - Berlin Buzzwords
Dataiku Flow and dctc - Berlin Buzzwords
 
Online Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for FunOnline Games Analytics - Data Science for Fun
Online Games Analytics - Data Science for Fun
 
Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch Dataiku - Paris JUG 2013 - Hadoop is a batch
Dataiku - Paris JUG 2013 - Hadoop is a batch
 

Último

Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 

Último (20)

Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 

Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge

  • 1. write your own data story!
  • 4. Founded
 January 2013 January 2014
 A Data Science Studio powered team wins a Challenge
  • 5. Founded
 January 2013 January 2014
 A Data Science Studio powered team wins a Challenge Data Science Studio’s GA February 2014
  • 6. Founded
 January 2013 January 2014
 A Data Science Studio powered team wins a Challenge Data Science Studio’s GA February 2014 July 2014
 Data Science Studio Available for Free with a Community Edition ! !
  • 7. Founded
 January 2013 January 2014
 A Data Science Studio powered team wins a Challenge Data Science Studio’s GA February 2014 15 People Now July 2014
 Data Science Studio Available for Free with a Community Edition ! !
  • 9. I don’t want to be a data cleaner anymore “
  • 10. Finding Leaks in my Data Pipelines
  • 12. MPP Databases Statistical Software Machine Learning No-SQL Hadoop
  • 15. Using Historical Logs of a search engine QUERIES RESULTS CLICKS ! and a set of new QUERIES and RESULTS ! rerank the RESULTS in order to optimize relevance Personalized Web Search Yandex Fri 11 Oct 2013 – Fri 10 Jan 2014 194 Teams $9,000 cash prize
  • 16. No researcher. No experience in reranking. Not much experience in ML for most of us. Not exactly our job. No expectations. Kenji Lefevre 37 Algrebraic Geometry Learning Python Christophe Bourguignat 37 Signal Processing Eng. Learning Scikit Mathieu Scordia 24 Data Scientist Paul Masurel 33 Soft. Engineer The Team
  • 19. YANDEX SUPPLIED 27 DAYS OF ANONYMOUS LOG Challenge Data 34,573,630 Sessions with user id 21,073,569 Queries 64,693,054 Clicks ~ 15GB Example
  • 21. A METRIC FOR RELEVANCE RIGHT FROM THE LOG? ASSUMING WE SEARCH FOR "FRENCH NEWSPAPER", WE TAKE A LOOK AT THE LOGS.
  • 22. WE COMPUTE THE SO CALLED DWELL TIME OF A CLICK I.E. THE TIME ELAPSED BEFORE THE NEXT ACTION DWELL TIME
  • 23. DWELL TIME HAS BEEN SHOWN TO BE CORRELATED WITH THE RELEVANCE
  • 24. GOOD WE HAVE A MEASURE OF RELEVANCE ! CAN WE GET AN OVERALL SCORE FOR OUR SEARCH ENGINE NOW?
  • 25.
  • 26. Emphasis on relevant documents Discount per ranking Discount Cumulative Gain
  • 27. Normalized Discount Cumulative Gain Just Normalize Between 0 and 1
  • 28. PERSONALIZED RERANKING IS ABOUT REORDERING THE N-BEST RESULTS BASED ON THE USER PAST SEARCH HISTORY Results Obtained in the contest: ! Original NCDG 0.79056 ! ReRanked NCDG 0.80714 ! ! ~ Raising the rank of a relevant ( relevancy = 2) result from Rank #6 to Rank #5 on each query ~ Raising the rank of a relevant ( relevancy = 2) result from Rank #6 to Rank #2 in 20% of the queries Equivalent To
  • 30. Simple, point wise approach Session 1 Session 2 .... 0 1 2 For each (URL, Session) predict relevance (0,1 or 2)
  • 31. Supervised Learning on History We split 27 days of the train dataset 24 (history) + 3 days (annotated). ! Stop randomly in the last 3 days at a “test" session (like Yandex) Train Set (24 history) Train Set (annotation) Test Set
  • 32. Working with a ML workflow collaboratively
  • 33. Features Construction : Team Member work independantly Learning : Team Member work independantly Split Train & Validation Features on 30 days Labelled 30 days data
  • 34. ! regression : we keep the hierarchy between the classes, but optimizing NDCG is cookery. classification : we lose the hierarchy but we can optimize the NDCG (more and that later) REGRESSION or CLASSIFICATION According to P. Li, C. J. C. Burges, and Q. Wu. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2007. Classification outperforms regression.
  • 35. ! Compute the probabilities that P(relevance = X) Build a sorted list ! Sort by ! P(Relevance=1) + 3 P (Relevance=2)
  • 36. Hence order by decreasing Hence order by P(Relevance=1) + 3 P (Relevance=2) P. Li, C. J. C. Burges, and Q. Wu. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2007. get slightly better results with linear weighting.
  • 38. FIRST OF ALL THE RANK In this contest, the rank is both The rank that has been displayed to the user THE DISPLAY RANK ! The rank that is computed by Yandex using
 PageRank, non-personalized log analysis?, TF-IDF, 
 and machine learning etc. THE NON-PERSONALIZED RANK RANK AS feature
  • 41. 53% OF THE COMPETITORS 
 COULD NOT IMPROVE THE BASELINE Worse 53% Better 47%
  • 42. 1. compute non-personalized rank 2. select 10 best hits and serves them in order 3. re-rank using log analysis. 4. put new ranking algorithm in prod (yeah right!) 5. compute NDCG on new logs 6. … 7. Profits !! IDEAL
  • 43. 1. compute non-personalized rank 2. select 10 bests hits 3. serve 10 bests hits ranked in random order 4. re-rank using log analysis, including non-personalized rank as a feature 5. compute score against the log with the former rank REAL
  • 44. Users tend to click on the first few urls. User satisfaction metric is influenced by the display rank. Our score is not aligned with our goal. PROBLEM We cannot discriminate the effect of the signal of the non-personalized rank from effect of the display rank
  • 45. PROMOTES OVER CONSERVATIVE RE-RANKING POLICY Even if we know for sure that the url with rank 9 would be clicked by the user if it was presented at rank 1, it would be probably a bad idea to rerank it to rank 1 in this contest. Average per session of the max position jump
  • 47. Revisits (Query-(User)-URL) features
 and variants Query Features Cumulative Features User Click Habits Collaborative Filtering Seasonality FEATURES
  • 48. ! In the past, when the user was displayed this url, with the exact same query what is the probability that : REVISITS • satisfaction=2 • satisfaction=1 • satisfaction=0 • miss (not-clicked) • skipped (after the last click) 5 Conditional Probability Features 1 An overall counter of display 4 mean reciprocal rank
 (kind of the harmonic mean of the rank) 1 snippet quality score (twisted formula used to compute snippet quality) 11 Base Features
  • 49. • (In the past|within the same sesssion), • (with this very query | whatever query | a subquery | a super query) • and was offered (this url/this domain) MANY VARIATIONS X2 X 3 X 2 12 variants With the same user Without being the same user ( URL - query features) • Same Domain • Same URL • Same Query and Same URL 3 variants 15 Variants X 11 Base Features 165 Features
  • 50. ADDITIVE SMOOTHING http://fumicoton.com/posts/bayesian_rating • book A : 1 rating of 5. Average rating of 5. • book B : 50 ratings. Average rating of 4.5 In our case to evaluate the probability that a (URL|query) should have a label l, under predicate P:
  • 51. CUMULATIVE FEATURES Aggregate the features of the URL above in the ranking list Rationale : If a URL above is likely to be clicked, those below are likely to be missed
  • 52. QUERY FEATURES Click entropy number of time it has been queried for number of terms average position within in session average number of occurences in a session MRR of its clicks How complex and ambiguous is a query ?
  • 53. USER FEATURES What are the users habits ? Click entropy User click rank counters Rank {1, 2} clicks Rank {3, 4, 5} clicks Rank {6,7,8,9,10 } clicks Average number of terms Average number of different terms in a session Total number of queries issued by the user
  • 55. COLLABORATIVE FILTERING (ATTEMPT) User / Domain interaction matrix. FunkSVD Algorithm Simon Funk http://sifter.org/~simon/journal/20061211.html https://github.com/commonsense/divisi/blob/master/svdlib/_svdlib.pyx Cython implementation Marginal increase 5.10^-5 of the NCDG ! Why ?
  • 57. Short Story Point Wise, Random Forest, 30 Features, 4th Place (*) List Wise , LambdaMART, 90 Features, 1st Place (*) (*) A Yandex “PaceMaker" Team was also displaying results on the leaderboard and were 
 at the first place during the whole competition even if not officially contestant
 Trained in 2 days, 1135 Trees Optimize & Train in ~ 1 hour (12 cores), 24 trees
  • 58. Lambda Mart From RankNet to LambdaRank to LambdaMART: An Overview Christopher J.C. Burges Microsoft Research Technical Report MSR-TR-2010-82 LambdaMART = LambdaRank + MART
  • 59. Lambda Rank Original Ranking Re Ranked 13 errors 11 errors High Quality Hit Low Quality Hit Rank Net Gradient LambdaRank "Gradient" From RankNet to LambdaRank to LambdaMART: An Overview Christopher J.C. Burges - Microsoft Research Technical Report MSR-TR-2010-82
  • 60. Grid Search We are not doing typical classification here. It is extremely important to perform grid search directly against NDCG final score. NDCG “conservatism” end up with large “min samples per leaf” (between 40 and 80 )
  • 61. Feature Selection Top-Down approach : Starting from a high number of features, iteratively removed subsets of features. This approach led to the subset of 90 features for the LambdaMart winning solutions (Similar strategy now implemented by sklearn.feature_selection.RFECV) ! Bottom-up approach : Starting from a low number of features, add the features that produce the best marginal improvement. Gave the 30 features that lead to the best solution with the point-wise approach.
  • 64. http://sourceforge.net/p/lemur/wiki/RankLib/ Ranklib ( Implementation of LambdaMART) These Slides http://www.slideshare.net/Dataiku Learning to rank using multiple classification and gradient boosting. P. Li, C. J. C. Burges, and Q. Wu. Mcrank - In NIPS, 2007 From RankNet to LambdaRank to LambdaMART: An Overview Christopher J.C. Burges - Microsoft Research Technical Report MSR-TR-2010-82 http://fumicoton.com/posts/bayesian_rating Blog Post About Additive Smoothing Blog Posts about the solution Contest Url Paper with Detailed Description http://blog.kaggle.com/2014/02/06/winning-personalized-web-search-team-dataiku/ http://www.dataiku.com/blog/2014/01/14/winning-kaggle.html http://research.microsoft.com/en-us/um/people/nickcr/wscd2014/papers/wscdchallenge2014dataiku.pdf https://www.kaggle.com/c/yandex-personalized-web-search-challenge Research Papers References
  • 65. Random Thoughts Dependancy analysis and comparing rank with predictive “relevance" could help determine general cases where the existing engine is not relevant enough How does it compare to a pure statistical approach ? ! Applying personalisation technique this way might not be practical because of the amount of live information to be maintained (in real-time) about users (each query, each click) to perform actionnable predictions How could a machine learning challenge enforce this kind of constraints? Is data science a science, a sport or a hobby. Newcomers can discover a field, improve existing results, and seemingly obtain incrementally more effective results, with little plateau effect ! Are we just at the very beginning non-industrial era of this discipline?