SlideShare una empresa de Scribd logo
1 de 43
Data Science in
E-commerce industry
Telecom Paris - Séminaires Big Data 2016/06/09
Vincent Michel
Big Data Europe, BDD, Rakuten Inc. / PriceMinister
vincent.michel@rakuten.com
@HowIMetYourData
2
Short Bio
ESPCI: engineer in Physics / Biology
ENS Cachan: MVA Master Mathematics Vision and Learning
INRIA Parietal team: PhD in Computer Science
Understanding the visual cortex by using classification techniques
Logilab – Development and data science consulting
Data.bnf.fr (French National Library open-data platform)
Brainomics (platform for heterogeneous medical data)
Education
Experience
Rakuten PriceMinister– Senior Developer and data scientist
Data engineer and data science consulting
Software engineering
Lessons learned from (painful) experiences
4
Do not redo it yourself !
Lots of interesting open-source libraries for all your needs:
Test first on a small POC, then contribute/develop
Scikit-learn, pandas, Caffe, Scikit-image, opencv, ….
Be careful: it is easy to do something wrong !
Open-data:
More and more open-data for catalogs, …
E.g. data.bnf.fr
~ 2.000.000 authors
~ 200.000 works
~ 200.000 topics
Contribute to open-source:
Is there a need / pool of potential developers ?
Do it well (documentation / test)
Unless you are doing some kind of super magical algorithm
May bring you help, bug fixes, and engineers ! But it takes time and energy
5
Quality in data science software engineering
Never underestimates integration cost
Easy to write a 20 lines Python code doing some
fancy Random Forests…
…that could be hard to deploy (data pipeline, packaging, monitoring)
Developer != DevOps != Sys admin
Make it clean from the start (> 2 days of dev or > 100 lines of code):
Tests, tests, tests, tests, tests, tests, tests, …
Documentation
Packaging / supervision / monitoring
Release often release earlier
Agile development, Pull request, code versioning
Choose the right tool:
Do you really need this super fancy NoSQL database
to store your transactions?
6
Monitoring and metrics
Always monitor:
Your development: continuous integration (Jenkins)
Your service: nagios/shinken
Your business data (BI): Kibana
Your user: tracker
Your data science process : e.g. A/B test
Evaluation:
Choose the right metric
Prediction accuracy / Precision-recall …
Always A/B test rather than relying on personal thoughts
Good question leads to good answer: Define your problem
Hiring remarks
Selling yourself as a (good) data scientist
8
Few remarks on hiring – my personal opinion
Be careful of CVs with buzzwords!
E.g. “IT skills: SVM (linear, non-linear), Clustering (K-means, Hierarchical),
Random Forests, Regularization (L1, L2, Elastic net…) …”
It is like as someone saying “ IT skills: Python (for loop, if/else pattern, …)
Often found in Junior CVs (ok), but huge warning in Senior CVs
Hungry for data?
Loving data is the most important thing to check
Opendata? Personal project? Curious about data? (Hackaton?)
Pluridisciplinary == knowing how to handle various datasets
Check for IT skills:
Should be able to install/develop new libraries/algorithms
A huge part of the job could be to format / cleanup the data
Experience VS education -> Autonomy
Recommendations @Rakuten
Data science use-case
10
Rakuten Group Worldwide
Recommendation
challenges
Different languages
Users behavior
Business areas
11
Rakuten Group in Numbers
Rakuten in Japan
> 12.000 employees
> 48 billions euros of GMS
> 100.000.000 users
> 250.000.000 items
> 40.000 merchants
Rakuten Group
Kobo 18.000.000 users
Viki 28.000.000 users
Viber 345.000.000 users
12
Rakuten Ecosystem
Rakuten global ecosystem :
Member-based business model that connects Rakuten services
Rakuten ID common to various Rakuten services
Online shopping and services;
Main business areas
E-commerce
Internet finance
Digital content
Recommendation challenges
Cross-services
Aggregated data
Complex users features
13
Rakuten’s e-commerce: B2B2C Business Model
Business to Business to Consumer:
Merchants located in different regions / online virtual shopping mall
Main profit sources
• Fixed fees from merchants
• Fees based on each transaction and other service
Recommendation
challenges
Many shops
Items references
Global catalog
14
Big Data Department @ Rakuten
Big Data Department
150+ engineers – Japan / Europe / US
Missions
Development and operations of internal
systems for:
Recommendations
Search
Targeting
User behavior tracking
Average traffic
> 100.000.000 events / day
> 40.000.000 items view / day
> 50.000.000 search / day
> 750.000 purchases / day
Technology stack
Java / Python / Ruby
Solr / Lucene
Cassandra / Couchbase
Hadoop / Hive / Pig
Redis / Kafka
15
Recommendations on Rakuten Marketplaces
Non-personalized recommendations
All-shop recommendations:
Item to item
User to item
In-shop recommendations
Review-based recommendations
Personalized recommendations
Purchase history recommendations
Cart add recommendations
Order confirmation recommendations
System status and scale
In production in over 35 services of Rakuten Group worldwide
Several hundreds of servers running:
Hadoop
Cassandra
APIS
Recommendations
The big picture
17
Challenges in Recommendations
Items
Catalogue
Items
Similarity
Recommendations
engine
Evaluation
Process
Items catalogues
Catalogue for multiple shops with different items
references ?
Items similarity / distances
Cross services aggregation ?
Lots of parameters ?
Recommendations engine
Best / optimal recommendations logic ?
Evaluation process
Offline / online evaluation ?
Long-tail ? KPI ?
18
Recommendations Architecture: Constantly Evolving
Browsing
Events
Cocounts Storage
Purchase
Events
Catalogue(s)
Distributionlayer
Recommendations
Offline / materialized
Recommendations
Online algebra / multi-arm
19
Items Catalogues
Use different levels of aggregation to improve recommendations
Category-level
(e.g. food, soda, clothes, …)
Product-level
(manufactured items)
Item in shop-level
(specific product sell by a
specific shop)
Increased statistical
power in co-events
computation
Easier business handling
(picking the good item)
20
Enriching Catalogues using Record Linkage
Marketplace 2Marketplace 1 Reference database
Record linkage
 Use external sources (e.g., Wikidata) to
align markets' products
 Fuzzy matching of 600K vs 350K items
for movies alignments usecase.
 Blocking algorithm
Cross recommendation
 Global catalog
 Items aggregation
 Helps with cold start issues
 Improved navigation
21
Semantic-web and RDF format
Triples: <subject> <relation> <object>
URI: unique identifier
http://dbpedia.org/page/Terminator_2:_Judgment_Day
Recommendations
Cocounts and matrixes
23
Recommendation datatypes
Ratings
Numerical feedbacks from the
users
Sources: Stars, reviews, …
✔ Qualitative and valuable data
✖ Hard to obtain
Scaling and normalization !
Users
Items
1 3 2
5 2
2 4 1
3 1 5
4 4 1 3
Unitary data
Only 0/1 without any quality
feedback
Sources: Click, purchase…
✔ Easy to obtain (e.g. tracker)
✖ No direct rating
Users
Items
1 1 1
1 1
1 1 1
1 1 1
1 1 1 1
24
Collaborative filtering
User-user
#items < #users
Items are changing quickly
Users
Items
1 3 2
5 2
2 4 1
3 1 5
4 4 1 3
?
1 – Compute users similarities
(cosine-similarity, Pearson)
2 – Weighted average of ratings
Item-item
#items >> #users
25
Matrix factorization
Users
Items
1 3 2
5 2
2 4 1
3 1 5
4 4 1 3
-0.7 1 0.4
…
…
…
…
…
2.3 0.2 -0.3
Items
0.5 0.3 … 1.2
…
1.2 -0.2 … -3.2
Users
~ X
Choose a number of latent variables to decompose the data
Predict new rating using the product of latent vectors
Use gradient descent technics (e.g. SGD)
Add some regularization
26
Matrix factorization – MovieLens example
Read files
import csv
movies_fname = '/path/ml-latest/movies.csv'
with open(movies_fname) as fobj:
movies = dict((r[0], r[1]) for r in csv.reader(fobj))
ratings_fname = ’/path/ml-latest/ratings.csv'
with open(ratings_fname) as fobj:
header = fobj.next()
ratings = [(r[0], movies[r[1]], float(r[2])) for r in csv.reader(fobj)]
Build sparse matrix
import scipy.sparse as sp
user_idx, item_idx = {}, {}
data, rows, cols = [], [], []
for u, i, s in ratings:
rows.append(user_idx.setdefault(u, len(user_idx)))
cols.append(item_idx.setdefault(i, len(item_idx)))
data.append(s)
ratings = sp.csr_matrix((data, (rows, cols)))
reverse_item_idx = dict((v, k) for k, v in item_idx.iteritems())
reverse_user_idx = dict((v, k) for k, v in user_idx.iteritems())
27
Matrix factorization – MovieLens example
Fit Non-negative Matrix Factorization
from sklearn.decomposition import NMF
nmf = NMF(n_components=50)
user_mat = nmf.fit_transform(ratings)
item_mat = nmf.components_
Plot results
component_ind = 3
component = [(reverse_item_idx[i], s)
for i, s in enumerate(item_mat[component_ind , :]) if s>0.] For
movie, score in sorted(component, key=lambda x: x[1], reverse=True)[:10]:
print movie, round(score)
Terminator 2: Judgment Day (1991) 24.0
Terminator, The (1984) 23.0
Die Hard (198 19.0
Aliens (1986) 17.0
Alien (1979) 16.0
Exorcist, The (1973) 8.0
Halloween (197 7.0
Nightmare on Elm Street, A (1984) 7.0
Shining, The (1980) 7.0
Carrie (1976) 7.0
Star Trek II: The Wrath of Khan (1982) 10.0
Star Trek: First Contact (1996) 10.0
Star Trek IV: The Voyage Home (1986) 9.0
Contact (1997) 8.0
Star Trek VI: The Undiscovered Country (1991) 8.0
Blade Runner (1982) 8.0
28
Binary / Unitary data
Only occurences of items views/purchases/…
Jaccard distance
Cosine similarity
Conditional probability
29
Co-occurrences and Similarities Computation
Only access to unitary data (purchase / browsing)
Use co-occurrences for computing items similarity
Multiple possible parameters:
 Size of time window to be considered:
Does browsing and purchase data reflect similar behavior ?
 Threshold on co-occurrences
Is one co-occurrence significant enough to be used ? Two ? Three ?
 Symmetric or asymmetric
Is the order important in the co-occurrence ? A then B == B then A ?
 Similarity metrics
Which similarity metrics to be used based on the co-occurrences ?
30
Co-occurrences Example
Browsing
Purchase
Session ? Session ?Time window 1
Session ?Time window 2
07/11/2015 08/11/2015
08/11/2015
24/11/2015
08/11/2015
08/11/2015
10/09/201
5
08/09/201
5
10/09/201
5
31
Co-occurrences Computation
Co-purchases
Co-browsing
Classical co-occurrences
Complementary
items
Substitute
items
Other possible co-occurrences
Items browsed and
bought together
Items browsed and
not bought together
“You may also
want…”
“Similar items…”
08/11/2015
08/11/2015
08/11/2015
07/11/2015
08/11/201510/09/201
5
08/09/201
5
07/11/2015
Recommendations
Development and evaluation
33
Recommendations Algebra
Algebra for defining and combining recommendations
engines
Keys ideas
 Reuse already existing logics and combine them easily.
 Write business logic, not code !
 Handle multiple input/output formats.
Available Logics
Content-based
Collaborative-filtering
Item-item
User-item
(personalization)
Available Backends
In-memory
HDF5 files
Cassandra
Couchbase
Available Hybridization
Linear algebra /
weighting
Mixed
Cascade engines
Meta-level
34
Python Algebra Example
Purchase-based
Top-20
Asymmetric
Conditional probability
Browsing-based
Similarity > 0.01
Symmetric
Cosine similarity
+ 0.2 Composite engine
>>> engine1 = RecommendationsEngine(nb_recos=20,
datatype=‘purchase’,
asymmetric=True,
distance=‘conditional_probability’)
>>> engine2 = RecommendationsEngine(similarity_th=0.01,
datatype=‘browsing’,
asymmetric=False,
distance=‘cosine_similarity’)
>>> composite_engine = engine1 + 0.2 * engine2
Get recommendations from items (item-to-item)
>>> recos = composite_engine.recommendations_by_items([123, 456,
789, …])
35
Python Algebra with Personalization
Purchase-based
Top-20
Asymmetric
Conditional probability
Browsing-based
Similarity > 0.01
Symmetric
Cosine similarity
+ 0.2 Composite engine
Purchase-history
Time window 180 days
Time decay 0.01
>>> history = HistoryEngine(datatype=‘purchase’, time_window=180,
time_decay=0.01)
>>> engine1.register_history_engine(history)
…same code as previously (user-to-item)
>>> recos = composite_engine.recommendations_by_user(‘userid’)
36
Python Algebra – Complete Example
Purchase-based
Top-20
Asymmetric
Conditional probability
Browsing-based
Similarity > 0.01
Symmetric
Cosine similarity
+ 0.2 Composite engine
Purchase-history
Time window 180 days
Time decay 0.01
X (cascade)
Purchase-based
Category-level
Similarity > 0.01
Asymmetric
Conditional probability
Browsing-based
Category-level
Similarity > 0.1
Symmetric
Cosine similarity
+ 0.1
Composite engine
37
Recommendation Quality Challenges
Recommendations categories
Cold start issue
• External data ?
• Cross-services ?
Hot products (A)
• Top-N items ?
Short tail (B)
Long tail (C + D)
Minor
Product
Major
Product
(Popular)
New
Product
Old
Product
(A)
(B)
(D)
(C)
38
Long Tail is Fat
Long tail numbers
• Most of the items are long tail
• They still represent a large
portion of the traffic
Long tail approaches
• Content-based
• Aggregation / clustering
• Personalization
Popula
r
Short
tail
Long
tail
Browsing share Number of items
Long tail Short tail Popular
39
Recommendations Offline Evaluation
Pros/Cons
• Convenient way to
try new ideas
• Fast and cheap
• But hard to align
with online KPI
Approaches
• Rescoring
• Prediction game
• Business simulator
40
Public Initiative – Viki Recommendation Challenge
567 submissions from 132 participants
http://www.dextra.sg/challenges/rakuten-viki-video-challenge
41
Datascience everywhere !
Rakuten provides marketplaces worldwide
Specific challenges for recommendations
Items catalogue: reinforce statistical power of co-occurrences
across shops and services;
Items similarities: find the good parameters for the different use-
cases;
Recommendations models: what is the best models for in-shop,
all-shops, personalization?
Evaluation: handling long-tail? Comparing different models?
42
THANKS !
Questions ?
More on Rakuten tech initiatives
http://www.slideshare.net/rakutentech
http://rit.rakuten.co.jp/oss.html
http://rit.rakuten.co.jp/opendata.html
Positions
• http://global.rakuten.com/corp/careers/bigdata/
• http://www.priceminister.com/recrutement/?p=197
43
We are Hiring!
Big Data Department – team in Paris
http://global.rakuten.com/corp/careers/bigdata/
http://www.priceminister.com/recrutement/?p=197
Data Scientist / Software Developer
 Build algorithms for recommendations, search, targeting
 Predictive modeling, machine learning, natural language processing
 Working close to business
 Python, Java, Hadoop, Couchbase, Cassandra…
 Also hiring: search engine developers, big data system
administrators, etc.

Más contenido relacionado

Similar a Telecom datascience master_public

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Yael Garten
 
Architecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystem
Yael Garten
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
sumit621
 

Similar a Telecom datascience master_public (20)

Recsys2016 Tutorial by Xavier and Deepak
Recsys2016 Tutorial by Xavier and DeepakRecsys2016 Tutorial by Xavier and Deepak
Recsys2016 Tutorial by Xavier and Deepak
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Machine Learning for Recommender Systems in the Job Market
Machine Learning for Recommender Systems in the Job MarketMachine Learning for Recommender Systems in the Job Market
Machine Learning for Recommender Systems in the Job Market
 
Architecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystem
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
 
Rakuten - Recommendation Platform
Rakuten - Recommendation PlatformRakuten - Recommendation Platform
Rakuten - Recommendation Platform
 
Datasciencein E-commerce industry
Datasciencein E-commerce industryDatasciencein E-commerce industry
Datasciencein E-commerce industry
 
How to analyze text data for AI and ML with Named Entity Recognition
How to analyze text data for AI and ML with Named Entity RecognitionHow to analyze text data for AI and ML with Named Entity Recognition
How to analyze text data for AI and ML with Named Entity Recognition
 
Data Science.pptx
Data Science.pptxData Science.pptx
Data Science.pptx
 
1030 track2 komp
1030 track2 komp1030 track2 komp
1030 track2 komp
 
Analytics in Online Retail
Analytics in Online RetailAnalytics in Online Retail
Analytics in Online Retail
 
acmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptxacmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptx
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Agile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - ExcercisesAgile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
 
Actual cases of applying AI related technologiesin Rakuten
Actual cases of applying AI related technologiesin RakutenActual cases of applying AI related technologiesin Rakuten
Actual cases of applying AI related technologiesin Rakuten
 
1120 track2 komp
1120 track2 komp1120 track2 komp
1120 track2 komp
 
Data Science, Personalisation & Product management
Data Science, Personalisation & Product managementData Science, Personalisation & Product management
Data Science, Personalisation & Product management
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
Artificial Intelligence for Automating Data Analysis
Artificial Intelligence for Automating Data AnalysisArtificial Intelligence for Automating Data Analysis
Artificial Intelligence for Automating Data Analysis
 

Último

Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 

Último (20)

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 

Telecom datascience master_public

  • 1. Data Science in E-commerce industry Telecom Paris - Séminaires Big Data 2016/06/09 Vincent Michel Big Data Europe, BDD, Rakuten Inc. / PriceMinister vincent.michel@rakuten.com @HowIMetYourData
  • 2. 2 Short Bio ESPCI: engineer in Physics / Biology ENS Cachan: MVA Master Mathematics Vision and Learning INRIA Parietal team: PhD in Computer Science Understanding the visual cortex by using classification techniques Logilab – Development and data science consulting Data.bnf.fr (French National Library open-data platform) Brainomics (platform for heterogeneous medical data) Education Experience Rakuten PriceMinister– Senior Developer and data scientist Data engineer and data science consulting
  • 3. Software engineering Lessons learned from (painful) experiences
  • 4. 4 Do not redo it yourself ! Lots of interesting open-source libraries for all your needs: Test first on a small POC, then contribute/develop Scikit-learn, pandas, Caffe, Scikit-image, opencv, …. Be careful: it is easy to do something wrong ! Open-data: More and more open-data for catalogs, … E.g. data.bnf.fr ~ 2.000.000 authors ~ 200.000 works ~ 200.000 topics Contribute to open-source: Is there a need / pool of potential developers ? Do it well (documentation / test) Unless you are doing some kind of super magical algorithm May bring you help, bug fixes, and engineers ! But it takes time and energy
  • 5. 5 Quality in data science software engineering Never underestimates integration cost Easy to write a 20 lines Python code doing some fancy Random Forests… …that could be hard to deploy (data pipeline, packaging, monitoring) Developer != DevOps != Sys admin Make it clean from the start (> 2 days of dev or > 100 lines of code): Tests, tests, tests, tests, tests, tests, tests, … Documentation Packaging / supervision / monitoring Release often release earlier Agile development, Pull request, code versioning Choose the right tool: Do you really need this super fancy NoSQL database to store your transactions?
  • 6. 6 Monitoring and metrics Always monitor: Your development: continuous integration (Jenkins) Your service: nagios/shinken Your business data (BI): Kibana Your user: tracker Your data science process : e.g. A/B test Evaluation: Choose the right metric Prediction accuracy / Precision-recall … Always A/B test rather than relying on personal thoughts Good question leads to good answer: Define your problem
  • 7. Hiring remarks Selling yourself as a (good) data scientist
  • 8. 8 Few remarks on hiring – my personal opinion Be careful of CVs with buzzwords! E.g. “IT skills: SVM (linear, non-linear), Clustering (K-means, Hierarchical), Random Forests, Regularization (L1, L2, Elastic net…) …” It is like as someone saying “ IT skills: Python (for loop, if/else pattern, …) Often found in Junior CVs (ok), but huge warning in Senior CVs Hungry for data? Loving data is the most important thing to check Opendata? Personal project? Curious about data? (Hackaton?) Pluridisciplinary == knowing how to handle various datasets Check for IT skills: Should be able to install/develop new libraries/algorithms A huge part of the job could be to format / cleanup the data Experience VS education -> Autonomy
  • 10. 10 Rakuten Group Worldwide Recommendation challenges Different languages Users behavior Business areas
  • 11. 11 Rakuten Group in Numbers Rakuten in Japan > 12.000 employees > 48 billions euros of GMS > 100.000.000 users > 250.000.000 items > 40.000 merchants Rakuten Group Kobo 18.000.000 users Viki 28.000.000 users Viber 345.000.000 users
  • 12. 12 Rakuten Ecosystem Rakuten global ecosystem : Member-based business model that connects Rakuten services Rakuten ID common to various Rakuten services Online shopping and services; Main business areas E-commerce Internet finance Digital content Recommendation challenges Cross-services Aggregated data Complex users features
  • 13. 13 Rakuten’s e-commerce: B2B2C Business Model Business to Business to Consumer: Merchants located in different regions / online virtual shopping mall Main profit sources • Fixed fees from merchants • Fees based on each transaction and other service Recommendation challenges Many shops Items references Global catalog
  • 14. 14 Big Data Department @ Rakuten Big Data Department 150+ engineers – Japan / Europe / US Missions Development and operations of internal systems for: Recommendations Search Targeting User behavior tracking Average traffic > 100.000.000 events / day > 40.000.000 items view / day > 50.000.000 search / day > 750.000 purchases / day Technology stack Java / Python / Ruby Solr / Lucene Cassandra / Couchbase Hadoop / Hive / Pig Redis / Kafka
  • 15. 15 Recommendations on Rakuten Marketplaces Non-personalized recommendations All-shop recommendations: Item to item User to item In-shop recommendations Review-based recommendations Personalized recommendations Purchase history recommendations Cart add recommendations Order confirmation recommendations System status and scale In production in over 35 services of Rakuten Group worldwide Several hundreds of servers running: Hadoop Cassandra APIS
  • 17. 17 Challenges in Recommendations Items Catalogue Items Similarity Recommendations engine Evaluation Process Items catalogues Catalogue for multiple shops with different items references ? Items similarity / distances Cross services aggregation ? Lots of parameters ? Recommendations engine Best / optimal recommendations logic ? Evaluation process Offline / online evaluation ? Long-tail ? KPI ?
  • 18. 18 Recommendations Architecture: Constantly Evolving Browsing Events Cocounts Storage Purchase Events Catalogue(s) Distributionlayer Recommendations Offline / materialized Recommendations Online algebra / multi-arm
  • 19. 19 Items Catalogues Use different levels of aggregation to improve recommendations Category-level (e.g. food, soda, clothes, …) Product-level (manufactured items) Item in shop-level (specific product sell by a specific shop) Increased statistical power in co-events computation Easier business handling (picking the good item)
  • 20. 20 Enriching Catalogues using Record Linkage Marketplace 2Marketplace 1 Reference database Record linkage  Use external sources (e.g., Wikidata) to align markets' products  Fuzzy matching of 600K vs 350K items for movies alignments usecase.  Blocking algorithm Cross recommendation  Global catalog  Items aggregation  Helps with cold start issues  Improved navigation
  • 21. 21 Semantic-web and RDF format Triples: <subject> <relation> <object> URI: unique identifier http://dbpedia.org/page/Terminator_2:_Judgment_Day
  • 23. 23 Recommendation datatypes Ratings Numerical feedbacks from the users Sources: Stars, reviews, … ✔ Qualitative and valuable data ✖ Hard to obtain Scaling and normalization ! Users Items 1 3 2 5 2 2 4 1 3 1 5 4 4 1 3 Unitary data Only 0/1 without any quality feedback Sources: Click, purchase… ✔ Easy to obtain (e.g. tracker) ✖ No direct rating Users Items 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  • 24. 24 Collaborative filtering User-user #items < #users Items are changing quickly Users Items 1 3 2 5 2 2 4 1 3 1 5 4 4 1 3 ? 1 – Compute users similarities (cosine-similarity, Pearson) 2 – Weighted average of ratings Item-item #items >> #users
  • 25. 25 Matrix factorization Users Items 1 3 2 5 2 2 4 1 3 1 5 4 4 1 3 -0.7 1 0.4 … … … … … 2.3 0.2 -0.3 Items 0.5 0.3 … 1.2 … 1.2 -0.2 … -3.2 Users ~ X Choose a number of latent variables to decompose the data Predict new rating using the product of latent vectors Use gradient descent technics (e.g. SGD) Add some regularization
  • 26. 26 Matrix factorization – MovieLens example Read files import csv movies_fname = '/path/ml-latest/movies.csv' with open(movies_fname) as fobj: movies = dict((r[0], r[1]) for r in csv.reader(fobj)) ratings_fname = ’/path/ml-latest/ratings.csv' with open(ratings_fname) as fobj: header = fobj.next() ratings = [(r[0], movies[r[1]], float(r[2])) for r in csv.reader(fobj)] Build sparse matrix import scipy.sparse as sp user_idx, item_idx = {}, {} data, rows, cols = [], [], [] for u, i, s in ratings: rows.append(user_idx.setdefault(u, len(user_idx))) cols.append(item_idx.setdefault(i, len(item_idx))) data.append(s) ratings = sp.csr_matrix((data, (rows, cols))) reverse_item_idx = dict((v, k) for k, v in item_idx.iteritems()) reverse_user_idx = dict((v, k) for k, v in user_idx.iteritems())
  • 27. 27 Matrix factorization – MovieLens example Fit Non-negative Matrix Factorization from sklearn.decomposition import NMF nmf = NMF(n_components=50) user_mat = nmf.fit_transform(ratings) item_mat = nmf.components_ Plot results component_ind = 3 component = [(reverse_item_idx[i], s) for i, s in enumerate(item_mat[component_ind , :]) if s>0.] For movie, score in sorted(component, key=lambda x: x[1], reverse=True)[:10]: print movie, round(score) Terminator 2: Judgment Day (1991) 24.0 Terminator, The (1984) 23.0 Die Hard (198 19.0 Aliens (1986) 17.0 Alien (1979) 16.0 Exorcist, The (1973) 8.0 Halloween (197 7.0 Nightmare on Elm Street, A (1984) 7.0 Shining, The (1980) 7.0 Carrie (1976) 7.0 Star Trek II: The Wrath of Khan (1982) 10.0 Star Trek: First Contact (1996) 10.0 Star Trek IV: The Voyage Home (1986) 9.0 Contact (1997) 8.0 Star Trek VI: The Undiscovered Country (1991) 8.0 Blade Runner (1982) 8.0
  • 28. 28 Binary / Unitary data Only occurences of items views/purchases/… Jaccard distance Cosine similarity Conditional probability
  • 29. 29 Co-occurrences and Similarities Computation Only access to unitary data (purchase / browsing) Use co-occurrences for computing items similarity Multiple possible parameters:  Size of time window to be considered: Does browsing and purchase data reflect similar behavior ?  Threshold on co-occurrences Is one co-occurrence significant enough to be used ? Two ? Three ?  Symmetric or asymmetric Is the order important in the co-occurrence ? A then B == B then A ?  Similarity metrics Which similarity metrics to be used based on the co-occurrences ?
  • 30. 30 Co-occurrences Example Browsing Purchase Session ? Session ?Time window 1 Session ?Time window 2 07/11/2015 08/11/2015 08/11/2015 24/11/2015 08/11/2015 08/11/2015 10/09/201 5 08/09/201 5 10/09/201 5
  • 31. 31 Co-occurrences Computation Co-purchases Co-browsing Classical co-occurrences Complementary items Substitute items Other possible co-occurrences Items browsed and bought together Items browsed and not bought together “You may also want…” “Similar items…” 08/11/2015 08/11/2015 08/11/2015 07/11/2015 08/11/201510/09/201 5 08/09/201 5 07/11/2015
  • 33. 33 Recommendations Algebra Algebra for defining and combining recommendations engines Keys ideas  Reuse already existing logics and combine them easily.  Write business logic, not code !  Handle multiple input/output formats. Available Logics Content-based Collaborative-filtering Item-item User-item (personalization) Available Backends In-memory HDF5 files Cassandra Couchbase Available Hybridization Linear algebra / weighting Mixed Cascade engines Meta-level
  • 34. 34 Python Algebra Example Purchase-based Top-20 Asymmetric Conditional probability Browsing-based Similarity > 0.01 Symmetric Cosine similarity + 0.2 Composite engine >>> engine1 = RecommendationsEngine(nb_recos=20, datatype=‘purchase’, asymmetric=True, distance=‘conditional_probability’) >>> engine2 = RecommendationsEngine(similarity_th=0.01, datatype=‘browsing’, asymmetric=False, distance=‘cosine_similarity’) >>> composite_engine = engine1 + 0.2 * engine2 Get recommendations from items (item-to-item) >>> recos = composite_engine.recommendations_by_items([123, 456, 789, …])
  • 35. 35 Python Algebra with Personalization Purchase-based Top-20 Asymmetric Conditional probability Browsing-based Similarity > 0.01 Symmetric Cosine similarity + 0.2 Composite engine Purchase-history Time window 180 days Time decay 0.01 >>> history = HistoryEngine(datatype=‘purchase’, time_window=180, time_decay=0.01) >>> engine1.register_history_engine(history) …same code as previously (user-to-item) >>> recos = composite_engine.recommendations_by_user(‘userid’)
  • 36. 36 Python Algebra – Complete Example Purchase-based Top-20 Asymmetric Conditional probability Browsing-based Similarity > 0.01 Symmetric Cosine similarity + 0.2 Composite engine Purchase-history Time window 180 days Time decay 0.01 X (cascade) Purchase-based Category-level Similarity > 0.01 Asymmetric Conditional probability Browsing-based Category-level Similarity > 0.1 Symmetric Cosine similarity + 0.1 Composite engine
  • 37. 37 Recommendation Quality Challenges Recommendations categories Cold start issue • External data ? • Cross-services ? Hot products (A) • Top-N items ? Short tail (B) Long tail (C + D) Minor Product Major Product (Popular) New Product Old Product (A) (B) (D) (C)
  • 38. 38 Long Tail is Fat Long tail numbers • Most of the items are long tail • They still represent a large portion of the traffic Long tail approaches • Content-based • Aggregation / clustering • Personalization Popula r Short tail Long tail Browsing share Number of items Long tail Short tail Popular
  • 39. 39 Recommendations Offline Evaluation Pros/Cons • Convenient way to try new ideas • Fast and cheap • But hard to align with online KPI Approaches • Rescoring • Prediction game • Business simulator
  • 40. 40 Public Initiative – Viki Recommendation Challenge 567 submissions from 132 participants http://www.dextra.sg/challenges/rakuten-viki-video-challenge
  • 41. 41 Datascience everywhere ! Rakuten provides marketplaces worldwide Specific challenges for recommendations Items catalogue: reinforce statistical power of co-occurrences across shops and services; Items similarities: find the good parameters for the different use- cases; Recommendations models: what is the best models for in-shop, all-shops, personalization? Evaluation: handling long-tail? Comparing different models?
  • 42. 42 THANKS ! Questions ? More on Rakuten tech initiatives http://www.slideshare.net/rakutentech http://rit.rakuten.co.jp/oss.html http://rit.rakuten.co.jp/opendata.html Positions • http://global.rakuten.com/corp/careers/bigdata/ • http://www.priceminister.com/recrutement/?p=197
  • 43. 43 We are Hiring! Big Data Department – team in Paris http://global.rakuten.com/corp/careers/bigdata/ http://www.priceminister.com/recrutement/?p=197 Data Scientist / Software Developer  Build algorithms for recommendations, search, targeting  Predictive modeling, machine learning, natural language processing  Working close to business  Python, Java, Hadoop, Couchbase, Cassandra…  Also hiring: search engine developers, big data system administrators, etc.