1. Data Science in
E-commerce industry
DSSP 2016/05/20
Vincent Michel
Big Data Europe, BDD, Rakuten Inc. / PriceMinister
vincent.michel@rakuten.com
@HowIMetYourData
2. 2
Short Bio
ESPCI: engineer in Physics / Biology
ENS Cachan: MVA Master Mathematics Vision and Learning
INRIA Parietal team: PhD in Computer Science
Understanding the visual cortex by using classification techniques
Logilab – Development and data science consulting
Data.bnf.fr (French National Library open-data platform)
Brainomics (platform for heterogeneous medical data)
Education
Experience
Rakuten PriceMinister– Senior Developer and data scientist
Data engineer and data science consulting
4. 4
Do not redo it yourself !
Lots of really interesting open-source libraries for all your needs:
Test first on a small POC, then contribute/develop
Scikit-learn, pandas, Caffe, Scikit-image, opencv, ….
Be careful: it is really easy to do something wrong !
Open-data:
More and more open-data for catalogs, …
E.g. data.bnf.fr
~ 2.000.000 authors
~ 200.000 works
~ 200.000 topics
Contribute to open-source:
Is there a need / pool of potential developers ?
Do it well (documentation / test)
Unless you are doing some kind of super magical algorithm
May bring you help, bug fixes, and engineers ! But it takes time and energy
5. 5
Quality in data science software engineering
Never underestimates integration cost
Really easy to write a 20 lines Python code doing some
fancy Random Forests…
…that could be really hard to deploy (data pipeline, packaging, monitoring)
Developer != DevOps != Sys admin
Make it clean from the start (> 2 days of dev or > 100 lines of code):
Tests, tests, tests, tests, tests, tests, tests, …
Documentation
Packaging / supervision / monitoring
Release often release earlier
Agile development, Pull request, code versioning
Choose the right tool:
Do you really need this super fancy NoSQL database
to store your transactions?
6. 6
Monitoring and metrics
Always monitor:
Your development: continuous integration (Jenkins)
Your service: nagios/shinken
Your business data (BI): Kibana
Your user: tracker
Your data science process : e.g. A/B test
Evaluation:
Choose the right metric
Prediction accuracy / Precision-recall …
Always A/B test rather than relying on personal thoughts
Good question leads to good answer: Define your problem
8. 8
Finding your data scientist
Do not try to find a unicorn!
Define your needs
(and unicorns no longer exist…)
9. 9
Few remarks on hiring – my personal opinion
Be careful of CVs with buzzwords!
E.g. “IT skills: SVM (linear, non-linear), Clustering (K-means, Hierarchical),
Random Forests, Regularization (L1, L2, Elastic net…) …”
It is like as someone saying “ IT skills: Python (for loop, if/else pattern, …)
Often found in Junior CVs (ok), but huge warning in Senior CVs
Hungry for data?
Loving data is the most important thing to check
Opendata? Personal project? Curious about data? (Hackaton?)
Pluridisciplinary == knowing how to handle various datasets
Check for IT skills:
Should be able to install/develop new libraries/algorithms
A huge part of the job could be to format / cleanup the data
Experience VS education -> Autonomy
12. 12
Rakuten Group in Numbers
Rakuten in Japan
> 12.000 employees
> 48 billions euros of GMS
> 100.000.000 users
> 250.000.000 items
> 40.000 merchants
Rakuten Group
Kobo 18.000.000 users
Viki 28.000.000 users
Viber 345.000.000 users
13. 13
Rakuten Ecosystem
Rakuten global ecosystem :
Member-based business model that connects Rakuten services
Rakuten ID common to various Rakuten services
Online shopping and services;
Main business areas
E-commerce
Internet finance
Digital content
Recommendation challenges
Cross-services
Aggregated data
Complex users features
14. 14
Rakuten’s e-commerce: B2B2C Business Model
Business to Business to Consumer:
Merchants located in different regions / online virtual shopping mall
Main profit sources
• Fixed fees from merchants
• Fees based on each transaction and other service
Recommendation
challenges
Many shops
Items references
Global catalog
15. 15
Big Data Department @ Rakuten
Big Data Department
150+ engineers – Japan / Europe / US
Missions
Development and operations of internal
systems for:
Recommendations
Search
Targeting
User behavior tracking
Average traffic
> 100.000.000 events / day
> 40.000.000 items view / day
> 50.000.000 search / day
> 750.000 purchases / day
Technology stack
Java / Python / Ruby
Solr / Lucene
Cassandra / Couchbase
Hadoop / Hive / Pig
Redis / Kafka
16. 16
Recommendations on Rakuten Marketplaces
Non-personalized recommendations
All-shop recommendations:
Item to item
User to item
In-shop recommendations
Review-based recommendations
Personalized recommendations
Purchase history recommendations
Cart add recommendations
Order confirmation recommendations
System status and scale
In production in over 35 services of Rakuten Group worldwide
Several hundreds of servers running:
Hadoop
Cassandra
APIS
19. 19
Items Catalogues
Use different levels of aggregation to improve recommendations
Category-level
(e.g. food, soda, clothes, …)
Product-level
(manufactured items)
Item in shop-level
(specific product sell by a
specific shop)
Increased statistical
power in co-events
computation
Easier business handling
(picking the good item)
20. 20
Enriching Catalogues using Record Linkage
Marketplace 2Marketplace 1 Reference database
Record linkage
Use external sources (e.g., Wikidata) to
align markets' products
Fuzzy matching of 600K vs 350K items
for movies alignments usecase.
Blocking algorithm
Cross recommendation
Global catalog
Items aggregation
Helps with cold start issues
Improved navigation
21. 21
Co-occurrences and Similarities Computation
Only access to unitary data (purchase / browsing)
Use co-occurrences for computing items similarity
Multiple possible parameters:
Size of time window to be considered:
Does browsing and purchase data reflect similar behavior ?
Threshold on co-occurrences
Is one co-occurrence significant enough to be used ? Two ? Three ?
Symmetric or asymmetric
Is the order important in the co-occurrence ? A then B == B then A ?
Similarity metrics
Which similarity metrics to be used based on the co-occurrences ?
24. 24
Recommendation Quality Challenges
Recommendations categories
Cold start issue
• External data ?
• Cross-services ?
Hot products (A)
• Top-N items ?
Short tail (B)
Long tail (C + D)
Minor
Product
Major
Product
(Popular)
New
Product
Old
Product
(A)
(B)
(D)
(C)
25. 25
Long Tail is Fat
Long tail numbers
• Most of the items are long tail
• They still represent a large
portion of the traffic
Long tail approaches
• Content-based
• Aggregation / clustering
• Personalization
Popula
r
Short
tail
Long
tail
Browsing share Number of items
Long tail Short tail Popular
26. 26
Recommendations Offline Evaluation
Pros/Cons
• Convenient way to
try new ideas
• Fast and cheap
• But hard to align
with online KPI
Approaches
• Rescoring
• Prediction game
• Business simulator
27. 27
Public Initiative – Viki Recommendation Challenge
567 submissions from 132 participants
http://www.dextra.sg/challenges/rakuten-viki-video-challenge
28. 28
Datascience everywhere !
Rakuten provides marketplaces worldwide
Specific challenges for recommendations
Items catalogue: reinforce statistical power of co-occurrences
across shops and services;
Items similarities: find the good parameters for the different use-
cases;
Recommendations models: what is the best models for in-shop,
all-shops, personalization?
Evaluation: handling long-tail? Comparing different models?
29. 29
THANKS !
Questions ?
More on Rakuten tech initiatives
http://www.slideshare.net/rakutentech
http://rit.rakuten.co.jp/oss.html
http://rit.rakuten.co.jp/opendata.html
Positions
• http://global.rakuten.com/corp/careers/bigdata/
• http://www.priceminister.com/recrutement/?p=197
30. 30
We are Hiring!
Big Data Department – team in Paris
http://global.rakuten.com/corp/careers/bigdata/
http://www.priceminister.com/recrutement/?p=197
Data Scientist / Software Developer
Build algorithms for recommendations, search, targeting
Predictive modeling, machine learning, natural language processing
Working close to business
Python, Java, Hadoop, Couchbase, Cassandra…
Also hiring: search engine developers, big data system
administrators, etc.