Lessons learned from building practical deep learning systems

LESSONS
LEARNED
from
building practical
Deep Learning systems
Xavier Amatriain (@xamat)

A bit about myself...
● PhD on Audio and Music Signal Processing and Modeling
● Researcher in Recommender Systems for several years
● Led ML Research/Engineering at Netflix
● VP of Engineering at Quora
● Currently co-founder/CTO at Curai (Providing the world’s best healthcare to
everyone)

What are we doing?
● Mission: Provide the world's
best healthcare for everyone
● Product: User-facing mobile
primary care app
● Team: Building an awesome
and diverse team
● Approach: State-of-the-art
AI/ML + product/UX/clinical
AI-based interaction
AI + Health coaches
AI + Doctors

Peer-reviewed research at Curai

More Data
or
Better Data?
Lesson 1

More data or better models?
Really?

Sometimes,
it’s not about
more data

Norvig:
“Google does not have
better Algorithms only
more Data”
Many
features/
low-bias
models

Sometimes
you might not
need all your
“Big Data”
0 2 4 6 8 10 12 14 16 18 20
Number of Training Examples (in Millions)
TestingAccuracy

What about Deep Learning?
Year Breakthrough in AI Datasets (First Available) Algorithms (First Proposal)
1994 Human-level spontaneous speech recognition Spoken Wall Street Journal articles and other
texts (1991)
Hidden Markov Model (1984)
1997 IBM Deep Blue defeated Garry Kasparov 700,000 Grandmaster chess games, aka “The
Extended Book” (1991)
Negascout planning algorithm (1983)
2005 Google’s Arabic- and Chinese-to-English
translation
1,8 trillion tokens from Google Web and News
pages (collected in 2005)
Statistical machine translation algorithm (1988)
2011 IBM watson become the world Jeopardy!
Champion
8,6 million documents from Wikipedia,
Wiktionary, Wikiquote, and Project Gutenberg
(updated in 2005)
Mixture-of-Experts algorithm (1991)
2014 Google’s GoogLeNet object classification at
near-human performance
ImageNet corpus of 1,5 million labeled images
and 1,000 object catagories (2010)
Convolution neural network algorithm (1989)
2015 Google’s Deepmind achieved human parity in
playing 29 Atari games by learning general
control from video
Arcade Learning Environment dataset of over
50 Atari games (2013)
Q-learning algorithm (1992)
Average No. Of Years to Breakthrough 3 years 18 years
The average elapsed time between key algorithm proposals and corresponding advances was about 18 years,
whereas the average elapsed time between key dataset availabilities and corresponding advances was less
than 3 years, or about 6 times faster.

What about Deep Learning?
Models and
Recipes
Pretrained
Available models trained using OpenNMT
→ English → German
→ German → English
→ English Summarization
→ Multi-way – FR,ES,PT,IT,RO < > FR,ES,PT,IT,RO
More models coming soon:
→ Ubuntu Dialog Dataset
→ Syntactic Parsing
→ Image-to-Text

Simple
Models
>>
Complex
Models
Lesson 2

Occam’s razor
Given two models that perform
more or less equally, you should
always prefer the less complex
Deep Learning might not be
preferred, even if it squeezes a
+1% in accuracy

Reasons to prefer a simpler model

Reasons to prefer a simpler model
….
There are many others
System complexity
Maintenance
Explainability
….
Figure 3: GoogLeNet network with all the bells and whistles

A real-life example
Goal: Supervised
Classification
→ 40 features
→ 10k examples
What did the ML
Engineer choose?
→ Multi-layer ANN trained
with Tensor Flow
What was his proposed
next step?
→ Try ConvNets
Where is the problem?
→ Hours to train, already
looking into distributing
→ There are much simpler
approaches

.... But,
sometimes
you do need
a Complex
Model
Lesson 3

Better models and features that “don’t work”
E.g. You have a linear model and have been
selecting and optimizing features for that model
→ More complex model with the same features -> improvement
not likely
→ More expressive features -> improvement not likely
More complex features may require a more
complex model
A more complex model may not show
improvements with a feature set that is too
simple

Yes, you
should care
about
Feature
Engineering
Lesson 4

Feature Engineering Example - Answer Ranking
How are those dimensions
translated into features?
Features that relate to the answer
Quality itself
Interaction features (upvotes/downvotes,
clicks, comments…)
User features (e.g. expertise in topic)
What is a good Quora answer?
Truthful Reusable Provides
explanation
Well
formatted ...

Feature Engineering
Properties of a well-
behaved ML feature Output
Mapping
from
features
OutputOutput
Most
complex
features
Mapping
from
features
Mapping
from
features
Output
Simplest
features
Features
Hand –
designed
features
Hand –
designed
program
InputInputInputInput
Rule -
based
systems
Classic
machine
learning
Representation
learning
Deep
learning
Fig; I. Goodfellow
Deep Learning:
Automating
Feature Discovery
Interpretable
Reliable
Reusable
Transformable

Deep Learning & Feature Engineering

Deep Learning & Feature Architecture Engineering

Supervised
vs.
Unsupervised
Learning
Lesson 5

Supervised/Unsupervised Learning
Unsupervised learning as
dimensionality reduction
E.g.1
Clustering + knn
E.g.2
Matrix Factorization
MF can be
interpreted as
Unsupervised:
• Dimensionality Reduction a la PCA
• Clustering (e.g. NMF)
Supervised:
• Labeled targets ~ regression
Unsupervised learning
as feature engineering
The “magic” behind combining unsupervised/supervised learning

Supervised/Unsupervised Learning
One of the “tricks” in
Deep Learning is how it
combines unsupervised
/supervised learning
→ E.g. Stacked Autoencoders
→ E.g. training of convolutional nets
X1
X2
X3
X4
X5
X6
+1
+1
+1
...
...
...
Input Features I Features II Softmax
classifier
P(y=0 | x)
P(y=1 | x)
P(y=2 | x)
Stacked
Autoencoders
Input
83x83
Layer 1
64x75x75
Layer 2
64@14x14
Layer 3
256@6x6
Layer 4
256@1x1
Output4
101
9x9
Convolution
(64 kernels)
10x10 pooling
5x5 subsampling
9x9
Convolution
(4096 kernels)
6x6 pooling
4x4 subsamp
→ Non-Linearity: half-wave rectification, shrinkage function, sigmoid
→ Pooling: average, L1, L2, max
→ Training: Supervised (1988-2006), Unsupervised+supervised (2006-now)
Convolutional Network (CovNet)
Neural Networks
Supervised
Unsupervised
Superviseed
Boost
ing
SVM
Decis
ion
Tree
Perc
eptro
n
AE D-AE
Neur
al
Net
RNN
Conv
. Net
RBM Spar
se
Codi
ng
DBN DBM
GMM Baye
s NP
ΣΠ

Supervised/Unsupervised Self-supervised Learning
Self-supervision
→ E.g. BERT and other LM

Everything is
an ensembleLesson 6

Ensembles
Netflix Prize was won by an ensemble
Most practical applications of ML run
an ensemble
→ Initially Bellkor was using GDBTs
→ BigChaos introduced ANN-based ensemble
→ Why wouldn’t you?
→ At least as good as the best of your methods
→ Can add completely different approaches (e.g. CF
and content-based)
→ You can use many different models at the ensemble
layer: LR, GDBTs, RFs, ANNs...

Ensembles & Feature Engineering
Ensembles are
the way to turn
any model into a
feature!
E.g. Don’t know if the
way to go is to use
Factorization Machines,
Tensor Factorization, or
RNNs?
→ Treat each model as a
“feature”
→ Feed them into an
ensemble
Sigmoid
Rectified
Linear Units
Output Units
Hidden Layers
Dense
Embeddings
Sparse
Features
Wide Models Deep Models Wide & Deep Models

There are
biases in your
data
Lesson 7

Defining training/testing data
Training a simple binary classifier for
good/bad answer
→ Defining positive and negative labels ->
Non-trivial task
→ Is this a positive or a negative?
→ funny uninformative answer with many
upvotes
→ short uninformative answer by a
well-known expert in the field
→ very long informative answer that nobody
reads/upvotes
→ informative answer with
grammar/spelling mistakes
→ ...

The curse of presentation bias
Better options
→ Correcting for the probability
a user will click on a position
-> Attention models
→ Explore/exploit approaches
such as MAB
Simply treating things you
show as negatives is not likely
to work
User can only click on what
you decide to show
→ But, what you decide to
show is the result of what
your model predicted is good
More
likely
to see
Less
likely

Think about your
models
“in the wild”
Lesson 8

AI in the wild: Desired properties
● Easily extensible
○ Incrementally/iteratively learn from
“human-in-the-loop” or from
additional data
● Knows what it does not know
○ Models uncertainty in prediction
○ Enables fall-back to manual

Assisted diagnosis in the wild
1. Extensibility
a. Diagnosis as a ML task
i. Expert systems as a prior
b. Modeling less prevalent diseases
i. Low-shot learning
2. Knowing what you don’t know
b. Measures of uncertainty in
prediction
c. Allows fall-back to
“physician-in-the-loop”

Data and Models are great.
You know what’s even better?
The right
evaluation
approach!
Lesson 9

Offline/Online testing process
Offline Experimentation Online Experimentation
Initial
Hypothesis
Design AB
Test
Choose Control
Deploy Prototype
Observe Behavior
Analyze Results
Significant
Improvements?
Choose Model
Train Model
Test Offline
Hypothesis
Validated?
Try different
Model?
Reformulated
Hypothesis
Deploy
Feature
NO
YES
NO YES
NO
YES

Executing A/B tests
Overall Evaluation Criteria (OEC) =
e.g. member retention at Netflix
→ Use long-term metrics
whenever possible
→ Short-term metrics can be
informative and allow faster
decisions
⁻ But, not always aligned with
OEC
Measure differences
in metrics across
statistically identical
populations that
each experience a
different algorithm.
Decisions on the product always
data-driven

Offline testing
Measure model
performance, using (IR)
metrics
Offline performance =
indication to make decisions
on follow-up A/B tests
A critical (and mostly
unsolved) issue is how
offline metrics correlate with
A/B test results.

Do not
underestimate
the value of
systems and
frameworks
Lesson 10

ML vs Software
Can you treat your ML infrastructure as you
would your software one?
→ Yes and No
You should apply best Software Engineering
practices (e.g. encapsulation, abstraction,
cohesion, low coupling…)
However, Design Patterns for Machine Learning
software are not well known/documented

Software: the new frontier of ML?

Your AI
infrastructure
will have two
masters
Lesson 11

Machine Learning Infrastructure
→ Whenever you develop any ML infrastructure, you need to target two different modes:
Mode 1: ML experimentation
− Flexibility
− Easy-to-use
− Reusability
Mode 2: ML production
− All of the above + performance & scalability
→ Ideally you want the two modes to be as similar as possible
→ How to combine them?

→ Favor experimentation and only invest in
productionizing once something shows
results
→ E.g. Have ML researchers use R and
then ask Engineers
to implement things in production when
they work
Option 1
→ Favor production and have “researchers”
struggle to figure out how to run
experiments
→ E.g. Implement highly optimized C++
code and have ML researchers
experiment only through data available
in logs/DB
Option 2

→ Favor experimentation and only invest in
productionizing
once something shows results
→ E.g. Have ML researchers use R and
then ask Engineers
to implement things in production when
they work
Option 1
→ Favor production and have “researchers”
struggle to figure
out how to run experiments
→ E.g. Implement highly optimized C++
code and have ML researchers
experiment only through data available
in logs/DB
Option 2

Good
intermediate
options
→ Have ML “researchers” experiment on Jupyter Notebooks using
Python tools (scikit-learn, Pytorch, TF…). Use same tools in
production whenever possible, implement optimized versions only
when needed.
→ Implement abstraction layers on top of optimized implementations
so they can be accessed from regular/friendly experimentation tools

There is ML
beyond Deep
Learning
Lesson 12

Other ML Advances
● Factorization Machines
● Tensor Methods
● Non-parametric Bayesian models
● XGBoost
● Online Learning
● Reinforcement Learning
● Learning to rank
● ...

Other very successful approaches

01.
02.
03.
04.
05.
Choose the right metric
Be thoughtful about your data
Understand dependencies between data, models & systems
Optimize only what matters, beware of biases
Be thoughtful about : Your ML infrastructure/tools,
About organizing your teams

Lessons learned from building practical deep learning systems

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Lessons learned from building practical deep learning systems

Similar a Lessons learned from building practical deep learning systems (20)

Más de Xavier Amatriain

Más de Xavier Amatriain (20)

Último

Último (20)

Lessons learned from building practical deep learning systems