3. A bit about myself...
● PhD on Audio and Music Signal Processing and Modeling
● Researcher in Recommender Systems for several years
● Led ML Research/Engineering at Netflix
● VP of Engineering at Quora
● Currently co-founder/CTO at Curai (Providing the world’s best healthcare to
everyone)
5. What are we doing?
● Mission: Provide the world's
best healthcare for everyone
● Product: User-facing mobile
primary care app
● Team: Building an awesome
and diverse team
● Approach: State-of-the-art
AI/ML + product/UX/clinical
AI-based interaction
AI + Health coaches
AI + Doctors
10. More data or better models?
Sometimes,
it’s not about
more data
11. More data or better models?
Norvig:
“Google does not have
better Algorithms only
more Data”
Many
features/
low-bias
models
12. More data or better models?
Sometimes
you might not
need all your
“Big Data”
0 2 4 6 8 10 12 14 16 18 20
Number of Training Examples (in Millions)
TestingAccuracy
13. What about Deep Learning?
Year Breakthrough in AI Datasets (First Available) Algorithms (First Proposal)
1994 Human-level spontaneous speech recognition Spoken Wall Street Journal articles and other
texts (1991)
Hidden Markov Model (1984)
1997 IBM Deep Blue defeated Garry Kasparov 700,000 Grandmaster chess games, aka “The
Extended Book” (1991)
Negascout planning algorithm (1983)
2005 Google’s Arabic- and Chinese-to-English
translation
1,8 trillion tokens from Google Web and News
pages (collected in 2005)
Statistical machine translation algorithm (1988)
2011 IBM watson become the world Jeopardy!
Champion
8,6 million documents from Wikipedia,
Wiktionary, Wikiquote, and Project Gutenberg
(updated in 2005)
Mixture-of-Experts algorithm (1991)
2014 Google’s GoogLeNet object classification at
near-human performance
ImageNet corpus of 1,5 million labeled images
and 1,000 object catagories (2010)
Convolution neural network algorithm (1989)
2015 Google’s Deepmind achieved human parity in
playing 29 Atari games by learning general
control from video
Arcade Learning Environment dataset of over
50 Atari games (2013)
Q-learning algorithm (1992)
Average No. Of Years to Breakthrough 3 years 18 years
The average elapsed time between key algorithm proposals and corresponding advances was about 18 years,
whereas the average elapsed time between key dataset availabilities and corresponding advances was less
than 3 years, or about 6 times faster.
14. What about Deep Learning?
Models and
Recipes
Pretrained
Available models trained using OpenNMT
→ English → German
→ German → English
→ English Summarization
→ Multi-way – FR,ES,PT,IT,RO < > FR,ES,PT,IT,RO
More models coming soon:
→ Ubuntu Dialog Dataset
→ Syntactic Parsing
→ Image-to-Text
18. Occam’s razor
Given two models that perform
more or less equally, you should
always prefer the less complex
Deep Learning might not be
preferred, even if it squeezes a
+1% in accuracy
20. Reasons to prefer a simpler model
….
There are many others
System complexity
Maintenance
Explainability
….
Figure 3: GoogLeNet network with all the bells and whistles
21. A real-life example
Goal: Supervised
Classification
→ 40 features
→ 10k examples
What did the ML
Engineer choose?
→ Multi-layer ANN trained
with Tensor Flow
What was his proposed
next step?
→ Try ConvNets
Where is the problem?
→ Hours to train, already
looking into distributing
→ There are much simpler
approaches
23. Better models and features that “don’t work”
E.g. You have a linear model and have been
selecting and optimizing features for that model
→ More complex model with the same features -> improvement
not likely
→ More expressive features -> improvement not likely
More complex features may require a more
complex model
A more complex model may not show
improvements with a feature set that is too
simple
25. Feature Engineering Example - Answer Ranking
How are those dimensions
translated into features?
Features that relate to the answer
Quality itself
Interaction features (upvotes/downvotes,
clicks, comments…)
User features (e.g. expertise in topic)
What is a good Quora answer?
Truthful Reusable Provides
explanation
Well
formatted ...
26. Feature Engineering
Properties of a well-
behaved ML feature Output
Mapping
from
features
OutputOutput
Most
complex
features
Mapping
from
features
Mapping
from
features
Output
Simplest
features
Features
Hand –
designed
features
Hand –
designed
program
InputInputInputInput
Rule -
based
systems
Classic
machine
learning
Representation
learning
Deep
learning
Fig; I. Goodfellow
Deep Learning:
Automating
Feature Discovery
Interpretable
Reliable
Reusable
Transformable
34. Ensembles
Netflix Prize was won by an ensemble
Most practical applications of ML run
an ensemble
→ Initially Bellkor was using GDBTs
→ BigChaos introduced ANN-based ensemble
→ Why wouldn’t you?
→ At least as good as the best of your methods
→ Can add completely different approaches (e.g. CF
and content-based)
→ You can use many different models at the ensemble
layer: LR, GDBTs, RFs, ANNs...
35. Ensembles & Feature Engineering
Ensembles are
the way to turn
any model into a
feature!
E.g. Don’t know if the
way to go is to use
Factorization Machines,
Tensor Factorization, or
RNNs?
→ Treat each model as a
“feature”
→ Feed them into an
ensemble
Sigmoid
Rectified
Linear Units
Output Units
Hidden Layers
Dense
Embeddings
Sparse
Features
Wide Models Deep Models Wide & Deep Models
37. Defining training/testing data
Training a simple binary classifier for
good/bad answer
→ Defining positive and negative labels ->
Non-trivial task
→ Is this a positive or a negative?
→ funny uninformative answer with many
upvotes
→ short uninformative answer by a
well-known expert in the field
→ very long informative answer that nobody
reads/upvotes
→ informative answer with
grammar/spelling mistakes
→ ...
38. The curse of presentation bias
Better options
→ Correcting for the probability
a user will click on a position
-> Attention models
→ Explore/exploit approaches
such as MAB
Simply treating things you
show as negatives is not likely
to work
User can only click on what
you decide to show
→ But, what you decide to
show is the result of what
your model predicted is good
More
likely
to see
Less
likely
41. AI in the wild: Desired properties
● Easily extensible
○ Incrementally/iteratively learn from
“human-in-the-loop” or from
additional data
● Knows what it does not know
○ Models uncertainty in prediction
○ Enables fall-back to manual
42. Assisted diagnosis in the wild
1. Extensibility
a. Diagnosis as a ML task
i. Expert systems as a prior
b. Modeling less prevalent diseases
i. Low-shot learning
2. Knowing what you don’t know
b. Measures of uncertainty in
prediction
c. Allows fall-back to
“physician-in-the-loop”
43. Data and Models are great.
You know what’s even better?
The right
evaluation
approach!
Lesson 9
44. Offline/Online testing process
Offline Experimentation Online Experimentation
Initial
Hypothesis
Design AB
Test
Choose Control
Deploy Prototype
Observe Behavior
Analyze Results
Significant
Improvements?
Choose Model
Train Model
Test Offline
Hypothesis
Validated?
Try different
Model?
Reformulated
Hypothesis
Deploy
Feature
NO
YES
NO YES
NO
YES
45. Executing A/B tests
Overall Evaluation Criteria (OEC) =
e.g. member retention at Netflix
→ Use long-term metrics
whenever possible
→ Short-term metrics can be
informative and allow faster
decisions
⁻ But, not always aligned with
OEC
Measure differences
in metrics across
statistically identical
populations that
each experience a
different algorithm.
Decisions on the product always
data-driven
46. Offline testing
Measure model
performance, using (IR)
metrics
Offline performance =
indication to make decisions
on follow-up A/B tests
A critical (and mostly
unsolved) issue is how
offline metrics correlate with
A/B test results.
48. ML vs Software
Can you treat your ML infrastructure as you
would your software one?
→ Yes and No
You should apply best Software Engineering
practices (e.g. encapsulation, abstraction,
cohesion, low coupling…)
However, Design Patterns for Machine Learning
software are not well known/documented
51. Machine Learning Infrastructure
→ Whenever you develop any ML infrastructure, you need to target two different modes:
Mode 1: ML experimentation
− Flexibility
− Easy-to-use
− Reusability
Mode 2: ML production
− All of the above + performance & scalability
→ Ideally you want the two modes to be as similar as possible
→ How to combine them?
52. Machine Learning Infrastructure
→ Favor experimentation and only invest in
productionizing once something shows
results
→ E.g. Have ML researchers use R and
then ask Engineers
to implement things in production when
they work
Option 1
→ Favor production and have “researchers”
struggle to figure out how to run
experiments
→ E.g. Implement highly optimized C++
code and have ML researchers
experiment only through data available
in logs/DB
Option 2
53. Machine Learning Infrastructure
→ Favor experimentation and only invest in
productionizing
once something shows results
→ E.g. Have ML researchers use R and
then ask Engineers
to implement things in production when
they work
Option 1
→ Favor production and have “researchers”
struggle to figure
out how to run experiments
→ E.g. Implement highly optimized C++
code and have ML researchers
experiment only through data available
in logs/DB
Option 2
54. Machine Learning Infrastructure
Good
intermediate
options
→ Have ML “researchers” experiment on Jupyter Notebooks using
Python tools (scikit-learn, Pytorch, TF…). Use same tools in
production whenever possible, implement optimized versions only
when needed.
→ Implement abstraction layers on top of optimized implementations
so they can be accessed from regular/friendly experimentation tools
60. 01.
02.
03.
04.
05.
Choose the right metric
Be thoughtful about your data
Understand dependencies between data, models & systems
Optimize only what matters, beware of biases
Be thoughtful about : Your ML infrastructure/tools,
About organizing your teams