Rigourous evaluation of nlp models in real world deployment

Sandya Mannarswamy
sandyasm@gmail.com
Rigorous
evaluation of
NLP models
for real-world
deployment

• In 5 years, how many people will
interact with an NLP application daily?
• What is the size of NLP market in
billions after five years?
• What % of AI/NLP projects fail to make
it from idea to production?
$116 B
≈ 87%
Photo credits: https://wallpaperaccess.com/full/818001.jpg
Let us deliver robust and responsible NLP
7.5 billion
Context

Sandya Mannarswamy
sandyasm@gmail.com
• 20 years industry and research experience
working with Microsoft, HP, IBM & Xerox Labs
• PhD in Computer Science from IISc.
• Research interests span Natural language
processing, Machine learning. Earlier work on
Compilers
• Holds 52 papers & patents pending
• Code Sport columnist in Open Source For You
• Currently Independent Researcher &
Consultant

Agenda
• How robust are current State of the Art NLP Models?
• How can we make NLP models robust?

• People communicate almost everything in
language
 web search
 Advertising
 Emails
 customer service
 language translation
 virtual agents
 medical reports
• AI beats humans in Stanford reading
comprehension test (CNET)
• Google Search Now Reads at a Higher
Level (WIRED)
NLP Applications Are Everywhere

All these are NLP models which failed in production
Guess What Is Common Between These News Headlines?
Algorithms grading millions of students’ essays AI – Key to recruiting diverse workforce
Microsoft unveiled Tay — a Twitter bot that… The Warren Buffett And Anne Hathaway Trade
Mummy this is just dummy

Building a NLP Model – Current Recipe
• Take a representative dataset, split into train/validation/test set
• Use the latest BERT/Roberta/Albert model
• Or build my own fancy deep architecture
• Prove model achieves > X% with the test dataset
• Voila! We are ready to go for live deployment

NLP Model Development Cycle
Build
Model
> x% Test data
performance
Manual
validation
Collect
Labelled
dataset
Failures
2/3rd of models fail after
they have gone live
Iteratively fixing issues
↑ time ↑ $$$

Real World Data Can Be Highly Diverse!
• Sentiment Analysis is a well known NLP
task
• State of the Art (SOTA) models exceed
95% on benchmark datasets
• But perform badly for many real world
utterances!
 varying tone/formality/code-
switched/transliterated text
I ❤️ this movie,
I love this flick,
I love this படம்,
Movie Aacha Hai!,
IMO, Gr8 movie!,
Value for your money!
Luv this movee
Arnie Killed it!
One word can mean 100 different things,
100 words can mean same thing!

NLP Models Match Human Performance in GLUE, SuperGlue Benchmarks
But do models really understand the task?
Credits: https://super.gluebenchmark.com/leaderboard

Current NLP Paradigm
• Models are based on associative learning (Correlation)
• Models learn statistical cues(superficial patterns) present in training data, predictive of
the label
• Examples of statistical cues can be
 presence of specific words in the data instances mapping to a specific label
 lexical overlap between two sentences (in sentence pair classification)
 presence of linguistic phenomena such as negation
• Statistical cues need not be reflective of the underlying task.
• This can lead to models with high “test set” performance, but with poor generalization.
How well do NLP models generalize?

Task #1 Sentiment Analysis
• Sentence :- “A great white shark bit me on the beach”
• Is this sentence positive or negative?
• Predicted as ‘Highly Positive’!
 In Google Cloud-NLP
 In Microsoft’s Text Analytics services
 Or even in Stanford Deep Learning based Sentiment Model
Models get misled by the surface cue words - “great, white”

Task #2 Machine Reading Comprehension
• Model – Bidirectional Attention (BIDAF)2
• Given Paragraph: The largest portion of
the Huguenots to settle in the Cape
arrived between 1688 and 1689…but
quite a few arrived as late as 1700;
thereafter, the numbers declined…
• Given Question: The number of new
Huguenot colonists declined after what
year?
• Machine Answer: 1700
Human performance
Logistic regression baseline
https://rajpurkar.github.io/SQuAD-explorer
Squad 1.0 leader board

Did the Machine Understand the Question?
thereafter, the numbers declined.
year?
thereafter, the numbers declined. The
number of old Acadian colonists declined
after the year of 1675.
year?
Surface cue: extract answer from sentence most similar to question
Let’s add one more line…
The
number of old Acadian colonists declined
after the year of 1675.
1675

Task #3 Natural Language Inference (NLI)
• Given a premise (P) and hypothesis (H),
determine the relation as Entailment,
Contradiction or Neutral
• Example #1
• P: A man is standing in front of the statue on
the beach
• H: A man is sleeping on the beach
• Label: Contradiction
• Example #2
• P: A man is standing on roof
• H: There is a man on roof
• Label: Entailment
• Example #3
• P: A man is standing on the roof
• H: the man has a hammer in hand
• Label: Neutral
SoTA numbers for MNLI

How robust are the NLI models?1.2
All the models drop in performance by > 48% on above examples
Example Actual Predicted Patterns/Cues
P: The judge was paid by the actor
H: The actor was paid by the judge
Contradict Entail Lexical overlap
P: Enthusiasm for Disney’s Lion King dwindles
H: Disney’s Lion King is no longer enthusiastically attended
Entail Contradict Negation Handling
P: Child Services will receive 200000$ grant
H: A grant of 900000$ will go to Child services
Contradict Entail Numerical Reasoning
P: It was still at night
H: The sun has not risen yet, and the moon was still shining
Entail Contradict World Knowledge
P: “have her show the message” – said Paul
H: Paul told her to hide the message
Contradict Entail Antonym relation

Task #4 Argument Reasoning Comprehension
Claim Google is not a harmful monopoly
• Given claim & reason, Task is to predict the warrant which makes the claim valid
• BERT achieves 78%!
• BERT picked surface cues such as the words “NOT”, “do”, “is” in warrant sentences
• Eliminating ‘label correlated’ shallow statistical cues, BERT performance drops to 50%
BERT is a strong learner, but depends on shallow surface cues in solving this task!
Reason People can choose not to use Google
Warrant Other search engines don’t redirect to Google
Alternative All other search engines redirect to Google

NLP’s Clever Hans Effect
• NLP Models often end up learning
from shallow surface cues present in
training data
• Such cues may not correlate with the
task being solved
• Models can show strong test set
performance!
• Real world data may not have those
shallow cues and performance can
really drop
Be skeptical when model achieves
near-human performance on complex NLP tasks!
Photo credits: https://en.wikipedia.org/wiki/Clever_Hans

Improving Model Robustness
• Spend time on exploring your data
• Understand why your model is making the decision
 Use interpretability tools to visualize the reason for the model decision
 See whether your model is depending on shallow surface cues unrelated to actual task
 Number of model interpretability tools available
o LIME
o AllenNLP’s interpret

Interpreting “Sentiment Analysis” with AllenNLP
model depends on the word cue "great"

Interpreting “Textual Entailment” with AllenNLP

Interpreting “Textual Entailment” with AllenNLP
Model depends on lexical overlap between premise and hypothesis

It Is Just Not Me, Even Andrej Says It!
Andrej Karpathy
Famous AI researcher
Now director of AI at Tesla

• Models pick up spurious correlations or
cues present in the dataset
• Cues aligned to task improves
performance
• Detrimental to classifier performance
 if cues are not representative of the actual task
 Occur in the training data but not in real world data
Example from SQUAD dataset
on PMI ranked word cues
Identifying Misleading Cues in Dataset Using PMI
Type of
Question
Cue word PMI score
Why because 0.92
When did 0.91
When Since 0.89
When year 0.76
Which people 0.69
Which Into 0.68
• We can use Pointwise Mutual Information
(PMI) to identify such cue words
𝑃𝑀𝐼( 𝑤𝑜𝑟𝑑, 𝑐𝑙𝑎𝑠𝑠) =
log 𝑝( 𝑤𝑜𝑟𝑑, 𝑐𝑙𝑎𝑠𝑠 )
𝑝 𝑤𝑜𝑟𝑑 ∗ 𝑝( 𝑐𝑙𝑎𝑠𝑠 )
• Examine the top-k PMI words to check
whether they are task representative or not

Do Named Entities in Data Carry any Unintended Biases?
• Large NLP models are often trained on
public data (Web/Wikipedia/News)
• A person’s name can be often mentioned
in negative contexts
• Models learn a spurious negative
association between that named entity
and sentiment
• Model should ideally be independent of
entities mentioned in the text!
• But it ends up being sensitive to named
entities in the text
Sentence Sentiment
I hate Justin Timberlake -0.3
I hate Kate Perry -0.1
I hate Taylor Swift -0.4
I hate Rihana -0.6
Example from FB-pub dataset
using toxicity model

• Use PSA to measure unintended biases
whether the model has learnt any unwanted associations with named entities
for each sentence containing named entity {
1. PSA perturbs the sentence
• by replacing the entity by other equivalent entities
2. Measure sensitivity of the model
• by running it on the perturbed sentences generated
3. Identify any unintended biases associated with named entities
}
Perturbation Sensitivity Analysis (PSA)

Dataset Ablation Analysis
• Do dataset ablation (in addition to model ablation)
 Test your model with partial input
 Test using random labels
 Augment your training data with Counter Examples
Every film student should see this
thing just so they'll know the very
definition of a perfect movie.
Every film student should see this
thing just so they'll know the very
definition of a bad movie.

• Generate synthetic data to stress test your model
• Augment data to break the spurious (pattern, label) correlation
• Word/Phrase level methods
• Synonym Replacement (using Wordnet/Embeddings)
• Random Word Swap/Insertion/Deletion
• Phrase level paraphrase replacement (Using PPDB)
• Readily available python libraries
• NLP Augment
• Easy Data Augmentation for NLP
• Using back translation for paraphrasing
• Translate from Language L1 to Language L2 and back to L1
Use Data Augmentation to Improve Model

Preventing Models from Learning Surface Cues
• How can we design models which do not learn spurious surface cues in dataset?
• Train using an ensemble of classifiers
• Train a naïve model which predicts based on surface cues only
• Train a stronger model which focuses on other patterns in the data excluding the surface cues
• Use only the stronger model for inference
• The stronger model generalizes well to out of domain examples
• Different types of ensembles can be used

Takeaways
• SOTA performance on a test dataset does not imply production readiness
• Understand your model using interpretability tools
• Test your model for diverse inputs
• Do dataset ablation
• Do targeted data augmentation to improve your model
 Do not add data indiscriminately
 Add needed data only!
• Continue monitoring after deployment

Rigourous evaluation of nlp models in real world deployment

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Rigourous evaluation of nlp models in real world deployment

Similar to Rigourous evaluation of nlp models in real world deployment (20)

Recently uploaded

Recently uploaded (20)

Rigourous evaluation of nlp models in real world deployment

Editor's Notes