2. • In 5 years, how many people will
interact with an NLP application daily?
• What is the size of NLP market in
billions after five years?
• What % of AI/NLP projects fail to make
it from idea to production?
$116 B
≈ 87%
Photo credits: https://wallpaperaccess.com/full/818001.jpg
Let us deliver robust and responsible NLP
7.5 billion
Context
3. Sandya Mannarswamy
sandyasm@gmail.com
• 20 years industry and research experience
working with Microsoft, HP, IBM & Xerox Labs
• PhD in Computer Science from IISc.
• Research interests span Natural language
processing, Machine learning. Earlier work on
Compilers
• Holds 52 papers & patents pending
• Code Sport columnist in Open Source For You
• Currently Independent Researcher &
Consultant
4. Agenda
• How robust are current State of the Art NLP Models?
• How can we make NLP models robust?
5. • People communicate almost everything in
language
web search
Advertising
Emails
customer service
language translation
virtual agents
medical reports
• AI beats humans in Stanford reading
comprehension test (CNET)
• Google Search Now Reads at a Higher
Level (WIRED)
NLP Applications Are Everywhere
6. All these are NLP models which failed in production
Guess What Is Common Between These News Headlines?
Algorithms grading millions of students’ essays AI – Key to recruiting diverse workforce
Microsoft unveiled Tay — a Twitter bot that… The Warren Buffett And Anne Hathaway Trade
Mummy this is just dummy
7. Building a NLP Model – Current Recipe
• Take a representative dataset, split into train/validation/test set
• Use the latest BERT/Roberta/Albert model
• Or build my own fancy deep architecture
• Prove model achieves > X% with the test dataset
• Voila! We are ready to go for live deployment
8. NLP Model Development Cycle
Build
Model
> x% Test data
performance
Manual
validation
Collect
Labelled
dataset
Failures
2/3rd of models fail after
they have gone live
Iteratively fixing issues
↑ time ↑ $$$
9. Real World Data Can Be Highly Diverse!
• Sentiment Analysis is a well known NLP
task
• State of the Art (SOTA) models exceed
95% on benchmark datasets
• But perform badly for many real world
utterances!
varying tone/formality/code-
switched/transliterated text
I ❤️ this movie,
I love this flick,
I love this படம்,
Movie Aacha Hai!,
IMO, Gr8 movie!,
Value for your money!
Luv this movee
Arnie Killed it!
One word can mean 100 different things,
100 words can mean same thing!
10. NLP Models Match Human Performance in GLUE, SuperGlue Benchmarks
But do models really understand the task?
Credits: https://super.gluebenchmark.com/leaderboard
11. Current NLP Paradigm
• Models are based on associative learning (Correlation)
• Models learn statistical cues(superficial patterns) present in training data, predictive of
the label
• Examples of statistical cues can be
presence of specific words in the data instances mapping to a specific label
lexical overlap between two sentences (in sentence pair classification)
presence of linguistic phenomena such as negation
• Statistical cues need not be reflective of the underlying task.
• This can lead to models with high “test set” performance, but with poor generalization.
How well do NLP models generalize?
12. Task #1 Sentiment Analysis
• Sentence :- “A great white shark bit me on the beach”
• Is this sentence positive or negative?
• Predicted as ‘Highly Positive’!
In Google Cloud-NLP
In Microsoft’s Text Analytics services
Or even in Stanford Deep Learning based Sentiment Model
Models get misled by the surface cue words - “great, white”
13. Task #2 Machine Reading Comprehension
• Model – Bidirectional Attention (BIDAF)2
• Given Paragraph: The largest portion of
the Huguenots to settle in the Cape
arrived between 1688 and 1689…but
quite a few arrived as late as 1700;
thereafter, the numbers declined…
• Given Question: The number of new
Huguenot colonists declined after what
year?
• Machine Answer: 1700
Human performance
Logistic regression baseline
https://rajpurkar.github.io/SQuAD-explorer
Squad 1.0 leader board
14. Did the Machine Understand the Question?
• Given Paragraph: The largest portion of
the Huguenots to settle in the Cape
arrived between 1688 and 1689…but
quite a few arrived as late as 1700;
thereafter, the numbers declined.
• Given Question: The number of new
Huguenot colonists declined after what
year?
• Machine Answer: 1700
• Given Paragraph: The largest portion of
the Huguenots to settle in the Cape
arrived between 1688 and 1689…but
quite a few arrived as late as 1700;
thereafter, the numbers declined. The
number of old Acadian colonists declined
after the year of 1675.
• Given Question: The number of new
Huguenot colonists declined after what
year?
• Machine Answer: 1675
Surface cue: extract answer from sentence most similar to question
Let’s add one more line…
The
number of old Acadian colonists declined
after the year of 1675.
1675
15. Task #3 Natural Language Inference (NLI)
• Given a premise (P) and hypothesis (H),
determine the relation as Entailment,
Contradiction or Neutral
• Example #1
• P: A man is standing in front of the statue on
the beach
• H: A man is sleeping on the beach
• Label: Contradiction
• Example #2
• P: A man is standing on roof
• H: There is a man on roof
• Label: Entailment
• Example #3
• P: A man is standing on the roof
• H: the man has a hammer in hand
• Label: Neutral
SoTA numbers for MNLI
16. How robust are the NLI models?1.2
All the models drop in performance by > 48% on above examples
Example Actual Predicted Patterns/Cues
P: The judge was paid by the actor
H: The actor was paid by the judge
Contradict Entail Lexical overlap
P: Enthusiasm for Disney’s Lion King dwindles
H: Disney’s Lion King is no longer enthusiastically attended
Entail Contradict Negation Handling
P: Child Services will receive 200000$ grant
H: A grant of 900000$ will go to Child services
Contradict Entail Numerical Reasoning
P: It was still at night
H: The sun has not risen yet, and the moon was still shining
Entail Contradict World Knowledge
P: “have her show the message” – said Paul
H: Paul told her to hide the message
Contradict Entail Antonym relation
17. Task #4 Argument Reasoning Comprehension
Claim Google is not a harmful monopoly
• Given claim & reason, Task is to predict the warrant which makes the claim valid
• BERT achieves 78%!
• BERT picked surface cues such as the words “NOT”, “do”, “is” in warrant sentences
• Eliminating ‘label correlated’ shallow statistical cues, BERT performance drops to 50%
BERT is a strong learner, but depends on shallow surface cues in solving this task!
Reason People can choose not to use Google
Warrant Other search engines don’t redirect to Google
Alternative All other search engines redirect to Google
18. NLP’s Clever Hans Effect
• NLP Models often end up learning
from shallow surface cues present in
training data
• Such cues may not correlate with the
task being solved
• Models can show strong test set
performance!
• Real world data may not have those
shallow cues and performance can
really drop
Be skeptical when model achieves
near-human performance on complex NLP tasks!
Photo credits: https://en.wikipedia.org/wiki/Clever_Hans
19. Improving Model Robustness
• Spend time on exploring your data
• Understand why your model is making the decision
Use interpretability tools to visualize the reason for the model decision
See whether your model is depending on shallow surface cues unrelated to actual task
Number of model interpretability tools available
o LIME
o AllenNLP’s interpret
24. It Is Just Not Me, Even Andrej Says It!
Andrej Karpathy
Famous AI researcher
Now director of AI at Tesla
25. • Models pick up spurious correlations or
cues present in the dataset
• Cues aligned to task improves
performance
• Detrimental to classifier performance
if cues are not representative of the actual task
Occur in the training data but not in real world data
Example from SQUAD dataset
on PMI ranked word cues
Identifying Misleading Cues in Dataset Using PMI
Type of
Question
Cue word PMI score
Why because 0.92
When did 0.91
When Since 0.89
When year 0.76
Which people 0.69
Which Into 0.68
• We can use Pointwise Mutual Information
(PMI) to identify such cue words
𝑃𝑀𝐼( 𝑤𝑜𝑟𝑑, 𝑐𝑙𝑎𝑠𝑠) =
log 𝑝( 𝑤𝑜𝑟𝑑, 𝑐𝑙𝑎𝑠𝑠 )
𝑝 𝑤𝑜𝑟𝑑 ∗ 𝑝( 𝑐𝑙𝑎𝑠𝑠 )
• Examine the top-k PMI words to check
whether they are task representative or not
26. Do Named Entities in Data Carry any Unintended Biases?
• Large NLP models are often trained on
public data (Web/Wikipedia/News)
• A person’s name can be often mentioned
in negative contexts
• Models learn a spurious negative
association between that named entity
and sentiment
• Model should ideally be independent of
entities mentioned in the text!
• But it ends up being sensitive to named
entities in the text
Sentence Sentiment
I hate Justin Timberlake -0.3
I hate Kate Perry -0.1
I hate Taylor Swift -0.4
I hate Rihana -0.6
Example from FB-pub dataset
using toxicity model
27. • Use PSA to measure unintended biases
whether the model has learnt any unwanted associations with named entities
for each sentence containing named entity {
1. PSA perturbs the sentence
• by replacing the entity by other equivalent entities
2. Measure sensitivity of the model
• by running it on the perturbed sentences generated
3. Identify any unintended biases associated with named entities
}
Perturbation Sensitivity Analysis (PSA)
28. Dataset Ablation Analysis
• Do dataset ablation (in addition to model ablation)
Test your model with partial input
Test using random labels
Augment your training data with Counter Examples
Every film student should see this
thing just so they'll know the very
definition of a perfect movie.
Every film student should see this
thing just so they'll know the very
definition of a bad movie.
29. • Generate synthetic data to stress test your model
• Augment data to break the spurious (pattern, label) correlation
• Word/Phrase level methods
• Synonym Replacement (using Wordnet/Embeddings)
• Random Word Swap/Insertion/Deletion
• Phrase level paraphrase replacement (Using PPDB)
• Readily available python libraries
• NLP Augment
• Easy Data Augmentation for NLP
• Using back translation for paraphrasing
• Translate from Language L1 to Language L2 and back to L1
Use Data Augmentation to Improve Model
30. Preventing Models from Learning Surface Cues
• How can we design models which do not learn spurious surface cues in dataset?
• Train using an ensemble of classifiers
• Train a naïve model which predicts based on surface cues only
• Train a stronger model which focuses on other patterns in the data excluding the surface cues
• Use only the stronger model for inference
• The stronger model generalizes well to out of domain examples
• Different types of ensembles can be used
31. Takeaways
• SOTA performance on a test dataset does not imply production readiness
• Understand your model using interpretability tools
• Test your model for diverse inputs
• Do dataset ablation
• Do targeted data augmentation to improve your model
Do not add data indiscriminately
Add needed data only!
• Continue monitoring after deployment
https://arxiv.org/abs/1707.07328 - Adversarial Examples for Evaluating Reading Comprehension Systems – model performance drops from 75% f1 score to 35% F1 score
Model used BIDAF - https://arxiv.org/abs/1611.01603 which was SOTA on SQUAD 1.0
Right for the wrong reasons - https://arxiv.org/abs/1902.01007
Stress test evaluation for natural language inference - https://arxiv.org/abs/1806.00692
Clever Hans was horse that was claimed to have performed arithmetic and other intellectual tasks, but was picking up signals from its trainer for the correct answer
Perturbation sensiity analysis
Let X be set of sentences containing the entity type we want to perturb
Ley N be the set of target entity names.
E is anchor in each sentence we want to replace with (every entity in n).
Measure the difference in classifier score.
Take the evaerage
What about “He is like Gandhi” vs “He is like Hitler”
Partial Input baselines
https://www.aclweb.org/anthology/S18-2023.pdf - Hypothesis Only Baselines in Natural Language Inference
How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks
Divyansh Kaushik, Zachary C. Lipton
Perturbation sensiity analysis
Let X be set of sentences containing the entity type we want to perturb
Ley N be the set of target entity names.
E is anchor in each sentence we want to replace with (every entity in n).
Measure the difference in classifier score.
Take the evaerage
What about “He is like Gandhi” vs “He is like Hitler”
Partial Input baselines
https://www.aclweb.org/anthology/S18-2023.pdf - Hypothesis Only Baselines in Natural Language Inference
How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks
Divyansh Kaushik, Zachary C. Lipton
Perturbation sensiity analysis
Let X be set of sentences containing the entity type we want to perturb
Ley N be the set of target entity names.
E is anchor in each sentence we want to replace with (every entity in n).
Measure the difference in classifier score.
Take the evaerage
What about “He is like Gandhi” vs “He is like Hitler”
Partial Input baselines
https://www.aclweb.org/anthology/S18-2023.pdf - Hypothesis Only Baselines in Natural Language Inference
How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks
Divyansh Kaushik, Zachary C. Lipton
Perturbation sensiity analysis
Let X be set of sentences containing the entity type we want to perturb
Ley N be the set of target entity names.
E is anchor in each sentence we want to replace with (every entity in n).
Measure the difference in classifier score.
Take the evaerage
What about “He is like Gandhi” vs “He is like Hitler”
Partial Input baselines
https://www.aclweb.org/anthology/S18-2023.pdf - Hypothesis Only Baselines in Natural Language Inference
How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks
Divyansh Kaushik, Zachary C. Lipton
Perturbation sensiity analysis
Let X be set of sentences containing the entity type we want to perturb
Ley N be the set of target entity names.
E is anchor in each sentence we want to replace with (every entity in n).
Measure the difference in classifier score.
Take the evaerage
What about “He is like Gandhi” vs “He is like Hitler”
Partial Input baselines
https://www.aclweb.org/anthology/S18-2023.pdf - Hypothesis Only Baselines in Natural Language Inference
How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks
Divyansh Kaushik, Zachary C. Lipton
3
Perturbation sensiity analysis
Let X be set of sentences containing the entity type we want to perturb
Ley N be the set of target entity names.
E is anchor in each sentence we want to replace with (every entity in n).
Measure the difference in classifier score.
Take the evaerage
What about “He is like Gandhi” vs “He is like Hitler”
Partial Input baselines
https://www.aclweb.org/anthology/S18-2023.pdf - Hypothesis Only Baselines in Natural Language Inference
How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks
Divyansh Kaushik, Zachary C. Lipton