This document summarizes work using Bayesian optimization to compress BERT models for question answering while balancing model size and performance. It describes distilling BERT into smaller student models using SQuAD 2.0 data. SigOpt was used to tune model architectures and training to find models that exceeded the baseline performance while reducing size by over 20%. The best models found had 4-6 layers and maintained over 67% accuracy on SQuAD.
9. SigOpt. Confidential.
How does Distillation work?
9
Teacher Model
Student Model
Data
Data
Soft
target
loss
Hard
target
loss
Trained Student Model
Hinton et al 2015, Intel’s overview
10. SigOpt. Confidential.
Distilling BERT for Question Answering
10
BERT
Pre-trained for language
modeling
Student Model
SQUAD 2.0
SQUAD 2.0
Soft
target
loss
Hard
target
loss
BERT
Fine-tuned for SQUAD 2.0
Trained Student Model
For more on distillation: Hinton et al 2015, DistilBERT
11. SigOpt. Confidential.
Defining the student model
11
Student Model
BookCorpus
and English
Wikipedia
DistilBERT
Pre-trained for language
understanding
Architecture
parameters
Pre-trained
model weights
DistilBERT, Toronto Book Corpus,
English Wikipedia, SigOpt
12. SigOpt. Confidential.
What is the Baseline?
12
BERT
Pre-trained for language
modeling
DistilBERT architecture
SQUAD 2.0
SQUAD 2.0
Soft
target
loss
Hard
target
loss
BERT
Fine-tuned for SQUAD 2.0
Trained DistilBERT
For more on distillation: Hinton et al 2015, DistilBERT
16. SigOpt. Confidential.
What are our metrics?
16
Minimize
Model Size
Maximize Model Performance
Baseline Exact
67.07%
Baseline
Parameters
66.3M
17. SigOpt. Confidential.
Metric Threshold: Dealing with dataset characteristics
17
Minimize
Model Size
Maximize Model Performance
Baseline
Parameters
66.3M
Baseline Exact
67.07%Metric Threshold
SigOpt’s Metric Threshold
18. SigOpt. Confidential.
What are we tuning?
18
SGD Parameters, Batch Size,
Warm up, Weight Initialization
Number of Layers and
Attention Heads, Pruning,
Dropouts
Temperature and loss
function weights
9 Model training
parameters
6 Model architecture
parameters
3 Distillation parameters
19. SigOpt. Confidential.
The Optimization Cycle
19
Student Model
Architecture and
training
parameters
BERT
Fine-tuned for SQUAD 2.0
SQUAD 2.0 Trained Student Model
Distillation
Distillation
Parameters
validation accuracy
and model size
23. SigOpt. Confidential.
Choose the model architecture that meets your needs
23
Maximize
Performance
Minimize
Size
+3.45% on Performance
+0.09% on Size
-0.25% on Performance
-22.47% on Size
+3.19% on Performance
-1.69% on Size
24. SigOpt. Confidential.
Some architecture options
24
Maximize
Performance
Minimize
Size
+3.45% on Performance
+0.09% on Size
-0.25% on Performance
-22.47% on Size
+3.19% on Performance
-1.69% on Size
4 layers, 11 attention heads
No dropout, raised
temperature, soft target loss
weighted more
6 layers, 11 attention heads
no dropout, low
temperature, almost all soft
target loss
6 layers, 12 attention heads
no dropout, raised
temperature, soft target loss
weighted more
30. SigOpt. Confidential.
Why does it matter?
30
By using Multimetric Bayesian
Optimization, we’re able to easily
understand trade-offs made during
compression
By understanding these trade-offs,
we’re able to choose a model
architecture that best suits our needs
31. SigOpt. Confidential.
Check out our
YouTube channel:
Learn more about SigOpt
Read our research and product blog.
See more videos here.
Get free beta access to
Experiment Management
Join the beta
Click Here
Upcoming webinars:
● Introducing Experiment
Management - Thursday, July 9 at
10am PT