SlideShare a Scribd company logo
1 of 46
Serving BERT in production
https://bit.ly/pytorch-workshop-2021
Image Ref
Agenda
1. Machine Learning at Walmart
2. Review some DL concepts + Hands-on
3. Optimizing the model for production + Hands-on
4. Break
5. Deploying the model + Hands-on
6. Lessons Learned
7. Q/A
About Us
Walmart: ML Engineers on the Search team
Nidhin Pattaiyil
Adway Dhillon
Machine Learning at Walmart Search
- SpellCheck
- Typeahead
- Related Search / Guided Navigation
- Item & SKU Understanding
- Query Understanding (product, gender, age, size and other attributes)
- Personalized Ranking
ML in Search: Typeahead / Guided Navigation
ML in Search: Query / Item Understanding
Query: Nike Men Shoes
Brand: NIKE
Gender: Men
Product Type:
- Athletic Shoes : 99%
- Casual and Dress Shoes: 55 %
Item
Brand: Nike
Color: Black / White
Shoe Size: 7.5M
What were the requirements
- Current production models written in TensorFlow and served with Java Native
Interface
- serve large models like BERT
- Support Pytorch / Onnx
- SLA < 40 ms
Intro
Setup
- Repo:
- https://bit.ly/pytorch-workshop-2021
Dataset
- Amazon Berkeley Objects (ABO) Dataset (from UC Berkeley and Amazon)
Dataset
- Amazon Berkeley Objects (ABO) Dataset (from Amazon and UC Berkeley)
Demo
Review Some Deep Learning Concepts
Models before BERT
- Word Embedding Layer (Word2Vec,
GLOVE, ELMO)
- Bidirectional LSTM
Cons:
- Slow to train RNN cells due to recurrence
- Not deeply bi-directional and hence
context is lost, since left to right and right
to left contexts are learnt separately and
then concatenated
- Transformer models help alleviate these
Image Ref
BERT
Source:
Source : BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding
Some Use Cases:
- Neural Machine Translation
- Question Answering
- Sentiment Analysis
Problems Solved:
- Contextualized Embeddings
- Can be trained on a larger corpus in a semi-supervised
manner;
- General Language Model trained on Masked Word
prediction
- Tokenizer solves for out of vocab words
Architecture:
- Attention is all you need
- stacked Transformer’s Encoder model.
BERT: Different Architectures
For environments with limited compute resources, the BERT recipe has different model sizes with different
attention heads and layers of complexity. The following table shows some of them with their corresponding GLUE
scores.
MODEL H L SST-2 QQP
BERT Tiny 128 2 83.2 62.2/83.4
BERT Mini 256 4 85.9 66.4/86.2
BERT Small 512 4 89.7 68.1/87.0
BERT Medium 512 8 89.6 69.6/87.9
Bi-LSTM vs BERT
BI-LSTM Predictions:
- Mattress: 0.19
- Mattress Encasements: 0.10
- Bed Sheets: 0.05
BERT Predictions:
- Bed Sheets: 0.72
- Bedding Sets
- Bed-in-a-Bag: 0.26
Query: bedding for twin xl mattress
F1 Score improved from 0.75 to 0.91
Hands-on
https://bit.ly/pytorch-workshop-2021
Notebook:
- 02_inference_review.ipynb
- 02_timing.ipynb
Tokenization (pre-processing)
- BERT uses WordPiece
Tokenizer
- A token is word or pieces of a
word
- Vocab size is smaller than glove
( 30k vs 300k)
Sample Training Code (using HuggingFace)
Sample Training Code (using HuggingFace)
Sample Inference Code
Optimizing the model for production
Post Training Optimization: Quantization
- Techniques for performing computations
and storing tensors at lower bit than
floating point precision; reduce memory
and speed up compute
- Hardware support for INT8 computations
is typically 2 to 4 times faster compared to
FP32 compute.
Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT
Post Training Optimization: Quantization
- Dynamic Quantization:
- Weights are quantized
- Activations quantized on the fly (don’t get memory benefits)
- However, the activations are read and written to memory in floating point format
Dynamic Quantization in PyTorch
Post Training Optimization: Quantization
- Static Quantization:
- Weights / activations are quantized with a calibration data post training
- Passing quantized values between operations instead of converting these values to floats -
and then back to ints - between every operation, resulting in a significant speed-up.
- Quantization Aware Training:
- quantization error is part of training error; training learns to reduce quantization error
- Leads to the greatest accuracy
Quantization results
Model Fp32
accuracy
Int 8
accuracy
change
Strategy Inference
performance
gain
Notes
BERT 0.902 0.895 Dynamic 1.8x
581 -> 313
Single thread;
1.6HZ;
Resnet-50 76.9 75.9 Static 2x
214->103
Single thread;
1.6GHZ
Mobilenet-v2 71.9 71.6 Quantization
Aware
Training
5.7x
97->17
Samsung S9
Reference: Introduction to Quantization on PyTorch ; pytorch.org
Post Training Optimization: Distillation
- Larger models have higher knowledge
capacity than small models but not all of it
may be used
- Transfer Knowledge from a trained large
model to a smaller one
Image Reference: https://towardsdatascience.com/knowledge-distillation-simplified-
dd4973dbc764
Distillation Results
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks, 2018, Tang et all
Model SST-2 QQP
BERT (Large) 94.9 72.1
BERT Base 93.5 71.2
ELMO 90.4 63.1
BiLSTM 86.7 63.8
Distilled
BiLSTM
90.7 68.2
Eager Execution vs Script Mode
Eager Execution
- Good for training /prototyping / debugging
- requires a python process
- Slow for production use cases
- Flexible ; can apply any model techniques;
easy to change
- Can use any of the libraries in python
ecosystem
Script Mode
- Model is running in graph execution mode;
better optimized for speed
- Only needs a C++ runtime; so can be
loaded in other languages
- TorchScript is an optimized intermediate
representation of a pytorch model
- Doesn’t support everything that python
/eager model supported
TorchScript JIT: Tracing vs Scripting
Tracing
- Does not support python data structures /
control flows in the model
- Requires a sample input
Scripting
- Supports custom control flow and python
data structures
- Only requires the model
TorchScript: Script
TorchScript Timing
- Inference time for torch script model: cpu/gpu
CPU (ms) GPU (ms)
Pytorch 1339 460
TorchScript 768 (42%) 360 (21%)
Benchmarking Transformers: PyTorch and TensorFlow from HuggingFace
Full Table
Hands-on: Optimizing the model
- Quantizing model
- Converting the BERT model with torch script
https://bit.ly/pytorch-workshop-2021
Notebook: 03_optimizing_model
Deploying the model
Options for deploying PyTorch model
- Pure Flask / Fastapi
- Model Servers: TorchServe / Nvidia Triton / TensorFlow Serving
Benefits of TorchServe
- Optimized for serving models
- Configurable support for CUDA and GPU environments
- multi model serving
- model version for A/B testing
- server side batching
- support for pre and post processing
Packaging a model / MAR
- Torch Model Archiver: utility from pytorch team
- package model code and artifacts into one file
- Python model specific dependencies
PyTorch Base Handler
Model Serving Handler at
● Acts as a layer on top of the BaseHandler provided by
TorchServe
● All common initialization, pre-processing, error handling and
metric collection steps can be abstracted into this custom
handler class
● Any model specific logic can be added in the child class that
inherits from the custom handler class
● This provides a resilient wrapper that can be extended and
used by multiple teams to serve models through TorchServe,
with minimal config and code changes
Built in handlers
- Pytorch has support for these handlers / use case
- image_classifier
- image_segmenter
- object_detector
- text_classifier
- Just need to provide model_file and index_to_name.json
Serving
torchserve --ts-config config.properties --start --model-store model_store
APIS
A. Management Api: http://localhost:8081
List of Models:
- curl "http://localhost:8081/models"
Get detail on model
- curl http://localhost:8081/models/model_name
A. Inference Api: http://localhost:8080
Inference:
curl -X POST http://localhost:8080/predictions/model_name
Handson: Deploying the Model
- Package the given model using Torch Model Archive
- Write a custom handler to support preprocessing and
post processing
https://bit.ly/pytorch-workshop-2021
Notebook: 04_packaging
Lessons learned
Lesson Learned
- DistilledBertBase with quantization and other optimizations can be served on a CPU and meet < 40 ms SLA
- DistilledBertBase on nvidia-t4 can hit < 10 ms under peak traffic
- Throughput of 1500 RPS at under SLA
Q/A
Check out Walmart Global Tech for openings for Software Engineers, Machine
Learning Engineers and Data Scientists: https://careers.walmart.com/technology

More Related Content

What's hot

Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
Provectus
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
Databricks
 

What's hot (20)

Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
Everything You Need to Know About Big Data: From Architectural Principles to ...
Everything You Need to Know About Big Data: From Architectural Principles to ...Everything You Need to Know About Big Data: From Architectural Principles to ...
Everything You Need to Know About Big Data: From Architectural Principles to ...
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine LearningDeploy and Serve Model from Azure Databricks onto Azure Machine Learning
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
 
Fundamental MLOps
Fundamental MLOpsFundamental MLOps
Fundamental MLOps
 
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
 
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageEnd to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
 
Elasticsearch From the Bottom Up
Elasticsearch From the Bottom UpElasticsearch From the Bottom Up
Elasticsearch From the Bottom Up
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
 
Productionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model ServingProductionzing ML Model Using MLflow Model Serving
Productionzing ML Model Using MLflow Model Serving
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOps
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Serving models using KFServing
Serving models using KFServingServing models using KFServing
Serving models using KFServing
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflow
 

Similar to Serving BERT Models in Production with TorchServe

Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Fisnik Kraja
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Databricks
 

Similar to Serving BERT Models in Production with TorchServe (20)

Scaling python to_hpc_big_data-maidanov
Scaling python to_hpc_big_data-maidanovScaling python to_hpc_big_data-maidanov
Scaling python to_hpc_big_data-maidanov
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Intel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big DataIntel Distribution for Python - Scaling for HPC and Big Data
Intel Distribution for Python - Scaling for HPC and Big Data
 
OpenPOWER Boot camp in Zurich
OpenPOWER Boot camp in ZurichOpenPOWER Boot camp in Zurich
OpenPOWER Boot camp in Zurich
 
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
 
Accelerate Machine Learning on Google Cloud
Accelerate Machine Learning on Google CloudAccelerate Machine Learning on Google Cloud
Accelerate Machine Learning on Google Cloud
 
Python* Scalability in Production Environments
Python* Scalability in Production EnvironmentsPython* Scalability in Production Environments
Python* Scalability in Production Environments
 
Intel python 2017
Intel python 2017Intel python 2017
Intel python 2017
 
Crypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M ProcessorsCrypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M Processors
 
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
 
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
 
Yashi dealer meeting settembre 2016 tecnologie xeon intel italia
Yashi dealer meeting settembre 2016 tecnologie xeon intel italiaYashi dealer meeting settembre 2016 tecnologie xeon intel italia
Yashi dealer meeting settembre 2016 tecnologie xeon intel italia
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
 
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
ECML PKDD 2021 ML meets IoT Tutorial Part III: Deep Optimizations of CNNs and...
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorch
 
Accelerating AI from the Cloud to the Edge
Accelerating AI from the Cloud to the EdgeAccelerating AI from the Cloud to the Edge
Accelerating AI from the Cloud to the Edge
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
Performance of State-of-the-Art Cryptography on ARM-based Microprocessors
Performance of State-of-the-Art Cryptography on ARM-based MicroprocessorsPerformance of State-of-the-Art Cryptography on ARM-based Microprocessors
Performance of State-of-the-Art Cryptography on ARM-based Microprocessors
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 

Recently uploaded

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Serving BERT Models in Production with TorchServe

  • 1. Serving BERT in production https://bit.ly/pytorch-workshop-2021 Image Ref
  • 2. Agenda 1. Machine Learning at Walmart 2. Review some DL concepts + Hands-on 3. Optimizing the model for production + Hands-on 4. Break 5. Deploying the model + Hands-on 6. Lessons Learned 7. Q/A
  • 3. About Us Walmart: ML Engineers on the Search team Nidhin Pattaiyil Adway Dhillon
  • 4. Machine Learning at Walmart Search - SpellCheck - Typeahead - Related Search / Guided Navigation - Item & SKU Understanding - Query Understanding (product, gender, age, size and other attributes) - Personalized Ranking
  • 5. ML in Search: Typeahead / Guided Navigation
  • 6. ML in Search: Query / Item Understanding Query: Nike Men Shoes Brand: NIKE Gender: Men Product Type: - Athletic Shoes : 99% - Casual and Dress Shoes: 55 % Item Brand: Nike Color: Black / White Shoe Size: 7.5M
  • 7. What were the requirements - Current production models written in TensorFlow and served with Java Native Interface - serve large models like BERT - Support Pytorch / Onnx - SLA < 40 ms
  • 10. Dataset - Amazon Berkeley Objects (ABO) Dataset (from UC Berkeley and Amazon)
  • 11. Dataset - Amazon Berkeley Objects (ABO) Dataset (from Amazon and UC Berkeley)
  • 12. Demo
  • 13. Review Some Deep Learning Concepts
  • 14. Models before BERT - Word Embedding Layer (Word2Vec, GLOVE, ELMO) - Bidirectional LSTM Cons: - Slow to train RNN cells due to recurrence - Not deeply bi-directional and hence context is lost, since left to right and right to left contexts are learnt separately and then concatenated - Transformer models help alleviate these Image Ref
  • 15. BERT Source: Source : BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Some Use Cases: - Neural Machine Translation - Question Answering - Sentiment Analysis Problems Solved: - Contextualized Embeddings - Can be trained on a larger corpus in a semi-supervised manner; - General Language Model trained on Masked Word prediction - Tokenizer solves for out of vocab words Architecture: - Attention is all you need - stacked Transformer’s Encoder model.
  • 16. BERT: Different Architectures For environments with limited compute resources, the BERT recipe has different model sizes with different attention heads and layers of complexity. The following table shows some of them with their corresponding GLUE scores. MODEL H L SST-2 QQP BERT Tiny 128 2 83.2 62.2/83.4 BERT Mini 256 4 85.9 66.4/86.2 BERT Small 512 4 89.7 68.1/87.0 BERT Medium 512 8 89.6 69.6/87.9
  • 17. Bi-LSTM vs BERT BI-LSTM Predictions: - Mattress: 0.19 - Mattress Encasements: 0.10 - Bed Sheets: 0.05 BERT Predictions: - Bed Sheets: 0.72 - Bedding Sets - Bed-in-a-Bag: 0.26 Query: bedding for twin xl mattress F1 Score improved from 0.75 to 0.91
  • 19. Tokenization (pre-processing) - BERT uses WordPiece Tokenizer - A token is word or pieces of a word - Vocab size is smaller than glove ( 30k vs 300k)
  • 20. Sample Training Code (using HuggingFace)
  • 21. Sample Training Code (using HuggingFace)
  • 23. Optimizing the model for production
  • 24. Post Training Optimization: Quantization - Techniques for performing computations and storing tensors at lower bit than floating point precision; reduce memory and speed up compute - Hardware support for INT8 computations is typically 2 to 4 times faster compared to FP32 compute. Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT
  • 25. Post Training Optimization: Quantization - Dynamic Quantization: - Weights are quantized - Activations quantized on the fly (don’t get memory benefits) - However, the activations are read and written to memory in floating point format Dynamic Quantization in PyTorch
  • 26. Post Training Optimization: Quantization - Static Quantization: - Weights / activations are quantized with a calibration data post training - Passing quantized values between operations instead of converting these values to floats - and then back to ints - between every operation, resulting in a significant speed-up. - Quantization Aware Training: - quantization error is part of training error; training learns to reduce quantization error - Leads to the greatest accuracy
  • 27. Quantization results Model Fp32 accuracy Int 8 accuracy change Strategy Inference performance gain Notes BERT 0.902 0.895 Dynamic 1.8x 581 -> 313 Single thread; 1.6HZ; Resnet-50 76.9 75.9 Static 2x 214->103 Single thread; 1.6GHZ Mobilenet-v2 71.9 71.6 Quantization Aware Training 5.7x 97->17 Samsung S9 Reference: Introduction to Quantization on PyTorch ; pytorch.org
  • 28. Post Training Optimization: Distillation - Larger models have higher knowledge capacity than small models but not all of it may be used - Transfer Knowledge from a trained large model to a smaller one Image Reference: https://towardsdatascience.com/knowledge-distillation-simplified- dd4973dbc764
  • 29. Distillation Results Distilling Task-Specific Knowledge from BERT into Simple Neural Networks, 2018, Tang et all Model SST-2 QQP BERT (Large) 94.9 72.1 BERT Base 93.5 71.2 ELMO 90.4 63.1 BiLSTM 86.7 63.8 Distilled BiLSTM 90.7 68.2
  • 30. Eager Execution vs Script Mode Eager Execution - Good for training /prototyping / debugging - requires a python process - Slow for production use cases - Flexible ; can apply any model techniques; easy to change - Can use any of the libraries in python ecosystem Script Mode - Model is running in graph execution mode; better optimized for speed - Only needs a C++ runtime; so can be loaded in other languages - TorchScript is an optimized intermediate representation of a pytorch model - Doesn’t support everything that python /eager model supported
  • 31. TorchScript JIT: Tracing vs Scripting Tracing - Does not support python data structures / control flows in the model - Requires a sample input Scripting - Supports custom control flow and python data structures - Only requires the model
  • 33. TorchScript Timing - Inference time for torch script model: cpu/gpu CPU (ms) GPU (ms) Pytorch 1339 460 TorchScript 768 (42%) 360 (21%) Benchmarking Transformers: PyTorch and TensorFlow from HuggingFace Full Table
  • 34. Hands-on: Optimizing the model - Quantizing model - Converting the BERT model with torch script https://bit.ly/pytorch-workshop-2021 Notebook: 03_optimizing_model
  • 36. Options for deploying PyTorch model - Pure Flask / Fastapi - Model Servers: TorchServe / Nvidia Triton / TensorFlow Serving
  • 37. Benefits of TorchServe - Optimized for serving models - Configurable support for CUDA and GPU environments - multi model serving - model version for A/B testing - server side batching - support for pre and post processing
  • 38. Packaging a model / MAR - Torch Model Archiver: utility from pytorch team - package model code and artifacts into one file - Python model specific dependencies
  • 39. PyTorch Base Handler Model Serving Handler at ● Acts as a layer on top of the BaseHandler provided by TorchServe ● All common initialization, pre-processing, error handling and metric collection steps can be abstracted into this custom handler class ● Any model specific logic can be added in the child class that inherits from the custom handler class ● This provides a resilient wrapper that can be extended and used by multiple teams to serve models through TorchServe, with minimal config and code changes
  • 40. Built in handlers - Pytorch has support for these handlers / use case - image_classifier - image_segmenter - object_detector - text_classifier - Just need to provide model_file and index_to_name.json
  • 41. Serving torchserve --ts-config config.properties --start --model-store model_store
  • 42. APIS A. Management Api: http://localhost:8081 List of Models: - curl "http://localhost:8081/models" Get detail on model - curl http://localhost:8081/models/model_name A. Inference Api: http://localhost:8080 Inference: curl -X POST http://localhost:8080/predictions/model_name
  • 43. Handson: Deploying the Model - Package the given model using Torch Model Archive - Write a custom handler to support preprocessing and post processing https://bit.ly/pytorch-workshop-2021 Notebook: 04_packaging
  • 45. Lesson Learned - DistilledBertBase with quantization and other optimizations can be served on a CPU and meet < 40 ms SLA - DistilledBertBase on nvidia-t4 can hit < 10 ms under peak traffic - Throughput of 1500 RPS at under SLA
  • 46. Q/A Check out Walmart Global Tech for openings for Software Engineers, Machine Learning Engineers and Data Scientists: https://careers.walmart.com/technology

Editor's Notes

  1. Intro (10 mins) Setup Introduce the deep learning BERT model- classification models at walmart Walk over the notebooks on Google Collab Setup Show the end model served
  2. Review Some Deep Learning Concepts (10 mins) Review sample trained pytorch model code Review sample model transformer architecture Tokenization / pre and post processing
  3. https://jalammar.github.io/illustrated-bert/
  4. Transition: that was a brief view of the HF ecosystem how to use inference with a trained model I hope you saw that large models have a pretty big compute cost lets look into strategies that are used for reducing the time We will look into 3 optimization techniques
  5. Break off here for a hands on
  6. Transition: First technique , we will look at is quantization
  7. Content: when we trained our model, we trained it with 32 bit precision floats for inference, do we need such precision? most of the the time , we can convey the same information with 16 bit or 8 bit ints; Not only do we reduce the storage/ memory requirements.. But on most hardware, int8 computations are faster In the image, on the top half we have the full range -> we transform it to the below Transition: In pytorch, we have 3 different types
  8. Content: 3 main quantization techniques All quantization techniques, you quantize the weights In dynamic quantization, the activations are quantized dynamically on the fly for fast int8 computatioon.. But reade/written back in memory using floats Transition: Lets look at the other two techniques
  9. Content: In DQ, we didn’t quantize the weights In SQ, we will quantize the activations by giving it calibration data ; because the activations are also quantized, we avoid the constant float to int conversion.. Which further speeds inference If you know, u will need quantization you can optimize for quantization while training.. By taking quantization error as part of training error .. leads to best accuracy Transition: Lets look at the results of these techniques in practice
  10. Content: BERT; 0.5% accuracy drop ; but we have have 1.8x speedup (581 ms - > 313 ms) Resnet: we saw a 1% accuracy drop; but got a 2x speedup with static quantization Mobilenet: saw a 0.3% drop with quantization aware training; it saw 5x improvement when running on mobile phone where floating point calculation are slower Quantization is a powerful technique especially on non GPU environments Transition: Next optimization, we will look at is Distillation
  11. Content: Large Models like BERT have large knowledge capacity , but do you need it all for your problem Then you might have a smaller model like a BI-LSTM What if you could teach the smaller model to learn from the larger model’s predictions and loss Transition: Lets see the results of this in practice
  12. Content: On the stanford sentiment task, we got an accuracy of 93.5% for bert base A regular BI-LSTM model only got 86.7% After distillation, the accuracy jumped by 4% Transition: Lets look at the last optimization technique
  13. Content: One of the reasons why pytorch became popular , is that it by default runs in Eager execution mode… Eager: Which allows us to get immediate feedback from operations on our layers and great debugging support You can also use any libraries in python But this flexibility, makes it a bit slow for production and makes it tied to a python runtime Script mode In script, you created an optimized execution graph of your model which is optimized for speed Read points Transition: Lets learn how we can convert a model to Script mode using pytorch’s Just in time compiler
  14. Content: The pytorch just in time compiler, support two mode: tracing / scripting Read points … Transition: Lets see a code snippet
  15. Content: Because we have an if/esle block, we need to use script mode Transition: Lets see how much improvement we got
  16. Content: On CPU, we see a 42% improvement We see on a GPU, that we got a 21% speedup Transition: We covered three techniques Lets explore it a lab
  17. Package the given model using Torch Model Archive Write a custom handler to support preprocessing and post processing