2. ➤ 10 x AWS Certifications including SA Pro, Dev Ops and Machine
Learning Specialism.
➤ Visionary in ML Ops, Produced production workloads of ML models at
scale, including 1500 inferences per minute, including active monitoring
and alerting
➤ Contributes to the AWS Community by speaking at several summits,
community days and meet-ups.
➤ Regular blogger, open-source contributor, and SME on Machine
Learning, MLOps, DevOps, Containers and Serverless.
➤ Experienced principal solutions architect, a lead developer with over 6
years of AWS experience. He has been responsible for running
production workloads of over 200 and 18,000 requests per second
WHO I AM?
Phil Basford
phil@inawisdom.com
@philipbasford
Phil B#4237
3. Inference types
ML OPS – INFERENCE TYPES
Real Time
➤ Business Critical, commonly uses are chat
bots, classifiers, recommenders or liner
regressors. Like credit risk, journey times
etc
➤ Hundred or thousands individual
predictions per second
➤ API Driven with Low Latency, typically
below 135ms at the 90th percentile.
Near Real Time
➤ Commonly used for image classification or
file analysis
➤ Hundred individual predictions per minute
and processing needs to be done within
seconds
➤ Event or Message Queue based,
predictions are sent back or stored
Occasional
➤ Examples are simple classifiers like Tax
codes
➤ Only a few predictions a month and
processing needs to be completed with
minutes
➤ API, Event or Message Queue based,
predicts sent back or stored
Batch
➤ End of month reporting, invoice
generation, warranty plan management
➤ Runs at Daily / Monthly / Set Times
➤ The data set is typically millions or tens of
millions of rows at once
Micro Batch
➤ Anomaly detection, invoice
approval and Image processing
➤ Executed regularly : every x
minutes or Y number of events.
Triggered by file upload or data
ingestion
➤ The data set is typically hundreds
or thousands of rows at once
Edge
➤ Used for Computer Vision, Fault Detection
in Manufacturing
➤ Runs on mobile phone apps and low
power devices. Uses sensors (i.e. video,
location, or heat)
➤ Model output is normally sent back to the
Cloud at regular intervals for analysis.
4. Fargate
OPTION 1
➤ Supports Batch and Realtime
➤ Low Latency (<100ms)
➤ Supports only CPU and Not GPU (can
step it down to full ECS)
➤ Pay Per Hour
➤ Application Auto Scaling
➤ Runs Docker and full native support
➤ Not integrated with Notebooks
SageMaker SDK.
➤ No Model Monitor support (records
predictions)
➤ Requires you to build your own images or
a deep learning container
➤ Memory and GPU Limits (can step it down
to full ECS)
5. SageMaker : Endpoints and Batch Transforms
➤ Supports Batch and Realtime
➤ Built in Algos, Framework and BOYM
support
➤ Low Latency (<100ms)
➤ Supports CPU and GPU
➤ Pay Per Hour (saving plans)
➤ Only recently add to saving plans
➤ Application Auto Scaling
➤ Runs Docker and full native support
➤ One click Deployment: Integration with
SageMaker Studio and Notebook support
via SDK.
➤ Model Monitor support (records
predictions)
➤ No resource limits
OPTION 2
6. Lambda
OPTION 3
➤ Simple
➤ Supports only Realtime, or micro batch
(15mins)
➤ Low Latency (<100ms)
➤ Supports only CPU and Not GPU
➤ Pay Per Request
➤ Scales on concurrency
➤ Saving plans
➤ *Custom Image : Runs Docker and full
native support
➤ Not integrated with Notebooks
SageMaker SDK.
➤ No Model Monitor support (records
predictions)
➤ Memory and GPU Limits