AI has made tremendous progress over the past decade, with many advancements coming from fundamental research from many decades ago. Accelerating the pipeline from research to commercialization has been daunting since scaling technologies in the real world faces many challenges beyond the theoretical work done in the lab. Qualcomm AI Research has taken on the task of not only generating novel AI research but also being first to demonstrate proof-of-concepts on commercial devices, enabling technology to scale in the real world. This presentation covers:
The challenges of deploying cutting-edge research on real-world mobile devices
How Qualcomm AI Research is solving system and feasibility challenges with full-stack optimizations to quickly move from research to commercialization
Examples where Qualcomm AI Research has had industrial or academic firsts
How to Troubleshoot Apps for the Modern Connected Worker
AI firsts: Leading from research to proof-of-concept
1. Jilei Hou
Vice President, Engineering
Qualcomm Technologies, Inc.
San Diego March 15, 2022
@QCOMResearch
AI firsts:
Leading from research
to proof-of-concepts
2. 2
Today’s
Agenda
The importance of full-stack AI research
A broad spectrum of AI firsts by
Qualcomm AI Research in both
research and proof-of-concept
Our future AI research directions
and next potential AI firsts
Questions?
Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.
3. 3
Power efficiency
Model design, compression,
quantization, algorithms,
efficient hardware,
software tool
Efficient learning
Robust learning through
minimal data,
unsupervised learning,
on-device learning
On-device learning
Continuous learning,
contextual, always-on,
privacy-preserved,
distributed learning
Advancing AI
research to make
edge AI ubiquitous
A platform to scale AI
across the industry
Action
Reinforcement
learning for
decision making
Perception
Object detection,
speech recognition,
contextual fusion
Reasoning
Scene understanding,
language understanding,
behavior prediction
Cloud Edge cloud Automotive
Io
T/IIoT
Mobile/XR
4. 4
Leading machine learning
research for edge AI
across the entire spectrum of topics
Bayesian
distributed learning
Graph and kernel
optimization
Federated learning
Deep learning
for 3D/geometry
Audio and video
compression
AI for wireless
& RF sensing
Energy-efficient
perception
AI for
chip design
On-device learning
Quantum AI Deep generative models
G-CNN
Self-supervised learning
Reinforcement learning Causality & system-2
Deep learning for graphics
Video recognition and prediction
Fingerprint
Voice UI
Model quantization,
compression, & NAS
HW-SW co-design
Compute-in-memory Power management
AI Model Efficiency
Toolkit (AIMET)
Platform research
Applied
research
Fundamental
research
Visual quality improvement
5. 5
Vision
Identify a problem
or need; establish
requirements
Ecosystem
collaboration
Collaborate and
drive the ecosystem
toward rapid
commercialization
at scale
Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.
Full stack
AI research
Model, hardware, and software
innovation across each layer
to accelerate AI applications
Early R&D and
technology inventions
essential to leading
the ecosystem forward
Transfer tech to commercial
teams and influence future
research with learnings
from deployment
~2-3
years
Model
quantization &
optimization
Develop tech & tools
to quantize weights and
modify architecture to run
efficiently on hardware
Software
compilation
Develop tech & tools
to improve graph-level
and kernel-level software
compilation performance
Proof of concept
Target teams integrate models
into final application for stable and
intuitive demonstration
Invention
Invent new methods
that set state-of-the-art
6. 6
SOTA: State-of-the-art; Cityscapes Benchmark: https://www.cityscapes-dataset.com/
Federated learning
Video semantic segmentation
Model quantization
On-device learning
Invented the best
techniques for fast
deployment of
8-bit quantization
Best power-efficiency
toolkit in the industry
Invented continuous
learning techniques for
SOTA on-device voice-UI
First demonstration
of 30% improvement
to keyword spotting
Invented methods
for combining
differential privacy
and compression
First end-to-end research
software framework
deployable on mobile
Top the Cityscape
leaderboard with loss
function innovation for
boundary-awareness
First real-time SS
at FHD on mobile
Brought to you
by Qualcomm
AI Research
AI
Firsts Video super resolution
Neural video compression
Group equivariant CNN
AI for wireless
Pioneer for
rotational
equivariance; best
paper at ICLR’18
First G-CNN
segmentation for health
on mobile
Invented neural
augmentation to
enhance physical
layer algorithms
First weakly supervised
method for real-world
passive RF sensing
Full stack optimization
for visual quality
improvement at
4K resolution
Invented instance-
adaptive for SOTA
performance & new
deployment scenarios
First real-time HD
decoding on mobile
First 4K SR at 100+
FPS on mobile
7. 7
7
Source: Welling
Weight
parameter
count
1940 1950 1960 1970 1980 1990 2000 2010 2020 2030
1943: First NN (+/- N=10)
1988: NetTalk
(+/- N=20K)
2009: Hinton’s Deep
Belief Net (+/- N=10M)
2013: Google/Y!
(N=+/- 1B)
2025:
N = 100T = 1014
2017: Very large neural
networks (N=137B)
1012
1010
108
106
1014
104
102
100
Deep neural networks
are energy hungry
and growing fast
AI is being powered by the explosive
growth of deep neural networks
2021: Extremely large
neural networks (N=1.6T)
Will we have reached the capacity of the human brain?
Energy efficiency of the human brain is estimated
to be 100,000x better than current hardware
2025
8. 8
1: FP32 model compared to quantized model
Leading
research to
efficiently
quantize
AI models
Promising results show that
low-precision integer inference
can become widespread
Virtually the same accuracy
between a FP32 and quantized
AI model through:
• Automated, data free,
post-training methods
• Automated training-based
mixed-precision method
Significant performance per watt
improvements through quantization
Automated reduction in precision
of weights and activations while
maintaining accuracy
Models trained at
high precision
32-bit floating point
3452.3194
8-bit Integer
255
Increase in performance
per watt from savings in
memory and compute1
Inference at
lower precision
16-bit Integer
3452
01010101
Increase in performance
per watt from savings in
memory and compute1
up to
4X
4-bit Integer
15
Increase in performance
per watt from savings in
memory and compute1
01010101
up to
16X
up to
64X
01010101
0101
01010101 01010101 01010101 01010101
9. 9
Data-free
quantization
How can we make
quantization as simple
as possible?
SOTA 8-bit
results
Making 8-bit weight
quantization ubiquitous
<1%
Accuracy drop for
MobileNet V2
against FP32 model
Data-Free Quantization Through Weight Equalization
and Bias Correction (Nagel, van Baalen, et al.,
ICCV 2019)
Created an automated method
that addresses bias and
imbalance in weight ranges:
No training
Data free
Invented
the best
techniques
for fast
deployment
of 8-bit
quantization
SOTA: State-of-the-art 9
AdaRound
Is rounding to the nearest
value the best approach
for quantization?
Making 4-bit weight
quantization ubiquitous
<2.5%
Accuracy drop for
MobileNet V2
against FP32 model
Up or Down? Adaptive Rounding for Post-Training
Quantization (Nagel, Amjad, et al., ICML 2020)
Created an automated
method for finding the
best rounding choice:
No training
Minimal unlabeled data
SOTA 4-bit
weight results
Transformer
quantization
Demonstrated effectiveness of
earlier techniques and created
new per-embedding quantization
No training
Minimal unlabeled data
How well do these methods
apply to transformers and
what more is needed?
SOTA for 8-bit
transformers
Making 8-bit weight quantization
for transformers ubiquitous
<1%
Accuracy drop on
problematic GLUE
benchmarks with
per-embedding-group
post-training
quantization
Understanding and Overcoming the Challenges of
Efficient Transformer Quantization (Bondarenko,
Nagel, et al., EMNLP 2021)
10. 10
10
Relaxed Quantization
(ICLR 2019)
Data-free Quantization
(ICCV 2019)
AdaRound
(ICML 2020)
Bayesian Bits
(NeurIPS 2020)
DONNA-NAS
(ICCV 2021)
Transformer Quantization
(EMNLP 2021)
Joint Pruning and Quantization
(ECCV 2020)
Qualcomm Neural Processing SDK and Qualcomm AIMET Pro are products of Qualcomm Technologies, Inc.
Driving the industry toward integer inference and power-efficient AI
Leading model efficiency research and fast commercialization
Qualcomm® Neural Processing SDK
Qualcomm® AI Model Efficiency Toolkit (AIMET) Pro
AIMET
Model efficiency
research
Model efficiency
commercialization
Model efficiency
open-sourcing
11. 11
11
AIMET
State-of-the-art quantization and compression techniques
github.com/quic/aimet
AIMET Model Zoo
Accurate pre-trained 8-bit quantized models
github.com/quic/aimet-model-zoo
Driving the industry toward integer inference and power-efficient AI
AIMET Model Zoo is a product of Qualcomm Innovation Center.
12. 12
12
Video monitoring
Extended reality Smart cities
Smart factories
Autonomous vehicles
Video conferencing
Smart homes
Smartphone
The need for intelligent, personalized
experiences powered by AI is ever-growing
How do we maintain privacy and deal
with all the data from edge devices?
12
13. 13
Data and
labels
Training
With offline training,
the test data can differ
from training data
(domain shift, distribution
shift, anomalies) and may
even change continuously
Test
data
On-device learning can
help to improve and
maintain accuracy when
original pre-trained model
cannot generalize well
Adapt
model
Inference
Deploy
On-device
learning
offers several
benefits
• Continuous learning
• Personalization
• Data privacy
• Scale
14. 14
14
Federated learning brings on-device learning to new level
Adaptation on the device, once or continuously, locally and/or globally for continuous model enhancement
Global adaptation
Local adaptation
Offline learning
Data
On-device learning
Locally adapt once to a few samples
(e.g., few shot learning) or continuously
(e.g., unsupervised learning)
Adapt model
based on
local data
Offline training prior to deployment
Federated learning
Aggregate model updates across
multiple users to globally improve
model from more diverse data
Federated learning for global adaptation
while still preserving privacy
15. 15
DP-REC: Private & Communication-Efficient Federated Learning, 2021
We combine
differential
privacy with
model update
compression
for DP-REC
Our federated learning
method uses differential
privacy to enable high
compression for a drastic
reduction in communications
Differential privacy
Model update
compression
A differentially private model update ensures
that information from the local data is reduced
Compression decreases the message size
and may reduce information from local data
DP-REC
Differentially Private Relative Entropy Coding (DP-REC)
- model updates reduce the information from local data
and can be compressed ‘for free’
Next character prediction Tag prediction
332.0x Comp.
105.1x Comp.
16. 16
16
Deployable federated learning framework for mobile
from Qualcomm AI Research
Android app
Pipe
Worker
host
gRPC
Torch
host
LibTorch
C++
Android app
Pipe
Worker
host
gRPC
Torch
host
LibTorch
C++
Android app
Pipe
Worker
host
gRPC
Torch
host
LibTorch
C++
Android app
Pipe
Worker
host
gRPC
Torch
host
LibTorch
C++
Controller
gRPC
Worker
manager
PyTorch / TensorFlow code
FL trainer
gRPC
TCP/IP
network
TCP/IP
network
Worker
Python
control
ML experts Coordinator server Mobile devices
Benefits
Scalable
Customizable
Deployable for real world
Supports TensorFlow
and PyTorch
Works on mobile
17. 17
First federated
learning framework
for mobile devices
Demonstration of voice user
verification using federated
learning on smartphones
(NeurIPS’21)
5000 worker nodes to train a
voice user verification model
17
Demo video
18. 18
18
1M
Minutes of video
crossing the internet
per second
15B
Minutes of talking
per day on WhatsApp
calls
82%
Of all consumer
internet traffic is
online video
76
Minutes per day watching
video on digital devices
by US adults
8B
Average daily
video views
on Facebook
The scale of video and voice
being created and consumed is massive
Cisco Visual Networking Index: Forecast and Trends, 2017–2022; WhatsApp blog 4/28/20 18
19. 19
AI-based
compression
has compelling
benefits
No special-purpose
hardware required, other
than an AI acceleration
Easy to upgrade, standardize,
and deploy new codecs
Specialized to a specific
data distribution
Easy to develop new codecs
for new modalities
Improved rate-distortion
trade-off
Optimized for advanced
perceptual quality metrics
Semantics aware for
human visual perception
Can generate
visual details not
in the bitstream
20. 20
Instance-adaptive video compression
Overfitting for Fun and Profit: Instance-Adaptive Data Compression, ICLR 2021
Neural video codec research shows promising results
Our
research
Rate-Distortion AEs
[Habibian et al., ICCV ’19]
Frame-Recurrent AEs
[Golinski et al., ACCV ’20]
Instance-Adaptive Compression
[Rozendaal et al., ICLR ’21]
Neural B-Frame Coding
[Pourreza et al., ICCV ’21]
Neural Coding in YUV420
[Egilmez et al., JSPS ’21]
shared knowledge
𝜃𝒟 𝜃𝒟
sender receiver
Send weight-deltas
based on overfitting
Send smaller
encoded bitstream
based on overfitting
E11 EE ED
0
0
1 0
0
1
𝒃ഥ
δ
E10
D1
𝒃ഥ
δ
⊖
ത
δ
⊕
ത
δ
D2
model prior model prior
encoder
𝑞𝜑(𝒛|𝒙)
0
0
0 1 1
1
ෝ
𝒙
0
0
0 1 1
1
𝒙
𝒃𝒛
E9
E9
E11
D4
D4
𝒛 𝒛
𝒃𝒛
D3
latent prior latent prior decoder
𝑝𝜃(𝒙|𝒛)
EE ED
decoder
𝑝𝜃(𝒙|𝒛)
21. 21
21
*We previously showcased real-time all-intra neural video decoding. Snapdragon is a product of Qualcomm Technologies, Inc. and/or its subsidiaries.
Neural inter-frame video decoder demo implementation
1280 × 720
Mobile device powered by
Snapdragon® 8 Mobile Platform
CPU cores
Parallel
Entropy
Encoding
Offline processing
Bitstream
Encoder
Parallel
Entropy
Decoding
Decoder
Motion
Resid
AI accelerator
Motion
Resid
demo to showcase real-time inter-frame
neural decoder on a mobile device!
30+
Frames
second
22. 22
First HD
neural video
codec on
mobile
Demonstration of real-time
neural video decoding
on a smartphone at NeurIPS’21
Demo video
23. 23
However, on-device
deployment that meets the real-
time, latency, and power
requirements at high resolution
has not been feasible before
AI-based super
resolution offers
improved visual
quality over
traditional methods
𝑊
𝑆
𝐻
𝑆
𝑊
𝐻
Super resolution
𝑆: upscaling factor
24. 24
Custom
architecture
Quantization-robust
model architecture
using optimized
residual connections
Qualcomm Hexagon is a product of Qualcomm Technologies, Inc. and/or its subsidiaries.
Our full stack
optimizations
have made
state-of-the-art
single-image
super resolution
at 4K possible
on mobile
Quantization
Cross-layer equalization,
bias correction, and
quantization-aware
training using AIMET
Hardware-
optimized
Efficiently utilize AI
acceleration of the
Qualcomm® Hexagon™
Tensor Processor via
channel-wise input
tiling
25. 25
25
Our SR implementation provides higher performance
at lower latency and power while maintaining accuracy
Settings for
comparisons:
• Running on a device
• Scaling factor: 2x
(4x is much faster)
• Output resolution:
1024x1024
• INT8 quantized models
Our models vs existing solutions
FSRCNN
ERFDN
SESR-M3
SESR-M5
SESR-M7
SESR-M11
ABPN
SRResNet
XLSR
SRResNet
FSRCNN
SESR-XL
ERFDN-8
Our
models
80
16
8
3
| Present
| 2021
| 2020
| 2019
| 2018
| 2017
| 2016
ERFDN
XLSR
ABPN
Relative latency
Relative
power
PSNR
(INT8)
SESR
26. 26
First 4K super
resolution at
100+ FPS
on mobile
Our new machine-learning
based super resolution method
26
Low-resolution Super-resolution
28. 28
Neural Augmentation of Kalman Filter with Hypernetwork for Channel Tracking, Globecom 2021
Combine inductive bias
from domain knowledge
with neural networks to address
interpretability, out-of-domain
generalization, and achieve
better sample complexity
Neural
augmentation for
enhanced wireless
communication
Hypernetwork Kalman filtering: Adapt Kalman filter parameters using a NN →
Outperforms NN baseline (LSTM), manually adapted Kalman (binned KF)
Neural augmentation:
• Keep the Kalman equations
for prediction.
• Use a recurrent network
to update the parameters
of Kalman
Generative channel modeling: Learn to model a complex system
with computationally efficient and differentiable model
𝑧 ∼ 𝒩(0, 𝐼)
GAN
𝑖 = 1, … 𝑁𝑇
𝑗 = 1, … 𝑁𝑅
Wireless channel
sampling
Neural augmentation:
• Keep the linearity of the
model from Maxwell equations
(𝒚 = 𝒙 ∗ 𝐇).
• Use generative models
to learn the distribution
of the linear model 𝐇
29. 29
WiCluster: Passive Indoor 2D/3D Positioning using WiFi without Precise Labels, 2021.
First weakly
supervised
indoor
positioning
Our new machine-learning
based methods work on large
floor plans and only require
weakly labeled training data
and a floor plan. (MWC’21)
29
Commercial precise positioning
Weakly/self-supervised learning
Demo video
30. 30
30
SOTA: State-of-the-art
Future
AI
Firsts
AI cloud platform
User-friendly automation for neural architecture
search and quantization, with support on cloud
platforms
Conditional compute
Frame-level early exit or mixture of experts
for significantly higher inference efficiency
ML for discrete optimization
AI-based algorithms for improving runtime,
scalability, and performance of combinatorial
optimization solvers
On-device learning
Real-time model adaption to improve
computer vision applications on mobile
Wireless AI
Joint sensing and communication
through generative modeling
3D AI
Efficient total scene capture
and novel view synthesis
AI for hardware design
Neural reasoning
Move beyond perception to reasoning
with auto-regressive language models
Data-efficient microarchitecture hardware/software
co-design and system-on-chip placement & routing
31. 31
We are conducting leading
research to enable edge AI
Due to our full-stack
AI research, we are
first to demonstrate
proof-of-concepts
on mobile devices
We are solving system
and feasibility challenges
to move from research
to commercialization