For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2015-embedded-vision-summit-baidu
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Dr. Ren Wu, former distinguished scientist at Baidu's Institute of Deep Learning (IDL), presents the keynote talk, "Enabling Ubiquitous Visual Intelligence Through Deep Learning," at the May 2015 Embedded Vision Summit.
Deep learning techniques have been making headlines lately in computer vision research. Using techniques inspired by the human brain, deep learning employs massive replication of simple algorithms which learn to distinguish objects through training on vast numbers of examples. Neural networks trained in this way are gaining the ability to recognize objects as accurately as humans.
Some experts believe that deep learning will transform the field of vision, enabling the widespread deployment of visual intelligence in many types of systems and applications. But there are many practical problems to be solved before this goal can be reached. For example, how can we create the massive sets of real-world images required to train neural networks? And given their massive computational requirements, how can we deploy neural networks into applications like mobile and wearable devices with tight cost and power consumption constraints?
In this talk, Ren shares an insider’s perspective on these and other critical questions related to the practical use of neural networks for vision, based on the pioneering work being conducted by his former team at Baidu.
Note 1: Regarding the ImageNet results included in this presentation, the organizers of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) have said: “Because of the violation of the regulations of the test server, these results may not be directly comparable to results obtained and reported by other teams.” (http://www.image-net.org/challenges/LSVRC/announcement-June-2-2015)
Note 2: The presenter, Ren Wu, has told the Embedded Vision Alliance that “There was some ambiguity with the rules. According to the ‘official’ interpretation of the rules, there should be no more than 52 submissions within a half year. For us, we achieved the reported results after 200 tests total within a half year. We believe there is no way to obtain any measurable gains, nor did we try to obtain any gains, from an 'extra' hundred tests as our networks have billions of parameters and are trained by tens of billions of training samples.”
4. Deep Blue
A classic example of application-specific system design comprised
of an IBM supercomputer with 480 custom-madeVLSI chess chips, running
massively parallel search algorithm with highly optimized implementation.
6. Deep Learning Works
“We deepened our investment in
advanced technologies like Deep
Learning, which is already yielding near
term enhancements in customer ROI
and is expected to drive
transformational change over the
longer term.”
– Robin Li, Baidu CEO
8. Deep Learning vs. Human Brain
pixels
edges
object parts
(combination
of edges)
object models
Deep Architecture in the Brain
Retina
Area V1
Area V2
Area V4
pixels
Edge detectors
Primitive shape
detectors
Higher level visual
abstractions
Slide credit: Andrew Ng
Voice
Text
Image
User
9. Deep Convolutional Neural Networks
* Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster
courtesy of Jonatan Ward, Sergey Andreev, Francisco Heredia, Bogdan Lazar, Zlatka Manevska
10. Big Data
• >2000PBStorage
• 10-100PB/dayProcessing
• 100b-1000bWebpages
• 100b-1000bIndex
• 1b-10b/dayUpdate
• 100TB~1PB/dayLog
12. Deep Learning: Two Step Process
Supercomputers used for
training
And then deploy the trained
models everywhere!
Datacenters
Tablets, smartphones
Wearable devices
IoTs
13. Deep Learning: Training
Big data + Deep learning + High performance computing =
Intelligence
Big data + Deep learning + Heterogeneous computing =
Success
15. ImageNet Classification Challenge
• ImageNet dataset
• More than 15 million images belonging to about 22,000 categories
• ILSVRC (ImageNet Large-Scale Visual Recognition Challenge)
• Classification task: 1.2 million images contains 1,000 categories
• One of the most challenging computer vision benchmarks
• Increasing attention both from industry and academic communities
* Olga Russakovsky et al. ECCV 2014
20. Latest Results
Team Date Top-5 test error
GoogLeNet 2014 6.66%
Deep Image 01/12/2015 5.98%
Deep Image 02/05/2015 5.33%
Microsoft 02/05/2015 4.94%
Google 03/02/2015 4.82%
Deep Image 05/10/2015 4.58%
21. Insights and Inspirations
多算胜少算不胜
孙⼦子 计篇 (Sun Tzu, 544-496 BC)
More calculations win, few
calculation lose
元元本本殚⻅见洽闻
班固 ⻄西都赋(Gu Ban, 32-92 AD)
Meaning the more you see the
more you know
明⾜足以察秋毫之末
孟⼦子梁惠⺩王上 (Mencius, 372-289 BC)
ability to see very fine details
22. Project Minwa (百度敏娲)
• Minerva + Athena + ⼥女娲
• Athena: Goddess of Wisdom,Warfare,
Divine Intelligence,Architecture, and Crafts
• Minerva: Goddess of wisdom, magic,
medicine, arts, commerce and defense
• ⼥女娲: 抟⼟土造⼈人, 炼⽯石补天, 婚姻, 乐器
World’s Largest Artificial Neural Networks
v Pushing the State-of-the-Art
v ~ 100x bigger than previous ones
v New kind of Intelligence?
23. Hardware/Software Co-design
• Stochastic gradient descent (SGD)
• High compute density
• Scale up, up to 100 nodes
• High bandwidth low latency
• 36 nodes, 144 GPUs, 6.9TB Host, 1.7TB Device
• 0.6 PFLOPS
• Highly Optimized software stack
• RDMA/GPU Direct
• New data partition and communication
strategies
GPUs
Infiniband
25. Speedup (wall time for convergence)
Validation set accuracy for different numbers of GPUs
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.25
0.5
1
2
4
8
16
32
64
128
256
Accuracy
Time (hours)
32 GPU
16 GPU
1 GPU
Accuracy 80%
32 GPU: 8.6 hours
1 GPU: 212 hours
Speedup: 24.7x
26. Never have enough training
examples!
Key observations
• Invariant to illuminant of the scene
• Invariant to observers
Augmentation approaches
• Color casting
• Optical distortion
• Rotation and cropping etc
Data Augmentation
“⻅见多识⼲⼴广”
27. And the Color Constancy
Key observations
• Invariant to illuminant of the scene
• Invariant to observers
Augmentation approaches
• Color casting
• Optical distortion
• Rotation and cropping etc
The Color of the Dress
“Inspired by the color constancy principal.
Essentially, this ‘forces’ our neural network to
develop its own color constancy ability.”
28. Data Augmentation
Augmentation The number of possible changes
Color casting 68920
Vignetting 1960
Lens distortion 260
Rotation 20
Flipping 2
Cropping 82944(crop size is 224x224, input image
size is 512x512)
Possible variations
The Deep Image system learned from ~2 billion examples, out
of 90 billion possible candidates.
31. Multi-scale Training
• Same crop size, different
resolution
• Fixed-size 224*224
• Downsized training images
• Reduces computational costs
• But not for state-of-the-art
• Different models trained by
different image sizes
256*256
512*512
• High-resolution model works
• 256x256: top-5 7.96%
• 512x512: top-5 7.42%
• Multi-scale models are
complementary
• Fused model: 6.97%
“明查秋毫”
34. Single Model Performance
• One basic configuration has 16 layers
• The number of weights in our configuration is 212.7M
• About 40% bigger than VGG’s
Team Top-5 val. error
VGG 8.0%
GoogLeNet 7.89%
BN-Inception 5.82%
MSRA, PReLU-net 5.71%
Deep Image 5.40%
40. Major Differentiators
• Customized built supercomputer dedicated for DL
• Simple, scalable algorithm + Fully optimized
software stack
• Larger models
• More Aggressive data augmentation
• Multi-scale, include high-resolution images
Scalability + Insights
and push for extreme
41. Deep Learning: Deployment
Big data + Deep learning + High performance computing =
Intelligence
Big data + Deep learning + Heterogeneous computing =
Success
42. Owl of Minwa (百度敏鸮)
Supercomputers
Datacenters
Tablets, smartphones
Models trained by supercomputers
Trained models will be deployed in many ways
data centers (cloud), smartphones, and even wearables and IoTs
d
OpenCL based, light weight and high performance
DNNs everywhere !
knowledge, wisdom, perspicacity and erudition
43. DNNs Everywhere
Supercomputers
Datacenters
Tablets, smartphones
Wearable devices
IoTs
1000s GPUs
100k-1m servers
2b (in China)
50b in 2020?
Supercomputer used for training
Trained DNNs then deployed to data centers (cloud),
smartphones, and even wearables and IoTs
44. Offline Mobile DNN App
• Image recognition on mobile device
• Real time and no connectivity
needed
• directly from video stream, what
you point is what you get
• Everything is done within the device
• OpenCL based, highly optimized
• Large deep neural network models
• Thousands of objects, flowers, dogs,
and bags etc
• Unleashed the full potential of the
device hardware
• Smart phones now, Wearables and
IoTs tomorrow
45.
46. Cloud Computing: What’s Missing?
Bandwidth?
Latency?
and
Power consumption?
*ArtemVasilyev: CNN optimizations for embedded systems and FFT
Moving data around is expensive, very expensive!
49. Heterogeneous Computing
“Human mind and brain is not a single general-purpose processor
but a collection of highly specialized components, each solving a
different, specific problem and yet collectively making up who we
are as human beings and thinkers. “ - Prof. Nancy Kanwisher