Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Deep Learning and Intelligent
Applications
Dr Xuedong Huang
Distinguished Engineer & Head of Advanced Technology Group
Mic...
What drives speech technology progress?
Stage Oxygen
Our computing
infrastructure
including GPU is a
stage for performers
...
2015 System
Human Error Rate 4%
Speech recognition could reach human parity in the next 3 years
5/25/2016
Dong Yu and Xuedong Huang: Microsoft Computational
Network Toolkit
6
ImageNet: Microsoft 2015 ResNet
28.2
25.8
16.4
11.7
7.3 6.7
3.5
ILSVRC 2010
NEC America
ILSVRC 2011
Xerox
ILSVRC 2012
Alex...
Nvidia CEO's View
http://blogs.nvidia.com/blog/2016/01/12/accelerating-ai-artificial-intelligence-gpus/
Design Goal of CNTK
• A deep learning tool that balances
• Efficiency: Can train production systems as fast as possible
• ...
Functionality
• Supports
• CPU and GPU with a focus on GPU Cluster
• Windows and Linux
• automatic numerical differentiati...
Architecture
12
CNICNBuilderCN
Description Use Build
ILearnerIDataReaderFeatures &
Labels Load Get data
IExecutionEngine
C...
At the Heart: Computational Networks
• A generalization of machine learning models that can be described as a
series of co...
CNTK Summary
• CNTK is a powerful tool that supports CPU/GPU and runs under
Windows/Linux
• CNTK is extensible with the lo...
Theano only supports 1 GPU
We report 8 GPUs (2 machines) for CNTK only as it is the only
public toolkit that can scale bey...
A portfolio of APIs, SDKs and apps that enable developers to easily add intelligent
services, such as vision or speech cap...
Understand the data around
your application
PROJECT OXFORD
Analyze an Image
Understand content within an image
OCR
Detect and recognize words within an image
Generate Thumbnail
Scal...
Analyze Image
Type of Image:
Clip Art Type 0 Non-clipart
Line Drawing Type 0 Non-Line Drawing
Black & White Image False
Co...
OCR
LIFE IS LIKE
RIDING A BICYCLE
TO KEEP YOUR BALANCE
YOU MUST KEEP MOVING
JSON:
{
"language": "en",
"orientation": "Up",...
Smart Thumbnail
Smart Cropping OffSmart Cropping On
Face Detection
Detect faces and their attributes within an image
Face Verification
Check if two faces belong to the same p...
Face APIs
Detection
"faceRectangle": {"width": 193, "height": 193, "left": 326, "top": 204}
…
Feature Attributes
"attribut...
Recognize Emotions
Detect emotions based on facial expressions
Emotion APIs
Emotion APIs
Face Detection
"faceRectangle": {"width": 193, "height": 193, "left": 326, "top": 204}
…
Emotion Scores
“scor...
Video APIs
Stabilization
Smooth and stabilize shaky video
Face Detection and Tracking
Detect and track faces in videos
Mot...
Stabilization
The Stabilization API provides automatic video stabilization and smoothing for shaky videos.
This API uses m...
Face Detection and Tracking
High precision face location detection and tracking.
Can detect up to 64 human faces in a vide...
Motion Detection
Indicates when motion occurs against a fixed background (e.g. surveillance video)
Trained to reduce false...
Speech APIs
Voice Recognition (Speech to Text)
Converts spoken audio to text
Voice Output (Text to Speech)
Synthesize audi...
Voice Recognition
Duration of Audio < 15 seconds < 2 minutes
Final Result n-best choice Best Choice, delivered at sentence pauses
Partial Re...
Synthesize audio from text via POST request
Maximum audio return of 15 seconds
17 languages supported
Voice Output
<speak ...
Speaker Verification
Check if two voices are the same
Speaker Identification
Identify who is speaking
Speaker Recognition
...
Speaker Recognition APIs
Enrollment
Create a unique voiceprint for a profile
Recognition
After enrolling one or more voice...
CRIS
Customize both language and acoustic
models
Tailor speech recognition to your app &
environment
Create custom language models for the vocabulary of the
application
Adapt acoustic models to better match the expected
env...
State-of-the-art cloud based spelling algorithms
Recognizes a wide variety of spelling errors
Spell Check APIs
Recognize n...
Spell Check APIs
Check a single word or a whole sentence
“Our engineers developed this four you!”
Corrected Text: “four” ...
LUIS
Understand what your users are saying
Use pre-built Bing & Cortana models or
create your own
Reduce labeling effort with interactive featuring
Use visualizations to gauge performance and improvements
Leverage Speech...
{
“entities”: [
{
“entity”: “flight_delays”,
“type”: “Topic”
}
],
“intents”: [
{
“intent”: “FindNews”,
“score”: 0.99853384...
oxfordSignUp
https://social.msdn.microsoft.com/forums/azure/en-
US/home?forum=mlapi
http://www.projectoxford.ai/doc​
https://github.com...
What is next?
Q&A?
Email me any follow-up questions: xdh@microsoft.com
Xuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent Applications
Próxima SlideShare
Cargando en…5
×

Xuedong Huang - Deep Learning and Intelligent Applications

1.080 visualizaciones

Publicado el

Machine Learning Prague 2016
www.mlprague.com

Publicado en: Tecnología
  • Sé el primero en comentar

Xuedong Huang - Deep Learning and Intelligent Applications

  1. 1. Deep Learning and Intelligent Applications Dr Xuedong Huang Distinguished Engineer & Head of Advanced Technology Group Microsoft Technology and Research xdh@microsoft.com
  2. 2. What drives speech technology progress? Stage Oxygen Our computing infrastructure including GPU is a stage for performers Performers Big usage data is oxygen to beautify our performance Deep learning is changing everything as our top performer
  3. 3. 2015 System Human Error Rate 4% Speech recognition could reach human parity in the next 3 years
  4. 4. 5/25/2016 Dong Yu and Xuedong Huang: Microsoft Computational Network Toolkit 6
  5. 5. ImageNet: Microsoft 2015 ResNet 28.2 25.8 16.4 11.7 7.3 6.7 3.5 ILSVRC 2010 NEC America ILSVRC 2011 Xerox ILSVRC 2012 AlexNet ILSVRC 2013 Clarifi ILSVRC 2014 VGG ILSVRC 2014 GoogleNet ILSVRC 2015 ResNet ImageNet Classification top-5 error (%) Microsoft had all 5 entries being the 1-st places this year: ImageNet classification, ImageNet localization, ImageNet detection, COCO detection, and COCO segmentation
  6. 6. Nvidia CEO's View http://blogs.nvidia.com/blog/2016/01/12/accelerating-ai-artificial-intelligence-gpus/
  7. 7. Design Goal of CNTK • A deep learning tool that balances • Efficiency: Can train production systems as fast as possible • Performance: Can achieve state-of-the-art performance on benchmark tasks and production systems • Flexibility: Can support various tasks such as speech, image, and text, and can try out new ideas quickly • Inspiration: Legos • Each brick is very simple and performs a specific function • Create arbitrary objects by combining many bricks • CNTK enables the creation of existing and novel models by combining simple functions in arbitrary ways. 5/25/2016 10
  8. 8. Functionality • Supports • CPU and GPU with a focus on GPU Cluster • Windows and Linux • automatic numerical differentiation • Efficient static and recurrent network training through batching • data parallelization within and across machines with 1-bit quantized SGD • memory sharing during execution planning • Modularized: separation of • computational networks • execution engine • learning algorithms • model description • data readers • Models can be described and modified with • C++ code • Network definition language (NDL) and model editing language (MEL) • Brain Script (beta) • Python and C# (planned) 5/25/2016 11
  9. 9. Architecture 12 CNICNBuilderCN Description Use Build ILearnerIDataReaderFeatures & Labels Load Get data IExecutionEngine CPU/GPU Task-specific reader SGD, AdaGrad, etc. Evaluate Compute Gradient
  10. 10. At the Heart: Computational Networks • A generalization of machine learning models that can be described as a series of computational steps. • E.g., DNN, CNN, RNN, LSTM, DSSM, Log-linear model • Representation: • A list of computational nodes denoted as n = {node name : operation name} • The parent-children relationship describing the operands {n : c1, · · · , cKn } • Kn is the number of children of node n. For leaf nodes Kn = 0. • Order of the children matters: e.g., XY is different from YX • Given the inputs (operands) the value of the node can be computed. • Can flexibly describe deep learning models. • Adopted by many other popular tools as well 5/25/2016 13
  11. 11. CNTK Summary • CNTK is a powerful tool that supports CPU/GPU and runs under Windows/Linux • CNTK is extensible with the low-coupling modular design: adding new readers and new computation nodes is easy with a new reader design • Network definition language, macros, and model editing language (as well as Brain Script and Python binding in the future) makes network design and modification easy • Compared to other tools CNTK has a great balance between efficiency, performance, and flexibility 5/25/2016 14
  12. 12. Theano only supports 1 GPU We report 8 GPUs (2 machines) for CNTK only as it is the only public toolkit that can scale beyond a single machine. Our system can scale beyond 8 GPUs across multiple machines with superior distributed system performance. 0 10000 20000 30000 40000 50000 60000 70000 80000 CNTK Theano TensorFlow Torch 7 Caffe Speed Comparison (Frames/Second, The Higher the Better) 1 GPU 1 x 4 GPUs 2 x 4 GPUs (8 GPUs) 5/25/2016 15 CNTK Computational Performance
  13. 13. A portfolio of APIs, SDKs and apps that enable developers to easily add intelligent services, such as vision or speech capabilities, to their solutions Project Oxford – Adding “smart” to your applications
  14. 14. Understand the data around your application
  15. 15. PROJECT OXFORD
  16. 16. Analyze an Image Understand content within an image OCR Detect and recognize words within an image Generate Thumbnail Scale and crop images, while retaining key content Computer Vision APIs
  17. 17. Analyze Image Type of Image: Clip Art Type 0 Non-clipart Line Drawing Type 0 Non-Line Drawing Black & White Image False Content of Image: Categories [{ “name”: “people_swimming”, “score”: 0.099609375 }] Adult Content False Adult Score 0.18533889949321747 Faces [{ “age”: 27, “gender”: “Male”, “faceRectangle”: {“left”: 472, “top”: 258, “width”: 199, “height”: 199}}] Image Colors: Dominant Color Background White Dominant Color Foreground Grey Dominant Colors White Accent Color
  18. 18. OCR LIFE IS LIKE RIDING A BICYCLE TO KEEP YOUR BALANCE YOU MUST KEEP MOVING JSON: { "language": "en", "orientation": "Up", "regions": [ { "boundingBox": "41,77,918,440", "lines": [ { "boundingBox": "41,77,723,89", "words": [ { "boundingBox": "41,102,225,64", "text": "LIFE" }, { "boundingBox": "356,89,94,62", "text": "IS" }, { "boundingBox": "539,77,225,64", "text": "LIKE" } . . . Good At: • Scanned Documents • Photos with Text • Fine Grained Location Information Need to Improve • Vehicle License Plate • Hand-written Text • Characters with Large Sizes
  19. 19. Smart Thumbnail Smart Cropping OffSmart Cropping On
  20. 20. Face Detection Detect faces and their attributes within an image Face Verification Check if two faces belong to the same person Similar Face Searching Find similar faces within a set of images Face APIs Face Grouping Organize many faces into groups Face Identification Search which person a face belongs to
  21. 21. Face APIs Detection "faceRectangle": {"width": 193, "height": 193, "left": 326, "top": 204} … Feature Attributes "attributes": { "age": 42, "gender": "male", "headPose": { "roll": "8.2", "yaw": "-37.8", "pitch": "0.0" }} Identification Jasper Williams Grouping
  22. 22. Recognize Emotions Detect emotions based on facial expressions Emotion APIs
  23. 23. Emotion APIs Face Detection "faceRectangle": {"width": 193, "height": 193, "left": 326, "top": 204} … Emotion Scores “scores": { "anger": 5.182241e-8, "contempt": 0.0000242813, "disgust": 5.621025e-7, "fear": 0.00115027453, "happiness": 1.06114619e-8, "neutral": 0.003540177, "sadness": 9.30888746e-7, "surprise": 0.9952837}
  24. 24. Video APIs Stabilization Smooth and stabilize shaky video Face Detection and Tracking Detect and track faces in videos Motion Detection Detect when motion occurs
  25. 25. Stabilization The Stabilization API provides automatic video stabilization and smoothing for shaky videos. This API uses many of the same technologies found in Microsoft Hyperlapse. Best For: Small camera motions, with or without rolling shutter effects (e.g. holding a static camera, walking with a slow speed).
  26. 26. Face Detection and Tracking High precision face location detection and tracking. Can detect up to 64 human faces in a video (no smaller than 24x24 pixels) Detected and tracked faces are returned with coordinates and a Face ID to track throughout the video. Time (sec) Face ID x, y Width, Height 0 0 0.59, 0.23 0.09, 0.16 0 1 0.38, 0.15 0.07, 0.12 1 0 0.54, 0.25 0.09, 0.15 1 1 0.23, 0.18 0.07, 0.12
  27. 27. Motion Detection Indicates when motion occurs against a fixed background (e.g. surveillance video) Trained to reduce false alarms, such as lighting and shadow changes. Current limitations: • No support for night-vision videos • Semi-transparent and small objects are not detected well Start Time End Time In Region 1.9 3.6 0 5.2 15.1 0
  28. 28. Speech APIs Voice Recognition (Speech to Text) Converts spoken audio to text Voice Output (Text to Speech) Synthesize audio from text Speaker ID & Diarisation Coming soon
  29. 29. Voice Recognition
  30. 30. Duration of Audio < 15 seconds < 2 minutes Final Result n-best choice Best Choice, delivered at sentence pauses Partial Results Yes Yes Voice Recognition Short Form Long Form
  31. 31. Synthesize audio from text via POST request Maximum audio return of 15 seconds 17 languages supported Voice Output <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US"> <voice name="Microsoft Server Speech Text to Speech Voice (en-US, ZiraRUS)"> Synthesize audio from text, to speak to your users. </voice></speak>
  32. 32. Speaker Verification Check if two voices are the same Speaker Identification Identify who is speaking Speaker Recognition APIs
  33. 33. Speaker Recognition APIs Enrollment Create a unique voiceprint for a profile Recognition After enrolling one or more voices, identify who is speaking from an audio clip Verification Confirm if a voice belongs to a previously enrolled profile Is this Anna’s voice? Anna AnnaMike Marry Who’s voice is this?
  34. 34. CRIS Customize both language and acoustic models Tailor speech recognition to your app & environment
  35. 35. Create custom language models for the vocabulary of the application Adapt acoustic models to better match the expected environment of the application’s users Deploy to a custom endpoint and access from any device Custom Recognition Intelligent Service
  36. 36. State-of-the-art cloud based spelling algorithms Recognizes a wide variety of spelling errors Spell Check APIs Recognize name errors and homonyms in context Difficult to spot errors that use the context of the words around them Updates over time Support for new brands and coined expressions as they emerge
  37. 37. Spell Check APIs Check a single word or a whole sentence “Our engineers developed this four you!” Corrected Text: “four”  “for” Identify errors and get suggestions "spellingErrors": [ { "offset": 5, "token": "gona", "type": "UnknownToken", "suggestions": [ { "token": "gonna" } ] }
  38. 38. LUIS Understand what your users are saying Use pre-built Bing & Cortana models or create your own
  39. 39. Reduce labeling effort with interactive featuring Use visualizations to gauge performance and improvements Leverage Speech recognition with seamless integration Deploy using just a few examples with active learning Language Understanding Intelligent Service
  40. 40. { “entities”: [ { “entity”: “flight_delays”, “type”: “Topic” } ], “intents”: [ { “intent”: “FindNews”, “score”: 0.99853384 }, { “intent”: “None”, “score”: 0.07289317 }, { “intent”: “ReadNews”, “score”: 0.0167122427 }, { “intent”: “ShareNews”, “score”: 1.0919299E-06 } ] } Language Understanding Models
  41. 41. oxfordSignUp
  42. 42. https://social.msdn.microsoft.com/forums/azure/en- US/home?forum=mlapi http://www.projectoxford.ai/doc​ https://github.com/Microsoft/ProjectOxford-ClientSDK https://github.com/matvelloso/twinsornot (the story) Intelligent Services Made Easy
  43. 43. What is next?
  44. 44. Q&A? Email me any follow-up questions: xdh@microsoft.com

×