Chad Hart examines the use of AI and Machine Learning (ML) in Real Time Communications (RTC) applications including speech analytics, voicebots, computer vision, and ML optimization of RTC components. Chad includes examples from his AI in RTC research report, webrtcHacks, and cogint.ai.
2. cwh.consulting
A blog for WebRTC developers
webrtcHacks.com
@webrtcHacks
AI & RTC blog
cogint.ai
@cogintai
WebRTC and ML for Developer Event
November 16, 2018 in San Francisco
krankygeek.com
About Me
Chad Hart
Analyst & Product Consultant
https://cwh.consulting
@chadwallacehart
chad@cwh.consulting
3. cwh.consulting
AI in RTC Research Study
• Authors
• Chad Hart – cwh.consulting
• Tsahi Levent-Levi - BlogGeek.me
• Methodology
• 40+ 1-on-1 vendor interviews
• ~100 respondent web survey
• Analysis of 126 companies & all major
products
• Output: 147-page report
5. cwh.consulting
AI in RTC use case categories
speech analytics
voicebots
RTC optimization
computer vision
Image source:
pixabay.com/en/a-i-ai-anatomy-2729782
8. cwh.consulting
Reality:
transcription quality is often not so great
My name is a chat heart of you might be
familiar with Dave from a brand or if you
are, a web or to see people I've done
about five years, I'm or so a of an
independent analyst. So I'm mostly do
park management strategy type. For a
product, marketing.
My name is Chad Hart. You might be
familiar with me from a brand -- if you are
WebRTC people; I've done webrtcHacks
now for about five years or so. Outside of
webrtcHacks, I have been an independent
analyst. I mostly do product management
and strategy type work and product
marketing.
Machine Transcription Actual Transcription
https://www.nojitter.com/post/240173958/when-speech-analytics-makes-gibberish-useful
9. cwh.consulting
My name is Chad Hart. You might be
familiar with me from a brand -- if you are
WebRTC people; I've done webrtcHacks
now for about five years or so. Outside of
webrtcHacks, I have been an independent
analyst. I mostly do product management
and strategy type work and product
marketing.
Reality:
transcription quality is often not so great
My name is a chat heart of you might be
familiar with Dave from a brand or if you
are, a web or to see people I've done
about five years, I'm or so a of an
independent analyst. So I'm mostly do
park management strategy type. For a
product, marketing.
Machine Transcription Actual Transcription
Non-standard
spelling
Industry
Jargon
Speech
disfluencies
US-English
language
assumption
https://www.nojitter.com/post/240173958/when-speech-analytics-makes-gibberish-useful
10. cwh.consulting
Higher-level speech analytics
• Perfect transcription is not needed to
provide useful analysis.
• Higher-level speech analytics systems look
for patterns in speech.
• These patterns can be matched to
business outcomes, such as did a caller
end up purchasing or did they give a good
customer satisfaction score.
• There are often meaningful patterns
beyond the words that were spoken – like
how fast each party was speaking, or how
often the agent talked compared to the
customer.
• There is also a lot of work going into
looking at caller emotion and sentiment.
Source: CallMiner
12. cwh.consulting
• Another area we examined was voice bots.
• These are smart speakers like the google home which was recently made available in
South Korea and AI assistants like Bixby or Siri.
• Building a voicebot is complex. You not only need to transcribe the speech and run
some natural language understanding on it like in speech analytics, but you need to
also generate speech and deal with interactivity with the customer in real time.
• There is very broad interest in using these voicebots
• Every telephony device maker is interested in adding a voice user interface to their
products – and this is a natural fit since people “talk” to these devices already.
• Typical conference room equipment is already setup to capture good quality audio
with minimal noise from a variety of locations throughout the room with microphone
arrays
• However, most companies are just starting to figure out how to use them in their
products.
Voicebots – Smart Speakers & Assistants
13. cwh.consulting
Flattening the IVR:
humans don’t speak in menus
https://cogint.ai/dialogflow-phone-bot/
Menu
DTMF
Menu
DTMF
Response Response Menu
DTMF
Response Response Response
Menu
DTMF
Response Response Response Menu
DTMF
Response Response
Utterance
Intent
Response
Intent
Response
Intent
Response
Intent
Response
Intent
Response
Intent
Response
Intent
Response
Intent
Response
Intent
Response
Intent
Response
Traditional IVR Menu Voicebot
time
10 potential responses in an IVR menu hierarchy vs. a voicebot
14. cwh.consulting
Flattening the IVR:
humans don’t speak in menus
• One major area where voicebots will have an impact is in IVRs.
• Traditional IVRs were designed for DTMF input and are usually setup with multiple
levels of menus.
• Because people cannot remember more than a few menu options at a time, you
cannot put too many options in each menu.
• As a result, to fit many options, you need to have a complex menu with many
layers.
• Users hate this because they are difficult to navigate and takes too long.
• Voicebots help to flatten the IVR into a just a few layers.
• Rather than navigating a complex menu, user can just say what they want and use
natural language to get the information they need.
• This is good for call centers too because users are more likely to stay in the IVR
instead of immediately dropping out to an operator.
https://cogint.ai/dialogflow-phone-bot/
15. cwh.consulting
New voicebots: consumer ⇨ businessNotable Consumer Voicebot Market Milestones
krankygeek.com/research
KRANKY GEEK RESEARCH
Notable voicebot milestones
18. cwh.consulting
Object detection over WebRTC with TensorFlow
Blog post:
https://webrtchacks.com/webrtc-cv-tensorflow/
Demo video: https://youtu.be/vzTXW0hGINM
• Using open source libraries and existing work,
without having a PhD in computer vision it is
relatively simple to setup your own server
and process real time video.
• Here is an example of a server I setup to do
real time analysis of a WebRTC stream.
19. cwh.consulting
Object detection over WebRTC with TensorFlow – example
architecture
https://webrtchacks.com/webrtc-cv-tensorflow/
TensorFlow
Object
Detection
Flask
Server Browser
local.js
index.html
objDetect.js
POST with image
object details
web assets
GET web assets
• This is just a very basic example that uses an
HTTP post to send several images per
second to a cloud-based server for
processing.
• As you saw in the video, there can be a little
bit of lag.
• Using a GPU-accelerated server or even
something like Google’s TPU that were
specifically designed to accelerate heavy
machine learning graphs would have helped
• But ultimately streaming a high-quality
image can always have its limits.
• Wouldn’t it be nice if you do the heavy
processing locally with hardware
acceleration, just like you can hardware
accelerate codecs like H.264?
20. cwh.consulting
ML processing moving to the edge,
with faster, local processing
• That’s exactly what you can do with some new chipsets from vendors like
Intel.
• This is an example of a kit from Google called the AIY Vision Kit that
includes the Intel Movidius processor.
• The Movidius is designed to run deep neural networks locally and is
especially well-suited to low-power computer vision applications.
• This kits runs on a tiny, single core Raspberry Pi 0 with only 512MB of RAM.
• Google used to sell just the vision bonnet add-on part of the chip for $45.
Now you can buy the complete kit with the Raspberry Pi for $90 in the US.
• Note that Amazon also has a computer vision kit it calls Deep Lense. That
runs on something more like an Intel NUC mini-PC and costs $250.
22. cwh.consulting
Improvements with edge hardware (demonstration)
• Let’s look at this in action
• This all runs locally on the Pi.
• So in this case, I am doing the computer
vision process locally while sending the
stream and annotation remotely
Blog post:
https://webrtchacks.com/aiy-vision-kit-uv4l-
web-server
Video:
https://youtu.be/h0O18R1rI9U
23. cwh.consulting
Fun use cases with native mobile libraries
• With new native mobile libraries like
Apple’s CoreML and Google’s ML Kit, it
is relatively simple.
• Some of the engineers at Houseparty
wrote a blog post demonstrating how
to do smile detection
• Similar libraries are available that
detect facial boundaries and let you
put hats, sunglasses, beards, and other
silly masks on people – I am sure you
have seen some of these!
• Similar techniques can be used in a
business context to blur out
backgrounds for remote workers who
call into a video conference.
https://webrtchacks.com/ml-kit-smile-detection/
24. cwh.consulting
MLKit CPU consumption: high framerates are not practical (without
special hardware)
CPU Usage for different framerates processed by ML Kit
CPUUsage%
https://webrtchacks.com/ml-kit-smile-detection/
26. cwh.consulting
WebRTC CV is coming to the browser
https://w3c.github.io/webrtc-nv-use-cases/#funnyhats*
This is from a W3C document examining use cases for the next version of WebRTC
28. cwh.consulting
Mozilla RNNoise – real time, low-power noise suppression with
deep learning
• One example is a research project
from Mozilla that uses Deep Learning
to provide better real-time noise
suppression.
• This is designed for lower power
devices and does not require any
specialized hardware.
• We do not have time now, but you can
go to that link and try some demos.
• Unfortunately this was just a research
project, but it gives you some idea of
what could be done in this and other
areas.
https://people.xiph.org/~jm/demo/rnnoise/
31. cwh.consulting
A blog for WebRTC developers
webrtcHacks.com
@webrtcHacks
AI & RTC blog
cogint.ai
@cogintai
WebRTC and ML for Developer Event
November 16, 2018 in San Francisco
krankygeek.com
About Me
Chad Hart
Analyst & Product Consultant
https://cwh.consulting
@chadwallacehart
chad@cwh.consulting
Notas del editor
As a quick background, my name is Chad Hart.
I am an analyst and consultant focused on real time communications products and services
Some of you may be familiar with webrtcHacks – I blog I have run since 2013 that aims to provide useful content for WebRTC developers
I also recently launched a blog to specifically explore topics related to AI, Machine Learning and RTC. You can check that out at cogint.ai
Lastly, I also help to run the Kranky Geek series of events with the help of Google and other sponsors like Intel, Nexmo and Agora.
We hold an event every year in San Francisco.
This year we will also be focusing on the AI in RTC topics with many great talks from companies like Facebook, Microsoft, IBM and many more.
The AI in RTC topic has been a major focus of mine.
I recently came off a long-term project where I ran a new product incubator group that launched a speech analytics service inside a telco.
I could see speech analytics and other machine-learning based technologies were starting to intersect with real time communications.
To understand this better I teamed up with Tsahi Levent-Levi of BlogGeek.me, another WebRTC analyst many of you know, to write a research report on this topic.
We covered more than 125 vendors, ran an industry survey, and had 1-on1 conversations with 40 vendors.
So what is AI in RTC?
I am not talking about science fiction robots making phone calls
I am going to talk about how modern machine learning techniques can be used to improve and expand real time communications.
We saw 4 major categories of use cases
Speech analytics
voice bots
computer vision,
And using Machine Learning (ML) to optimize lower-level RTC protocols and networks
By far the most common use case was speech analytics
There is a broad range of use cases that range from providing transcription on conference calls to providing real time agent coaching based on what the customer is saying in the call center.
Speech transcription – also known as ASR or Speech-to-text (STT)
Has made a lot of improvements over the past couple of year thanks to deep learning techniques.
Many vendors now claim they are at human-levels of accuracy.
The reality is that transcription still has a number of challenges.
The example here shows a transcription where I was introducing myself.
As you can see – the machine transcription did not do such a great job.
This specific example is probably worse than average, but not uncommon.
The first major challenge is getting languages and dialects correct.
I am sure that this is a big struggle for this audience as you deal with STT technologies made outside of Korea.
I am lucky that English, and particularly American English, is by far the best supported language.
May vendors also have support for many dialects of English, such as British, Australian, and Indian accents.
You will find much more limited support for Korean.
I do not think I have seen any major international vendor support specific Korean dialects.
Fortunately this is improving and newer algorithms require less training data, so it is becoming easier to build support for new languages.
Non-standard spellings and specific industry jargon that does not appear in the dictionary like “WebRTC” is also a challenge.
Most systems now have techniques that let you specify a custom vocabulary to correct these.
It is also important to note that perfect transcription is not needed to provide useful analysis.
Higher-level speech analytics systems look for patterns in speech.
These patterns can be matched to business outcomes, such as did a caller end up purchasing or did they give a good customer satisfaction score.
There are often meaningful patterns beyond the words that were spoken – like how fast each party was speaking, or how often the agent talked compared to the customer.
There is also a lot of work going into looking at caller emotion and sentiment.
Another area we examined was voice bots.
These are smart speakers like the google home which was recently made available in South Korea (https://voicebot.ai/2018/09/11/google-home-arriving-in-south-korean-on-september-18-pre-orders-start-today/)
And AI assistants like Bixby or Siri.
Building a voicebot is complex. You not only need to transcribe the speech and run some natural language understanding on it like in speech analytics, but you need to also generate speech and deal with interactivity with the customer in real time.
There is very broad interest in using these voicebots
Every telephony device maker is interested in adding a voice user interface to their products – and this is a natural fit since people “talk” to these devices already.
Typical conference room equipment is already setup to capture good quality audio with minimal noise from a variety of locations throughout the room with microphone arrays
However, most companies are just starting to figure out how to use them in their products.
Another area we examined was voice bots.
These are smart speakers like the google home which was recently made available in South Korea (https://voicebot.ai/2018/09/11/google-home-arriving-in-south-korean-on-september-18-pre-orders-start-today/)
And AI assistants like Bixby or Siri.
Building a voicebot is complex. You not only need to transcribe the speech and run some natural language understanding on it like in speech analytics, but you need to also generate speech and deal with interactivity with the customer in real time.
There is very broad interest in using these voicebots
Every telephony device maker is interested in adding a voice user interface to their products – and this is a natural fit since people “talk” to these devices already.
Typical conference room equipment is already setup to capture good quality audio with minimal noise from a variety of locations throughout the room with microphone arrays
However, most companies are just starting to figure out how to use them in their products.
One major area where voicebots will have an impact is in IVRs.
Traditional IVRs were designed for DTMF input and are usually setup with multiple levels of menus.
Because people cannot remember more than a few menu options at a time, you cannot put too many options in each menu.
As a result, to fit many options, you need to have a complex menu with many layers.
Users hate this because they are difficult to navigate and takes too long.
Voicebots help to flatten the IVR into a just a few layers.
Rather than navigating a complex menu, user can just say what they want and use natural language to get the information they need.
This is good for call centers too because users are more likely to stay in the IVR instead of immediately dropping out to an operator.
One major area where voicebots will have an impact is in IVRs.
Traditional IVRs were designed for DTMF input and are usually setup with multiple levels of menus.
Because people cannot remember more than a few menu options at a time, you cannot put too many options in each menu.
As a result, to fit many options, you need to have a complex menu with many layers.
Users hate this because they are difficult to navigate and takes too long.
Voicebots help to flatten the IVR into a just a few layers.
Rather than navigating a complex menu, user can just say what they want and use natural language to get the information they need.
This is good for call centers too because users are more likely to stay in the IVR instead of immediately dropping out to an operator.
Actually, many advanced IVR systems like those sold by companies like Nuance, Aspect, and Genesys already have natural language inputs and responses.
One big change here is the growth of the consumer voicebot market.
As this technology has matured, these solutions are not being targeted at business telephony use cases, not just consumers.
For example, IBM launched a voice gateway option for its Watson assistant.
Amazon is integrating its natural language engine called Lex into Amazon Connect, its contact center solution.
Microsoft’s language processing platform is called LUIS and it has a bot-builder framework that can use this to integrate into the consumer Skype and Skype for business.
Just this summer, Google launched its contact center AI initiative where it has partnered with many major communications providers and vendors.
As part of Google’s solution, they are looking to penetrate call centers by using Dialogflow, their natural languge understanding engine and are using other tools to help agents more quickly answer questions.
Existing IVR technology that incorporates natural language tends to be very expensive.
Big vendors like Amazon, Google, and Microsoft are adapting technologies they built for the much larger consumer market and applying that to business use cases at much lower costs, often with better performance.
One of Google’s customers Marks and Spensor, commented they were able to save the equivalent of 100 Full Time employees using this technology across their call center.
The last area I would like to discuss is computer vision.
This domain already had a lot of usage in consumer applications and is just starting to find some business use cases.
There are many applications area including counting people, identifying faces, using gestures for controls, and even augmented reality.
Using open source libraries and existing work, without having a PhD in computer vision it is relatively simple to setup your own server and process real time video.
Here is an example of a server I setup to do real time analysis of a WebRTC stream.
This is just a very basic example that uses an HTTP post to send several images per second to a cloud-based server for processing.
As you saw in the video, there can be a little bit of lag.
Using a GPU-accelerated server or even something like Google’s TPU that were specifically designed to accelerate heavy machine learning graphs would have helped
But ultimately streaming a high-quality image can always have its limits.
Wouldn’t it be nice if you do the heavy processing locally with hardware acceleration, just like you can hardware accelerate codecs like H.264?
That’s exactly what you can do with some new chipsets from vendors like Intel.
This is an example of a kit from Google called the AIY Vision Kit that includes the Intel Movidius processor.
The Movidius is designed to run deep neural networks locally and is especially well-suited to low-power computer vision applications.
This kits runs on a tiny, single core Raspberry Pi 0 with only 512MB of RAM.
Google used to sell just the vision bonnet add-on part of the chip for $45. Now you can buy the complete kit with the Raspberry Pi for $90 in the US.
Note that Amazon also has a computer vision kit it calls Deep Lense. That runs on something more like an Intel NUC mini-PC and costs $250.
That’s exactly what you can do with some new chipsets from vendors like Intel.
This is an example of a kit from Google called the AIY Vision Kit that includes the Intel Movidius processor.
The Movidius is designed to run deep neural networks locally and is especially well-suited to low-power computer vision applications.
This kits runs on a tiny, single core Raspberry Pi 0 with only 512MB of RAM.
Google used to sell just the vision bonnet add-on part of the chip for $45. Now you can buy the complete kit with the Raspberry Pi for $90 in the US.
Note that Amazon also has a computer vision kit it calls Deep Lense. That runs on something more like an Intel NUC mini-PC and costs $250.
Let’s look at this in action
This all runs locally on the Pi.
So in this case, I doing the computer vision process locally while sending the stream and annotation remotely
With new native mobile libraries like Apple’s CoreML and Google’s ML Kit, it is relatively simple.
Some of the engineers at Houseparty wrote a blog post demonstrating how to do smile detection
Similar libraries are available that detect facial boundaries and let you put hats, sunglasses, beards, and other silly masks on people – I am sure you have seen some of these!
Similar techniques can be used in a business context to blur out backgrounds for remote workers who call into a video conference.
The last area is RTC optimization.
There are many opportunities to use machine learning to improve bandwidth estimation, echo cancellation, and perform better error correction.
We were very surprised that there has been relatively investment made here.
One example is a research project from Mozilla that uses Deep Learning to provide better real-time noise suppression.
This is designed for lower power devices and does not require any specialized hardware.
We do not have time now, but you can go to that link and try some demos.
It is pretty neat.
Unfortunately this was just a research project, but it gives you some idea of what could be done in this and other areas.
Before I take questions, I did want to mention we have a special discount code for RTC Korea attendees.
If you are interested in seeing out full 147-page report, you can use that for a big discount.