2. Introduction
Historical Survey
Motivation
General Model of ASR
Feature Extraction
Hidden Markov Model
Existing Systems
Developing ASR Systems
Revolution in ASR
4. The first speech recognition system (Audrey) was
developed at bell laboratories in 1952. It could
recognise numbers spoken by one person.
In 1970s Carnegie Mellon came out with HARPY
system which could recognise 1011 words with
different pronunciation.
In 1980s new systems based on Hidden Markov
Model was introduced. HMM was statistical
approach and more robust than the earlier
technology.
5. Language is the fundamental mode of
communication. Communicating with
machines in natural language effectively is a
challenge.
We can use ASR systems to control machines,
find contents online and contribute to
generate contents.
Most of the speech recognition systems and
contents (about 80 %) are available for 10
major languages. Hence, there is a need to
expand the system for local languages.
If we can interact with machines in our own
local language then it would be greater
achievement for modern era.
6. Speech Input
Analog to digital
Feature Extraction
(Generate speech fingerprint)
Compare and Select words
Prob
Matrices
updated
Compare and Select
Sentence level match
Pick most probable word
Output
Training algorithm
Word
fingerprint
template
Sentence
fingerprint
template
HMM
7. “Dividing sound waves, extract phonemes & represent using some parameters”
LPCC MFCC RASTA-PLP
Low resource, High
popularity,
Easy implementation,
Single speaker, single
language, Below 300
words
Moderate resource,
High popularity,
Easy-moderate impl,
Multi speaker, Multi-
language, Moderate
vocabulary
High resource, Low
popularity, Modrate-
hard impl, Multi-
speaker, Multi-
language, Large
vocabulary
Power Spectral Analysis
(FFT)
First Order Derivative
(DELTA)
Energy Normalization
Outdated Techniques
DNN
Multiple
User
Multiple
Language
Large
Vocabulary
Abundant
Resources
Phonetic Dictionary (TTS Synthesizer)
8. Z1
XnX2X1
Z2 Zn
Observed
Data
Hidden or latent data
“Markov Chain”
Why HMM ?
Simple for sequential and temporal data.
Handle real world applications.
It works on the principle of
New State = ʄ (old state, noise)
Initial
Probability
Transition
Probability
Observed
Probability
Applications :
Speech Recognition.
Facial Expression Recognition.
Handwriting Recognition.
Bioinformatics : Analyzing biological
data.
9. Large systems like Siri,
Google voice and
Cortana are based on
neural network.
High computing
processors.
AI algorithms
Parallel processing.
10. Time
Money
Scientists
Computing
Power
Engineers
Using built-in supportIn order to build from scratch
Kaldi is a toolkit for speech recognition written in C++ and
licensed under the Apache License v2.0. Kaldi is intended for
use by speech recognition researchers.
An open source toolkit for speech recognition, which includes
a recognizer library written in C; an adjustable, modifiable
recognizer written in Java.
Acoustic model, language model, Input source, Dictionary
11. Use of ASR systems to interact
with the devices used in daily
life.
ASR systems working in local
languages.
Developing Neural network
based ASR systems working in all
major languages.