SlideShare una empresa de Scribd logo
1 de 38
IITT FFOORR MMAANNAAGGEERRSS
RREEPPOORRTT OONN
SSPPEEEECCHH RREECCOOGGNNIITTIIOONN SSYYSSTTEEMM
SSUUBBMMIITTTTEEDD TTOO DDRR.. RROOSSHHAANN AA.. SSHHEEIIKKHH
MMAARRCCHH,, 22000099
IQBAL S/O SHAHZAD
REGISTRATION # 9952
MBA(M) - SECTION A
Speech Recognition System IT Project
IQBAL P a g e | 1
AABBSSTTRRAACCTT
This report has been submitted to Dr. Roshan A. Sheikh of Iqra University Karachi, as a
requirement for the completion of the course , IT for Managers for MBA students. I have
prepared this brief report on Speech Recognition System after deep study and research on the
topic for two weeks. I have done by best in presenting, explaining the concepts and interpreting
the report in its proper form.
This report presents an overview of speech recognition technology, software,
development and applications. It begins with an introduction to Speech Recognition Technology
then it explains how such systems work, and the level of accuracy that can be expected.
Applications of speech recognition technology in education and beyond are then explored. A
brief comparison of the most common systems is presented, as well as notes on the main
centres of speech recognition research in the UK educational sector. The report concludes with
potential uses of speech recognition in education, probable main uses of the technology in the
future, and a selection of key web-based resources. It also includes software that are being
used for this purpose in homes and also in business environment.
A video is also presented with this report which shows an example of how we can use
speech recognition in windows vista. This video is prepared solely by me on my personal
computer. It is available in the soft copy of the project in attached CD.
Speech Recognition System IT Project
IQBAL P a g e | 2
TTAABBLLEE OOFF CCOONNTTEENNTTSS
1. Introduction ………………………………………………………………………………………….......... 4
1.1 Introduction ………………………………………………………………………………………… 4
1.2 Closer Look …………………………………………………………………………………………. 4-5
2. Terms and Concepts ……………………………………………………………………………….……… 6
2.1 Utterances ………………………………………………………………………………….………. 6
2.2 Pronunciation …………………………………………………………………………….…….…. 6
2.3 Grammar …………………………………………………………………………………….……… 7
2.4 Speaker Dependence ……………………………………………………………….….……… 7
2.5 Accuracy …………………………………………………………………………………….………. 8
2.6 Training ………………………………………………………………………………….….………. 8-9
3. How Speech Recognition Works ………………………………………………………………….… 10
3.1 How Speech Recognition Works ……………………………………………………….… 10
3.2 Acceptance and Regection ……………………………………………………………….… 11-12
4. Types of Speech Recognition ………………………………………………………………………… 13
4.1 Isolated Words …………………………………………………………………………………… 13
4.2 Connected Words ………………………………………………………………………………. 13
4.3 Continuous Speech …………………………………………………………………………….. 13
4.4 Spontaneous Speech ………………………………………………………………………….. 13-14
4.5 Voice Verification / Identification ………………………………………………………. 14
5. Hardware ……………………………………………………………………………………………………... 15
5.1 Soud Cards …………………………………………………………………………………………. 15
5.2 Microphones ……………………………………………………………………………………… 15-16
5.3 Computers / Processors …………………………………………………………………….. 16
6. Uses / Applications of Speech Recognition ………………………………………………….. 17
6.1 Military ……………………………………………………………………………………………... 17
6.1.1 High Performance Fighter Aircrafts ………………………………………. 17
6.1.2 Helicopters ……………………………………………………………………………. 18
6.1.3 Training Air Traffic Controllers ……………………………………………… 18-19
6.2 People with Disabilities ………………………………………………………………………. 19
6.3 Speech Recognition in Telephony Environment ………………………………….. 20
6.3.1 Communications Management and Personal Assistants …………. 21
Speech Recognition System IT Project
IQBAL P a g e | 3
6.3.2 General Information …………………………………………………….…………. 21
6.3.3 E-Commerce …………………………………………………………………………… 21
6.4 Potential Uses in Education ………………………………………………………………… 22-23
6.5 Computer and Video Games ………………………………………………………………. 23-24
6.6 Medical Transcription ………………………………………………………………………… 24-25
6.7 Mobile Devices …………………………………………………………………………………... 25-26
6.8 Voice Security Systems ……………………………………………………………………….. 26-27
7. Future Applications ………………………………………………………………………………………. 28
7.1 Home / Domestic Appliances …………………………………………………………….. 28-29
7.2 Wearable Computers ………………………………………………………………………… 29
7.3 Precision Surgery ………………………………………………………………………………. 30
8. Speech Recognition Software ………………………………………………………………………. 31
8.1 Free Software ……………………………………………………………………………………. 31-32
8.2 Commercial Software ……………………………………………………………………….. 32
8.2.1 Dragon Naturally Speeking ……………………………………………………. 32-33
8.2.2 IBM Via Voice ……………………………………………………………………….. 33
8.2.3 Microsoft Speech Recognition System …………………………………… 34
8.2.4 MacSpeech Dictate ……………………………………………………………….. 35
8.2.5 Philips Speech Engine ……………………………………………………………. 35-36
8.2.6 Other commercial software ………………………………………………….. 36
9. Conclusion …………………………………………………………………………………………………… 37
Speech Recognition System IT Project
IQBAL P a g e | 4
11.. IINNTTRROODDUUCCTTIIOONN
Have you ever talked to your computer? I mean, have you really, really talked to your
computer? Where it actually recognized what you said and then did something as a result? If
you have, then you've used a technology known as speech recognition.
Designing a machine that understand human behavior, particularly the capability of
speaking naturally and responding properly to spoken language, has intrigued engineers and
scientists for centuries. Today speech technologies are commercially available for a limited but
interesting range of tasks. These technologies enable machines to respond correctly and
reliably to human voices, and provide useful and valuable services. While we are still far from
having a machine that converses with humans on any topic like another human, many
important scientific and technological advances have taken place, bringing us closer to the
machines that recognize and understand fluently spoken speech.
“Speech Recognition Simply is the process of converting spoken input to text. Speech
recognition is thus sometimes referred to as speech-to-text. Speech recognition, also referred
to as voice recognition, is software technology that lets the user control computer functions
and dictate text by voice. For example, a person can move the mouse cursor with a voice
command, such as “mouse up;” control application functions, such as opening up a file menu;
or create documents, such as letters or reports or start media player by saying “Music”.
1.2 A Closer Look
The speech recognition process is performed by a software component known as the
speech recognition engine. The primary function of the speech recognition engine is to process
spoken input and translate it into text that an application understands. The application can then
do one of two things:
 The application can interpret the result of the recognition as a command. In this case,
the application is a command and control application. An example of a command and
control application is one in which the caller says “check balance”, and the application
returns the current balance of the caller’s account.
 If an application handles the recognized text simply as text, then it is considered a
dictation application. In a dictation application, if you said “check balance,” the
application would not interpret the result, but simply return the text “check balance”.
Speech Recognition System IT Project
IQBAL P a g e | 5
Speech recognition is an alternative to traditional methods of interacting with a
computer, such as textual input through a keyboard. An effective system can replace, or reduce
the reliability on, standard keyboard and mouse input. This can especially assist the following:
 People who have little keyboard skills or experience, who are slow typists, or do not
have the time or resources to develop keyboard skills.
 Dyslexic people, or others who have problems with character or word use and
manipulation in a textual form.
 People with physical disabilities that affect either their data entry, or ability to read (and
therefore check) what they have entered.
A speech recognition system consists of the following:
 A microphone, for the person to speak into.
 Speech recognition software.
 A computer to take and interpret the speech.
 A good quality soundcard for input and/or output.
 A proper and good pronunciation.
However, systems on computers meant for more individual use, such as for personal
word processing, usually require a degree of “training” before use. Here, an individual user
“trains” the system to understand words or word fragments (see section 2.6); this training is
often referred to as “enrolment”.
Speech Recognition System IT Project
IQBAL P a g e | 6
22.. TTEERRMMSS AANNDD CCOONNCCEEPPTTSS
Following are a few of the basic terms and concepts that are fundamental to speech
recognition. It is important to have a good understanding of these concepts.
2.1 Utterances
When the user says something, this is known as an utterance. An utterance is any
stream of speech between two periods of silence. Utterances are sent to the speech engine to
be processed.
Silence, in speech recognition, is almost as important as what is spoken, because silence
delineates the start and end of an utterance. Here's how it works. The speech recognition
engine is "listening" for speech input. When the engine detects audio input - in other words, a
lack of silence -- the beginning of an utterance is signaled. Similarly, when the engine detects a
certain amount of silence following the audio, the end of the utterance occurs.
Utterances are sent to the speech engine to be processed. If the user doesn’t say
anything, the engine returns what is known as a silence timeout - an indication that there was
no speech detected within the expected timeframe, and the application takes an appropriate
action, such as reprompting the user for input.
An utterance can be a single word, or it can contain multiple words (a phrase or a
sentence). For example, “Word”, “Microsoft Word,” or “I’d like to run Microsoft Word” are all
examples of possible utterances. Whether these words and phrases are valid at a particular
point in a dialog is determined by which grammars are active. Note that there are small
snippets of silence between the words spoken within a phrase. If the user pauses too long
between the words of a phrase, the end of an utterance can be detected too soon, and only a
partial phrase will be processed by the engine.
2.2 Pronunciation
The speech recognition engine uses all sorts of data, statistical models, and algorithms
to convert spoken input into text. One piece of information that the speech recognition engine
uses to process a word is its pronunciation, which represents what the speech engine thinks a
word should sound like.
Words can have multiple pronunciations associated with them. For example, the word
“the” has at least two pronunciations in the U.S. English language: “thee” and “thuh”.
Speech Recognition System IT Project
IQBAL P a g e | 7
2.3 Grammar
Grammars define the domain, or context, within which the recognition engine works.
The engine compares the current utterance against the words and phrases in the active
grammars. If the user says something that is not in the grammar, the speech engine will not be
able to understand it correctly. So usually speech engines have a very vast grammar.
Vocabularies (or dictionaries) are lists of words or utterances that can be recognized by
the Speech Recognition system. Generally, smaller vocabularies are easier for a computer to
recognize, while larger vocabularies are more difficult. Unlike normal dictionaries, each entry
doesn't have to be a single word. They can be as long as a sentence or two. Smaller
vocabularies can have as few as 1 or 2 recognized utterances (e.g."Wake Up"), while very large
vocabularies can have a hundred thousand or more!
2.4 Speaker Dependence
Speaker dependence describes the degree to which a speech recognition system
requires knowledge of a speaker’s individual voice characteristics to successfully process
speech. The speech recognition engine can “learn” how you speak words and phrases; it can be
trained to your voice.
Speech recognition systems that require a user to train the system to his/her voice are
known as speaker-dependent systems. If you are familiar with desktop dictation systems, most
are speaker dependent like IBM Via Voice. Because they operate on very large vocabularies,
dictation systems perform much better when the speaker has spent the time to train the
system to his/her voice.
Speech recognition systems that do not require a user to train the system are known as
speaker-independent systems. Speech recognition in the VoiceXML world must be speaker-
independent. Think of how many users (hundreds, maybe thousands) may be calling into your
web site. You cannot require that each caller train the system to his or her voice. The speech
recognition systemin a voice-enabled web application MUST successfully process the speech of
many different callers without having to understand the individual voice characteristics of each
caller.
Speech Recognition System IT Project
IQBAL P a g e | 8
2.5 Accuracy
The ability of a recognizer can be examined by measuring its accuracy − or how well it
recognizes utterances. The performance of a speech recognition system is measurable. Perhaps
the most widely used measurement is accuracy. It is typically a quantitative measurement and
can be calculated in several ways. Arguably the most important measurement of accuracy is
whether the desired end result occurred. This measurement is useful in validating application
design. For example, if the user said "yes," the engine returned "yes," and the "YES" action was
executed, it is clear that the desired result was achieved. But what happens if the engine
returns text that does not exactly match the utterance? For example, what if the user said
"nope," the engine returned "no," yet the "NO" action was executed? Should that be
considered a successful dialog? The answer to that question is yes because the desired result
was acheived.
Another measurement of recognition accuracy is whether the engine recognized the
utterance exactly as spoken. This measure of recognition accuracy is expressed as a percentage
and represents the number of utterances recognized correctly out of the total number of
utterances spoken. It is a useful measurement when validating grammar design. Using the
previous example, if the engine returned "nope" when the user said "no," this would be
considered a recognition error. Based on the accuracy measurement, you may want to analyze
your grammar to determine if there is anything you can do to improve accuracy. For instance,
you might need to add "nope" as a valid word to your grammar. You may also want to check
your grammar to see if it allows words that are acoustically similar (for example,
"repeat/delete," "Austin/Boston," and "Addison/Madison"), and determine if there is any way
you can make the allowable words more distinctive to the engine.
Recognition accuracy is an important measure for all speech recognition applications. It
is tied to grammar design and to the environment of the user. Good ASR (Automatic Speech
Recognition) systems have an accuracy of 98% or more!
2.6 Training
Some speech recognizers have the ability to adapt to a speaker. When the system has
this ability, it may allow training to take place. An ASR (Automatic Speech Recognition) system
is trained by having the speaker repeat standard or common phrases and adjusting its
comparison algorithms to match that particular speaker. Training a recognizer usually improves
its accuracy.
Speech Recognition System IT Project
IQBAL P a g e | 9
Training can also be used by speakers that have difficulty speaking, or pronouncing
certain words. As long as the speaker can consistently repeat an utterance, ASR systems with
training should be able to adapt.
Speech Recognition System IT Project
IQBAL P a g e | 10
33.. HHOOWW SSPPEEEECCHH RREECCOOGGNNIITTIIOONN WWOORRKKSS
Now that we've discussed some of the basic terms and concepts involved in speech
recognition, let's put them together and take a look at how the speech recognition process
works.
As you can probably imagine, the speech recognition engine has a rather complex task
to handle, that of taking raw audio input and translating it to recognized text that an
application understands. As shown in the diagram below, the major components we want to
discuss are:
 Audio input - Transform of the digital audio into a better acoustic representation
 Apply a "grammar" so the speech recognizer knows what phonemes to expect. A
grammar could be anything from a context-free grammar to full-blown English.
 Acoustic Model
 Recognized text
The first thing we want to take a look at is the audio input coming into the recognition
engine. It is important to understand that this audio stream is rarely pristine. It contains not
only the speech data (what was said) but also background noise. This noise can interfere with
Speech Recognition System IT Project
IQBAL P a g e | 11
the recognition process, and the speech engine must handle (and possibly even adapt to) the
environment within which the audio is spoken.
As we've discussed, it is the job of the speech recognition engine to convert spoken
input into text. To do this, it employs all sorts of data, statistics, and software algorithms. Its
first job is to process the incoming audio signal and convert it into a format best suited for
further analysis. Once the speech data is in the proper format, the engine searches for the best
match. It does this by taking into consideration the words and phrases it knows about (the
active grammars), along with its knowledge of the environment in which it is operating. The
knowledge of the environment is provided in the form of an acoustic model. Once it identifies
the most likely match for what was said, it returns what it recognized as a text string.
Most speech engines try very hard to find a match, and are usually very "forgiving." But
it is important to note that the engine is always returning it's best guess for what was said.
(This is an example of a digital audio)
3.2 Acceptance and Rejection
When the recognition engine processes an utterance, it returns a result. The result can
be either of two states: acceptance or rejection. An accepted utterance is one in which the
engine returns recognized text.
Whatever the caller says, the speech recognition engine tries very hard to match the
utterance to a word or phrase in the active grammar. Sometimes the match may be poor
because the caller said something that the application was not expecting, or the caller spoke
indistinctly. In these cases, the speech engine returns the closest match, which might be
Speech Recognition System IT Project
IQBAL P a g e | 12
incorrect. Some engines also return a confidence score along with the text to indicate the
likelihood that the returned text is correct.
Not all utterances that are processed by the speech engine are accepted. Acceptance or
rejection is flagged by the engine with each processed utterance.
Speech Recognition System IT Project
IQBAL P a g e | 13
44.. TTYYPPEESS OOFF SSPPEEEECCHH RREECCOOGGNNIITTIIOONN
Speech recognition systems can be separated in several different classes by describing
what types of utterances they have the ability to recognize. These classes are based on the fact
that one of the difficulties of ASR is the ability to determine when a speaker starts and finishes
an utterance. Most packages can fit into more than one class, depending on which mode
they're using.
4.1 Isolated Words
Isolated word recognizers usually require each utterance to have quiet (lack of an audio
signal) on BOTH sides of the sample window. It doesn't mean that it accepts single words, but
does require a single utterance at a time. Often, these systems have "Listen/Not−Listen" states,
where they require the speaker to wait between utterances (usually doing processing during
the pauses). Isolated Utterance might be a better name for this class.
4.2 Connected Words
Connect word systems (or more correctly 'connected utterances') are similar to Isolated
words, but allow separate utterances to be 'run−together' with a minimal pause between them.
4.3 Continuous Speech
Continuous recognition is the next step. Recognizers with continuous speech capabilities
are some of the most difficult to create because they must utilize special methods to determine
utterance boundaries. Continuous speech recognizers allow users to speak almost naturally,
while the computer determines the content. Basically, it's computer dictation.
4.4 Spontaneous Speech
There appears to be a variety of definitions for what spontaneous speech actually is. At
a basic level, it can be thought of as speech that is natural sounding and not rehearsed. An ASR
Speech Recognition System IT Project
IQBAL P a g e | 14
system with spontaneous speech ability should be able to handle a variety of natural speech
features such as words being run together, "ums" and "ahs", and even slight stutters.
4.5 Voice Verification/Identification
Some ASR systems have the ability to identify specific users. This document doesn't
cover verification or security systems.
Speech Recognition System IT Project
IQBAL P a g e | 15
55.. HHAARRDDWWAARREE
5.1 Sound Cards
Because speech requires a relatively low bandwidth, just about any medium−high
quality 16 bit sound card will get the job done. You must have sound enabled in your kernel,
and you must have correct drivers installed. Sound card quality often starts a heated discussion
about their impact on accuracy and noise.
Sound cards with the 'cleanest' A/D (analog to digital) conversions are recommended,
but most often the clarity of the digital sample is more dependent on the microphone quality
and even more dependent on the environmental noise. Electrical "noise" from monitors, pci
slots, hard−drives, etc. are usually nothing compared to audible noise from the computer fans,
squeaking chairs, or heavy breathing.
Some ASR software packages may require a specific sound card. It's usually a good idea
to stay away from specific hardware requirements, because it limits many of your possible
future options and decisions. You'll have to weigh the benefits and costs if you are considering
packages that require specific hardware to function properly.
5.2 Microphones
A quality microphone is key when utilizing ASR. In most cases, a desktop microphone
just won't do the job. They tend to pick up more ambient noise that gives ASR programs a hard
time.
Hand held microphones are also not the best choice as they can be cumbersome to pick
up all the time. While they do limit the amount of ambient noise, they are most useful in
applications that require changing speakers often, or when speaking to the recognizer isn't
done frequently (when wearing a headset isn't an option).
The best choice, and by far the most common is the headset style. It allows the ambient
noise to be minimized, while allowing you to have the microphone at the tip of your tongue all
the time. Headsets are available without earphones and with earphones (mono or stereo). I
recommend the stereo headphones, but it's just a matter of personal taste.
A quick note about levels: Don't forget to turn up your microphone volume. This can be
done with a program such as XMixer or OSS Mixer and care should be used to avoid feedback
Speech Recognition System IT Project
IQBAL P a g e | 16
noise. If the ASR software includes auto−adjustment programs, use them instead, as they are
optimized for their particular recognition system.
5.3 Computers/Processors
ASR applications can be heavily dependent on processing speed. This is because a large
amount of digital filtering and signal processing can take place in ASR.
As with just about any cpu intensive software, the faster the better. Also, the more
memory the better. It's possible to do some SR with 100MHz and 16M RAM, but for fast
processing (large dictionaries, complex recognition schemes, or high sample rates), you should
shoot for a minimum of a 1 Ghz and 1 GB RAM. Because of the processing required, most
software packages list their minimum requirements.
Speech Recognition System IT Project
IQBAL P a g e | 17
66.. UUSSEESS // AAPPPPLLIICCAATTIIOONNSS
6.1 Military
6.1.1 High-performance fighter aircraft
Substantial efforts have been devoted in the last decade to the test and evaluation of
speech recognition in fighter aircraft. Of particular note are the U.S. program in speech
recognition for the Advanced Fighter Technology Integration (AFTI)/F-16 aircraft, the program
in France on installing speech recognition systems on Mirage aircraft, and programs in the UK
dealing with a variety of aircraft platforms. In these programs, speech recognizers have been
operated successfully in fighter aircraft with applications including: setting radio frequencies,
commanding an autopilot system, setting steer-point coordinates and weapons release
parameters, and controlling flight displays. Generally, only very limited, constrained
vocabularies have been used successfully, and a major effort has been devoted to integration of
the speech recognizer with the avionics system.
Some important conclusions from the work were as follows:
1. Speech recognition has definite potential for reducing pilot workload, but this potential was
not realized consistently.
2. Achievement of very high recognition accuracy (95% or more) was the most critical factor
for making the speech recognition system useful — with lower recognition rates, pilots
would not use the system.
3. More natural vocabulary and grammar, and shorter training times would be useful, but only
if very high recognition rates could be maintained.
4. Laboratory research in robust speech recognition for military environments has produced
promising results which, if extendable to the cockpit, should improve the utility of speech
recognition in high-performance aircraft.
The Eurofighter Typhoon currently in service with the UK RAF employs a speaker-
dependent system, i.e. it requires each pilot to create a template. The system is not used for
any safety critical or weapon critical tasks, such as weapon release or lowering of the
undercarriage, but is used for a wide range of other cockpit functions. Voice commands are
confirmed by visual and/or aural feedback. The system is seen as a major design feature in the
reduction of pilot workload, and even allows the pilot to assign targets to himself with two
simple voice commands or to any of his wingmen with only five commands.
Speech Recognition System IT Project
IQBAL P a g e | 18
6.1.2 Helicopters
The problems of achieving high recognition accuracy under stress and noise pertain
strongly to the helicopter environment as well as to the fighter environment. The acoustic noise
problem is actually more severe in the helicopter environment, not only because of the high
noise levels but also because the helicopter pilot generally does not wear a facemask, which
would reduce acoustic noise in the microphone. Substantial test and evaluation programs have
been carried out in the past decade in speech recognition systems applications in helicopters,
notably by the U.S. Army Avionics Research and Development Activity (AVRADA) and by the
Royal Aerospace Establishment (RAE) in the UK. Work in France has included speech
recognition in the Puma helicopter. There has also been much useful work in Canada. Results
have been encouraging, and voice applications have included: control of communication radios;
setting of navigation systems; and control of an automated target handover system.
As in fighter applications, the overriding issue for voice in helicopters is the impact on
pilot effectiveness. Encouraging results are reported for the AVRADA tests, although these
represent only a feasibility demonstration in a test environment. Much remains to be done
both in speech recognition and in overall speech recognition technology, in order to
consistently achieve performance improvements in operational settings.
6.1.3 Training Air Traffic Controllers
Training for military air traffic controllers (ATC) represents an excellent application for
speech recognition systems. Many ATC training systems currently require a person to act as a
"pseudo-pilot", engaging in a voice dialog with the trainee controller, which simulates the
dialog which the controller would have to conduct with pilots in a real ATC situation. Speech
recognition and synthesis techniques offer the potential to eliminate the need for a person to
act as pseudo-pilot, thus reducing training and support personnel. Air controller tasks are also
characterized by highly structured speech as the primary output of the controller, hence
reducing the difficulty of the speech recognition task.
The U.S. Naval Training Equipment Center has sponsored a number of developments of
prototype ATC trainers using speech recognition. Generally, the recognition accuracy falls short
of providing graceful interaction between the trainee and the system. However, the prototype
training systems have demonstrated a significant potential for voice interaction in these
systems, and in other training applications. The U.S. Navy has sponsored a large-scale effort in
ATC training systems, where a commercial speech recognition unit was integrated with a
complex training system including displays and scenario creation. Although the recognizer was
constrained in vocabulary, one of the goals of the training programs was to teach the
controllers to speak in a constrained language, using specific vocabulary specifically designed
Speech Recognition System IT Project
IQBAL P a g e | 19
for the ATC task. Research in France has focused on the application of speech recognition in
ATC training systems, directed at issues both in speech recognition and in application of task-
domain grammar constraints.
Another approach to ATC simulation with speech recognition has been created by
Supremis. The Supremis system is not constrained by rigid grammars imposed by the underlying
limitations of other recognition strategies.
6.2 People with Disabilities
It has been suggested that one of the most promising areas for the application of speech
recognition is in helping handicapped people (Leggett and Williams, 1984). Speech recognition
technology helps people with disabilities interact with computers more easily. People with
motor limitations, who cannot use a standard keyboard and mouse, can use their voices to
navigate the computer and create documents. For example, Braille input/output devices touch
screen systems and trackballs have all been used successfully in the classrooms. The technology
is also useful to people with learning disabilities who experience difficulty with spelling and
writing. Some individuals with speech impairments may use speech recognition as a therapeutic
tool to improve vocal quality. People with overuse or repetitive stress injuries also benefit from
using speech recognition to operate their computers hands free. Speech recognition technology
has great potential to provide people with disabilities greater access to computers and a world
of opportunities.
Mr. Jones is a reporter who must submit his articles in HTML for publishing in an on-line
journal. Over his twenty-year career, he has developed repetitive stress injury (RSI) in his hands
and arms, and it has become painful for him to type. He uses a combination of speech
recognition and an alternative keyboard to prepare his articles, but he doesn't use a mouse. It
took him several months to become sufficiently accustomed to using speech recognition to be
comfortable working for many hours at a time. There are some things he has not worked out
yet, such as a sound card conflict that arises whenever he tries to use speech recognition on
Web sites that have streaming audio. (Source : http://www.w3.org/WAI/EO/Drafts/PWD-Use-
Web/).
Speech Recognition System IT Project
IQBAL P a g e | 20
6.3 Speech Recognition in Telephony Environment
William Meisel, who holds a Ph.D. in Electrical Engineering, ran a speech recognition
company for ten years. He is president of the speech industry consulting firm TMA Associates
and publisher and editor of Speech Recognition Update newsletter. According to him
Telephone speech recognition creates a Voice Web. Sites that support speech
recognition constitute the Voice Web. Most sites today have individual phone numbers
(typically toll-free). Such sites are often called "voice portals". There are, however, likely to be
more popular voice portals than Web portals; every wireless and landline telephone service
provider will eventually be a voice portal, and there will be independent, corporate, and
specialized voice portals. VoiceXML, a new standard, created by the VoiceXML Form
(www.voicexml.org) and the W3C Voice Browser working group (www.w3.org/voice), is a way
that companies can provide a voice-interactive application on a Web server without needing
speech engines or telephone line interface hardware. The VoiceXML code is downloaded to the
voice portal and executed by a VoiceXML interpreter, much as a Web browser on a PC
interprets HTML.
(Source : William Meisel’s Guide Book on The Voice Web)
The Voice Web is not just an extension of the Internet, although information on existing
Web sites can be used to support interactive voice services. It can run applications totally unlike
visual Web applications and totally independent of the HTML-based Web. Some of the
applications that the Voice Web is supporting are listed here.
Speech Recognition System IT Project
IQBAL P a g e | 21
6.3.1 Communications management and personal assistants
Communications management usually includes dialing by name using a personal
directory. Personal-assistant functionality includes call screening, taking and accessing voice
messages, and one-number access to the subscriber (scanning several subscriber numbers
based on subscriber instructions). Other personalized features include maintaining a schedule
and delivering reminders. Unified messaging includes features such as reviewing email or fax
headers by phone using text-to-speech. Since subscribers will make calls through their personal
assistant, the voice portal can potentially get additional revenues from providing bundled local
and/or long-distance service.
Enterprise applications, such as voice-activated auto attendants that direct calls by
name, can be a corporate voice portal. Corporate voice portals can also provide such services as
reservations for a conference, location of a local store outlet, or a connection to customer
service.
6.3.2 General information
General information includes weather, sports scores, horoscopes, general news,
financial news, stock quotes, traffic conditions, and driving directions. Such information is
intended to make a voice-enabled service part of a subscriber’s daily habit. Information can be
customized, using, for example, the user’s personal stock portfolio or the user’s current
location. As voice portals evolve, the caller will be able to "voicemark" specialized voice-
equipped Web sites.
6.3.3 E-commerce
V-commerce supports a variety of transactions that can result in product or service
sales. These include transactions similar to ordering from a Web sites or telephone catalog
service. They also include finding a business by saying its trade name or its category.
Entertainment is part of e-commerce, and it will be part of the Voice Web. For example,
the caller can use speech recognition to choose audio channels to listen to.
(Source : Receiver Magazine, Vodafone - 2001)
Speech Recognition System IT Project
IQBAL P a g e | 22
6.4 Potential uses in education
Contact with a number of practitioners and researchers in the field of speech
recognition led to some interesting speculation regarding the feasible use of this technology in
education.
No. Applications Problems and Likelihood
1 Teaching students of foreign languages to
pronounce vocabulary correctly.
Unlikely in near future on a large scale, due
to the software training currently involved.
2 Teaching overseas students to pronounce
English correctly.
3 Making notes of observations during
scientific experiments, so the
scientist/research can focus on
the observation without needing
to view the monitor or keyboard.
Similar to how a coroner verbally
records notes during an autopsy.
Likely, and is probably already used in
individual circumstances. Noise from the
experiment, the researcher need to rapidly
record some observations, and a
vocabulary that understands the scientific
terms present some issues.
4 Enabling students who are physically
handicapped and unable to use a
keyboard to enter text verbally.
Used already, though becoming
increasingly widespread.
5 Enabling people with textual interpretive
problems e.g. Dyslexia, to enter text
verbally.
Used already, though becoming
increasingly widespread.
6 Restrictive access on a high security
computer, where a keyboard or other
input device may be used by hackers.
Interest from a number of people, though a
lack of “proof of concept” research hinders
further development. Unlikely to be
available in the near future.
7 Narrative-oriented research, where
transcripts are automatically generated.
This would remove the time to manually
generate the transcript, and human error.
Likely in the near future. Current speech
recognition technology places unacceptable
c ompromise between accuracy and
inhibiting the interviewee. Quicker and
easier training systems for the interviewee
will help, as will increases in portable
computing processing power.
8 Capturing the speech of a lecturer or
tutor.
Unlikely on a large scale, due to vocabulary,
training and interpretive issues. In addition,
filming of the lecture results in audio and
visual content combined which may be
more useful.
9 Using a speech recognition system in an
examination.
Very likely. Technically, this is possible, and
within current UK examination guidelines
Speech Recognition System IT Project
IQBAL P a g e | 23
this appears to be acceptable
(Source : http://www.becta.org.uk/technology/speechrecog/docs/finalreport.pdf - the
final report (June 2000) from a experimental project to see how effective speech recognition
technologies could be to people with special educational needs.)
6.5 Computer and Video Games
Speech input has been used in a limited number of computer and video games, on a
variety of PC and console-based platforms, over the past decade. For example, the game
Seaman24 involved growing and controlling strange half-man half fish characters in a virtual
aquarium. A microphone, sold with the game, allowed the player to issue one of a pre-
determined list of command words and questions to the fish. The accuracy of interpretation, in
use, seemed variable; during gaming sessions colleagues with strong accents had to speak in an
exaggerated and slower manner in order for the game to understand their commands.
Microphone-based games are available for two of the three main video game consoles
(Playstation 2 and Xbox). However, these games primarily use speech in an online player to
player manner, rather than spoken words being interpreted electronically. For example, a
MotoGP for the Xbox allows online players to ride against each other in a motorbike race
simulation, and speak (via microphone headset) to the nearest players (bikers) in the race.
There is currently interest, but less development, of video games that interpret speech.
The Microsoft Xbox, Nintendo GameCube, and Sony PlayStation 2 consoles all offer
games with speech input/output. Currently, most games are war-action-shooter games. In
these, speech recognition provides high-level commands to virtual teammates who respond
with a variety of recorded quips. Lets take examples of two games i.e. graphically-realistic,
tactical squad-based, shooter games Ghost Recon 2 and SOCOM II: U.S. Navy Seals. Both these
games are available in Sony Playstation 2. The speech recognition systems for these games are
provided by Fonix and ScanSoft, respectively.
In Ghost Recon 2, the user is the leader of a team of
three secret Special Forces soldiers who must capture various
military targets in North Korea in the year 2007. The team is
critical to the user’s survival from enemy gunfire. Saying “Move
out!” directs the team to move ahead of you as you make your
way through the virtual, hilly terrain toward various objectives.
The speech commands (“Move out,” “Covering fire,”
“Grenade,” “Take point,” “Hold position,” “Regroup”) are
Speech Recognition System IT Project
IQBAL P a g e | 24
easily-recalled, high-level instructions to the team members. The commands that can be
obeyed depend on the immediate situation. If you say, “Take point,” and the hostile fire is too
great the designated team member may say, “No can do, Captain.” Occasionally, the retort is
somewhat less respectful.
In SOCOM II: U.S. Navy Seals, a team of four men
including the first person leader attempts to stop an arms
smuggling group in rural Albania. The team has to avoid the
enemy, meet an informant, blow up weapons caches, and make
their escape. The speech commands in this game are spoken in
three parts, using a simple grammar. The commands may be
addressed to “Fireteam” (all other team members) or
individuals like, “Able” (your partner). Then there are
approximately 12 action commands including “Fire at will,”
“Deploy,” “Move to,” “Get down,” and others. The third part of
the command includes nine letters of the military alphabet
(“Charlie,” “Delta,” etc.) indicating where the “Move to” and
similar commands are intended. They represent the specific
locations of game objectives.
(Source: Article from The Speech Technology Magazine Apr 2005,
http://www.speechtechmag.com/Articles/ReadArticle.aspx?ArticleID=29432)
6.6 Medical Transcription
Medical transcription, also known as MT, is an allied
health profession, which deals in the process of transcription,
or converting voice-recorded reports as dictated by physicians
and/or other healthcare professionals, into text format.
Every day, doctors scour the market looking for new
ways to help simplify their office routines and reduce their
costs. Medical Transcription software saves their time and
money. The speech recognition product produces accurate
and fully formatted transcriptions from clinicians' dictations.
The goal is to minimize editing time by MTs and, as a result,
increase MT productivity. It interprets and formats a
Speech Recognition System IT Project
IQBAL P a g e | 25
document, so that it is close to a final product.
Benefits:
 Organized and formatted document sections
 Punctuation inserted even if not spoken
 Numbers interpreted and presented appropriately. This includes dosages, measurements,
lists, etc.
 Formatting based on each organization’s preferences and specifications
 Inserts speech-activated ‘normals’
 No explicit training required
 Continually learns and improved from MT edits
Examples:
When a clinician dictates: "Exam…vital signs…two twelve…eighty eight and
regular…thirteen…BP one forty one hundred and one thirty five ninety five"
Speech Recognition software can output: PHYSICAL EXAMINATION: VITAL SIGNS: Weight
212, pulse 88 and regular, respiration 13, blood pressure is 140/100, 135/95.
When a provider says: "The following problems were reviewed…hypertension …please
enter my hypertension template…use my normal cad"
Speech Recognition software can output: PROBLEMS: The following problems were
reviewed:
 Hypertension: No headache, visual disturbance, chest pain, palpitation, focal neurologic
complaint, dyspnea, edema, claudication, or complaint from current medication.
 Coronary artery disease: No chest pain, dyspnea, PND, orthopnea, palpitation, weakness,
syncope, or obvious problems related to medications.
6.7 Mobile Devices
The growth of cellular telephony combined with recent advances in speech recognition
technology results in sizeable potential opportunities for mobile speech recognition
applications. Speech recognition in mobile phone have already been introduced but there is a
Speech Recognition System IT Project
IQBAL P a g e | 26
lot of work to be done in this particular field. First time when speech recognition was
introduced in mobiles, it was used to call a contact by saying its name. In that case first the user
needed to record voice clips of the names of each contact and associate them with their
respective contacts. So when the user said the name the mobile compared it with already
recorded sounds for each contact and then called the person whose name was spoken.
New smart mobile phones are introducing every month. These mobiles don’t require
recording the names first. They have their own speech system, which can read the names
written in English. So when the user says a name, it uses its speech system to compare the user
spoken sound with saved contacts and then calls the contact whose name is being spoken.
Nuance Communications has launched Nuance Mobile Speech Platform that will
improve the text-to-speech and speech recognition abilities of mobile devices. Through this
platform, end users will be able to perform searches, dictate emails and SMS messages, and
have any incoming emails and messages read out to them, which will improve the usability and
efficiency of mobile devices.
The Nuance Mobile Speech Platform can be used to speech-enable a mobile application,
and specifically offers pre-built components for the following:
 Nuance Local Search - search business names and categories, residential listings, weather,
dining and entertainment, movies, etc.
 Nuance Mobile Navigation - voice destination entry (including street addresses, businesses
and points of interest) and spoken turn-by-turn directions.
 Nuance Content Search - search catalogs with items in music, video, games and more.
 Nuance Mobile Web Search - search the Web from a mobile device.
 Nuance Mobile Communications - compose email, SMS, and IM messages by speaking.
(Source: Nuance Communications http://www.nuance.com)
6.8 Voice Security Systems
Voice Security Systems technology uses a person's voice print to uniquely identify
individuals using biometric speaker verification technology. Speech is processed through a non-
contact method; you do not need to see or to touch the person to be able to recognize them.
The popularity of speaker verification is swiftly growing because speech is easy to obtain
without the addition of dedicated hardware. Improved, robust speech recognition algorithms
and PC hardware have also brought this one-time futuristic idea into the present.
Speech Recognition System IT Project
IQBAL P a g e | 27
At Voice Security Systems, a decade of research
and development has lead them to believe that the
explosive speech processing market is here to stay.
Their Voice Protect® method of biometric voice
authentication is ideally suited for low memory,
database independent applications using smart cards
or other physical devices such as cell phones. Due to
the value of biometric security for use in fraud
prevention, and the added convenience of knowing a
person is who they claim to be, they believe speaker
verification will be more widely accepted by the
consumer market before speech recognition.
Voice Security Systems can deliver biometric security technology to the market at a
lower cost than anyone else in the industry, with no reoccurring maintenance costs such as
database management or complicated user training. Once the Voice Protect® technology is
built into a product it will continue to function independently for the life of the product.
Voice Security Systems can be applied in our daily lives, for example it can be
successfully applied in Garage Door openers, Computers and laptops, Automobiles, PDA and
handheld devices, Smartcard applications, Cell phones, Door access and ATM Machines.
(Source: Voice Security Systems Inc. http://www.voice-security.com/)
Speech Recognition System IT Project
IQBAL P a g e | 28
77.. FFUUTTUURREE AAPPPPLLIICCAATTIIOONNSS
There are a number of scenarios where speech recognition is either being delivered,
developed for, researched or seriously discussed. As with many contemporary technologies,
such as the Internet, online payment systems and mobile phone functionality, development is
at least partially driven.
IBM intends to have better-than-human Automatic Speech Recognition by 2010. Bill
Gates predicted that by 2011 the quality of ASR will catch up to humans. Justin Rattner from
Intel said in 2005 that by 2015, computers will have "strong capabilities" in speech-to-text.
At some point in the future, speech recognition may become speech understanding. The
statistical models that allow computers to decide what a person just said may someday allow
them to grasp the meaning behind the words. Although it is a huge leap in terms of
computational power and software sophistication, some researchers argue that speech
recognition development offers the most direct line from the computers of today to true
artificial intelligence. We can talk to our computers today. In 25 years, they may very well talk
back.
7.1 Home Appliances
Designers have developed very convenient user interfaces to consumer appliances.
What could be easier than pressing buttons on a remote control to select television channels or
flipping a switch to turn on a light? These types of direct manipulation user interfaces will
continue to be widely used. However, because current buttons and switches are not intelligent,
you cannot ask your remote control when "Star Trek" is on, and you must walk to the light
switch before turning the light on. Speech enables consumer appliances to act intelligently,
responding to speech commands and answering verbal questions. For example, speech
enhances consumer appliances by enabling the user to say instructions such as:
1. To the VCR: "Record tonight's 'Star Trek'."
2. To the coffeepot: "Start at 6:30 a.m. tomorrow."
3. To the light switch: "Turn on the lights one half-hour before sunset."
There is, inevitable, interest in the use of speech recognition in domestic appliances
such as ovens, refrigerators, dishwashers and washing machines. One school of thought is that,
Speech Recognition System IT Project
IQBAL P a g e | 29
like the use of speech recognition in cars, this can reduce the number of parts and therefore the
cost of production of the machine. However, removal of the normal buttons and controls would
present problems for people who, for physical or learning reasons, cannot use speech
recognition systems.
7.2 Wearable Computers
Perhaps the most futuristic application is in the use and functionality of wearable
computers i.e. unobtrusive devices that you can wear like a watch, or are even embedded in
your clothes. These would allow people to go about their everyday lives, but still store
information (thoughts, notes, to-do lists) verbally, or communicate via email, phone or
videophone, through wearable devices. Crucially, this would be done without having to interact
with the device, or even remember that it is there; the user would just speak, the device would
know what to do with the speech, and would carry out the appropriate task.
The rapid miniaturization of computing
devices, the rapid rise in processing power, and
advances in mobile wireless technologies, are
making these devices more feasible. There are
still significant problems, such as background
noise and the idiosyncrasies of an individual’s
language, to overcome. However, it is
speculated that reliable versions of such devices
will become commercially available during this
decade.
The conventional human-computer interface such as GUI, which assumes a keyboard,
mouse, and bit-map display, is insufficient for the Wearable environment, especially for the
Wearables. Although handwritten character recognizers and keyboards that can be used with
one hand have been developed as input devices for computers, speech recognition has recently
received more interest. The main reason for this is that it permits both hands and eyes to be
kept free and therefore is less restricted in its use and can achieve quicker communication. In
addition, speech can convey not only linguistic information but also the emotion and identity of
speakers. IBM’s wearable PC described above has a microphone in its controller and can
recognize speech as soon as the Via Voice has been installed.
Speech Recognition System IT Project
IQBAL P a g e | 30
7.3 Precision Surgery
Developments in keyhole and micro surgery have clearly shown that an approach of as
little invasive or non-essential surgery as possible increases success rates and patient recovery
times. There is occasional speculation in various medical for a regarding the use of speech
recognition in precision surgery, where a procedure is partially or totally carried out by
automated means.
For example, in removing a tumour or blockage without damaging surrounding tissue, a
command could be given to make an incision of a precise and small length e.g. 2 millimetres.
However, the legal implications of such technology are a formidable barrier to significant
developments in this area. If speech was incorrectly interpreted and e.g. a limb was accidentally
sliced off, who would be liable – the surgeon, the surgery system developers, or the speech
recognition software developers?
Speech Recognition System IT Project
IQBAL P a g e | 31
88.. SSPPEEEECCHH RREECCOOGGNNIITTIIOONN SSOOFFTTWWAARREE
Modern speech recognition software enables a single computer user to speak text
and/or commands to the computer, largely, but not entirely, bypassing the use of the keyboard
and mouse interface.
The idea has been portrayed in science fiction for many decades, quite frequently
depicting computers that do not even have keyboards or mice. Such computers are also
typically depicted as being able to keep up no matter how fast a person speaks, and without
regard to who the speaker is, the language spoken, or even how many speakers there are. In
other words, they're depicting a computer that hears in like manner as a multilingual person.
Attempts to develop usable speech recognition software began in the mid-1900s, and
proved to be far more daunting than anyone had imagined. It also turned out to require so
much computing power that only the most modern computers are now able to perform the
functions required in real time (i.e., as fast as you can speak).
The first commercially practical products became available around 1990, (e.g. the Voice
Navigator, a standalone computer dedicated 100% to speech recognition) and used up all the
available computing power of the machine, which would send its output to a second computer.
They weren't particularly accurate and could only understand a single person at a time,
requiring retraining, not of the operator but of the machine itself, to work for another person.
Despite these limitations, they could type so rapidly that even after taking time to make
corrections, a person with disabilities could easily accomplish more work with the machine than
without it. For persons with physical disabilities, the ability to simply talk to your computer
could be a priceless asset. Consider for instance, an author with Parkinson's disease who can
barely control his hands, yet is conveniently able to create an article.
8.1 Free Softwares
There are many software that are used for speech recognition. Many of them are free of
cost. Some free software are:
 XVoice
(http://www.compapp.dcu.ie/~tdoris/Xvoice/
http://www.zachary.com/creemer/xvoice.html)
Speech Recognition System IT Project
IQBAL P a g e | 32
 CVoiceControl/kVoiceControl
(http://www.kiecza.de/daniel/linux/index.html)
 Ears
(ftp://svr−ftp.eng.cam.ac.uk/comp.speech/recognition/)
 NICO ANN Toolkit
(http://www.speech.kth.se/NICO/index.html)
 Myers' Hidden Markov Model Software
(http://www.itl.atr.co.jp/comp.speech/Section6/Recognition/myers.hmm.html)
 Jialong He's Speech Recognition Research Tool
(http://www.itl.atr.co.jp/comp.speech/Section6/Recognition/jialong.html)
 Open Mind Speech
(http://freespeech.sourceforge.net)
 GVoice
(http://www.cse.ogi.edu/~omega/gnome/gvoice/)
 ISIP
(http://www.isip.msstate.edu/project/speech/)
 CMU Sphinx
(http://www.speech.cs.cmu.edu/sphinx/Sphinx.html)
8.2 Commercial Software
8.2.1 Dragon Naturally Speaking
Dragon NaturallySpeaking is almost universally regarded in reviews as the best voice-
recognition software, with the potential for 99.8 percent accuracy (reviews say 95 percent is
more realistic). NaturallySpeaking integrates easily with Microsoft productivity software. The
Preferred version can also be used with a compatible digital-audio recorder, MP3
player/recorder or PDA for recording voice notes or lectures on the go; NaturallySpeaking will
later transcribe your recordings. Reviews say Dragon NaturallySpeaking is the most
Speech Recognition System IT Project
IQBAL P a g e | 33
sophisticated product on the market, but that if you have
Windows Vista or plan to buy a new computer with it, you
should try the voice-recognition capabilities included with
Vista, which by most accounts are nearly as robust as Dragon
NaturallySpeaking.
(Source: http://www.nuance.com/naturallyspeaking/)
8.2.2 IBM Via Voice
IBM ViaVoice is a range of language-specific continuous speech synthesis software
products offered by IBM. The current version is designed primarily for use in embedded
devices.
Individual language editions may have different features, specifications, technical
support, and microphone support. Some of the products or editions available are:
 Advanced Edition,
 Standard Edition,
 Personal Edition,
 ViaVoice for Mac OS X Edition,
 Pro USB Edition,
 Simply Dictation for Mac.
Prior to the development of ViaVoice, IBM developed
a product named VoiceType. In 1997, ViaVoice was first
introduced to the general public. Two years later, in 1999,
IBM released a free of charge version of ViaVoice.
I didn't find a single review that recommends ViaVoice
over Dragon NaturallySpeaking, but ViaVoice is the only
program that will run on older or less powerful computers.
Dragon NaturallySpeaking is extremely demanding (you need
at the very least 512 MB RAM, a recent processor and 1 GB
free hard-drive space). However, reviews say ViaVoice isn't as
accurate as Dragon NaturallySpeaking, and mistakes aren't as easy to correct. ViaVoice hasn't
been updated in years.
(Source: http://www.ibm.com/software/speech/)
Speech Recognition System IT Project
IQBAL P a g e | 34
8.2.3 Microsoft Speech Recognition System
In 1993, Microsoft hired Xuedong Huang from CMU to lead its speech efforts. Microsoft
has been involved in research on speech recognition and text to speech.[2] The company's
research eventually led to the development of the Speech API (SAPI).
Speech recognition technology has been used in some of Microsoft's products, including
Microsoft Dictation (a research prototype that ran on Windows 9x). It was also included in
Office XP, Office 2003[3], Microsoft Plus! for Windows XP, Windows XP Tablet PC Edition, and
Windows Mobile (as Microsoft Voice Command)[4]. However, prior to Windows Vista, speech
recognition was not mainstream. In response, Windows Speech Recognition was bundled with
Windows Vista and released in 2006, making the operating system the first mainstream version
of Microsoft Windows to offer fully-integrated support for speech recognition.
Windows Speech Recognition in
Windows Vista empowers users to
interact with their computers by voice. It
was designed for people who want to
significantly limit their use of the mouse and keyboard while maintaining or increasing their
overall productivity. You can dictate documents and emails in mainstream applications, use
voice commands to start and switch between applications, control the operating system, and
even fill out forms on the Web.
Windows Speech Recognition is a new feature in Windows Vista, built using the latest
Microsoft speech technologies. Windows Vista Speech Recognition provides excellent
recognition accuracy that improves with each use as it adapts to your speaking style and
vocabulary. Speech Recognition is available in English (U.S.), English (U.K.), German (Germany),
French (France), Spanish (Spain), Japanese, Chinese (Traditional), and Chinese (Simplified).
Early reviews say it rivals Dragon NaturallySpeaking 9 for accuracy. If you buy a new
computer, you'll get Vista by default, so you can try out its voice-recognition features before
buying other software. You can also upgrade an older computer to Vista, but the system
requirements are demanding. Reviewers say Dragon NaturallySpeaking has a slight edge, but
cite no compelling reason to buy it if you have or plan to buy Vista.
(Source: http://www.microsoft.com/speech/speech2007/default.mspx)
Speech Recognition System IT Project
IQBAL P a g e | 35
8.2.4 MacSpeech Dictate
MacSpeech is a company that develops speech
recognition software for Apple Macintosh computers.
In 2008, its previous flagship product, iListen, was
replaced by Dictate, which is now built around
Nuance's licensed Dragon NaturallySpeaking engine.
MacSpeech was established in 1996 by current CEO
Andrew Taylor. MacSpeech is currently the only
company that develops voice dictation systems for the
Macintosh. Its full product line is devoted to speech
recognition and dictation.
Reviews say Dictate, introduced in early 2008, is based on the Dragon NaturallySpeaking
engine. In tests, it is as accurate as Dragon NaturallySpeaking, and much better than the
previous MacSpeech program, iListen. Dictate comes with a microphone headset. No products
directly compete with Dictate.
(Source: http://www.macspeech.com/dictate/)
8.2.5 Philips Speech Magic
SpeechMagic is an industrial grade platform for
capturing information in a digital format. It has been
developed by Philips Speech Recognition Systems of Vienna,
Austria. SpeechMagic features large-vocabulary speech
recognition as well as a number of services aimed at supporting “accurate, convenient and
efficient” information capturing in healthcare IT applications. The technology is mainly used in
the healthcare sector, however, applications are also available for the legal market as well as
for tax consultants.
SpeechMagic supports 25 recognition languages and provides more than 150 ConTexts
(industry-specific vocabularies). More than 8,000 healthcare sites in 45 nations use
SpeechMagic to capture information and create professional documents. The world’s largest
site that is powered by SpeechMagic is in the United States with more than 60,000 authors,
more than 3,000 editors and a throughput of 400 million lines per year.
Speech Recognition System IT Project
IQBAL P a g e | 36
Growth consulting company Frost & Sullivan has recognized SpeechMagic in 2005 with
the Market Leadership Award in European Healthcare. In 2007, Frost & Sullivan presented
Philips Speech Recognition Systems with the Global Excellence Award in Speech Recognition.
(Source: http://www.myspeech.com/)
8.2.6 Other Commercial Software
There are many other commercial software used for speech recognition. Some of them
are:
 HTK
(http://htk.eng.cam.ac.uk/)
 CSLU Toolkit
(http://cslu.cse.ogi.edu/toolkit/)
 Simmortel Voice
(http://www.simmortel.com)
 Quack.com by AOL
(http://www.quack.com)
 SpeechWorks
(http://www.speechworks.com)
 Bable Technologies
(http://www.babeltech.com)
 Vocalis Speechware
(http://www.vocalisspeechware.com)
 Entropic
(http://htk.eng.cam.ac.uk)
Speech Recognition System IT Project
IQBAL P a g e | 37
99.. CCOONNCCLLUUSSIIOONN
Speech recognition will revolutionize the way people conduct business over the Web
and will, ultimately, differentiate world-class e-businesses. VoiceXML ties speech recognition
and telephony together and provides the technology with which businesses can develop and
deploy voice-enabled Web solutions TODAY! These solutions can greatly expand the
accessibility of Web-based self-service transactions to customers who would otherwise not
have access, and, at the same time, leverage a business’ existing Web investments. Speech
recognition and VoiceXML clearly represent the next wave of the Web. In near future people
will be using their home and business computers by speech not by keyboard or mouse. Home
automation will be completely based on speech recognition system.

Más contenido relacionado

La actualidad más candente

Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversionankit_saluja
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice RecognitionAmrita More
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentationhimanshubhatti
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminarDiptimaya Sarangi
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition TechnologySeminar Links
 
A seminar report on speech recognition technology
A seminar report on speech recognition technologyA seminar report on speech recognition technology
A seminar report on speech recognition technologySrijanKumar18
 
Automatic speech recognition system
Automatic speech recognition systemAutomatic speech recognition system
Automatic speech recognition systemAlok Tiwari
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognitionananth
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By MatlabAnkit Gujrati
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition TechnologyAamir-sheriff
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech RecognitionAhmed Moawad
 
Speech recognition
Speech recognitionSpeech recognition
Speech recognitionCharu Joshi
 
Speech Recognition by Iqbal
Speech Recognition by IqbalSpeech Recognition by Iqbal
Speech Recognition by IqbalIqbal
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech RecognitionHugo Moreno
 
Artificial Intelligence for Speech Recognition
Artificial Intelligence for Speech RecognitionArtificial Intelligence for Speech Recognition
Artificial Intelligence for Speech RecognitionRHIMRJ Journal
 
Abstract of speech recognition
Abstract of speech recognitionAbstract of speech recognition
Abstract of speech recognitionVinay Jaisriram
 
speech processing and recognition basic in data mining
speech processing and recognition basic in  data miningspeech processing and recognition basic in  data mining
speech processing and recognition basic in data miningJimit Rupani
 
Introduction to text to speech
Introduction to text to speechIntroduction to text to speech
Introduction to text to speechBilgin Aksoy
 
Speech recognition techniques
Speech recognition techniquesSpeech recognition techniques
Speech recognition techniquessonukumar142
 

La actualidad más candente (20)

Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentation
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
A seminar report on speech recognition technology
A seminar report on speech recognition technologyA seminar report on speech recognition technology
A seminar report on speech recognition technology
 
Automatic speech recognition system
Automatic speech recognition systemAutomatic speech recognition system
Automatic speech recognition system
 
Deep Learning For Speech Recognition
Deep Learning For Speech RecognitionDeep Learning For Speech Recognition
Deep Learning For Speech Recognition
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By Matlab
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Speech recognition
Speech recognitionSpeech recognition
Speech recognition
 
Speech Recognition by Iqbal
Speech Recognition by IqbalSpeech Recognition by Iqbal
Speech Recognition by Iqbal
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Artificial Intelligence for Speech Recognition
Artificial Intelligence for Speech RecognitionArtificial Intelligence for Speech Recognition
Artificial Intelligence for Speech Recognition
 
Speech Recognition System
Speech Recognition SystemSpeech Recognition System
Speech Recognition System
 
Abstract of speech recognition
Abstract of speech recognitionAbstract of speech recognition
Abstract of speech recognition
 
speech processing and recognition basic in data mining
speech processing and recognition basic in  data miningspeech processing and recognition basic in  data mining
speech processing and recognition basic in data mining
 
Introduction to text to speech
Introduction to text to speechIntroduction to text to speech
Introduction to text to speech
 
Speech recognition techniques
Speech recognition techniquesSpeech recognition techniques
Speech recognition techniques
 

Destacado

Speech recognition project report
Speech recognition project reportSpeech recognition project report
Speech recognition project reportSarang Afle
 
Text to-speech & voice recognition
Text to-speech & voice recognitionText to-speech & voice recognition
Text to-speech & voice recognitionMark Williams
 
Ideological rationale
Ideological rationaleIdeological rationale
Ideological rationaleRabia Nawaz
 
Allama Muhammad Iqbal
Allama Muhammad IqbalAllama Muhammad Iqbal
Allama Muhammad IqbalAfia Shahid
 
The Great Leader M A Jinnah
The Great Leader M A JinnahThe Great Leader M A Jinnah
The Great Leader M A Jinnahkharison
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognitionRichie
 

Destacado (7)

Speech recognition project report
Speech recognition project reportSpeech recognition project report
Speech recognition project report
 
Bhutto speeches 1948-66
Bhutto speeches 1948-66Bhutto speeches 1948-66
Bhutto speeches 1948-66
 
Text to-speech & voice recognition
Text to-speech & voice recognitionText to-speech & voice recognition
Text to-speech & voice recognition
 
Ideological rationale
Ideological rationaleIdeological rationale
Ideological rationale
 
Allama Muhammad Iqbal
Allama Muhammad IqbalAllama Muhammad Iqbal
Allama Muhammad Iqbal
 
The Great Leader M A Jinnah
The Great Leader M A JinnahThe Great Leader M A Jinnah
The Great Leader M A Jinnah
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 

Similar a Speech Recognition by Iqbal

A survey on Enhancements in Speech Recognition
A survey on Enhancements in Speech RecognitionA survey on Enhancements in Speech Recognition
A survey on Enhancements in Speech RecognitionIRJET Journal
 
Real Time Direct Speech-to-Speech Translation
Real Time Direct Speech-to-Speech TranslationReal Time Direct Speech-to-Speech Translation
Real Time Direct Speech-to-Speech TranslationIRJET Journal
 
Desktop Based Voice Assistant Application Using Machine Learning Approach
Desktop Based Voice Assistant Application Using Machine Learning ApproachDesktop Based Voice Assistant Application Using Machine Learning Approach
Desktop Based Voice Assistant Application Using Machine Learning ApproachIRJET Journal
 
Instant speech translation 10BM60080 - VGSOM
Instant speech translation   10BM60080 - VGSOMInstant speech translation   10BM60080 - VGSOM
Instant speech translation 10BM60080 - VGSOMsathiyaseelanm
 
Control mouse and computer system using voice commands
Control mouse and computer system using voice commandsControl mouse and computer system using voice commands
Control mouse and computer system using voice commandseSAT Journals
 
VOCAL- Voice Command Application using Artificial Intelligence
VOCAL- Voice Command Application using Artificial IntelligenceVOCAL- Voice Command Application using Artificial Intelligence
VOCAL- Voice Command Application using Artificial IntelligenceIRJET Journal
 
Voice Assistant Using Python and AI
Voice Assistant Using Python and AIVoice Assistant Using Python and AI
Voice Assistant Using Python and AIIRJET Journal
 
IRJET- Communication System for Blind, Deaf and Dumb People using Internet of...
IRJET- Communication System for Blind, Deaf and Dumb People using Internet of...IRJET- Communication System for Blind, Deaf and Dumb People using Internet of...
IRJET- Communication System for Blind, Deaf and Dumb People using Internet of...IRJET Journal
 
IRJET - E-Assistant: An Interactive Bot for Banking Sector using NLP Process
IRJET -  	  E-Assistant: An Interactive Bot for Banking Sector using NLP ProcessIRJET -  	  E-Assistant: An Interactive Bot for Banking Sector using NLP Process
IRJET - E-Assistant: An Interactive Bot for Banking Sector using NLP ProcessIRJET Journal
 
Developing a hands-free interface to operate a Computer using voice command
Developing a hands-free interface to operate a Computer using voice commandDeveloping a hands-free interface to operate a Computer using voice command
Developing a hands-free interface to operate a Computer using voice commandMohammad Liton Hossain
 
IRJET- Voice to Code Editor using Speech Recognition
IRJET- Voice to Code Editor using Speech RecognitionIRJET- Voice to Code Editor using Speech Recognition
IRJET- Voice to Code Editor using Speech RecognitionIRJET Journal
 
Assistive Examination System for Visually Impaired
Assistive Examination System for Visually ImpairedAssistive Examination System for Visually Impaired
Assistive Examination System for Visually ImpairedEditor IJCATR
 
Computer science basics for nonit students
Computer science basics for nonit studentsComputer science basics for nonit students
Computer science basics for nonit studentsSrikanth KS
 
11-miwai10_submission_12
11-miwai10_submission_1211-miwai10_submission_12
11-miwai10_submission_12Long Tran
 
IRJET- Applications of Artificial Intelligence in Neural Machine Translation
IRJET- Applications of Artificial Intelligence in Neural Machine TranslationIRJET- Applications of Artificial Intelligence in Neural Machine Translation
IRJET- Applications of Artificial Intelligence in Neural Machine TranslationIRJET Journal
 
IRJET- Virtual Vision for Blinds
IRJET- Virtual Vision for BlindsIRJET- Virtual Vision for Blinds
IRJET- Virtual Vision for BlindsIRJET Journal
 
IRJET- ASL Language Translation using ML
IRJET- ASL Language Translation using MLIRJET- ASL Language Translation using ML
IRJET- ASL Language Translation using MLIRJET Journal
 
1P A R T Introduction to Analytics and AII
1P A R T Introduction to Analytics and AII1P A R T Introduction to Analytics and AII
1P A R T Introduction to Analytics and AIITatianaMajor22
 
Speech enabled interactive voice response system
Speech enabled interactive voice response systemSpeech enabled interactive voice response system
Speech enabled interactive voice response systemeSAT Journals
 

Similar a Speech Recognition by Iqbal (20)

A survey on Enhancements in Speech Recognition
A survey on Enhancements in Speech RecognitionA survey on Enhancements in Speech Recognition
A survey on Enhancements in Speech Recognition
 
Real Time Direct Speech-to-Speech Translation
Real Time Direct Speech-to-Speech TranslationReal Time Direct Speech-to-Speech Translation
Real Time Direct Speech-to-Speech Translation
 
Desktop Based Voice Assistant Application Using Machine Learning Approach
Desktop Based Voice Assistant Application Using Machine Learning ApproachDesktop Based Voice Assistant Application Using Machine Learning Approach
Desktop Based Voice Assistant Application Using Machine Learning Approach
 
Instant speech translation 10BM60080 - VGSOM
Instant speech translation   10BM60080 - VGSOMInstant speech translation   10BM60080 - VGSOM
Instant speech translation 10BM60080 - VGSOM
 
Desktop assistant
Desktop assistant Desktop assistant
Desktop assistant
 
Control mouse and computer system using voice commands
Control mouse and computer system using voice commandsControl mouse and computer system using voice commands
Control mouse and computer system using voice commands
 
VOCAL- Voice Command Application using Artificial Intelligence
VOCAL- Voice Command Application using Artificial IntelligenceVOCAL- Voice Command Application using Artificial Intelligence
VOCAL- Voice Command Application using Artificial Intelligence
 
Voice Assistant Using Python and AI
Voice Assistant Using Python and AIVoice Assistant Using Python and AI
Voice Assistant Using Python and AI
 
IRJET- Communication System for Blind, Deaf and Dumb People using Internet of...
IRJET- Communication System for Blind, Deaf and Dumb People using Internet of...IRJET- Communication System for Blind, Deaf and Dumb People using Internet of...
IRJET- Communication System for Blind, Deaf and Dumb People using Internet of...
 
IRJET - E-Assistant: An Interactive Bot for Banking Sector using NLP Process
IRJET -  	  E-Assistant: An Interactive Bot for Banking Sector using NLP ProcessIRJET -  	  E-Assistant: An Interactive Bot for Banking Sector using NLP Process
IRJET - E-Assistant: An Interactive Bot for Banking Sector using NLP Process
 
Developing a hands-free interface to operate a Computer using voice command
Developing a hands-free interface to operate a Computer using voice commandDeveloping a hands-free interface to operate a Computer using voice command
Developing a hands-free interface to operate a Computer using voice command
 
IRJET- Voice to Code Editor using Speech Recognition
IRJET- Voice to Code Editor using Speech RecognitionIRJET- Voice to Code Editor using Speech Recognition
IRJET- Voice to Code Editor using Speech Recognition
 
Assistive Examination System for Visually Impaired
Assistive Examination System for Visually ImpairedAssistive Examination System for Visually Impaired
Assistive Examination System for Visually Impaired
 
Computer science basics for nonit students
Computer science basics for nonit studentsComputer science basics for nonit students
Computer science basics for nonit students
 
11-miwai10_submission_12
11-miwai10_submission_1211-miwai10_submission_12
11-miwai10_submission_12
 
IRJET- Applications of Artificial Intelligence in Neural Machine Translation
IRJET- Applications of Artificial Intelligence in Neural Machine TranslationIRJET- Applications of Artificial Intelligence in Neural Machine Translation
IRJET- Applications of Artificial Intelligence in Neural Machine Translation
 
IRJET- Virtual Vision for Blinds
IRJET- Virtual Vision for BlindsIRJET- Virtual Vision for Blinds
IRJET- Virtual Vision for Blinds
 
IRJET- ASL Language Translation using ML
IRJET- ASL Language Translation using MLIRJET- ASL Language Translation using ML
IRJET- ASL Language Translation using ML
 
1P A R T Introduction to Analytics and AII
1P A R T Introduction to Analytics and AII1P A R T Introduction to Analytics and AII
1P A R T Introduction to Analytics and AII
 
Speech enabled interactive voice response system
Speech enabled interactive voice response systemSpeech enabled interactive voice response system
Speech enabled interactive voice response system
 

Más de Iqbal

Demutualization Of Stock Exchanges
Demutualization Of Stock ExchangesDemutualization Of Stock Exchanges
Demutualization Of Stock ExchangesIqbal
 
Leadership by Iqbal
Leadership by IqbalLeadership by Iqbal
Leadership by IqbalIqbal
 
Revenue Management by Iqbal
Revenue Management by IqbalRevenue Management by Iqbal
Revenue Management by IqbalIqbal
 
Motivation from Concepts to Application by Iqbal
Motivation from Concepts to Application by IqbalMotivation from Concepts to Application by Iqbal
Motivation from Concepts to Application by IqbalIqbal
 
Revenue management by Iqbal
Revenue management by IqbalRevenue management by Iqbal
Revenue management by IqbalIqbal
 
Understanding and Managing Speaker Anxiety by Iqbal
Understanding and Managing Speaker Anxiety by IqbalUnderstanding and Managing Speaker Anxiety by Iqbal
Understanding and Managing Speaker Anxiety by IqbalIqbal
 
Understanding and Managing Speaker Anxiety Final
Understanding and Managing Speaker Anxiety FinalUnderstanding and Managing Speaker Anxiety Final
Understanding and Managing Speaker Anxiety FinalIqbal
 
Correct Usage of Nouns and Pronouns by Iqbal
Correct Usage of Nouns and Pronouns by IqbalCorrect Usage of Nouns and Pronouns by Iqbal
Correct Usage of Nouns and Pronouns by IqbalIqbal
 
Stress Management by Iqbal
Stress Management by IqbalStress Management by Iqbal
Stress Management by IqbalIqbal
 
Leadership by Iqbal
Leadership by IqbalLeadership by Iqbal
Leadership by IqbalIqbal
 

Más de Iqbal (10)

Demutualization Of Stock Exchanges
Demutualization Of Stock ExchangesDemutualization Of Stock Exchanges
Demutualization Of Stock Exchanges
 
Leadership by Iqbal
Leadership by IqbalLeadership by Iqbal
Leadership by Iqbal
 
Revenue Management by Iqbal
Revenue Management by IqbalRevenue Management by Iqbal
Revenue Management by Iqbal
 
Motivation from Concepts to Application by Iqbal
Motivation from Concepts to Application by IqbalMotivation from Concepts to Application by Iqbal
Motivation from Concepts to Application by Iqbal
 
Revenue management by Iqbal
Revenue management by IqbalRevenue management by Iqbal
Revenue management by Iqbal
 
Understanding and Managing Speaker Anxiety by Iqbal
Understanding and Managing Speaker Anxiety by IqbalUnderstanding and Managing Speaker Anxiety by Iqbal
Understanding and Managing Speaker Anxiety by Iqbal
 
Understanding and Managing Speaker Anxiety Final
Understanding and Managing Speaker Anxiety FinalUnderstanding and Managing Speaker Anxiety Final
Understanding and Managing Speaker Anxiety Final
 
Correct Usage of Nouns and Pronouns by Iqbal
Correct Usage of Nouns and Pronouns by IqbalCorrect Usage of Nouns and Pronouns by Iqbal
Correct Usage of Nouns and Pronouns by Iqbal
 
Stress Management by Iqbal
Stress Management by IqbalStress Management by Iqbal
Stress Management by Iqbal
 
Leadership by Iqbal
Leadership by IqbalLeadership by Iqbal
Leadership by Iqbal
 

Último

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 

Último (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 

Speech Recognition by Iqbal

  • 1. IITT FFOORR MMAANNAAGGEERRSS RREEPPOORRTT OONN SSPPEEEECCHH RREECCOOGGNNIITTIIOONN SSYYSSTTEEMM SSUUBBMMIITTTTEEDD TTOO DDRR.. RROOSSHHAANN AA.. SSHHEEIIKKHH MMAARRCCHH,, 22000099 IQBAL S/O SHAHZAD REGISTRATION # 9952 MBA(M) - SECTION A
  • 2. Speech Recognition System IT Project IQBAL P a g e | 1 AABBSSTTRRAACCTT This report has been submitted to Dr. Roshan A. Sheikh of Iqra University Karachi, as a requirement for the completion of the course , IT for Managers for MBA students. I have prepared this brief report on Speech Recognition System after deep study and research on the topic for two weeks. I have done by best in presenting, explaining the concepts and interpreting the report in its proper form. This report presents an overview of speech recognition technology, software, development and applications. It begins with an introduction to Speech Recognition Technology then it explains how such systems work, and the level of accuracy that can be expected. Applications of speech recognition technology in education and beyond are then explored. A brief comparison of the most common systems is presented, as well as notes on the main centres of speech recognition research in the UK educational sector. The report concludes with potential uses of speech recognition in education, probable main uses of the technology in the future, and a selection of key web-based resources. It also includes software that are being used for this purpose in homes and also in business environment. A video is also presented with this report which shows an example of how we can use speech recognition in windows vista. This video is prepared solely by me on my personal computer. It is available in the soft copy of the project in attached CD.
  • 3. Speech Recognition System IT Project IQBAL P a g e | 2 TTAABBLLEE OOFF CCOONNTTEENNTTSS 1. Introduction ………………………………………………………………………………………….......... 4 1.1 Introduction ………………………………………………………………………………………… 4 1.2 Closer Look …………………………………………………………………………………………. 4-5 2. Terms and Concepts ……………………………………………………………………………….……… 6 2.1 Utterances ………………………………………………………………………………….………. 6 2.2 Pronunciation …………………………………………………………………………….…….…. 6 2.3 Grammar …………………………………………………………………………………….……… 7 2.4 Speaker Dependence ……………………………………………………………….….……… 7 2.5 Accuracy …………………………………………………………………………………….………. 8 2.6 Training ………………………………………………………………………………….….………. 8-9 3. How Speech Recognition Works ………………………………………………………………….… 10 3.1 How Speech Recognition Works ……………………………………………………….… 10 3.2 Acceptance and Regection ……………………………………………………………….… 11-12 4. Types of Speech Recognition ………………………………………………………………………… 13 4.1 Isolated Words …………………………………………………………………………………… 13 4.2 Connected Words ………………………………………………………………………………. 13 4.3 Continuous Speech …………………………………………………………………………….. 13 4.4 Spontaneous Speech ………………………………………………………………………….. 13-14 4.5 Voice Verification / Identification ………………………………………………………. 14 5. Hardware ……………………………………………………………………………………………………... 15 5.1 Soud Cards …………………………………………………………………………………………. 15 5.2 Microphones ……………………………………………………………………………………… 15-16 5.3 Computers / Processors …………………………………………………………………….. 16 6. Uses / Applications of Speech Recognition ………………………………………………….. 17 6.1 Military ……………………………………………………………………………………………... 17 6.1.1 High Performance Fighter Aircrafts ………………………………………. 17 6.1.2 Helicopters ……………………………………………………………………………. 18 6.1.3 Training Air Traffic Controllers ……………………………………………… 18-19 6.2 People with Disabilities ………………………………………………………………………. 19 6.3 Speech Recognition in Telephony Environment ………………………………….. 20 6.3.1 Communications Management and Personal Assistants …………. 21
  • 4. Speech Recognition System IT Project IQBAL P a g e | 3 6.3.2 General Information …………………………………………………….…………. 21 6.3.3 E-Commerce …………………………………………………………………………… 21 6.4 Potential Uses in Education ………………………………………………………………… 22-23 6.5 Computer and Video Games ………………………………………………………………. 23-24 6.6 Medical Transcription ………………………………………………………………………… 24-25 6.7 Mobile Devices …………………………………………………………………………………... 25-26 6.8 Voice Security Systems ……………………………………………………………………….. 26-27 7. Future Applications ………………………………………………………………………………………. 28 7.1 Home / Domestic Appliances …………………………………………………………….. 28-29 7.2 Wearable Computers ………………………………………………………………………… 29 7.3 Precision Surgery ………………………………………………………………………………. 30 8. Speech Recognition Software ………………………………………………………………………. 31 8.1 Free Software ……………………………………………………………………………………. 31-32 8.2 Commercial Software ……………………………………………………………………….. 32 8.2.1 Dragon Naturally Speeking ……………………………………………………. 32-33 8.2.2 IBM Via Voice ……………………………………………………………………….. 33 8.2.3 Microsoft Speech Recognition System …………………………………… 34 8.2.4 MacSpeech Dictate ……………………………………………………………….. 35 8.2.5 Philips Speech Engine ……………………………………………………………. 35-36 8.2.6 Other commercial software ………………………………………………….. 36 9. Conclusion …………………………………………………………………………………………………… 37
  • 5. Speech Recognition System IT Project IQBAL P a g e | 4 11.. IINNTTRROODDUUCCTTIIOONN Have you ever talked to your computer? I mean, have you really, really talked to your computer? Where it actually recognized what you said and then did something as a result? If you have, then you've used a technology known as speech recognition. Designing a machine that understand human behavior, particularly the capability of speaking naturally and responding properly to spoken language, has intrigued engineers and scientists for centuries. Today speech technologies are commercially available for a limited but interesting range of tasks. These technologies enable machines to respond correctly and reliably to human voices, and provide useful and valuable services. While we are still far from having a machine that converses with humans on any topic like another human, many important scientific and technological advances have taken place, bringing us closer to the machines that recognize and understand fluently spoken speech. “Speech Recognition Simply is the process of converting spoken input to text. Speech recognition is thus sometimes referred to as speech-to-text. Speech recognition, also referred to as voice recognition, is software technology that lets the user control computer functions and dictate text by voice. For example, a person can move the mouse cursor with a voice command, such as “mouse up;” control application functions, such as opening up a file menu; or create documents, such as letters or reports or start media player by saying “Music”. 1.2 A Closer Look The speech recognition process is performed by a software component known as the speech recognition engine. The primary function of the speech recognition engine is to process spoken input and translate it into text that an application understands. The application can then do one of two things:  The application can interpret the result of the recognition as a command. In this case, the application is a command and control application. An example of a command and control application is one in which the caller says “check balance”, and the application returns the current balance of the caller’s account.  If an application handles the recognized text simply as text, then it is considered a dictation application. In a dictation application, if you said “check balance,” the application would not interpret the result, but simply return the text “check balance”.
  • 6. Speech Recognition System IT Project IQBAL P a g e | 5 Speech recognition is an alternative to traditional methods of interacting with a computer, such as textual input through a keyboard. An effective system can replace, or reduce the reliability on, standard keyboard and mouse input. This can especially assist the following:  People who have little keyboard skills or experience, who are slow typists, or do not have the time or resources to develop keyboard skills.  Dyslexic people, or others who have problems with character or word use and manipulation in a textual form.  People with physical disabilities that affect either their data entry, or ability to read (and therefore check) what they have entered. A speech recognition system consists of the following:  A microphone, for the person to speak into.  Speech recognition software.  A computer to take and interpret the speech.  A good quality soundcard for input and/or output.  A proper and good pronunciation. However, systems on computers meant for more individual use, such as for personal word processing, usually require a degree of “training” before use. Here, an individual user “trains” the system to understand words or word fragments (see section 2.6); this training is often referred to as “enrolment”.
  • 7. Speech Recognition System IT Project IQBAL P a g e | 6 22.. TTEERRMMSS AANNDD CCOONNCCEEPPTTSS Following are a few of the basic terms and concepts that are fundamental to speech recognition. It is important to have a good understanding of these concepts. 2.1 Utterances When the user says something, this is known as an utterance. An utterance is any stream of speech between two periods of silence. Utterances are sent to the speech engine to be processed. Silence, in speech recognition, is almost as important as what is spoken, because silence delineates the start and end of an utterance. Here's how it works. The speech recognition engine is "listening" for speech input. When the engine detects audio input - in other words, a lack of silence -- the beginning of an utterance is signaled. Similarly, when the engine detects a certain amount of silence following the audio, the end of the utterance occurs. Utterances are sent to the speech engine to be processed. If the user doesn’t say anything, the engine returns what is known as a silence timeout - an indication that there was no speech detected within the expected timeframe, and the application takes an appropriate action, such as reprompting the user for input. An utterance can be a single word, or it can contain multiple words (a phrase or a sentence). For example, “Word”, “Microsoft Word,” or “I’d like to run Microsoft Word” are all examples of possible utterances. Whether these words and phrases are valid at a particular point in a dialog is determined by which grammars are active. Note that there are small snippets of silence between the words spoken within a phrase. If the user pauses too long between the words of a phrase, the end of an utterance can be detected too soon, and only a partial phrase will be processed by the engine. 2.2 Pronunciation The speech recognition engine uses all sorts of data, statistical models, and algorithms to convert spoken input into text. One piece of information that the speech recognition engine uses to process a word is its pronunciation, which represents what the speech engine thinks a word should sound like. Words can have multiple pronunciations associated with them. For example, the word “the” has at least two pronunciations in the U.S. English language: “thee” and “thuh”.
  • 8. Speech Recognition System IT Project IQBAL P a g e | 7 2.3 Grammar Grammars define the domain, or context, within which the recognition engine works. The engine compares the current utterance against the words and phrases in the active grammars. If the user says something that is not in the grammar, the speech engine will not be able to understand it correctly. So usually speech engines have a very vast grammar. Vocabularies (or dictionaries) are lists of words or utterances that can be recognized by the Speech Recognition system. Generally, smaller vocabularies are easier for a computer to recognize, while larger vocabularies are more difficult. Unlike normal dictionaries, each entry doesn't have to be a single word. They can be as long as a sentence or two. Smaller vocabularies can have as few as 1 or 2 recognized utterances (e.g."Wake Up"), while very large vocabularies can have a hundred thousand or more! 2.4 Speaker Dependence Speaker dependence describes the degree to which a speech recognition system requires knowledge of a speaker’s individual voice characteristics to successfully process speech. The speech recognition engine can “learn” how you speak words and phrases; it can be trained to your voice. Speech recognition systems that require a user to train the system to his/her voice are known as speaker-dependent systems. If you are familiar with desktop dictation systems, most are speaker dependent like IBM Via Voice. Because they operate on very large vocabularies, dictation systems perform much better when the speaker has spent the time to train the system to his/her voice. Speech recognition systems that do not require a user to train the system are known as speaker-independent systems. Speech recognition in the VoiceXML world must be speaker- independent. Think of how many users (hundreds, maybe thousands) may be calling into your web site. You cannot require that each caller train the system to his or her voice. The speech recognition systemin a voice-enabled web application MUST successfully process the speech of many different callers without having to understand the individual voice characteristics of each caller.
  • 9. Speech Recognition System IT Project IQBAL P a g e | 8 2.5 Accuracy The ability of a recognizer can be examined by measuring its accuracy − or how well it recognizes utterances. The performance of a speech recognition system is measurable. Perhaps the most widely used measurement is accuracy. It is typically a quantitative measurement and can be calculated in several ways. Arguably the most important measurement of accuracy is whether the desired end result occurred. This measurement is useful in validating application design. For example, if the user said "yes," the engine returned "yes," and the "YES" action was executed, it is clear that the desired result was achieved. But what happens if the engine returns text that does not exactly match the utterance? For example, what if the user said "nope," the engine returned "no," yet the "NO" action was executed? Should that be considered a successful dialog? The answer to that question is yes because the desired result was acheived. Another measurement of recognition accuracy is whether the engine recognized the utterance exactly as spoken. This measure of recognition accuracy is expressed as a percentage and represents the number of utterances recognized correctly out of the total number of utterances spoken. It is a useful measurement when validating grammar design. Using the previous example, if the engine returned "nope" when the user said "no," this would be considered a recognition error. Based on the accuracy measurement, you may want to analyze your grammar to determine if there is anything you can do to improve accuracy. For instance, you might need to add "nope" as a valid word to your grammar. You may also want to check your grammar to see if it allows words that are acoustically similar (for example, "repeat/delete," "Austin/Boston," and "Addison/Madison"), and determine if there is any way you can make the allowable words more distinctive to the engine. Recognition accuracy is an important measure for all speech recognition applications. It is tied to grammar design and to the environment of the user. Good ASR (Automatic Speech Recognition) systems have an accuracy of 98% or more! 2.6 Training Some speech recognizers have the ability to adapt to a speaker. When the system has this ability, it may allow training to take place. An ASR (Automatic Speech Recognition) system is trained by having the speaker repeat standard or common phrases and adjusting its comparison algorithms to match that particular speaker. Training a recognizer usually improves its accuracy.
  • 10. Speech Recognition System IT Project IQBAL P a g e | 9 Training can also be used by speakers that have difficulty speaking, or pronouncing certain words. As long as the speaker can consistently repeat an utterance, ASR systems with training should be able to adapt.
  • 11. Speech Recognition System IT Project IQBAL P a g e | 10 33.. HHOOWW SSPPEEEECCHH RREECCOOGGNNIITTIIOONN WWOORRKKSS Now that we've discussed some of the basic terms and concepts involved in speech recognition, let's put them together and take a look at how the speech recognition process works. As you can probably imagine, the speech recognition engine has a rather complex task to handle, that of taking raw audio input and translating it to recognized text that an application understands. As shown in the diagram below, the major components we want to discuss are:  Audio input - Transform of the digital audio into a better acoustic representation  Apply a "grammar" so the speech recognizer knows what phonemes to expect. A grammar could be anything from a context-free grammar to full-blown English.  Acoustic Model  Recognized text The first thing we want to take a look at is the audio input coming into the recognition engine. It is important to understand that this audio stream is rarely pristine. It contains not only the speech data (what was said) but also background noise. This noise can interfere with
  • 12. Speech Recognition System IT Project IQBAL P a g e | 11 the recognition process, and the speech engine must handle (and possibly even adapt to) the environment within which the audio is spoken. As we've discussed, it is the job of the speech recognition engine to convert spoken input into text. To do this, it employs all sorts of data, statistics, and software algorithms. Its first job is to process the incoming audio signal and convert it into a format best suited for further analysis. Once the speech data is in the proper format, the engine searches for the best match. It does this by taking into consideration the words and phrases it knows about (the active grammars), along with its knowledge of the environment in which it is operating. The knowledge of the environment is provided in the form of an acoustic model. Once it identifies the most likely match for what was said, it returns what it recognized as a text string. Most speech engines try very hard to find a match, and are usually very "forgiving." But it is important to note that the engine is always returning it's best guess for what was said. (This is an example of a digital audio) 3.2 Acceptance and Rejection When the recognition engine processes an utterance, it returns a result. The result can be either of two states: acceptance or rejection. An accepted utterance is one in which the engine returns recognized text. Whatever the caller says, the speech recognition engine tries very hard to match the utterance to a word or phrase in the active grammar. Sometimes the match may be poor because the caller said something that the application was not expecting, or the caller spoke indistinctly. In these cases, the speech engine returns the closest match, which might be
  • 13. Speech Recognition System IT Project IQBAL P a g e | 12 incorrect. Some engines also return a confidence score along with the text to indicate the likelihood that the returned text is correct. Not all utterances that are processed by the speech engine are accepted. Acceptance or rejection is flagged by the engine with each processed utterance.
  • 14. Speech Recognition System IT Project IQBAL P a g e | 13 44.. TTYYPPEESS OOFF SSPPEEEECCHH RREECCOOGGNNIITTIIOONN Speech recognition systems can be separated in several different classes by describing what types of utterances they have the ability to recognize. These classes are based on the fact that one of the difficulties of ASR is the ability to determine when a speaker starts and finishes an utterance. Most packages can fit into more than one class, depending on which mode they're using. 4.1 Isolated Words Isolated word recognizers usually require each utterance to have quiet (lack of an audio signal) on BOTH sides of the sample window. It doesn't mean that it accepts single words, but does require a single utterance at a time. Often, these systems have "Listen/Not−Listen" states, where they require the speaker to wait between utterances (usually doing processing during the pauses). Isolated Utterance might be a better name for this class. 4.2 Connected Words Connect word systems (or more correctly 'connected utterances') are similar to Isolated words, but allow separate utterances to be 'run−together' with a minimal pause between them. 4.3 Continuous Speech Continuous recognition is the next step. Recognizers with continuous speech capabilities are some of the most difficult to create because they must utilize special methods to determine utterance boundaries. Continuous speech recognizers allow users to speak almost naturally, while the computer determines the content. Basically, it's computer dictation. 4.4 Spontaneous Speech There appears to be a variety of definitions for what spontaneous speech actually is. At a basic level, it can be thought of as speech that is natural sounding and not rehearsed. An ASR
  • 15. Speech Recognition System IT Project IQBAL P a g e | 14 system with spontaneous speech ability should be able to handle a variety of natural speech features such as words being run together, "ums" and "ahs", and even slight stutters. 4.5 Voice Verification/Identification Some ASR systems have the ability to identify specific users. This document doesn't cover verification or security systems.
  • 16. Speech Recognition System IT Project IQBAL P a g e | 15 55.. HHAARRDDWWAARREE 5.1 Sound Cards Because speech requires a relatively low bandwidth, just about any medium−high quality 16 bit sound card will get the job done. You must have sound enabled in your kernel, and you must have correct drivers installed. Sound card quality often starts a heated discussion about their impact on accuracy and noise. Sound cards with the 'cleanest' A/D (analog to digital) conversions are recommended, but most often the clarity of the digital sample is more dependent on the microphone quality and even more dependent on the environmental noise. Electrical "noise" from monitors, pci slots, hard−drives, etc. are usually nothing compared to audible noise from the computer fans, squeaking chairs, or heavy breathing. Some ASR software packages may require a specific sound card. It's usually a good idea to stay away from specific hardware requirements, because it limits many of your possible future options and decisions. You'll have to weigh the benefits and costs if you are considering packages that require specific hardware to function properly. 5.2 Microphones A quality microphone is key when utilizing ASR. In most cases, a desktop microphone just won't do the job. They tend to pick up more ambient noise that gives ASR programs a hard time. Hand held microphones are also not the best choice as they can be cumbersome to pick up all the time. While they do limit the amount of ambient noise, they are most useful in applications that require changing speakers often, or when speaking to the recognizer isn't done frequently (when wearing a headset isn't an option). The best choice, and by far the most common is the headset style. It allows the ambient noise to be minimized, while allowing you to have the microphone at the tip of your tongue all the time. Headsets are available without earphones and with earphones (mono or stereo). I recommend the stereo headphones, but it's just a matter of personal taste. A quick note about levels: Don't forget to turn up your microphone volume. This can be done with a program such as XMixer or OSS Mixer and care should be used to avoid feedback
  • 17. Speech Recognition System IT Project IQBAL P a g e | 16 noise. If the ASR software includes auto−adjustment programs, use them instead, as they are optimized for their particular recognition system. 5.3 Computers/Processors ASR applications can be heavily dependent on processing speed. This is because a large amount of digital filtering and signal processing can take place in ASR. As with just about any cpu intensive software, the faster the better. Also, the more memory the better. It's possible to do some SR with 100MHz and 16M RAM, but for fast processing (large dictionaries, complex recognition schemes, or high sample rates), you should shoot for a minimum of a 1 Ghz and 1 GB RAM. Because of the processing required, most software packages list their minimum requirements.
  • 18. Speech Recognition System IT Project IQBAL P a g e | 17 66.. UUSSEESS // AAPPPPLLIICCAATTIIOONNSS 6.1 Military 6.1.1 High-performance fighter aircraft Substantial efforts have been devoted in the last decade to the test and evaluation of speech recognition in fighter aircraft. Of particular note are the U.S. program in speech recognition for the Advanced Fighter Technology Integration (AFTI)/F-16 aircraft, the program in France on installing speech recognition systems on Mirage aircraft, and programs in the UK dealing with a variety of aircraft platforms. In these programs, speech recognizers have been operated successfully in fighter aircraft with applications including: setting radio frequencies, commanding an autopilot system, setting steer-point coordinates and weapons release parameters, and controlling flight displays. Generally, only very limited, constrained vocabularies have been used successfully, and a major effort has been devoted to integration of the speech recognizer with the avionics system. Some important conclusions from the work were as follows: 1. Speech recognition has definite potential for reducing pilot workload, but this potential was not realized consistently. 2. Achievement of very high recognition accuracy (95% or more) was the most critical factor for making the speech recognition system useful — with lower recognition rates, pilots would not use the system. 3. More natural vocabulary and grammar, and shorter training times would be useful, but only if very high recognition rates could be maintained. 4. Laboratory research in robust speech recognition for military environments has produced promising results which, if extendable to the cockpit, should improve the utility of speech recognition in high-performance aircraft. The Eurofighter Typhoon currently in service with the UK RAF employs a speaker- dependent system, i.e. it requires each pilot to create a template. The system is not used for any safety critical or weapon critical tasks, such as weapon release or lowering of the undercarriage, but is used for a wide range of other cockpit functions. Voice commands are confirmed by visual and/or aural feedback. The system is seen as a major design feature in the reduction of pilot workload, and even allows the pilot to assign targets to himself with two simple voice commands or to any of his wingmen with only five commands.
  • 19. Speech Recognition System IT Project IQBAL P a g e | 18 6.1.2 Helicopters The problems of achieving high recognition accuracy under stress and noise pertain strongly to the helicopter environment as well as to the fighter environment. The acoustic noise problem is actually more severe in the helicopter environment, not only because of the high noise levels but also because the helicopter pilot generally does not wear a facemask, which would reduce acoustic noise in the microphone. Substantial test and evaluation programs have been carried out in the past decade in speech recognition systems applications in helicopters, notably by the U.S. Army Avionics Research and Development Activity (AVRADA) and by the Royal Aerospace Establishment (RAE) in the UK. Work in France has included speech recognition in the Puma helicopter. There has also been much useful work in Canada. Results have been encouraging, and voice applications have included: control of communication radios; setting of navigation systems; and control of an automated target handover system. As in fighter applications, the overriding issue for voice in helicopters is the impact on pilot effectiveness. Encouraging results are reported for the AVRADA tests, although these represent only a feasibility demonstration in a test environment. Much remains to be done both in speech recognition and in overall speech recognition technology, in order to consistently achieve performance improvements in operational settings. 6.1.3 Training Air Traffic Controllers Training for military air traffic controllers (ATC) represents an excellent application for speech recognition systems. Many ATC training systems currently require a person to act as a "pseudo-pilot", engaging in a voice dialog with the trainee controller, which simulates the dialog which the controller would have to conduct with pilots in a real ATC situation. Speech recognition and synthesis techniques offer the potential to eliminate the need for a person to act as pseudo-pilot, thus reducing training and support personnel. Air controller tasks are also characterized by highly structured speech as the primary output of the controller, hence reducing the difficulty of the speech recognition task. The U.S. Naval Training Equipment Center has sponsored a number of developments of prototype ATC trainers using speech recognition. Generally, the recognition accuracy falls short of providing graceful interaction between the trainee and the system. However, the prototype training systems have demonstrated a significant potential for voice interaction in these systems, and in other training applications. The U.S. Navy has sponsored a large-scale effort in ATC training systems, where a commercial speech recognition unit was integrated with a complex training system including displays and scenario creation. Although the recognizer was constrained in vocabulary, one of the goals of the training programs was to teach the controllers to speak in a constrained language, using specific vocabulary specifically designed
  • 20. Speech Recognition System IT Project IQBAL P a g e | 19 for the ATC task. Research in France has focused on the application of speech recognition in ATC training systems, directed at issues both in speech recognition and in application of task- domain grammar constraints. Another approach to ATC simulation with speech recognition has been created by Supremis. The Supremis system is not constrained by rigid grammars imposed by the underlying limitations of other recognition strategies. 6.2 People with Disabilities It has been suggested that one of the most promising areas for the application of speech recognition is in helping handicapped people (Leggett and Williams, 1984). Speech recognition technology helps people with disabilities interact with computers more easily. People with motor limitations, who cannot use a standard keyboard and mouse, can use their voices to navigate the computer and create documents. For example, Braille input/output devices touch screen systems and trackballs have all been used successfully in the classrooms. The technology is also useful to people with learning disabilities who experience difficulty with spelling and writing. Some individuals with speech impairments may use speech recognition as a therapeutic tool to improve vocal quality. People with overuse or repetitive stress injuries also benefit from using speech recognition to operate their computers hands free. Speech recognition technology has great potential to provide people with disabilities greater access to computers and a world of opportunities. Mr. Jones is a reporter who must submit his articles in HTML for publishing in an on-line journal. Over his twenty-year career, he has developed repetitive stress injury (RSI) in his hands and arms, and it has become painful for him to type. He uses a combination of speech recognition and an alternative keyboard to prepare his articles, but he doesn't use a mouse. It took him several months to become sufficiently accustomed to using speech recognition to be comfortable working for many hours at a time. There are some things he has not worked out yet, such as a sound card conflict that arises whenever he tries to use speech recognition on Web sites that have streaming audio. (Source : http://www.w3.org/WAI/EO/Drafts/PWD-Use- Web/).
  • 21. Speech Recognition System IT Project IQBAL P a g e | 20 6.3 Speech Recognition in Telephony Environment William Meisel, who holds a Ph.D. in Electrical Engineering, ran a speech recognition company for ten years. He is president of the speech industry consulting firm TMA Associates and publisher and editor of Speech Recognition Update newsletter. According to him Telephone speech recognition creates a Voice Web. Sites that support speech recognition constitute the Voice Web. Most sites today have individual phone numbers (typically toll-free). Such sites are often called "voice portals". There are, however, likely to be more popular voice portals than Web portals; every wireless and landline telephone service provider will eventually be a voice portal, and there will be independent, corporate, and specialized voice portals. VoiceXML, a new standard, created by the VoiceXML Form (www.voicexml.org) and the W3C Voice Browser working group (www.w3.org/voice), is a way that companies can provide a voice-interactive application on a Web server without needing speech engines or telephone line interface hardware. The VoiceXML code is downloaded to the voice portal and executed by a VoiceXML interpreter, much as a Web browser on a PC interprets HTML. (Source : William Meisel’s Guide Book on The Voice Web) The Voice Web is not just an extension of the Internet, although information on existing Web sites can be used to support interactive voice services. It can run applications totally unlike visual Web applications and totally independent of the HTML-based Web. Some of the applications that the Voice Web is supporting are listed here.
  • 22. Speech Recognition System IT Project IQBAL P a g e | 21 6.3.1 Communications management and personal assistants Communications management usually includes dialing by name using a personal directory. Personal-assistant functionality includes call screening, taking and accessing voice messages, and one-number access to the subscriber (scanning several subscriber numbers based on subscriber instructions). Other personalized features include maintaining a schedule and delivering reminders. Unified messaging includes features such as reviewing email or fax headers by phone using text-to-speech. Since subscribers will make calls through their personal assistant, the voice portal can potentially get additional revenues from providing bundled local and/or long-distance service. Enterprise applications, such as voice-activated auto attendants that direct calls by name, can be a corporate voice portal. Corporate voice portals can also provide such services as reservations for a conference, location of a local store outlet, or a connection to customer service. 6.3.2 General information General information includes weather, sports scores, horoscopes, general news, financial news, stock quotes, traffic conditions, and driving directions. Such information is intended to make a voice-enabled service part of a subscriber’s daily habit. Information can be customized, using, for example, the user’s personal stock portfolio or the user’s current location. As voice portals evolve, the caller will be able to "voicemark" specialized voice- equipped Web sites. 6.3.3 E-commerce V-commerce supports a variety of transactions that can result in product or service sales. These include transactions similar to ordering from a Web sites or telephone catalog service. They also include finding a business by saying its trade name or its category. Entertainment is part of e-commerce, and it will be part of the Voice Web. For example, the caller can use speech recognition to choose audio channels to listen to. (Source : Receiver Magazine, Vodafone - 2001)
  • 23. Speech Recognition System IT Project IQBAL P a g e | 22 6.4 Potential uses in education Contact with a number of practitioners and researchers in the field of speech recognition led to some interesting speculation regarding the feasible use of this technology in education. No. Applications Problems and Likelihood 1 Teaching students of foreign languages to pronounce vocabulary correctly. Unlikely in near future on a large scale, due to the software training currently involved. 2 Teaching overseas students to pronounce English correctly. 3 Making notes of observations during scientific experiments, so the scientist/research can focus on the observation without needing to view the monitor or keyboard. Similar to how a coroner verbally records notes during an autopsy. Likely, and is probably already used in individual circumstances. Noise from the experiment, the researcher need to rapidly record some observations, and a vocabulary that understands the scientific terms present some issues. 4 Enabling students who are physically handicapped and unable to use a keyboard to enter text verbally. Used already, though becoming increasingly widespread. 5 Enabling people with textual interpretive problems e.g. Dyslexia, to enter text verbally. Used already, though becoming increasingly widespread. 6 Restrictive access on a high security computer, where a keyboard or other input device may be used by hackers. Interest from a number of people, though a lack of “proof of concept” research hinders further development. Unlikely to be available in the near future. 7 Narrative-oriented research, where transcripts are automatically generated. This would remove the time to manually generate the transcript, and human error. Likely in the near future. Current speech recognition technology places unacceptable c ompromise between accuracy and inhibiting the interviewee. Quicker and easier training systems for the interviewee will help, as will increases in portable computing processing power. 8 Capturing the speech of a lecturer or tutor. Unlikely on a large scale, due to vocabulary, training and interpretive issues. In addition, filming of the lecture results in audio and visual content combined which may be more useful. 9 Using a speech recognition system in an examination. Very likely. Technically, this is possible, and within current UK examination guidelines
  • 24. Speech Recognition System IT Project IQBAL P a g e | 23 this appears to be acceptable (Source : http://www.becta.org.uk/technology/speechrecog/docs/finalreport.pdf - the final report (June 2000) from a experimental project to see how effective speech recognition technologies could be to people with special educational needs.) 6.5 Computer and Video Games Speech input has been used in a limited number of computer and video games, on a variety of PC and console-based platforms, over the past decade. For example, the game Seaman24 involved growing and controlling strange half-man half fish characters in a virtual aquarium. A microphone, sold with the game, allowed the player to issue one of a pre- determined list of command words and questions to the fish. The accuracy of interpretation, in use, seemed variable; during gaming sessions colleagues with strong accents had to speak in an exaggerated and slower manner in order for the game to understand their commands. Microphone-based games are available for two of the three main video game consoles (Playstation 2 and Xbox). However, these games primarily use speech in an online player to player manner, rather than spoken words being interpreted electronically. For example, a MotoGP for the Xbox allows online players to ride against each other in a motorbike race simulation, and speak (via microphone headset) to the nearest players (bikers) in the race. There is currently interest, but less development, of video games that interpret speech. The Microsoft Xbox, Nintendo GameCube, and Sony PlayStation 2 consoles all offer games with speech input/output. Currently, most games are war-action-shooter games. In these, speech recognition provides high-level commands to virtual teammates who respond with a variety of recorded quips. Lets take examples of two games i.e. graphically-realistic, tactical squad-based, shooter games Ghost Recon 2 and SOCOM II: U.S. Navy Seals. Both these games are available in Sony Playstation 2. The speech recognition systems for these games are provided by Fonix and ScanSoft, respectively. In Ghost Recon 2, the user is the leader of a team of three secret Special Forces soldiers who must capture various military targets in North Korea in the year 2007. The team is critical to the user’s survival from enemy gunfire. Saying “Move out!” directs the team to move ahead of you as you make your way through the virtual, hilly terrain toward various objectives. The speech commands (“Move out,” “Covering fire,” “Grenade,” “Take point,” “Hold position,” “Regroup”) are
  • 25. Speech Recognition System IT Project IQBAL P a g e | 24 easily-recalled, high-level instructions to the team members. The commands that can be obeyed depend on the immediate situation. If you say, “Take point,” and the hostile fire is too great the designated team member may say, “No can do, Captain.” Occasionally, the retort is somewhat less respectful. In SOCOM II: U.S. Navy Seals, a team of four men including the first person leader attempts to stop an arms smuggling group in rural Albania. The team has to avoid the enemy, meet an informant, blow up weapons caches, and make their escape. The speech commands in this game are spoken in three parts, using a simple grammar. The commands may be addressed to “Fireteam” (all other team members) or individuals like, “Able” (your partner). Then there are approximately 12 action commands including “Fire at will,” “Deploy,” “Move to,” “Get down,” and others. The third part of the command includes nine letters of the military alphabet (“Charlie,” “Delta,” etc.) indicating where the “Move to” and similar commands are intended. They represent the specific locations of game objectives. (Source: Article from The Speech Technology Magazine Apr 2005, http://www.speechtechmag.com/Articles/ReadArticle.aspx?ArticleID=29432) 6.6 Medical Transcription Medical transcription, also known as MT, is an allied health profession, which deals in the process of transcription, or converting voice-recorded reports as dictated by physicians and/or other healthcare professionals, into text format. Every day, doctors scour the market looking for new ways to help simplify their office routines and reduce their costs. Medical Transcription software saves their time and money. The speech recognition product produces accurate and fully formatted transcriptions from clinicians' dictations. The goal is to minimize editing time by MTs and, as a result, increase MT productivity. It interprets and formats a
  • 26. Speech Recognition System IT Project IQBAL P a g e | 25 document, so that it is close to a final product. Benefits:  Organized and formatted document sections  Punctuation inserted even if not spoken  Numbers interpreted and presented appropriately. This includes dosages, measurements, lists, etc.  Formatting based on each organization’s preferences and specifications  Inserts speech-activated ‘normals’  No explicit training required  Continually learns and improved from MT edits Examples: When a clinician dictates: "Exam…vital signs…two twelve…eighty eight and regular…thirteen…BP one forty one hundred and one thirty five ninety five" Speech Recognition software can output: PHYSICAL EXAMINATION: VITAL SIGNS: Weight 212, pulse 88 and regular, respiration 13, blood pressure is 140/100, 135/95. When a provider says: "The following problems were reviewed…hypertension …please enter my hypertension template…use my normal cad" Speech Recognition software can output: PROBLEMS: The following problems were reviewed:  Hypertension: No headache, visual disturbance, chest pain, palpitation, focal neurologic complaint, dyspnea, edema, claudication, or complaint from current medication.  Coronary artery disease: No chest pain, dyspnea, PND, orthopnea, palpitation, weakness, syncope, or obvious problems related to medications. 6.7 Mobile Devices The growth of cellular telephony combined with recent advances in speech recognition technology results in sizeable potential opportunities for mobile speech recognition applications. Speech recognition in mobile phone have already been introduced but there is a
  • 27. Speech Recognition System IT Project IQBAL P a g e | 26 lot of work to be done in this particular field. First time when speech recognition was introduced in mobiles, it was used to call a contact by saying its name. In that case first the user needed to record voice clips of the names of each contact and associate them with their respective contacts. So when the user said the name the mobile compared it with already recorded sounds for each contact and then called the person whose name was spoken. New smart mobile phones are introducing every month. These mobiles don’t require recording the names first. They have their own speech system, which can read the names written in English. So when the user says a name, it uses its speech system to compare the user spoken sound with saved contacts and then calls the contact whose name is being spoken. Nuance Communications has launched Nuance Mobile Speech Platform that will improve the text-to-speech and speech recognition abilities of mobile devices. Through this platform, end users will be able to perform searches, dictate emails and SMS messages, and have any incoming emails and messages read out to them, which will improve the usability and efficiency of mobile devices. The Nuance Mobile Speech Platform can be used to speech-enable a mobile application, and specifically offers pre-built components for the following:  Nuance Local Search - search business names and categories, residential listings, weather, dining and entertainment, movies, etc.  Nuance Mobile Navigation - voice destination entry (including street addresses, businesses and points of interest) and spoken turn-by-turn directions.  Nuance Content Search - search catalogs with items in music, video, games and more.  Nuance Mobile Web Search - search the Web from a mobile device.  Nuance Mobile Communications - compose email, SMS, and IM messages by speaking. (Source: Nuance Communications http://www.nuance.com) 6.8 Voice Security Systems Voice Security Systems technology uses a person's voice print to uniquely identify individuals using biometric speaker verification technology. Speech is processed through a non- contact method; you do not need to see or to touch the person to be able to recognize them. The popularity of speaker verification is swiftly growing because speech is easy to obtain without the addition of dedicated hardware. Improved, robust speech recognition algorithms and PC hardware have also brought this one-time futuristic idea into the present.
  • 28. Speech Recognition System IT Project IQBAL P a g e | 27 At Voice Security Systems, a decade of research and development has lead them to believe that the explosive speech processing market is here to stay. Their Voice Protect® method of biometric voice authentication is ideally suited for low memory, database independent applications using smart cards or other physical devices such as cell phones. Due to the value of biometric security for use in fraud prevention, and the added convenience of knowing a person is who they claim to be, they believe speaker verification will be more widely accepted by the consumer market before speech recognition. Voice Security Systems can deliver biometric security technology to the market at a lower cost than anyone else in the industry, with no reoccurring maintenance costs such as database management or complicated user training. Once the Voice Protect® technology is built into a product it will continue to function independently for the life of the product. Voice Security Systems can be applied in our daily lives, for example it can be successfully applied in Garage Door openers, Computers and laptops, Automobiles, PDA and handheld devices, Smartcard applications, Cell phones, Door access and ATM Machines. (Source: Voice Security Systems Inc. http://www.voice-security.com/)
  • 29. Speech Recognition System IT Project IQBAL P a g e | 28 77.. FFUUTTUURREE AAPPPPLLIICCAATTIIOONNSS There are a number of scenarios where speech recognition is either being delivered, developed for, researched or seriously discussed. As with many contemporary technologies, such as the Internet, online payment systems and mobile phone functionality, development is at least partially driven. IBM intends to have better-than-human Automatic Speech Recognition by 2010. Bill Gates predicted that by 2011 the quality of ASR will catch up to humans. Justin Rattner from Intel said in 2005 that by 2015, computers will have "strong capabilities" in speech-to-text. At some point in the future, speech recognition may become speech understanding. The statistical models that allow computers to decide what a person just said may someday allow them to grasp the meaning behind the words. Although it is a huge leap in terms of computational power and software sophistication, some researchers argue that speech recognition development offers the most direct line from the computers of today to true artificial intelligence. We can talk to our computers today. In 25 years, they may very well talk back. 7.1 Home Appliances Designers have developed very convenient user interfaces to consumer appliances. What could be easier than pressing buttons on a remote control to select television channels or flipping a switch to turn on a light? These types of direct manipulation user interfaces will continue to be widely used. However, because current buttons and switches are not intelligent, you cannot ask your remote control when "Star Trek" is on, and you must walk to the light switch before turning the light on. Speech enables consumer appliances to act intelligently, responding to speech commands and answering verbal questions. For example, speech enhances consumer appliances by enabling the user to say instructions such as: 1. To the VCR: "Record tonight's 'Star Trek'." 2. To the coffeepot: "Start at 6:30 a.m. tomorrow." 3. To the light switch: "Turn on the lights one half-hour before sunset." There is, inevitable, interest in the use of speech recognition in domestic appliances such as ovens, refrigerators, dishwashers and washing machines. One school of thought is that,
  • 30. Speech Recognition System IT Project IQBAL P a g e | 29 like the use of speech recognition in cars, this can reduce the number of parts and therefore the cost of production of the machine. However, removal of the normal buttons and controls would present problems for people who, for physical or learning reasons, cannot use speech recognition systems. 7.2 Wearable Computers Perhaps the most futuristic application is in the use and functionality of wearable computers i.e. unobtrusive devices that you can wear like a watch, or are even embedded in your clothes. These would allow people to go about their everyday lives, but still store information (thoughts, notes, to-do lists) verbally, or communicate via email, phone or videophone, through wearable devices. Crucially, this would be done without having to interact with the device, or even remember that it is there; the user would just speak, the device would know what to do with the speech, and would carry out the appropriate task. The rapid miniaturization of computing devices, the rapid rise in processing power, and advances in mobile wireless technologies, are making these devices more feasible. There are still significant problems, such as background noise and the idiosyncrasies of an individual’s language, to overcome. However, it is speculated that reliable versions of such devices will become commercially available during this decade. The conventional human-computer interface such as GUI, which assumes a keyboard, mouse, and bit-map display, is insufficient for the Wearable environment, especially for the Wearables. Although handwritten character recognizers and keyboards that can be used with one hand have been developed as input devices for computers, speech recognition has recently received more interest. The main reason for this is that it permits both hands and eyes to be kept free and therefore is less restricted in its use and can achieve quicker communication. In addition, speech can convey not only linguistic information but also the emotion and identity of speakers. IBM’s wearable PC described above has a microphone in its controller and can recognize speech as soon as the Via Voice has been installed.
  • 31. Speech Recognition System IT Project IQBAL P a g e | 30 7.3 Precision Surgery Developments in keyhole and micro surgery have clearly shown that an approach of as little invasive or non-essential surgery as possible increases success rates and patient recovery times. There is occasional speculation in various medical for a regarding the use of speech recognition in precision surgery, where a procedure is partially or totally carried out by automated means. For example, in removing a tumour or blockage without damaging surrounding tissue, a command could be given to make an incision of a precise and small length e.g. 2 millimetres. However, the legal implications of such technology are a formidable barrier to significant developments in this area. If speech was incorrectly interpreted and e.g. a limb was accidentally sliced off, who would be liable – the surgeon, the surgery system developers, or the speech recognition software developers?
  • 32. Speech Recognition System IT Project IQBAL P a g e | 31 88.. SSPPEEEECCHH RREECCOOGGNNIITTIIOONN SSOOFFTTWWAARREE Modern speech recognition software enables a single computer user to speak text and/or commands to the computer, largely, but not entirely, bypassing the use of the keyboard and mouse interface. The idea has been portrayed in science fiction for many decades, quite frequently depicting computers that do not even have keyboards or mice. Such computers are also typically depicted as being able to keep up no matter how fast a person speaks, and without regard to who the speaker is, the language spoken, or even how many speakers there are. In other words, they're depicting a computer that hears in like manner as a multilingual person. Attempts to develop usable speech recognition software began in the mid-1900s, and proved to be far more daunting than anyone had imagined. It also turned out to require so much computing power that only the most modern computers are now able to perform the functions required in real time (i.e., as fast as you can speak). The first commercially practical products became available around 1990, (e.g. the Voice Navigator, a standalone computer dedicated 100% to speech recognition) and used up all the available computing power of the machine, which would send its output to a second computer. They weren't particularly accurate and could only understand a single person at a time, requiring retraining, not of the operator but of the machine itself, to work for another person. Despite these limitations, they could type so rapidly that even after taking time to make corrections, a person with disabilities could easily accomplish more work with the machine than without it. For persons with physical disabilities, the ability to simply talk to your computer could be a priceless asset. Consider for instance, an author with Parkinson's disease who can barely control his hands, yet is conveniently able to create an article. 8.1 Free Softwares There are many software that are used for speech recognition. Many of them are free of cost. Some free software are:  XVoice (http://www.compapp.dcu.ie/~tdoris/Xvoice/ http://www.zachary.com/creemer/xvoice.html)
  • 33. Speech Recognition System IT Project IQBAL P a g e | 32  CVoiceControl/kVoiceControl (http://www.kiecza.de/daniel/linux/index.html)  Ears (ftp://svr−ftp.eng.cam.ac.uk/comp.speech/recognition/)  NICO ANN Toolkit (http://www.speech.kth.se/NICO/index.html)  Myers' Hidden Markov Model Software (http://www.itl.atr.co.jp/comp.speech/Section6/Recognition/myers.hmm.html)  Jialong He's Speech Recognition Research Tool (http://www.itl.atr.co.jp/comp.speech/Section6/Recognition/jialong.html)  Open Mind Speech (http://freespeech.sourceforge.net)  GVoice (http://www.cse.ogi.edu/~omega/gnome/gvoice/)  ISIP (http://www.isip.msstate.edu/project/speech/)  CMU Sphinx (http://www.speech.cs.cmu.edu/sphinx/Sphinx.html) 8.2 Commercial Software 8.2.1 Dragon Naturally Speaking Dragon NaturallySpeaking is almost universally regarded in reviews as the best voice- recognition software, with the potential for 99.8 percent accuracy (reviews say 95 percent is more realistic). NaturallySpeaking integrates easily with Microsoft productivity software. The Preferred version can also be used with a compatible digital-audio recorder, MP3 player/recorder or PDA for recording voice notes or lectures on the go; NaturallySpeaking will later transcribe your recordings. Reviews say Dragon NaturallySpeaking is the most
  • 34. Speech Recognition System IT Project IQBAL P a g e | 33 sophisticated product on the market, but that if you have Windows Vista or plan to buy a new computer with it, you should try the voice-recognition capabilities included with Vista, which by most accounts are nearly as robust as Dragon NaturallySpeaking. (Source: http://www.nuance.com/naturallyspeaking/) 8.2.2 IBM Via Voice IBM ViaVoice is a range of language-specific continuous speech synthesis software products offered by IBM. The current version is designed primarily for use in embedded devices. Individual language editions may have different features, specifications, technical support, and microphone support. Some of the products or editions available are:  Advanced Edition,  Standard Edition,  Personal Edition,  ViaVoice for Mac OS X Edition,  Pro USB Edition,  Simply Dictation for Mac. Prior to the development of ViaVoice, IBM developed a product named VoiceType. In 1997, ViaVoice was first introduced to the general public. Two years later, in 1999, IBM released a free of charge version of ViaVoice. I didn't find a single review that recommends ViaVoice over Dragon NaturallySpeaking, but ViaVoice is the only program that will run on older or less powerful computers. Dragon NaturallySpeaking is extremely demanding (you need at the very least 512 MB RAM, a recent processor and 1 GB free hard-drive space). However, reviews say ViaVoice isn't as accurate as Dragon NaturallySpeaking, and mistakes aren't as easy to correct. ViaVoice hasn't been updated in years. (Source: http://www.ibm.com/software/speech/)
  • 35. Speech Recognition System IT Project IQBAL P a g e | 34 8.2.3 Microsoft Speech Recognition System In 1993, Microsoft hired Xuedong Huang from CMU to lead its speech efforts. Microsoft has been involved in research on speech recognition and text to speech.[2] The company's research eventually led to the development of the Speech API (SAPI). Speech recognition technology has been used in some of Microsoft's products, including Microsoft Dictation (a research prototype that ran on Windows 9x). It was also included in Office XP, Office 2003[3], Microsoft Plus! for Windows XP, Windows XP Tablet PC Edition, and Windows Mobile (as Microsoft Voice Command)[4]. However, prior to Windows Vista, speech recognition was not mainstream. In response, Windows Speech Recognition was bundled with Windows Vista and released in 2006, making the operating system the first mainstream version of Microsoft Windows to offer fully-integrated support for speech recognition. Windows Speech Recognition in Windows Vista empowers users to interact with their computers by voice. It was designed for people who want to significantly limit their use of the mouse and keyboard while maintaining or increasing their overall productivity. You can dictate documents and emails in mainstream applications, use voice commands to start and switch between applications, control the operating system, and even fill out forms on the Web. Windows Speech Recognition is a new feature in Windows Vista, built using the latest Microsoft speech technologies. Windows Vista Speech Recognition provides excellent recognition accuracy that improves with each use as it adapts to your speaking style and vocabulary. Speech Recognition is available in English (U.S.), English (U.K.), German (Germany), French (France), Spanish (Spain), Japanese, Chinese (Traditional), and Chinese (Simplified). Early reviews say it rivals Dragon NaturallySpeaking 9 for accuracy. If you buy a new computer, you'll get Vista by default, so you can try out its voice-recognition features before buying other software. You can also upgrade an older computer to Vista, but the system requirements are demanding. Reviewers say Dragon NaturallySpeaking has a slight edge, but cite no compelling reason to buy it if you have or plan to buy Vista. (Source: http://www.microsoft.com/speech/speech2007/default.mspx)
  • 36. Speech Recognition System IT Project IQBAL P a g e | 35 8.2.4 MacSpeech Dictate MacSpeech is a company that develops speech recognition software for Apple Macintosh computers. In 2008, its previous flagship product, iListen, was replaced by Dictate, which is now built around Nuance's licensed Dragon NaturallySpeaking engine. MacSpeech was established in 1996 by current CEO Andrew Taylor. MacSpeech is currently the only company that develops voice dictation systems for the Macintosh. Its full product line is devoted to speech recognition and dictation. Reviews say Dictate, introduced in early 2008, is based on the Dragon NaturallySpeaking engine. In tests, it is as accurate as Dragon NaturallySpeaking, and much better than the previous MacSpeech program, iListen. Dictate comes with a microphone headset. No products directly compete with Dictate. (Source: http://www.macspeech.com/dictate/) 8.2.5 Philips Speech Magic SpeechMagic is an industrial grade platform for capturing information in a digital format. It has been developed by Philips Speech Recognition Systems of Vienna, Austria. SpeechMagic features large-vocabulary speech recognition as well as a number of services aimed at supporting “accurate, convenient and efficient” information capturing in healthcare IT applications. The technology is mainly used in the healthcare sector, however, applications are also available for the legal market as well as for tax consultants. SpeechMagic supports 25 recognition languages and provides more than 150 ConTexts (industry-specific vocabularies). More than 8,000 healthcare sites in 45 nations use SpeechMagic to capture information and create professional documents. The world’s largest site that is powered by SpeechMagic is in the United States with more than 60,000 authors, more than 3,000 editors and a throughput of 400 million lines per year.
  • 37. Speech Recognition System IT Project IQBAL P a g e | 36 Growth consulting company Frost & Sullivan has recognized SpeechMagic in 2005 with the Market Leadership Award in European Healthcare. In 2007, Frost & Sullivan presented Philips Speech Recognition Systems with the Global Excellence Award in Speech Recognition. (Source: http://www.myspeech.com/) 8.2.6 Other Commercial Software There are many other commercial software used for speech recognition. Some of them are:  HTK (http://htk.eng.cam.ac.uk/)  CSLU Toolkit (http://cslu.cse.ogi.edu/toolkit/)  Simmortel Voice (http://www.simmortel.com)  Quack.com by AOL (http://www.quack.com)  SpeechWorks (http://www.speechworks.com)  Bable Technologies (http://www.babeltech.com)  Vocalis Speechware (http://www.vocalisspeechware.com)  Entropic (http://htk.eng.cam.ac.uk)
  • 38. Speech Recognition System IT Project IQBAL P a g e | 37 99.. CCOONNCCLLUUSSIIOONN Speech recognition will revolutionize the way people conduct business over the Web and will, ultimately, differentiate world-class e-businesses. VoiceXML ties speech recognition and telephony together and provides the technology with which businesses can develop and deploy voice-enabled Web solutions TODAY! These solutions can greatly expand the accessibility of Web-based self-service transactions to customers who would otherwise not have access, and, at the same time, leverage a business’ existing Web investments. Speech recognition and VoiceXML clearly represent the next wave of the Web. In near future people will be using their home and business computers by speech not by keyboard or mouse. Home automation will be completely based on speech recognition system.