3. ??Speech… How
Noise
Channel
Signal + … Protocol
Sender Message Receiver
4. Computer Analogy
text (TTS) speech
Speech Speech
Production Synthesis
(ASR) ( )
Speech speech Speech text
Perception Recognition
5. Recognition Made Easy
I bought a boat.
افرنقعوا أيها المتكأكئين
gute Nacht
Feature Decoder
Extraction (Search)
Grammar Lexicon Phone
Models
6. Recognizer Characteristics
Discrete words / continuous speech
Read / spontaneous speech
Speaker dependent / independent
Small / large vocabulary
Finite state / context sensitive language
model
7. What to study
Phonetics and Phonology (Linguistics)
Speech Signal Processing (DSP)
Pattern Recognition (AI)
Hidden Markov Models ( )
Artificial Neural Networks
Hybrid ANN - HMM
8. Phonetics
Phonetics: study of the production, perception,
and physical properties of speech sounds
Phonology: describes the way sounds function
within a given language and how they are
combined and organized
Phoneme: The smallest phonetic unit in a
language that is capable of conveying a
distinction in meaning
E.g.
boat-bought, car-jar, نشاط-شمس ,أرض-أحمد
9. Speech Signal Processing
Sampling
Rate:
e.g. 16 kHz
Sample size: e.g. 16 bits
Format: PCM (.wav files)
Time or Frequency domain features?
Spectrogram: represents the time-varying
spectrum of a signal. (x, y, intensity)
Can’t represent features?:
Filters Banks, LPCs, MFCCs
What is the need for speech technology? Why do we need to develop computer technologies tackling human speech? It is the easiest way for communication between people. So, why not communicating with computers by means of it? It’ll be really great. Do you remember the definition of AI? Solving problems human can do better. A very little child can speak, hear and understand you, but he cannot read, write or even do simple calculations. That’s why we need speech technology.
Like any communication system, the speech communication process comprises a message that needs to be carried from sender to receiver through a channel. المرسل يصيغ الرسالة اللى فى مخه إالى إشارات لجهاز النطق بوضع معين للأحبال الصوتية والحلق واللسان والشفايف والرئة .. فيتحول الهواء عبر كل تللك المؤثرات إلى تضاغطات و تخلخلات باهتزاز معين فتنقل عبر الهواء للطرف الآخر يقوم المستقبل بتجميع هذه الإشارات من الأذن الخارجية ثم تعبر الوسطى إلى الداخلية عبر المطرقة والسندان والركاب إلى الطبلة إلى القوقعة فتتحول لإلى إشارات عصبية يقوم المخ بالبحث عن مدلولها ومعناها حتى يصل لمعنى الرسالة ويفهمها طبعا يحمل الوسيط إشارات أخرى تنتشر عبر الهواء كصوت المروحة والسيارات والطلبة بالخارج و الزن ..إلخ all this is called noise, i.e. the channel doesn’t carry only the sender’s signal; it carries lots of signals combined together in a complex signal, the receiver do some processing to filter it out first. But, if all this happens, well you be able to understand the coming signal after all this processing and filtering??!! Imagine you I’m talking in Japanese and you understand only in German !! Or will the air conditioning understand the signal transmitted by a TV remote control ?!! So, the message is not just a signal, it’s a signal + a communication protocol agreed upon between sender and receiver. So, in speech the message == signal + language
Our focus mainly is on ASR. Note: beside the microphone/speaker; the sound card in the computer with it’s A/D and D/A converter plays the role of ear and mouse (physical part of speech processing) Note: Microphone converts acoustic pressure ( التضاغطات والتخلخلات الصوتية ) to electrical analog signal, the speakers do the opposite operation.
After the audience hear the three sentences from you (without displaying them); ask them what they understand from every utterance they heard. You won’t understand the third sentence assuming that you know English only (you don’t know Arabic or German), your ear will notice strange sound (ch خ ) that cannot perceive. In the second sentence (assuming you know Arabic) you ear can perceive every pronounced sound (you have what is called phone models in the sounds database in your brain) and by sense you can get the sentence structure ( فعل أمر ) (as you have the language grammar in your brain too) but you couldn’t understand the sentence because you don’t have synonyms for the words you heard in your dictionary (words lexicon in your brain) For the first sentence (assuming you know English) uttered sounds are ok as well as the words too; but two words have almost the same pronunciation. You hardly could get with the aid of the language grammar the told you the first word is a verb while the second is a noun. From this example, it becomes clear that speech perception is a searching process the brain performs in a fraction of a moment trying to find the appropriate match of the heard utterance given a large knowledge base constituted from (language sounds + words dictionary + language grammar) From here, the comes up the structure of a speech recognition engine.
Read words: كلام مقروء (الكلام منتظر و متوقع قبل نطقه ) Spontaneous: كلام عفوي غير متوقع Speaker-dependent: the engine needs to build a special profile for every user and be trained on its voice and way of speaking before being able to run properly and give acceptable results Finite-state language model: جمل قليلة محدودة النطاق مثل نمر تليفونات على سبيل المثال Context-sensitive language model: غير محدود النكاق ومعتمد على سياق الكلام needs a complicated NLP system.
Phonology answers the question: what is the sounds existing in this language? Phonetics answers the question: what is the properties of these sounds? phonetics, study of the sounds of languages from three basic points of view according to their production in the vocal organs their physical properties (acoustic phonetics), their effect on the ear
When a child starts learning, when he sees a dog and asks you what is this; you tell him it’s a dog; after that when he sees a donkey or cat he point to it and says it is a dog; you tell him no this is a donkey and this is cat; after that he points to your cat and says it’s a cat; you tell him, no it’s not just a cat, it is my cat, its name is Poosy. This is the idea of a model . Firstly the child made a model in his mind for any animal (a 4 legs creature) as a dog. Then he narrowed his model to dogs, donkeys and cats; then he narrowed it again to Poosy cat. The same idea applies for a mathematical model. Depending on your system size and nature you choose how to take your models. If your system is just recognizes on of only three sentences; you might make just an HMM for each sentence. If the system searches in a dictionary on 10 words, make an HMM for each word. If it searches in combinations of words in different orders, narrow you model to the level of sub-words, tri-phones, mono-phones, or even allophones, according to the system size and the appropriate search tree size and depth the system can bear. You have to note that, number of states in your model is a function of the model size you choose (i.e a function of the feature vector or in other meaning a fucntion of the time length of the unit of utterance you build model for {ranging usually from a whole word to a sub-phone})