Recently, a variety of video data are being generated, stored, and accessed with advances in computer technology and the Int
ernet.
To make search a video, or a video scene quickly from the data, an efficient and effective technique is needed.
So I proposed a video scene retrieval system based on speech recognition which is using HMM(Hidden Markov Model).
The proposed system is applied to scene retrieval experiments that evaluate a recognition rate for 457 short words.
Experiment result shows average detection accuracy is 68%.
1. A Study on the Video Scene
Retrieving System
with a Speech Recognizer
2013. 5. 14
Yoshika OSAWA
Kohno Lab.
2. Outline
1. Introduction
2. Aim of Study
3. Composition of System
i. Voice Divide Section
ii. Speech Recognize Section
iii. Scene Retrieve Section
4. Evaluation Experiment
5. Conclusion
3. 1. Introduction
• A variety of video data are being
generated, stored, and accessed with
advances in the Internet.
• To make search a video scene quickly from
the data, an efficient technique is needed.
5. 1. Introduction
• A Subtitling System for Broadcast
Programs with a Speech Recognizer
o Ando et al.(2001)
6. 1. Introduction
• Extracting voices from the video.
• The advantage of voice :
Easy to Make texts.
Simple association.
Apply the speech recognition to the scene
retrieving.
7. Outline
1. Introduction
2. Aim of Study
3. Composition of System
i. Voice Divide Section
ii. Speech Recognize Section
iii. Scene Retrieve Section
4. Evaluation Experiment
5. Conclusion
8. 2. Aim of Study
Implement a scene retrieving
system, then verify the accuracy and
check the operations.
Make annotations with the speech
recognition automatically.
9. Outline
1. Introduction
2. Aim of Study
3. Composition of System
i. Voice Divide Section
ii. Speech Recognize Section
iii. Scene Retrieve Section
4. Evaluation Experiment
5. Conclusion
10. 3. Composition of System
Start
End
Select a Video
Speech Recognize Section
Input a Keyword
Scene Retrieve Section
Output the resultVoice Divide Section
11. i. Voice Divide Section
• Focus on the Amplitude
o Use signals while exceeding the threshold
value of the amplitude.
o Reject because it is not possible to recognize if
it is too short.
o Derive threshold based on experiment.
axis threshold
Amplitude 10[%]
Time 1000[ms]
13. (1) Pre-Processing Unit
• Digitization
o Sampling frequency: 16kHz
o Quantization bit : 16bit
• Noise Reduction
o Additive: Subtract the difference between the silence
o Multiplicative: Subtract in the log axis
Microphone characteristics of SM57
15. • Resolution of human hearing
o Higher sensitivity in lower frequency
• Filter that matches the human hearing
Mel-frequency
(2) Feature Extraction Unit
16. • Inverse Fourier transform in the Mel-frequency axis
o New axis: Cepstrum
o Separate the voice pitch and resonance frequency
• MFCC(Mel Frequency Cepstrum Coefficient)
o Information of vowel
• ΔMFCC
o Infromation of consonant
• Feature vector
o (Average power, MFCC, ΔMFCC)
(2) Feature Extraction Unit
18. (3) Identification Unit
Speech waveform : Observable
Character information:
Unobservable directly
Estimate the character information
from the waveform by using HMM
(Hidden Markov Models)
Maximum likelihood calculation : Viterbi algorithm
Machine learning : Baum-Welch algorithm
19. iii. Scene Retrieve Section
• Matching keyword and text
1. Input a keyword
2. Matching the keyword by String searching
3. Extract scene that the keyword was spoken.
4. Output a thumbnail
20. Outline
1. Introduction
2. Aim of Study
3. Composition of System
i. Voice Divide Section
ii. Speech Recognize Section
iii. Scene Retrieve Section
4. Evaluation Experiment
5. Conclusion
21. 4. Evaluation Experiment
1. Compare the result with the word I heard
2. Calculate the recognition rate
3. Evaluate it by each number of characters
Sample data
Video NHK news
Time 3 minutes
Number 30 videos
Words 457 words
Engine Julius
22. 4. Evaluation Experiment
Total average rate is 68%.
67%
73%
69%
46% 45%
40%
0%
20%
40%
60%
80%
Recognition Rate
1 2 3 4 5 6 words
23. 4. Evaluation Experiment
• Verify the correspondence between
keyword and the seek destination
o Select thumbnail and play from the scene
o Check whether the keyword was spoken.
24. 4. Evaluation Experiment
• Recognition rate decrease when number
of characters increase.
• The retrieved scene is corresponding to
the keyword.
• Recognition error in weak consonant part
o Need improvement in Voice Devide Section
o Must also improve the recognition accuracy
25. Outline
1. Introduction
2. Aim of Study
3. Composition of System
i. Voice Divide Section
ii. Speech Recognize Section
iii. Scene Retrieve Section
4. Evaluation Experiment
5. Conclusion
26. 5. Conclusion
• System for efficient watching video
o Use Speech Recognition
o Make Annotations automatically
• Future work
o Adopt the Zero-Crossing Number in Voice
Devide Section
o Take in latest Speech Recognition technology.
o Incorporate Image Recognition.
Good afternoon, everyone.I’m Yoshika OSAWA, I am very happy to see all of you today.Let's begin.The theme of my presentation is “A Study on the Video Scene Retrieving System with a Speech Recognizer”.which I studied last year at Gunma National College of Technology.