A Study on the Video Scene Retrieving System

•Descargar como PPTX, PDF•

1 recomendación•355 vistas

Recently, a variety of video data are being generated, stored, and accessed with advances in computer technology and the Int ernet. To make search a video, or a video scene quickly from the data, an efficient and effective technique is needed. So I proposed a video scene retrieval system based on speech recognition which is using HMM(Hidden Markov Model). The proposed system is applied to scene retrieval experiments that evaluate a recognition rate for 457 short words. Experiment result shows average detection accuracy is 68%.

Tecnología

A Study on the Video Scene
Retrieving System
with a Speech Recognizer
2013. 5. 14
Yoshika OSAWA
Kohno Lab.

Outline
1. Introduction
2. Aim of Study
3. Composition of System
i. Voice Divide Section
ii. Speech Recognize Section
iii. Scene Retrieve Section
4. Evaluation Experiment
5. Conclusion

1. Introduction
• A variety of video data are being
generated, stored, and accessed with
advances in the Internet.
• To make search a video scene quickly from
the data, an efficient technique is needed.

1. Introduction
• Multimedia Annotations
o Nagao(2001)

1. Introduction
• A Subtitling System for Broadcast
Programs with a Speech Recognizer
o Ando et al.(2001)

1. Introduction
• Extracting voices from the video.
• The advantage of voice :
Easy to Make texts.
Simple association.
Apply the speech recognition to the scene
retrieving.

2. Aim of Study
Implement a scene retrieving
system, then verify the accuracy and
check the operations.
Make annotations with the speech
recognition automatically.

3. Composition of System
Start
End
Select a Video
Speech Recognize Section
Input a Keyword
Scene Retrieve Section
Output the resultVoice Divide Section

i. Voice Divide Section
• Focus on the Amplitude
o Use signals while exceeding the threshold
value of the amplitude.
o Reject because it is not possible to recognize if
it is too short.
o Derive threshold based on experiment.
axis threshold
Amplitude 10[%]
Time 1000[ms]

(1) Pre-Processing Unit
• Digitization
o Sampling frequency: 16kHz
o Quantization bit : 16bit
• Noise Reduction
o Additive: Subtract the difference between the silence
o Multiplicative: Subtract in the log axis
Microphone characteristics of SM57

(2) Feature Extraction Unit
Resonant frequency is effective as a feature value

• Resolution of human hearing
o Higher sensitivity in lower frequency
• Filter that matches the human hearing
Mel-frequency
(2) Feature Extraction Unit

• Inverse Fourier transform in the Mel-frequency axis
o New axis: Cepstrum
o Separate the voice pitch and resonance frequency
• MFCC（Mel Frequency Cepstrum Coefficient)
o Information of vowel
• ΔMFCC
o Infromation of consonant
• Feature vector
o （Average power, MFCC, ΔMFCC）
(2) Feature Extraction Unit

(3) Identification Unit
From Bayes' theorem

(3) Identification Unit
Speech waveform : Observable
Character information:
Unobservable directly
Estimate the character information
from the waveform by using HMM
(Hidden Markov Models)
Maximum likelihood calculation : Viterbi algorithm
Machine learning : Baum-Welch algorithm

iii. Scene Retrieve Section
• Matching keyword and text
1. Input a keyword
2. Matching the keyword by String searching
3. Extract scene that the keyword was spoken.
4. Output a thumbnail

4. Evaluation Experiment
1. Compare the result with the word I heard
2. Calculate the recognition rate
3. Evaluate it by each number of characters
Sample data
Video NHK news
Time 3 minutes
Number 30 videos
Words 457 words
Engine Julius

4. Evaluation Experiment
Total average rate is 68%.
67%
73%
69%
46% 45%
40%
0%
20%
40%
60%
80%
Recognition Rate
1 2 3 4 5 6 words

4. Evaluation Experiment
• Verify the correspondence between
keyword and the seek destination
o Select thumbnail and play from the scene
o Check whether the keyword was spoken.

4. Evaluation Experiment
• Recognition rate decrease when number
of characters increase.
• The retrieved scene is corresponding to
the keyword.
• Recognition error in weak consonant part
o Need improvement in Voice Devide Section
o Must also improve the recognition accuracy

5. Conclusion
• System for efficient watching video
o Use Speech Recognition
o Make Annotations automatically
• Future work
o Adopt the Zero-Crossing Number in Voice
Devide Section
o Take in latest Speech Recognition technology.
o Incorporate Image Recognition.

Más contenido relacionado

Último

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Install Stable Diffusion in windows machinePadma Pradeep

Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

AI as an Interface for Commercial BuildingsMemoori

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Story boards and shot lists for my a level piececharlottematthew16

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

A Study on the Video Scene Retrieving System

1. A Study on the Video Scene Retrieving System with a Speech Recognizer 2013. 5. 14 Yoshika OSAWA Kohno Lab.

2. Outline 1. Introduction 2. Aim of Study 3. Composition of System i. Voice Divide Section ii. Speech Recognize Section iii. Scene Retrieve Section 4. Evaluation Experiment 5. Conclusion

3. 1. Introduction • A variety of video data are being generated, stored, and accessed with advances in the Internet. • To make search a video scene quickly from the data, an efficient technique is needed.

4. 1. Introduction • Multimedia Annotations o Nagao(2001)

5. 1. Introduction • A Subtitling System for Broadcast Programs with a Speech Recognizer o Ando et al.(2001)

6. 1. Introduction • Extracting voices from the video. • The advantage of voice : Easy to Make texts. Simple association. Apply the speech recognition to the scene retrieving.

7. Outline 1. Introduction 2. Aim of Study 3. Composition of System i. Voice Divide Section ii. Speech Recognize Section iii. Scene Retrieve Section 4. Evaluation Experiment 5. Conclusion

8. 2. Aim of Study Implement a scene retrieving system, then verify the accuracy and check the operations. Make annotations with the speech recognition automatically.

9. Outline 1. Introduction 2. Aim of Study 3. Composition of System i. Voice Divide Section ii. Speech Recognize Section iii. Scene Retrieve Section 4. Evaluation Experiment 5. Conclusion

10. 3. Composition of System Start End Select a Video Speech Recognize Section Input a Keyword Scene Retrieve Section Output the resultVoice Divide Section

11. i. Voice Divide Section • Focus on the Amplitude o Use signals while exceeding the threshold value of the amplitude. o Reject because it is not possible to recognize if it is too short. o Derive threshold based on experiment. axis threshold Amplitude 10[%] Time 1000[ms]

12. ii. Speech Recognize Section

13. (1) Pre-Processing Unit • Digitization o Sampling frequency: 16kHz o Quantization bit : 16bit • Noise Reduction o Additive: Subtract the difference between the silence o Multiplicative: Subtract in the log axis Microphone characteristics of SM57

14. (2) Feature Extraction Unit Resonant frequency is effective as a feature value

15. • Resolution of human hearing o Higher sensitivity in lower frequency • Filter that matches the human hearing Mel-frequency (2) Feature Extraction Unit

16. • Inverse Fourier transform in the Mel-frequency axis o New axis: Cepstrum o Separate the voice pitch and resonance frequency • MFCC（Mel Frequency Cepstrum Coefficient) o Information of vowel • ΔMFCC o Infromation of consonant • Feature vector o （Average power, MFCC, ΔMFCC） (2) Feature Extraction Unit

17. (3) Identification Unit From Bayes' theorem

18. (3) Identification Unit Speech waveform : Observable Character information: Unobservable directly Estimate the character information from the waveform by using HMM (Hidden Markov Models) Maximum likelihood calculation : Viterbi algorithm Machine learning : Baum-Welch algorithm

19. iii. Scene Retrieve Section • Matching keyword and text 1. Input a keyword 2. Matching the keyword by String searching 3. Extract scene that the keyword was spoken. 4. Output a thumbnail

20. Outline 1. Introduction 2. Aim of Study 3. Composition of System i. Voice Divide Section ii. Speech Recognize Section iii. Scene Retrieve Section 4. Evaluation Experiment 5. Conclusion

21. 4. Evaluation Experiment 1. Compare the result with the word I heard 2. Calculate the recognition rate 3. Evaluate it by each number of characters Sample data Video NHK news Time 3 minutes Number 30 videos Words 457 words Engine Julius

22. 4. Evaluation Experiment Total average rate is 68%. 67% 73% 69% 46% 45% 40% 0% 20% 40% 60% 80% Recognition Rate 1 2 3 4 5 6 words

23. 4. Evaluation Experiment • Verify the correspondence between keyword and the seek destination o Select thumbnail and play from the scene o Check whether the keyword was spoken.

24. 4. Evaluation Experiment • Recognition rate decrease when number of characters increase. • The retrieved scene is corresponding to the keyword. • Recognition error in weak consonant part o Need improvement in Voice Devide Section o Must also improve the recognition accuracy

25. Outline 1. Introduction 2. Aim of Study 3. Composition of System i. Voice Divide Section ii. Speech Recognize Section iii. Scene Retrieve Section 4. Evaluation Experiment 5. Conclusion

26. 5. Conclusion • System for efficient watching video o Use Speech Recognition o Make Annotations automatically • Future work o Adopt the Zero-Crossing Number in Voice Devide Section o Take in latest Speech Recognition technology. o Incorporate Image Recognition.

27. Thank you for your attention!

Notas del editor

Good afternoon, everyone.I’m Yoshika OSAWA, I am very happy to see all of you today.Let's begin.The theme of my presentation is “A Study on the Video Scene Retrieving System with a Speech Recognizer”.which I studied last year at Gunma National College of Technology.

A Study on the Video Scene Retrieving System

Recomendados

Recomendados

Más contenido relacionado

Similar a A Study on the Video Scene Retrieving System

Similar a A Study on the Video Scene Retrieving System (20)

Último

Último (20)

A Study on the Video Scene Retrieving System

Notas del editor