Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

[DSC Europe 22] Parakeet - a parrot that can repeat words very well - Tomislav Krizan

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio

Eche un vistazo a continuación

1 de 18 Anuncio

[DSC Europe 22] Parakeet - a parrot that can repeat words very well - Tomislav Krizan

Descargar para leer sin conexión

Automatic Speech Recognition is a well-explored, and in some cases, a nearly solved problem. However, in many real use cases, we wish to use ASR for separating multiple speakers, as well as perform speaker diarization (determine who speaks when). Until recently, methods to tackle this issue were based on a modular approach where separate networks are used for voice activity detection, speaker diarization, and ASR. Recently, many developments have been done on end-to-end methods, in which a network simultaneously learns both tasks. We propose a method in which we utilize Whisper, a novel model by OpenAI, and modify its structure to adapt it for both speaker diarization and speech recognition.

Automatic Speech Recognition is a well-explored, and in some cases, a nearly solved problem. However, in many real use cases, we wish to use ASR for separating multiple speakers, as well as perform speaker diarization (determine who speaks when). Until recently, methods to tackle this issue were based on a modular approach where separate networks are used for voice activity detection, speaker diarization, and ASR. Recently, many developments have been done on end-to-end methods, in which a network simultaneously learns both tasks. We propose a method in which we utilize Whisper, a novel model by OpenAI, and modify its structure to adapt it for both speaker diarization and speech recognition.

Anuncio
Anuncio

Más Contenido Relacionado

Más de DataScienceConferenc1 (20)

Más reciente (20)

Anuncio

[DSC Europe 22] Parakeet - a parrot that can repeat words very well - Tomislav Krizan

  1. 1. PARKEET - VOICE FINGERPRINTING & RECOGNITION
  2. 2. Agenda • About us!? • Motivation • Project approach ... • Project outcome… • Q&A
  3. 3. About us • We are data-centric company with many years of experience in different industries • Telco • Banking • Finance • Retail • Manufacturing • Distributions • Transportation • We are here to solve any data-related problems and extract value from your data.
  4. 4. Organizational Structure • As a company, we are strategically focused on knowledge. • We have established team-based matrix organization in order to provide flexibility of teams according to competencies needed for particular projects. • It helps us to accomplish optimal results in quality, within defined budgets and deadlines. Consulting Development Research * Note: Competencies of all team members are developed in a way that they can actively be engaged also in other team that is not their primary competence R&D Consulting
  5. 5. Evolution of ASR, TTS and multispeaker ...
  6. 6. Glossary • ASR - Automatic Speech Recognition • TTS - Text-To-Speech • WER - word error rate • cpWER – concatenated minimum-permutation word error rate • DER - diarization error rate • VAD - voice activity detection
  7. 7. Motivation • Automatic speech recognition is a long-researched problem, the goal of which is human turn speech into words. • In the technological sphere, this problem is better known as „Speech-to-text„ and some of the applications include conversational agents (Siri) and conversation transcriptions. • End-to-end models have become a popular approach as an alternative to traditional hybrid models in automatic speech recognition (ASR). • The multi-speaker speech separation and recognition task is a central task in cocktail party problem
  8. 8. Ground work ... • ASR (automatic speech recognition) with 1 speaker and with clean data is a very well solved problem (superhuman performance) • In real use cases, we might have multiple speakers, different accents, noise, etc. • Diarization - “who spoke when” —> very hard problem, unknown number of speakers in advance
  9. 9. Ground work ... • Historically, separating speakers was done modular - a separate network for automatic speech recognition (speech-to-text) and for speaker diarization • Recently, end-to-end (one network for all tasks) have started making an impact • Big companies (Microsoft, Google) have models that perform well, but they are not open-source • One network for VAD (voice activity detection), speaker diarization, ASR
  10. 10. Our approach • (Relevant projects) Audio separation for music instruments with custom built CV framework merging SOTA architectures • MobileNet • ResUNet
  11. 11. Single microphone recording issue
  12. 12. Version 0.1 • Wave2Vec 2.0 - Unsupervised pretraining, uses unlabeled data, fine tuned for downstream tasks • Open source!!!
  13. 13. Version 0.2 – enhanced Wawe2Vec • “Injecting” layers in different parts of architecture - output VAD/diarization/ASR • Initial layer(s) – VAD (almost binary task) • Mid layer(s) – diarization (complex calculation) • Final layer(s) - ASR/speech-to-text (high complexity) • This is our baseline for all other work!
  14. 14. Whisper by OpenAI • Transformer based model, uses log-mel spectrograms • Trained on 680000 hours of multilingual data • Robust to accents and noise – Important • Open source!!!
  15. 15. Version 0.6 – enhanced Whisper • “Injecting” layers • Initial layer(s) – VAD (almost binary task) • Mid layer(s) – diarization (complex calculation) • Final layer(s) - ASR/speech-to-text (high complexity) • Since it is robust to noise, it achieves a good model for real-world applications of multi- speaker ASR (meetings have noise, people have different accents) • A big problem occurs when speakers overlap
  16. 16. Short DEMO Model ASR and diarization output: That’s all for today. Okay we have to fill in all this stuff stuff m stuff. Meeting adjourned, meeting edjourned, yeah, I think I’ve learned not to bring play-dough to meetings. Yeah, I think it would be a good idea, I like it. Speaker 1 Speaker 2 Speaker 3 Speaker 4
  17. 17. Final conculsion…
  18. 18. www.atmc.ai info@atmc.ai

×