Automatic Speech Recognition is a well-explored, and in some cases, a nearly solved problem. However, in many real use cases, we wish to use ASR for separating multiple speakers, as well as perform speaker diarization (determine who speaks when). Until recently, methods to tackle this issue were based on a modular approach where separate networks are used for voice activity detection, speaker diarization, and ASR. Recently, many developments have been done on end-to-end methods, in which a network simultaneously learns both tasks. We propose a method in which we utilize Whisper, a novel model by OpenAI, and modify its structure to adapt it for both speaker diarization and speech recognition.
3. About us
• We are data-centric company with many years of
experience in different industries
• Telco
• Banking
• Finance
• Retail
• Manufacturing
• Distributions
• Transportation
• We are here to solve any data-related problems
and extract value from your data.
4. Organizational Structure
• As a company, we are
strategically focused
on knowledge.
• We have established
team-based matrix
organization in order
to provide flexibility
of teams according to
competencies needed
for particular
projects.
• It helps us to
accomplish optimal
results in quality,
within defined budgets
and deadlines.
Consulting Development
Research
* Note: Competencies of all team members are developed in a way that they can
actively be engaged also in other team that is not their primary competence
R&D
Consulting
6. Glossary
• ASR - Automatic Speech Recognition
• TTS - Text-To-Speech
• WER - word error rate
• cpWER – concatenated minimum-permutation word error
rate
• DER - diarization error rate
• VAD - voice activity detection
7. Motivation
• Automatic speech recognition is a long-researched
problem, the goal of which is human turn speech into
words.
• In the technological sphere, this problem is better
known as „Speech-to-text„ and some of the applications
include conversational agents (Siri) and conversation
transcriptions.
• End-to-end models have become a popular approach as an
alternative to traditional hybrid models in automatic
speech recognition (ASR).
• The multi-speaker speech separation and recognition
task is a central task in cocktail party problem
8. Ground work ...
• ASR (automatic speech recognition) with 1
speaker and with clean data is a very well
solved problem (superhuman performance)
• In real use cases, we might have multiple
speakers, different accents, noise, etc.
• Diarization - “who spoke when” —> very hard
problem, unknown number of speakers in advance
9. Ground work ...
• Historically, separating speakers was done
modular - a separate network for automatic
speech recognition (speech-to-text) and for
speaker diarization
• Recently, end-to-end (one network for all
tasks) have started making an impact
• Big companies (Microsoft, Google) have models
that perform well, but they are not open-source
• One network for VAD (voice activity detection),
speaker diarization, ASR
10. Our approach
• (Relevant projects) Audio separation for music
instruments with custom built CV framework
merging SOTA architectures
• MobileNet
• ResUNet
12. Version 0.1
• Wave2Vec 2.0 - Unsupervised pretraining, uses
unlabeled data, fine tuned for downstream tasks
• Open source!!!
13. Version 0.2 – enhanced
Wawe2Vec
• “Injecting” layers in different parts of
architecture - output VAD/diarization/ASR
• Initial layer(s) – VAD (almost binary task)
• Mid layer(s) – diarization (complex calculation)
• Final layer(s) - ASR/speech-to-text (high complexity)
• This is our baseline for all other work!
14. Whisper by OpenAI
• Transformer based model, uses log-mel
spectrograms
• Trained on 680000 hours of multilingual data
• Robust to accents and noise – Important
• Open source!!!
15. Version 0.6 – enhanced Whisper
• “Injecting” layers
• Initial layer(s) – VAD (almost binary task)
• Mid layer(s) – diarization (complex calculation)
• Final layer(s) - ASR/speech-to-text (high complexity)
• Since it is robust to noise, it achieves a good
model for real-world applications of multi-
speaker ASR (meetings have noise, people have
different accents)
• A big problem occurs when speakers overlap
16. Short DEMO
Model ASR and diarization output:
That’s all for today. Okay we have to fill in
all this stuff stuff m stuff. Meeting adjourned,
meeting edjourned, yeah, I think I’ve learned
not to bring play-dough to meetings. Yeah, I
think it would be a good idea, I like it.
Speaker 1 Speaker 2 Speaker 3 Speaker 4