[DSC Europe 22] Parakeet - a parrot that can repeat words very well - Tomislav Krizan

PARKEET - VOICE
FINGERPRINTING &
RECOGNITION

Agenda
• About us!?
• Motivation
• Project approach ...
• Project outcome…
• Q&A

About us
• We are data-centric company with many years of
experience in different industries
• Telco
• Banking
• Finance
• Retail
• Manufacturing
• Distributions
• Transportation
• We are here to solve any data-related problems
and extract value from your data.

Organizational Structure
• As a company, we are
strategically focused
on knowledge.
• We have established
team-based matrix
organization in order
to provide flexibility
of teams according to
competencies needed
for particular
projects.
• It helps us to
accomplish optimal
results in quality,
within defined budgets
and deadlines.
Consulting Development
Research
* Note: Competencies of all team members are developed in a way that they can
actively be engaged also in other team that is not their primary competence
R&D
Consulting

Evolution of ASR, TTS and multispeaker ...

Glossary
• ASR - Automatic Speech Recognition
• TTS - Text-To-Speech
• WER - word error rate
• cpWER – concatenated minimum-permutation word error
rate
• DER - diarization error rate
• VAD - voice activity detection

Motivation
• Automatic speech recognition is a long-researched
problem, the goal of which is human turn speech into
words.
• In the technological sphere, this problem is better
known as „Speech-to-text„ and some of the applications
include conversational agents (Siri) and conversation
transcriptions.
• End-to-end models have become a popular approach as an
alternative to traditional hybrid models in automatic
speech recognition (ASR).
• The multi-speaker speech separation and recognition
task is a central task in cocktail party problem

Ground work ...
• ASR (automatic speech recognition) with 1
speaker and with clean data is a very well
solved problem (superhuman performance)
• In real use cases, we might have multiple
speakers, different accents, noise, etc.
• Diarization - “who spoke when” —> very hard
problem, unknown number of speakers in advance

Ground work ...
• Historically, separating speakers was done
modular - a separate network for automatic
speech recognition (speech-to-text) and for
speaker diarization
• Recently, end-to-end (one network for all
tasks) have started making an impact
• Big companies (Microsoft, Google) have models
that perform well, but they are not open-source
• One network for VAD (voice activity detection),
speaker diarization, ASR

Our approach
• (Relevant projects) Audio separation for music
instruments with custom built CV framework
merging SOTA architectures
• MobileNet
• ResUNet

Single microphone recording
issue

Version 0.1
• Wave2Vec 2.0 - Unsupervised pretraining, uses
unlabeled data, fine tuned for downstream tasks
• Open source!!!

Version 0.2 – enhanced
Wawe2Vec
• “Injecting” layers in different parts of
architecture - output VAD/diarization/ASR
• Initial layer(s) – VAD (almost binary task)
• Mid layer(s) – diarization (complex calculation)
• Final layer(s) - ASR/speech-to-text (high complexity)
• This is our baseline for all other work!

Whisper by OpenAI
• Transformer based model, uses log-mel
spectrograms
• Trained on 680000 hours of multilingual data
• Robust to accents and noise – Important
• Open source!!!

Version 0.6 – enhanced Whisper
• “Injecting” layers
• Initial layer(s) – VAD (almost binary task)
• Mid layer(s) – diarization (complex calculation)
• Final layer(s) - ASR/speech-to-text (high complexity)
• Since it is robust to noise, it achieves a good
model for real-world applications of multi-
speaker ASR (meetings have noise, people have
different accents)
• A big problem occurs when speakers overlap

Short DEMO
Model ASR and diarization output:
That’s all for today. Okay we have to fill in
all this stuff stuff m stuff. Meeting adjourned,
meeting edjourned, yeah, I think I’ve learned
not to bring play-dough to meetings. Yeah, I
think it would be a good idea, I like it.
Speaker 1 Speaker 2 Speaker 3 Speaker 4

[DSC Europe 22] Parakeet - a parrot that can repeat words very well - Tomislav Krizan

Recommended

Recommended

More Related Content

Similar to [DSC Europe 22] Parakeet - a parrot that can repeat words very well - Tomislav Krizan

Similar to [DSC Europe 22] Parakeet - a parrot that can repeat words very well - Tomislav Krizan (20)

More from DataScienceConferenc1

More from DataScienceConferenc1 (20)

Recently uploaded

Recently uploaded (20)

[DSC Europe 22] Parakeet - a parrot that can repeat words very well - Tomislav Krizan