[DSC Europe 22] Starting deep learning projects without sufficient amount of labeled data – few practical examples - Sergei Smirnov

Starting deep learning projects without sufficient
amount of labeled data – few practical examples

SERGEI SMIRNOV
Chief Data Scientist II
• I have been working at EPAM 6+ years
• Experience with ML/DS 12+ years
• Participated in many projects in RecSys, NLP, CV, Time Series etc.
• Joining EPAM Serbia in 2022
• Responsible for DS/MLE in Serbia/Montenegro/Turkiye/
• Responsible for development DS competency EPAM globally

Intro
• Ideal flow of deep learning project:
• Relevant business understanding
• Normal data quality
• Presence of labeled data
• Representative sample
• Good quality of labeled data
• Budget for data labeling/re-labeling if needed

Why this talk is important
• Typical set-up of early stages of project
• Problem with labels
• Pre-trained models and leveraging existing solutions
• Auto-labeling/labeling propagation

Why this question is actual for EPAM
• Sometimes you need to persuade customer to work with you
• Start-up mode
• New customer
• No labeling budget
• Specific case (hard to find datasets/models for such case)

Agenda
• Few computer vision cases
• Sound-basic case
• Typical NLP case

Darts
• Support for startup
• Darts scoring
• Detection of board
• Detection of sector
• Scoring
• Challenges with classical CV
• Too many heuristic to use
• Edge cases which could make
solution not stable

Board detection using deep learning
• No datasets with board
• Non-stable solution based on circles/lines
• Intuition
• Let’s label small sample
• Add this data to pre-train model
• Model can learn boxes faster than new classes
• Real solution
• Using pre-train model on COCO dataset
• Use clock entity
• Label few not working cases

What did we do next
• Re-train model
• Minimal correction of results
• Train lightweighted model (SSD) + tracking between frames
• 1-2 weeks of work
• Result of the project
• Pipeline for board detection based on deep learning
• Classical CV algorithm of segmentation

CV segmentation case
• Start-up with pig weight calculation
• No labeled data
• Time/budget constraints

How to do pig segmentation
• Initial approach
• Add segmentation and object
detection model
• No entity pig, but entity human is
almost ok
• Improve segmentation results by
color space transformation
• Re-train segmentation model (U-
net like)

Results
• After labeling correction we re-train segmentation networks
• Relevant results (5-7% of MAPE for weight results)
• Whole PoC solution was done in 3 weeks

Audio/Voice case
• Huge telecom company
• Content distribution platform
• Problems
• No meta-data for football games, TV shows etc.
• They wanted system of segmentation of content
• They can’t provide to us labeled dataset
• Chosen use-case: Voice TV show

Why project was initially started?
• We have full Voice TV show
• Extract fragments of different types
• Musical fragments
• Speech fragments
• Judges
• Fragments with concreate person
• Other
• We started to solve 1st case and 2nd case

Proposed ML solution (HIGH LEVEL scheme)

Customer’s ask
• Solution should work fast
• They can’t provide GPU instances
• So, we decided to go only with audio stream for ML algorithm
• Using segmentation results we can combine highlights using time
segmentation

Which data we need for our model
Input audio file
T0 T1 T2 T3
music music
Music:
{ interval_1 : start = T0, end = T1
interval_2 : start = T2, end = T3}
Ground truth

Where we can get labeled data
• Customer can’t provide labeled data and has no labeling budget
• We decided to find some open dataset to solve this task
• Decided to use AudioSet for extracting different types of audio
fragments to create synthetic samples

Audio processing
Audio file Mel spectrogram DCNN PCA

Model – audio features
Audio
features
CRF
layer
mask
GRU Fully-connected layer

Model - target
Audio features
T0 T1 T2 T3
music music
Target
111111 111111
000000 0
0
T0 T1 T2 T3

Generation of labeled data
…
…
Pool of positive samples
Pool of negative samples
Synthetic
sample
generator
Final data sample
Start time Finish time

More details about labeled data generation
Final data sample
Left border Right border
Random crop generation

Generating hard-negatives
• Run first model on real Voice TV show samples
• Manually analyze results of segmentation
• Add hard-negatives to model

Results on hold-out set
Average Precision=0.96

Outcome
• We provide relevant segmentation model to the customer
• We got trust in our ML capabilities
• After this project we run new innovation program with customer

Financial service entity extraction
{
key_1 : value_1,
………………………
key_last : value_last ,
key_lineItem1 :
lineItem1,
……………………………………..
key_lineItem_last :
lineItem_last
}

Few words about project
•Customer : company which develops software for financial sector
•Initial set up
•Customer had idea to remove commercial solution
•Customer wanted to build unified pipeline for many types of
documents
•This pipeline should be customizable
•We didn’t have budget on data labeling
•We should work with engineering team from customer side

Overview of model – high level

Overview of model – feature extraction

Iterative pipeline for model training
Documents
without labels
Mining labeled
dataset - v1
Training model
Fuzzy matching
with high
thresholds
Mining labeled
dataset – v2
Fuzzy matching
with lower
thresholds +
modeling results
Pattern extraction
model
Mining labeled
dataset – v3
Clustering + Heuristics
Mining labeled
dataset – final
version

Modeling pipeline and inference
• Initial model gave 70+ % of automation for main use-case
• New solution replaced commercial one in customer flow
• After few iteration of improvement using established labeling flow
• Achieved 85+ % of automation
• More sophisticated cases were solved
• More modern models were developed under the top of established labeling
pipeline (both manual and auto-labeling)

Labeling propagation
• To save the time of domain experts on manual
labeling we apply labeling propagation.
• After collecting bunch of files that we
inferenced incorrectly we cluster them using
HDBSCAN and pick some individual samples
from each cluster
• We give these samples to experts for manual
labeling of entities
• After that we use rule-based approaches to
find entities on the same place in other
documents within each cluster
37
3d clustering visualisation

Active learning pipeline
New documents
without labels
Pattern extraction
model
Updated labeled
dataset
Model Manual validation
Predictions (only
which failed auto
validation)
Increment for
labeled set
Training pipeline
Check against
ground truth data
Labeled docs
Training dataset
OCRed docs
OCRed docs
Patterns Labeled documents for
new patterns
New model
Update model if it
works better

Summary
• Not having labeling budget is not end of the world
• Even if your business task is not typical-you could try to re-use pre-
trained model
• Auto-labeling
• Label propagation
• Class similarity
• Combination of classical + heuristic + ML iteration can bring result
• Of course, for having stable system it is important to have good
labeling data quality

Contacts
• E-mail: sergei_smirnov1@epam.com
• Telegram: +79215801652

[DSC Europe 22] Starting deep learning projects without sufficient amount of labeled data – few practical examples - Sergei Smirnov

Recomendados

Recomendados

Más contenido relacionado

Similar a [DSC Europe 22] Starting deep learning projects without sufficient amount of labeled data – few practical examples - Sergei Smirnov

Similar a [DSC Europe 22] Starting deep learning projects without sufficient amount of labeled data – few practical examples - Sergei Smirnov (20)

Más de DataScienceConferenc1

Más de DataScienceConferenc1 (20)

Último

Último (20)

[DSC Europe 22] Starting deep learning projects without sufficient amount of labeled data – few practical examples - Sergei Smirnov

Notas del editor