Ideal flow of ML project considers presence of data which was labeled according to particular business task. But what if we need to start project asap but still don’t have relevant amount of such data or don’t have annotation budget? Described problem could especially harm cases for which we are planning to use deep learning approaches as the most promising ones. During this talk would be discussed real projects for which we faced in past this problem and few helpful practical approaches for solving this problem would be demonstrated.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
[DSC Europe 22] Starting deep learning projects without sufficient amount of labeled data – few practical examples - Sergei Smirnov
1. Starting deep learning projects without sufficient
amount of labeled data – few practical examples
2. SERGEI SMIRNOV
Chief Data Scientist II
• I have been working at EPAM 6+ years
• Experience with ML/DS 12+ years
• Participated in many projects in RecSys, NLP, CV, Time Series etc.
• Joining EPAM Serbia in 2022
• Responsible for DS/MLE in Serbia/Montenegro/Turkiye/
• Responsible for development DS competency EPAM globally
3. Intro
• Ideal flow of deep learning project:
• Relevant business understanding
• Normal data quality
• Presence of labeled data
• Representative sample
• Good quality of labeled data
• Budget for data labeling/re-labeling if needed
4. Why this talk is important
• Typical set-up of early stages of project
• Problem with labels
• Pre-trained models and leveraging existing solutions
• Auto-labeling/labeling propagation
5. Why this question is actual for EPAM
• Sometimes you need to persuade customer to work with you
• Start-up mode
• New customer
• No labeling budget
• Specific case (hard to find datasets/models for such case)
7. Darts
• Support for startup
• Darts scoring
• Detection of board
• Detection of sector
• Scoring
• Challenges with classical CV
• Too many heuristic to use
• Edge cases which could make
solution not stable
8. Board detection using deep learning
• No datasets with board
• Non-stable solution based on circles/lines
• Intuition
• Let’s label small sample
• Add this data to pre-train model
• Model can learn boxes faster than new classes
• Real solution
• Using pre-train model on COCO dataset
• Use clock entity
• Label few not working cases
9. What did we do next
• Re-train model
• Minimal correction of results
• Train lightweighted model (SSD) + tracking between frames
• 1-2 weeks of work
• Result of the project
• Pipeline for board detection based on deep learning
• Classical CV algorithm of segmentation
10. CV segmentation case
• Start-up with pig weight calculation
• No labeled data
• Time/budget constraints
11. How to do pig segmentation
• Initial approach
• Add segmentation and object
detection model
• No entity pig, but entity human is
almost ok
• Improve segmentation results by
color space transformation
• Re-train segmentation model (U-
net like)
12. Results
• After labeling correction we re-train segmentation networks
• Relevant results (5-7% of MAPE for weight results)
• Whole PoC solution was done in 3 weeks
13. Audio/Voice case
• Huge telecom company
• Content distribution platform
• Problems
• No meta-data for football games, TV shows etc.
• They wanted system of segmentation of content
• They can’t provide to us labeled dataset
• Chosen use-case: Voice TV show
14. Why project was initially started?
• We have full Voice TV show
• Extract fragments of different types
• Musical fragments
• Speech fragments
• Judges
• Fragments with concreate person
• Other
• We started to solve 1st case and 2nd case
16. Customer’s ask
• Solution should work fast
• They can’t provide GPU instances
• So, we decided to go only with audio stream for ML algorithm
• Using segmentation results we can combine highlights using time
segmentation
17. Which data we need for our model
Input audio file
T0 T1 T2 T3
music music
Music:
{ interval_1 : start = T0, end = T1
interval_2 : start = T2, end = T3}
Ground truth
18. Where we can get labeled data
• Customer can’t provide labeled data and has no labeling budget
• We decided to find some open dataset to solve this task
• Decided to use AudioSet for extracting different types of audio
fragments to create synthetic samples
20. Model – audio features
Audio
features
CRF
layer
mask
GRU Fully-connected layer
21. Model - target
Audio features
T0 T1 T2 T3
music music
Target
111111 111111
000000 0
0
T0 T1 T2 T3
22. Generation of labeled data
…
…
Pool of positive samples
Pool of negative samples
Synthetic
sample
generator
Final data sample
Start time Finish time
23. More details about labeled data generation
Final data sample
Start time Finish time
Start time Finish time
Left border Right border
Random crop generation
27. Outcome
• We provide relevant segmentation model to the customer
• We got trust in our ML capabilities
• After this project we run new innovation program with customer
29. Few words about project
•Customer : company which develops software for financial sector
•Initial set up
•Customer had idea to remove commercial solution
•Customer wanted to build unified pipeline for many types of
documents
•This pipeline should be customizable
•We didn’t have budget on data labeling
•We should work with engineering team from customer side
35. Iterative pipeline for model training
Documents
without labels
Mining labeled
dataset - v1
Training model
Fuzzy matching
with high
thresholds
Mining labeled
dataset – v2
Fuzzy matching
with lower
thresholds +
modeling results
Pattern extraction
model
Mining labeled
dataset – v3
Clustering + Heuristics
Mining labeled
dataset – final
version
36. Modeling pipeline and inference
• Initial model gave 70+ % of automation for main use-case
• New solution replaced commercial one in customer flow
• After few iteration of improvement using established labeling flow
• Achieved 85+ % of automation
• More sophisticated cases were solved
• More modern models were developed under the top of established labeling
pipeline (both manual and auto-labeling)
37. Labeling propagation
• To save the time of domain experts on manual
labeling we apply labeling propagation.
• After collecting bunch of files that we
inferenced incorrectly we cluster them using
HDBSCAN and pick some individual samples
from each cluster
• We give these samples to experts for manual
labeling of entities
• After that we use rule-based approaches to
find entities on the same place in other
documents within each cluster
37
3d clustering visualisation
38. Active learning pipeline
New documents
without labels
Pattern extraction
model
Updated labeled
dataset
Model Manual validation
Predictions (only
which failed auto
validation)
Increment for
labeled set
Training pipeline
Check against
ground truth data
Labeled docs
Training dataset
OCRed docs
OCRed docs
Patterns Labeled documents for
new patterns
New model
Update model if it
works better
39. Summary
• Not having labeling budget is not end of the world
• Even if your business task is not typical-you could try to re-use pre-
trained model
• Auto-labeling
• Label propagation
• Class similarity
• Combination of classical + heuristic + ML iteration can bring result
• Of course, for having stable system it is important to have good
labeling data quality