SlideShare una empresa de Scribd logo
1 de 40
Starting deep learning projects without sufficient
amount of labeled data – few practical examples
SERGEI SMIRNOV
Chief Data Scientist II
• I have been working at EPAM 6+ years
• Experience with ML/DS 12+ years
• Participated in many projects in RecSys, NLP, CV, Time Series etc.
• Joining EPAM Serbia in 2022
• Responsible for DS/MLE in Serbia/Montenegro/Turkiye/
• Responsible for development DS competency EPAM globally
Intro
• Ideal flow of deep learning project:
• Relevant business understanding
• Normal data quality
• Presence of labeled data
• Representative sample
• Good quality of labeled data
• Budget for data labeling/re-labeling if needed
Why this talk is important
• Typical set-up of early stages of project
• Problem with labels
• Pre-trained models and leveraging existing solutions
• Auto-labeling/labeling propagation
Why this question is actual for EPAM
• Sometimes you need to persuade customer to work with you
• Start-up mode
• New customer
• No labeling budget
• Specific case (hard to find datasets/models for such case)
Agenda
• Few computer vision cases
• Sound-basic case
• Typical NLP case
Darts
• Support for startup
• Darts scoring
• Detection of board
• Detection of sector
• Scoring
• Challenges with classical CV
• Too many heuristic to use
• Edge cases which could make
solution not stable
Board detection using deep learning
• No datasets with board
• Non-stable solution based on circles/lines
• Intuition
• Let’s label small sample
• Add this data to pre-train model
• Model can learn boxes faster than new classes
• Real solution
• Using pre-train model on COCO dataset
• Use clock entity
• Label few not working cases
What did we do next
• Re-train model
• Minimal correction of results
• Train lightweighted model (SSD) + tracking between frames
• 1-2 weeks of work
• Result of the project
• Pipeline for board detection based on deep learning
• Classical CV algorithm of segmentation
CV segmentation case
• Start-up with pig weight calculation
• No labeled data
• Time/budget constraints
How to do pig segmentation
• Initial approach
• Add segmentation and object
detection model
• No entity pig, but entity human is
almost ok
• Improve segmentation results by
color space transformation
• Re-train segmentation model (U-
net like)
Results
• After labeling correction we re-train segmentation networks
• Relevant results (5-7% of MAPE for weight results)
• Whole PoC solution was done in 3 weeks
Audio/Voice case
• Huge telecom company
• Content distribution platform
• Problems
• No meta-data for football games, TV shows etc.
• They wanted system of segmentation of content
• They can’t provide to us labeled dataset
• Chosen use-case: Voice TV show
Why project was initially started?
• We have full Voice TV show
• Extract fragments of different types
• Musical fragments
• Speech fragments
• Judges
• Fragments with concreate person
• Other
• We started to solve 1st case and 2nd case
Proposed ML solution (HIGH LEVEL scheme)
Customer’s ask
• Solution should work fast
• They can’t provide GPU instances
• So, we decided to go only with audio stream for ML algorithm
• Using segmentation results we can combine highlights using time
segmentation
Which data we need for our model
Input audio file
T0 T1 T2 T3
music music
Music:
{ interval_1 : start = T0, end = T1
interval_2 : start = T2, end = T3}
Ground truth
Where we can get labeled data
• Customer can’t provide labeled data and has no labeling budget
• We decided to find some open dataset to solve this task
• Decided to use AudioSet for extracting different types of audio
fragments to create synthetic samples
Audio processing
Audio file Mel spectrogram DCNN PCA
Model – audio features
Audio
features
CRF
layer
mask
GRU Fully-connected layer
Model - target
Audio features
T0 T1 T2 T3
music music
Target
111111 111111
000000 0
0
T0 T1 T2 T3
Generation of labeled data
…
…
Pool of positive samples
Pool of negative samples
Synthetic
sample
generator
Final data sample
Start time Finish time
More details about labeled data generation
Final data sample
Start time Finish time
Start time Finish time
Left border Right border
Random crop generation
First results
Generating hard-negatives
• Run first model on real Voice TV show samples
• Manually analyze results of segmentation
• Add hard-negatives to model
Results on hold-out set
Average Precision=0.96
Outcome
• We provide relevant segmentation model to the customer
• We got trust in our ML capabilities
• After this project we run new innovation program with customer
Financial service entity extraction
{
key_1 : value_1,
………………………
key_last : value_last ,
key_lineItem1 :
lineItem1,
……………………………………..
key_lineItem_last :
lineItem_last
}
Few words about project
•Customer : company which develops software for financial sector
•Initial set up
•Customer had idea to remove commercial solution
•Customer wanted to build unified pipeline for many types of
documents
•This pipeline should be customizable
•We didn’t have budget on data labeling
•We should work with engineering team from customer side
Data flow
Overview of model – high level
Overview of model – feature extraction
Overview of model – target
Auto-labeling
Iterative pipeline for model training
Documents
without labels
Mining labeled
dataset - v1
Training model
Fuzzy matching
with high
thresholds
Mining labeled
dataset – v2
Fuzzy matching
with lower
thresholds +
modeling results
Pattern extraction
model
Mining labeled
dataset – v3
Clustering + Heuristics
Mining labeled
dataset – final
version
Modeling pipeline and inference
• Initial model gave 70+ % of automation for main use-case
• New solution replaced commercial one in customer flow
• After few iteration of improvement using established labeling flow
• Achieved 85+ % of automation
• More sophisticated cases were solved
• More modern models were developed under the top of established labeling
pipeline (both manual and auto-labeling)
Labeling propagation
• To save the time of domain experts on manual
labeling we apply labeling propagation.
• After collecting bunch of files that we
inferenced incorrectly we cluster them using
HDBSCAN and pick some individual samples
from each cluster
• We give these samples to experts for manual
labeling of entities
• After that we use rule-based approaches to
find entities on the same place in other
documents within each cluster
37
3d clustering visualisation
Active learning pipeline
New documents
without labels
Pattern extraction
model
Updated labeled
dataset
Model Manual validation
Predictions (only
which failed auto
validation)
Increment for
labeled set
Training pipeline
Check against
ground truth data
Labeled docs
Training dataset
OCRed docs
OCRed docs
Patterns Labeled documents for
new patterns
New model
Update model if it
works better
Summary
• Not having labeling budget is not end of the world
• Even if your business task is not typical-you could try to re-use pre-
trained model
• Auto-labeling
• Label propagation
• Class similarity
• Combination of classical + heuristic + ML iteration can bring result
• Of course, for having stable system it is important to have good
labeling data quality
Contacts
• E-mail: sergei_smirnov1@epam.com
• Telegram: +79215801652

Más contenido relacionado

Similar a [DSC Europe 22] Starting deep learning projects without sufficient amount of labeled data – few practical examples - Sergei Smirnov

Agile & Lean @ MediaGeniX
Agile & Lean @ MediaGeniXAgile & Lean @ MediaGeniX
Agile & Lean @ MediaGeniXESUG
 
NUS-ISS Learning Day 2018- Application of analytics in manufacturing sector
NUS-ISS Learning Day 2018- Application of analytics in manufacturing sectorNUS-ISS Learning Day 2018- Application of analytics in manufacturing sector
NUS-ISS Learning Day 2018- Application of analytics in manufacturing sectorNUS-ISS
 
Lifecycle of a Data Science Project
Lifecycle of a Data Science ProjectLifecycle of a Data Science Project
Lifecycle of a Data Science ProjectDigital Vidya
 
FlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at HumanaFlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at HumanaDatabricks
 
Steps in Simulation Study
Steps in Simulation StudySteps in Simulation Study
Steps in Simulation StudyNalin Adhikari
 
Lauri Pietarinen - What's Wrong With My Test Data
Lauri Pietarinen - What's Wrong With My Test DataLauri Pietarinen - What's Wrong With My Test Data
Lauri Pietarinen - What's Wrong With My Test DataTEST Huddle
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceData Science Milan
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or realityAwantik Das
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer visionEran Shlomo
 
Webinar: Question Answering and Virtual Assistants with Deep Learning
Webinar: Question Answering and Virtual Assistants with Deep LearningWebinar: Question Answering and Virtual Assistants with Deep Learning
Webinar: Question Answering and Virtual Assistants with Deep LearningLucidworks
 
Artur Suchwalko “What are common mistakes in Data Science projects and how to...
Artur Suchwalko “What are common mistakes in Data Science projects and how to...Artur Suchwalko “What are common mistakes in Data Science projects and how to...
Artur Suchwalko “What are common mistakes in Data Science projects and how to...Lviv Startup Club
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity ManagementEDB
 
The art of system and solution testing
The art of system and solution testingThe art of system and solution testing
The art of system and solution testinggaoliang641
 
EMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesPiet J.H. Daas
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Praveen Penumathsa
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makerszekeLabs Technologies
 

Similar a [DSC Europe 22] Starting deep learning projects without sufficient amount of labeled data – few practical examples - Sergei Smirnov (20)

Agile & Lean @ MediaGeniX
Agile & Lean @ MediaGeniXAgile & Lean @ MediaGeniX
Agile & Lean @ MediaGeniX
 
NUS-ISS Learning Day 2018- Application of analytics in manufacturing sector
NUS-ISS Learning Day 2018- Application of analytics in manufacturing sectorNUS-ISS Learning Day 2018- Application of analytics in manufacturing sector
NUS-ISS Learning Day 2018- Application of analytics in manufacturing sector
 
Lifecycle of a Data Science Project
Lifecycle of a Data Science ProjectLifecycle of a Data Science Project
Lifecycle of a Data Science Project
 
FlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at HumanaFlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at Humana
 
Steps in Simulation Study
Steps in Simulation StudySteps in Simulation Study
Steps in Simulation Study
 
Lauri Pietarinen - What's Wrong With My Test Data
Lauri Pietarinen - What's Wrong With My Test DataLauri Pietarinen - What's Wrong With My Test Data
Lauri Pietarinen - What's Wrong With My Test Data
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial Intelligence
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
 
Deep learning for NLP
Deep learning for NLPDeep learning for NLP
Deep learning for NLP
 
The Art of Project Estimation
The Art of Project EstimationThe Art of Project Estimation
The Art of Project Estimation
 
Webinar: Question Answering and Virtual Assistants with Deep Learning
Webinar: Question Answering and Virtual Assistants with Deep LearningWebinar: Question Answering and Virtual Assistants with Deep Learning
Webinar: Question Answering and Virtual Assistants with Deep Learning
 
Artur Suchwalko “What are common mistakes in Data Science projects and how to...
Artur Suchwalko “What are common mistakes in Data Science projects and how to...Artur Suchwalko “What are common mistakes in Data Science projects and how to...
Artur Suchwalko “What are common mistakes in Data Science projects and how to...
 
AshwiniCV- SAP Basis
AshwiniCV- SAP BasisAshwiniCV- SAP Basis
AshwiniCV- SAP Basis
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity Management
 
The art of project estimation
The art of project estimationThe art of project estimation
The art of project estimation
 
The art of system and solution testing
The art of system and solution testingThe art of system and solution testing
The art of system and solution testing
 
EMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniquesEMOS 2018 Big Data methods and techniques
EMOS 2018 Big Data methods and techniques
 
Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
 

Más de DataScienceConferenc1

[DSC Europe 23] Luciano Catani - AI in Diplomacy.PDF
[DSC Europe 23] Luciano Catani - AI in Diplomacy.PDF[DSC Europe 23] Luciano Catani - AI in Diplomacy.PDF
[DSC Europe 23] Luciano Catani - AI in Diplomacy.PDFDataScienceConferenc1
 
[DSC Europe 23] Rania Wazir - Mathematician jokes, cute cat photos, offensiv...
[DSC Europe 23] Rania Wazir -  Mathematician jokes, cute cat photos, offensiv...[DSC Europe 23] Rania Wazir -  Mathematician jokes, cute cat photos, offensiv...
[DSC Europe 23] Rania Wazir - Mathematician jokes, cute cat photos, offensiv...DataScienceConferenc1
 
[DSC Europe 23] Irena Cerovic - AI in International Development.pdf
[DSC Europe 23] Irena Cerovic - AI in International Development.pdf[DSC Europe 23] Irena Cerovic - AI in International Development.pdf
[DSC Europe 23] Irena Cerovic - AI in International Development.pdfDataScienceConferenc1
 
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...DataScienceConferenc1
 
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptxDataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Goran Dumic - Data-Driven Approach In Treatments
[DSC Europe 23][DigiHealth]  Goran Dumic -  Data-Driven Approach In Treatments[DSC Europe 23][DigiHealth]  Goran Dumic -  Data-Driven Approach In Treatments
[DSC Europe 23][DigiHealth] Goran Dumic - Data-Driven Approach In TreatmentsDataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Milos Todorovic - Bridging the Gap-Innovating Ag...
[DSC Europe 23][DigiHealth]  Milos Todorovic - Bridging the Gap-Innovating Ag...[DSC Europe 23][DigiHealth]  Milos Todorovic - Bridging the Gap-Innovating Ag...
[DSC Europe 23][DigiHealth] Milos Todorovic - Bridging the Gap-Innovating Ag...DataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Urosh VIlimanovich Clinical Data Management and C...
[DSC Europe 23][DigiHealth] Urosh VIlimanovich Clinical Data Management and C...[DSC Europe 23][DigiHealth] Urosh VIlimanovich Clinical Data Management and C...
[DSC Europe 23][DigiHealth] Urosh VIlimanovich Clinical Data Management and C...DataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Vladimir Brusic - SMART HEALTH HOME: Technology,...
[DSC Europe 23][DigiHealth]  Vladimir Brusic - SMART HEALTH HOME: Technology,...[DSC Europe 23][DigiHealth]  Vladimir Brusic - SMART HEALTH HOME: Technology,...
[DSC Europe 23][DigiHealth] Vladimir Brusic - SMART HEALTH HOME: Technology,...DataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Dimitar Penkov Grid Search Optimization of Novel...
[DSC Europe 23][DigiHealth]  Dimitar Penkov Grid Search Optimization of Novel...[DSC Europe 23][DigiHealth]  Dimitar Penkov Grid Search Optimization of Novel...
[DSC Europe 23][DigiHealth] Dimitar Penkov Grid Search Optimization of Novel...DataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Tomislav Krizan - AIMED
[DSC Europe 23][DigiHealth] Tomislav Krizan - AIMED[DSC Europe 23][DigiHealth] Tomislav Krizan - AIMED
[DSC Europe 23][DigiHealth] Tomislav Krizan - AIMEDDataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Katarina Vucicevic - Navigating theKinetics of Dr...
[DSC Europe 23][DigiHealth] Katarina Vucicevic - Navigating theKinetics of Dr...[DSC Europe 23][DigiHealth] Katarina Vucicevic - Navigating theKinetics of Dr...
[DSC Europe 23][DigiHealth] Katarina Vucicevic - Navigating theKinetics of Dr...DataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Anja Baresic 0- Croatian digital Healthcare ecosy...
[DSC Europe 23][DigiHealth] Anja Baresic 0- Croatian digital Healthcare ecosy...[DSC Europe 23][DigiHealth] Anja Baresic 0- Croatian digital Healthcare ecosy...
[DSC Europe 23][DigiHealth] Anja Baresic 0- Croatian digital Healthcare ecosy...DataScienceConferenc1
 
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...DataScienceConferenc1
 
[DSC Europe 23][AI:CSI] Uros Arsenijevic Unlocking Cybersecurity with Seif
[DSC Europe 23][AI:CSI] Uros Arsenijevic Unlocking Cybersecurity with Seif[DSC Europe 23][AI:CSI] Uros Arsenijevic Unlocking Cybersecurity with Seif
[DSC Europe 23][AI:CSI] Uros Arsenijevic Unlocking Cybersecurity with SeifDataScienceConferenc1
 
[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...
[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...
[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...DataScienceConferenc1
 
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...DataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...DataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Ligia Kornowska-How_may AI help you
[DSC Europe 23][DigiHealth] Ligia Kornowska-How_may AI help you[DSC Europe 23][DigiHealth] Ligia Kornowska-How_may AI help you
[DSC Europe 23][DigiHealth] Ligia Kornowska-How_may AI help youDataScienceConferenc1
 
[DSC Europe 23][DigiHealth] Ilya Zakharov - NETWORK NEUROSCIENCE WHERE THE BR...
[DSC Europe 23][DigiHealth] Ilya Zakharov - NETWORK NEUROSCIENCE WHERE THE BR...[DSC Europe 23][DigiHealth] Ilya Zakharov - NETWORK NEUROSCIENCE WHERE THE BR...
[DSC Europe 23][DigiHealth] Ilya Zakharov - NETWORK NEUROSCIENCE WHERE THE BR...DataScienceConferenc1
 

Más de DataScienceConferenc1 (20)

[DSC Europe 23] Luciano Catani - AI in Diplomacy.PDF
[DSC Europe 23] Luciano Catani - AI in Diplomacy.PDF[DSC Europe 23] Luciano Catani - AI in Diplomacy.PDF
[DSC Europe 23] Luciano Catani - AI in Diplomacy.PDF
 
[DSC Europe 23] Rania Wazir - Mathematician jokes, cute cat photos, offensiv...
[DSC Europe 23] Rania Wazir -  Mathematician jokes, cute cat photos, offensiv...[DSC Europe 23] Rania Wazir -  Mathematician jokes, cute cat photos, offensiv...
[DSC Europe 23] Rania Wazir - Mathematician jokes, cute cat photos, offensiv...
 
[DSC Europe 23] Irena Cerovic - AI in International Development.pdf
[DSC Europe 23] Irena Cerovic - AI in International Development.pdf[DSC Europe 23] Irena Cerovic - AI in International Development.pdf
[DSC Europe 23] Irena Cerovic - AI in International Development.pdf
 
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...
 
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx
 
[DSC Europe 23][DigiHealth] Goran Dumic - Data-Driven Approach In Treatments
[DSC Europe 23][DigiHealth]  Goran Dumic -  Data-Driven Approach In Treatments[DSC Europe 23][DigiHealth]  Goran Dumic -  Data-Driven Approach In Treatments
[DSC Europe 23][DigiHealth] Goran Dumic - Data-Driven Approach In Treatments
 
[DSC Europe 23][DigiHealth] Milos Todorovic - Bridging the Gap-Innovating Ag...
[DSC Europe 23][DigiHealth]  Milos Todorovic - Bridging the Gap-Innovating Ag...[DSC Europe 23][DigiHealth]  Milos Todorovic - Bridging the Gap-Innovating Ag...
[DSC Europe 23][DigiHealth] Milos Todorovic - Bridging the Gap-Innovating Ag...
 
[DSC Europe 23][DigiHealth] Urosh VIlimanovich Clinical Data Management and C...
[DSC Europe 23][DigiHealth] Urosh VIlimanovich Clinical Data Management and C...[DSC Europe 23][DigiHealth] Urosh VIlimanovich Clinical Data Management and C...
[DSC Europe 23][DigiHealth] Urosh VIlimanovich Clinical Data Management and C...
 
[DSC Europe 23][DigiHealth] Vladimir Brusic - SMART HEALTH HOME: Technology,...
[DSC Europe 23][DigiHealth]  Vladimir Brusic - SMART HEALTH HOME: Technology,...[DSC Europe 23][DigiHealth]  Vladimir Brusic - SMART HEALTH HOME: Technology,...
[DSC Europe 23][DigiHealth] Vladimir Brusic - SMART HEALTH HOME: Technology,...
 
[DSC Europe 23][DigiHealth] Dimitar Penkov Grid Search Optimization of Novel...
[DSC Europe 23][DigiHealth]  Dimitar Penkov Grid Search Optimization of Novel...[DSC Europe 23][DigiHealth]  Dimitar Penkov Grid Search Optimization of Novel...
[DSC Europe 23][DigiHealth] Dimitar Penkov Grid Search Optimization of Novel...
 
[DSC Europe 23][DigiHealth] Tomislav Krizan - AIMED
[DSC Europe 23][DigiHealth] Tomislav Krizan - AIMED[DSC Europe 23][DigiHealth] Tomislav Krizan - AIMED
[DSC Europe 23][DigiHealth] Tomislav Krizan - AIMED
 
[DSC Europe 23][DigiHealth] Katarina Vucicevic - Navigating theKinetics of Dr...
[DSC Europe 23][DigiHealth] Katarina Vucicevic - Navigating theKinetics of Dr...[DSC Europe 23][DigiHealth] Katarina Vucicevic - Navigating theKinetics of Dr...
[DSC Europe 23][DigiHealth] Katarina Vucicevic - Navigating theKinetics of Dr...
 
[DSC Europe 23][DigiHealth] Anja Baresic 0- Croatian digital Healthcare ecosy...
[DSC Europe 23][DigiHealth] Anja Baresic 0- Croatian digital Healthcare ecosy...[DSC Europe 23][DigiHealth] Anja Baresic 0- Croatian digital Healthcare ecosy...
[DSC Europe 23][DigiHealth] Anja Baresic 0- Croatian digital Healthcare ecosy...
 
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
 
[DSC Europe 23][AI:CSI] Uros Arsenijevic Unlocking Cybersecurity with Seif
[DSC Europe 23][AI:CSI] Uros Arsenijevic Unlocking Cybersecurity with Seif[DSC Europe 23][AI:CSI] Uros Arsenijevic Unlocking Cybersecurity with Seif
[DSC Europe 23][AI:CSI] Uros Arsenijevic Unlocking Cybersecurity with Seif
 
[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...
[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...
[DSC Europe 23][AI:CSI] Goran Gvozden Improving Cybersecurity Posture with an...
 
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
[DSC Europe 23][AI:CSI] Aleksa Stojanovic - Applying AI for Threat Detection ...
 
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
[DSC Europe 23][DigiHealth] Muthu Ramachandran AI and Blockchain Framework fo...
 
[DSC Europe 23][DigiHealth] Ligia Kornowska-How_may AI help you
[DSC Europe 23][DigiHealth] Ligia Kornowska-How_may AI help you[DSC Europe 23][DigiHealth] Ligia Kornowska-How_may AI help you
[DSC Europe 23][DigiHealth] Ligia Kornowska-How_may AI help you
 
[DSC Europe 23][DigiHealth] Ilya Zakharov - NETWORK NEUROSCIENCE WHERE THE BR...
[DSC Europe 23][DigiHealth] Ilya Zakharov - NETWORK NEUROSCIENCE WHERE THE BR...[DSC Europe 23][DigiHealth] Ilya Zakharov - NETWORK NEUROSCIENCE WHERE THE BR...
[DSC Europe 23][DigiHealth] Ilya Zakharov - NETWORK NEUROSCIENCE WHERE THE BR...
 

Último

办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 

Último (20)

办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 

[DSC Europe 22] Starting deep learning projects without sufficient amount of labeled data – few practical examples - Sergei Smirnov

  • 1. Starting deep learning projects without sufficient amount of labeled data – few practical examples
  • 2. SERGEI SMIRNOV Chief Data Scientist II • I have been working at EPAM 6+ years • Experience with ML/DS 12+ years • Participated in many projects in RecSys, NLP, CV, Time Series etc. • Joining EPAM Serbia in 2022 • Responsible for DS/MLE in Serbia/Montenegro/Turkiye/ • Responsible for development DS competency EPAM globally
  • 3. Intro • Ideal flow of deep learning project: • Relevant business understanding • Normal data quality • Presence of labeled data • Representative sample • Good quality of labeled data • Budget for data labeling/re-labeling if needed
  • 4. Why this talk is important • Typical set-up of early stages of project • Problem with labels • Pre-trained models and leveraging existing solutions • Auto-labeling/labeling propagation
  • 5. Why this question is actual for EPAM • Sometimes you need to persuade customer to work with you • Start-up mode • New customer • No labeling budget • Specific case (hard to find datasets/models for such case)
  • 6. Agenda • Few computer vision cases • Sound-basic case • Typical NLP case
  • 7. Darts • Support for startup • Darts scoring • Detection of board • Detection of sector • Scoring • Challenges with classical CV • Too many heuristic to use • Edge cases which could make solution not stable
  • 8. Board detection using deep learning • No datasets with board • Non-stable solution based on circles/lines • Intuition • Let’s label small sample • Add this data to pre-train model • Model can learn boxes faster than new classes • Real solution • Using pre-train model on COCO dataset • Use clock entity • Label few not working cases
  • 9. What did we do next • Re-train model • Minimal correction of results • Train lightweighted model (SSD) + tracking between frames • 1-2 weeks of work • Result of the project • Pipeline for board detection based on deep learning • Classical CV algorithm of segmentation
  • 10. CV segmentation case • Start-up with pig weight calculation • No labeled data • Time/budget constraints
  • 11. How to do pig segmentation • Initial approach • Add segmentation and object detection model • No entity pig, but entity human is almost ok • Improve segmentation results by color space transformation • Re-train segmentation model (U- net like)
  • 12. Results • After labeling correction we re-train segmentation networks • Relevant results (5-7% of MAPE for weight results) • Whole PoC solution was done in 3 weeks
  • 13. Audio/Voice case • Huge telecom company • Content distribution platform • Problems • No meta-data for football games, TV shows etc. • They wanted system of segmentation of content • They can’t provide to us labeled dataset • Chosen use-case: Voice TV show
  • 14. Why project was initially started? • We have full Voice TV show • Extract fragments of different types • Musical fragments • Speech fragments • Judges • Fragments with concreate person • Other • We started to solve 1st case and 2nd case
  • 15. Proposed ML solution (HIGH LEVEL scheme)
  • 16. Customer’s ask • Solution should work fast • They can’t provide GPU instances • So, we decided to go only with audio stream for ML algorithm • Using segmentation results we can combine highlights using time segmentation
  • 17. Which data we need for our model Input audio file T0 T1 T2 T3 music music Music: { interval_1 : start = T0, end = T1 interval_2 : start = T2, end = T3} Ground truth
  • 18. Where we can get labeled data • Customer can’t provide labeled data and has no labeling budget • We decided to find some open dataset to solve this task • Decided to use AudioSet for extracting different types of audio fragments to create synthetic samples
  • 19. Audio processing Audio file Mel spectrogram DCNN PCA
  • 20. Model – audio features Audio features CRF layer mask GRU Fully-connected layer
  • 21. Model - target Audio features T0 T1 T2 T3 music music Target 111111 111111 000000 0 0 T0 T1 T2 T3
  • 22. Generation of labeled data … … Pool of positive samples Pool of negative samples Synthetic sample generator Final data sample Start time Finish time
  • 23. More details about labeled data generation Final data sample Start time Finish time Start time Finish time Left border Right border Random crop generation
  • 25. Generating hard-negatives • Run first model on real Voice TV show samples • Manually analyze results of segmentation • Add hard-negatives to model
  • 26. Results on hold-out set Average Precision=0.96
  • 27. Outcome • We provide relevant segmentation model to the customer • We got trust in our ML capabilities • After this project we run new innovation program with customer
  • 28. Financial service entity extraction { key_1 : value_1, ……………………… key_last : value_last , key_lineItem1 : lineItem1, …………………………………….. key_lineItem_last : lineItem_last }
  • 29. Few words about project •Customer : company which develops software for financial sector •Initial set up •Customer had idea to remove commercial solution •Customer wanted to build unified pipeline for many types of documents •This pipeline should be customizable •We didn’t have budget on data labeling •We should work with engineering team from customer side
  • 31. Overview of model – high level
  • 32. Overview of model – feature extraction
  • 33. Overview of model – target
  • 35. Iterative pipeline for model training Documents without labels Mining labeled dataset - v1 Training model Fuzzy matching with high thresholds Mining labeled dataset – v2 Fuzzy matching with lower thresholds + modeling results Pattern extraction model Mining labeled dataset – v3 Clustering + Heuristics Mining labeled dataset – final version
  • 36. Modeling pipeline and inference • Initial model gave 70+ % of automation for main use-case • New solution replaced commercial one in customer flow • After few iteration of improvement using established labeling flow • Achieved 85+ % of automation • More sophisticated cases were solved • More modern models were developed under the top of established labeling pipeline (both manual and auto-labeling)
  • 37. Labeling propagation • To save the time of domain experts on manual labeling we apply labeling propagation. • After collecting bunch of files that we inferenced incorrectly we cluster them using HDBSCAN and pick some individual samples from each cluster • We give these samples to experts for manual labeling of entities • After that we use rule-based approaches to find entities on the same place in other documents within each cluster 37 3d clustering visualisation
  • 38. Active learning pipeline New documents without labels Pattern extraction model Updated labeled dataset Model Manual validation Predictions (only which failed auto validation) Increment for labeled set Training pipeline Check against ground truth data Labeled docs Training dataset OCRed docs OCRed docs Patterns Labeled documents for new patterns New model Update model if it works better
  • 39. Summary • Not having labeling budget is not end of the world • Even if your business task is not typical-you could try to re-use pre- trained model • Auto-labeling • Label propagation • Class similarity • Combination of classical + heuristic + ML iteration can bring result • Of course, for having stable system it is important to have good labeling data quality

Notas del editor

  1. Split on 2 slides – first 1 with ideal flow, second– problematic case
  2. Create separate slide with clock=board, ask ritorical question to audience