SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
© 2022 Tryolabs
Data Versioning
Towards Reproducibility in
Machine Learning
Nicolás Eiris
Machine Learning Engineer
Tryolabs
© 2022 Tryolabs
© 2022 Tryolabs
Tryolabs
2
• We build custom AI solutions
• 70+ team members
• 12+ years of experience
• Served more than 150 clients
Trusted by
© 2022 Tryolabs
1. Main pain points in ML workflows
2. Useful open source tool
3. Takeaways
4. References
Agenda
3
© 2022 Tryolabs
© 2022 Tryolabs 4
Dilemma in ML development
Building everything manually from scratch
vs. using a tool to support the
development phase (from collecting data to
deploying on the edge).
Main pain points in ML workflows
© 2022 Tryolabs
© 2022 Tryolabs
Standard ML workflow
6
DATA
INGESTION
EXPLORATORY
DATA ANALYSIS
DATA CLEANING
EXPERIMENTATION
& EVALUATION
MODELING FEATURE
ENGINEERING
ROLLING OUT TO
PRODUCTION
© 2022 Tryolabs
© 2022 Tryolabs
ML pipeline in practice
7
model
features
data
data_v2
features copy
features_2
features_3
model_1
model_1_2
model_prefinal
model_data_v2
model_2_2
model_final
UPLOAD
CODE
SETUP STORAGE
& UPLOAD DATA
SETUP CLOUD
RUNNER (GPU,
NAS, ETC.)
WAIT THE DATA &
CODE DON’T MATCH
RUN TRAIN/TEST
SCRIPT
DOWNLOAD
DATA + CODE
SYNC DATA
AND CODE
OH NO!
THE
REQUIREMENTS
HAVE CHANGED
WHERE DO I
REPORT OUTPUT
RESULTS ANYWAY?
RAGE QUIT JOB
EDA
EDA_2
EDA_3
*EDA = Exploratory data analysis
© 2022 Tryolabs
© 2022 Tryolabs
Main pain points in ML workflows
8
1. Reproducibility
● Teamwork
● Usually ad-hoc processes
● Productivity bottleneck
● Challenges
○ Changes in data
○ Hyperparams inconsistency
○ Randomness
○ Manual and ad-hoc
execution of experiments
© 2022 Tryolabs
© 2022 Tryolabs
Main pain points in ML workflows
9
1. Reproducibility
“Changes are uploaded, please
run all the notebook again.”
© 2022 Tryolabs
© 2022 Tryolabs
Main pain points in ML workflows
10
• Complex READMEs on how to
gather data from remote
storage
• Security and data privacy
risks
• Manual versioning of dataset
changes
2. Data sharing
© 2022 Tryolabs
© 2022 Tryolabs
Main pain points in ML workflows
11
2. Data sharing
“I wish I could automate this
process…”
NO STORAGE
© 2022 Tryolabs
© 2022 Tryolabs
Main pain points in ML workflows
12
3. Experiments
execution & tracking
● Experiments setup traceability
challenges
● Inefficient results comparison &
evaluation
● Manual process:
○ Spreadsheet
○ Github (metadata files)
○ Tracking tools (big learning curve)
© 2022 Tryolabs
© 2022 Tryolabs
Ideal development experience
13
Structured pipeline
composed by
interdependent steps
Sharing
experiments,
models, and results
in a simple way
Easily adding files
or directories to a
remote repository
Stop worrying
about source code
and data association
Useful open source tool:
DATA VERSION CONTROL
© 2022 Tryolabs
© 2022 Tryolabs
DVC high-level overview
15
model
features
data
7fe5fc5
Update features
d512ef1
Update dataset
and input
parameters
23811e0
Adjusting input
parameters
e7eb61f
Add the new
dataset and
features
020c55f
Adjusting input
parameters
model
features
data
model
features
data
model
features
data
model
features
data
© 2022 Tryolabs
© 2022 Tryolabs
DVC high-level overview
16
Time
Accuracy
87%
0
Collaborate
Deploy to
production
commit 8d7aa3d
Rollback
Cloud Local Cache
76%
© 2022 Tryolabs
© 2022 Tryolabs
Main features
17
● Git-compatible
● Reproducible
● Low friction branching
● Storage agnostic
● ML pipeline framework
● Language & framework agnostic
● Track failures
● Experiments & metrics tracking
© 2022 Tryolabs
© 2022 Tryolabs
Pipelines
18
● Pipelines composed by
interdependent steps
○ Dependencies
○ Code to execute
○ Outputs
● Additional pipeline
visualization command
dvc dag
© 2022 Tryolabs
© 2022 Tryolabs
Metrics differences
19
Smooth comparison
process:
numeric and graphic
visualization
© 2022 Tryolabs
© 2022 Tryolabs
Continuous integration
20
• Automatically check data version
• Benchmark new model against
previously deployed models
• Metrics diff & interactive plots in
Pull Requests
• Re-train & refine in the cloud
PUSH DATA + CODE
SETUP CLOUD
RUNNER FROM
CI/CD
(GPU, NAS,
ETC.)
RUN
TRAIN/TEST
SCRIPT
PUSH &
REPORT
METRICS
TABLES/GRAPH
S IN PR
COMMENTS
WIN!
SOURCE: WWW.DVC.COM
© 2022 Tryolabs 21
Experiments batch execution
“I can’t believe the number of hours saved by
queuing and executing experiments in parallel.”
© 2022 Tryolabs
© 2022 Tryolabs
UI does not have to be built from scratch
22
SOURCE: WWW.DVC.COM
● Show plots for selected
experiments
● Compare results
● Run new experiments
● Generate trend charts
Takeaways
© 2022 Tryolabs
© 2022 Tryolabs 24
Takeaways
Adopting a
development support
tool across the entire
ML workflow may be
crucial for the success
of a project.
Stop reinventing the
wheel for common ML
challenges.
Boost developer’s
productivity by
enabling them to focus
on coding.
Integrating DVC tool
favors quality
attributes such as
maintainability,
scalability, and security.
Support end-to-end
experience, from EDA
to production.
© 2022 Tryolabs
© 2022 Tryolabs 25
Takeaways
Reproducibility
With a couple of commands,
replicate the environment
state from other team
members (without re-executing
all the pipeline or experiment).
Data sharing
Data and source
code association
out-of-the-box, with
a wide variety of
remote storage options.
Experiments
Quickly run multiple
experiments in
parallel with various
ways of visualizing and
comparing results.
© 2022 Tryolabs
© 2022 Tryolabs 26
Takeaways - tool vs. from scratch
We learned that for most of the cases, using an
all-in-one framework like DVC alleviates the
work vs. manually dealing with
Reproducibility, Experimentation, and Data
sharing tasks.
© 2022 Tryolabs
Resources
27
DVC documentation
https://dvc.org/doc
Platform to quickly get-started with DVC
https://katacoda.com/dvc/courses/get-started
Norfair - Tryolabs object tracking open-source library
https://github.com/tryolabs/norfair
Reproducibility in machine learning
https://towardsdatascience.com/reproducible-machine-
learning-cf1841606805
Thank you!

Más contenido relacionado

Similar a Data Versioning Towards Reproducibility in ML

Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionFlorian Wilhelm
 
Application Modernization to Accelerate Business Growth | JK Tech Webinar
Application Modernization to Accelerate Business Growth | JK Tech WebinarApplication Modernization to Accelerate Business Growth | JK Tech Webinar
Application Modernization to Accelerate Business Growth | JK Tech WebinarJK Tech
 
vodQA Pune (2019) - Insights into big data testing
vodQA Pune (2019) - Insights into big data testingvodQA Pune (2019) - Insights into big data testing
vodQA Pune (2019) - Insights into big data testingvodQA
 
Manoj Sharma_Enovia_9years
Manoj Sharma_Enovia_9yearsManoj Sharma_Enovia_9years
Manoj Sharma_Enovia_9yearsManoj Sharma
 
Manoj Sharma_Enovia_9years
Manoj Sharma_Enovia_9yearsManoj Sharma_Enovia_9years
Manoj Sharma_Enovia_9yearsManoj Sharma
 
Always Be Deploying. How to make R great for machine learning in (not only) E...
Always Be Deploying. How to make R great for machine learning in (not only) E...Always Be Deploying. How to make R great for machine learning in (not only) E...
Always Be Deploying. How to make R great for machine learning in (not only) E...Wit Jakuczun
 
MT01 The business imperatives driving cloud adoption
MT01 The business imperatives driving cloud adoptionMT01 The business imperatives driving cloud adoption
MT01 The business imperatives driving cloud adoptionDell EMC World
 
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela PoklukarDataScienceConferenc1
 
Why we should consider Open Hybrid Cloud.pdf
Why we should  consider Open Hybrid Cloud.pdfWhy we should  consider Open Hybrid Cloud.pdf
Why we should consider Open Hybrid Cloud.pdfMasahiko Umeno
 
451 Research: Data Is the Key to Friction in DevOps
451 Research: Data Is the Key to Friction in DevOps451 Research: Data Is the Key to Friction in DevOps
451 Research: Data Is the Key to Friction in DevOpsDelphix
 
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...DataWorks Summit
 
INT Inc | Benefits of a Microservices Architecture
INT Inc | Benefits of a Microservices ArchitectureINT Inc | Benefits of a Microservices Architecture
INT Inc | Benefits of a Microservices ArchitectureThelma Gros
 
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...HostedbyConfluent
 
automotive_nx_design
automotive_nx_designautomotive_nx_design
automotive_nx_designAIMFirst
 
Belgium & Luxembourg dedicated online Data Virtualization discovery workshop
Belgium & Luxembourg dedicated online Data Virtualization discovery workshopBelgium & Luxembourg dedicated online Data Virtualization discovery workshop
Belgium & Luxembourg dedicated online Data Virtualization discovery workshopDenodo
 
MT74 - Is Your Tech Support Keeping Up with Your Instr Tech
MT74 - Is Your Tech Support Keeping Up with Your Instr TechMT74 - Is Your Tech Support Keeping Up with Your Instr Tech
MT74 - Is Your Tech Support Keeping Up with Your Instr TechDell EMC World
 
Production-Ready Kubernetes: It's Not About Technology
Production-Ready Kubernetes: It's Not About TechnologyProduction-Ready Kubernetes: It's Not About Technology
Production-Ready Kubernetes: It's Not About TechnologyAntoine Craske
 

Similar a Data Versioning Towards Reproducibility in ML (20)

Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to Production
 
Application Modernization to Accelerate Business Growth | JK Tech Webinar
Application Modernization to Accelerate Business Growth | JK Tech WebinarApplication Modernization to Accelerate Business Growth | JK Tech Webinar
Application Modernization to Accelerate Business Growth | JK Tech Webinar
 
vodQA Pune (2019) - Insights into big data testing
vodQA Pune (2019) - Insights into big data testingvodQA Pune (2019) - Insights into big data testing
vodQA Pune (2019) - Insights into big data testing
 
Manoj Sharma_Enovia_9years
Manoj Sharma_Enovia_9yearsManoj Sharma_Enovia_9years
Manoj Sharma_Enovia_9years
 
Manoj Sharma_Enovia_9years
Manoj Sharma_Enovia_9yearsManoj Sharma_Enovia_9years
Manoj Sharma_Enovia_9years
 
Always Be Deploying. How to make R great for machine learning in (not only) E...
Always Be Deploying. How to make R great for machine learning in (not only) E...Always Be Deploying. How to make R great for machine learning in (not only) E...
Always Be Deploying. How to make R great for machine learning in (not only) E...
 
MT01 The business imperatives driving cloud adoption
MT01 The business imperatives driving cloud adoptionMT01 The business imperatives driving cloud adoption
MT01 The business imperatives driving cloud adoption
 
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
 
Why we should consider Open Hybrid Cloud.pdf
Why we should  consider Open Hybrid Cloud.pdfWhy we should  consider Open Hybrid Cloud.pdf
Why we should consider Open Hybrid Cloud.pdf
 
451 Research: Data Is the Key to Friction in DevOps
451 Research: Data Is the Key to Friction in DevOps451 Research: Data Is the Key to Friction in DevOps
451 Research: Data Is the Key to Friction in DevOps
 
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
 
INT Inc | Benefits of a Microservices Architecture
INT Inc | Benefits of a Microservices ArchitectureINT Inc | Benefits of a Microservices Architecture
INT Inc | Benefits of a Microservices Architecture
 
Ravindra Prasad
Ravindra PrasadRavindra Prasad
Ravindra Prasad
 
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
 
Developer want change Ops want control - devops
Developer want change Ops want control - devopsDeveloper want change Ops want control - devops
Developer want change Ops want control - devops
 
automotive_nx_design
automotive_nx_designautomotive_nx_design
automotive_nx_design
 
Belgium & Luxembourg dedicated online Data Virtualization discovery workshop
Belgium & Luxembourg dedicated online Data Virtualization discovery workshopBelgium & Luxembourg dedicated online Data Virtualization discovery workshop
Belgium & Luxembourg dedicated online Data Virtualization discovery workshop
 
SRE & Kubernetes
SRE & KubernetesSRE & Kubernetes
SRE & Kubernetes
 
MT74 - Is Your Tech Support Keeping Up with Your Instr Tech
MT74 - Is Your Tech Support Keeping Up with Your Instr TechMT74 - Is Your Tech Support Keeping Up with Your Instr Tech
MT74 - Is Your Tech Support Keeping Up with Your Instr Tech
 
Production-Ready Kubernetes: It's Not About Technology
Production-Ready Kubernetes: It's Not About TechnologyProduction-Ready Kubernetes: It's Not About Technology
Production-Ready Kubernetes: It's Not About Technology
 

Más de Edge AI and Vision Alliance

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...Edge AI and Vision Alliance
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...Edge AI and Vision Alliance
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...Edge AI and Vision Alliance
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...Edge AI and Vision Alliance
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...Edge AI and Vision Alliance
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...Edge AI and Vision Alliance
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...Edge AI and Vision Alliance
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsightsEdge AI and Vision Alliance
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...Edge AI and Vision Alliance
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...Edge AI and Vision Alliance
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...Edge AI and Vision Alliance
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...Edge AI and Vision Alliance
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...Edge AI and Vision Alliance
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...Edge AI and Vision Alliance
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...Edge AI and Vision Alliance
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from SamsaraEdge AI and Vision Alliance
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...Edge AI and Vision Alliance
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...Edge AI and Vision Alliance
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...Edge AI and Vision Alliance
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...Edge AI and Vision Alliance
 

Más de Edge AI and Vision Alliance (20)

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
 

Último

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Data Versioning Towards Reproducibility in ML

  • 1. © 2022 Tryolabs Data Versioning Towards Reproducibility in Machine Learning Nicolás Eiris Machine Learning Engineer Tryolabs
  • 2. © 2022 Tryolabs © 2022 Tryolabs Tryolabs 2 • We build custom AI solutions • 70+ team members • 12+ years of experience • Served more than 150 clients Trusted by
  • 3. © 2022 Tryolabs 1. Main pain points in ML workflows 2. Useful open source tool 3. Takeaways 4. References Agenda 3
  • 4. © 2022 Tryolabs © 2022 Tryolabs 4 Dilemma in ML development Building everything manually from scratch vs. using a tool to support the development phase (from collecting data to deploying on the edge).
  • 5. Main pain points in ML workflows
  • 6. © 2022 Tryolabs © 2022 Tryolabs Standard ML workflow 6 DATA INGESTION EXPLORATORY DATA ANALYSIS DATA CLEANING EXPERIMENTATION & EVALUATION MODELING FEATURE ENGINEERING ROLLING OUT TO PRODUCTION
  • 7. © 2022 Tryolabs © 2022 Tryolabs ML pipeline in practice 7 model features data data_v2 features copy features_2 features_3 model_1 model_1_2 model_prefinal model_data_v2 model_2_2 model_final UPLOAD CODE SETUP STORAGE & UPLOAD DATA SETUP CLOUD RUNNER (GPU, NAS, ETC.) WAIT THE DATA & CODE DON’T MATCH RUN TRAIN/TEST SCRIPT DOWNLOAD DATA + CODE SYNC DATA AND CODE OH NO! THE REQUIREMENTS HAVE CHANGED WHERE DO I REPORT OUTPUT RESULTS ANYWAY? RAGE QUIT JOB EDA EDA_2 EDA_3 *EDA = Exploratory data analysis
  • 8. © 2022 Tryolabs © 2022 Tryolabs Main pain points in ML workflows 8 1. Reproducibility ● Teamwork ● Usually ad-hoc processes ● Productivity bottleneck ● Challenges ○ Changes in data ○ Hyperparams inconsistency ○ Randomness ○ Manual and ad-hoc execution of experiments
  • 9. © 2022 Tryolabs © 2022 Tryolabs Main pain points in ML workflows 9 1. Reproducibility “Changes are uploaded, please run all the notebook again.”
  • 10. © 2022 Tryolabs © 2022 Tryolabs Main pain points in ML workflows 10 • Complex READMEs on how to gather data from remote storage • Security and data privacy risks • Manual versioning of dataset changes 2. Data sharing
  • 11. © 2022 Tryolabs © 2022 Tryolabs Main pain points in ML workflows 11 2. Data sharing “I wish I could automate this process…” NO STORAGE
  • 12. © 2022 Tryolabs © 2022 Tryolabs Main pain points in ML workflows 12 3. Experiments execution & tracking ● Experiments setup traceability challenges ● Inefficient results comparison & evaluation ● Manual process: ○ Spreadsheet ○ Github (metadata files) ○ Tracking tools (big learning curve)
  • 13. © 2022 Tryolabs © 2022 Tryolabs Ideal development experience 13 Structured pipeline composed by interdependent steps Sharing experiments, models, and results in a simple way Easily adding files or directories to a remote repository Stop worrying about source code and data association
  • 14. Useful open source tool: DATA VERSION CONTROL
  • 15. © 2022 Tryolabs © 2022 Tryolabs DVC high-level overview 15 model features data 7fe5fc5 Update features d512ef1 Update dataset and input parameters 23811e0 Adjusting input parameters e7eb61f Add the new dataset and features 020c55f Adjusting input parameters model features data model features data model features data model features data
  • 16. © 2022 Tryolabs © 2022 Tryolabs DVC high-level overview 16 Time Accuracy 87% 0 Collaborate Deploy to production commit 8d7aa3d Rollback Cloud Local Cache 76%
  • 17. © 2022 Tryolabs © 2022 Tryolabs Main features 17 ● Git-compatible ● Reproducible ● Low friction branching ● Storage agnostic ● ML pipeline framework ● Language & framework agnostic ● Track failures ● Experiments & metrics tracking
  • 18. © 2022 Tryolabs © 2022 Tryolabs Pipelines 18 ● Pipelines composed by interdependent steps ○ Dependencies ○ Code to execute ○ Outputs ● Additional pipeline visualization command dvc dag
  • 19. © 2022 Tryolabs © 2022 Tryolabs Metrics differences 19 Smooth comparison process: numeric and graphic visualization
  • 20. © 2022 Tryolabs © 2022 Tryolabs Continuous integration 20 • Automatically check data version • Benchmark new model against previously deployed models • Metrics diff & interactive plots in Pull Requests • Re-train & refine in the cloud PUSH DATA + CODE SETUP CLOUD RUNNER FROM CI/CD (GPU, NAS, ETC.) RUN TRAIN/TEST SCRIPT PUSH & REPORT METRICS TABLES/GRAPH S IN PR COMMENTS WIN! SOURCE: WWW.DVC.COM
  • 21. © 2022 Tryolabs 21 Experiments batch execution “I can’t believe the number of hours saved by queuing and executing experiments in parallel.”
  • 22. © 2022 Tryolabs © 2022 Tryolabs UI does not have to be built from scratch 22 SOURCE: WWW.DVC.COM ● Show plots for selected experiments ● Compare results ● Run new experiments ● Generate trend charts
  • 24. © 2022 Tryolabs © 2022 Tryolabs 24 Takeaways Adopting a development support tool across the entire ML workflow may be crucial for the success of a project. Stop reinventing the wheel for common ML challenges. Boost developer’s productivity by enabling them to focus on coding. Integrating DVC tool favors quality attributes such as maintainability, scalability, and security. Support end-to-end experience, from EDA to production.
  • 25. © 2022 Tryolabs © 2022 Tryolabs 25 Takeaways Reproducibility With a couple of commands, replicate the environment state from other team members (without re-executing all the pipeline or experiment). Data sharing Data and source code association out-of-the-box, with a wide variety of remote storage options. Experiments Quickly run multiple experiments in parallel with various ways of visualizing and comparing results.
  • 26. © 2022 Tryolabs © 2022 Tryolabs 26 Takeaways - tool vs. from scratch We learned that for most of the cases, using an all-in-one framework like DVC alleviates the work vs. manually dealing with Reproducibility, Experimentation, and Data sharing tasks.
  • 27. © 2022 Tryolabs Resources 27 DVC documentation https://dvc.org/doc Platform to quickly get-started with DVC https://katacoda.com/dvc/courses/get-started Norfair - Tryolabs object tracking open-source library https://github.com/tryolabs/norfair Reproducibility in machine learning https://towardsdatascience.com/reproducible-machine- learning-cf1841606805