ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

•

7 recomendaciones•3,842 vistas

Building a machine learning model is an iterative process. A data scientist will build many tens to hundreds of models before arriving at one that meets some acceptance criteria. However, the current style of model building is ad-hoc and there is no practical way for a data scientist to manage models that are built over time. In addition, there are no means to run complex queries on models and related data. In this talk, we present ModelDB, a novel end-to-end system for managing machine learning (ML) models. Using client libraries, ModelDB automatically tracks and versions ML models in their native environments (e.g. spark.ml, scikit-learn). A common set of abstractions enable ModelDB to capture models and pipelines built across different languages and environments. The structured representation of models and metadata then provides a platform for users to issue complex queries across various modeling artifacts. Our rich web frontend provides a way to query ModelDB at varying levels of granularity. ModelDB has been open-sourced at https://github.com/mitdbg/modeldb.

Datos y análisis

ModelDB: A system
to manage machine
learning models
Manasi Vartak
PhD Student, MIT DB Group

People
Manasi Vartak
PhD student, MIT
Srinidhi Viswanathan
MEng, MIT
Samuel Madden
Faculty, MIT
Matei Zaharia
Faculty, Stanford
Harihar Subramanyam
MEng, MIT
Wei-En Lee
MEng student, MIT

Building a default
prediction algorithm
Profession Credit History Risk of Default
Politician Reasonable 0.3
Struggling
artist
Poor 0.7
Investor
Has more
money than our
company
0.0
… … … …
Barack
Obama
Lindsay
Lohan
Warren
Buffet

Model 3
RandomForestClassifier
val udf1: (Int => Int) = (delayed..)
df.withColumn(“timesDelayed”, udf1)

RandomForestClassifier
df.withColumn(“timesDelayed”, udf1)
.withColumn(“percentPaid”, udf2)
val lrGrid = new ParamGridBuilder()
.addGrid(rf.maxDepth, Array(5, 10, 15))
.addGrid(rf.numTrees, Array(50, 100))
Model 5
credit-default-clean.csv

df.withColumn(“timesDelayed”, udf1)
.withColumn(“percentPaid”, udf2)
.withColumn(“creditUsed”, udf3)
…
val lrGrid = new ParamGridBuilder()
.addGrid(lr.elasticNetParam, Array(0.01, 0.1, 0.5, 0.7))
val scaler = new StandardScaler()
.setInputCol(“features”)
…
val labelIndexer1 = new LabelIndexer()
val labelIndexer2 = new LabelIndexer()
…
Model 50
val udf1: (Int => Int) = (delayed..)
val udf2: (String, Int) = …
credit-default-clean.csv

No one in here tracks (all of)
their models
…and this is not unusual
I’m willing to bet…

Why is this a problem?
• No record of experiments
• Insights lost along the way
• Difﬁcult to reproduce results
• Cannot search for or query models
• Difﬁcult to collaborate
Did my colleague do that
already?
How did normalization
affect my ROC?
How does someone review
your model?
Where’s the LR
model I tried last
week with featureX?
What params did I use?

Model Management
track, store and index modeling artifacts
so that they may subsequently be
reproduced, shared, queried, and
analyzed

ModelDB: a system to
manage machine
learning models
http://modeldb.csail.mit.edu

ModelDB: an end-to-end
model management system
Model artifact
Storage &
Versioning
Query
Ingest models,
metadata
Collaboration,
Reproducibilitytrack
store &
index
query, reproduce++

ModelDB Architecture &
Design Decisions
1. Support for diverse
languages and environments
2. Minimal changes to
existing workflows
3. Rich visual interface
4. Support for complex
queries
spark.ml
scikit-learn
ModelDB
Backend
Storage
thrift
Scala
Python
…
ModelDB
Frontend:
vis + query
Native Client
Events

ModelDB Features
• Experiment tracking
• Versioning
• Reproducibility
• Comparisons, queries, search
• Collaboration
Log models, params, pipelines
etc. via ModelDB API
Model search, query,
comparison via frontend
Central repository of models
Review models, annotate
All pipeline details, params
logged
Every modeling run = version

Ongoing Work
• Uniﬁed querying of modeling artifacts
• Mining data in ModelDB
• Model monitoring and retraining

ModelDB available now!
http://modeldb.csail.mit.edu
*MIT License

ModelDB available now!
• Download, try it out!
• Tell us what you think; what can we do better?
• Contribute! (see Issues on repo for some ideas)

ModelDB: a system to
manage machine
learning models
mvartak@csail.mit.edu | @DataCereal
http://modeldb.csail.mit.edu

Más contenido relacionado

Similar a ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

Data Product ArchitecturesBenjamin Bengfort

Lessons Learned from Building Machine Learning Software at NetflixJustin Basilico

Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Gabriel Moreira

Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort

C19013010 the tutorial to build shared ai services session 1Bill Liu

Recsys 2016Mindaugas Zickus

Machine Learning Models in ProductionDataWorks Summit

Machine Learning and AI: Core Methods and ApplicationsQuantUniversity

B sc it syit sem 3 sem 4 syllabus as per mumbai universitytanujaparihar

Beautiful Models in PHPbrandonsavage

Scaling up Machine Learning DevelopmentMatei Zaharia

Machine Learning with TensorFlow.jsBrian Greig

Lecture_1_Intro.pdfpaijitk

Discovering User's Topics of Interest in Recommender SystemsGabriel Moreira

CEDAR Technologies for AIRR SubmissionsSyed Ahmad Chan Bukhari, PhD

On the Customization of Model Management Systems for File-Centric IDEsDavid Méndez-Acuña

Data engineering design patternsValdas Maksimavičius

What's The Role Of Machine Learning In Fast Data And Streaming Applications?Lightbend

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten

Similar a ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak (20)

Data Product Architectures

Lessons Learned from Building Machine Learning Software at Netflix

Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...

Introduction to Machine Learning with SciKit-Learn

C19013010 the tutorial to build shared ai services session 1

Recsys 2016

Machine Learning Models in Production

Machine Learning and AI: Core Methods and Applications

B sc it syit sem 3 sem 4 syllabus as per mumbai university

Beautiful Models in PHP

Scaling up Machine Learning Development

Machine Learning with TensorFlow.js

Lecture_1_Intro.pdf

Discovering User's Topics of Interest in Recommender Systems

CEDAR Technologies for AIRR Submissions

On the Customization of Model Management Systems for File-Centric IDEs

Data engineering design patterns

What's The Role Of Machine Learning In Fast Data And Streaming Applications?

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...

Más de Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit

Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit

Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit

Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit

Powering a Startup with Apache Spark with Kevin KimSpark Summit

Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit

How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit

Goal Based Data Production with Sim SimeonovSpark Summit

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit

Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit

Más de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang

VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu

Improving Traffic Prediction Using Weather Data with Ramya Raghavendra

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...

Apache Spark and Tensorflow as a Service with Jim Dowling

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...

Next CERN Accelerator Logging Service with Jakub Wozniak

Powering a Startup with Apache Spark with Kevin Kim

Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra

Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...

How Nielsen Utilized Databricks for Large-Scale Research and Development with...

Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...

Goal Based Data Production with Sim Simeonov

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...

Getting Ready to Use Redis with Apache Spark with Dvir Volk

Deduplication and Author-Disambiguation of Streaming Records via Supervised M...

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Último

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

How we prevented account sharing with MFAAndrei Kaleshka

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

ASML's Taxonomy Adventure by Daniel Cantervoginip

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

1. ModelDB: A system to manage machine learning models Manasi Vartak PhD Student, MIT DB Group

2. People Manasi Vartak PhD student, MIT Srinidhi Viswanathan MEng, MIT Samuel Madden Faculty, MIT Matei Zaharia Faculty, Stanford Harihar Subramanyam MEng, MIT Wei-En Lee MEng student, MIT

3. Building a default prediction algorithm Profession Credit History Risk of Default Politician Reasonable 0.3 Struggling artist Poor 0.7 Investor Has more money than our company 0.0 … … … … Barack Obama Lindsay Lohan Warren Buffet

4. Accuracy: 62% Model 1

5. Model 3 RandomForestClassifier val udf1: (Int => Int) = (delayed..) df.withColumn(“timesDelayed”, udf1)

6. RandomForestClassifier df.withColumn(“timesDelayed”, udf1) .withColumn(“percentPaid”, udf2) val lrGrid = new ParamGridBuilder() .addGrid(rf.maxDepth, Array(5, 10, 15)) .addGrid(rf.numTrees, Array(50, 100)) Model 5 credit-default-clean.csv

7. df.withColumn(“timesDelayed”, udf1) .withColumn(“percentPaid”, udf2) .withColumn(“creditUsed”, udf3) … val lrGrid = new ParamGridBuilder() .addGrid(lr.elasticNetParam, Array(0.01, 0.1, 0.5, 0.7)) val scaler = new StandardScaler() .setInputCol(“features”) … val labelIndexer1 = new LabelIndexer() val labelIndexer2 = new LabelIndexer() … Model 50 val udf1: (Int => Int) = (delayed..) val udf2: (String, Int) = … credit-default-clean.csv

8. No one in here tracks (all of) their models …and this is not unusual I’m willing to bet…

9. Why is this a problem? • No record of experiments • Insights lost along the way • Difﬁcult to reproduce results • Cannot search for or query models • Difﬁcult to collaborate Did my colleague do that already? How did normalization affect my ROC? How does someone review your model? Where’s the LR model I tried last week with featureX? What params did I use?

10. Model Management track, store and index modeling artifacts so that they may subsequently be reproduced, shared, queried, and analyzed

11. ModelDB: a system to manage machine learning models http://modeldb.csail.mit.edu

12. ModelDB: an end-to-end model management system Model artifact Storage & Versioning Query Ingest models, metadata Collaboration, Reproducibilitytrack store & index query, reproduce++

13. Demo

14. ModelDB w/ scikit-learn

15. ModelDB Architecture & Design Decisions 1. Support for diverse languages and environments 2. Minimal changes to existing workflows 3. Rich visual interface 4. Support for complex queries spark.ml scikit-learn ModelDB Backend Storage thrift Scala Python … ModelDB Frontend: vis + query Native Client Events

16. ModelDB Features • Experiment tracking • Versioning • Reproducibility • Comparisons, queries, search • Collaboration Log models, params, pipelines etc. via ModelDB API Model search, query, comparison via frontend Central repository of models Review models, annotate All pipeline details, params logged Every modeling run = version

17. Ongoing Work • Uniﬁed querying of modeling artifacts • Mining data in ModelDB • Model monitoring and retraining

18. ModelDB available now! http://modeldb.csail.mit.edu *MIT License

19. ModelDB available now! • Download, try it out! • Tell us what you think; what can we do better? • Contribute! (see Issues on repo for some ideas)

20. ModelDB: a system to manage machine learning models mvartak@csail.mit.edu | @DataCereal http://modeldb.csail.mit.edu

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

Recomendados

Recomendados

Más contenido relacionado

Similar a ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

Similar a ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak (20)

Más de Spark Summit

Más de Spark Summit (20)

Último

Último (20)

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak