SlideShare una empresa de Scribd logo
1 de 2
Descargar para leer sin conexión
HORIZONTALLY SCALABLE ML PIPELINES WITH A FEATURE STORE
Alexandru A. Ormenis
, an 1
Mahmoud Ismail 1
Kim Hammar 2
Robin Andersson 2
Ermias Gebremeskel 2
Theofilos Kakantousis 2
Antonios Kouzoupis 2
Fabio Buso 2
Gautier Berthou 2 3
Jim Dowling 1 2
Seif Haridi 1 3
ABSTRACT
Machine Learning (ML) pipelines are the fundamental building block for productionizing ML models. However,
much introductory material for machine learning and deep learning emphasizes ad-hoc feature engineering and
training pipelines to experiment with ML models. Such pipelines have a tendency to become complex over time
and do not allow features to be easily re-used across different pipelines. Duplicating features can even lead to
correctness problems when features have different implementations for training and serving. In this demo, we
introduce the Feature Store as a new data layer in horizontally scalable machine learning pipelines.
1 INTRODUCTION
In this demonstration, we introduce an open-source data
platform with support for machine learning (ML) pipelines,
Hopsworks, where every stage in the pipeline can be scaled
out to potentially hundreds of servers, enabling faster train-
ing of larger models with more data. An end-to-end machine
learning pipeline covers all stages in the production and op-
eration of ML models, from data ingestion and preparation,
to experimentation, to training, to deployment and operation.
Our demonstration will have the added twist of introducing
a new data management layer, the Feature Store. We will
show how the Feature Store simplifies such pipelines, pro-
viding an API for data engineers to produce features and
an API with which data scientists can easily select features
when designing new models. Our Feature Store is based on
open-source frameworks, see Figure 3, using Apache Spark
to compute features, Apache Hive to store feature data, and
MySQL Cluster (NDB) for feature metadata. Our Feature
Store also supports generating training data, from selected
features, in native file formats for TensorFlow, Keras, and
PyTorch. Our Feature Store is integrated in a multi-tenant
security model that enables the governance and reuse of
features as first-class entities within an organization. We
also provide Python API support for both the Feature Store
and for building pipelines in our framework. APIs ease the
programmer’s burden when discovering and using services
such as Kafka, the Feature Store, our REST API, and model
serving, as well as when performing distributed model train-
ing and hyperparameter optimization.
1
KTH Royal Institute of Technology, Sweden 2
Logical Clocks
AB, Sweden 3
RISE AB, Sweden. Correspondence to: Alexandru
A. Ormenis
, an <aaor@kth.se>.
Proceedings of the 2nd
SysML Conference, Palo Alto, CA, USA,
2019. Copyright 2019 by the author(s).
Figure 1. End-to-End ML Pipeline, orchestrated by Airflow
2 END-TO-END ML PIPELINES
In ML pipelines on Hopsworks, see Figure 1, horizontal
scalability is provided using Apache Spark for feature engi-
neering, Apache Hive for storing feature data, HopsFS (Ni-
azi et al., 2017) (a next generation HDFS) as the distributed
filesystem for training data, Ring-AllReduce using Tensor-
Flow for distributed training, and Kubernetes to serve mod-
els. We also support Kafka for storing logs for ML models
served from Kubernetes, and Spark streaming applications
for processing those logs in near-realtime. Distributed train-
ing and hyperparameter optimization with TensorFlow and
PyTorch use PySpark to distribute computations, replicated
conda environments to ensure Python libraries are available
at all hosts, and YARN to allocate GPUs, CPUs, and mem-
ory to applications. Our pipelines are driven and managed
by a scalable, consistent, distributed shared memory service
built on MySQL Cluster.
2.1 Demonstration
In our live demonstration, we will use a managed version of
Hopsworks running from a web browser at www.hops.site.
We will start with an Airflow job orchestrating an entire ML
pipeline that consists of a series of different jobs, all im-
plemented in Jupyter notebooks and invoked from Airflow
by an operator that calls the REST API to Hopsworks, see
Horizontally Scalable ML Pipelines with a Feature Store
Figure 2. Hopsworks Data and ML Platform Architecture
Figure 3. Feature Store Architecture
Figure 2. We will then go through the Jupyter notebooks in-
dividually. We will start with discovering and downloading a
ML dataset to work with - Hopsworks supports peer-to-peer
sharing of large datasets. We will then write/run a Spark
job to engineer features from the dataset. We also support
Beam/Flink as jobs, so we can show an additional data vali-
dation job using TensorFlow Extended (TFX) (Baylor et al.,
2017). After the feature engineering stage, the features
will be published to our Feature Store. The next stage is
where a Data Scientist will write a notebook to select the
features that will be used to train a model. This job will out-
put training data in a chosen file format (such as .tfrecords,
.csv, .parquet, .npy, .hdf5) to HopsFS. The next job in the
pipeline is the experimentation or training step. Here, the
data scientist designs (or evolves) a model architecture and
discovers good hyperparameters by running lots of parallel
experiments on GPUs. Similar to MLflow (Zaharia et al.,
2018), Hopsworks manages and archives ML experiments
for easy reproduction and archiving. The next stage is to
train a model with the chosen hyperparameters using lots of
GPUs and RingAll-Reduce for distributed training. The out-
put of the training step is a model, saved to HopsFS. Now,
the model can be analyzed and validated (such as using
TFX model analysis), and then deployed for serving using
TensorFlow Serving (with many options available, such as
A/B testing and number of replicated instances behind a
load-balancer). Finally, we will show client applications
using the deployed model, and a streaming job will be run
to monitor model predictions, notify of any anomalies, and
collect more training data (by storing predictions that are
later linked with outcomes).
2.2 Teaching Platform and Interactive Demo
Our research organization runs a managed version of
Hopsworks with over 1000 users. This managed platform,
see Figure 2, is also used for lab and project work in Deep
Learning and Big Data courses at a large technical university.
Hopsworks has a User Interface and strong security support,
built around TLS certificates for applications, users, and ser-
vices. Similar to Jupyter Hub, students can write interactive
programs in Jupyter notebooks. In contrast to JupyterHub,
users can run jobs that access up to 100s of TBs of data on
1000s of cores, and use 10s of GPUs for model experimen-
tation or training. Hopsworks also includes a project-based
multi-tenancy environment, so students can work in groups
on sandboxed data. Our managed platform provides many
popular datasets for machine learning, including Imagenet,
Wikipedia, Youtube-8M, and Kinetics.
In the interactive demo, participants will be invited to use
their laptops and an Internet connection to create accounts
on our managed platform. With an account, a user can
create a project, link in a large machine learning dataset, and
run pre-canned notebooks or end-to-end machine learning
pipelines. Participants will be given enough quota to use
100s of CPUs, and several GPUs, if they so choose.
2.3 Summary
We will demonstrate a data and ML platform, Hopsworks,
that supports the design, debugging, and running of deep
learning pipelines at scale. All stages of the pipeline can
be scaled out horizontally, as the platform manages the
whole stack from resource management (YARN with GPU
support) to API support for running distributed training and
hyperparameter optimization experiments.
ACKNOWLEDGMENT
This work was funded by the Swedish Foundation for Strate-
gic Research projects “Smart Intra-body network, RIT15-
0119” and “Continuous Deep Analytics, BD15-0006”, and
by the EU Horizon 2020 project AEGIS under Grant Agree-
ment no. 732189.
REFERENCES
Baylor, D., Breck, E., Cheng, H.-T., Fiedel, N., Foo, C. Y.,
Haque, Z., Haykal, S., Ispir, M., Jain, V., Koc, L., et al.
Tfx: A tensorflow-based production-scale machine learn-
ing platform. In Proceedings of the 23rd ACM SIGKDD
International Conference on Knowledge Discovery and
Data Mining, pp. 1387–1395. ACM, 2017.
Niazi, S., Ismail, M., Haridi, S., Dowling, J., Grohsschmiedt,
S., and Ronström, M. Hopsfs: Scaling hierarchical file
system metadata using newsql databases. In FAST, pp.
89–104, 2017.
Zaharia, M., Chen, A., Davidson, A., Ghodsi, A., Hong,
S. A., Konwinski, A., Murching, S., Nykodym, T.,
Ogilvie, P., Parkhe, M., et al. Accelerating the machine
learning lifecycle with mlflow. Data Engineering, pp. 39,
2018.

Más contenido relacionado

La actualidad más candente

Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Yahoo Developer Network
 
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Databricks
 
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Databricks
 

La actualidad más candente (20)

Scaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastScaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and Feast
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
 
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...
 
Using Apache Spark for Predicting Degrading and Failing Parts in Aviation
Using Apache Spark for Predicting Degrading and Failing Parts in AviationUsing Apache Spark for Predicting Degrading and Failing Parts in Aviation
Using Apache Spark for Predicting Degrading and Failing Parts in Aviation
 
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Willump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML InferenceWillump: Optimizing Feature Computation in ML Inference
Willump: Optimizing Feature Computation in ML Inference
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Erin LeDell, H2O.ai - Scalable Automatic Machine Learning - H2O World San Fra...
Erin LeDell, H2O.ai - Scalable Automatic Machine Learning - H2O World San Fra...Erin LeDell, H2O.ai - Scalable Automatic Machine Learning - H2O World San Fra...
Erin LeDell, H2O.ai - Scalable Automatic Machine Learning - H2O World San Fra...
 
Hopsworks - The Platform for Data-Intensive AI
Hopsworks - The Platform for Data-Intensive AIHopsworks - The Platform for Data-Intensive AI
Hopsworks - The Platform for Data-Intensive AI
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
 
Operationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer NoriOperationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer Nori
 
Use MLflow to manage and deploy Machine Learning model on Spark
Use MLflow to manage and deploy Machine Learning model on Spark Use MLflow to manage and deploy Machine Learning model on Spark
Use MLflow to manage and deploy Machine Learning model on Spark
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
 

Similar a Sysml 2019 demo_paper

On premise ai platform - from dc to edge
On premise ai platform - from dc to edgeOn premise ai platform - from dc to edge
On premise ai platform - from dc to edge
Conference Papers
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning products
Gianmario Spacagna
 

Similar a Sysml 2019 demo_paper (20)

Deploying and Managing Artificial Intelligence Services using the Open Data H...
Deploying and Managing Artificial Intelligence Services using the Open Data H...Deploying and Managing Artificial Intelligence Services using the Open Data H...
Deploying and Managing Artificial Intelligence Services using the Open Data H...
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
 
Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes
 
Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine Learning
 
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
 
MLOps pipelines using MLFlow - From training to production
MLOps pipelines using MLFlow - From training to productionMLOps pipelines using MLFlow - From training to production
MLOps pipelines using MLFlow - From training to production
 
On premise ai platform - from dc to edge
On premise ai platform - from dc to edgeOn premise ai platform - from dc to edge
On premise ai platform - from dc to edge
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
 
Building ML Pipelines with DCOS
Building ML Pipelines with DCOSBuilding ML Pipelines with DCOS
Building ML Pipelines with DCOS
 
MLflow with Databricks
MLflow with DatabricksMLflow with Databricks
MLflow with Databricks
 
Mlflow with databricks
Mlflow with databricksMlflow with databricks
Mlflow with databricks
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning products
 
1645 goldenberg using our laptop
1645 goldenberg using our laptop1645 goldenberg using our laptop
1645 goldenberg using our laptop
 
End-to-End ML pipelines with Beam, Flink, TensorFlow and Hopsworks.
End-to-End ML pipelines with Beam, Flink, TensorFlow and Hopsworks.End-to-End ML pipelines with Beam, Flink, TensorFlow and Hopsworks.
End-to-End ML pipelines with Beam, Flink, TensorFlow and Hopsworks.
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
Hopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open WorkshopHopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open Workshop
 
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
 
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
 

Último

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Último (20)

Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 

Sysml 2019 demo_paper

  • 1. HORIZONTALLY SCALABLE ML PIPELINES WITH A FEATURE STORE Alexandru A. Ormenis , an 1 Mahmoud Ismail 1 Kim Hammar 2 Robin Andersson 2 Ermias Gebremeskel 2 Theofilos Kakantousis 2 Antonios Kouzoupis 2 Fabio Buso 2 Gautier Berthou 2 3 Jim Dowling 1 2 Seif Haridi 1 3 ABSTRACT Machine Learning (ML) pipelines are the fundamental building block for productionizing ML models. However, much introductory material for machine learning and deep learning emphasizes ad-hoc feature engineering and training pipelines to experiment with ML models. Such pipelines have a tendency to become complex over time and do not allow features to be easily re-used across different pipelines. Duplicating features can even lead to correctness problems when features have different implementations for training and serving. In this demo, we introduce the Feature Store as a new data layer in horizontally scalable machine learning pipelines. 1 INTRODUCTION In this demonstration, we introduce an open-source data platform with support for machine learning (ML) pipelines, Hopsworks, where every stage in the pipeline can be scaled out to potentially hundreds of servers, enabling faster train- ing of larger models with more data. An end-to-end machine learning pipeline covers all stages in the production and op- eration of ML models, from data ingestion and preparation, to experimentation, to training, to deployment and operation. Our demonstration will have the added twist of introducing a new data management layer, the Feature Store. We will show how the Feature Store simplifies such pipelines, pro- viding an API for data engineers to produce features and an API with which data scientists can easily select features when designing new models. Our Feature Store is based on open-source frameworks, see Figure 3, using Apache Spark to compute features, Apache Hive to store feature data, and MySQL Cluster (NDB) for feature metadata. Our Feature Store also supports generating training data, from selected features, in native file formats for TensorFlow, Keras, and PyTorch. Our Feature Store is integrated in a multi-tenant security model that enables the governance and reuse of features as first-class entities within an organization. We also provide Python API support for both the Feature Store and for building pipelines in our framework. APIs ease the programmer’s burden when discovering and using services such as Kafka, the Feature Store, our REST API, and model serving, as well as when performing distributed model train- ing and hyperparameter optimization. 1 KTH Royal Institute of Technology, Sweden 2 Logical Clocks AB, Sweden 3 RISE AB, Sweden. Correspondence to: Alexandru A. Ormenis , an <aaor@kth.se>. Proceedings of the 2nd SysML Conference, Palo Alto, CA, USA, 2019. Copyright 2019 by the author(s). Figure 1. End-to-End ML Pipeline, orchestrated by Airflow 2 END-TO-END ML PIPELINES In ML pipelines on Hopsworks, see Figure 1, horizontal scalability is provided using Apache Spark for feature engi- neering, Apache Hive for storing feature data, HopsFS (Ni- azi et al., 2017) (a next generation HDFS) as the distributed filesystem for training data, Ring-AllReduce using Tensor- Flow for distributed training, and Kubernetes to serve mod- els. We also support Kafka for storing logs for ML models served from Kubernetes, and Spark streaming applications for processing those logs in near-realtime. Distributed train- ing and hyperparameter optimization with TensorFlow and PyTorch use PySpark to distribute computations, replicated conda environments to ensure Python libraries are available at all hosts, and YARN to allocate GPUs, CPUs, and mem- ory to applications. Our pipelines are driven and managed by a scalable, consistent, distributed shared memory service built on MySQL Cluster. 2.1 Demonstration In our live demonstration, we will use a managed version of Hopsworks running from a web browser at www.hops.site. We will start with an Airflow job orchestrating an entire ML pipeline that consists of a series of different jobs, all im- plemented in Jupyter notebooks and invoked from Airflow by an operator that calls the REST API to Hopsworks, see
  • 2. Horizontally Scalable ML Pipelines with a Feature Store Figure 2. Hopsworks Data and ML Platform Architecture Figure 3. Feature Store Architecture Figure 2. We will then go through the Jupyter notebooks in- dividually. We will start with discovering and downloading a ML dataset to work with - Hopsworks supports peer-to-peer sharing of large datasets. We will then write/run a Spark job to engineer features from the dataset. We also support Beam/Flink as jobs, so we can show an additional data vali- dation job using TensorFlow Extended (TFX) (Baylor et al., 2017). After the feature engineering stage, the features will be published to our Feature Store. The next stage is where a Data Scientist will write a notebook to select the features that will be used to train a model. This job will out- put training data in a chosen file format (such as .tfrecords, .csv, .parquet, .npy, .hdf5) to HopsFS. The next job in the pipeline is the experimentation or training step. Here, the data scientist designs (or evolves) a model architecture and discovers good hyperparameters by running lots of parallel experiments on GPUs. Similar to MLflow (Zaharia et al., 2018), Hopsworks manages and archives ML experiments for easy reproduction and archiving. The next stage is to train a model with the chosen hyperparameters using lots of GPUs and RingAll-Reduce for distributed training. The out- put of the training step is a model, saved to HopsFS. Now, the model can be analyzed and validated (such as using TFX model analysis), and then deployed for serving using TensorFlow Serving (with many options available, such as A/B testing and number of replicated instances behind a load-balancer). Finally, we will show client applications using the deployed model, and a streaming job will be run to monitor model predictions, notify of any anomalies, and collect more training data (by storing predictions that are later linked with outcomes). 2.2 Teaching Platform and Interactive Demo Our research organization runs a managed version of Hopsworks with over 1000 users. This managed platform, see Figure 2, is also used for lab and project work in Deep Learning and Big Data courses at a large technical university. Hopsworks has a User Interface and strong security support, built around TLS certificates for applications, users, and ser- vices. Similar to Jupyter Hub, students can write interactive programs in Jupyter notebooks. In contrast to JupyterHub, users can run jobs that access up to 100s of TBs of data on 1000s of cores, and use 10s of GPUs for model experimen- tation or training. Hopsworks also includes a project-based multi-tenancy environment, so students can work in groups on sandboxed data. Our managed platform provides many popular datasets for machine learning, including Imagenet, Wikipedia, Youtube-8M, and Kinetics. In the interactive demo, participants will be invited to use their laptops and an Internet connection to create accounts on our managed platform. With an account, a user can create a project, link in a large machine learning dataset, and run pre-canned notebooks or end-to-end machine learning pipelines. Participants will be given enough quota to use 100s of CPUs, and several GPUs, if they so choose. 2.3 Summary We will demonstrate a data and ML platform, Hopsworks, that supports the design, debugging, and running of deep learning pipelines at scale. All stages of the pipeline can be scaled out horizontally, as the platform manages the whole stack from resource management (YARN with GPU support) to API support for running distributed training and hyperparameter optimization experiments. ACKNOWLEDGMENT This work was funded by the Swedish Foundation for Strate- gic Research projects “Smart Intra-body network, RIT15- 0119” and “Continuous Deep Analytics, BD15-0006”, and by the EU Horizon 2020 project AEGIS under Grant Agree- ment no. 732189. REFERENCES Baylor, D., Breck, E., Cheng, H.-T., Fiedel, N., Foo, C. Y., Haque, Z., Haykal, S., Ispir, M., Jain, V., Koc, L., et al. Tfx: A tensorflow-based production-scale machine learn- ing platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1387–1395. ACM, 2017. Niazi, S., Ismail, M., Haridi, S., Dowling, J., Grohsschmiedt, S., and Ronström, M. Hopsfs: Scaling hierarchical file system metadata using newsql databases. In FAST, pp. 89–104, 2017. Zaharia, M., Chen, A., Davidson, A., Ghodsi, A., Hong, S. A., Konwinski, A., Murching, S., Nykodym, T., Ogilvie, P., Parkhe, M., et al. Accelerating the machine learning lifecycle with mlflow. Data Engineering, pp. 39, 2018.