Audrey Reznik, Data Scientist from ExxonMobil and John Archer, Red Hat Solution Architect present on how to use Openshift to enable and create value to data science teams and improve their agility and improve collaboration for larger organizations.
Delivering Agile Data Science on Openshift - Red Hat Summit 2019
1. Delivering Agile Data Science on Openshift
Audrey Reznik
Data Scientist
May 9th, 2019
John Archer
Principal Energy Solution Architect
How to create Instant Business Value
2. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
MEET THE SPEAKERS
John Archer
Principal Solution Energy Architect
Red Hat since 2015
BEA Systems, BSI Consulting,
DocuQuest, Andrews & Kurth,
SilverStream, Petris and Oracle
Upstream Data Management, DoD,
APIs, eCommerce, IoT, data science
and blockchain
SPE, SEG, PPDM, HJUG, HDUG, HAL-
PC, Energistics
Audrey Reznik
Data Scientist
Upstream Research Center
ExxonMobil since 2007
Chevron, Akamai, Entriq, Digital Medical
Registrar, Spider Technologies, Ziff
Davis
3. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
DATA SCIENCE TEAM PRESSURES
EXPLOSIVE GROWTH
in data analytics teams and analytic
tools
MULTIPLE TEAMS COMPETING
for use of the same storage and
computing resources
CONGESTION
in busy analytic clusters causing
frustration and missed SLAs
EMERGING DATAOPS
Data Scientist Developers vs Full Stack
Developer agility and enablement gaps
5. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
NEED: SHARE CODE (PRODUCT) WITH USERS
Jupyter Notebooks as a technology we could use to combine python code, a GUI, documentation for sharing with
customers.
Start of a Interactive Data Science environment.
Red Hat OpenShift PoC at ExxonMobil. Could this new technology benefit us in
creating a Reproducible & Interactive Data Science environment?
Prize: This would enable the team to not only quickly obtain customer feedback,
but also easily utilize Agile Methodology; therefore, quickly delivering MVPs.
Drawback: how does
one avoid the
setup/configuration
issues and reliably
deploy the notebook? Pip install required
Anaconda libraries
Jupyter Notebook Python 3.x
(load onto PC – or setup server)
Local admin access
Access to latest source code
OS?SQL
Server
PC Setup
6. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
LOCAL PC VS OPENSHIFT PROJECT CONTAINERS
Jupyter Notebook
Python 3.x
(image)
Libraries
• Numpy
• Pandas
• Matplotlib
• IPyWidgets
• SciPy
• Lmfit
• Seaborne
• Plotly
SQLite
Container v2.0
GIT
Image project
Code project
OpenShift
URL
to PoCCode
Local PC Setup
pip install required
Anaconda libraries
Jupyter Notebook Python 3.x
(load onto PC – or setup server)
Local admin access
Access to latest source code
OS?SQL
Server
Reproducible Data Science environment that users interact with via Chrome.
Hardware Freedom
& easier
Reproduction!
7. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
For a Data Scientist, the ability to rapidly deploy code and quickly obtain feedback from a user is extremely
valuable and Agile! Openshift facilitates these capabilities!
REPRODUCIBLE & INTERACTIVE SCIENTIFIC ENVIRONMENT
1. Understand
the
Problem
2. Suggest
Solutions
Deliver POC
3. Refine the
Problem
Agile
How to Deploy?
URL
to
PoC
Code
GIT
Image project
Code project
OpenShift
“Interactive” feedback!
Nexus
Image
As a user I want to
provide frequent
feedback!
Python
(Pypi)
Security
8. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
DEPLOY SOURCE CODE WITH SOURCE TO IMAGE (S2I)
• Re-useable Data Science Applications: data location
• To re-useable Data Science Images: can they be re-consumed or modified for particular use cases?
• E.g. we have a base python image that has been modified to provide TensorFlow, SciKit Learn for Data
Science projects.
• Reusable data access containers: SQL Server, Oracle, PI, SAP HANA.
Git
RepositoryBUILD APP
(OpenShift) Developer
code
Source-to-Image
(S2I)
Builder Image
Image
Registry
BUILD IMAGE
(OpenShift)
DEPLOY
(OpenShift)
deployApplication
Container
9. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
MATURING THE CI/CD PIPELINE
Seeing an emerging notion of Data ScienceOps workflows. Current OS production CI/CD in progress.
Challenges we are experiencing include:
1. OnPrem databases in different countries
2. Development/Deployment in Jupyter notebooks
GIT
Jenkins
build
Package
Jenkins
Archive
Artifacts in
Nexus
Nexus
OS build image
deploy to TEST
OS build image
deploy to PROD
Test
build
Package
10. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
MACHINE LEARNING ON OPENSHIFT
Figure 1. liquid estimates. Marco De Mattia
Unique performance computing requirements for
Artificial Intelligence, Machine Learning, Neural
Networks and GPUs
Multiple Data Science images:
• TensorFlow
• PyTorch
• Scikit-learn
Testing GPU (NVidia v100) cluster (OCP). Additional
service to internal HPC.
Next Steps: examine RAPIDS.AI – execute end-to-
end data science pipelines in the GPU…
11. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
OPENSHIFT GPU PROOF OF CONCEPT (POC)
GPU POC: read & analyze petro-physical data. Use ML Algorithms to generate analysis/models on GPU cluster.
Vetted models can be pushed to Azure for deployment.
GPUDB
Data
Scientist
URL to ML App
User
ML Algorithms
(GIT Repo)
L4
Network
onPrem
Database(s)
Containers
Figure 2. GPU POC workflow, Audrey Reznik
12. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
READY FOR ANY CLOUD – PRIVATE AND PUBLIC
DATA GRAVITY DRIVES THE LOCATION
• OpenShift for on premise and Public Cloud (Azure) for Container as a Service (CaaS)
1. CaaS Security enabled through AD groups created onPremise and DevOps practices
2. Self-service for accessing Data Science packages with network, routing and DNS services
3. Storage can be self-service with PVC or extended with Ceph and OCP Storage options
Where does your application live? How do you access it?
Is my application
secure?
Enabled Data Science Teams
• Perform More Experiments
• Spend less time on plumbing
• Focus on Delivering Value to
ExxonMobil
Resulting In
13. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
EXXONMOBIL DATA SCIENCE OS TIMELINE
Started with Data
Virtualization for
Calgary
Optimization
Dec 2017
Containerized JBoss
Data Virtualization
on Openshift on
premise - Feb 2018
Spoke with Data Science
teams - Python,
MATLAB, Julia and R
users – Mar 2018
Introduced Graham
Dumpleton’s
JupyterHub container
image – April - 2018
Delivered Data Science
Workshop on Openshift to
eight different data
science teams – Dec 2018
Built “Base” Data
Science image.
Python 3.x, AI
libraries
July - 2018
Data Science developers
deliver faster and
collaborate globally within
2 months – Feb 2019
Successfully deploy
ODH supporting multiple
notebook kernels and
GPU – Mar 2019
Built test OCP 3.10
cluster for NVidia
v100 testing for
Tensorflow and
Keras - Nov 2018
14. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
MOVING FORWARD: EXXONMOBIL DATA SCIENCE CAPABILITY TODAY
As a Data Scientist (all I care about) is that using Openshift, I can now deploy a common Jupyter Notebook /
Anaconda image (with all required libraries) in a matter of seconds.
Freeing myself (and other Data Scientists) to perform data science and not worry about architecture and delivery
mechanisms. Now that is Democratizing Data Science!
Selected Openshift on premises and public cloud for Container as a Service (CaaS)
• Openshift supports:
• One Click Notebooks and JupyterHub/Lab templates
• Self-service for accessing data & data science packages
• Nexus Repository to allow for Python, Java, R, PHP, .Net package managers
• Docker public repository security built-in process – protects against rooted
containers and new CVE attacks
• NVidia GPU support allows for sharing these resources across multiple teams
Jupyter Notebook & select conda libraries image being used for Kearl Mining Optimization Studies
15. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
DATA SCIENTIST DEVELOPERS NEEDS
All Developers need
● Choice of architectures
● Choice of programming languages
● Choice of databases and persistence
● Choice of application services
● Choice of development tools
● Choice of build and deploy workflows
Data Science Additional Needs
● Access to GPUs and varied storage
● Access to Curated Data
● Automated ScienceOps pipelines
● Collaboration with the Business
● Access to specific data science
languages and toolsets
They don’t want to have to worry about the infrastructure.
16. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
YOUR DIFFERENTIATION DEPENDS ON YOUR
ABILITY TO DELIVER INTELLIGENT APPS FASTER
CONTAINERS, KUBERNETES, DEVOPS & DATAOPS ARE KEY INGREDIENTS
Innovation
Culture
Cloud-native
Applications
AI & Machine
Learning
Internet of
Things
Virtual GPU
17. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
18. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
OPENDATAHUB.IO ARCHITECTURE
CONTAINER STORAGE (CEPH)
CONTAINER HOST (RHEL/RHCOS)
Microsoft
Azure
AWSOpenStackDatacenterLaptop Google
Cloud
CONTAINER ORCHESTRATION AND MANAGEMENT (OPENSHIFT)
S3 API Object Store BLOCK FILE
GPU FPGA
APPLICATION LIFE CYCLE MANAGEMENT (OPENSHIFT)
DEVOPS WORKFLOW (CODE & DATA)
API GATEWAY (3SCALE) SERVICE MESH (ISTIO)
SERVERLESS
PRIVATE MICRO SERVICES
(CONTAINERIZED CUSTOM APPS)
CONTAINER APPS
PRE-DEFINED AI LIBRARY
(BOTS | ANOMALY | CLASSIFICATION | SENTIMENT | …)
AI TOOLCHAIN & WORKFLOW
(JUPYTER, SUPERSET, …)
COMMON SERVICES
SERVICECATALOG&SELFSERVICEUI/CLI
IDENTITY/POLICY(ACCESS,PLACEMENT)/LINEAGE(CODE
ANDDATA)
MANAGEMENTCONSOLE/INSIGHTS/AIOPS
(PROMETHEUS|ELASTIC|…)
FEDERATION
RH Core
Platform
OpenShift ALM
Red Hat
Middleware
Community &
ISV Ecosystem
Technology
Roadmap
Customer
Content
LEGEND
PYTHON / FLASK JAVA JAVASCRIPT ...
STREAMING (KAFKA - streamzi)
MSG BUS (AMQ) ANALYTICS (SPARK)
ML (TENSORFLOW |
…)
MEMORY CACHE (JDG) ||
DECISION (BxMS)
HDFS | REDIS | SQL | NoSQL
| GRAPHDB | TIMESERIES |
ELASTIC | ...
19. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
MODERN DATA ANALYTICS PIPELINE
DATA
GENERATION
INGEST DATA
SCIENCE
MACHINE
LEARNING
STREAM
PROCESSING
TRANSFORM,
MERGE, JOIN
DATA
ANALYTICS
• IoT Telemetry
• G&G - Well Logs
• Transactions
• Production
• NiFi
• Kafka
• MQTT
• Presto
• Impala
• SparkSQL
• Notebooks
• TensorFlow
• PyTorch
• Keras
• scikit-learn
• AutoML*
• Kafka
• MQTT
• WebSockets
• Hadoop
• Spark
• Pandas
• Apache Arrow
• Spark
• Hadoop
20. CONNECTING THE EDGE TO DATA SCIENTISTS
Highly Scalable,
flexible, elastic,
microservice based
architecture
Fully Portable – On
Premise to any
public cloud vendor
Leverages the
power and agility
of open source
software without
lock-in
Architecture
Tenets
Data
Scientist
Data
Manager
s
Citizen
Data
Scientist
Cognitive AI
Vision
Speech
Face
Audio
Video
Text
Data
Models
Curation
Prep
Quality
Publishing
SecurityPython, R, Jupyter.org, Tensorflow, Keras, Pandas, Bokeh, Dash, Prometheus,
Grafana, SciPy, NumPy, SumPy, Julia , Spark, PySpark, Theano, Scikit, FaceDetect
Packages:
AI/ML/Data Science Pods
MongoDB, MariaDB, mySQL, Postgres, Couchbase, Redis, MS-SQL, OraclePersistence
:
SSO and Authentication
OIDC
SAML
OAuth
JWT
Kerberos
DevOps
Node.js, .Net Core, Java, Python, PHP, Ruby, Rails, Javascript, PerlApp Dev:
AppDev & App Services and Persistence Pods
REST
ODBC
JDBC
WS
Predictive
Maintenance
Autonomous
Operations
Supply Chain
Improvements
Downstream
Reliability
Use Cases
Multitenant – CPU
and GPU powered
workloads
REST
IoT “Things”
MQTT
Integration, BPM, Rules, Messaging, API, IoT, Microservices, IstioApp Services:
OnPremise Public Cloud
WSS
Kafka
21. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
● JupyterHub on Openshift
○ Jupyter notebook, JupyterHub, JupyterLab, Openshift Templates
● Kubeflow
○ Kube project for Tensorflow, Seldon, JupyterHub/Lab, PyTorch, MPI
Operator
● Opendatahub.io
○ Ceph, Spark, JupyterHub/Lab, Tensorflow
○ Simplified Multiple Kernels support
○ GPU Support
○ Resource management and instance culling
● radanalytics.io
○ Openshift Spark
○ Oshinko - Apache Spark Cluster
○ Spark Operator
OSS DATA SCIENCE PROJECTS
22. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
● Join Openshift Commons - ML SIG - https://commons.openshift.org/
● Openshift Self Service Education - https://learn.openshift.com
● Install Minishift - https://docs.okd.io/latest/minishift/getting-
started/installing.html
○ MacOS - brew cask install minishift
○ Manual - https://github.com/minishift/minishift/releases
● Install Jupyter and JupyterHub Openshift templates
○ https://github.com/jupyter-on-openshift/jupyterhub-quickstart
● Review the OpenDataHub.io project
HOW CAN I GET STARTED?
23. Delivering Agile Data Science solutions with OpenShift … and providing Business Value!