SlideShare a Scribd company logo
1 of 30
Download to read offline
Building analytical micro services
powered by Jupyter Kernels
About me - Luciano Resende
Data Science Platform Architect – IBM – CODAIT (formerly Spark Technology Center)
• Have been contributing to open source at ASF for over 10 years
• Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache Spark,
Apache Toree among other projects related to Apache Spark ecosystem
lresende@apache.org
http://lresende.blogspot.com/
https://www.linkedin.com/in/lresende
@lresende1975
https://github.com/lresende
@
About me – Kevin Bates
Sr. Software Engineer – IBM – CODAIT (formerly Spark Technology Center)
• Over 30 years developing enterprise-level software
• Currently working in the Jupyter ecosystem the last 14 months
kbates4@gmail.com
https://www.linkedin.com/in/kevinbatessoftware
@kbates4
https://github.com/kevin-bates
@
Jupyter Notebooks
Notebooks are interactive computational
environments, in which you can
combine code execution, rich text,
mathematics, plots and rich media.
Jupyter Notebook Platform Architecture
• Notebook UI runs on the browser
• The Notebook Server serves the ’Notebooks’
• Kernels interpret/execute cell contents
§ Are responsible for code execution
§ Abstracts different languages
Jupyter Notebooks Architecture
Notebook Server Process
JavaScript
NotebookManagement
Python Process
KernelManagement
IPythonKernel
KernelProxy
Shell
IOPub
stdin
control
heartbeat
UserCode
sklearn
Spark
Tensor
Flow
…
Jupyter Connection Profile
{
"stdin_port": 37934,
"control_port": 42264,
"hb_port": 34727,
"shell_port": 35502,
"iopub_port": 59585,
"transport": "tcp",
"ip": "127.0.0.1",
"signature_scheme": "hmac-sha256",
"key": "4b306a48-98a0715362cbc47aafbc4e5f",
"kernel_name": ""
}
Jupyter Messaging Protocol
Available Sockets:
• Shell (requests, history, info)
• IOPub (status, display, results)
• Stdin (input requests from kernel)
• Control (shutdown, interrupt)
• Heartbeat (poll)
Message flow
Two types of responses
• Results
• Computations that return a result
• 1+1
• val a = 2 + 5
• Stream Content
• Values that are written to output stream
• println(‘Hello World’)
• df.show(10)
Client Program Kernel
Evaluate (msgid=1) ‘1+1’
Busy (msgid=1)
Status (msgid=1) ok/error
Result (msgid=1)
Stream Content (msgid=1)
Idle (msgid=1)
Introducing
Jupyter Enterprise Gateway
Jupyter Notebooks Architecture
Notebook Server Process
JavaScript
NotebookManagement
Python Process
KernelManagement
IPythonKernel
KernelProxy
Shell
IOPub
stdin
control
heartbeat
UserCode
sklearn
Spark
Tensor
Flow
…
The Origins of Jupyter Enterprise Gateway​
• Multiple IBM products embedding Spark on YARN​
• All wanted to add Jupyter notebooks with Spark​
• Usual enterprise requirements (multitenancy, scalability, security, etc.)​
• Attempts at scaling up (one large server) or having a single Notebook server
per user were insufficient
• Jupyter Kernel Gateway introduced a Bring Your Own Notebook model via
Websocket “Personality” and Notebook extension – nb2kg
Initial prototype using Jupyter Kernel Gateway
YARN Cluster
Security
Layer
YARN
Workers
YARN
Resource
Manager
Spark
ExecutorsSpark
ExecutorsSpark
Executors
Spark
ExecutorsSpark
ExecutorsSpark
Executors
Gateway Node
nb2kg
(Proxy)
nb2kg
Jupyter
Kernel
Gateway
Python
Kernel
Spark Driver
Python
Kernel
Spark Driver
Shell
IOPub
stdin
control
heartbeat
Issue #2: All Spark jobs
run as same user ID
Issue #1: All kernels
and Spark drivers run
on a single node
Issue: All kernels run on a single node
8 8 8 8
0
10
20
30
40
50
60
70
80
4 Nodes 8 Nodes 12 Nodes 16 Nodes
MAXKERNELS(4GBHEAP)
CLUSTER SIZE (32GB NODES)
Maximum Number of Simultaneous Kernels
Jupyter Enterprise Gateway: Initial Goals
Optimized Resource Allocation
§ Run Spark in YARN Cluster Mode to better utilize cluster resources
§ Pluggable architecture for additional Resource Managers and Lifecycle Management
§ General framework for remote kernels
Multiuser support with user impersonation
§ Enhance security and sandboxing by enabling user impersonation when running kernels (using
Kerberos)
§ Individual HDFS home folder for each notebook user
§ Enables use of same user ID for both notebook and batch jobs
Enhanced Security
§ Secure socket communications
§ Any network communication should be encrypted
Jupyter Enterprise Gateway
Jupyter Kernel Gateway
Jupyter Notebook
jupyter_client
Jupyter Enterprise Gateway
YARN Cluster
YARN
Workers
Gateway Node
Jupyter Enterprise Gateway
• Multitenancy
• Remote kernel lifecycle management via process proxies
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Impersonation:
Alice’s kernel
runs under
Alice’s user ID.
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Security
Layer
nb2kg
(Proxy)
nb2kg
Scalability Benefits
8 8 8 8
16
32
48
64
0
10
20
30
40
50
60
70
80
4 Nodes 8 Nodes 12 Nodes 16 Nodes
MAXKERNELS(4GBHEAP)
CLUSTER SIZE (32GB NODES)
Maximum Number of Simultaneous Kernels
Before JEG
After JEG
Jupyter Notebooks with Enterprise Gateway
Notebook Server Process
JavaScript
NotebookManagement
KernelManagement
(viaNB2KG)
Kernel Launcher Process
IPythonKernel(embedded)
Shell
IOPub
stdin
control
heartbeat
UserCode
sklearn
Spark
Tensor
Flow
…
Enterprise Gateway
Process
Kernel
Lifecycle
Management
Process
Proxies
Websocket
passthrough
via
Kernel
Gateway
Building an Analytical Micro Service
Use Case – Sentiment Analysis
Utilizing Yelp Dataset from Kaggle
Utilizing AFINN sentiment analysis library in Python
Using PySpark for training and scoring model
Using Jupyter Kernels to integrate between micro service and spark
Yelp Dataset: https://www.kaggle.com/yelp-dataset/yelp-dataset
Use Case – Sentiment Analysis
Integrates analytics to your regular
python application/micro service by
leveraging Jupyter Notebook
Kernels
YARN Cluster
YARN
Workers
Enterprise Gateway Node
Jupyter Enterprise Gateway
Multitenancy
Remote kernels and Kernel Lifecycle management
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Sentiment
Resource
flask
Kernel
Laucher/Client
http
Sentiment
Provider
Use Case – Sentiment Analysis
Loading the Yelp Dataset
business = spark.read
.option('header','true')
.option('inferSchema', 'true')
.csv('file:///opt/data/yelp_business.csv')
city_business = business.filter(business.city == 'Brooklyn')
city_business.write.parquet(path='yelp/business', mode='overwrite')
reviews = spark.read
.option('header', 'true')
.option('inferSchema', 'true')
.csv('file:///opt/data/yelp_review.csv')
city_business_reviews = reviews.join(city_business, city_business.business_id ==
reviews.business_id, 'left_semi')
city_business_reviews.write.parquet(path='yelp/reviews', mode='overwrite')
Application – Sentiment REST API
• Leverage Flask-RESTFul
• Exposes a sentiment REST API
• Request sentiment for a given business
• http://<host>:5000/sentiment/<business_id>
• During Application startup
• Start the kernel
• Perform required data load operations
Application – Sentiment Provider
• Encapsulate all interactions with the
kernel
• Start/Stop the Kernel
• Load necessary tables
• Retrieve business details
• Calculate sentiment for each business review
• Possible enhancements/todos
• Return data as Numpy Arrays
• Provide more flexibility to manipulate
display on the resource side (e.g pretty html)
Application – Kernel Launcher
• Encapsulate kernel lifecycle
• Start/Stop the Kernel
• Instantiate a new kernel object
Application – Kernel Object
• Encapsulate kernel code execution
• Create proper code execution requests
• Monitor kernel responses
• Errors
• Result Responses
• Stream Responses
• Kernel idle status
• Return code execution responses
Application – Live Demo
Further Readings
Connect direct to kernel
• See Jupyter Client https://github.com/jupyter/jupyter_client
• See https://github.com/lresende/toree-gateway/blob/master/python/toree_client.py
Jupyter Kernel Gateway/Enterprise Gateway HTTP Personality
• Starts the Gateway in single notebook mode
• Notebook cells (based on resource identifier comment) become accessible via URLs
# GET /hello/world
print("I'm cell #1")
• http://jupyter-kernel-gateway.readthedocs.io/en/latest/http-mode.html
Resources
Jupyter Enterprise Gateway source code at GitHub
https://github.com/jupyter-incubator/enterprise_gateway
Jupyter Enterprise Gateway Documentation
http://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Microservice Demo Application
https://github.com/lresende/eg-sentiment-microservice-demo

More Related Content

What's hot

Kokki: Configuration Management Framework
Kokki: Configuration Management FrameworkKokki: Configuration Management Framework
Kokki: Configuration Management Framework
Aleksey Maksimov
 
Optimizing Cloud Foundry and OpenStack for large scale deployments
Optimizing Cloud Foundry and OpenStack for large scale deploymentsOptimizing Cloud Foundry and OpenStack for large scale deployments
Optimizing Cloud Foundry and OpenStack for large scale deployments
Animesh Singh
 

What's hot (19)

Build FAST Learning Apps with Docker and OpenPOWER
Build FAST Learning Apps with Docker and OpenPOWERBuild FAST Learning Apps with Docker and OpenPOWER
Build FAST Learning Apps with Docker and OpenPOWER
 
Kokki: Configuration Management Framework
Kokki: Configuration Management FrameworkKokki: Configuration Management Framework
Kokki: Configuration Management Framework
 
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldS2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real World
 
Cloud Foundry and OpenStack – Marriage Made in Heaven !
Cloud Foundry and OpenStack – Marriage Made in Heaven !Cloud Foundry and OpenStack – Marriage Made in Heaven !
Cloud Foundry and OpenStack – Marriage Made in Heaven !
 
How to build an event-driven, polyglot serverless microservices framework on ...
How to build an event-driven, polyglot serverless microservices framework on ...How to build an event-driven, polyglot serverless microservices framework on ...
How to build an event-driven, polyglot serverless microservices framework on ...
 
Redfish & python redfish
Redfish & python redfishRedfish & python redfish
Redfish & python redfish
 
Webinar: OpenStack Benefits for VMware
Webinar: OpenStack Benefits for VMwareWebinar: OpenStack Benefits for VMware
Webinar: OpenStack Benefits for VMware
 
Hadoop on Docker
Hadoop on DockerHadoop on Docker
Hadoop on Docker
 
Hadoop Everywhere & Cloudbreak
Hadoop Everywhere & CloudbreakHadoop Everywhere & Cloudbreak
Hadoop Everywhere & Cloudbreak
 
Herding your cattle from dev to ops
Herding your cattle from dev to opsHerding your cattle from dev to ops
Herding your cattle from dev to ops
 
Finding and Organizing a Great Cloud Foundry User Group
Finding and Organizing a Great Cloud Foundry User GroupFinding and Organizing a Great Cloud Foundry User Group
Finding and Organizing a Great Cloud Foundry User Group
 
Optimizing Cloud Foundry and OpenStack for large scale deployments
Optimizing Cloud Foundry and OpenStack for large scale deploymentsOptimizing Cloud Foundry and OpenStack for large scale deployments
Optimizing Cloud Foundry and OpenStack for large scale deployments
 
NoSQL - Vital Open Source Ingredient for Modern Success
NoSQL - Vital Open Source Ingredient for Modern SuccessNoSQL - Vital Open Source Ingredient for Modern Success
NoSQL - Vital Open Source Ingredient for Modern Success
 
Hadoop Cluster on Docker Containers
Hadoop Cluster on Docker ContainersHadoop Cluster on Docker Containers
Hadoop Cluster on Docker Containers
 
HP Helion OpenStack Community Edition Deployment
HP Helion OpenStack Community Edition DeploymentHP Helion OpenStack Community Edition Deployment
HP Helion OpenStack Community Edition Deployment
 
Introduction to EasyBuild: Tutorial Part 1
Introduction to EasyBuild: Tutorial Part 1Introduction to EasyBuild: Tutorial Part 1
Introduction to EasyBuild: Tutorial Part 1
 
Nike tech-talk-intro-to-apache-ignite
Nike tech-talk-intro-to-apache-igniteNike tech-talk-intro-to-apache-ignite
Nike tech-talk-intro-to-apache-ignite
 
November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn
November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn
November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn
 
Cloud foundry integration-with-openstack-and-docker-bangalorecf-meetup
Cloud foundry integration-with-openstack-and-docker-bangalorecf-meetupCloud foundry integration-with-openstack-and-docker-bangalorecf-meetup
Cloud foundry integration-with-openstack-and-docker-bangalorecf-meetup
 

Similar to Building analytical microservices powered by jupyter kernels

ApacheCon 2021 Apache Deep Learning 302
ApacheCon 2021   Apache Deep Learning 302ApacheCon 2021   Apache Deep Learning 302
ApacheCon 2021 Apache Deep Learning 302
Timothy Spann
 

Similar to Building analytical microservices powered by jupyter kernels (20)

The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
 
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache SparkAn Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel GatewayBig analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel Gateway
 
A Jupyter kernel for Scala and Apache Spark.pdf
A Jupyter kernel for Scala and Apache Spark.pdfA Jupyter kernel for Scala and Apache Spark.pdf
A Jupyter kernel for Scala and Apache Spark.pdf
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
ApacheCon 2021 Apache Deep Learning 302
ApacheCon 2021   Apache Deep Learning 302ApacheCon 2021   Apache Deep Learning 302
ApacheCon 2021 Apache Deep Learning 302
 
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
 
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
 
Blastn plus jupyter on Docker
Blastn plus jupyter on DockerBlastn plus jupyter on Docker
Blastn plus jupyter on Docker
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_index
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
PPT5: Neuron Introduction
PPT5: Neuron IntroductionPPT5: Neuron Introduction
PPT5: Neuron Introduction
 
Cytoscape: Now and Future
Cytoscape: Now and FutureCytoscape: Now and Future
Cytoscape: Now and Future
 

More from Luciano Resende

Data access layer and schema definitions
Data access layer and schema definitionsData access layer and schema definitions
Data access layer and schema definitions
Luciano Resende
 
Building RESTful services using SCA and JAX-RS
Building RESTful services using SCA and JAX-RSBuilding RESTful services using SCA and JAX-RS
Building RESTful services using SCA and JAX-RS
Luciano Resende
 

More from Luciano Resende (20)

Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
 
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
From Data to AI - Silicon Valley Open Source projects come to you - Madrid me...
 
Inteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for CodeInteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for Code
 
IoT Applications and Patterns using Apache Spark & Apache Bahir
IoT Applications and Patterns using Apache Spark & Apache BahirIoT Applications and Patterns using Apache Spark & Apache Bahir
IoT Applications and Patterns using Apache Spark & Apache Bahir
 
Getting insights from IoT data with Apache Spark and Apache Bahir
Getting insights from IoT data with Apache Spark and Apache BahirGetting insights from IoT data with Apache Spark and Apache Bahir
Getting insights from IoT data with Apache Spark and Apache Bahir
 
Open Source AI - News and examples
Open Source AI - News and examplesOpen Source AI - News and examples
Open Source AI - News and examples
 
Building iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache BahirBuilding iot applications with Apache Spark and Apache Bahir
Building iot applications with Apache Spark and Apache Bahir
 
What's new in Apache SystemML - Declarative Machine Learning
What's new in Apache SystemML  - Declarative Machine LearningWhat's new in Apache SystemML  - Declarative Machine Learning
What's new in Apache SystemML - Declarative Machine Learning
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirWriting Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
 
How mentoring can help you start contributing to open source
How mentoring can help you start contributing to open sourceHow mentoring can help you start contributing to open source
How mentoring can help you start contributing to open source
 
SystemML - Declarative Machine Learning
SystemML - Declarative Machine LearningSystemML - Declarative Machine Learning
SystemML - Declarative Machine Learning
 
Luciano Resende's keynote at Apache big data conference
Luciano Resende's keynote at Apache big data conferenceLuciano Resende's keynote at Apache big data conference
Luciano Resende's keynote at Apache big data conference
 
Asf icfoss-mentoring
Asf icfoss-mentoringAsf icfoss-mentoring
Asf icfoss-mentoring
 
Open Source tools overview
Open Source tools overviewOpen Source tools overview
Open Source tools overview
 
Data access layer and schema definitions
Data access layer and schema definitionsData access layer and schema definitions
Data access layer and schema definitions
 
How mentoring programs can help newcomers get started with open source
How mentoring programs can help newcomers get started with open sourceHow mentoring programs can help newcomers get started with open source
How mentoring programs can help newcomers get started with open source
 
Building RESTful services using SCA and JAX-RS
Building RESTful services using SCA and JAX-RSBuilding RESTful services using SCA and JAX-RS
Building RESTful services using SCA and JAX-RS
 
SCA Reaches the Cloud
SCA Reaches the CloudSCA Reaches the Cloud
SCA Reaches the Cloud
 
Building apps with tuscany
Building apps with tuscanyBuilding apps with tuscany
Building apps with tuscany
 
S314011 - Developing Composite Applications for the Cloud with Apache Tuscany
S314011 - Developing Composite Applications for the Cloud with Apache TuscanyS314011 - Developing Composite Applications for the Cloud with Apache Tuscany
S314011 - Developing Composite Applications for the Cloud with Apache Tuscany
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Building analytical microservices powered by jupyter kernels

  • 1.
  • 2. Building analytical micro services powered by Jupyter Kernels
  • 3. About me - Luciano Resende Data Science Platform Architect – IBM – CODAIT (formerly Spark Technology Center) • Have been contributing to open source at ASF for over 10 years • Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache Spark, Apache Toree among other projects related to Apache Spark ecosystem lresende@apache.org http://lresende.blogspot.com/ https://www.linkedin.com/in/lresende @lresende1975 https://github.com/lresende @
  • 4. About me – Kevin Bates Sr. Software Engineer – IBM – CODAIT (formerly Spark Technology Center) • Over 30 years developing enterprise-level software • Currently working in the Jupyter ecosystem the last 14 months kbates4@gmail.com https://www.linkedin.com/in/kevinbatessoftware @kbates4 https://github.com/kevin-bates @
  • 5. Jupyter Notebooks Notebooks are interactive computational environments, in which you can combine code execution, rich text, mathematics, plots and rich media.
  • 6. Jupyter Notebook Platform Architecture • Notebook UI runs on the browser • The Notebook Server serves the ’Notebooks’ • Kernels interpret/execute cell contents § Are responsible for code execution § Abstracts different languages
  • 7. Jupyter Notebooks Architecture Notebook Server Process JavaScript NotebookManagement Python Process KernelManagement IPythonKernel KernelProxy Shell IOPub stdin control heartbeat UserCode sklearn Spark Tensor Flow …
  • 8. Jupyter Connection Profile { "stdin_port": 37934, "control_port": 42264, "hb_port": 34727, "shell_port": 35502, "iopub_port": 59585, "transport": "tcp", "ip": "127.0.0.1", "signature_scheme": "hmac-sha256", "key": "4b306a48-98a0715362cbc47aafbc4e5f", "kernel_name": "" }
  • 9. Jupyter Messaging Protocol Available Sockets: • Shell (requests, history, info) • IOPub (status, display, results) • Stdin (input requests from kernel) • Control (shutdown, interrupt) • Heartbeat (poll)
  • 10. Message flow Two types of responses • Results • Computations that return a result • 1+1 • val a = 2 + 5 • Stream Content • Values that are written to output stream • println(‘Hello World’) • df.show(10) Client Program Kernel Evaluate (msgid=1) ‘1+1’ Busy (msgid=1) Status (msgid=1) ok/error Result (msgid=1) Stream Content (msgid=1) Idle (msgid=1)
  • 12. Jupyter Notebooks Architecture Notebook Server Process JavaScript NotebookManagement Python Process KernelManagement IPythonKernel KernelProxy Shell IOPub stdin control heartbeat UserCode sklearn Spark Tensor Flow …
  • 13. The Origins of Jupyter Enterprise Gateway​ • Multiple IBM products embedding Spark on YARN​ • All wanted to add Jupyter notebooks with Spark​ • Usual enterprise requirements (multitenancy, scalability, security, etc.)​ • Attempts at scaling up (one large server) or having a single Notebook server per user were insufficient • Jupyter Kernel Gateway introduced a Bring Your Own Notebook model via Websocket “Personality” and Notebook extension – nb2kg
  • 14. Initial prototype using Jupyter Kernel Gateway YARN Cluster Security Layer YARN Workers YARN Resource Manager Spark ExecutorsSpark ExecutorsSpark Executors Spark ExecutorsSpark ExecutorsSpark Executors Gateway Node nb2kg (Proxy) nb2kg Jupyter Kernel Gateway Python Kernel Spark Driver Python Kernel Spark Driver Shell IOPub stdin control heartbeat Issue #2: All Spark jobs run as same user ID Issue #1: All kernels and Spark drivers run on a single node
  • 15. Issue: All kernels run on a single node 8 8 8 8 0 10 20 30 40 50 60 70 80 4 Nodes 8 Nodes 12 Nodes 16 Nodes MAXKERNELS(4GBHEAP) CLUSTER SIZE (32GB NODES) Maximum Number of Simultaneous Kernels
  • 16. Jupyter Enterprise Gateway: Initial Goals Optimized Resource Allocation § Run Spark in YARN Cluster Mode to better utilize cluster resources § Pluggable architecture for additional Resource Managers and Lifecycle Management § General framework for remote kernels Multiuser support with user impersonation § Enhance security and sandboxing by enabling user impersonation when running kernels (using Kerberos) § Individual HDFS home folder for each notebook user § Enables use of same user ID for both notebook and batch jobs Enhanced Security § Secure socket communications § Any network communication should be encrypted Jupyter Enterprise Gateway Jupyter Kernel Gateway Jupyter Notebook jupyter_client
  • 17. Jupyter Enterprise Gateway YARN Cluster YARN Workers Gateway Node Jupyter Enterprise Gateway • Multitenancy • Remote kernel lifecycle management via process proxies Spark Executors Spark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Spark Executors Spark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Impersonation: Alice’s kernel runs under Alice’s user ID. Spark Executors Spark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Security Layer nb2kg (Proxy) nb2kg
  • 18. Scalability Benefits 8 8 8 8 16 32 48 64 0 10 20 30 40 50 60 70 80 4 Nodes 8 Nodes 12 Nodes 16 Nodes MAXKERNELS(4GBHEAP) CLUSTER SIZE (32GB NODES) Maximum Number of Simultaneous Kernels Before JEG After JEG
  • 19. Jupyter Notebooks with Enterprise Gateway Notebook Server Process JavaScript NotebookManagement KernelManagement (viaNB2KG) Kernel Launcher Process IPythonKernel(embedded) Shell IOPub stdin control heartbeat UserCode sklearn Spark Tensor Flow … Enterprise Gateway Process Kernel Lifecycle Management Process Proxies Websocket passthrough via Kernel Gateway
  • 20. Building an Analytical Micro Service
  • 21. Use Case – Sentiment Analysis Utilizing Yelp Dataset from Kaggle Utilizing AFINN sentiment analysis library in Python Using PySpark for training and scoring model Using Jupyter Kernels to integrate between micro service and spark Yelp Dataset: https://www.kaggle.com/yelp-dataset/yelp-dataset
  • 22. Use Case – Sentiment Analysis Integrates analytics to your regular python application/micro service by leveraging Jupyter Notebook Kernels YARN Cluster YARN Workers Enterprise Gateway Node Jupyter Enterprise Gateway Multitenancy Remote kernels and Kernel Lifecycle management Spark Executors Spark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Spark Executors Spark Executors Spark Executors Yarn Container Jupyter Kernel Spark Driver Sentiment Resource flask Kernel Laucher/Client http Sentiment Provider
  • 23. Use Case – Sentiment Analysis Loading the Yelp Dataset business = spark.read .option('header','true') .option('inferSchema', 'true') .csv('file:///opt/data/yelp_business.csv') city_business = business.filter(business.city == 'Brooklyn') city_business.write.parquet(path='yelp/business', mode='overwrite') reviews = spark.read .option('header', 'true') .option('inferSchema', 'true') .csv('file:///opt/data/yelp_review.csv') city_business_reviews = reviews.join(city_business, city_business.business_id == reviews.business_id, 'left_semi') city_business_reviews.write.parquet(path='yelp/reviews', mode='overwrite')
  • 24. Application – Sentiment REST API • Leverage Flask-RESTFul • Exposes a sentiment REST API • Request sentiment for a given business • http://<host>:5000/sentiment/<business_id> • During Application startup • Start the kernel • Perform required data load operations
  • 25. Application – Sentiment Provider • Encapsulate all interactions with the kernel • Start/Stop the Kernel • Load necessary tables • Retrieve business details • Calculate sentiment for each business review • Possible enhancements/todos • Return data as Numpy Arrays • Provide more flexibility to manipulate display on the resource side (e.g pretty html)
  • 26. Application – Kernel Launcher • Encapsulate kernel lifecycle • Start/Stop the Kernel • Instantiate a new kernel object
  • 27. Application – Kernel Object • Encapsulate kernel code execution • Create proper code execution requests • Monitor kernel responses • Errors • Result Responses • Stream Responses • Kernel idle status • Return code execution responses
  • 29. Further Readings Connect direct to kernel • See Jupyter Client https://github.com/jupyter/jupyter_client • See https://github.com/lresende/toree-gateway/blob/master/python/toree_client.py Jupyter Kernel Gateway/Enterprise Gateway HTTP Personality • Starts the Gateway in single notebook mode • Notebook cells (based on resource identifier comment) become accessible via URLs # GET /hello/world print("I'm cell #1") • http://jupyter-kernel-gateway.readthedocs.io/en/latest/http-mode.html
  • 30. Resources Jupyter Enterprise Gateway source code at GitHub https://github.com/jupyter-incubator/enterprise_gateway Jupyter Enterprise Gateway Documentation http://jupyter-enterprise-gateway.readthedocs.io/en/latest/ Microservice Demo Application https://github.com/lresende/eg-sentiment-microservice-demo