Building analytical microservices powered by jupyter kernels

Building analytical micro services
powered by Jupyter Kernels

About me - Luciano Resende
Data Science Platform Architect – IBM – CODAIT (formerly Spark Technology Center)
• Have been contributing to open source at ASF for over 10 years
• Currently contributing to : Jupyter Notebook ecosystem, Apache Bahir, Apache Spark,
Apache Toree among other projects related to Apache Spark ecosystem
lresende@apache.org
http://lresende.blogspot.com/
https://www.linkedin.com/in/lresende
@lresende1975
https://github.com/lresende
@

About me – Kevin Bates
Sr. Software Engineer – IBM – CODAIT (formerly Spark Technology Center)
• Over 30 years developing enterprise-level software
• Currently working in the Jupyter ecosystem the last 14 months
kbates4@gmail.com
https://www.linkedin.com/in/kevinbatessoftware
@kbates4
https://github.com/kevin-bates
@

Jupyter Notebooks
Notebooks are interactive computational
environments, in which you can
combine code execution, rich text,
mathematics, plots and rich media.

Jupyter Notebook Platform Architecture
• Notebook UI runs on the browser
• The Notebook Server serves the ’Notebooks’
• Kernels interpret/execute cell contents
§ Are responsible for code execution
§ Abstracts different languages

Jupyter Notebooks Architecture
Notebook Server Process
JavaScript
NotebookManagement
Python Process
KernelManagement
IPythonKernel
KernelProxy
Shell
IOPub
stdin
control
heartbeat
UserCode
sklearn
Spark
Tensor
Flow
…

Jupyter Connection Profile
{
"stdin_port": 37934,
"control_port": 42264,
"hb_port": 34727,
"shell_port": 35502,
"iopub_port": 59585,
"transport": "tcp",
"ip": "127.0.0.1",
"signature_scheme": "hmac-sha256",
"key": "4b306a48-98a0715362cbc47aafbc4e5f",
"kernel_name": ""
}

Jupyter Messaging Protocol
Available Sockets:
• Shell (requests, history, info)
• IOPub (status, display, results)
• Stdin (input requests from kernel)
• Control (shutdown, interrupt)
• Heartbeat (poll)

Message flow
Two types of responses
• Results
• Computations that return a result
• 1+1
• val a = 2 + 5
• Stream Content
• Values that are written to output stream
• println(‘Hello World’)
• df.show(10)
Client Program Kernel
Evaluate (msgid=1) ‘1+1’
Busy (msgid=1)
Status (msgid=1) ok/error
Result (msgid=1)
Stream Content (msgid=1)
Idle (msgid=1)

Introducing
Jupyter Enterprise Gateway

The Origins of Jupyter Enterprise Gateway
• Multiple IBM products embedding Spark on YARN
• All wanted to add Jupyter notebooks with Spark
• Usual enterprise requirements (multitenancy, scalability, security, etc.)
• Attempts at scaling up (one large server) or having a single Notebook server
per user were insufficient
• Jupyter Kernel Gateway introduced a Bring Your Own Notebook model via
Websocket “Personality” and Notebook extension – nb2kg

Initial prototype using Jupyter Kernel Gateway
YARN Cluster
Security
Layer
YARN
Workers
YARN
Resource
Manager
Spark
ExecutorsSpark
ExecutorsSpark
Executors
Spark
ExecutorsSpark
ExecutorsSpark
Executors
Gateway Node
nb2kg
(Proxy)
nb2kg
Jupyter
Kernel
Gateway
Python
Kernel
Spark Driver
Python
Kernel
Spark Driver
Shell
IOPub
stdin
control
heartbeat
Issue #2: All Spark jobs
run as same user ID
Issue #1: All kernels
and Spark drivers run
on a single node

Issue: All kernels run on a single node
8 8 8 8
0
10
20
30
40
50
60
70
80
4 Nodes 8 Nodes 12 Nodes 16 Nodes
MAXKERNELS(4GBHEAP)
CLUSTER SIZE (32GB NODES)
Maximum Number of Simultaneous Kernels

Jupyter Enterprise Gateway: Initial Goals
Optimized Resource Allocation
§ Run Spark in YARN Cluster Mode to better utilize cluster resources
§ Pluggable architecture for additional Resource Managers and Lifecycle Management
§ General framework for remote kernels
Multiuser support with user impersonation
§ Enhance security and sandboxing by enabling user impersonation when running kernels (using
Kerberos)
§ Individual HDFS home folder for each notebook user
§ Enables use of same user ID for both notebook and batch jobs
Enhanced Security
§ Secure socket communications
§ Any network communication should be encrypted
Jupyter Kernel Gateway
Jupyter Notebook
jupyter_client

YARN Cluster
YARN
Workers
Gateway Node
• Multitenancy
• Remote kernel lifecycle management via process proxies
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Impersonation:
Alice’s kernel
runs under
Alice’s user ID.
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Security
Layer
nb2kg
(Proxy)
nb2kg

Scalability Benefits
8 8 8 8
16
32
48
64
0
10
20
30
40
50
60
70
80
4 Nodes 8 Nodes 12 Nodes 16 Nodes
MAXKERNELS(4GBHEAP)
CLUSTER SIZE (32GB NODES)
Maximum Number of Simultaneous Kernels
Before JEG
After JEG

Jupyter Notebooks with Enterprise Gateway
Notebook Server Process
JavaScript
NotebookManagement
KernelManagement
(viaNB2KG)
Kernel Launcher Process
IPythonKernel(embedded)
Shell
IOPub
stdin
control
heartbeat
UserCode
sklearn
Spark
Tensor
Flow
…
Enterprise Gateway
Process
Kernel
Lifecycle
Management
Process
Proxies
Websocket
passthrough
via
Kernel
Gateway

Building an Analytical Micro Service

Use Case – Sentiment Analysis
Utilizing Yelp Dataset from Kaggle
Utilizing AFINN sentiment analysis library in Python
Using PySpark for training and scoring model
Using Jupyter Kernels to integrate between micro service and spark
Yelp Dataset: https://www.kaggle.com/yelp-dataset/yelp-dataset

Integrates analytics to your regular
python application/micro service by
leveraging Jupyter Notebook
Kernels
YARN Cluster
YARN
Workers
Enterprise Gateway Node
Multitenancy
Remote kernels and Kernel Lifecycle management
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Spark Executors
Spark Executors
Spark Executors
Yarn Container
Jupyter Kernel
Spark Driver
Sentiment
Resource
flask
Kernel
Laucher/Client
http
Sentiment
Provider

Loading the Yelp Dataset
business = spark.read
.option('header','true')
.option('inferSchema', 'true')
.csv('file:///opt/data/yelp_business.csv')
city_business = business.filter(business.city == 'Brooklyn')
city_business.write.parquet(path='yelp/business', mode='overwrite')
reviews = spark.read
.option('header', 'true')
.option('inferSchema', 'true')
.csv('file:///opt/data/yelp_review.csv')
city_business_reviews = reviews.join(city_business, city_business.business_id ==
reviews.business_id, 'left_semi')
city_business_reviews.write.parquet(path='yelp/reviews', mode='overwrite')

Application – Sentiment REST API
• Leverage Flask-RESTFul
• Exposes a sentiment REST API
• Request sentiment for a given business
• http://<host>:5000/sentiment/<business_id>
• During Application startup
• Start the kernel
• Perform required data load operations

Application – Sentiment Provider
• Encapsulate all interactions with the
kernel
• Start/Stop the Kernel
• Load necessary tables
• Retrieve business details
• Calculate sentiment for each business review
• Possible enhancements/todos
• Return data as Numpy Arrays
• Provide more flexibility to manipulate
display on the resource side (e.g pretty html)

Application – Kernel Launcher
• Encapsulate kernel lifecycle
• Start/Stop the Kernel
• Instantiate a new kernel object

Application – Kernel Object
• Encapsulate kernel code execution
• Create proper code execution requests
• Monitor kernel responses
• Errors
• Result Responses
• Stream Responses
• Kernel idle status
• Return code execution responses

Further Readings
Connect direct to kernel
• See Jupyter Client https://github.com/jupyter/jupyter_client
• See https://github.com/lresende/toree-gateway/blob/master/python/toree_client.py
Jupyter Kernel Gateway/Enterprise Gateway HTTP Personality
• Starts the Gateway in single notebook mode
• Notebook cells (based on resource identifier comment) become accessible via URLs
# GET /hello/world
print("I'm cell #1")
• http://jupyter-kernel-gateway.readthedocs.io/en/latest/http-mode.html

Resources
Jupyter Enterprise Gateway source code at GitHub
https://github.com/jupyter-incubator/enterprise_gateway
Jupyter Enterprise Gateway Documentation
http://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Microservice Demo Application
https://github.com/lresende/eg-sentiment-microservice-demo

Building analytical microservices powered by jupyter kernels

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Building analytical microservices powered by jupyter kernels

Similar to Building analytical microservices powered by jupyter kernels (20)

More from Luciano Resende

More from Luciano Resende (20)

Recently uploaded

Recently uploaded (20)

Building analytical microservices powered by jupyter kernels