Spark Kernel Enables Interactive Apache Spark Apps

Spark Kernel
IBM Emerging Internet Technologies

Outline
• Scenario
• Problem
• How do you enable interactive
applications against Apache
Spark?
• Solution
• Spark Kernel
• Architecture
• Memory issue
• Comm API
• Livesheets (line of business tool)
• RESTful Server (query interface)
• Extending the Spark Kernel
• Summary & Questions

Scenario
• Livesheets prototype
• Needs to be able to build computations on the fly
• Needs to be able to perform computations on static (historical) data as
well as dynamic (streaming) data
• Needs to be responsive (order of seconds instead of minutes)

Problem
How do you enable interactive applications?
• Spark Submit for job submission to Apache Spark
• JDBC and other offerings available for Spark SQL
• RESTful interfaces available to submit jars
• Spark Shell offers code snippet support to execute against a Spark cluster

Problem
• Used Spark Submit
• Bundled up Spark-based computations into a jar
• Started an external process to run the Spark Submit script against the jar

Problem
• What was wrong?
• Rebundle the jar every time a computation changed
• Not easy to attach to an existing Spark job
• Getting results involved writing to a data store and then reading back out
• Very slow turnaround

Solution: Spark Kernel
• Scala application that can do the following:
• Define and execute raw Scala source code
• Define and run Spark tasks via code snippets
or jars
• Collect results directly from a Spark cluster
• Benefits
• Avoid friction of shipping jars and reading
results from peripheral systems
• Well-defined API (IPython/Jupyter)
• Acts as a proxy for Spark applications such that
they can run remotely away from Spark
• Provides a client library for application
development
Spark Cluster
Master
Worker
Worker
Worker
Worker
Kernel
IPython App 1
Kernel Client
library
App2
Kernel Client
library
ZeroMQ with IPython
message protocol

Kernel Architecture
Spark Cluster
Master
Worker
Worker
Worker
Worker
Kernel
ØMQ
Akka
Message Parsing and Validation
Routing
Message Handling
Scala Interpreter
Class
Server
Spark
Context
Heartbeat Shell Control StdIn IOPub

Kernel Architecture
Spark Cluster
Master
Worker
Worker
Worker
Worker
Kernel
ØMQ
Akka
Routing
Message Handling
Scala Interpreter
Class
Server
Spark
Context
Heartbea
t
She
ll
Control
StdI
n
IOPu
b
• Why ZeroMQ?
• Used by IPython
• Responsiveness
• Building blocks have behavior
• Publisher sends messages
to all subscribers

Kernel Architecture
Spark Cluster
Master
Worker
Worker
Worker
Worker
Kernel
ØMQ
Akka
Routing
Message Handling
Scala Interpreter
Class
Server
Spark
Context
Heartbea
t
She
ll
Control
StdI
n
IOPu
b
• Why Akka?
• Concurrency
• Code isolation
• Fault tolerance
• Scalability

• IPython Protocol
• Specifies incoming and
outgoing messages handled
by the kernel
• Defines the purposes of the
five channels of
communication
Channels of Communication
Kernel
ØMQ
…
• ZeroMQ API
• Uses ZeroMQ for socket
communication via the five
defined ports
• Uses ZMTP as the wire
protocol

• Heartbeat
• Used to indicate that the
kernel is still alive
• Echoes received messages
back to client
• Primarily used by IPython
Kernel
ØMQ
…
• Shell
• Used to communicate
requests from a client to the
kernel
• Main purposes are code
execution and Comm
messages from a client

• Control
• Serves as a higher priority
shell channel
• Typically used to receive
shutdown signals
Kernel
ØMQ
…
• StdIn
• Used to communicate
requests from the kernel to
the client(s)
• Primarily used by IPython as
a form of communication for
users through the UI

• IOPub
• Broadcasts messages to all
listening clients
• Used to communicate side
effects (standard out/error)
as well as Comm messages
Kernel
ØMQ
…

Processing Messages
Kernel
…
Akka
Routing
Message Handling
…
• Message Parsing and Validation
• Uses Akka actors wrapping JeroMQ as an abstraction to parse messages
• Calculates an HMAC (keyed-hash message authentication code) using
SHA-256 and a secret key to validate against a signature in a message

Processing Messages
Kernel
…
Akka
Routing
Message Handling
…
• Routing
• Incoming messages are routed by message type to associated message
handler actors
• Outgoing messages are routed by message type to associated channels

Processing Messages
Kernel
…
Akka
Routing
Message Handling
…
• Message Handling
• Each message type has an associated Akka actor to handle the request
• Some handlers use child actors to perform tasks, protecting the state of the
handler by following Erlang’s Error Kernel Pattern as well as reducing strain
on the handler

Scala Interpreter
Kernel
…
Scala Interpreter
Class
Server
Spark
Context
• Scala Interpreter
• Uses the Spark REPL API to execute Scala code
• Contains zero modifications to Spark’s REPL
• Contains injected variables to provide Spark APIs and kernel APIs including
magics and Comm communication

Scala Interpreter
Kernel
…
Scala Interpreter
Class
Server
Spark
Context
• Class Server
• Exposes generated REPL
classes to the Spark cluster
• In Spark’s Scala 2.10
implementation of the
REPL, this is created for us
• Spark Context
• Standard Scala-based
Spark Context
• Exposed as a variable
named ‘sc’ for user
submitted code

Kernel
Kernel Client Architecture
Heartbea
t
ShellControlStdInIO/Pub
Kernel Client
ØMQ
Akka
Message Parsing and
Validation
Routing
Message Handling
API
Application
Expose public methods
accessible from Scala
and Java
Client sockets mirror
and communicate with
kernel sockets
Actor system for client
shares codebase with
kernel

Memory Issue
• Scala REPL (therefore Spark Shell)
• Generates new classes with each code snippet compiled (leads to
PermGen space issues on JVM)
• Instantiates a new Request class instance per execution to hold state
(leads to OutOfMemory exception)

Memory Issue
Comm API to the rescue!

Comm API
Frontend
(Client)
Backend
(Kernel)
• Flexibility
• Bidirectional communication
• Ability to programmatically
define messages and their
interactions
• Performance
• Avoid recompiling code
• Does not keep execution state
• Simplicity
• Start (open) communication
• Send data (msg)
• Stop (close) communication
open
msg
close

Comm API
Frontend
(Client)
Backend
(Kernel)
• Comm Open Request
• Establishes a new link
between the frontend and
backend
• Can contain data needed for
initialization
{
"comm_id" : "u-u-i-d",
"target_name" : "my_comm",
"data" : {}
}
open
msg
close

Comm API
Frontend
(Client)
Backend
(Kernel)
• Comm Msg Request
• Primary form of
communication
• Contains data relevant to
the request
open
msg
close
{
"data" : {}
}

Comm API
Frontend
(Client)
Backend
(Kernel)
• Comm Close Request
• Removes the link between
the front and back end
components
• Can contain data needed for
teardown
open
msg
close
{
"data" : {}
}

Extending the Spark Kernel
• PySpark support
• Zeppelin integration

Summary
• Goal was to provide an API to enable interactive Spark applications
• Kernel provides a responsive API to use Apache Spark
• Submit code snippets in same fashion as Spark Shell
• Use Comm API for programmatically-defined messages
• Kernel implements IPython message protocol
• Able to use with IPython notebooks out of the box
• Repository: https://github.com/ibm-et/spark-kernel

Questions?
Contact info:
rcsenkbe@us.ibm.com, fallside@us.ibm.com

Spark Kernel Enables Interactive Apache Spark Apps

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Spark Kernel Enables Interactive Apache Spark Apps

Similar to Spark Kernel Enables Interactive Apache Spark Apps (20)

Recently uploaded

Recently uploaded (20)

Spark Kernel Enables Interactive Apache Spark Apps