SlideShare una empresa de Scribd logo
1 de 40
Real Time Machine Learning
Visualization with Spark
Chester Chen
Director of Engineering
Alpine Data
March 13, 2016
COMPANY CONFIDENTIAL2
Who am I ?
• Director of Engineering at Alpine Data
• Founder and Organizer of SF Big Analytics Meetup (3500+ members)
• Previous Employment:
– Architect / Director at Tinga, Symantec, AltaVista, Ascent Media, ClearStory
Systems, WebWare.
• Experience with Spark
– Exposed to Spark since Spark 0.6
– Architect for Alpine Spark Integration on Spark 1.1, 1.3 and 1.5.x
• Hadoop Distribution
– CDH, HDP and MapR
COMPANY CONFIDENTIAL3
Alpine Data at a Glance
Enterprise Scale Predictive Analytics with deep experience in Machine Learning, Data Science, and
Distributed Data Architectures
Industry Innovations and IP
Broad patents awarded for in-cluster and in-database machine learning - 2012
First web-based solution for end-to-end Predictive analytics - 2012
Created Industry first integrated Analytics Services Platform - 2013
First Predictive Analytics solution to be certified on Spark - 2014
Launched Touchpoints, Industry first predictive applications service layer- 2015
Global Brand Names in Financial Services, Telco/Media, Healthcare, Manufacturing, Public Sector and Retail
Visionary in the Gartner Magic Quadrant for Advanced Analytics
Key Partners:
COMPANY CONFIDENTIAL4
Lightning-fast cluster computing
Real Time ML Visualization with Spark
-- What is Spark
http://spark.apache.org/
COMPANY CONFIDENTIAL5
Iris data set, K-Means clustering with K=3
Cluster 2
Cluster 1
Cluster 0
Centroids
Sepal width vs Petal length
COMPANY CONFIDENTIAL6
Iris data set, K-Means clustering with K=3
distance
COMPANY CONFIDENTIAL7
What is K-Means ?
• Given a set of observations (x1, x2, …, xn), where each observation is a d-
dimensional real vector,
• k-means clustering aims to partition the n observations into k (≤ n) sets
S = {S1, S2, …, Sk}
• The clusters are determined by minimizing the inter-cluster sum of
squares (ICSS) (sum of distance functions of each point in the cluster to
the K center). In other words, the objective is to find
• where μi is the mean of points in Si.
• https://en.wikipedia.org/wiki/K-means_clustering
COMPANY CONFIDENTIAL8
Visualization Cost
35
35.5
36
36.5
37
37.5
38
38.5
0 5 10 15 20 25
Cost vs Iteration
Cost
COMPANY CONFIDENTIAL9
Real Time ML Visualization – Why ?
• Use Cases
– Use visualization to determine whether to end the training early
• Need a way to visualize the training process including the
convergence, clustering or residual plots, etc.
• Need a way to stop the training and save current model
• Need a way to disable or enable the visualization
COMPANY CONFIDENTIAL10
Real Time ML Visualization with Spark
DEMO
COMPANY CONFIDENTIAL11
How to Enable Real Time ML Visualization ?
• A callback interface for Spark Machine Learning Algorithm to send messages
– Algorithms decide when and what message to send
– Algorithms don’t care how the message is delivered
• A task channel to handle the message delivery from Spark Driver to Spark Client
– It doesn’t care about the content of the message or who sent the message
• The message is delivered from Spark Client to Browser
– We use HTML5 Server-Sent Events ( SSE) and HTTP Chunked Response (PUSH)
– Pull is possible, but requires a message Queue
• Visualization using JavaScript Frameworks Plot.ly and D3
COMPANY CONFIDENTIAL12
Spark Job in Yarn-Cluster mode
Spark
Client
Hadoop Cluster
Yarn-Container
Spark Driver
Spark Job
Spark Context
Spark ML
algorithm
Command Line
Rest API
Servlet
Application Host
COMPANY CONFIDENTIAL13
Spark Job in Yarn-Cluster mode
Spark
Client
Hadoop Cluster
Command Line
Rest API
Servlet
Application Host
Spark Job
App Context Spark ML
Algorithms
ML Listener
Message
Logger
COMPANY CONFIDENTIAL14
Spark
Client
Hadoop ClusterApplication Host
Spark Job
App Context Spark ML
Algorithms
ML Listener
Message
Logger
Spark Job in Yarn-Cluster mode
Web/
Rest
API
Server
Akka
Browser
COMPANY CONFIDENTIAL15
Enable Real Time ML Visualization
SSE
Plotly
D3
Browser
Rest
API
Server
Web Server
Spark
Client
Hadoop Cluster
Spark Job
App Context
Message
Logger
Task Channel
Spark ML
Algorithms
ML Listener
Akka
Chunked
Response
Akka
COMPANY CONFIDENTIAL16
Enable Real Time ML Visualization
SSE
Plotly
D3
Browser
Rest
API
Server
Web Server
Spark
Client
Hadoop Cluster
Spark Job
App Context
Message
Logger
Task Channel
Spark ML
Algorithms
ML Listener
Akka
Chunked
Response
Akka
COMPANY CONFIDENTIAL17
Machine Learning Listeners
COMPANY CONFIDENTIAL18
Callback Interface: ML Listener
trait MLListener {
def onMessage(message: => Any)
}
COMPANY CONFIDENTIAL19
Callback Interface: MLListenerSupport
trait MLListenerSupport {
// rest of code
def sendMessage(message: => Any): Unit = {
if (enableListener) {
listeners.foreach(l => l.onMessage(message))
}
}
COMPANY CONFIDENTIAL20
KMeansEx: KMeans with MLListener
class KMeansExt private (…) extends Serializable
with Logging
with MLListenerSupport {
...
}
COMPANY CONFIDENTIAL21
KMeansEx: KMeans with MLListener
case class KMeansCoreStats (iteration: Int, centers: Array[Vector], cost: Double )
private def runAlgorithm(data: RDD[VectorWithNorm]): KMeansModel = {
...
while (!stopIteration &&
iteration < maxIterations && !activeRuns.isEmpty) {
...
if (listenerEnabled()) {
sendMessage(KMeansCoreStats(…))
}
...
}
}
COMPANY CONFIDENTIAL22
KMeans Spark Job Setup
val kMeans = new KMeansExt().setK(numClusters)
.setEpsilon(epsilon)
.setMaxIterations(maxIterations)
.enableListener(enableVisualization)
.addListener(
new KMeansListener(...))
appCtxOpt.foreach(_.addTaskObserver(new MLTaskObserver(kMeans,logger)))
kMeans.run(vectors)
COMPANY CONFIDENTIAL23
KMeans ML Listener
class KMeansListener(columnNames: List[String],
data : RDD[Vector],
logger : MessageLogger) extends MLListener{
//sampling the data
message match {
case coreStats :KMeansCoreStats =>
//use the KMeans model of the current iteration to predict sample
//cluster indexes
//construct message consists of sample, cost, iteration and centroids
//use logger to send the message out
}
COMPANY CONFIDENTIAL24
ML Task Observer
• Receives command from User to update running Spark Job
• Once receives UpdateTask Command from notify call, it preforms
the necessary update operation
trait TaskObserver {
def notify (task: UpdateTaskCmd)
}
class MLTaskObserver(support: MLListenerSupport, logger: MessageLogger )
extends TaskObserver {
//implement notify
}
COMPANY CONFIDENTIAL25
Logistic Regression MLListener
class LogisticRegression(…) extends MLListenerSupport {
def train(data: RDD[(Double, Vector)]): LogisticRegressionModel= {
// initialization code
val (rawWeights, loss) = OWLQN.runOWLQN( …)
generateLORModel(…)
}
}
COMPANY CONFIDENTIAL26
Logistic Regression MLListener
object OWLQN extends Logging {
def runOWLQN(/*args*/,mlSupport:Option[MLListenerSupport]):(Vector,
Array[Double]) = {
val costFun=new CostFun(data, mlSupport, IterationState(), /*other
args */)
val states : Iterator[lbfgs.State] =
lbfgs.iterations(
new CachedDiffFunction(costFun), initialWeights.toBreeze.toDenseVector
)
…
}
COMPANY CONFIDENTIAL27
Logistic Regression MLListener
In Cost function :
override def calculate(weights: BDV[Double]): (Double, BDV[Double]) = {
val shouldStop = mlSupport.exists(_.stopIteration)
if (!shouldStop) {
…
mlSupport.filter(_.listenerEnabled()).map { s=>
s.sendMessage( (iState.iteration, w, loss))
}
…
}
else {
…
}
}
COMPANY CONFIDENTIAL28
Task Communication Channel
COMPANY CONFIDENTIAL29
Task Channel : Akka Messaging
Spark
Application Application
Context
Actor System
Messager
Actor
Task
Channel
Actor
SparkContext Spark tasks
Akka
Akka
COMPANY CONFIDENTIAL30
Task Channel : Akka messaging
SSE
Plotly
D3
Browser
Rest
API
Server
Web Server
Spark
Client
Hadoop Cluster
Spark Job
App Context
Message
Logger
Task Channel
Spark ML
Algorithms
ML Listener
Akka
Chunked
Response
Akka
COMPANY CONFIDENTIAL31
Push To The Browser
COMPANY CONFIDENTIAL32
HTTP Chunked Response and SSE
SSE
Plotly
D3
Browser
Rest
API
Server
Web Server
Spark
Client
Hadoop Cluster
Spark Job
App Context
Message
Logger
Task Channel
Spark ML
Algorithms
ML Listener
Akka
Chunked
Response
Akka
COMPANY CONFIDENTIAL33
HTML5 Server-Sent Events (SSE)
• Server-sent Events (SSE) is one-way messaging
– An event is when a web page automatically get update from Server
• Register an event source (JavaScript)
var source = new EventSource(url);
• The Callback onMessage(data)
source.onmessage = function(message){...}
• Data Format:
data: { n
data: “key” : “value”, nn
data: } nn
COMPANY CONFIDENTIAL34
HTTP Chunked Response
• Spray Rest Server supports Chunked Response
val responseStart =
HttpResponse(entity = HttpEntity(`text/event-stream`, s"data: Startn"))
requestCtx.responder ! ChunkedResponseStart(responseStart).withAck(Messages.Ack)
val nextChunk = MessageChunk(s"data: $r nn")
requestCtx.responder ! nextChunk.withAck(Messages.Ack)
requestCtx.responder ! MessageChunk(s"data: Finished nn")
requestCtx.responder ! ChunkedMessageEnd
COMPANY CONFIDENTIAL35
Push vs. Pull
Push
• Pros
– The data is streamed (pushed) to browser via chunked response
– There is no need for data queue, but the data can be lost if not consumed
– Multiple pages can be pushed at the same time, which allows multiple
visualization views
• Cons
– For slow network, slow browser and fast data iterations, the data might all
show-up in browser at once, rather showing a nice iteration-by-iteration
display
– If you control the data chunked response by Network Acknowledgement,
the visualization may not show-up at all as the data is not pushed due to
slow network acknowledgement
COMPANY CONFIDENTIAL36
Push vs. Pull
Pull
• Pros
– Message does not get lost, since it can be temporarily stored in the
message queue
– The visualization will render in an even pace
• Cons
– Need to periodically send server request for update,
– We will need a message queue before the message is consumed
– Hard to support multiple pages rendering with simple message
queue
COMPANY CONFIDENTIAL37
Visualization: Plot.ly + D3
Cost vs. IterationCost vs. Iteration
ArrTime vs. DistanceArrTime vs. DepTime
Alpine Workflow
COMPANY CONFIDENTIAL38
Use Plot.ly to render graph
function showCost(dataParsed) {
var costTrace = { … };
var data = [ costTrace ];
var costLayout = {
xaxis: {…},
yaxis: {…},
title: …
};
Plotly.newPlot('cost', data, costLayout);
}
COMPANY CONFIDENTIAL39
Real Time ML Visualization: Summary
• Training machine learning model involves a lot of experimentation,
we need a way to visualize the training process.
• We presented a system to enable real time machine learning
visualization with Spark:
– Gives visibility into the training of a model
– Allows us monitor the convergence of the algorithms during training
– Can stop the iterations when convergence is good enough.
COMPANY CONFIDENTIAL40
Thank You
Chester Chen
chester@alpinenow.com
LinkedIn
https://www.linkedin.com/in/chester-chen-3205992
SlideShare
http://www.slideshare.net/ChesterChen/presentations
demo video
https://youtu.be/DkbYNYQhrao

Más contenido relacionado

La actualidad más candente

Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
Databricks
 
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Spark Summit
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
DataWorks Summit
 

La actualidad más candente (20)

From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Clipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving SystemClipper: A Low-Latency Online Prediction Serving System
Clipper: A Low-Latency Online Prediction Serving System
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
 
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
 
Streaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For ScaleStreaming Big Data & Analytics For Scale
Streaming Big Data & Analytics For Scale
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
Parallelizing Large Simulations with Apache SparkR with Daniel Jeavons and Wa...
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Google Cloud Platform Empowers TensorFlow and Machine Learning
Google Cloud Platform Empowers TensorFlow and Machine LearningGoogle Cloud Platform Empowers TensorFlow and Machine Learning
Google Cloud Platform Empowers TensorFlow and Machine Learning
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...
 
How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaHow Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-Shma
 

Destacado

Visualization and Machine Learning - for exploratory data ...
Visualization and Machine Learning - for exploratory data ...Visualization and Machine Learning - for exploratory data ...
Visualization and Machine Learning - for exploratory data ...
butest
 
Real time machine learning visualization with spark -- Hadoop Summit 2016
Real time machine learning visualization with spark -- Hadoop Summit 2016Real time machine learning visualization with spark -- Hadoop Summit 2016
Real time machine learning visualization with spark -- Hadoop Summit 2016
Chester Chen
 
SF Big Analytics: Introduction to Succinct by UC Berkeley AmpLab
SF Big Analytics: Introduction to Succinct by UC Berkeley AmpLabSF Big Analytics: Introduction to Succinct by UC Berkeley AmpLab
SF Big Analytics: Introduction to Succinct by UC Berkeley AmpLab
Chester Chen
 
Neural_Programmer_Interpreter
Neural_Programmer_InterpreterNeural_Programmer_Interpreter
Neural_Programmer_Interpreter
Katy Lee
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Lior Rokach
 

Destacado (19)

Visualization and Machine Learning - for exploratory data ...
Visualization and Machine Learning - for exploratory data ...Visualization and Machine Learning - for exploratory data ...
Visualization and Machine Learning - for exploratory data ...
 
Real time machine learning visualization with spark -- Hadoop Summit 2016
Real time machine learning visualization with spark -- Hadoop Summit 2016Real time machine learning visualization with spark -- Hadoop Summit 2016
Real time machine learning visualization with spark -- Hadoop Summit 2016
 
SF Big Analytics: Introduction to Succinct by UC Berkeley AmpLab
SF Big Analytics: Introduction to Succinct by UC Berkeley AmpLabSF Big Analytics: Introduction to Succinct by UC Berkeley AmpLab
SF Big Analytics: Introduction to Succinct by UC Berkeley AmpLab
 
Alpine Tech Talk: System ML by Berthold Reinwald
Alpine Tech Talk: System ML by Berthold ReinwaldAlpine Tech Talk: System ML by Berthold Reinwald
Alpine Tech Talk: System ML by Berthold Reinwald
 
Pruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inferencePruning convolutional neural networks for resource efficient inference
Pruning convolutional neural networks for resource efficient inference
 
Paper Reading, "On Causal and Anticausal Learning", ICML-12
Paper Reading, "On Causal and Anticausal Learning", ICML-12Paper Reading, "On Causal and Anticausal Learning", ICML-12
Paper Reading, "On Causal and Anticausal Learning", ICML-12
 
Visualizing big data in the browser using spark
Visualizing big data in the browser using sparkVisualizing big data in the browser using spark
Visualizing big data in the browser using spark
 
Neural_Programmer_Interpreter
Neural_Programmer_InterpreterNeural_Programmer_Interpreter
Neural_Programmer_Interpreter
 
Making neural programming architectures generalize via recursion
Making neural programming architectures generalize via recursionMaking neural programming architectures generalize via recursion
Making neural programming architectures generalize via recursion
 
[DL輪読会] Hybrid computing using a neural network with dynamic external memory
[DL輪読会] Hybrid computing using a neural network with dynamic external memory[DL輪読会] Hybrid computing using a neural network with dynamic external memory
[DL輪読会] Hybrid computing using a neural network with dynamic external memory
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
 
[DL輪読会]Exploiting Cyclic Symmetry in Convolutional Neural Networks
[DL輪読会]Exploiting Cyclic Symmetry in Convolutional Neural Networks[DL輪読会]Exploiting Cyclic Symmetry in Convolutional Neural Networks
[DL輪読会]Exploiting Cyclic Symmetry in Convolutional Neural Networks
 
[DL輪読会]Unsupervised Cross-Domain Image Generation
[DL輪読会]Unsupervised Cross-Domain Image Generation[DL輪読会]Unsupervised Cross-Domain Image Generation
[DL輪読会]Unsupervised Cross-Domain Image Generation
 
[DL輪読会]Wasserstein GAN/Towards Principled Methods for Training Generative Adv...
[DL輪読会]Wasserstein GAN/Towards Principled Methods for Training Generative Adv...[DL輪読会]Wasserstein GAN/Towards Principled Methods for Training Generative Adv...
[DL輪読会]Wasserstein GAN/Towards Principled Methods for Training Generative Adv...
 
[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generat...
[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generat...[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generat...
[DL輪読会]StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generat...
 
Improving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsImproving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN Applications
 
[DL輪読会] GAN系の研究まとめ (NIPS2016とICLR2016が中心)
[DL輪読会] GAN系の研究まとめ (NIPS2016とICLR2016が中心)[DL輪読会] GAN系の研究まとめ (NIPS2016とICLR2016が中心)
[DL輪読会] GAN系の研究まとめ (NIPS2016とICLR2016が中心)
 
[DL輪読会]Understanding deep learning requires rethinking generalization
[DL輪読会]Understanding deep learning requires rethinking generalization[DL輪読会]Understanding deep learning requires rethinking generalization
[DL輪読会]Understanding deep learning requires rethinking generalization
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Similar a Real Time Machine Learning Visualization With Spark

Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 
KFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature StoreKFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature Store
Databricks
 

Similar a Real Time Machine Learning Visualization With Spark (20)

Real timeml visualizationwithspark_v6
Real timeml visualizationwithspark_v6Real timeml visualizationwithspark_v6
Real timeml visualizationwithspark_v6
 
Real Time Visualization with Spark
Real Time Visualization with SparkReal Time Visualization with Spark
Real Time Visualization with Spark
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
Real Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with SparkReal Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with Spark
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
 
Bhadale group of companies - Our project works
Bhadale group of companies - Our project worksBhadale group of companies - Our project works
Bhadale group of companies - Our project works
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
 
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
 
Tech UG - Newcastle 09-17 - logic apps
Tech UG - Newcastle 09-17 -   logic appsTech UG - Newcastle 09-17 -   logic apps
Tech UG - Newcastle 09-17 - logic apps
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Big Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft AzureBig Data Adavnced Analytics on Microsoft Azure
Big Data Adavnced Analytics on Microsoft Azure
 
XebiCon'17 : AxonFramework @ SGCIB (our experience) : (CQRS, Eventsourcing, A...
XebiCon'17 : AxonFramework @ SGCIB (our experience) : (CQRS, Eventsourcing, A...XebiCon'17 : AxonFramework @ SGCIB (our experience) : (CQRS, Eventsourcing, A...
XebiCon'17 : AxonFramework @ SGCIB (our experience) : (CQRS, Eventsourcing, A...
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
 
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
 
June 2023 Architect Group FTW.pdf
June 2023 Architect Group FTW.pdfJune 2023 Architect Group FTW.pdf
June 2023 Architect Group FTW.pdf
 
ECS19 - Bill Ayers - UNLOCK YOUR BUSINESS KNOWLEDGE WITH THE MICROSOFT GRAPH,...
ECS19 - Bill Ayers - UNLOCK YOUR BUSINESS KNOWLEDGE WITH THE MICROSOFT GRAPH,...ECS19 - Bill Ayers - UNLOCK YOUR BUSINESS KNOWLEDGE WITH THE MICROSOFT GRAPH,...
ECS19 - Bill Ayers - UNLOCK YOUR BUSINESS KNOWLEDGE WITH THE MICROSOFT GRAPH,...
 
2019 04 seattle_meetup___kafka_machine_learning___kai_waehner
2019 04 seattle_meetup___kafka_machine_learning___kai_waehner2019 04 seattle_meetup___kafka_machine_learning___kai_waehner
2019 04 seattle_meetup___kafka_machine_learning___kai_waehner
 
KFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature StoreKFServing, Model Monitoring with Apache Spark and a Feature Store
KFServing, Model Monitoring with Apache Spark and a Feature Store
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 

Más de Chester Chen

zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdf
Chester Chen
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
Chester Chen
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
Chester Chen
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
Chester Chen
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
Chester Chen
 

Más de Chester Chen (20)

SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
 
zookeeer+raft-2.pdf
zookeeer+raft-2.pdfzookeeer+raft-2.pdf
zookeeer+raft-2.pdf
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
 
A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?A missing link in the ML infrastructure stack?
A missing link in the ML infrastructure stack?
 
Shopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdataShopify datadiscoverysf bigdata
Shopify datadiscoverysf bigdata
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK... SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a ProSFBigAnalytics_20190724: Monitor kafka like a Pro
SFBigAnalytics_20190724: Monitor kafka like a Pro
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scaleSF Big Analytics 2019-06-12: Managing uber's data workflows at scale
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
SFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdapSFBigAnalytics- hybrid data management using cdap
SFBigAnalytics- hybrid data management using cdap
 
Sf big analytics: bighead
Sf big analytics: bigheadSf big analytics: bighead
Sf big analytics: bighead
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_index
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
 

Último

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Último (20)

WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AI
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 

Real Time Machine Learning Visualization With Spark

  • 1. Real Time Machine Learning Visualization with Spark Chester Chen Director of Engineering Alpine Data March 13, 2016
  • 2. COMPANY CONFIDENTIAL2 Who am I ? • Director of Engineering at Alpine Data • Founder and Organizer of SF Big Analytics Meetup (3500+ members) • Previous Employment: – Architect / Director at Tinga, Symantec, AltaVista, Ascent Media, ClearStory Systems, WebWare. • Experience with Spark – Exposed to Spark since Spark 0.6 – Architect for Alpine Spark Integration on Spark 1.1, 1.3 and 1.5.x • Hadoop Distribution – CDH, HDP and MapR
  • 3. COMPANY CONFIDENTIAL3 Alpine Data at a Glance Enterprise Scale Predictive Analytics with deep experience in Machine Learning, Data Science, and Distributed Data Architectures Industry Innovations and IP Broad patents awarded for in-cluster and in-database machine learning - 2012 First web-based solution for end-to-end Predictive analytics - 2012 Created Industry first integrated Analytics Services Platform - 2013 First Predictive Analytics solution to be certified on Spark - 2014 Launched Touchpoints, Industry first predictive applications service layer- 2015 Global Brand Names in Financial Services, Telco/Media, Healthcare, Manufacturing, Public Sector and Retail Visionary in the Gartner Magic Quadrant for Advanced Analytics Key Partners:
  • 4. COMPANY CONFIDENTIAL4 Lightning-fast cluster computing Real Time ML Visualization with Spark -- What is Spark http://spark.apache.org/
  • 5. COMPANY CONFIDENTIAL5 Iris data set, K-Means clustering with K=3 Cluster 2 Cluster 1 Cluster 0 Centroids Sepal width vs Petal length
  • 6. COMPANY CONFIDENTIAL6 Iris data set, K-Means clustering with K=3 distance
  • 7. COMPANY CONFIDENTIAL7 What is K-Means ? • Given a set of observations (x1, x2, …, xn), where each observation is a d- dimensional real vector, • k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} • The clusters are determined by minimizing the inter-cluster sum of squares (ICSS) (sum of distance functions of each point in the cluster to the K center). In other words, the objective is to find • where μi is the mean of points in Si. • https://en.wikipedia.org/wiki/K-means_clustering
  • 9. COMPANY CONFIDENTIAL9 Real Time ML Visualization – Why ? • Use Cases – Use visualization to determine whether to end the training early • Need a way to visualize the training process including the convergence, clustering or residual plots, etc. • Need a way to stop the training and save current model • Need a way to disable or enable the visualization
  • 10. COMPANY CONFIDENTIAL10 Real Time ML Visualization with Spark DEMO
  • 11. COMPANY CONFIDENTIAL11 How to Enable Real Time ML Visualization ? • A callback interface for Spark Machine Learning Algorithm to send messages – Algorithms decide when and what message to send – Algorithms don’t care how the message is delivered • A task channel to handle the message delivery from Spark Driver to Spark Client – It doesn’t care about the content of the message or who sent the message • The message is delivered from Spark Client to Browser – We use HTML5 Server-Sent Events ( SSE) and HTTP Chunked Response (PUSH) – Pull is possible, but requires a message Queue • Visualization using JavaScript Frameworks Plot.ly and D3
  • 12. COMPANY CONFIDENTIAL12 Spark Job in Yarn-Cluster mode Spark Client Hadoop Cluster Yarn-Container Spark Driver Spark Job Spark Context Spark ML algorithm Command Line Rest API Servlet Application Host
  • 13. COMPANY CONFIDENTIAL13 Spark Job in Yarn-Cluster mode Spark Client Hadoop Cluster Command Line Rest API Servlet Application Host Spark Job App Context Spark ML Algorithms ML Listener Message Logger
  • 14. COMPANY CONFIDENTIAL14 Spark Client Hadoop ClusterApplication Host Spark Job App Context Spark ML Algorithms ML Listener Message Logger Spark Job in Yarn-Cluster mode Web/ Rest API Server Akka Browser
  • 15. COMPANY CONFIDENTIAL15 Enable Real Time ML Visualization SSE Plotly D3 Browser Rest API Server Web Server Spark Client Hadoop Cluster Spark Job App Context Message Logger Task Channel Spark ML Algorithms ML Listener Akka Chunked Response Akka
  • 16. COMPANY CONFIDENTIAL16 Enable Real Time ML Visualization SSE Plotly D3 Browser Rest API Server Web Server Spark Client Hadoop Cluster Spark Job App Context Message Logger Task Channel Spark ML Algorithms ML Listener Akka Chunked Response Akka
  • 18. COMPANY CONFIDENTIAL18 Callback Interface: ML Listener trait MLListener { def onMessage(message: => Any) }
  • 19. COMPANY CONFIDENTIAL19 Callback Interface: MLListenerSupport trait MLListenerSupport { // rest of code def sendMessage(message: => Any): Unit = { if (enableListener) { listeners.foreach(l => l.onMessage(message)) } }
  • 20. COMPANY CONFIDENTIAL20 KMeansEx: KMeans with MLListener class KMeansExt private (…) extends Serializable with Logging with MLListenerSupport { ... }
  • 21. COMPANY CONFIDENTIAL21 KMeansEx: KMeans with MLListener case class KMeansCoreStats (iteration: Int, centers: Array[Vector], cost: Double ) private def runAlgorithm(data: RDD[VectorWithNorm]): KMeansModel = { ... while (!stopIteration && iteration < maxIterations && !activeRuns.isEmpty) { ... if (listenerEnabled()) { sendMessage(KMeansCoreStats(…)) } ... } }
  • 22. COMPANY CONFIDENTIAL22 KMeans Spark Job Setup val kMeans = new KMeansExt().setK(numClusters) .setEpsilon(epsilon) .setMaxIterations(maxIterations) .enableListener(enableVisualization) .addListener( new KMeansListener(...)) appCtxOpt.foreach(_.addTaskObserver(new MLTaskObserver(kMeans,logger))) kMeans.run(vectors)
  • 23. COMPANY CONFIDENTIAL23 KMeans ML Listener class KMeansListener(columnNames: List[String], data : RDD[Vector], logger : MessageLogger) extends MLListener{ //sampling the data message match { case coreStats :KMeansCoreStats => //use the KMeans model of the current iteration to predict sample //cluster indexes //construct message consists of sample, cost, iteration and centroids //use logger to send the message out }
  • 24. COMPANY CONFIDENTIAL24 ML Task Observer • Receives command from User to update running Spark Job • Once receives UpdateTask Command from notify call, it preforms the necessary update operation trait TaskObserver { def notify (task: UpdateTaskCmd) } class MLTaskObserver(support: MLListenerSupport, logger: MessageLogger ) extends TaskObserver { //implement notify }
  • 25. COMPANY CONFIDENTIAL25 Logistic Regression MLListener class LogisticRegression(…) extends MLListenerSupport { def train(data: RDD[(Double, Vector)]): LogisticRegressionModel= { // initialization code val (rawWeights, loss) = OWLQN.runOWLQN( …) generateLORModel(…) } }
  • 26. COMPANY CONFIDENTIAL26 Logistic Regression MLListener object OWLQN extends Logging { def runOWLQN(/*args*/,mlSupport:Option[MLListenerSupport]):(Vector, Array[Double]) = { val costFun=new CostFun(data, mlSupport, IterationState(), /*other args */) val states : Iterator[lbfgs.State] = lbfgs.iterations( new CachedDiffFunction(costFun), initialWeights.toBreeze.toDenseVector ) … }
  • 27. COMPANY CONFIDENTIAL27 Logistic Regression MLListener In Cost function : override def calculate(weights: BDV[Double]): (Double, BDV[Double]) = { val shouldStop = mlSupport.exists(_.stopIteration) if (!shouldStop) { … mlSupport.filter(_.listenerEnabled()).map { s=> s.sendMessage( (iState.iteration, w, loss)) } … } else { … } }
  • 29. COMPANY CONFIDENTIAL29 Task Channel : Akka Messaging Spark Application Application Context Actor System Messager Actor Task Channel Actor SparkContext Spark tasks Akka Akka
  • 30. COMPANY CONFIDENTIAL30 Task Channel : Akka messaging SSE Plotly D3 Browser Rest API Server Web Server Spark Client Hadoop Cluster Spark Job App Context Message Logger Task Channel Spark ML Algorithms ML Listener Akka Chunked Response Akka
  • 32. COMPANY CONFIDENTIAL32 HTTP Chunked Response and SSE SSE Plotly D3 Browser Rest API Server Web Server Spark Client Hadoop Cluster Spark Job App Context Message Logger Task Channel Spark ML Algorithms ML Listener Akka Chunked Response Akka
  • 33. COMPANY CONFIDENTIAL33 HTML5 Server-Sent Events (SSE) • Server-sent Events (SSE) is one-way messaging – An event is when a web page automatically get update from Server • Register an event source (JavaScript) var source = new EventSource(url); • The Callback onMessage(data) source.onmessage = function(message){...} • Data Format: data: { n data: “key” : “value”, nn data: } nn
  • 34. COMPANY CONFIDENTIAL34 HTTP Chunked Response • Spray Rest Server supports Chunked Response val responseStart = HttpResponse(entity = HttpEntity(`text/event-stream`, s"data: Startn")) requestCtx.responder ! ChunkedResponseStart(responseStart).withAck(Messages.Ack) val nextChunk = MessageChunk(s"data: $r nn") requestCtx.responder ! nextChunk.withAck(Messages.Ack) requestCtx.responder ! MessageChunk(s"data: Finished nn") requestCtx.responder ! ChunkedMessageEnd
  • 35. COMPANY CONFIDENTIAL35 Push vs. Pull Push • Pros – The data is streamed (pushed) to browser via chunked response – There is no need for data queue, but the data can be lost if not consumed – Multiple pages can be pushed at the same time, which allows multiple visualization views • Cons – For slow network, slow browser and fast data iterations, the data might all show-up in browser at once, rather showing a nice iteration-by-iteration display – If you control the data chunked response by Network Acknowledgement, the visualization may not show-up at all as the data is not pushed due to slow network acknowledgement
  • 36. COMPANY CONFIDENTIAL36 Push vs. Pull Pull • Pros – Message does not get lost, since it can be temporarily stored in the message queue – The visualization will render in an even pace • Cons – Need to periodically send server request for update, – We will need a message queue before the message is consumed – Hard to support multiple pages rendering with simple message queue
  • 37. COMPANY CONFIDENTIAL37 Visualization: Plot.ly + D3 Cost vs. IterationCost vs. Iteration ArrTime vs. DistanceArrTime vs. DepTime Alpine Workflow
  • 38. COMPANY CONFIDENTIAL38 Use Plot.ly to render graph function showCost(dataParsed) { var costTrace = { … }; var data = [ costTrace ]; var costLayout = { xaxis: {…}, yaxis: {…}, title: … }; Plotly.newPlot('cost', data, costLayout); }
  • 39. COMPANY CONFIDENTIAL39 Real Time ML Visualization: Summary • Training machine learning model involves a lot of experimentation, we need a way to visualize the training process. • We presented a system to enable real time machine learning visualization with Spark: – Gives visibility into the training of a model – Allows us monitor the convergence of the algorithms during training – Can stop the iterations when convergence is good enough.
  • 40. COMPANY CONFIDENTIAL40 Thank You Chester Chen chester@alpinenow.com LinkedIn https://www.linkedin.com/in/chester-chen-3205992 SlideShare http://www.slideshare.net/ChesterChen/presentations demo video https://youtu.be/DkbYNYQhrao

Notas del editor

  1. Steps : Choose centers Compute and min d = distance to centroid, choose new center Convergence when centroid is not changed
  2. Once we define the MLListener Support, we can gather stats at initial, iteration and final step and call: sendMessage(gatherKMeansStats(/*…*/))
  3. Turn into picture
  4. Two slides
  5. Two slides
  6. Share contact info? Link to slides again?