Shir Bromberg (Big Data team leader) @ Yotpo:
Nowadays, many of an organization’s main applications rely on Spark pipelines. As these applications become more significant to businesses, so does the need to quickly deploy, test and monitor them.
The standard way of running spark jobs is to deploy it on a dedicated managed cluster. However, this solution is relatively expensive with potentially high setup time. Therefore, we developed a way to run Spark on any container orchestration platform. This allows us to run Spark in a simple, custom and testable way.
In this talk, we will present our open-source dockers for running Spark on Nomad servers. We will cover:
* The issues we had running spark on managed clusters and the solution we developed.
* How to build a spark docker.
* And finally, what you may achieve by using Spark on Nomad.
Advanced Machine Learning for Business Professionals
The benefits of running Spark on your own Docker
1. The Benefits of Running
Spark on Docker
Shir Bromberg
Big Data Team Leader @ Yotpo
2. Agenda
● Motivation
● Spark on Docker
○ Solution overview
○ Building the Spark docker
○ Deploying your application using
“Spark on Nomad”
● Why is it better!
● Next steps
3. Motivation
● Many applications rely on Spark
● These applications also require:
○ unitesting
○ quality tests
○ analytics
○ …
● A year ago: running these jobs on an
on-demand managed cluster.
○ Managed cluster = EMR (AWS)
dataproc
4. Pain Points of Managed Clusters
● Setup time: long startup time
○ Tests should be quick
● Environment: manually manage
installed packages on the cluster
○ Use AMI ⇒ coupling with AWS :(
● Pricing: pay per instance
● Slow releases
6. Existing Solutions
Option 1: Start a cluster for every test
Problem:
● Time consuming
Option 2: Cluster that is always up
Problems:
● Environment must be able to run all
test cases
● Can be expensive
7. Our solution for working with Spark should follow these guidelines
In an ideal world...
Reduce Cost Simple To Use
Better
Testability
Monitoring Cloud Agnostic Scalability
9. An open-source framework for distributed data processing
Provides high-level functions in Scala, Java, Python, and R
Spark in a nutshell
Data storage layer (S3/HDFS)
Resource Manager (Yarn/Mesos/Standalone)
Spark Core
Streaming SQL GraphX MLlib
10. Enables you to package your code with dependencies into a deployable unit
called a container.
Docker in a nutshell
ServerDockerfile
Build Image
Docker Hub
Image Image
OS
Docker
ContainerContainerContainer
11. Container orchestration platform
by HashiCorp
Nomad in a nutshell
Server
Resource Scheduling
Task Scheduling
Leader Election
Allocation
Image
Instances
Memory
CPU
...
Nomad’s API
Node
NodeNode
Nomad Cluster
Client
Execute
tasks
20. Building our own Spark
docker sounds difficult…
Apparently it’s Not!!
21. And we made it
open source
Open source
https://github.com/YotpoLtd/metorikku/tree/master/docker/spark
Docker Hub
https://hub.docker.com/r/metorikku/spark
22. Spark Components
● Driver
○ Runs the main() function of the
application
○ Distributes tasks among the
nodes
● Executor
○ A process that execute
multiple tasks
Driver
ExecutorExecutor
23. Spark Components
● Master
○ Manages the resources
● Worker
○ The executor process that
execute multiple tasks on a
worker node
● Standalone deploy mode
● Client Mode
Master
Driver
Worker
Executor
Worker
Executor
24. Spark Docker
Submit Docker
Submits the Spark
command
Worker Docker
Runs the
spark-worker service
Master Docker
Runs the
spark-master service
31. Nomad configuration
with ease
● Batch and service (streaming)
● Dynamic port allocation
● Spark UI address using the job name
● Auto restart/reschedule
● Unified environments, no more
bootstrap actions
32. What About
Kubernetes?
Spark can run on clusters managed by Kubernetes.
● Pros:
○ Built-in
○ K8S is aware that it’s running with Spark
● Cons:
○ We are not using K8S (working with
hashicorp).
○ Nomad is simple
33. Have we met
your expectations?
Reduce Cost Simple To Use
Better
Testability
Monitoring Cloud Agnostic Scalability
34. Have we met
your expectations?
Reduce Cost Simple To Use
Better
Testability
Monitoring Cloud Agnostic Scalability
Resource
sharing: better
optimized
EMR pricing
3500$/month
(42K$/year)
35. Have we met
your expectations?
Reduce Cost Simple To Use
Better
Testability
Monitoring Cloud Agnostic Scalability
Writing and
deploying
spark jobs is
easier than
ever
36. Have we met
your expectations?
Reduce Cost Simple To Use
Better
Testability
Monitoring Cloud Agnostic Scalability
Setup time:
Cluster of 100
nodes in 1-2
minutes!
Run locally or
in CI env for
tests
37. Have we met
your expectations?
Reduce Cost Simple To Use
Better
Testability
Monitoring Cloud Agnostic Scalability
Observability
“for free” on
existing
orchestration
platform infra
38. Have we met
your expectations?
Reduce Cost Simple To Use
Better
Testability
Monitoring Cloud Agnostic Scalability
Can be used in
any
environment
(on prem. or
any cloud
provider)
39. Have we met
your expectations?
Reduce Cost Simple To Use
Better
Testability
Monitoring Cloud Agnostic Scalability
Auto scaling
using
orchestration
tools (Libra)