Scaling Spark – Vertically: The mantra of Spark technology is divide and conquer, especially for problems too big for a single computer. The more you divide a problem across worker nodes, the more total memory and processing parallelism you can exploit. This comes with a trade-off. Splitting applications and data across multiple nodes is nontrivial, and more distribution results in more network traffic which becomes a bottleneck. Can you achieve scale and parallelism without those costs?
We’ll show results of a variety of Spark application domains including structured data, graph processing and common machine learning in a single, high-capacity scaled-up system versus a more distributed approach and discuss how virtualization can be used to define node size flexibly, achieving the best balance for Spark performance.
Elevate Developer Efficiency & build GenAI Application with Amazon Q
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
1. SCALE| SIMPLIFY| OPTIMIZE | EVOLVE
4/15/2016 TidalScale Proprietary & Confidential 1
Comparing a Virtual Supercomputer
with a Cluster for Spark in-memory
Computations
Ike Nassi
Ike.nassi@tidalscale.com
2. Why Run Spark?
Spark originated as in-memory alternative to Hadoop
Run huge analytics on clusters of commodity servers
Enjoy the hardware economy of “scale-out”
Apply a rich set of transformations and actions
Operate out of memory as much as possible
4/15/2016 TidalScale Proprietary & Confidential 2
3. Today’s Conundrum:
Scale Up vs. Scale Out?
4/15/2016 TidalScale Proprietary & Confidential 3
Scale Up Scale Out
Software Simplicity
HW Cost
?
✔ ✗
✗ ✔
4. TidalScale – The Best of Both
4/15/2016 TidalScale Proprietary & Confidential 4
Software Simplicity HW Cost
✔ ✔
Easy to say, but this is a ridiculously difficult problem!
5. Key takeaways
Simplicity of Scale up:
• We allow the simplicity of scale-up – you can run multi-
terabyte analytics on a single Spark node.
Scale out “under the hood”
• We offer a new class of virtual supercomputers to host
Spark – we hide the complexity of scale-out “under the
hood”.
4/15/2016 TidalScale Proprietary & Confidential 5
6. Traditional Spark in two layers
4/15/2016 TidalScale Proprietary & Confidential 6
Programming Paradigm
RDD – Resilient Distributed Dataset / DataFrame
Parallel in-memory execution
Lazy, repeatable evaluation thanks to ”wide dependencies”
Rich set of operators beyond just Map-Reduce
Implementation Plumbing
Clusters – standalone, Mesos, Yarn
Data – HDFS, Dataframes
Memory management
7. Alternative Spark in two layers
4/15/2016 TidalScale Proprietary & Confidential 7
Programming Paradigm
RDD – Resilient Distributed Dataset / DataFrame
Parallel in-memory execution
Lazy, repeatable evaluation thanks to ”wide dependencies”
Rich set of operators beyond just Map-Reduce
TidalScale as alternate plumbing!
8. Today’s Spark cluster with multiple nodes
4/15/2016 TidalScale Proprietary & Confidential 8
Hardware
Spark Application
Cluster Manager
Operating System
OS
HW
OS
HW
OS
HW
Executor Executor Executor
Manager
Workers
9. Virtual Supercomputer running Spark
4/15/2016 TidalScale Proprietary & Confidential 9
Spark Application
HW HW HW…
HyperKernel HyperKernel HyperKernel
Cluster Manager
Operating System
Draws from a pool of
processors and JVMs in a single
coherent memory space.
Standard Linux,
FreeBSD, Windows
The OS sees a collection of
cores, disks, and networks in a
huge address space
10. A tale of two approaches
4/15/2016 TidalScale Proprietary & Confidential 10
Feature Scale out under the hood Scale out with worker nodes
Organization One super-node Cluster of worker nodes
Cross-connect 10Gb Ethernet TCP/IP
Shared variables and shuffle Across JVMs in one address space Across distinct nodes
RDD partitioning See shuffle See shuffle
Scale out Add servers “under the hood” Add servers to the cluster
Scale up Scale-out creates bigger a computer None
Reuse Run any application Other cluster techs like Hadoop
11. Experiment Setup
SynthBenchmark benchmark from Apache.org
• git://git.apache.org/spark.git (spark-1.6.1-bin-hadoop2.6.tgz)
• Applies the PageRank algorithm to a generated graph
• Benchmark scaled from 15GB to 150GB by number of vertices
Scale Out Spark Configuration on EC2:
• 1 Master: ec2 r3.2xlarge (8 cpus, 61G)
• 5 Workers: r3.xlarge (4 cpus, 28.5G)
• 4 Intel E5 2670 CPUs x 5 servers = 20 CPUs total allocated to Spark
Scale Up Spark Configuration on TidalScale:
• TidalScale TidalPod with 5 nodes
• 20 Intel E5 2643 v3 CPUs allocated to Spark
15-Apr-16 TidalScale Proprietary & Confidential 11
13. Experiment Setup
15-Apr-16 TidalScale Proprietary & Confidential 13
* Note: The number of edge partitions in this example have been set to a fixed constant over all size
workloads for illustrative purposes. Normal practice is to vary # of edge partitions based on workload size.
A:
EC2
Small
B:
EC2
Big
C:
TidalScale
Big
D:
TidalScale
Bigger
Cluster Configuration 10 nodes 5 nodes 1 Tidalpod 1 Tidalpod
Edge partitions * 10 10 10 20
Spark.worker.instances 10 5 1 1
Spark.worker.cores 20 20 20 20
Spark Memory per Node 10G 28G 140GB 300GB
Total Spark Memory 100GB 140GB 140GB 300GB
17. Experiment Observations
Tuning Spark is complex
• We spent most of our time tuning Spark parameters
• We are not sure we’ve tuned optimally for either the ec2 spark distributed
cluster or the TS spark standalone case, but parameters were the same
in both
Choice of the number of data partitions really matters
• A suboptimal choice can have 2-3x performance impact
• We used 10 edge partitions for both ec2 and TidalScale configurations
15-Apr-16 TidalScale Proprietary & Confidential 17
18. Possible mixed model with multi-terabyte manager
4/15/2016 TidalScale Proprietary & Confidential 18
OS
HW
OS
HW
OS
HW
Executor Executor Executor
Super
Manager
Workers
Spark Application
HW HW HW…
HyperKerne
l
HyperKernel
HyperKerne
l
Cluster Manager
Operating System
19. Conclusions & Recommendations
Spark standalone on TidalScale performs similarly to a
cluster
Without TidalScale, larger workloads can run out of
memory without careful Spark tuning
We recommend using both scale up and scale out
15-Apr-16 TidalScale Proprietary & Confidential 19
20. Key messages – more obvious now?
A new class of virtual supercomputers to host Spark
Run multi-terabyte analytics on a single Spark node
4/15/2016 TidalScale Proprietary & Confidential 20
21. Value Proposition
Scale:
• Aggregates compute resources for large scale in-memory analysis and decision support
• Scales like a cluster using commodity hardware at linear cost
• Allows customers to grow gradually as their needs develop
Simplify:
• Dramatically simplifies application development
• No need to distribute work across servers
• Existing applications run as a single instance, without modification, as if on a highly flexible
mainframe
Optimize:
• Automatic dynamic hierarchical resource optimization
Evolve:
• Applicable to modern and emerging microprocessors, memories, interconnects, persistent storage
& networks
4/15/2016 TidalScale Proprietary & Confidential 21
I’m here to present a different approach to large-scale, in-memory computations.
Ike’s contact is on the last slide
With a little more time (25 mins.) we can set the context and thereby frame our context.
Scale-out – not much for me to add to Ike…
Spark has some 80 transformations (within a partition) and actions (often across partitions) that greatly enhance the original MapReduce of tools like Hadoop
It may seem sacrilegious (you’ll find your word) to address a group of Spark enthusiasts on the theme of a single huge node, but it’s a different way of thinking about the method
The remainder of this talk is about this different approach
…which we claim is completely in line with Spark’s direction
We like Spark and actively support the technology.
We think it’s useful to distinguish the powerful programming paradigm from the underlying implementation.
Our message is that there are different ways to achieve the same end.
[It may be a red herring here, but the power of the “80 operators” applied to RDDs is what makes Spark cool. This talk may not want to explore that.]
We like Spark and actively support the technology.
We think it’s useful to distinguish the powerful programming paradigm from the underlying implementation.
Our message is that there are different ways to achieve the same end.
[It may be a red herring here, but the power of the “80 operators” applied to RDDs is what makes Spark cool. This talk may not want to explore that.]
Here’s a stock Spark diagram…
The “driver program” runs on the manager, dispatching tasks to the executors.
[We try to give the generic message without being to Sales-y, YET.]
The moral of this table is that you can have your cake (parallelization with in-memory processing) and eat it (solve non-Spark problems), too.
The issue is where to scale out -- under the hood, invisibly to the operating system; or at the server level, over a network.
Spark shares variables and shuffles data across partitions – a key performance issue
The punchline is that when you scale up you get the benefit of reuse, the opportunity to run any demanding application, which is beneficial for experimentation.
You can show this slide or just talk through these points while showing the next slide:
SynthBenchmark benchmark from Apache.org
git://git.apache.org/spark.git (spark-1.6.1-bin-hadoop2.6.tgz)
Applies the PageRank algorithm to a generated graph
Benchmark scaled from 15GB to 150GB by number of vertices
Scale Out Spark Configuration on EC2:
1 Master: ec2 r3.2xlarge (8 cpus, 61G)
5 Workers: r3.xlarge (4 cpus, 28.5G)
4 Intel E5 2370 CPUs x 5 servers = 20 CPUs total allocated to Spark
Scale Up Spark Configuration on TidalScale:
TidalScale TidalPod with 5 nodes
20 Intel E5 2343 CPUs allocated to Spark
Note: TidalPod is booted with 224GB total – 200GB for spark and 24 for the OS. This means each physical node is hosting 45GB of the guest OS.
We ran the PageRank workloads in four tests:
A “EC2 Small” - 10 node EC2 using 15GB servers (total spark memory = 100GB)
B “EC2 Big” - 5 node EC2 using 31GB servers (total spark memory = 140GB)
C “TidalScale Big” - 5 node TidalPod with hardware equivalent to B
D “TidalScale Bigger” – 5 node TidalPod booted at 2.5TB
B “EC2 Big” and C “TidalScale Big” are the two to compare directly.
Time in seconds on the Y axis, size of workload in memory on the Y axis.
This is log-log on both axis
These two lines show the effect of more sharding – the 10 node EC2 config is slower than the 5 node EC2 config
At the larger sizes the jobs fail with Out of Memory errors on the worker nodes (denoted by the red box on each line).
The TidalPod “Big” result (“Big” meaning case C, configured with equivalent HW to the EC2 5 nodes config).
This shows similar performance between the two 5 node EC2 and TidalScale configurations (“B” – “EC2 Big” versus “C” – “TidalScale Big”).
For fun we tested a larger TidalScale single spark instance to see if we could get further up the workload size – the config we show here is a 400G spark worker on very large tidalpod with 20 edge partitions instead of 10.
The shape of the performance result line is different because of the effect of the greater number of edge partitions.
The job does NOT fail because of Out of Memory but because of another spark standalone mode issue (according to one forum some bug).
Tuning is time consuming
TidalScale can help you address Out of Memory challenges!
We’re committed to big data analytics as carried out in a variety of environments
More memory can expedite the assimilation of data from the workers
A more extreme example has virtual supercomputers for worker nodes
We give you flexibility in how you deploy node size in your spark applications
[Repeat the mantra in slightly different form to reinforce the message.]
Given the Spark context, here are some ground rules.
We see huge opportunity in the 80% solution up to 15TB. We’ll talk at the end about the realm of hundreds of terabytes and challenge problems.
One the rules is to maintain the economy of scale-out. A multi-million dollar HPC-class machine is another conversation.
Goals that we’ve added to the discussion are simplicity of deployment and use, especially for one-off experiments, but also the versatility to support different problems that arise.
[I removed many words for readability. Still too many but the point isn’t to read every one.]
Our work here is the outcome of years of development
This is the punchline of my talk
TidalScale technology is where scale-up meets scale-out
Spark provides an excellent, if at first surprising, context for this conversation
Spark is migrating from its original model of multiple JVMs on distributed machines
…to a more bare-metal approach of JIT compiled code operating on memory allocated C-style