Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by PhuDuc Nguyen

Copyright © 2015, Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 1
Auto Scaling Systems With Elastic Spark Streaming
PhuDuc Nguyen
Consulting Engineer @ Oracle Data Cloud

Problem - streaming jobs with variable traffic pattern
● Our traffic pattern has natural peaks and valleys.
● We also experience unpredictable traffic spikes.
○ Breaking world news events - eg Brexit.
○ Some of our partners run batch jobs and send data at unknown times.

Various scenarios that result in “catch up mode”
● Streaming job has fallen behind real-time. In order to catch up, throughput must be > 1x to process backlog +
real-time.
● Catch up from failure: hardware, software, network, etc. Data doesn’t stop just because your streaming job has.
Deal with it. Be ready to catch up.
● Catch up from replay: pushing historical data back through the pipeline to fix a bug or apply new
feature/transformation retroactively.
● Unforeseen spikes in traffic.
● Underestimating the size of a streaming job.

Problems aren’t new - can use traditional methods to cope…
● Statically size the cluster/job to max servers needed to handle peak rate + some percentage for extra wiggle room.
○ Works but wastes resources when not at peak traffic; larger difference between peaks/valleys = larger waste
when idle.
● Manually restart job adjusting spark.cores.max=n
○ Manually babysitting cluster = pain. No one wants to wake up at 3am to grow or shrink a cluster.

What about backpressure?
● Async consumer signalling upstream producer to slow down. Reactive Manifesto is brilliantly simple, elegant, and
powerful API.
● Impl in Spark Streaming is effective and works very well.
● Can use “enough” servers and allow backpressure to kick in during peaks/spikes and will catch up later during valley.
○ But can result in falling behind or a lag in real-time data during backpressure. If job is required to maintain
real-time, buffering upstream may not be an option.
○ Once caught up, during traffic valley, you’ll likely have idle/wasted resources.
○ Prolonged sustained spikes > than “normal peak” for many days may exceed your upstream buffer - eg Brexit
caused a very large spike in our traffic for about 10 days. Our Kafka retention window is only 5 days.
○ Ultimately, max throughput is capped by the amount of servers it has...and sometimes we simply need more
servers.

A solution: Elastic Spark Streaming
● Auto scale - ability to add/remove servers dynamically without restarting the stream.
● Streaming job must automatically adjust to the demands of traffic.
● Maximize resource utilization - use servers when you need them, release them when you don’t. Don’t pay for idle
resources.

Implementation
● We collect “processing %” = totalProcessingTime / microbatchInterval
● If “processing %” > 100%, that microbatch fell behind
● If “processing %” < 100%, that microbatch is stable
● Don’t act on a single datapoint. Collect a sliding window and act on trend.
● Careful to avoid creating a system that constantly scales up and down within seconds or minutes - appears as if
the streaming job is indecisive and thrashing/flapping.
● Each streaming job has a unique workload, traffic pattern, and micro batch interval and thus requires
tunable/configurable thresholds.
○ Max threshold default is 100%; trigger scale up after N datapoints have median > max threshold.
○ Min threshold default is 75%; trigger scale down after N datapoints have medan < min threshold and queued
batches are empty.
○ N is configurable.
○ Scale up and down by configurable percentage of current servers.
● sparkContext.requestExecutors(count) and sparkContext.killExecutor(executorId).

Testing Elastic Streams
● Invariant - elasticity affects performance and resource utilization, it cannot impact data integrity.
● Impl random scaling - WTF why?!?! Before using any heuristics to trigger elasticity, we want to prove that no
matter how good/bad your heuristics are, elasticity can only affect performance. Even with the worst heuristics,
we want to prove that no data loss or corruption occurs. Random scaling intentionally forces the streaming job
into a state of thrashing/flapping. We can observe the performance is much worse than no scaling at all.
● Single command in testing project does the following…
○ Build new mesos cluster.
○ Use 1 static docker image that contains all input/test data.
○ Start 2 spark streaming jobs that ingest same dataset and share same processing logic: one job runs with
no-scaling, other job randomly scales and thrashes.
○ Start last spark job that ingests output from other 2 jobs, perform distinct and diff against the datasets for
every record and field. They must be equivalent otherwise you’ve violated the invariant.
○ Tear down mesos cluster.

Where do the spark jobs add servers from and release to?
● Deploy our streaming jobs on mesos cluster and launched via marathon.
● Each streaming job
○ Expands/contracts independently from one another - it has no knowledge of its neighbor.
○ Can gain more servers from the mesos cluster and releases them back to the mesos cluster on-demand.
● A lightweight app monitors used/unused resources on mesos. If the mesos cluster needs to grow, the app automatically triggers
the automation to add mesos workers from the cloud. As the mesos cluster needs to shrink, the app automatically releases the
servers back to the cloud. Like nested balloons that can expand/contract on demand...

Metrics
● A single ingest streaming job ingests over 5TB/day.
● Output is fed into many other streaming jobs (over 30). Results of processing generates > 2x ingest rate.
● At peak times, single ingest job processes > 130k events/second.
● Roughly 500-800 servers in the cloud for our team alone - elasticity changes this number frequently.
● Our team alone spends $250k/month = $3mln/year on infrastructure.
● 8 engineers are responsible for all of it: infrastructure, development, testing.

Impact of Elastic Spark Streaming
● Operational gains
○ Minimize manual developer intervention
○ Devs focus on features not babysitting
● Financial impact
○ Maximize resource utilization
○ If your cloud bill is in the millions, your savings can be in the millions!

Thanks for listening. Q&A?

Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by PhuDuc Nguyen

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by PhuDuc Nguyen

Similar a Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by PhuDuc Nguyen (20)

Más de Spark Summit

Más de Spark Summit (20)

Último

Último (20)

Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by PhuDuc Nguyen