Hortonworks Presentation at The Boulder/Denver BigData Meetup on July 22nd, 2015. Topic: Scaling Spark Workloads on YARN. Spark as a workload in a multi-tenant Hadoop infrastructure, scaling, cloud deployment, tuning.
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
1. Page
 1
Â
Â
Â
Â
Â
 July
 2015
Â
Scaling Spark Workloads on YARN
Boulder/Denver
 Big
 Data
Â
Shane
 Kumpf
 &
 Mac
 Moore
Â
Solu2ons
 Engineers,
 Hortonworks
Â
July
 2015
Â
2. Page
 2
Â
Agenda
Â
§ď§âŻ Introduction
ââŻWhy we love Spark, Spark Strategy, Whatâs Next
§ď§âŻ YARN: The Data Operating System
§ď§âŻ Spark: Processing Internals Review
§ď§âŻ Spark: on YARN
§ď§âŻ Demo: Scaling Spark on YARN in the cloud
§ď§âŻ Q & A
Page
 2
Â
3. Page
 3
Â
Made for Data Science"
All apps need to get predictive at scale and ďŹne granularity
Democratizes Machine Learning"
Spark is doing to ML on Hadoop what Hive did for SQL on
Hadoop
Elegant Developer APIs"
DataFrames, Machine Learning, and SQL
Realize Value of Data Operating System"
A key tool in the Hadoop toolbox
Community"
Broad developer, customer and partner interest
Why We Love Spark at Hortonworks
Storage
YARN: Data Operating System
Governance Security
Operations
Resource Management
4. Page
 4
Â
Hadoop/YARN Powered data operating system"
100% open source, multi-tenant data platform for any
application, any dataset, anywhere."
Built on a centralized architecture of shared
enterprise services
â˘âŻ Scalable Tiered Storage
â˘âŻ Resource and workload management
â˘âŻ Trusted data governance and metadata management
â˘âŻ Consistent operations
â˘âŻ Comprehensive security
â˘âŻ Developer APIs and tools
Data Operating System: Open Enterprise Hadoop
5. Page
 5
Â
Themes for Spark Strategy
Spark is made for Data Science
â˘âŻ Lead in the community for ML optimization
â˘âŻ Data Science theme of Spark Summit / Hadoop Summit
Provide Notebooks for data exploration & visualization
â˘âŻ iPython Ambari Stack
â˘âŻ Zeppelin â weâre very excited about this project
Process more Hadoop data efďŹciently in Spark
â˘âŻ Hive/ORC data delivered, HBase work in progress
Innovate at the core
â˘âŻ Security, Spark on YARN improvements and more
6. Page
 6
Â
Current State of Security in Spark
Only Spark on YARN supports Kerberos today
â˘âŻ Leverage Kerberos for authentication
Spark reads data from HDFS & ORC
â˘âŻ HDFS ďŹle permissions (& Ranger integration) applicable to Spark jobs
Spark submits job to YARN queue
â˘âŻ YARN queue ACL (& Ranger integration) applicable to Spark jobs
Wire Encryption
â˘âŻ Spark has some coverage, not all channels are covered
LDAP Authentication
â˘âŻ No Authentication in Spark UI OOB, supports ďŹlter for hooking in LDAP
7. Page
 7
Â
What about ORC support?
ORC â Optimized Row Columnar format
ORC is an Apache TLP providing columnar storage for Hadoop
Spark ORC Support
â˘âŻ ORC support in HDP/Spark since 1.2.x â (Alpha)
â˘âŻ ORC support merged into Apache Spark in 1.4
â˘âŻ Joint blog with Databricks @ hortonworks.com
â˘âŻ Changes between ORC 1.3.1 and Spark 1.4.1
â˘âŻ ORC now uses standard API to read/write.
orc.apache.org
Â
9. Page
 9
Â
Apache Zeppelin
Features
â˘âŻ A web-based notebook for interactive
analytics
â˘âŻ Ad-hoc experimentation with Spark, Hive, Shell, Flink,
Tajo, Ignite, Lens, etc
â˘âŻ Deeply integrated with Spark and Hadoop
â˘âŻ Can be managed via Ambari Stacks
â˘âŻ Supports multiple language backends
â˘âŻ Pluggable âInterpretersâ
â˘âŻ Incubating at Apache
â˘âŻ 100% open source and open community
Use Cases
â˘âŻ Data exploration & discovery
â˘âŻ Visualization - tables, graphs, charts
â˘âŻ Interactive snippet-at-a-time experience
â˘âŻ Collaboration and publishing
â˘âŻ âModern Data Science Studioâ
10. Page
 10
Â
Where can I ďŹnd more?
â˘âŻArun Murthyâs Keynote at Hadoop Summit & SparkSummit
ââŻHadoop Summit (http://bit.ly/1IC1BEG)
ââŻSpark Summit (http://bit.ly/1M7qw47)
â˘âŻDataScience with Spark & Zeppelin Session at Hadoop Summit
ââŻhttp://bit.ly/1DdKeTs
â˘âŻDataScience with Spark + Zeppelin Blog
ââŻhttp://bit.ly/1HFd545
â˘âŻORC Support in Spark Blog
ââŻhttp://bit.ly/1OkA1uU
12. Page
 12
Â
YARN Introduction
Â
The Architectural Center
â˘âŻ YARN moved Hadoop âbeyond batchâ; run batch, interactive,
and real time applications simultaneously on shared hardware.
â˘âŻ Intelligently places workloads on cluster members based on
resource requirements, labels, and data locality.
â˘âŻ Runs user code in containers, providing isolation and lifecycle
management.
Hortonworks
 Data
 PlaBorm
 2.2
Â
Â
Â
YARN: Data Operating System
(Cluster
 Resource
 Management)
Â
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
ApachePig
° °
° °
° ° °
° ° °
HDFS
(Hadoop Distributed File System)
Â
Â
GOVERNANCE
 BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
Apache Falcon
ApacheHive
Cascading
Apache
HBase
Apache
Accumulo
ApacheSolr
Apache
Spark
Apache
Storm
Apache Sqoop
Apache Flume
Apache Kafka
Â
Â
SECURITY
Â
Apache Ranger
Apache Knox
Apache Falcon
Â
Â
OPERATIONS
Â
Apache Ambari
Apache
Zookeeper
Apache Oozie
13. Page
 13
Â
YARN Architecture - Overview
Â
Resource Manager
â˘âŻ Global resource scheduler
Node Manager
â˘âŻ Per-machine agent
â˘âŻ Manages the life-cycle of container & resource
monitoring
Container
â˘âŻ Basic unit of allocation
â˘âŻ Fine-grained resource allocation across multiple
resource types (memory, cpu, future: disk, network,
gpu, etc.)
Application Master
â˘âŻ Per-application master that manages application
scheduling and task execution
â˘âŻ E.g. MapReduce Application Master
14. Page
 14
Â
YARN Concepts
â˘âŻ Application
ââŻApplication is a job or a long running service submitted to YARN
ââŻExamples:
â⯠Job: Map Reduce Job
â⯠Service: HBase Cluster
â˘âŻ Container
ââŻBasic unit of allocation
â⯠Map Reduce map or reduce task
â⯠HBase HMaster or Region Server
ââŻFine-grained resource allocations
â⯠container_0 = 2GB, 1CPU
â⯠container_1 = 1GB, 6 CPU
ââŻReplaces the ďŹxed map/reduce slots from Hadoop 1
14
Â
15. Page
 15
Â
YARN Resource Request
15
Â
Resource Model
â˘âŻ Ask for a speciďŹc amount of resources (memory,
CPU, etc.) on a speciďŹc machine or rack
â˘âŻ Capabilities deďŹne how much memory and CPU is
requested.
â˘âŻ Relax Locality = false to force containers onto
subsets of machines aka YARN node labels.
ResourceRequest
priority
resourceName
capability
numContainers
relaxLocality
16. Page
 16
Â
YARN Capacity Scheduler
Page 16
â˘âŻ Elasticity
â˘âŻ Queues to subdivide resources
â˘âŻ Job submission Access Control Lists
Capacity
 Sharing
Â
FUNCTION
Â
â˘âŻ Max capacity per queue
â˘âŻ User limits within queue
â˘âŻ Preemption
Capacity
Â
Enforcement
Â
FUNCTION
Â
â˘âŻ Ambari Capacity Scheduler View
AdministraWon
Â
FUNCTION
Â
17. Page
 17
Â
Hierarchical Queues
17
Â
root
Â
Adhoc
Â
10%
Â
DW
Â
70%
Â
Mrk2ng
Â
20%
Â
Dev
Â
10%
Â
Reserved
Â
20%
Â
Prod
Â
70%
Â
Prod
Â
80%
Â
Dev
Â
20%
Â
P0
Â
70%
Â
P1
Â
30%
Â
Parent
Â
Leaf
Â
18. Page
 18
Â
YARN
 capacity
 scheduler
 helps
 manage
 resources
Â
across
 the
 cluster
Â
19. Page
 19
Â
NodeManager
 NodeManager
 NodeManager
 NodeManager
Â
Container
 1.1
Â
Container
 2.4
Â
NodeManager
 NodeManager
 NodeManager
 NodeManager
Â
NodeManager
 NodeManager
 NodeManager
 NodeManager
Â
Container
 1.2
Â
Container
 1.3
Â
AM
 1
Â
Container
 2.2
Â
Container
 2.1
Â
Container
 2.3
Â
AM2
Â
YARN Application Submission - Walkthrough
Client2
Â
ResourceManager
Â
Scheduler
Â
21. Page
 21
Â
First, a bit of review - What is Spark?
â˘âŻDistributed runtime engine for fast large
scale data processing.
â˘âŻDesigned for iterative computations and
interactive data mining.
â˘âŻProvides a API framework to support In-
Memory Cluster Computing.
â˘âŻMulti-language support â Scala, Java, Python
22. Page
 22
Â
So what makes Spark fast? Data access methods are not equal!
23. Page
 23
Â
MapReduce vs Spark
â˘âŻMapReduce â On disk
â˘âŻSpark â In memory
24. Page
 24
Â
RDD â The main programming abstraction
Resilient Distributed Datasets
â˘âŻ Collections of objects spread
across a cluster, cached or
stored in RAM or on Disk
â˘âŻ Built through parallel
transformations
â˘âŻ Automatically rebuilt on failure
â˘âŻ Immutable, each transformation
creates a new RDD
Operations
â˘âŻ Lazy Transformations"
(e.g. map, ďŹlter, groupBy)
â˘âŻ Actions"
(e.g. count, collect, save)
25. Page
 25
Â
RDD In Action
RDD
RDD
RDD
RDD
Transformations
Action Value
linesWithSpark = textFile.filter(lambda line: "Sparkâ in line)!
linesWithSpark.count()!
74!
!
linesWithSpark.first()!
# Apache Spark!
textFile = sc.textFile(âSomeFile.txtâ)!
26. Page
 26
Â
RDD Graph
map
 map
 reduceByKey
 collect
 textFile
Â
.flatMap(line=>line.split("
 "))
Â
.reduceByKey(_
 +
 _,
 3)
Â
.collect()
Â
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
Array[(String, Int)]
RDD[(String, Int)].map(word=>(word,
 1)))
Â
27. Page
 27
Â
DAG Scheduler
map
 map
 reduceByKey
 collect
 textFile
Â
map
Â
Stage
 2
 Stage
 1
Â
map
 reduceByKey
 collect
 textFile
Â
Goals
â˘âŻ Split graph into stages
based on the types of
transformations
â˘âŻ Pipe-line narrow
transformations
(transformations
without data
movement) into a
single stage
28. Page
 28
Â
DAG Scheduler - Double Click
map
Â
Stage
 2
 Stage
 1
Â
map
 reduceByKey
 collect
 textFile
Â
Stage
 2
 Stage
 1
Â
Stage 1
1.⯠Read HDFS split
2.⯠Apply both maps
3.⯠Write shufďŹe data
Stage 2
1.⯠Read shufďŹe data
2.⯠Final reduce
3.⯠Send result to driver
29. Page
 29
Â
Tasks â How work gets done
Execute
 task
Â
Fetch
 input
Â
Write
 output
Â
The fundamental unit of work in Spark
1.⯠Fetch input based on the InputFormat or a shufďŹe.
2.⯠Execute the task.
3.⯠Materialize task output via shufďŹe, write, or a result to
the driver.
30. Page
 30
Â
Input Formats control task input
â˘âŻHadoop InputFormats control how data on HDFS is read into each task.
ââŻControls Splits â how data is split up â each task (by default) gets one split, which is typically
a single HDFS block
ââŻControls the concept of a Record â is a record a whole line? A single word? An XML
element?
â˘âŻSpark can use both the old and new API InputFormats for creating RDD.
ââŻnewAPIHadoopRDD and hadoopRDD
ââŻSave time, use Hadoop InputFormats versus writing a custom RDD
Page 30
31. Page
 31
Â
Executor â The Spark Worker
Isolation for tasks
1.⯠Each application gets itâs own executors.
2.⯠Executors run tasks in threads and cache data.
3.⯠Run in separate processes for isolation.
4.⯠Lives for the duration of the application.
32. Page
 32
Â
Executor â The Spark Worker
Execute task
Fetch input
Write output
Execute task
Fetch input
Write output
Execute task
Fetch input
Write output
Execute task
Fetch input
Write output
Execute task
Fetch input
Write output
Execute task
Fetch input
Write output
Execute task
Fetch input
Write output
Core
1
Core
2
Core
3
task
 task
Â
task
 task
Â
task
 task
 task
Â
EXECUTOR!
35. Page
 35
Â
Spark onYARN
Â
Modus Operandi
â˘âŻ 1 executor = 1 yarn container
â˘âŻ 2 modes: yarn-client or yarn-cluster
â˘âŻ yarn-client = driver on the client side â good for the REPL
â˘âŻ yarn-cluster = driver inside the YARN application master
(below) â good for batch and automated jobs
YARN
 RM
Â
App
 Master
Â
Monitoring
 UI
Â
36. Page
 36
Â
Why Spark onYARN
Â
Core Features
â˘âŻ Run other workloads along with Spark
â˘âŻ Leverage Spark Dynamic Resource Allocation
â˘âŻ Currently the only way to run in a kerberized environment
â˘âŻ Ability to provide capacity guarantees via Capacity Scheduler
Hortonworks
 Data
 PlaBorm
 2.2
Â
Â
Â
YARN: Data Operating System
(Cluster
 Resource
 Management)
Â
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
ApachePig
° °
° °
° ° °
° ° °
HDFS
(Hadoop Distributed File System)
Â
Â
GOVERNANCE
 BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
Apache Falcon
ApacheHive
Cascading
Apache
HBase
Apache
Accumulo
ApacheSolr
Apache
Spark
Apache
Storm
Apache Sqoop
Apache Flume
Apache Kafka
Â
Â
SECURITY
Â
Apache Ranger
Apache Knox
Apache Falcon
Â
Â
OPERATIONS
Â
Apache Ambari
Apache
Zookeeper
Apache Oozie
37. Page
 37
Â
Executor Allocations onYARN
Â
Static Allocation
â˘âŻ Static number of executors started on the cluster.
â˘âŻ Executors live for the duration of the application,
even when idle.
Dynamic Allocation
â˘âŻ Minimal number of executors started initially.
â˘âŻ Executors added exponentially based on pending
tasks.
â˘âŻ After an idle period, executors are stopped and
resources are returned to the resource pool.
38. Page
 38
Â
Static Allocation Details
Â
Static Allocation
â˘âŻ Traditional means of starting executors on nodes.
spark-shell --master yarn-client
--driver-memory 3686m
--executor-memory 17g
--executor-cores 7
--num-executors 7
â˘âŻ Static number of executors speciďŹed by the submitter.
â˘âŻ Size and count of executors is key for good
performance.
39. Page
 39
Â
Dynamic Allocation Details
Â
Dynamic Allocation
â˘âŻ Scale executor count based on pending tasks
spark-shell --master yarn-client
--driver-memory 3686m
--executor-memory 3686m
--executor-cores 1
--conf "spark.dynamicAllocation.enabled=true"
--conf "spark.dynamicAllocation.minExecutors=1"
--conf "spark.dynamicAllocation.maxExecutors=100"
--conf "spark.shufďŹe.service.enabled=true"
â˘âŻ Minimum and maximum number of executors
speciďŹed.
â˘âŻ Exclusive to running Spark on YARN
40. Page
 40
Â
Enabling Dynamic Allocation
spark_shufďŹe YARN aux service
Dynamic allocation is not enabled OOTB
--conf "spark.dynamicAllocation.enabled=true"
--conf "spark.shufďŹe.service.enabled=true"
1.⯠Copy the spark-shufďŹe jar onto
the NodeManager classpath.
2.⯠ConďŹgure the YARN aux
service for spark_shufďŹe
Add: spark_shufďŹe to yarn.nodemanager.aux-services
Add: yarn.nodemanager.aux-service.spark_shufďŹe.class =
Org.apache.spark.network.yarn.YarnShufďŹeService
3.⯠Restart the NodeManagers to
pick up the spark-shufďŹe jar.
4.⯠Run the spark job with the
dynamic allocation conďŹgs.
41. Page
 41
Â
Dynamic Allocation ConďŹguration Options
Â
spark.dynamicAllocation.minExecutors
Minimum number of executors, also the initial number to be spawned at
job submission. (can override initial count with initialExecutors)
--conf "spark.dynamicAllocation.minExecutors=1â
spark.dynamicAllocation.maxExecutors
Maximum number of executors, executors will be added
based on pending tasks up to this maximum.
--conf "spark.dynamicAllocation.maxExecutors=100â
42. Page
 42
Â
Dynamic Allocation ConďŹguration Options
Â
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout
After the initial round of executors are scheduled, how long until the next
round of scheduling? Default: 5 seconds.
--conf "spark.dynamicAllocation.schedulerBacklogTimeout=10â
spark.dynamicAllocation.schedulerBacklogTimeout
Initial Delay to wait before allocating additional executors.
Default: 5 seconds
--conf "spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=10â
E
Executors Started over Time
E
E
E
E E
E E
E
E
E
E
E
E
E
E
43. Page
 43
Â
Dynamic Allocation â Good citizenship in a shared environment
Â
spark.dynamicAllocation.executorIdleTimeout
Amount of idle time in seconds before a executor container is
killed and resource returned to YARN. Default: 10 minutes
--conf "spark.dynamicAllocation.executorIdleTimeout=60â
spark.dynamicAllocation.cachedExecutorIdleTimeout
Because caching RDDs is key to performance, this setting has been
introduced to keep executors with cached data around longer.
--conf "spark.dynamicAllocation.cachedExecutorIdleTimeout=1800â
44. Page
 44
Â
Sizing your Spark job
Â
DifďŹcult Landscape
â˘âŻ ConďŹicting recommendations often found online.
â˘âŻ Need knowledge of the data set, task distribution,
cluster topology, RDD cache churn, hardware
proďŹleâŚ.
1 executor per core?
It Depends
1 executor per node?
3-5 executors if I/O bound?
yarn.nodemanager.resource.memory-mb?
18gb max heap?
45. Page
 45
Â
Commons Suggestions to improve performance
Do these things
1.⯠Cache RDDs in memory*
2.⯠Donât spill to disk if possible
3.⯠Use a better serializer
4.⯠Consider compression
5.⯠Limit GC activity
6.⯠Get parallelism right*
1.⯠⌠or scale elastically
* New considerations with Spark on YARN
46. Page
 46
Â
Sizing Spark Executors onYARN
Relationship
1.⯠Setting the executor memory size is setting the JVM heap, NOT the container.
2.⯠Executor memory + the greater of (10% or 384mb) = container size.
3.⯠To avoid wasted resources, ensure Executor memory + memoryOverhead <
yarn.scheduler.minimum-allocation-mb
47. Page
 47
Â
Sizing Spark Executors onYARN
Â
Relevant YARN Container Settings
â˘âŻ yarn.nodemanager.resource.cpu-vcores
â⯠Number of vcores availble for YARN containers per nodemanager
â˘âŻ yarn.nodemanager.resource.memory-mb
â⯠Total memory available for YARN containers per nodemanager
â˘âŻ yarn.scheduler.minimum-allocation-mb
â⯠Minimum resource request allowed per allocation in megabytes.
â⯠Smallest container available for an executor
â˘âŻ yarn.scheduler.maximum-allocation-mb
â⯠Maximum resource request allowed per allocation in megabytes.
â⯠Largest container available for an executor
â⯠Typically equal to yarn.nodemanager.resource.memory-mb
48. Page
 48
Â
Tuning Advice
How do we get it right?
â˘âŻ Test, gather, and test some more
â˘âŻ DeďŹne a SLA!
â˘âŻ Tune the job, not the cluster
â˘âŻ Tune the job to meet SLA!
â˘âŻ Donât tune prematurely, itâs the root of all evil
Starting Points
â˘âŻ Keep your heap reasonable, but large enough to
handle your dataset.
â⯠Recall that we only get about 60% of the heap for
RDD caching.
â⯠Measure GC and ensure the percent of time spent
here is low.
â˘âŻ For jobs that heavily depend on cached RDDs,
limit executors per machine to one where possible
â⯠See the ďŹrst point, if RDD cache churn or GC are a
problem, make smaller executors and run multiple
per machine.
Starting Points
â˘âŻ High memory hardware, multiple executors per
machine.
â⯠Keep the heap reasonable
â˘âŻ For CPU bound tasks with limited data needs,
more executors can be better
â⯠Run with 2-4GB executors with a single vcore and
measure performance.
â˘âŻ Tune task parallelism
â⯠As a rule of thumb, increase the task count by 1.5x
each round of testing and measure the results.
49. Page
 49
Â
Avoid spilling or caching to disk
Â
Caching strategies
â˘âŻ Use the default .cache() or .persist() which stores data as deserialized java
objects (MEMORY_ONLY).
â⯠Trade off: Lower CPU usage versus size of data in memory.
â˘âŻ Donât use disk persistence.
â⯠Itâs typically faster to recompute the partition and there is a good chance many of the
blocks are still in the Operating System page cache.
â˘âŻ If the default strategy results in the data not ďŹtting in memory, use
MEMORY_ONLY_SER, which stores the data as serialized objects.
â⯠Trade off: Higher CPU usage but data set is typically around 50% smaller in memory.
â⯠Can result in signiďŹcant impacts to the job run time for larger data sets, use with caution.
import org.apache.spark.storage.StorageLevel._
theRdd.persist(MEMORY_ONLY_SER)
50. Page
 50
Â
Data Access with Spark onYARN
Â
Gotchas
â˘âŻ Donât cache base RDDs, poor distribution.
â⯠Do cache intermediate data sets, good distribution across dynamically
allocated executors.
â˘âŻ Ensure executors remain running until you are done with the cached
data.
â⯠Cached data goes away when the executors do, costly to recompute.
â˘âŻ Data locality is getting better, but isnât great.
â⯠SPARK-1767 introduced locality waits for cached data.
â˘âŻ computePreferredLocations is pretty broken.
â⯠Only use if necessary, gets overwritten in some scenarios, better
approaches in the works.
val locData = InputFormatInfo.computePreferredLocations(Seq(
new InputFormatInfo(conf, classOf[TextInputFormat], new Path("myďŹle.txt")))
val sc = new SparkContext(conf, locData)
51. Page
 51
Â
Future Improvements for Spark onYARN
Â
RDD Sharing
â⯠Short term: Keep around executors with RDD cache longer
â⯠HDFS Memory Tier for RDD caching
â⯠Experimental Off-heap caching in Tachyon (lower overhead than persist())
â⯠Cache rebalancing
Data Locality for Dynamic Allocation
â⯠No more preferredLocations, discover locality from RDD lineage.
Container/Executor Sizing
â⯠Make it easier⌠automatically determine the appropriate size.
â⯠Long term: specify task size only and memory, cores, and overhead are determined
automatically.
Secure All The Things!
â⯠SASL for shufďŹe data
â⯠SSL for the HTTP endpoints
â⯠Encrypted ShufďŹe â SPARK-5682
57. Page
 57
Â
Scenarios
Promising Use Cases
1.⯠CPU bound workloads
2.⯠Burst-y usage
3.⯠Zeppelin/ad-hoc data exploration
4.⯠Multi-tenant, multi-use, centralized cluster
5.⯠Dev/QA clusters
58. Page
 58
Â
Cloudbreak
â˘âŻ Developed by SequenceIQ
â˘âŻ Open source with options to extend
with custom UI
â˘âŻ Launches Ambari and deploys
selected distribution via Blueprints in
Docker containers
â˘âŻ Customer registers, delegates
access to cloud credentials, and
runs Hadoop on own cloud account
(Azure, AWS, etc.)
â˘âŻ Elastic â Spin up any number of
nodes, up/down scale on the ďŹy
âCloud agnostic Hadoop
As-A-Service APIâ
59. Page
 59
Â
BI
 /
 AnalyWcs
Â
(Hive)
Â
IoT
 Apps
Â
(Storm,
 HBase,
 Hive)
Â
Launch HDP on Any Cloud for Any Application
Dev
 /
 Test
Â
(all
 HDP
 services)
Â
Data
 Science
Â
(Spark)
Â
Cloudbreak
Â
1.⯠Pick
 a
 Blueprint
Â
2.⯠Choose
 a
 Cloud
Â
3.⯠Launch
 HDP!
Â
Example
 Ambari
 Blueprints:
Â
Â
IoT
 Apps,
 BI
 /
 Analy2cs,
 Data
 Science,
Â
Dev
 /
 Test
Â
60. Page
 60
Â
Step 1: Sign up for a free Cloudbreak account
Page
 60
Â
URL to sign up for a free account:"
https://accounts.sequenceiq.com/ "
"
General Cloudbreak
documentation:"
http://sequenceiq.com/cloudbreak/
#cloudbreak
61. Page
 61
Â
â˘âŻVaries by cloud, but typically
only a couple of steps.
Page 61
Step 2: Create or add credentials
62. Page
 62
Â
Step 3: Note the blueprint for your use case
â˘âŻAn Ambari blueprint describes components of
the HDP stack to include in the cloud
deployment
â˘âŻCloudbreak comes with some default
blueprints, such as a Spark cluster or a
streaming architecture
â˘âŻPick the appropriate blueprint, or create your
own!
Page 62
63. Page
 63
Â
Step 4: Create Cluster
â˘âŻEnsure your credential is selected by
clicking on âselect a credentialâ
â˘âŻClick Create cluster, give it a name,
choose a region, choose a network
â˘âŻChoose desired blueprint
â˘âŻSet the instance type and number of
nodes.
â˘âŻClick create and start cluster
Page 63
64. Page
 64
Â
Step 5: Wait for cluster install to complete
â˘âŻDepending on instance types
and blueprint chosen, cluster
install should complete in
10-35 mins
â˘âŻOnce cluster install is
complete, click on the Ambari
server address link
(highlighted on screenshot)
and login to Ambari with
admin/admin
â˘âŻYour HDP cluster is ready to
use
Page 64
65. Page
 65
Â
Periscope: Auto up and down scaling
â˘âŻDeďŹne alerts for the number
of pending YARN containers.
Page 65
66. Page
 66
Â
Periscope: Auto up and down scaling
â˘âŻDeďŹne scaling policies for
how Periscope should react
to the deďŹned alerts.
Page 66
67. Page
 67
Â
Periscope: Auto up and down scaling
â˘âŻDeďŹne the min/max cluster
size and âcooldownâ period
(how long to wait between
scaling events).
Page 67
â˘âŻThe number of
compute nodes
will automatically
scale when out of
capacity for
containers.
68. Page
 68
Â
BeneďŹts
Why do I care?
â˘âŻ Less contention between jobs
â⯠Less waiting for your neighbors job to ďŹnish, elastic scale gives us all
compute time.
â˘âŻ Improved job run times.
â⯠Testing has shown a 30%+ decrease in job run times for moderate
duration CPU bound jobs.
â˘âŻ Decreased costs over persistent IaaS clusters
â⯠Spin down resources not in use.
â⯠If time = money, improve job run times will decrease costs.
â˘âŻ Capacity planning hack!
â⯠Scaling up a lot? You should probably add more capacityâŚ
â⯠Never scaling up? You probably overbuiltâŚ