Scale-Out Resource Management at Microsoft using Apache YARN
1. Scale-Out Resource Management at
Microsoft Using Apache YARN
Raghu Ramakrishnan
CTO for Data
Microsoft
Technical Fellow
Head, Big Data Engineering
2. Store any data
relations
Do any analysis
SQL queries
Hive,
At any speed
Batch
Hive
At any scale … elastic!
Anywhere
Data to
Intelligent
Action
3.
4.
5.
6. Windows
SMSG
Live
Ads
CRM/Dynamics
Windows Phone
Xbox Live
Office365
STB Malware Protection
Microsoft Stores
STBCommerceRisk
Messenger
LCA
Exchange
Yammer
Skype
Bing
data managed: EBs
cluster sizes: 10s of Ks
# machines: 100s of Ks
daily I/O: >100 PBs
# internal developers: 1000s
# daily jobs: 100s of Ks
7. • Interactive and Real-Time Analytics requires i
• Massive data volumes require scale-out stores
using commodity servers, even archival storage
Tiered Storage
Seamlessly move data across tiers, mirroring life-cycle and usage patterns
Schedule compute near low-latency copies of data
How can we manage this trade-off without moving data across
different storage systems (and governance boundaries)?
8. • Many different analytic engines (OSS and
vendors; SQL, ML; batch, interactive, streaming)
• Many users’ jobs (across these job types) run
on the same machines (where the data lives)
Resource Management with Multitenancy and SLAs
Policy-driven management of vast compute pools co-located with data
Schedule computation “near” data
How can we manage this multi-tenanted heterogeneous job mix
across tens of thousands of machines?
9. Shared Data and Compute
Tiered Storage
Relational
Query Engine
Machine
Learning
Compute Fabric (Resource Management)
Multiple analytic
engines sharing same
resource pool
Compute and
store/cache on
same machines
11. Resource Managers for Big Data
Allocate compute containers to competing jobs
Multiple job engines shared pool
Containers
YARN: Resource manager for Hadoop2.x
Corona, Mesos, Omega
12. YARN Gaps
resource allocation SLOs
scalability limitations
• High allocation latency
• Support for specialized execution frameworks
• Interactive environments, long-running services
13. • Amoeba Rayon
• Status: shipping in Apache Hadoop 2.6
• Mercury and Yaq
• Status: prototypes, JIRAs and papers
• Federation
• Status: prototype and JIRA
• Framework-level Pooling
• Enable frameworks that want to take over resource allocation to support millisecond-
level response and adaptation times
• Status: spec
Microsoft Contributions to OSS Apache YARN
17. Client
Job1
RM
Scheduler
NodeManager NodeManager NodeManager
App
Master Task
Task
Task
Task
Task
Task
Task
MR-5192
MR-5194
MR-5197
MR-5189
MR-5189
MR-5176
YARN-569
MR-5196
Contributing to Apache
Engaging with OSS
talk with active developers
show early/partial work
small patches
ok to leave things unfinished
18.
19. Sharing a Cluster Between Production & Best-effort Jobs
Production Jobs (P)
Money-making, large (recurrent) jobs with SLAs
e.g., Shows up at 3pm, deadline at 6pm, ~ 90 min runtime
Best-Effort Jobs
Interactive exploratory jobs submitted by data scientists w/o SLAs
However, latency is still important (user waiting)
19
20. New idea:
Support SLOs for production jobs by using
- Job-provided resource requirements in RDL
- System-enforced admission control
Reservation-Based Scheduling in Hadoop
(Curino, Krishnan, Difallah, Douglas, Ramakrishnan, Rao; Rayon paper, SOCC 2014)
20
21. Resource Definition Language (RDL)
e.g., atom (<2GB,1core>, 1, 10,
1min, 10bundle/min)
(simplified for OSS release)
22. Steps:
1. App formulates reservation request in RDL
2. Request is “placed” in the Plan
3. Allocation is validated against sharing policy
4. System commits to deliver resources on time
5. Plan is dynamically enacted
6. Jobs get (reserved) resources
7. System adapts to live conditions
Planning
Plan
Follower
Plan
j
… j
5
sharing
policy
adapter
3
7
Reservation
Adaptive Scheduling
Scheduler
Node
Manager
...
Node
Manager
Node
Manager
Planning
Agent 2
RDL
1
Node
Manager
ResourceManager
Production Job
Best-effort Job
6
4
Reservation-based Scheduling
Architecture: Teach the RM About Time
23. Results
• Meets all production job SLAs
• Lowers best-effort jobs latency
• Increased cluster utilization and throughput
Committed to Hadoop trunk and 2.6 release
Now part of Cloudera CDH and Hortonworks HDP
Comparing Rayon With CapacityScheduler
24. Initial Umbrella JIRA: YARN-1051 (14 sub-tasks)
Rayon OSS
Rayon V2 Umbrella JIRA: YARN-2572 (25 sub-tasks)
(tooling, REST APIs, UI, Documentation, perf-improvements)
High-Availability Umbrella JIRA: YARN-2573 (7 sub-tasks)
Heterogeneity/node-labels Umbrella JIRA: YARN-4193
(8 sub-tasks)
Algo enhancements: YARN-3656 (1 sub-task)
Folks Involved:
Carlo Curino, Subru Krishnan, Ishai Menache, Sean Po, Jonathan Yaniv, Arun Suresh, Abhunav Dhoot, Alexey Tumanov
Included in Apache Hadoop 2.6
Various enhancements in upcoming Apache Hadoop 2.8
25.
26. Why Federation?
Problem:
• YARN scalability is bounded by the centralized ResourceManager
• Proportional to #nodes, #apps, #containers, heart-beat#frequency
• Maintenance and Operations on single massive cluster are painful
Solution:
• Scale by federating multiple YARN clusters
• Appears as a single massive cluster to an app
• Node Manager(s) heart-beat to one RM only
• Most apps talk with one RM only; a few apps might span sub-clusters
(achieved by transparently proxying AM-RM communication)
• If single app exceed sub-cluster size, or for load
• Easier provisioning / maintenance
• Leverage cross-company stabilization effort of smaller YARN clusters
• Use ~6k YARN clusters as-is as building blocks
35. • Umbrella JIRA for Mercury: YARN-2877
– RESOLVED
– RESOLVED
– RESOLVED
– PATCH AVAILABLE
– PATCH AVAILABLE
– RESOLVED
– RESOLVED
Mercury and Yaq OSS
36. • Amoeba Rayon
• Status: shipping in Apache Hadoop 2.6
• Mercury and Yaq
• Status: prototypes, JIRAs and papers
• Federation
• Status: prototype and JIRA
• Framework-level Pooling
• Enable frameworks that want to take over resource allocation to support millisecond-
level response and adaptation times
• Status: spec
Microsoft Contributions to OSS Apache YARN
Youre familiar with SQL Server, and many of you know Hadoop and Azure HDInsight
This is a little bigger.
Analytic storage for the cloud
Users want to think about the content of their data and what it can tell them about their business, and control who can access it
They don’t want to think about
remote vs local storage
RAM vs flash
security
Analytic storage for the cloud
Users want to think about the content of their data and what it can tell them about their business, and control who can access it
They don’t want to think about
remote vs local storage
RAM vs flash
security