SlideShare una empresa de Scribd logo
1 de 30
Apache Hadoop YARN:
State of the Union
Weiwei Yang/Chunde Ren
Hortonworks/Alibaba
Weiwei Yang
• Software Engineer at
Hortonworks, YARN dev team
• Apache Hadoop committer and
PMC member
Chunde Ren
• Staff Software Engineer at
Alibaba
• Leading the Hadoop team in
real-time computation platform
• State of the Union: Service, ML, Cloud and beyond
Scale and Performance
Unified platform
• Apache YARN 3.1 in Alibaba
Utilization+: balance & oversubscription
• Q&A
Hybrid clusters
State of the Union: Service,
ML, Cloud…
Weiwei Yang
Hortonworks, YARN
1 Year Timeline: GA Releases
2.9.0 3.0.0 3.1.0 3.2.0
• Submarine
• Node attributes
• Service upgrade
• Containerize
improvements
• Global
Scheduling
• Multiple
Resource types
• New YARN UI
• Timeline v2
• GPU/FPGA
• YARN Native
Service
• Global
scheduling
• Placement
Constraints
• YARN Federation
• Opportunistic
Containers
• New YARN UI
• Timeline v2
Nov 17 Dec 17 Aug 18 Oct 182.9.1 3.0.3 3.1.1 3.2.0
Apache Hadoop YARN
Unified Data Operative System
ML
Streaming
Ad-hoc
Deep Learning
No-SQL
SQLService
Compute
Resource
SLA
Utilization
Focus area
• Continue to evolve at large scale
• Scale
• Global Scheduling
• Unified platform
• Container runtime and Services
• Placement constraints
• Beyond: Submarine/CSI
Scale at Today
• Tons of sites with clusters made
up by large amount of nodes
• Oath(Yahoo!), Twitter,
LinkedIn, Microsoft, Alibaba
etc.
• 50K nodes in a single cluster of
Microsoft[1]
• Roadmap: To 100K and beyond
https://azure.microsoft.com/en-us/blog/how-microsoft-drives-exabyte-analytics-on-the-world-s-largest-yarn-cluster/[1]
Global Scheduling
Thread 1
Proposal 1
Thread 2
Proposal 2
Thread 3
Proposal 3
Scheduler
state Placement
Committer
Global Scheduling: takeaways
• Addresses hotspot issues
• Allows to plug customized node scoring policies (customize slot-
selection themes)
• Scoring can be done at background or in-place
• Not fit for clusters merely run small batches
Docker Container
• Better packing model
• Light-weighted mechanism for packaging and resource
isolation
• Popularized and made accessible by Docker
• Native integrated in YARN
• Docker container runtime
• Many security/usability improvements added to 3.x
Container Runtime
Run both docker and non-docker
containers on same cluster
YARN Service
Simplified Service
Framework
• Service discovery: DNS
Registration
• Service timeout
• Upgrade
• Integrated REST API/CLI/UI
Service - UI
Placement Constraints
Anti-affinity
Don't place containers together
Affinity
Collocate containers
Cardinality
Control number of containers per node/rack
Expression, namespace, service spec and more
Node Attributes
rm.yarn.io/hostName=“host1”
rm.yarn.io/hostType=“new”
nm.yarn.io/javaVersion=“1.8”
rm.yarn.io/hostName=“host2”
rm.yarn.io/hostType=“old”
nm.yarn.io/javaVersion=“1.9”
RM STATE
hostName=host1
javaVersion>1.6
Host1 Host2
Allocation = host1
Allocation = host1 | host2
1. Centralized/Distributed node-attributes
2. Distributed node-attributes support config/script providers
3. Admin tools and restful APIs
Submarine: TF Hello world
yarn jar hadoop-yarn-applications-submarine-<version>.jar job run 
--name tf-job-001 --docker_image <your docker image> 
--input_path hdfs://default/dataset/cifar-10-data 
--checkpoint_path hdfs://default/tmp/cifar-10-jobdir 
--num_workers 2 
--worker_resources memory=8G,vcores=2,gpu=2 
--worker_launch_cmd "cmd for worker ..." 
--num_ps 2 
--ps_resources memory=4G,vcores=2,gpu=0 
--ps_launch_cmd "cmd for ps"
Run distributed TF training with one commnad:
Submarine - Architecture
• Container Storage Interface (CSI): a vendor neural interface to bridge
Container Orchestrator and Storage Providers.
Amazon
EBS
Hadoop
Ozone
ContainerStorageInterface
CSI
The Adaption of CSI
Part of container resource request
Volume Resource
Deployment
Resource Manager
Node
Manager
Controller
Plugin
Node
Plugin
Identity
Plugin
CSI
Driver
Adaptor
Container
Container
Volume Mount
Storage System
Node
Manager
Controller
Plugin
Node
Plugin
Identity
Plugin
CSI
Driver
Adaptor
Container
Container
Volume Mount
Volume Manager
3rd Party CSI Driver
Master Slave Slave
Apache YARN 3.1 in Alibaba
Chunde Ren
Alibaba, Real-time Computation
Apache YARN
BI Recommendation SecuritySearch
Apache HDFS
Ads
Streaming + Batch
The Ecosystem
The challenges & solutions
• YARN Clusters
• Tens thousand nodes, version: 3.1.0+patches
• Long running streaming jobs >> Batch jobs, Machine Learning
• SLA: latency, priority, failover
• HA, performance, load-balancer, placement constraints, isolation
• Resource Utilization
• Elastic queue capacity & preemption, Oversubscription, Hybrid cluster
Enhanced Capacity Scheduler
Load Balance – Node Scores
Dynamically optimize
container distribution
across the cluster
• Policy: app or queue
• Shuffle candidates
• Auto resize allocate
thread
• Score Cache
Load Balance - Reschedule
27
Hotspot
Candidates
Selector
Rescheduler
App1 Container
App2 Container Scheduler
editScheduler
Scheduler States
Notify Scheduler
MARK_CONTAINER_FOR_PREEMPTION
Resource Manager
MainScheduler
Eliminate Hotspots and
Fragmentations
• NM/Container utilization metrics
• Identify hotspot & idle nodes
• Preemption:
- Priority
- candidate app cache
- preemption contract protocol
28
Container execution type:Guaranteed/Opportunistic
Schedule Mode
• Distributed Scheduler
• extended AMS
• extended CapacityScheduler
Isolation
• CPU: O container share = 2
• Memory: OOM Controller + Killer vs QoS Monitor
• Network: net_cls + tc + switch
• Disk IO: HDD blk.weight vs SSD
Preemption
• High/Low watermark
• O(App Priority) < G(App Priority)
Result: auto operation, utilization >10%
Resource Oversubscription
29
Hybrid Cluster
Improve Online Service Cluster
resource utilization
Architecture: YARN on K8s
Isolation: powered by aliKernel
Use Case: machine learning & deep
learning
Utilization:
K8s
NodeManager
G+O Containers
NM docker)
Services Allocated
Services
ResourceManager
Controller
API Server
Master
Apache Hadoop YARN State of the Union

Más contenido relacionado

La actualidad más candente

AWS Database Services-Philadelphia AWS User Group-4-17-2018
AWS Database Services-Philadelphia AWS User Group-4-17-2018AWS Database Services-Philadelphia AWS User Group-4-17-2018
AWS Database Services-Philadelphia AWS User Group-4-17-2018Bert Zahniser
 
Deep Dive with Amazon EC2 Container Service Hands-on Workshop
Deep Dive with Amazon EC2 Container Service Hands-on WorkshopDeep Dive with Amazon EC2 Container Service Hands-on Workshop
Deep Dive with Amazon EC2 Container Service Hands-on WorkshopAmazon Web Services
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftLi Gao
 
Running Containerised Applications at Scale on AWS
Running Containerised Applications at Scale on AWSRunning Containerised Applications at Scale on AWS
Running Containerised Applications at Scale on AWSAmazon Web Services
 
ENT309 Scaling Up to Your First 10 Million Users
ENT309 Scaling Up to Your First 10 Million UsersENT309 Scaling Up to Your First 10 Million Users
ENT309 Scaling Up to Your First 10 Million UsersAmazon Web Services
 
Bridging the Gap: Connecting AWS and Kafka
Bridging the Gap: Connecting AWS and KafkaBridging the Gap: Connecting AWS and Kafka
Bridging the Gap: Connecting AWS and KafkaPengfei (Jason) Li
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
 
6 Roadmap Cloudstack Developer Day
6 Roadmap Cloudstack Developer Day6 Roadmap Cloudstack Developer Day
6 Roadmap Cloudstack Developer DayKimihiko Kitase
 
Azure Custom Backup Solution for SAP NetWeaver
Azure Custom Backup Solution for SAP NetWeaverAzure Custom Backup Solution for SAP NetWeaver
Azure Custom Backup Solution for SAP NetWeaverGary Jackson MBCS
 
Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014Tsuyoshi OZAWA
 
AWS re:Invent 2016: Design Patterns for High Availability: Lessons from Amazo...
AWS re:Invent 2016: Design Patterns for High Availability: Lessons from Amazo...AWS re:Invent 2016: Design Patterns for High Availability: Lessons from Amazo...
AWS re:Invent 2016: Design Patterns for High Availability: Lessons from Amazo...Amazon Web Services
 
System design for video streaming service
System design for video streaming serviceSystem design for video streaming service
System design for video streaming serviceNirmik Kale
 
AWS re:Invent 2016: Development Workflow with Docker and Amazon ECS (CON302)
AWS re:Invent 2016: Development Workflow with Docker and Amazon ECS (CON302)AWS re:Invent 2016: Development Workflow with Docker and Amazon ECS (CON302)
AWS re:Invent 2016: Development Workflow with Docker and Amazon ECS (CON302)Amazon Web Services
 
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech Talks
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech TalksMigrating Your Oracle Database to PostgreSQL - AWS Online Tech Talks
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech TalksAmazon Web Services
 
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...Amazon Web Services
 
AWS re:Invent 2016: Get Technically Inspired by Container-Powered Migrations ...
AWS re:Invent 2016: Get Technically Inspired by Container-Powered Migrations ...AWS re:Invent 2016: Get Technically Inspired by Container-Powered Migrations ...
AWS re:Invent 2016: Get Technically Inspired by Container-Powered Migrations ...Amazon Web Services
 
Office 365 SaaS Mail Integration with SAP on Azure
Office 365 SaaS Mail Integration with SAP on AzureOffice 365 SaaS Mail Integration with SAP on Azure
Office 365 SaaS Mail Integration with SAP on AzureGary Jackson MBCS
 

La actualidad más candente (20)

AWS Database Services-Philadelphia AWS User Group-4-17-2018
AWS Database Services-Philadelphia AWS User Group-4-17-2018AWS Database Services-Philadelphia AWS User Group-4-17-2018
AWS Database Services-Philadelphia AWS User Group-4-17-2018
 
Deep Dive with Amazon EC2 Container Service Hands-on Workshop
Deep Dive with Amazon EC2 Container Service Hands-on WorkshopDeep Dive with Amazon EC2 Container Service Hands-on Workshop
Deep Dive with Amazon EC2 Container Service Hands-on Workshop
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at Lyft
 
Running Containerised Applications at Scale on AWS
Running Containerised Applications at Scale on AWSRunning Containerised Applications at Scale on AWS
Running Containerised Applications at Scale on AWS
 
ENT309 Scaling Up to Your First 10 Million Users
ENT309 Scaling Up to Your First 10 Million UsersENT309 Scaling Up to Your First 10 Million Users
ENT309 Scaling Up to Your First 10 Million Users
 
Bridging the Gap: Connecting AWS and Kafka
Bridging the Gap: Connecting AWS and KafkaBridging the Gap: Connecting AWS and Kafka
Bridging the Gap: Connecting AWS and Kafka
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
6 Roadmap Cloudstack Developer Day
6 Roadmap Cloudstack Developer Day6 Roadmap Cloudstack Developer Day
6 Roadmap Cloudstack Developer Day
 
Azure Custom Backup Solution for SAP NetWeaver
Azure Custom Backup Solution for SAP NetWeaverAzure Custom Backup Solution for SAP NetWeaver
Azure Custom Backup Solution for SAP NetWeaver
 
Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014
 
AWS re:Invent 2016: Design Patterns for High Availability: Lessons from Amazo...
AWS re:Invent 2016: Design Patterns for High Availability: Lessons from Amazo...AWS re:Invent 2016: Design Patterns for High Availability: Lessons from Amazo...
AWS re:Invent 2016: Design Patterns for High Availability: Lessons from Amazo...
 
System design for video streaming service
System design for video streaming serviceSystem design for video streaming service
System design for video streaming service
 
AWS re:Invent 2016: Development Workflow with Docker and Amazon ECS (CON302)
AWS re:Invent 2016: Development Workflow with Docker and Amazon ECS (CON302)AWS re:Invent 2016: Development Workflow with Docker and Amazon ECS (CON302)
AWS re:Invent 2016: Development Workflow with Docker and Amazon ECS (CON302)
 
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech Talks
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech TalksMigrating Your Oracle Database to PostgreSQL - AWS Online Tech Talks
Migrating Your Oracle Database to PostgreSQL - AWS Online Tech Talks
 
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
AWS re:Invent 2016: Netflix: Container Scheduling, Execution, and Integration...
 
AWS re:Invent 2016: Get Technically Inspired by Container-Powered Migrations ...
AWS re:Invent 2016: Get Technically Inspired by Container-Powered Migrations ...AWS re:Invent 2016: Get Technically Inspired by Container-Powered Migrations ...
AWS re:Invent 2016: Get Technically Inspired by Container-Powered Migrations ...
 
Cloud jiffy vs Heroku
Cloud jiffy vs HerokuCloud jiffy vs Heroku
Cloud jiffy vs Heroku
 
Office 365 SaaS Mail Integration with SAP on Azure
Office 365 SaaS Mail Integration with SAP on AzureOffice 365 SaaS Mail Integration with SAP on Azure
Office 365 SaaS Mail Integration with SAP on Azure
 
Simplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & TroubleshootingSimplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & Troubleshooting
 
YARN Services
YARN ServicesYARN Services
YARN Services
 

Similar a Apache Hadoop YARN State of the Union

MariaDB on Docker
MariaDB on DockerMariaDB on Docker
MariaDB on DockerMariaDB plc
 
Hacking apache cloud stack
Hacking apache cloud stackHacking apache cloud stack
Hacking apache cloud stackNitin Mehta
 
Running database infrastructure on containers
Running database infrastructure on containersRunning database infrastructure on containers
Running database infrastructure on containersMariaDB plc
 
Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0Kai Sasaki
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
Getting Started with MariaDB with Docker
Getting Started with MariaDB with DockerGetting Started with MariaDB with Docker
Getting Started with MariaDB with DockerMariaDB plc
 
VMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
VMworld 2013: Architecting VMware Horizon Workspace for Scale and PerformanceVMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
VMworld 2013: Architecting VMware Horizon Workspace for Scale and PerformanceVMworld
 
Getting started with MariaDB with Docker
Getting started with MariaDB with DockerGetting started with MariaDB with Docker
Getting started with MariaDB with DockerMariaDB plc
 
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement VMware Tanzu
 
How does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataHow does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataacelyc1112009
 
What are clouds made from
What are clouds made fromWhat are clouds made from
What are clouds made fromJohn Garbutt
 
BDTC2015 hulu-梁宇明-voidbox - docker on yarn
BDTC2015 hulu-梁宇明-voidbox - docker on yarnBDTC2015 hulu-梁宇明-voidbox - docker on yarn
BDTC2015 hulu-梁宇明-voidbox - docker on yarnJerry Wen
 
(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014
(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014
(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014Amazon Web Services
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
 
Netflix0SS Services on Docker
Netflix0SS Services on DockerNetflix0SS Services on Docker
Netflix0SS Services on DockerDocker, Inc.
 
Ibm cloud nativenetflixossfinal
Ibm cloud nativenetflixossfinalIbm cloud nativenetflixossfinal
Ibm cloud nativenetflixossfinalaspyker
 
Better, faster, cheaper infrastructure with apache cloud stack and riak cs redux
Better, faster, cheaper infrastructure with apache cloud stack and riak cs reduxBetter, faster, cheaper infrastructure with apache cloud stack and riak cs redux
Better, faster, cheaper infrastructure with apache cloud stack and riak cs reduxJohn Burwell
 
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]Rainforest QA
 
AWS re:Invent 2016: Service Integration Delivery and Automation Using Amazon ...
AWS re:Invent 2016: Service Integration Delivery and Automation Using Amazon ...AWS re:Invent 2016: Service Integration Delivery and Automation Using Amazon ...
AWS re:Invent 2016: Service Integration Delivery and Automation Using Amazon ...Amazon Web Services
 

Similar a Apache Hadoop YARN State of the Union (20)

MariaDB on Docker
MariaDB on DockerMariaDB on Docker
MariaDB on Docker
 
Hacking apache cloud stack
Hacking apache cloud stackHacking apache cloud stack
Hacking apache cloud stack
 
Running database infrastructure on containers
Running database infrastructure on containersRunning database infrastructure on containers
Running database infrastructure on containers
 
Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Getting Started with MariaDB with Docker
Getting Started with MariaDB with DockerGetting Started with MariaDB with Docker
Getting Started with MariaDB with Docker
 
VMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
VMworld 2013: Architecting VMware Horizon Workspace for Scale and PerformanceVMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
VMworld 2013: Architecting VMware Horizon Workspace for Scale and Performance
 
Getting started with MariaDB with Docker
Getting started with MariaDB with DockerGetting started with MariaDB with Docker
Getting started with MariaDB with Docker
 
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
 
How does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsDataHow does Apache Pegasus (incubating) community develop at SensorsData
How does Apache Pegasus (incubating) community develop at SensorsData
 
High availability
High availabilityHigh availability
High availability
 
What are clouds made from
What are clouds made fromWhat are clouds made from
What are clouds made from
 
BDTC2015 hulu-梁宇明-voidbox - docker on yarn
BDTC2015 hulu-梁宇明-voidbox - docker on yarnBDTC2015 hulu-梁宇明-voidbox - docker on yarn
BDTC2015 hulu-梁宇明-voidbox - docker on yarn
 
(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014
(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014
(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
Netflix0SS Services on Docker
Netflix0SS Services on DockerNetflix0SS Services on Docker
Netflix0SS Services on Docker
 
Ibm cloud nativenetflixossfinal
Ibm cloud nativenetflixossfinalIbm cloud nativenetflixossfinal
Ibm cloud nativenetflixossfinal
 
Better, faster, cheaper infrastructure with apache cloud stack and riak cs redux
Better, faster, cheaper infrastructure with apache cloud stack and riak cs reduxBetter, faster, cheaper infrastructure with apache cloud stack and riak cs redux
Better, faster, cheaper infrastructure with apache cloud stack and riak cs redux
 
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
 
AWS re:Invent 2016: Service Integration Delivery and Automation Using Amazon ...
AWS re:Invent 2016: Service Integration Delivery and Automation Using Amazon ...AWS re:Invent 2016: Service Integration Delivery and Automation Using Amazon ...
AWS re:Invent 2016: Service Integration Delivery and Automation Using Amazon ...
 

Último

Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfVictor Lopez
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024vaibhav130304
 
architecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdfarchitecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdfWSO2
 
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfMicrosoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfQ-Advise
 
Microsoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdfMicrosoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdfMarkus Moeller
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignNeo4j
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Andrea Goulet
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationHelp Desk Migration
 
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...OnePlan Solutions
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems ApproachNeo4j
 
Weeding your micro service landscape.pdf
Weeding your micro service landscape.pdfWeeding your micro service landscape.pdf
Weeding your micro service landscape.pdftimtebeek1
 
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024SimonedeGijt
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAShane Coughlan
 
The Strategic Impact of Buying vs Building in Test Automation
The Strategic Impact of Buying vs Building in Test AutomationThe Strategic Impact of Buying vs Building in Test Automation
The Strategic Impact of Buying vs Building in Test AutomationElement34
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationWave PLM
 
SQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionSQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionMohammed Fazuluddin
 
Community is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea GouletCommunity is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea GouletAndrea Goulet
 
Salesforce Introduced Zero Copy Partner Network to Simplify the Process of In...
Salesforce Introduced Zero Copy Partner Network to Simplify the Process of In...Salesforce Introduced Zero Copy Partner Network to Simplify the Process of In...
Salesforce Introduced Zero Copy Partner Network to Simplify the Process of In...CloudMetic
 

Último (20)

Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024
 
architecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdfarchitecting-ai-in-the-enterprise-apis-and-applications.pdf
architecting-ai-in-the-enterprise-apis-and-applications.pdf
 
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdfMicrosoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
Microsoft 365 Copilot; An AI tool changing the world of work _PDF.pdf
 
Microsoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdfMicrosoft365_Dev_Security_2024_05_16.pdf
Microsoft365_Dev_Security_2024_05_16.pdf
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data Migration
 
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
Weeding your micro service landscape.pdf
Weeding your micro service landscape.pdfWeeding your micro service landscape.pdf
Weeding your micro service landscape.pdf
 
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
 
The Strategic Impact of Buying vs Building in Test Automation
The Strategic Impact of Buying vs Building in Test AutomationThe Strategic Impact of Buying vs Building in Test Automation
The Strategic Impact of Buying vs Building in Test Automation
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
AI Hackathon.pptx
AI                        Hackathon.pptxAI                        Hackathon.pptx
AI Hackathon.pptx
 
SQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionSQL Injection Introduction and Prevention
SQL Injection Introduction and Prevention
 
Community is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea GouletCommunity is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea Goulet
 
Salesforce Introduced Zero Copy Partner Network to Simplify the Process of In...
Salesforce Introduced Zero Copy Partner Network to Simplify the Process of In...Salesforce Introduced Zero Copy Partner Network to Simplify the Process of In...
Salesforce Introduced Zero Copy Partner Network to Simplify the Process of In...
 

Apache Hadoop YARN State of the Union

  • 1. Apache Hadoop YARN: State of the Union Weiwei Yang/Chunde Ren Hortonworks/Alibaba
  • 2. Weiwei Yang • Software Engineer at Hortonworks, YARN dev team • Apache Hadoop committer and PMC member Chunde Ren • Staff Software Engineer at Alibaba • Leading the Hadoop team in real-time computation platform
  • 3. • State of the Union: Service, ML, Cloud and beyond Scale and Performance Unified platform • Apache YARN 3.1 in Alibaba Utilization+: balance & oversubscription • Q&A Hybrid clusters
  • 4. State of the Union: Service, ML, Cloud… Weiwei Yang Hortonworks, YARN
  • 5. 1 Year Timeline: GA Releases 2.9.0 3.0.0 3.1.0 3.2.0 • Submarine • Node attributes • Service upgrade • Containerize improvements • Global Scheduling • Multiple Resource types • New YARN UI • Timeline v2 • GPU/FPGA • YARN Native Service • Global scheduling • Placement Constraints • YARN Federation • Opportunistic Containers • New YARN UI • Timeline v2 Nov 17 Dec 17 Aug 18 Oct 182.9.1 3.0.3 3.1.1 3.2.0
  • 6. Apache Hadoop YARN Unified Data Operative System ML Streaming Ad-hoc Deep Learning No-SQL SQLService Compute Resource SLA Utilization
  • 7. Focus area • Continue to evolve at large scale • Scale • Global Scheduling • Unified platform • Container runtime and Services • Placement constraints • Beyond: Submarine/CSI
  • 8. Scale at Today • Tons of sites with clusters made up by large amount of nodes • Oath(Yahoo!), Twitter, LinkedIn, Microsoft, Alibaba etc. • 50K nodes in a single cluster of Microsoft[1] • Roadmap: To 100K and beyond https://azure.microsoft.com/en-us/blog/how-microsoft-drives-exabyte-analytics-on-the-world-s-largest-yarn-cluster/[1]
  • 9. Global Scheduling Thread 1 Proposal 1 Thread 2 Proposal 2 Thread 3 Proposal 3 Scheduler state Placement Committer
  • 10. Global Scheduling: takeaways • Addresses hotspot issues • Allows to plug customized node scoring policies (customize slot- selection themes) • Scoring can be done at background or in-place • Not fit for clusters merely run small batches
  • 11. Docker Container • Better packing model • Light-weighted mechanism for packaging and resource isolation • Popularized and made accessible by Docker • Native integrated in YARN • Docker container runtime • Many security/usability improvements added to 3.x
  • 12. Container Runtime Run both docker and non-docker containers on same cluster
  • 13. YARN Service Simplified Service Framework • Service discovery: DNS Registration • Service timeout • Upgrade • Integrated REST API/CLI/UI
  • 15. Placement Constraints Anti-affinity Don't place containers together Affinity Collocate containers Cardinality Control number of containers per node/rack Expression, namespace, service spec and more
  • 16. Node Attributes rm.yarn.io/hostName=“host1” rm.yarn.io/hostType=“new” nm.yarn.io/javaVersion=“1.8” rm.yarn.io/hostName=“host2” rm.yarn.io/hostType=“old” nm.yarn.io/javaVersion=“1.9” RM STATE hostName=host1 javaVersion>1.6 Host1 Host2 Allocation = host1 Allocation = host1 | host2 1. Centralized/Distributed node-attributes 2. Distributed node-attributes support config/script providers 3. Admin tools and restful APIs
  • 17. Submarine: TF Hello world yarn jar hadoop-yarn-applications-submarine-<version>.jar job run --name tf-job-001 --docker_image <your docker image> --input_path hdfs://default/dataset/cifar-10-data --checkpoint_path hdfs://default/tmp/cifar-10-jobdir --num_workers 2 --worker_resources memory=8G,vcores=2,gpu=2 --worker_launch_cmd "cmd for worker ..." --num_ps 2 --ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cmd for ps" Run distributed TF training with one commnad:
  • 19. • Container Storage Interface (CSI): a vendor neural interface to bridge Container Orchestrator and Storage Providers. Amazon EBS Hadoop Ozone ContainerStorageInterface CSI The Adaption of CSI
  • 20. Part of container resource request Volume Resource
  • 21. Deployment Resource Manager Node Manager Controller Plugin Node Plugin Identity Plugin CSI Driver Adaptor Container Container Volume Mount Storage System Node Manager Controller Plugin Node Plugin Identity Plugin CSI Driver Adaptor Container Container Volume Mount Volume Manager 3rd Party CSI Driver Master Slave Slave
  • 22. Apache YARN 3.1 in Alibaba Chunde Ren Alibaba, Real-time Computation
  • 23. Apache YARN BI Recommendation SecuritySearch Apache HDFS Ads Streaming + Batch The Ecosystem
  • 24. The challenges & solutions • YARN Clusters • Tens thousand nodes, version: 3.1.0+patches • Long running streaming jobs >> Batch jobs, Machine Learning • SLA: latency, priority, failover • HA, performance, load-balancer, placement constraints, isolation • Resource Utilization • Elastic queue capacity & preemption, Oversubscription, Hybrid cluster
  • 26. Load Balance – Node Scores Dynamically optimize container distribution across the cluster • Policy: app or queue • Shuffle candidates • Auto resize allocate thread • Score Cache
  • 27. Load Balance - Reschedule 27 Hotspot Candidates Selector Rescheduler App1 Container App2 Container Scheduler editScheduler Scheduler States Notify Scheduler MARK_CONTAINER_FOR_PREEMPTION Resource Manager MainScheduler Eliminate Hotspots and Fragmentations • NM/Container utilization metrics • Identify hotspot & idle nodes • Preemption: - Priority - candidate app cache - preemption contract protocol
  • 28. 28 Container execution type:Guaranteed/Opportunistic Schedule Mode • Distributed Scheduler • extended AMS • extended CapacityScheduler Isolation • CPU: O container share = 2 • Memory: OOM Controller + Killer vs QoS Monitor • Network: net_cls + tc + switch • Disk IO: HDD blk.weight vs SSD Preemption • High/Low watermark • O(App Priority) < G(App Priority) Result: auto operation, utilization >10% Resource Oversubscription
  • 29. 29 Hybrid Cluster Improve Online Service Cluster resource utilization Architecture: YARN on K8s Isolation: powered by aliKernel Use Case: machine learning & deep learning Utilization: K8s NodeManager G+O Containers NM docker) Services Allocated Services ResourceManager Controller API Server Master