2. Weiwei Yang
• Software Engineer at
Hortonworks, YARN dev team
• Apache Hadoop committer and
PMC member
Chunde Ren
• Staff Software Engineer at
Alibaba
• Leading the Hadoop team in
real-time computation platform
3. • State of the Union: Service, ML, Cloud and beyond
Scale and Performance
Unified platform
• Apache YARN 3.1 in Alibaba
Utilization+: balance & oversubscription
• Q&A
Hybrid clusters
4. State of the Union: Service,
ML, Cloud…
Weiwei Yang
Hortonworks, YARN
5. 1 Year Timeline: GA Releases
2.9.0 3.0.0 3.1.0 3.2.0
• Submarine
• Node attributes
• Service upgrade
• Containerize
improvements
• Global
Scheduling
• Multiple
Resource types
• New YARN UI
• Timeline v2
• GPU/FPGA
• YARN Native
Service
• Global
scheduling
• Placement
Constraints
• YARN Federation
• Opportunistic
Containers
• New YARN UI
• Timeline v2
Nov 17 Dec 17 Aug 18 Oct 182.9.1 3.0.3 3.1.1 3.2.0
6. Apache Hadoop YARN
Unified Data Operative System
ML
Streaming
Ad-hoc
Deep Learning
No-SQL
SQLService
Compute
Resource
SLA
Utilization
7. Focus area
• Continue to evolve at large scale
• Scale
• Global Scheduling
• Unified platform
• Container runtime and Services
• Placement constraints
• Beyond: Submarine/CSI
8. Scale at Today
• Tons of sites with clusters made
up by large amount of nodes
• Oath(Yahoo!), Twitter,
LinkedIn, Microsoft, Alibaba
etc.
• 50K nodes in a single cluster of
Microsoft[1]
• Roadmap: To 100K and beyond
https://azure.microsoft.com/en-us/blog/how-microsoft-drives-exabyte-analytics-on-the-world-s-largest-yarn-cluster/[1]
10. Global Scheduling: takeaways
• Addresses hotspot issues
• Allows to plug customized node scoring policies (customize slot-
selection themes)
• Scoring can be done at background or in-place
• Not fit for clusters merely run small batches
11. Docker Container
• Better packing model
• Light-weighted mechanism for packaging and resource
isolation
• Popularized and made accessible by Docker
• Native integrated in YARN
• Docker container runtime
• Many security/usability improvements added to 3.x
15. Placement Constraints
Anti-affinity
Don't place containers together
Affinity
Collocate containers
Cardinality
Control number of containers per node/rack
Expression, namespace, service spec and more
17. Submarine: TF Hello world
yarn jar hadoop-yarn-applications-submarine-<version>.jar job run
--name tf-job-001 --docker_image <your docker image>
--input_path hdfs://default/dataset/cifar-10-data
--checkpoint_path hdfs://default/tmp/cifar-10-jobdir
--num_workers 2
--worker_resources memory=8G,vcores=2,gpu=2
--worker_launch_cmd "cmd for worker ..."
--num_ps 2
--ps_resources memory=4G,vcores=2,gpu=0
--ps_launch_cmd "cmd for ps"
Run distributed TF training with one commnad: