At Yahoo! over the past year we have helped migrate hundreds of our grids? users to YARN. Our YARN clusters have in aggregate run over 18 million jobs with more than 3 billion tasks consuming over 10 thousand years of compute time. With one single cluster running 90 thousand jobs a day. From this experience we would like to share what we have learned about running YARN well, how this is different from running a 1.0 based cluster, and what it takes to migrate your jobs to YARN from 1.0.
2. Agenda
• Who We Are
• Some Background on YARN and YARN at Yahoo!
• What Was Not So Good
• What Was Good
3. Who IAm
Robert (Bobby) Evans
• Technical Lead @ Yahoo!
• Apache Hadoop Committer and PMC Member
• Past
– Hardware Design
– Linux Kernel and Device Driver Development
– Machine Learning on Hadoop
• Current
– Hadoop Core Development (MapReduce and YARN)
– TEZ, Storm and Spark
4. Who I Represent
• Yahoo! Hadoop Team
– We are over 40 people developing, maintaining and
supporting a complete Hadoop stack including
Pig, Hive, HBase, Oozie, and HCatalog.
• Hadoop Users @ Yahoo!
5. Agenda
• Who We Are
• Some Background on YARN and YARN at Yahoo!
• What Was Not So Good
• What Was Good
7. Yahoo! Scale
• About 40,000 Nodes Running Hadoop.
• Around 500,000 Map/Reduce jobs a day.
• Consuming in excess of 230 compute years
every single day.
• Over 350 PB of Storage.
• On 0.23.X we have over 20,000 years of
compute time under our belts.
http://www.flickr.com/photos/doctorow/2699158636/
9. Agenda
• Who We Are
• Some Background on YARN and YARN at Yahoo!
• What Was Not So Good
• What Was Good
10. TheAM Runs on Unreliable Hardware
• Split Brain/AM Recovery (FIXED for MR but not perfect)
– For anyone else writing a YARN app, be aware you
have to handle this.
11. TheAM Runs on Unreliable Hardware
• Debugging the AM is hard when it does crash.
• AM can get overwhelmed if it is on a slow node or the
job is very large.
• Tuning the AM is difficult to get right for large jobs.
– Be sure to tune the heap/container size. 1GB heap
can fit about 100,000 task attempts in memory
(25,000 tasks worst case).
http://www.flickr.com/photos/cushinglibrary/3963200463/
12. Lack of Flow Control
• Both AM and RM based on an asynchronous event
framework that has no flow control.
http://www.flickr.com/photos/iz4aks/4085305231/
13. Name Node Load
• YARN launches tasks faster than 1.0
• MR keeps a running history log for recovery
• Log Aggregation.
– 7 days of aggregated logs used up approximately
30% of the total namespace.
• 50% higher write load on HDFS for the same
jobs
• 160% more rename operations
• 60% more create, addBlock and fsync
operations
14. Web UI
• Resource Manager and History Server Forget Apps too Quickly
• Browser/Javascript Heavy
• Follows the YARN model, so it can be confusing for those used to
old UI.
15. Binary Incompatibility
• Map/Reduce APIs are not binary compatible between 1.0
and 0.23. They are source compatible though so just
recompile require.
16. Agenda
• Who We Are
• Some Background on YARN and YARN at Yahoo!
• What Was Not So Good
• What Was Good
17. Operability
“The issues were not with
incompatibilities, but coupling between
applications and check-offs.”
-- Rajiv Chittajallu
18. Performance
Tests run on a 350 node cluster on top of JDK 1.6.0
1.0.2 0.23.3 Improvement
Sort (GB/s
throughput)
2.26 2.35 4%
Sort with
compression
(GB/s throughput)
4.5 4.5 0%
Shuffle (mean
shuffle time secs)
303.8 263.5 13%
Scan (GB/s
throughput)
25.2 22.9 -9%
Gridmx 3 replay
(Runtime secs)
2817 2668 5%
19. Web Services/LogAggregation
• No more scraping of web pages needed
– Resource Manager
– Node Managers
– History Server
– MR App Master
• Deep analysis of log output using Map/Reduce
21. Total Capacity
Our most heavily used cluster was able to increase from
80,000 jobs a day to 125,000 jobs a day.
That is more than a 50% increase. It is like we bought over
1000 new servers and added it to the cluster.
This is primarily due to the removal of the artificial split
between maps and reduces, but also because the Job
Tracker could not keep up with tracking/launching all the
tasks.
22. Conclusion
Upgrading to 0.23 from 1.0 took a lot of planning and effort.
Most of that was stabilization and hardening of Hadoop for
the scale that we run at, but it was worth it.
MR AM abandonscontainers that were already running.Testing recovery code that is a path rarely taken.
Uber-AM also saw big performance gains for small jobs.We have run other performance tests but most of them are on different hardware, and compare different versions of 0.23.Sorry we are not going to release the code for the benchmarks.