With a long history of open innovation with Hadoop, Yahoo continues to invest in and expand the platform capabilities by pushing the boundaries of what the platform can accomplish for the entire organization. In this talk, Sumeet Singh will present some of the recent innovations, open source contributions, and where things are headed when it comes to Hadoop at Yahoo.
4. 0
50
100
150
200
250
300
Consolidated Cluster
HDFS 65 PB
Compute 240 TB
Avg. Util: 70%
Pushing Cluster Utility Boundaries
One Month Sample (2016)
40% decrease in TCO
10,500
servers
2,200
servers
Before After
65% increase in compute capacity
50% increase in avg. utilization
Total Used
4
ComputeTotalandUsed(TB)
5. Pushing Cluster Heterogeneity Boundaries
Rack 1
Network Backplane
CPU Servers
with JBODs
& 10GbE
Rack 2 Rack N
100Gbps
InfiniBand
GPU Servers
Hi-Mem Servers
5
.
.
.
6. Pushing Deep Learning Boundaries
Apache
License
Existing
Clusters
Powerful
DL Platform
Fully
Distributed
High-level
API
Incremental
Learning
github.com/yahoo/caffeonspark
6
C a f f e O n S p a r k
10. Pushing NoSQL Boundaries with Omid1
Highly performant and fault tolerant ACID
transactional framework
New Apache Incubator project
incubator.apache.org/projects/omid.html
Handles million of transactions per day for
search and personalization products
10
1 Omid stands for “Hope” in Persian
ACID
Transactions
13. THANK YOU
SUMEET SINGH (@sumeetksingh)
Sr. Director, Cloud and Big Data Platforms
Icon Courtesy – iconfinder.com (under Creative Commons)
Notas del editor
(1 min)
Good morning. My name is Sumeet Singh, and I am a Sr. Director of Products at Yahoo.
We have a long history of involvement with Hadoop, and we rely on the platform heavily as a business. And as a result, we continue to invest in and expand the platform capabilities by pushing the boundaries of what the platform can accomplish for our organization.
I am going to talk some of the recent innovation and open source contributions Yahoo has made that I believe pushes the platform boundaries.
(1 min – T 2 min)
And, finally, a set of internal tools for monitoring, on-boarding and reporting.
(1 min – T 3 min)
In Q3 last year, we began a tech refresh cycle in which we intended to retire three reasonably large clusters with a total of 10,500 old servers
The clusters had an aggregate utilization of less than 50% shown by the purple line here for three clusters
(1 min – T 4 min)
With the consolidation, we were able to setup a single brand new cluster that absorbed over 100 active projects running on the old clusters
The new cluster has storage parity and 65% more compute capacity than the previous three clusters combined
We are able to run the cluster at an average utilization of 70% or more now (the purple line), a 50% increase from before, and with 40% lower cluster TCO that I would argue more than funds the money we spent on setting up the new cluster
(1 min – T 5 min)
And then connected the GPUs with 100G InfiniBand for RDMA that gave us the capability to fully distribute the deep learning
(2 min – T 7 min)
And, best yet, CaffeOnSpark was open sourced last month with Apache 2.0 license
(1 min – T 8 min)
MapReduce, in blue, accounted for two-thirds at the end of March is declining in favor of Tez, 21% and they are tracking each other due to Hive and Pig workloads moving to Tez at scale
Spark is relative stable at about 12% with most of the iterative processing / ML workloads running on it
(2 min – T 10 min)
In the absence of one, we established a real-world streaming benchmark, code is available on Git. I am excited to tell you that most of these multi-tenancy, scale, and security changes are available in the community releases or are on their way to be released
(1 min – T 11 min)
Certain class of problems in big data analytics don’t scale well due to queries taking too much time or resources, such as count distinct, most frequent, quantiles etc.
And that’s where Sketches algorithms come in where “good enough” approximate answers work great for interactivity (and real-time stream data)
We have used Sketches successfully for several use cases such as audience analytics and Flurry analytics for our Mobile Developer Suite
Sketches integrates really well with Druid for sub-second OLAP where we have many lots of contributions recently such as dimension joins, reliable pull-based real-time ingestion, and schema introspection
Sketches is now available in open source, and integrates well with Pig and Hive from the Hadoop ecosystem
(1 min – T 12 min)
HBase is another cornerstone technology that we rely on extensively and there are applications on HBase that need to bundle multiple read and write operations into a single unit of work, and that’s’ exactly where Omid comes in
With Omid, applications can execute transactions with ACID properties without worrying about performance and fault tolerance
Omid executes millions of transactions per day for our incremental content management platform for nextgen search and personalization products
And, I am pleased to say that the same technology is now available as a new Apache incubator project
(2 min – T 14 min)
And finally a hierarchical file system layout for humongous tables avoids HDFS directory limits and speeds up directory creation times, easily scales up to 10M regions
(1 min – T 15 min)
We believe that increasing machine intelligence, quest for lowering latency, higher efficiency of cluster operations, and achieving desired scale that balances out cost and efficiency are the key boundaries to push for the coming 12 months and beyond.
(30 sec – T 15.5 min)
Thank you and enjoy rest of the Summit. If you have questions, please drop by Liffey Hall 2 at 12:20 p.m. today, or at the Yahoo booth #400.