Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The Boundaries

•

0 recomendaciones•334 vistas

With a long history of open innovation with Hadoop, Yahoo continues to invest in and expand the platform capabilities by pushing the boundaries of what the platform can accomplish for the entire organization. In this talk, Sumeet Singh will present some of the recent innovations, open source contributions, and where things are headed when it comes to Hadoop at Yahoo.

Tecnología

HADOOP PLATFORM
INNOVATIONS
PUSHING THE BOUNDRIES
SUMEET SINGH (@sumeetksingh)
Sr. Director, Cloud and Big Data Platforms

Platform Today
ZK DBMS MON SSHOP
LOG
WH
TOOLS
Apache / Open Source Projects Yahoo Projects
HDFS HBase HCat Kafka CMS DH
Pig Hive Oozie Hue GDM Big ML
YARN CS MR Tez Spark Storm
2
Services
Compute
Storage / Msg.
Tools

0
10
20
30
Cluster 1 (2,000 servers)
HDFS 12 PB
Compute 23 TB
Avg. Util: 26%
Cluster Boundaries Before
0
20
40
60
80
ComputeTotalandUsed(TB)
Cluster 3 (5,400 servers)
HDFS 36 PB
Compute 70 TB
Avg. Util: 59%
Cluster 2 (3,100 servers)
HDFS 21 PB
Compute 52 TB
Avg. Util: 40%
0
20
40
60
One Month Sample (2015)
Total Used
3

0
50
100
150
200
250
300
Consolidated Cluster
HDFS 65 PB
Compute 240 TB
Avg. Util: 70%
Pushing Cluster Utility Boundaries
One Month Sample (2016)
40% decrease in TCO
10,500
servers
2,200
servers
Before After
65% increase in compute capacity
50% increase in avg. utilization
Total Used
4
ComputeTotalandUsed(TB)

Pushing Cluster Heterogeneity Boundaries
Rack 1
Network Backplane
CPU Servers
with JBODs
& 10GbE
Rack 2 Rack N
100Gbps
InfiniBand
GPU Servers
Hi-Mem Servers
5
.
.
.

Pushing Deep Learning Boundaries
Apache
License
Existing
Clusters
Powerful
DL Platform
Fully
Distributed
High-level
API
Incremental
Learning
github.com/yahoo/caffeonspark
6
C a f f e O n S p a r k

0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Pushing Batch Compute Boundaries%ofTotalCompute(memory-sec)
Q1 2016
MapReduce Tez Spark
7
112 Million Batch Jobs in Q1’16
Jan 78%
Mar 67%
Mar 21% 12%Jan 8% 14%

Pushing Real-time Boundaries
MT & RA
Scheduler
Dist. Cache
API
8 x
Throughput
Improved
Debuggability
1 github.com/yahoo/streaming-benchmarks
Pacemaker
Server
Streaming
Benchmark 1
8

Pushing Interactivity Boundaries
Data Sketches Algorithms Library
datasketches.github.io
Sub-second User Facing Analytics
druid.io
9

Pushing NoSQL Boundaries with Omid1
Highly performant and fault tolerant ACID
transactional framework
New Apache Incubator project
incubator.apache.org/projects/omid.html
Handles million of transactions per day for
search and personalization products
10
1 Omid stands for “Hope” in Persian
ACID
Transactions

Pushing Scale Boundaries
Region Server
Groups
Split
Meta
Split
ZK
Favored
Nodes
Humongous
Tables
11

Boundaries Going Forward
Increased
Intelligence
Greater
Speed
Higher
Efficiency
Necessary
Scale
12

THANK YOU
SUMEET SINGH (@sumeetksingh)
Sr. Director, Cloud and Big Data Platforms
Icon Courtesy – iconfinder.com (under Creative Commons)

Más contenido relacionado

La actualidad más candente

알쓸신잡youngick

Hadoop pigWei-Yu Chen

YARN - Hadoop Next Generation Compute PlatformBikas Saha

Yahoo's Experience Running Pig on Tez at ScaleDataWorks Summit/Hadoop Summit

Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz

Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz

Hadoop and big data trainingagiamas

HDFSSteve Loughran

Hadoop 1.x vs 2Rommel Garcia

TriHUG Feb: Hive on sparktrihug

Advanced Hadoop Tuning and Optimization Shivkumar Babshetty

Hadoop configuration & performance tuningVitthal Gogate

002 Introduction to hadoop v3Dendej Sawarnkatat

Hadoop & Big Data benchmarkingBart Vandewoestyne

Tune hadoopJason Shao

White paper hadoop performancetuningAnil Reddy

Benchmarking Hadoop and Big DataNicolas Poggi

Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas

10c introductionmapr-academy

Hadoop technologytipanagiriharika

La actualidad más candente (20)

알쓸신잡

Hadoop pig

YARN - Hadoop Next Generation Compute Platform

Yahoo's Experience Running Pig on Tez at Scale

Introduction to the Hadoop Ecosystem (codemotion Edition)

Introduction to the Hadoop Ecosystem (SEACON Edition)

Hadoop and big data training

HDFS

Hadoop 1.x vs 2

TriHUG Feb: Hive on spark

Advanced Hadoop Tuning and Optimization

Hadoop configuration & performance tuning

002 Introduction to hadoop v3

Hadoop & Big Data benchmarking

Tune hadoop

White paper hadoop performancetuning

Benchmarking Hadoop and Big Data

Design, Scale and Performance of MapR's Distribution for Hadoop

10c introduction

Hadoop technology

Destacado

Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleDataWorks Summit/Hadoop Summit

Hadoop Platform at YahooDataWorks Summit/Hadoop Summit

Spark crash course workshop at Hadoop SummitDataWorks Summit

Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicDataWorks Summit

Hadoop crash course workshop at Hadoop SummitDataWorks Summit

Evolution of Big Data at Intel - Crawl, Walk and Run ApproachDataWorks Summit

Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz

Destacado (7)

Managing Hadoop, HBase and Storm Clusters at Yahoo Scale

Hadoop Platform at Yahoo

Spark crash course workshop at Hadoop Summit

Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic

Hadoop crash course workshop at Hadoop Summit

Evolution of Big Data at Intel - Crawl, Walk and Run Approach

Scaling Apache Storm - Strata + Hadoop World 2014

Similar a Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The Boundaries

Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG

Hadoop: Distributed Data ProcessingCloudera, Inc.

Strata Stinger Talk October 2013alanfgates

Large Scale Data With Hadoopguest27e6764

Big data and hadoopRahul Johari

Big Tools for Big DataLewis Crawford

Masterclass Webinar: Amazon Elastic MapReduce (EMR)Amazon Web Services

How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah

Data infrastructure at Facebook AhmedDoukh

HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon

Hadoop Big Data A big pictureJ S Jodha

Big Data and HadoopFlavio Vit

Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network

EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.

Masterclass Webinar - Amazon Elastic MapReduce (EMR)Amazon Web Services

2011 06-30-hadoop-summit v5Samuel Rash

Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterData Con LA

Data Discovery on Hadoop - Realizing the Full Potential of your DataDataWorks Summit

Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainYahoo Developer Network

Taylor bosc2010BOSC 2010

Similar a Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The Boundaries (20)

Hadoop ecosystem framework n hadoop in live environment

Hadoop: Distributed Data Processing

Strata Stinger Talk October 2013

Large Scale Data With Hadoop

Big data and hadoop

Big Tools for Big Data

Masterclass Webinar: Amazon Elastic MapReduce (EMR)

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook

Data infrastructure at Facebook

HBaseCon 2015: HBase 2.0 and Beyond Panel

Hadoop Big Data A big picture

Big Data and Hadoop

Hadoop for Scientific Workloads__HadoopSummit2010

EclipseCon Keynote: Apache Hadoop - An Introduction

Masterclass Webinar - Amazon Elastic MapReduce (EMR)

2011 06-30-hadoop-summit v5

Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter

Data Discovery on Hadoop - Realizing the Full Potential of your Data

Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain

Taylor bosc2010

Más de Sumeet Singh

Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Sumeet Singh

Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh

HUG Meetup 2013: HCatalog / Hive Data Out Sumeet Singh

Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh

Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Sumeet Singh

Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh

Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Sumeet Singh

Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Sumeet Singh

SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! Sumeet Singh

Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Sumeet Singh

HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! Sumeet Singh

Más de Sumeet Singh (11)

Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...

Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...

HUG Meetup 2013: HCatalog / Hive Data Out

Hadoop Summit San Jose 2014: Data Discovery on Hadoop

Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop

Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...

Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters

Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...

SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!

Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...

HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!

Último

Corporate and higher education May webinar.pptxRustici Software

Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

MS Copilot expands with MS Graph connectorsNanddeep Nachan

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security

Ransomware_Q4_2023. The report. [EN].pdfOverkill Security

CNIC Information System with Pakdata Cf In Pakistandanishmna97

Why Teams call analytics are critical to your entire businesspanagenda

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

ICT role in 21st century education and its challengesrafiqahmad00786416

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The Boundaries

1. HADOOP PLATFORM INNOVATIONS PUSHING THE BOUNDRIES SUMEET SINGH (@sumeetksingh) Sr. Director, Cloud and Big Data Platforms

2. Platform Today ZK DBMS MON SSHOP LOG WH TOOLS Apache / Open Source Projects Yahoo Projects HDFS HBase HCat Kafka CMS DH Pig Hive Oozie Hue GDM Big ML YARN CS MR Tez Spark Storm 2 Services Compute Storage / Msg. Tools

3. 0 10 20 30 Cluster 1 (2,000 servers) HDFS 12 PB Compute 23 TB Avg. Util: 26% Cluster Boundaries Before 0 20 40 60 80 ComputeTotalandUsed(TB) Cluster 3 (5,400 servers) HDFS 36 PB Compute 70 TB Avg. Util: 59% Cluster 2 (3,100 servers) HDFS 21 PB Compute 52 TB Avg. Util: 40% 0 20 40 60 One Month Sample (2015) Total Used 3

4. 0 50 100 150 200 250 300 Consolidated Cluster HDFS 65 PB Compute 240 TB Avg. Util: 70% Pushing Cluster Utility Boundaries One Month Sample (2016) 40% decrease in TCO 10,500 servers 2,200 servers Before After 65% increase in compute capacity 50% increase in avg. utilization Total Used 4 ComputeTotalandUsed(TB)

5. Pushing Cluster Heterogeneity Boundaries Rack 1 Network Backplane CPU Servers with JBODs & 10GbE Rack 2 Rack N 100Gbps InfiniBand GPU Servers Hi-Mem Servers 5 . . .

6. Pushing Deep Learning Boundaries Apache License Existing Clusters Powerful DL Platform Fully Distributed High-level API Incremental Learning github.com/yahoo/caffeonspark 6 C a f f e O n S p a r k

7. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Pushing Batch Compute Boundaries%ofTotalCompute(memory-sec) Q1 2016 MapReduce Tez Spark 7 112 Million Batch Jobs in Q1’16 Jan 78% Mar 67% Mar 21% 12%Jan 8% 14%

8. Pushing Real-time Boundaries MT & RA Scheduler Dist. Cache API 8 x Throughput Improved Debuggability 1 github.com/yahoo/streaming-benchmarks Pacemaker Server Streaming Benchmark 1 8

9. Pushing Interactivity Boundaries Data Sketches Algorithms Library datasketches.github.io Sub-second User Facing Analytics druid.io 9

10. Pushing NoSQL Boundaries with Omid1 Highly performant and fault tolerant ACID transactional framework New Apache Incubator project incubator.apache.org/projects/omid.html Handles million of transactions per day for search and personalization products 10 1 Omid stands for “Hope” in Persian ACID Transactions

11. Pushing Scale Boundaries Region Server Groups Split Meta Split ZK Favored Nodes Humongous Tables 11

12. Boundaries Going Forward Increased Intelligence Greater Speed Higher Efficiency Necessary Scale 12

13. THANK YOU SUMEET SINGH (@sumeetksingh) Sr. Director, Cloud and Big Data Platforms Icon Courtesy – iconfinder.com (under Creative Commons)

Notas del editor

(1 min) Good morning. My name is Sumeet Singh, and I am a Sr. Director of Products at Yahoo. We have a long history of involvement with Hadoop, and we rely on the platform heavily as a business. And as a result, we continue to invest in and expand the platform capabilities by pushing the boundaries of what the platform can accomplish for our organization. I am going to talk some of the recent innovation and open source contributions Yahoo has made that I believe pushes the platform boundaries.
(1 min – T 2 min) And, finally, a set of internal tools for monitoring, on-boarding and reporting.
(1 min – T 3 min) In Q3 last year, we began a tech refresh cycle in which we intended to retire three reasonably large clusters with a total of 10,500 old servers The clusters had an aggregate utilization of less than 50% shown by the purple line here for three clusters
(1 min – T 4 min) With the consolidation, we were able to setup a single brand new cluster that absorbed over 100 active projects running on the old clusters The new cluster has storage parity and 65% more compute capacity than the previous three clusters combined We are able to run the cluster at an average utilization of 70% or more now (the purple line), a 50% increase from before, and with 40% lower cluster TCO that I would argue more than funds the money we spent on setting up the new cluster
(1 min – T 5 min) And then connected the GPUs with 100G InfiniBand for RDMA that gave us the capability to fully distribute the deep learning
(2 min – T 7 min) And, best yet, CaffeOnSpark was open sourced last month with Apache 2.0 license
(1 min – T 8 min) MapReduce, in blue, accounted for two-thirds at the end of March is declining in favor of Tez, 21% and they are tracking each other due to Hive and Pig workloads moving to Tez at scale Spark is relative stable at about 12% with most of the iterative processing / ML workloads running on it
(2 min – T 10 min) In the absence of one, we established a real-world streaming benchmark, code is available on Git. I am excited to tell you that most of these multi-tenancy, scale, and security changes are available in the community releases or are on their way to be released
(1 min – T 11 min) Certain class of problems in big data analytics don’t scale well due to queries taking too much time or resources, such as count distinct, most frequent, quantiles etc. And that’s where Sketches algorithms come in where “good enough” approximate answers work great for interactivity (and real-time stream data) We have used Sketches successfully for several use cases such as audience analytics and Flurry analytics for our Mobile Developer Suite Sketches integrates really well with Druid for sub-second OLAP where we have many lots of contributions recently such as dimension joins, reliable pull-based real-time ingestion, and schema introspection Sketches is now available in open source, and integrates well with Pig and Hive from the Hadoop ecosystem
(1 min – T 12 min) HBase is another cornerstone technology that we rely on extensively and there are applications on HBase that need to bundle multiple read and write operations into a single unit of work, and that’s’ exactly where Omid comes in With Omid, applications can execute transactions with ACID properties without worrying about performance and fault tolerance Omid executes millions of transactions per day for our incremental content management platform for nextgen search and personalization products And, I am pleased to say that the same technology is now available as a new Apache incubator project
(2 min – T 14 min) And finally a hierarchical file system layout for humongous tables avoids HDFS directory limits and speeds up directory creation times, easily scales up to 10M regions
(1 min – T 15 min) We believe that increasing machine intelligence, quest for lowering latency, higher efficiency of cluster operations, and achieving desired scale that balances out cost and efficiency are the key boundaries to push for the coming 12 months and beyond.
(30 sec – T 15.5 min) Thank you and enjoy rest of the Summit. If you have questions, please drop by Liffey Hall 2 at 12:20 p.m. today, or at the Yahoo booth #400.

Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The Boundaries

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (7)

Similar a Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The Boundaries

Similar a Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The Boundaries (20)

Más de Sumeet Singh

Más de Sumeet Singh (11)

Último

Último (20)

Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The Boundaries

Notas del editor