SlideShare una empresa de Scribd logo
1 de 40
Costing Your Big Data Operations
PRESENTED BY Sumeet Singh, Amrit Lal ⎪ June 5, 2014
2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
Introduction
2
§  Product Manager at Yahoo engaged in building high
class and robust Hadoop infrastructure services
§  Eight years of experience across HSBC, Oracle and
Google in developing products and platforms for high
growth enterprises
§  MBA from Carnegie Mellon University
§  Manages Hadoop products team at Yahoo!
§  Responsible for Product Management, Strategy and
Customer Engagements
§  Managed Cloud Services products team and headed
Strategy functions for the Cloud Platform Group at
Yahoo
§  MBA from UCLA and MS from Rensselaer
Polytechnic Institute (RPI)
Sumeet Singh
Senior Director, Product Management
Hadoop and Big Data Platforms
Cloud Engineering Group
701 First Avenue,
Sunnyvale, CA 94089 USA
@sumeetksingh
Amrit Lal
Product Manager
Hadoop and Big Data Platforms
Cloud Engineering Group
701 First Avenue,
Sunnyvale, CA 94089 USA
@amritasshwar
2014 Hadoop Summit, San Jose, California
Agenda
3
Total Cost of Ownership (TCO) Models1
Deeper Understanding of (Resource) Usage
P&L, Metering and Billing Provisions
Benchmark Costs
Improve Utilization and ROI
2
3
4
5
2014 Hadoop Summit, San Jose, California
Why do Costing?
4
Profitability
Understanding the data services costs (an element of your total project cost) to determine how
profitable the project is
ROI Investment decisions both at the platform and app / project level
Operational
Efficiency
Benchmark, improve ops by focusing on avg. utilization, increasing the # hosted apps, storage
efficiencies, job performance etc.
Planning Capital planning and budgeting, product improvements
Cost Transparency Metering / usage metrics, billing, chargeback / showback, P&L
2014 Hadoop Summit, San Jose, California
Costing is Relevant Irrespective of the Service Model
5
Private
Cloud
Public
Cloud
§  Fixed costs that favors scale and
24x7 operations
§  Centralized operations
§  Multi-tenant clusters with security
and data sharing
§  Cost a function of desired SLA
§  Utilization and # hosted apps a
primary lever
§  Tenants often tend to ignore costs
§  Variable with usage and favors a run
and done model
§  Decentralized operations, ops /
headcount costs still relevant
§  Dedicated virtual clusters
§  Monthly bills!
§  Releasing cluster instances, when not
needed, a wise idea
§  Users often overlook the peripheral
costs
2014 Hadoop Summit, San Jose, California
0
50
100
150
200
250
300
350
400
450
500
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012 2013 2014
RawHDFSStorage(inPB)
NumberofServers
Year
Servers Storage
Important with Multi-tenancy and Scale
6
Yahoo!
Commits to
Scaling Hadoop
for Production
Use
Research
Workloads
in Search and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems with
Security,
Multi-tenancy,
and SLAs
Open Sourced
with Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen
Hadoop
(H 0.23 YARN)
New Services
(HBase, Storm,
Hive etc.
Increased
User-base
with partitioned
namespaces
Apache H 2.x
(Low latency,
Util, HA etc.)
2014 Hadoop Summit, San Jose, California
272
330
382
495
525
260
310
360
410
460
510
560
Q1-11 Q2-11 Q3-11 Q4-11 Q1-12 Q2-12 Q3-12 Q4-12 Q1-13 Q2-13 Q3-13 Q4-13 Q1-14
Hosted Apps Growth on Apache Hadoop
7
NumberofNewProjects
New Customer Apps On-boarded
58 projects in
2011
52 projects in
2012
113 projects in
2013
2014 Hadoop Summit, San Jose, California
Multi-tenant Apache HBase Growth
8
1140
33.6 PB
0
5
10
15
20
25
30
35
40
0
200
400
600
800
1000
1200
Q1-13 Q2-13 Q3-13 Q4-13 Q1-14
DataStored(inPB)
NumberofRegionServers
Zero to “20” Use Cases (60,000 Regions) in a Year
Region Servers Storage
2014 Hadoop Summit, San Jose, California
760
175
0
20
40
60
80
100
120
140
160
180
200
0
100
200
300
400
500
600
700
800
Q1-13 Q2-13 Q3-13 Q4-13 Q1-14
NumberofTopologies
NumberofSupervisors
Supervisor Topologies
Multi-tenant Apache Storm Growth
9
Zero to “175” Production Topologies in a Year
Multi-tenancy
Release
2014 Hadoop Summit, San Jose, California
Capital Deployment for Big Data Infrastructure
10
DataNode NodeManager
NameNode RM
DataNodes RegionServers
NameNode HBase Master Nimbus
Supervisor
Administration, Management and Monitoring
ZooKeeper
Pools
HTTP/HDFS/GDM
Load Proxies
Applications and Data
Data
Feeds
Data
Stores
Oozie
Server
HS2/
HCat
Network
Backplane
2014 Hadoop Summit, San Jose, California
Big Data Platforms Technology Stack at Yahoo
11
Compute
Services
Storage
Infrastructure
Services
HivePig Oozie HDFS ProxyGDM
YARN MapReduce
HDFS HBase
Zookeeper
Support
Shop
Monitoring Starling
Messaging
Service
HCatalog
Storm SparkTez
2014 Hadoop Summit, San Jose, California
Resources Consumed in Big Data Operations
12
.
.
.
.
Colo 1
Rack 1 Rack N
.
.
Bandwidth
Storage
Memory
CPU
Clusters in Datacenters Server Resources
2014 Hadoop Summit, San Jose, California
Elements of a TCO Model
13
$2.1 M
60%
12%
7%
6%
3%
2%
6
5
4
3
2
1
7
10%
Operations Engineering
§  Headcount for service engineering and data operations teams responsible for day-to-day ops and
support
6
Acquisition/ Install (One-time)
§  Labor, POs, transportation, space, support, upgrades, decommissions, shipping/ receiving etc.
5
Network Hardware
§  Aggregated network component costs, including switches, wiring, terminal servers, power strips etc.
4
Active Use and Operations (Recurring)
§  Recurring datacenter ops cost (power, space, labor support, and facility maintenance
3
R&D HC
§  Headcount for platform software development, quality, and release engineering
2
Cluster Hardware
§  Data nodes, name nodes, job trackers, gateways, load proxies, monitoring, aggregator, and web servers
1
Monthly TCOTCO Components
Network Bandwidth
§  Data transferred into and out of clusters for all colos, including cross-colo transfers
7
ILLUSTRATIVE
2014 Hadoop Summit, San Jose, California
Understanding Apache Hadoop Resources
14
Task 1
Task 2
Task 3
2014 Hadoop Summit, San Jose, California
NameNode Resource Manager
DFS
Blocks
DFS
Blocks
DataNode Node Manager
MR
Containers
MR
Containers
MemoryStorage
Storage and Compute MapReduce and Memory
. . . . . .
Unit Costs for Hadoop Operations
15
Compute
Containers where
apps can perform
computation and
access HDFS if
needed
Storage
HFDS (usable) space
needed by an app
with default
replication factor of
three
Network
bandwidth needed
to move data into/
out of the clusters
by the app
Bandwidth Namespace
Files and
directories used
by the apps to
understand/ limit
the load on NN
$ / GB-Hour (H 0.23/2.0)
GBs of Memory
available for an hour
Monthly Compute Cost
Avail. Compute Capacity
$ / GB Stored
Usable storage space
(less replication and
overheads)
Monthly Storage Cost
Avail. Usable Storage
Unit
Total Capacity
Unit Cost
$ / GB for Inter-region
data transfers
Inter-region (peak) link
capacity
[Monthly GB In + Out] x
$ / GB
N/A
N/A
N/A
2014 Hadoop Summit, San Jose, California
Working Through A Hadoop Example
16
Monthly TCO (less bw.) = $2 M
Compute @ 50% = $1 M
315 TB memory
== 315 TB x 24 x 30
= 227 M GB-Hours
$1 M/ 227 M GB-Hours
= $0.004 / GB-Hour / Month
Monthly TCO (less bw.) = $2 M
Storage @ 50% = $1 M
RAW HDFS = 200 PB
Usable HDFS == [ 200 x 0.8 (20%
overhead) ] / 3
= 53.3 PB
$ 1 M / 53.3 PB
= $ 0.019 / GB / Month
Monthly Cost
Monthly
Capacity
Unit Cost
Monthly Charges = $0.1 M
Total Data In + Out = 5 PB
$ 0.1 M / 5 PB
= $ 0.02/ GB transferred
Compute Storage Bandwidth
2014 Hadoop Summit, San Jose, California
ILLUSTRATIVE
Measuring Hadoop Resource Consumption
17
Map GB-Hours = GB(M1) x T(M1) +
GB(M2) x T(M2) + …
Reduce GB-Hours = GB(R1) x T(R1)
+ GB(R2) x T(R2) + …
Cost = (M + R) GB-Hour x $0.004 /
GB-Hour / Month
= $ for the Job/ Month
(M+R) GB-Hours for all jobs can
summed up for the month for a user,
app, BU, or the entire platform
Monthly Job
and Task
Cost
Monthly Roll-
ups
Compute Storage Bandwidth
/ project (app) directory quota in
GB (peak monthly storage used)
/ user directory quota in GB (peak
monthly storage used)
/ data is accounted for as each user
accountable for their portion of use.
For e.g.
GB Read (U1)
GB Read (U1) + GB Read (U2) + …
Roll-ups through relationship
among user, file ownership, app,
and their BU
Bandwidth measured at the cluster
level and divided among select
apps and users of data based on
average volume In/Out
Roll-ups through relationship
among user, app, and their BU
2014 Hadoop Summit, San Jose, California
Measuring Hadoop Resource Consumption
18 2014 Hadoop Summit, San Jose, California
queue 2
queue 1
queue 3
queue 4
queue 5
queue 6
queue 7
queue 8
queue 11
queue 9
queue 10
Measuring Hadoop Resource Consumption
19 2014 Hadoop Summit, San Jose, California
SLA Dashboard on Hadoop Analytics Warehouse
Putting it Together for Hadoop Services
20
BU
HDFS (Storage) Compute Network Bandwidth
Total Cost
($ M)Used
(PB)
Effective Used
(PB)
Cost
($ M)
Used
(GB-hour)
Cost
($ M)
Transferred
(GB)
Cost
($ M)
BU1 15 PB 3.45 PB $0.065 12.5 M $0.05 1.25 PB $0.025 $0.15 M
BU2 10 PB 2.65 PB $0.05 6.25 M $0.025 0.5 PB $0.01 $0.085 M
… …. … … … … … …
BU N … … … … … … ...
Total 148 PB 39.5 PB $0.75 M 125 M $0.5 M 5 PB $0.1 M $1.35 M
Resource	
   Unit	
   Aggregated / Measured	
   Cost	
  
HDFS (Storage)	
   GB	
   Monthly, Peak storage used	
   $ 0.019/GB	
  
Compute	
   Map-Reduce GB Hours	
   Number of GBs used by mappers and reducers and hours they ran for	
   $ 0.004/GB-Hour	
  
Network Bandwidth	
   GB	
   Monthly, total in /out	
   $ 0.02/GB	
  
Hadoop Services Billing Rate Card [ Monthly Rates ]
Monthly Bill for May 2014
2014 Hadoop Summit, San Jose, California
ILLUSTRATIVE
Multi-Tenant Deployment For Apache HBase
21
Region Server M
X:Table:Region M
Y:Table:Region M
…
Z:Table:Region M
Region Server N
X:Table:Region N
Y:Table:Region N
…
Z:Table:Region N
Projects X,Y & Z
RegionServerJVMHDFSReads/Writes
Shared Region Servers
Region Server 2
X:Table:Region 2
Y:Table:Region 2
…
Z:Table:Region 2
…
HMaster
Zookeeper
Region Server 1
X:Table:Region 1
Y:Table:Region 1
…
Z:Table:Region 1
2014 Hadoop Summit, San Jose, California
Understanding Apache HBase Resources
22
X:Table:Region 1
Y:Table:Region M
…
Regionlevel
Reads/Writes
HFile
HFile
HFile
HDFS Storage (Disk)RegionServer JVM (Heap)
Z:Table:Region N…
2014 Hadoop Summit, San Jose, California
Total Reads @ RS
Reads (Table X: Reg 1
+ Table X: Reg 2
+ …
+ Table Z: Reg N)
Read Share (X)
Total Table X
Total Table (X, Y, Z)
Total Writes @ RS
Writes (Table X: Reg 1
+ Table X: Reg 2
+ …
+ Table Z: Reg N)
Total Table Data @ RS
Table X: Reg 1
+ Table X: Reg 2
+ …
+ Table Z: Reg N
Write Share (X)
Total Table X
Total Table (X, Y, Z)
Reads Writes Data Stored
Unit Costs for HBase Operations
23
Writes
Write Operations
performed on Region
Server while writing
to individual table
regions
Reads
Read Operations
performed on Region
Server while reading
from individual table
regions
HFDS (usable)
space needed by
table region’s
HFiles with default
replication factor
Storage Bandwidth
Network bandwidth
needed to move
data in to/out of the
clusters by clients
$ / 1000 Writes
Total Write operations
across Region Servers
Monthly Write TCO
Total Write Ops (K)
$ / 1000 Reads
Total Read operations
across Region Servers
Monthly Read TCO
Total Read Ops (K)
Unit
Total Capacity
Unit Cost
$ / GB Stored
Usable storage space
(less replication and
overheads)
Monthly Storage Cost
Avail. Usable Storage
$ / GB for Inter-region
data transfers
Inter-region (peak) link
capacity
Monthly GB [In + Out] x
$ / GB
2014 Hadoop Summit, San Jose, California
Working Through An HBase Example
24
Monthly TCO (less bw.)
= $60 K
Write Serving @ 25%
= $15 K
Total Write operations
across Region Servers
= 100 M
$ 15 K / 100 M = $0.15 per
1000 writes per month
Monthly TCO (less bw.)
= $60 K
Write Serving @ 25%
= $15 K
Total Read operations
across Region Servers
= 200 M
$ 15 K / 200 M = $0.075 per
1000 reads per month
Monthly Cost
Monthly
Capacity
Unit Cost
Monthly TCO (less bw.)
= $60 K
Storage @ 50%
= $30 K
RAW HDFS = 10 PB
Usable HDFS == [ 10 x 0.8
(20% overhead) ] / 3
= 2.67 PB
$ 30 K / 2.67 PB
= $ 0.011 / GB / Month
Writes Reads Storage
2014 Hadoop Summit, San Jose, California
Monthly Charges
= $5 K
Total Data In + Out
= 0.25 PB
$ 5 K / 0.25 PB
= $ 0.02 / GB transferred
Bandwidth
ILLUSTRATIVE
Measuring HBase Resource Consumption
25
Write Ops per Region Server
per Table Region =
#W(R1:RS1)+#W(R2:RS1)+
…
Cost = Total Writes x $0.15 /
1000 writes/month
=$ for the Table/RS/Month
Write Ops cost for all tables
across all region servers for
a user ,app, BU or the
platform
Read Ops per Region Server
per Table Region =
#R(R1:RS1)+#R(R2:RS1)+
…
Cost = Total Reads x
$0.075 /1000 writes/month
=$ for the Table/RS/Month
Read Ops cost for all tables
across all region servers for
a user ,app, BU or the
platform
Monthly
HBase
Project Cost
Monthly Roll-
ups
HDFS size of regions under
hbase/table/<regions> in
GBs
Cost = Total HDFS size x
$ 0.011 / GB / Month
=$ for the Table/Month
Total HDFS size for all
tables across all region
servers for a user ,app, BU
or the platform
Writes Reads Storage
2014 Hadoop Summit, San Jose, California
Bandwidth measured at the
cluster level and divided
among select apps and
users of data based on
average volume In/Out
Roll-ups through relationship
among user, app, and their
BU
Bandwidth
Putting it Together for HBase Services
26
Resource	
   Unit	
   Aggregated / Measured	
   Cost	
  
Write Operations	
   Count of operations	
   Monthly, Total write operations across regions of table	
   $ 0.15 / 1000 Writes	
  
Read Operations	
   Count of operations	
   Monthly, Total read operations across regions of table	
   $ 0.075 / 1000 Reads	
  
HDFS (Storage)	
   GB	
   Monthly, Peak storage used	
   $ 0.011 / GB	
  
Network Bandwidth	
   GB	
   Monthly, total in /out	
   $ 0.02 / GB	
  
HBase Services Billing Rate Card [ Monthly Rates ]
Monthly Bill for May 2014
BU
Write Operations Read Operations HDFS (Storage) Network Bandwidth
Total Cost
($ K)Count
(M)
Cost
($ K)
Count
(M)
Cost
($ K)
Used
(PB)
Effective Used
(PB)
Cost
($ K)
Transferred
(PB)
Cost
($ K)
BU 1 30 M $ 4.5 20 M $ 1.5 3 PB 0.8 PB $ 8.80 1.25 PB $ 0.025 $ 14.82
BU 2 10 M $ 1.5 60 M $ 4.5 1 PB 0.27 PB $ 2.93 0.5 PB $ 0.01 $ 8.94
… …. … … … … …
BU N … … … … … ...
Total 100 M $ 15 200 M $ 15 10 PB 2.67PB $ 29.4 0.25 PB $ 5 $ 64.4
2014 Hadoop Summit, San Jose, California
ILLUSTRATIVE
Multi-Tenant Deployment For Apache Storm
27
Topologies X,Y & Z
SharedSupervisors
NimbusZookeeper
Supervisor M
X: Worker M
Y: Worker M
…
Z: Worker M
Supervisor N
X: Worker N
Y: Worker N
…
Z: Worker N
Supervisor 2
X: Worker 2
Y: Worker 2
…
Z: Worker 2
…
Supervisor 1
X: Worker 1
Y: Worker 1
…
Z: Worker 1
2014 Hadoop Summit, San Jose, California
Understanding Apache Storm Resources
28
Topology A : Worker
Task
Task
Task
Task
Supervisor
FixedWorkerSlots
§  Supervisor runs one or worker
processes for one or more
topologies
§  Each Supervisor have fixed
number of worker slots
§  A worker process belongs to a
specific topology
§  The workers from topologies are
distributed randomly on
supervisor
§  Tasks perform the actual data
processing
Topology B : Worker
Task
Task
Task
Task
2014 Hadoop Summit, San Jose, California
$ / Slot-Hour
Total number of slots
Monthly Slots Used
Avail. Slots
Unit Costs for Storm Operations
29
Compute
Worker Slots where topology
workers execute the actual
logic / tasks of spout and bolts
in parallel
Network bandwidth needed to
move data into/out of the
clusters by topologies
Bandwidth
Unit
Total Capacity
Unit Cost
2014 Hadoop Summit, San Jose, California
$ / GB for Inter-region data transfers
Inter-region (peak) link capacity
[Monthly GB In + Out] x $ / GB
Monthly TCO (less bw.) = $30 K
24 Slots Per Supervisors@100%
= $30 K
19.2 K Slots = 19.2 K x 24 x30
= 13.8 M Slot Hours
$ 30 K / 13.8 M Slot-Hours
= $0.002 / Slot-Hour / Month
Working Through a Storm Example
30
Compute Bandwidth
Monthly Cost
Monthly
Capacity
Unit Cost
2014 Hadoop Summit, San Jose, California
Monthly Charges = $2.5 K
Total Data In + Out = 0.12 PB
[$ 2.5 K / 0.12 PB
= $ 0.02/ GB transferred
ILLUSTRATIVE
Worker Slot-Hours for Topologies =
#W(TP1) x T(TP1) + #W(TP2) x
T(TP2) + …
Cost = Worker Slot-Hours x $0.002 /
Slot-Hour / Month
= $ for the Topology / Month
Worker Slot-Hours for all Topologies
can be summed up for the month for
a user, app, BU, or the entire
platform
Measuring Storm Resource Consumption
31
Compute Bandwidth
Monthly Cost
Monthly Roll-
ups
2014 Hadoop Summit, San Jose, California
Bandwidth measured at the cluster
level and divided among select apps
and users of data based on average
volume In/Out
Roll-ups through relationship among
user, app, and their BU
ILLUSTRATIVE
Putting it Together for Storm Services
32
BU
Compute Network Bandwidth
Total Cost
($ K)Used
(Slot hour)
Cost
($ K)
Transferred
(PB)
Cost
($ K)
BU1 2.5 M $ 5 0.02 PB $ 0.4 $ 5.4
BU2 1.25 M $ 2.5 0.04 PB $ 0.8 $ 3.3
… … … … …
BU N … … … ...
Total 10 M $ 20 0.12 PB $ 2.4 K $ 22.4
Resource	
   Unit	
   Aggregated / Measured	
   Cost	
  
Compute	
   Worker Slot Hours	
   Number of slots used by Topology workers and hours they ran for	
   $ 0.002/Slot-Hour	
  
Network Bandwidth	
   GB	
   Monthly, total in /out	
   $ 0.02/GB	
  
Storm Services Billing Rate Card [ Monthly Rates ]
Monthly Bill for May 2014
ILLUSTRATIVE
2014 Hadoop Summit, San Jose, California
Project Based Costing for Grid Services
33
Project Summary Period Cost (K)
Grid Services Cost May 2014 $ 165.5 K
Project Usage Details (Data Center DC1) Usage Cost (K)
Apache Hadoop Services $ 126 K
Compute (Map & Reduce GB-Hours consumed @ $0.004/GB-Hour) 12.5 M $ 50 K
Storage (GBs of peak storage used @ $ 0.019/GB) 3.45 PB $ 66 K
Network (GBs In/Out @ $0.02/GB) 0.5 PB $ 10 K
Apache HBase Services $ 34.1 K
Reads (Number of Read Operations @ $0.075/1000 Reads) 30 M $ 2.2 K
Writes (Number of Write Operations @ $0.15/1000 Writes) 20 M $ 3.0 K
Storage (GBs of peak storage used @ $ 0.011/GB) 2.45 PB $26.9 K
Network (GBs In/Out @ $0.02/GB) 0.1 PB $2 K
Apache Storm Services $ 5.4 K
Compute (Slot Hours consumed @ $ 0.002/Slot-Hour) 2.5 M $ 5 K
Network (GBs In/Out @ $0.02/GB) 0.02 PB $ 0.4 K
ILLUSTRATIVE
2014 Hadoop Summit, San Jose, California
Platform P&L
34
Line Item Q4’12 Q1’13 Q2’13 Q3 ’13 Total Total %
Y! Gross Revenues
Cost of revenues (less Grid CapEx)
Gross Profit
Grid OpEx
R&D Headcount
SE&O Headcount
Acquisition/Install
Active Use/ Ops
Network Bandwidth
Total Gird OpEx
Grid CapEx
Grid Services
Total Grid CapEx
Contribution Margin
Indirect Costs
G&A
Sales and Marketing
ILLUSTRATIVE
2014 Hadoop Summit, San Jose, California
LEFT BLANK ON PURPOSE
Hadoop Cost Benchmarking – An Approach
35
On-Premise Public Cloud
Monthly Used Unused Total Public Pricing or Terms-based (Used On-Premise Eqv.)
M/R 71.4 M 61.6 M 133 M
Compute Instances (normalized time,
RAM, 32/64 ops, I/O etc.)
1,000
instances/ hr.
HDFS 148 PB 52 PB 200 PB
Storage
(account for 3x repl., job/ app space)
30 PB/ month
Avg. Data
Processed
- - 75 PB Instance Storage 2.5 PB daily
M/R $0.50 M $0.50 M $1 M 1,000 x $0.70/ instance/ hr. x 24 x 30 $0.5 M
HDFS $0.75 M $0.25 M $1 M 30 PB x $0.04/GB/month $1.2 M
Other Costs (if any) such as reads,
writes, data services/ hour etc.
$0.25 M
Total * $1.25 M $0.75 M $2 M Total $ 1.95 M
Quantity
equivalent
Cost
equivalent
2014 Hadoop Summit, San Jose, California
* Ignored bandwidth, assumed equivalent
ILLUSTRATIVE
HBase and Storm Cost Benchmarking
36
On-Premise Public Cloud
Total Public Pricing or Terms-based (Used On-Premise Eqv.)
Reads
Peak concurrent reads for
a given record size
300 MB/s
Reads on chosen instances
(benchmarks 45MB/s)
300/45 = 7
instances
Writes
Peak concurrent writes for
a given record size
160 MB/s
Writes on chosen instances
(benchmarks 10MB/s)
160/10 = 16
instances
Storage
Data storage in tables (incl.
replication)
1.6 TB
Data served per instance (benchmarks
0.5 TB incl. repl.)
1.6/0.5 = 3
Cost calculations stay the same as Hadoop.
Instances required based on thru-put
and storage needs
16 instances/
hour
Slots-
Hours
Slot hours per month 2.5M
Instance hours based on memory and
CPU requirements (12 slots / instance)
0.21 M
instance
hours
Cost calculations stay the same as Hadoop.
Quantity
equivalent
2014 Hadoop Summit, San Jose, California
* Ignored bandwidth, assumed equivalent
ILLUSTRATIVE
Quantity
equivalent
Improving Utilization favors on-premise setup
37
Utilization / Consumption (Compute and Storage)
Cost($)
On-premise Hadoop
as a Service
On-demand public
cloud service
Terms-based public
cloud service
Favors on-premise
Hadoop as a Service
Favors public cloud
service
x
x
Sensitivity analysis on
costs based on current
and expected utilization
or target utilization can
provide further insights
into your operations and
cost competitiveness
Highstartingcost
Scalingup
2014 Hadoop Summit, San Jose, California
Improving Utilization improves ROI
38
Time
CostAmortizedoverApps($)
Phase I 2012 – 2013 (H 0.23) 2014 & Future
Time = t Time = t’
Cost (t) = C
Cost (t’)= C’
# App continue to
grow on the Platform
At time t, BU profits are
R (t) – C(t) = π (t)
Platform’s goal is to continue
to increase the ROI while
supporting new technology
and services
R (t’) – C (t’) = π (t’), where
C (t’) < C (t) and π (t’) > π (t)
for same or bigger revenues.
2014 Hadoop Summit, San Jose, California
Going Forward
39 2014 Hadoop Summit, San Jose, California
Hadoop HBase Storm
§  CPU as a resource
§  Pre-emption and priority
§  Long-running jobs
§  Other potential
resources such as disk,
network, GPUs etc.
§  Tez as the execution
engine / Container
reuse
§  Multiple Region Servers
per node
§  Larger JVMs / GC
improvements
§  HBase-on-YARN
§  cgroup profiles
§  Storm-on-YARN
§  Resource aware
scheduling (memory,
CPU, network)
§  cgroup profiles
§  More experience with
multi-tenancy
Co-exist with HBase to share the compute and memory – Using the c-group profiles at the Storm JVM level and topology worker
level
Resource aware scheduling – Memory & CPU
YARN
Thank You
@sumeetksingh
@amritasshwar
We are hiring!
Stop by Kiosk P9
or reach out to us at
bigdata@yahoo-inc.com.

Más contenido relacionado

La actualidad más candente

Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drilltshiran
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformBikas Saha
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideDanairat Thanabodithammachari
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 

La actualidad más candente (20)

HW09 Hadoop Vaidya
HW09 Hadoop VaidyaHW09 Hadoop Vaidya
HW09 Hadoop Vaidya
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 

Destacado

Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Sumeet Singh
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Sumeet Singh
 
Costing your Bug Data Operations
Costing your Bug Data OperationsCosting your Bug Data Operations
Costing your Bug Data OperationsDataWorks Summit
 
Big Data Asset Maturity Model
Big Data Asset Maturity ModelBig Data Asset Maturity Model
Big Data Asset Maturity Modelnoahwong
 
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...StampedeCon
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm DataWorks Summit/Hadoop Summit
 
Contexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureUwe Printz
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop MapR Technologies
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationmattlieber
 
IT Operating Model
IT Operating ModelIT Operating Model
IT Operating Modelanusharaju38
 
How to create new business models with Big Data and Analytics
How to create new business models with Big Data and AnalyticsHow to create new business models with Big Data and Analytics
How to create new business models with Big Data and AnalyticsAki Balogh
 
Big Data: It’s all about the Use Cases
Big Data: It’s all about the Use CasesBig Data: It’s all about the Use Cases
Big Data: It’s all about the Use CasesJames Serra
 
How to Build & Sustain a Data Governance Operating Model
How to Build & Sustain a Data Governance Operating Model How to Build & Sustain a Data Governance Operating Model
How to Build & Sustain a Data Governance Operating Model DATUM LLC
 
Cost of Ownership for Hadoop Implementation
Cost of Ownership for Hadoop ImplementationCost of Ownership for Hadoop Implementation
Cost of Ownership for Hadoop ImplementationDataWorks Summit
 

Destacado (18)

Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
 
Costing your Bug Data Operations
Costing your Bug Data OperationsCosting your Bug Data Operations
Costing your Bug Data Operations
 
Big Data Asset Maturity Model
Big Data Asset Maturity ModelBig Data Asset Maturity Model
Big Data Asset Maturity Model
 
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
 
Contexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to Production
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Unicom Big Data Conference
Unicom  Big Data ConferenceUnicom  Big Data Conference
Unicom Big Data Conference
 
The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop The TCO Calculator - Estimate the True Cost of Hadoop
The TCO Calculator - Estimate the True Cost of Hadoop
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
 
IT Operating Model
IT Operating ModelIT Operating Model
IT Operating Model
 
How to create new business models with Big Data and Analytics
How to create new business models with Big Data and AnalyticsHow to create new business models with Big Data and Analytics
How to create new business models with Big Data and Analytics
 
Big Data: It’s all about the Use Cases
Big Data: It’s all about the Use CasesBig Data: It’s all about the Use Cases
Big Data: It’s all about the Use Cases
 
How to Build & Sustain a Data Governance Operating Model
How to Build & Sustain a Data Governance Operating Model How to Build & Sustain a Data Governance Operating Model
How to Build & Sustain a Data Governance Operating Model
 
Cost of Ownership for Hadoop Implementation
Cost of Ownership for Hadoop ImplementationCost of Ownership for Hadoop Implementation
Cost of Ownership for Hadoop Implementation
 

Similar a Hadoop Summit San Jose 2014: Costing Your Big Data Operations

What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on HadoopDataWorks Summit
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopTony Ng
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successDataWorks Summit
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 
How pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architectureHow pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architectureKovid Academy
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
 
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...Hortonworks
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsJongwook Woo
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveQubole
 

Similar a Hadoop Summit San Jose 2014: Costing Your Big Data Operations (20)

What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
hadoop exp
hadoop exphadoop exp
hadoop exp
 
SoftServe BI/BigData Workshop in Utah
SoftServe BI/BigData Workshop in UtahSoftServe BI/BigData Workshop in Utah
SoftServe BI/BigData Workshop in Utah
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 
How pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architectureHow pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architecture
 
IJSRED-V2I3P43
IJSRED-V2I3P43IJSRED-V2I3P43
IJSRED-V2I3P43
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 
Eliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside HadoopEliminating the Challenges of Big Data Management Inside Hadoop
Eliminating the Challenges of Big Data Management Inside Hadoop
 
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
 
Resume
ResumeResume
Resume
 

Más de Sumeet Singh

Hadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckHadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckSumeet Singh
 
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Sumeet Singh
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Sumeet Singh
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out Sumeet Singh
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Sumeet Singh
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Sumeet Singh
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! Sumeet Singh
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! Sumeet Singh
 

Más de Sumeet Singh (11)

Hadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckHadoop Summit Kiosk Deck
Hadoop Summit Kiosk Deck
 
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
 

Último

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 

Último (20)

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 

Hadoop Summit San Jose 2014: Costing Your Big Data Operations

  • 1. Costing Your Big Data Operations PRESENTED BY Sumeet Singh, Amrit Lal ⎪ June 5, 2014 2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
  • 2. Introduction 2 §  Product Manager at Yahoo engaged in building high class and robust Hadoop infrastructure services §  Eight years of experience across HSBC, Oracle and Google in developing products and platforms for high growth enterprises §  MBA from Carnegie Mellon University §  Manages Hadoop products team at Yahoo! §  Responsible for Product Management, Strategy and Customer Engagements §  Managed Cloud Services products team and headed Strategy functions for the Cloud Platform Group at Yahoo §  MBA from UCLA and MS from Rensselaer Polytechnic Institute (RPI) Sumeet Singh Senior Director, Product Management Hadoop and Big Data Platforms Cloud Engineering Group 701 First Avenue, Sunnyvale, CA 94089 USA @sumeetksingh Amrit Lal Product Manager Hadoop and Big Data Platforms Cloud Engineering Group 701 First Avenue, Sunnyvale, CA 94089 USA @amritasshwar 2014 Hadoop Summit, San Jose, California
  • 3. Agenda 3 Total Cost of Ownership (TCO) Models1 Deeper Understanding of (Resource) Usage P&L, Metering and Billing Provisions Benchmark Costs Improve Utilization and ROI 2 3 4 5 2014 Hadoop Summit, San Jose, California
  • 4. Why do Costing? 4 Profitability Understanding the data services costs (an element of your total project cost) to determine how profitable the project is ROI Investment decisions both at the platform and app / project level Operational Efficiency Benchmark, improve ops by focusing on avg. utilization, increasing the # hosted apps, storage efficiencies, job performance etc. Planning Capital planning and budgeting, product improvements Cost Transparency Metering / usage metrics, billing, chargeback / showback, P&L 2014 Hadoop Summit, San Jose, California
  • 5. Costing is Relevant Irrespective of the Service Model 5 Private Cloud Public Cloud §  Fixed costs that favors scale and 24x7 operations §  Centralized operations §  Multi-tenant clusters with security and data sharing §  Cost a function of desired SLA §  Utilization and # hosted apps a primary lever §  Tenants often tend to ignore costs §  Variable with usage and favors a run and done model §  Decentralized operations, ops / headcount costs still relevant §  Dedicated virtual clusters §  Monthly bills! §  Releasing cluster instances, when not needed, a wise idea §  Users often overlook the peripheral costs 2014 Hadoop Summit, San Jose, California
  • 6. 0 50 100 150 200 250 300 350 400 450 500 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 2006 2007 2008 2009 2010 2011 2012 2013 2014 RawHDFSStorage(inPB) NumberofServers Year Servers Storage Important with Multi-tenancy and Scale 6 Yahoo! Commits to Scaling Hadoop for Production Use Research Workloads in Search and Advertising Production (Modeling) with machine learning & WebMap Revenue Systems with Security, Multi-tenancy, and SLAs Open Sourced with Apache Hortonworks Spinoff for Enterprise hardening Nextgen Hadoop (H 0.23 YARN) New Services (HBase, Storm, Hive etc. Increased User-base with partitioned namespaces Apache H 2.x (Low latency, Util, HA etc.) 2014 Hadoop Summit, San Jose, California
  • 7. 272 330 382 495 525 260 310 360 410 460 510 560 Q1-11 Q2-11 Q3-11 Q4-11 Q1-12 Q2-12 Q3-12 Q4-12 Q1-13 Q2-13 Q3-13 Q4-13 Q1-14 Hosted Apps Growth on Apache Hadoop 7 NumberofNewProjects New Customer Apps On-boarded 58 projects in 2011 52 projects in 2012 113 projects in 2013 2014 Hadoop Summit, San Jose, California
  • 8. Multi-tenant Apache HBase Growth 8 1140 33.6 PB 0 5 10 15 20 25 30 35 40 0 200 400 600 800 1000 1200 Q1-13 Q2-13 Q3-13 Q4-13 Q1-14 DataStored(inPB) NumberofRegionServers Zero to “20” Use Cases (60,000 Regions) in a Year Region Servers Storage 2014 Hadoop Summit, San Jose, California
  • 9. 760 175 0 20 40 60 80 100 120 140 160 180 200 0 100 200 300 400 500 600 700 800 Q1-13 Q2-13 Q3-13 Q4-13 Q1-14 NumberofTopologies NumberofSupervisors Supervisor Topologies Multi-tenant Apache Storm Growth 9 Zero to “175” Production Topologies in a Year Multi-tenancy Release 2014 Hadoop Summit, San Jose, California
  • 10. Capital Deployment for Big Data Infrastructure 10 DataNode NodeManager NameNode RM DataNodes RegionServers NameNode HBase Master Nimbus Supervisor Administration, Management and Monitoring ZooKeeper Pools HTTP/HDFS/GDM Load Proxies Applications and Data Data Feeds Data Stores Oozie Server HS2/ HCat Network Backplane 2014 Hadoop Summit, San Jose, California
  • 11. Big Data Platforms Technology Stack at Yahoo 11 Compute Services Storage Infrastructure Services HivePig Oozie HDFS ProxyGDM YARN MapReduce HDFS HBase Zookeeper Support Shop Monitoring Starling Messaging Service HCatalog Storm SparkTez 2014 Hadoop Summit, San Jose, California
  • 12. Resources Consumed in Big Data Operations 12 . . . . Colo 1 Rack 1 Rack N . . Bandwidth Storage Memory CPU Clusters in Datacenters Server Resources 2014 Hadoop Summit, San Jose, California
  • 13. Elements of a TCO Model 13 $2.1 M 60% 12% 7% 6% 3% 2% 6 5 4 3 2 1 7 10% Operations Engineering §  Headcount for service engineering and data operations teams responsible for day-to-day ops and support 6 Acquisition/ Install (One-time) §  Labor, POs, transportation, space, support, upgrades, decommissions, shipping/ receiving etc. 5 Network Hardware §  Aggregated network component costs, including switches, wiring, terminal servers, power strips etc. 4 Active Use and Operations (Recurring) §  Recurring datacenter ops cost (power, space, labor support, and facility maintenance 3 R&D HC §  Headcount for platform software development, quality, and release engineering 2 Cluster Hardware §  Data nodes, name nodes, job trackers, gateways, load proxies, monitoring, aggregator, and web servers 1 Monthly TCOTCO Components Network Bandwidth §  Data transferred into and out of clusters for all colos, including cross-colo transfers 7 ILLUSTRATIVE 2014 Hadoop Summit, San Jose, California
  • 14. Understanding Apache Hadoop Resources 14 Task 1 Task 2 Task 3 2014 Hadoop Summit, San Jose, California NameNode Resource Manager DFS Blocks DFS Blocks DataNode Node Manager MR Containers MR Containers MemoryStorage Storage and Compute MapReduce and Memory . . . . . .
  • 15. Unit Costs for Hadoop Operations 15 Compute Containers where apps can perform computation and access HDFS if needed Storage HFDS (usable) space needed by an app with default replication factor of three Network bandwidth needed to move data into/ out of the clusters by the app Bandwidth Namespace Files and directories used by the apps to understand/ limit the load on NN $ / GB-Hour (H 0.23/2.0) GBs of Memory available for an hour Monthly Compute Cost Avail. Compute Capacity $ / GB Stored Usable storage space (less replication and overheads) Monthly Storage Cost Avail. Usable Storage Unit Total Capacity Unit Cost $ / GB for Inter-region data transfers Inter-region (peak) link capacity [Monthly GB In + Out] x $ / GB N/A N/A N/A 2014 Hadoop Summit, San Jose, California
  • 16. Working Through A Hadoop Example 16 Monthly TCO (less bw.) = $2 M Compute @ 50% = $1 M 315 TB memory == 315 TB x 24 x 30 = 227 M GB-Hours $1 M/ 227 M GB-Hours = $0.004 / GB-Hour / Month Monthly TCO (less bw.) = $2 M Storage @ 50% = $1 M RAW HDFS = 200 PB Usable HDFS == [ 200 x 0.8 (20% overhead) ] / 3 = 53.3 PB $ 1 M / 53.3 PB = $ 0.019 / GB / Month Monthly Cost Monthly Capacity Unit Cost Monthly Charges = $0.1 M Total Data In + Out = 5 PB $ 0.1 M / 5 PB = $ 0.02/ GB transferred Compute Storage Bandwidth 2014 Hadoop Summit, San Jose, California ILLUSTRATIVE
  • 17. Measuring Hadoop Resource Consumption 17 Map GB-Hours = GB(M1) x T(M1) + GB(M2) x T(M2) + … Reduce GB-Hours = GB(R1) x T(R1) + GB(R2) x T(R2) + … Cost = (M + R) GB-Hour x $0.004 / GB-Hour / Month = $ for the Job/ Month (M+R) GB-Hours for all jobs can summed up for the month for a user, app, BU, or the entire platform Monthly Job and Task Cost Monthly Roll- ups Compute Storage Bandwidth / project (app) directory quota in GB (peak monthly storage used) / user directory quota in GB (peak monthly storage used) / data is accounted for as each user accountable for their portion of use. For e.g. GB Read (U1) GB Read (U1) + GB Read (U2) + … Roll-ups through relationship among user, file ownership, app, and their BU Bandwidth measured at the cluster level and divided among select apps and users of data based on average volume In/Out Roll-ups through relationship among user, app, and their BU 2014 Hadoop Summit, San Jose, California
  • 18. Measuring Hadoop Resource Consumption 18 2014 Hadoop Summit, San Jose, California queue 2 queue 1 queue 3 queue 4 queue 5 queue 6 queue 7 queue 8 queue 11 queue 9 queue 10
  • 19. Measuring Hadoop Resource Consumption 19 2014 Hadoop Summit, San Jose, California SLA Dashboard on Hadoop Analytics Warehouse
  • 20. Putting it Together for Hadoop Services 20 BU HDFS (Storage) Compute Network Bandwidth Total Cost ($ M)Used (PB) Effective Used (PB) Cost ($ M) Used (GB-hour) Cost ($ M) Transferred (GB) Cost ($ M) BU1 15 PB 3.45 PB $0.065 12.5 M $0.05 1.25 PB $0.025 $0.15 M BU2 10 PB 2.65 PB $0.05 6.25 M $0.025 0.5 PB $0.01 $0.085 M … …. … … … … … … BU N … … … … … … ... Total 148 PB 39.5 PB $0.75 M 125 M $0.5 M 5 PB $0.1 M $1.35 M Resource   Unit   Aggregated / Measured   Cost   HDFS (Storage)   GB   Monthly, Peak storage used   $ 0.019/GB   Compute   Map-Reduce GB Hours   Number of GBs used by mappers and reducers and hours they ran for   $ 0.004/GB-Hour   Network Bandwidth   GB   Monthly, total in /out   $ 0.02/GB   Hadoop Services Billing Rate Card [ Monthly Rates ] Monthly Bill for May 2014 2014 Hadoop Summit, San Jose, California ILLUSTRATIVE
  • 21. Multi-Tenant Deployment For Apache HBase 21 Region Server M X:Table:Region M Y:Table:Region M … Z:Table:Region M Region Server N X:Table:Region N Y:Table:Region N … Z:Table:Region N Projects X,Y & Z RegionServerJVMHDFSReads/Writes Shared Region Servers Region Server 2 X:Table:Region 2 Y:Table:Region 2 … Z:Table:Region 2 … HMaster Zookeeper Region Server 1 X:Table:Region 1 Y:Table:Region 1 … Z:Table:Region 1 2014 Hadoop Summit, San Jose, California
  • 22. Understanding Apache HBase Resources 22 X:Table:Region 1 Y:Table:Region M … Regionlevel Reads/Writes HFile HFile HFile HDFS Storage (Disk)RegionServer JVM (Heap) Z:Table:Region N… 2014 Hadoop Summit, San Jose, California Total Reads @ RS Reads (Table X: Reg 1 + Table X: Reg 2 + … + Table Z: Reg N) Read Share (X) Total Table X Total Table (X, Y, Z) Total Writes @ RS Writes (Table X: Reg 1 + Table X: Reg 2 + … + Table Z: Reg N) Total Table Data @ RS Table X: Reg 1 + Table X: Reg 2 + … + Table Z: Reg N Write Share (X) Total Table X Total Table (X, Y, Z) Reads Writes Data Stored
  • 23. Unit Costs for HBase Operations 23 Writes Write Operations performed on Region Server while writing to individual table regions Reads Read Operations performed on Region Server while reading from individual table regions HFDS (usable) space needed by table region’s HFiles with default replication factor Storage Bandwidth Network bandwidth needed to move data in to/out of the clusters by clients $ / 1000 Writes Total Write operations across Region Servers Monthly Write TCO Total Write Ops (K) $ / 1000 Reads Total Read operations across Region Servers Monthly Read TCO Total Read Ops (K) Unit Total Capacity Unit Cost $ / GB Stored Usable storage space (less replication and overheads) Monthly Storage Cost Avail. Usable Storage $ / GB for Inter-region data transfers Inter-region (peak) link capacity Monthly GB [In + Out] x $ / GB 2014 Hadoop Summit, San Jose, California
  • 24. Working Through An HBase Example 24 Monthly TCO (less bw.) = $60 K Write Serving @ 25% = $15 K Total Write operations across Region Servers = 100 M $ 15 K / 100 M = $0.15 per 1000 writes per month Monthly TCO (less bw.) = $60 K Write Serving @ 25% = $15 K Total Read operations across Region Servers = 200 M $ 15 K / 200 M = $0.075 per 1000 reads per month Monthly Cost Monthly Capacity Unit Cost Monthly TCO (less bw.) = $60 K Storage @ 50% = $30 K RAW HDFS = 10 PB Usable HDFS == [ 10 x 0.8 (20% overhead) ] / 3 = 2.67 PB $ 30 K / 2.67 PB = $ 0.011 / GB / Month Writes Reads Storage 2014 Hadoop Summit, San Jose, California Monthly Charges = $5 K Total Data In + Out = 0.25 PB $ 5 K / 0.25 PB = $ 0.02 / GB transferred Bandwidth ILLUSTRATIVE
  • 25. Measuring HBase Resource Consumption 25 Write Ops per Region Server per Table Region = #W(R1:RS1)+#W(R2:RS1)+ … Cost = Total Writes x $0.15 / 1000 writes/month =$ for the Table/RS/Month Write Ops cost for all tables across all region servers for a user ,app, BU or the platform Read Ops per Region Server per Table Region = #R(R1:RS1)+#R(R2:RS1)+ … Cost = Total Reads x $0.075 /1000 writes/month =$ for the Table/RS/Month Read Ops cost for all tables across all region servers for a user ,app, BU or the platform Monthly HBase Project Cost Monthly Roll- ups HDFS size of regions under hbase/table/<regions> in GBs Cost = Total HDFS size x $ 0.011 / GB / Month =$ for the Table/Month Total HDFS size for all tables across all region servers for a user ,app, BU or the platform Writes Reads Storage 2014 Hadoop Summit, San Jose, California Bandwidth measured at the cluster level and divided among select apps and users of data based on average volume In/Out Roll-ups through relationship among user, app, and their BU Bandwidth
  • 26. Putting it Together for HBase Services 26 Resource   Unit   Aggregated / Measured   Cost   Write Operations   Count of operations   Monthly, Total write operations across regions of table   $ 0.15 / 1000 Writes   Read Operations   Count of operations   Monthly, Total read operations across regions of table   $ 0.075 / 1000 Reads   HDFS (Storage)   GB   Monthly, Peak storage used   $ 0.011 / GB   Network Bandwidth   GB   Monthly, total in /out   $ 0.02 / GB   HBase Services Billing Rate Card [ Monthly Rates ] Monthly Bill for May 2014 BU Write Operations Read Operations HDFS (Storage) Network Bandwidth Total Cost ($ K)Count (M) Cost ($ K) Count (M) Cost ($ K) Used (PB) Effective Used (PB) Cost ($ K) Transferred (PB) Cost ($ K) BU 1 30 M $ 4.5 20 M $ 1.5 3 PB 0.8 PB $ 8.80 1.25 PB $ 0.025 $ 14.82 BU 2 10 M $ 1.5 60 M $ 4.5 1 PB 0.27 PB $ 2.93 0.5 PB $ 0.01 $ 8.94 … …. … … … … … BU N … … … … … ... Total 100 M $ 15 200 M $ 15 10 PB 2.67PB $ 29.4 0.25 PB $ 5 $ 64.4 2014 Hadoop Summit, San Jose, California ILLUSTRATIVE
  • 27. Multi-Tenant Deployment For Apache Storm 27 Topologies X,Y & Z SharedSupervisors NimbusZookeeper Supervisor M X: Worker M Y: Worker M … Z: Worker M Supervisor N X: Worker N Y: Worker N … Z: Worker N Supervisor 2 X: Worker 2 Y: Worker 2 … Z: Worker 2 … Supervisor 1 X: Worker 1 Y: Worker 1 … Z: Worker 1 2014 Hadoop Summit, San Jose, California
  • 28. Understanding Apache Storm Resources 28 Topology A : Worker Task Task Task Task Supervisor FixedWorkerSlots §  Supervisor runs one or worker processes for one or more topologies §  Each Supervisor have fixed number of worker slots §  A worker process belongs to a specific topology §  The workers from topologies are distributed randomly on supervisor §  Tasks perform the actual data processing Topology B : Worker Task Task Task Task 2014 Hadoop Summit, San Jose, California
  • 29. $ / Slot-Hour Total number of slots Monthly Slots Used Avail. Slots Unit Costs for Storm Operations 29 Compute Worker Slots where topology workers execute the actual logic / tasks of spout and bolts in parallel Network bandwidth needed to move data into/out of the clusters by topologies Bandwidth Unit Total Capacity Unit Cost 2014 Hadoop Summit, San Jose, California $ / GB for Inter-region data transfers Inter-region (peak) link capacity [Monthly GB In + Out] x $ / GB
  • 30. Monthly TCO (less bw.) = $30 K 24 Slots Per Supervisors@100% = $30 K 19.2 K Slots = 19.2 K x 24 x30 = 13.8 M Slot Hours $ 30 K / 13.8 M Slot-Hours = $0.002 / Slot-Hour / Month Working Through a Storm Example 30 Compute Bandwidth Monthly Cost Monthly Capacity Unit Cost 2014 Hadoop Summit, San Jose, California Monthly Charges = $2.5 K Total Data In + Out = 0.12 PB [$ 2.5 K / 0.12 PB = $ 0.02/ GB transferred ILLUSTRATIVE
  • 31. Worker Slot-Hours for Topologies = #W(TP1) x T(TP1) + #W(TP2) x T(TP2) + … Cost = Worker Slot-Hours x $0.002 / Slot-Hour / Month = $ for the Topology / Month Worker Slot-Hours for all Topologies can be summed up for the month for a user, app, BU, or the entire platform Measuring Storm Resource Consumption 31 Compute Bandwidth Monthly Cost Monthly Roll- ups 2014 Hadoop Summit, San Jose, California Bandwidth measured at the cluster level and divided among select apps and users of data based on average volume In/Out Roll-ups through relationship among user, app, and their BU ILLUSTRATIVE
  • 32. Putting it Together for Storm Services 32 BU Compute Network Bandwidth Total Cost ($ K)Used (Slot hour) Cost ($ K) Transferred (PB) Cost ($ K) BU1 2.5 M $ 5 0.02 PB $ 0.4 $ 5.4 BU2 1.25 M $ 2.5 0.04 PB $ 0.8 $ 3.3 … … … … … BU N … … … ... Total 10 M $ 20 0.12 PB $ 2.4 K $ 22.4 Resource   Unit   Aggregated / Measured   Cost   Compute   Worker Slot Hours   Number of slots used by Topology workers and hours they ran for   $ 0.002/Slot-Hour   Network Bandwidth   GB   Monthly, total in /out   $ 0.02/GB   Storm Services Billing Rate Card [ Monthly Rates ] Monthly Bill for May 2014 ILLUSTRATIVE 2014 Hadoop Summit, San Jose, California
  • 33. Project Based Costing for Grid Services 33 Project Summary Period Cost (K) Grid Services Cost May 2014 $ 165.5 K Project Usage Details (Data Center DC1) Usage Cost (K) Apache Hadoop Services $ 126 K Compute (Map & Reduce GB-Hours consumed @ $0.004/GB-Hour) 12.5 M $ 50 K Storage (GBs of peak storage used @ $ 0.019/GB) 3.45 PB $ 66 K Network (GBs In/Out @ $0.02/GB) 0.5 PB $ 10 K Apache HBase Services $ 34.1 K Reads (Number of Read Operations @ $0.075/1000 Reads) 30 M $ 2.2 K Writes (Number of Write Operations @ $0.15/1000 Writes) 20 M $ 3.0 K Storage (GBs of peak storage used @ $ 0.011/GB) 2.45 PB $26.9 K Network (GBs In/Out @ $0.02/GB) 0.1 PB $2 K Apache Storm Services $ 5.4 K Compute (Slot Hours consumed @ $ 0.002/Slot-Hour) 2.5 M $ 5 K Network (GBs In/Out @ $0.02/GB) 0.02 PB $ 0.4 K ILLUSTRATIVE 2014 Hadoop Summit, San Jose, California
  • 34. Platform P&L 34 Line Item Q4’12 Q1’13 Q2’13 Q3 ’13 Total Total % Y! Gross Revenues Cost of revenues (less Grid CapEx) Gross Profit Grid OpEx R&D Headcount SE&O Headcount Acquisition/Install Active Use/ Ops Network Bandwidth Total Gird OpEx Grid CapEx Grid Services Total Grid CapEx Contribution Margin Indirect Costs G&A Sales and Marketing ILLUSTRATIVE 2014 Hadoop Summit, San Jose, California LEFT BLANK ON PURPOSE
  • 35. Hadoop Cost Benchmarking – An Approach 35 On-Premise Public Cloud Monthly Used Unused Total Public Pricing or Terms-based (Used On-Premise Eqv.) M/R 71.4 M 61.6 M 133 M Compute Instances (normalized time, RAM, 32/64 ops, I/O etc.) 1,000 instances/ hr. HDFS 148 PB 52 PB 200 PB Storage (account for 3x repl., job/ app space) 30 PB/ month Avg. Data Processed - - 75 PB Instance Storage 2.5 PB daily M/R $0.50 M $0.50 M $1 M 1,000 x $0.70/ instance/ hr. x 24 x 30 $0.5 M HDFS $0.75 M $0.25 M $1 M 30 PB x $0.04/GB/month $1.2 M Other Costs (if any) such as reads, writes, data services/ hour etc. $0.25 M Total * $1.25 M $0.75 M $2 M Total $ 1.95 M Quantity equivalent Cost equivalent 2014 Hadoop Summit, San Jose, California * Ignored bandwidth, assumed equivalent ILLUSTRATIVE
  • 36. HBase and Storm Cost Benchmarking 36 On-Premise Public Cloud Total Public Pricing or Terms-based (Used On-Premise Eqv.) Reads Peak concurrent reads for a given record size 300 MB/s Reads on chosen instances (benchmarks 45MB/s) 300/45 = 7 instances Writes Peak concurrent writes for a given record size 160 MB/s Writes on chosen instances (benchmarks 10MB/s) 160/10 = 16 instances Storage Data storage in tables (incl. replication) 1.6 TB Data served per instance (benchmarks 0.5 TB incl. repl.) 1.6/0.5 = 3 Cost calculations stay the same as Hadoop. Instances required based on thru-put and storage needs 16 instances/ hour Slots- Hours Slot hours per month 2.5M Instance hours based on memory and CPU requirements (12 slots / instance) 0.21 M instance hours Cost calculations stay the same as Hadoop. Quantity equivalent 2014 Hadoop Summit, San Jose, California * Ignored bandwidth, assumed equivalent ILLUSTRATIVE Quantity equivalent
  • 37. Improving Utilization favors on-premise setup 37 Utilization / Consumption (Compute and Storage) Cost($) On-premise Hadoop as a Service On-demand public cloud service Terms-based public cloud service Favors on-premise Hadoop as a Service Favors public cloud service x x Sensitivity analysis on costs based on current and expected utilization or target utilization can provide further insights into your operations and cost competitiveness Highstartingcost Scalingup 2014 Hadoop Summit, San Jose, California
  • 38. Improving Utilization improves ROI 38 Time CostAmortizedoverApps($) Phase I 2012 – 2013 (H 0.23) 2014 & Future Time = t Time = t’ Cost (t) = C Cost (t’)= C’ # App continue to grow on the Platform At time t, BU profits are R (t) – C(t) = π (t) Platform’s goal is to continue to increase the ROI while supporting new technology and services R (t’) – C (t’) = π (t’), where C (t’) < C (t) and π (t’) > π (t) for same or bigger revenues. 2014 Hadoop Summit, San Jose, California
  • 39. Going Forward 39 2014 Hadoop Summit, San Jose, California Hadoop HBase Storm §  CPU as a resource §  Pre-emption and priority §  Long-running jobs §  Other potential resources such as disk, network, GPUs etc. §  Tez as the execution engine / Container reuse §  Multiple Region Servers per node §  Larger JVMs / GC improvements §  HBase-on-YARN §  cgroup profiles §  Storm-on-YARN §  Resource aware scheduling (memory, CPU, network) §  cgroup profiles §  More experience with multi-tenancy Co-exist with HBase to share the compute and memory – Using the c-group profiles at the Storm JVM level and topology worker level Resource aware scheduling – Memory & CPU YARN
  • 40. Thank You @sumeetksingh @amritasshwar We are hiring! Stop by Kiosk P9 or reach out to us at bigdata@yahoo-inc.com.