Twitter is powered by thousands of microservices that run on our internal Cloud platform which consists of a suite of multi-tenant platform services that offer Compute, Storage, Messaging, Monitoring, etc as a service. These platforms have thousands of tenants and run atop hundreds of thousands of servers, across on-prem & the public cloud. The scale & diversity in multi-tenant infrastructure services make it extremely difficult to effectively forecast capacity, compute resource utilization & cost and drive efficiency.
In this talk, I would like to share how my team is building a system (Kite - A unified service manager) to help define, model, provision, meter & charge infrastructure resources. The infrastructure resources include primitive bare metal servers / VMs on the public cloud and abstract resources offered by multi-tenant services such as our Compute platform (powered by Apache Aurora/Mesos), Storage (Manhattan for key/val, Cache, RDBMS), Observability. Along with how we solved this problem, I also intend to share a few case-studies on how we were able to use this data to better plan capacity & drive a cultural change in engineering that helped improve overall resource utilization & drive significant savings in infrastructure spending.
9. INFRASTRUCTURE & DATACENTER MANAGEMENT
CORE APPLICATION
SERVICES
TWEETS
USERS
SOCIAL
GRAPH
PLATFORM SERVICES
SEARCH
MESSAGING &
QUEUES
CACHE
MONITORING AND
ALERTING
INGRESS &
PROXY
FRAMEWORK/
LIBRARIES
FINAGLE
(RPC)
SCALDING
(Map Reduce in
Scala)
HERON
(Streaming
Compute)
JVM
MANAGEMENT
TOOLS
SELF SERVE
SERVICE
DIRECTORY
CHARGEBACK
CONFIG
MGMT
DATA & ANALYTICS
PLATFORM
INTERACTIVE
QUERY
DATA
DISCOVERY
WORKFLOW
MANAGEMENT
INFRASTRUCTURE
SERVICES
MANHATTAN
BLOBSTORE
GRAPHSTORE
TIMESERIESDB
S
T
O
R
A
G
E
MESOS/AURORA
HADOOP
C
O
M
P
U
T
E
MYSQL
VERTICA
POSTGRES
D
B
/
D
W
DEPLOY
(Workflows)
16. Chargeback @Twitter
Ability to meter
allocation & utilization of resources
per service,
per project,
per engineering team
17. Chargeback @Twitter
Ability to meter
allocation & utilization of resources
per service,
per project,
per engineering team
to improve visibility &
enable accountability
19. 19
Chargeback @Twitter
1. Resource Catalog: Consistent way to inventory infrastructure
resources
Support diverse Infrastructure and Platform Services
20. 20
Chargeback @Twitter
1. Resource Catalog: Consistent way to inventory infrastructure
resources
• Resource Fluidity: Support primitive (CPU) and abstract resource (“Tweets /
second”). Extend existing resource
Support diverse Infrastructure and Platform Services
21. 21
Chargeback @Twitter
1. Resource Catalog: Consistent way to inventory infrastructure
resources
• Resource Fluidity: Support primitive (CPU) and abstract resource (“Tweets /
second”). Extend existing resource
2. Resource <> Client Identifier Ownership: Map of client identifier to an
owner to enable accountability
Support diverse Infrastructure and Platform Services
30. Operational Overhead
Headroom
Production Used Cores
Non-Prod Used Cores
Cost of Physical Server
($X / day)
Total available Cores
Quota Buffer
(Underutilized Quota)
Container Size Buffer
(Underutilized Reservation)
Total Cost of Ownership for Aurora
$X core-day
31. Operational Overhead
Headroom
Production Used Cores
Non-Prod Used Cores
Cost of Physical Server
($X / day)
Total available Cores
Quota Buffer
(Underutilized Quota)
Container Size Buffer
(Underutilized Reservation)
Total used Cores
Total Cost of Ownership for Aurora
$X core-day
32. Operational Overhead
Headroom
Production Used Cores
Non-Prod Used Cores
Cost of Physical Server
($X / day)
Total available Cores
Quota Buffer
(Underutilized Quota)
Container Size Buffer
(Underutilized Reservation)
Total used Cores
Excess Cores (incl. DR,
Spikes, Overallocation)Total Cost of Ownership for Aurora
$X core-day
33. Operational Overhead
Headroom
Production Used Cores
Non-Prod Used Cores
Cost of Physical Server
($X / day)
Total available Cores
Quota Buffer
(Underutilized Quota)
Container Size Buffer
(Underutilized Reservation)
Total used Cores
Excess Cores (incl. DR,
Spikes, Overallocation)
Cores used by platform
for operations &
maintenance
Total Cost of Ownership for Aurora
$X core-day
36. 36
Chargeback @Twitter
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
REPORT
REPORT
Metering Pipeline (ETL Job)
IDENTIFIER
OWNERSHIP
MAPPING
Schema(client_identifier, offering_measure, volume, metadata, timestamp)
DATA FIDELITY
Metering Pipeline (ETL Job)
37. 37
Chargeback @Twitter
Metering Pipeline (ETL Job)
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
IDENTIFIER
OWNERSHIP
MAPPING
REPORT
REPORT
Transformer
DATA FIDELITY
Metering Pipeline (ETL Job)
38. 38
Chargeback @Twitter
Metering Pipeline (ETL Job)
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
IDENTIFIER
OWNERSHIP
MAPPING
REPORT
REPORT
1. Resolve Ownership
DATA FIDELITY
Metering Pipeline (ETL Job)
39. 39
Chargeback @Twitter
Metering Pipeline (ETL Job)
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
IDENTIFIER
OWNERSHIP
MAPPING
REPORT
REPORT
2. Cost Computation
DATA FIDELITY
Metering Pipeline (ETL Job)
40. 40
Chargeback @Twitter
Metering Pipeline (ETL Job)
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
DATA FIDELITY
REPORT
REPORT
IDENTIFIER
OWNERSHIP
MAPPING
Data Fidelity & Reporting
Metering Pipeline (ETL Job)
41. 41
Chargeback @Twitter
Metering Pipeline (ETL Job)
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
REPORT
REPORT
IDENTIFIER
OWNERSHIP
MAPPING
1. Verify Data Integrity & Fidelity
DATA FIDELITY
Metering Pipeline (ETL Job)
42. 42
Chargeback @Twitter
Metering Pipeline (ETL Job)
INFRASTRUCTURE
SERVICE 1
INFRASTRUCTURE
SERVICE 2
INGEST
METRICS
RAW
FACT
TRANSFORMER
RESOLVED
FACT
RESOURCE
CATALOG
REPORT
REPORT
IDENTIFIER
OWNERSHIP
MAPPING
2. Alert when things don’t seem the way it should be
DATA FIDELITY
Metering Pipeline (ETL Job)
47. 47
Chargeback @Twitter
Customers
Infrastructure & Platform Operators
Overall Cluster Growth
Allocation v/s Utilization of resources by Client/Tenant
Finance & Execs
Budget v/s Spend per Org
Infrastructure PnL
Overall Efficiency & Trends
Service Owners & Developers
Team Bill
Per Service Allocation vs. Utilization of Resources
Reports
51. 51
1 2 3 4
Learnings
Chargeback @Twitter
Invest in data
Fidelity
Accurate Ownership
Mapping
Logical grouping
of resources
Track historical
data
• Trust in data is most
important.
• Invest in monitoring &
alerting for data
inconsistencies
• Leverage this for
detecting abnormal
increase/decrease and
notify users
• Static mappings go out
of date quickly
• Invest in systems (ex,
Kite) for users to manage
it themselves
• Identifiers were too
granular and teams were
too broad.
• Find a good middle
ground and invest in
system (ex, Kite) to track,
understand and maintain
• Unit prices change over
time
• Orgs / Teams change
over time
• Resources get added /
removed
• Change history is
essential for consistency
which is used for CAP
planning
52. 52
1 2 3 4
Learnings
Chargeback @Twitter
Invest in data
Fidelity
Accurate Ownership
Mapping
Logical grouping
of resources
Track historical
data
• Trust in data is most
important.
• Invest in monitoring &
alerting for data
inconsistencies
• Leverage this for
detecting abnormal
increase/decrease and
notify users
• Static mappings go out
of date quickly
• Invest in systems (ex,
Kite) for users to manage
it themselves
• Identifiers were too
granular and teams were
too broad.
• Find a good middle
ground and invest in
system (ex, Kite) to track,
understand and maintain
• Unit prices change over
time
• Orgs / Teams change
over time
• Resources get added /
removed
• Change history is
essential for consistency
which is used for CAP
planning
53. 53
1 2 3 4
Learnings
Chargeback @Twitter
Invest in data
Fidelity
Accurate Ownership
Mapping
Logical grouping
of resources
Track historical
data
• Trust in data is most
important.
• Invest in monitoring &
alerting for data
inconsistencies
• Leverage this for
detecting abnormal
increase/decrease and
notify users
• Static mappings go out
of date quickly
• Invest in systems (ex,
Kite) for users to manage
it themselves
• Identifiers were too
granular and teams were
too broad.
• Find a good middle
ground and invest in
system (ex, Kite) to track,
understand and maintain
• Unit prices change over
time
• Orgs / Teams change
over time
• Resources get added /
removed
• Change history is
essential for consistency
which is used for CAP
planning
54. 54
1 2 3 4
Learnings
Chargeback @Twitter
Invest in data
Fidelity
Accurate Ownership
Mapping
Logical grouping
of resources
Track historical
data
• Trust in data is most
important.
• Invest in monitoring &
alerting for data
inconsistencies
• Leverage this for
detecting abnormal
increase/decrease and
notify users
• Static mappings go out
of date quickly
• Invest in systems (ex,
Kite) for users to manage
it themselves
• Identifiers were too
granular and teams were
too broad.
• Find a good middle
ground and invest in
system (ex, Kite) to track,
understand and maintain
• Unit prices change over
time
• Orgs / Teams change
over time
• Resources get added /
removed
• Change history is
essential for consistency
which is used for CAP
planning
55. 55
1 2 3 4
Learnings
Chargeback @Twitter
Invest in data
Fidelity
Accurate Ownership
Mapping
Logical grouping
of resources
Track historical
data
• Trust in data is most
important.
• Invest in monitoring &
alerting for data
inconsistencies
• Leverage this for
detecting abnormal
increase/decrease and
notify users
• Static mappings go out
of date quickly
• Invest in systems (ex,
Kite) for users to manage
it themselves
• Identifiers were too
granular and teams were
too broad.
• Find a good middle
ground and invest in
system (ex, Kite) to track,
understand and maintain
• Unit prices change over
time
• Orgs / Teams change
over time
• Resources get added /
removed
• Change history is
essential for consistency
which is used for CAP
planning
56.
57. SERVICE IDENTITY
MANAGER
RESOURCE
PROVISIONING MANAGER
DASHBOARD
(SINGLE PANE OF GLASS)
REPORTING
INFRASTRUCTURE SERVICEINFRASTRUCTURE SERVICEINFRASTRUCTURE SERVICEINFRASTRUCTURE & PLATFORM SERVICE
SERVICE LIFECYCLE WORKFLOWS
METADATA
RESOURCE QUOTA
MANAGEMENT
METERING &
CHARGEBACK
CLIENT IDENTITY
PROVIDER APIS & ADAPTERS
59. 59
Kite @Twitter
Identity System: Built a consistent way to group client identifiers of
different infrastructure services into a project and enabled ownership
• Capture Org Structure: Support org structure changes, project transfer
workflows to ensure up-to-date ownership of identifiers
• Unify client identifier provisioning workflow: Enables single source of truth
and reduces operator pain around provisioning and managing client identifiers.
Client Identifier Management
61. IDENTITY ENTITY MODEL
SERVICE/
SYSTEM ACCOUNT
<INFRA, CLIENTID>
1:N
tweetypie
<Aurora,
tweetypie.prod.tweetypie>
ads-prediction
<Aurora, ads-
prediction.prod.campaign-x>
62. BUSINESS OWNER
TEAM
PROJECT
SERVICE/
SYSTEM ACCOUNT
<INFRA, CLIENTID>
1:N
1:N
1:N
1:N
INFRASTRUCTURE
TWEETYPIE
tweetypie
tweetypie
<Aurora,
tweetypie.prod.tweetypie>
ADS PREDICTION
prediction
ads-prediction
<Aurora, ads-
prediction.prod.campaign-x>
REVENUE
IDENTITY ENTITY MODEL
63. BUSINESS OWNER
TEAM
PROJECT
SERVICE/
SYSTEM ACCOUNT
<INFRA, CLIENTID>
1:N
1:N
1:N
1:N
INFRASTRUCTURE
TWEETYPIE
tweetypie
tweetypie
<Aurora,
tweetypie.prod.tweetypie>
ADS PREDICTION
prediction
ads-prediction
<Aurora, ads-
prediction.prod.campaign-x>
REVENUE
IDENTITY ENTITY MODEL
Entities are time varying dimensions
75. 75
Future Work
Impact & Future Work
1 2
Resource
provisioning
Enable project
deprecation
• Extend Quota Manager
and unify the experience
into Kite
• Onboard Hadoop,
Storage and other
systems
• Detect unused
resources, notify users,
trigger deprecation
process based on policy
3
Capacity Planning
• Provide historic trends
and help with forecast of
capacity
76. 76
1 2
Future Work
Impact & Future Work
Resource
provisioning
Enable project
deprecation
• Extend Quota Manager
and unify the experience
into Kite
• Onboard Hadoop,
Storage and other
systems
• Detect unused
resources, notify users,
trigger deprecation
process based on policy
3
Capacity Planning
• Provide historic trends
and help with forecast of
capacity
77. 77
1 2
Future Work
Impact & Future Work
Resource
provisioning
Enable project
deprecation
• Extend Quota Manager
and unify the experience
into Kite
• Onboard Hadoop,
Storage and other
systems
• Detect unused
resources, notify users,
trigger deprecation
process based on policy
3
Capacity Planning
• Provide historic trends
and help with forecast of
capacity
78.
79. 79
1 2
Future Work
Impact & Future Work
Resource
provisioning
Enable project
deprecation
• Extend Quota Manager
and unify the experience
into Kite
• Onboard Hadoop,
Storage and other
systems
• Detect unused
resources, notify users,
trigger deprecation
process based on policy
3
Capacity Planning
• Provide historic trends
and help with forecast of
capacity