SlideShare a Scribd company logo
1 of 16
Download to read offline
Designing and Deploying
 Internet-Scale Services


          James Hamilton
                   2008.01.17
    Architect, Windows Live Platform Services
            JamesRH@microsoft.com
     Web: research.microsoft.com/~jamesrh
        Blog: perspectives.mvdirona.com
Agenda
•   Overview
•   Recovery-Oriented Computing
•   Overall Application Design
•   Operational Issues
•   Summary



Contributors: Search, Mail, Exchange Hosted Services, Live Collaboration Server, Contacts &
Storage, Spaces, Xbox Live, Rackable Systems, Messenger, WinLive Operations, & MS.com Ops
1/17/2008                                                                                     2
Background and biases
• 15 years in database engine development
      – Lead architect on IBM DB2
      – Architect on SQL Server
            • Led variety of core engine teams including SQL client, SQL
              compiler, optimizer, XML, full text search, execution engine,
              protocols, etc.
• Led the Exchange Hosted Services Team
      – Email anti-spam, anti-virus, and archiving for 2.2m seats
        with $27m revenue
      – ~700 servers in 10 data centers world-wide
• Currently architect on Windows Live Platform Services
• Automation & redundancy is only way to:
      – Reduce costs
      – Improve rate of innovation
      – Reduce operational failures and downtime


1/17/2008                                                                     3
Motivation
• System-to-admin ratio indicator of admin costs
      – Tracking total ops costs often gamed
            • Outsourcing halves ops costs without addressing real issues
      – Inefficient properties: <10:1
      – Enterprise: 150:1
      – Best services: over 2,000:1
• 80% of ops issues from design and development
      – Poorly written applications are difficult to automate
• Focus on reducing ops costs during design &
  development

1/17/2008                                                                   4
What does operations do?
                              Site Management                      Architectural
                      Software      Total                        Engineering Total
                    Development      7%                                 8%
                        Total
                         7%
                     Requests Total
                          6%


                                                                      Deployment
                                Overhead Total                        Management
                                     11%                                 Total
                                                                         31%



                                                    Incident
                                                   Management
                         Problem                                                       Source: Deepak Patil, Global
                                                      Total
                     Engineering Total                                               Foundation Services (8/14/2006)
                                                      20%
                           10%



• 51% is deployment & incident management (known resolution)
•   Teams: Messenger, Contacts and Storage & business unit IT services

1/17/2008                                Windows Live Platform Services                                            5
ROC design pattern
• Recover-oriented computing (ROC)
      – Assume software & hardware will fail frequently & unpredictably
• Heavily instrument applications to detect failures

   App
              Bohr Bug      Urgent                                  Bohr bug: Repeatable functional
                             Alert
                                                                    software issue (functional bugs);
      Heisenbug
                                                                    should be rare in production
                                                                    Heisenbug: Software issue that only
  Restart
                                                                    occurs in unusual cross-request
    Failure
               Reboot
                                                                    timing issues or the pattern of long
                                                                    sequences of independent
                 Failure
                           Re-image                                 operations; some found only in
                             Failure
                                                                    production
                                       Replace

                                        Machine out of rotation and power down
                                        Set LCD/LED to quot;needs servicequot;


1/17/2008                                  Windows Live Platform Services                             6
Overall application design
•   Single-box deployment
•   Development and testing in full environment
•   Quick service health check
•   Zero trust of underlying components
•   Pod or cluster independence
•   Implement & test ops tools and utilities
•   Simplicity throughout
•   Partition & version everything

1/17/2008                                         7
Design for auto-mgmt & provisioning
•   Support for geo-distribution
•   Auto-provisioning & auto-installation mandatory
•   Manage quot;service rolequot; rather than servers
•   Multi-system failures are common
      – Limit automation range of action
• Never rely on local, non-replicated persistent state
• Don't worry about clean shutdown
      – Often won't get it & need this path tested
• Explicitly install everything and then verify
• Force fail all services and components regularly

1/17/2008                 Windows Live Platform Services   8
Release cycle & testing
• Ship frequently:
      – Small releases ship more smoothly
      – Increases pace of innovation
      – Long stabilization periods not required in services
• Use production data to find problems (traffic capture)
      – Measurable release criteria
      – Release criteria includes quality and throughput data
• Track all recovered errors to protect against automation-
  supported service entropy
• Test all error paths in integration & in production
• Test in production via incremental deployment & roll-back
      – Never deploy without tested roll-back
      – Continue testing after release

1/17/2008                                                       9
Design for incremental release
• Incrementally release with schema changes?
   – Old code must run against new schema, or
   – Two-phase process (avoid if possible)
            • Update code to support both, commit changes, and then upgrade schema
• Incrementally release with user experience (UX) changes?
   – Separate UX from infrastructure
   – Ensure old UX works with new infrastructure
   – Deploy infrastructure incrementally
   – On success, bring a small beta population onto new UX
   – On continued success, announce new UX and set a date to
      roll out
• Client-side code?
   – Ensure old & new clients both run with new infrastructure
1/17/2008                                                                       10
Graceful degradation & admission control

• No amount of quot;head roomquot; is sufficient
   – Even at 25-50% H/W utilization, spikes will exceed 100%
• Prevent overload through admission control
• Graceful degradation prior to admission control
   – Find less resource-intensive modes to provide (possibly)
     degraded services
• Related concept: Metered rate-of-service admission
   – Service login typically more expensive than steady state
   – Allow a single or small number of users in when restarting
     a service after failure

1/17/2008                                                     11
Auditing, monitoring, & alerting
• Produce perf data, health data & throughput data
• All config changes need to be tracked via audit log
• Alerting goals:
      – No customer events without an alert (detect problems)
      – Alert to event ratio nearing 1 (don’t false alarm)
• Alerting is an art … need to tune alerting frequently
      – Can’t embed in code (too hard to change)
      – Code produces events, events tracked centrally, alerts produced via
        queries over event DB
• Testing in production requires very reliable monitoring
      – Combination of detection & capability to roll back allows nimbleness
• Tracked events for all interesting issues
      – Latencies are toughest issues to detect

1/17/2008                                                                      12
Dependency management
• Expect latency & failures in dependent services
    – Run on cached data or offer degraded services
    – Test failure & latency frequently in production
• Don’t depend upon features not yet shipped
    – It takes time to work out reliability & scaling issues
• Select dependent components & services thoughtfully
    – On-server components need consistent quality goals
    – Dependent services should be large granule (“worth” sharing)
• Isolate services & decouple components
    – Contain faults within services
    – Assume different upgrade rates
    – Rather than auth on each connect, use session key and refresh every N
      hours (avoids login storms)

1/17/2008                                                                13
Customer & press communications plan
• Systems fail & you will experience latency
• Communicate through multiple channels
      – Opt-in RSS, web, IM, email, etc.
      – If app has client, report details through
        client
• Set ETA expectations & inform
• Some events will bring press attention
• There is a natural tendency to hide systems issues
• Prepare for serious scenarios in advance
   • Data loss, data corruption, security breach, privacy violation
• Prepare communications skeleton plan in advance
   • Who gets called, communicates with the press, & how data is gathered
   • Silence typically interpreted as hiding something or lack of control

1/17/2008                                                              14
Summary
• Reduced operations costs & improved reliability
  through automation
• Full automation dependent upon partitioning &
  redundancy
• Each human administrative interaction is an
  opportunity for error
• Design for failure in all components & test
  frequently
• Rollback & deep monitoring allows safe
  production testing

1/17/2008                                           15
More Information
• Designing & Deploying Internet-Scale Services paper:
      – http://research.microsoft.com/~JamesRH/TalksAndPapers/JamesRH_Lisa.pdf
• Autopilot: Automatic Data Center Operation
      – http://research.microsoft.com/users/misard/papers/osr2007.pdf
• Recovery-Oriented Computing
      – http://roc.cs.berkeley.edu/
      – http://www.cs.berkeley.edu/~pattrsn/talks/HPCAkeynote.ppt
      – http://www.sciam.com/article.cfm?articleID=000DAA41-3B4E-1EB7-
        BDC0809EC588EEDF
• These slides:
      – http://research.microsoft.com/~jamesrh
• Email:
      – JamesRH@microsoft.com
• External Blog:
      – http://perspectives.mvdirona.com
1/17/2008                                                                        16

More Related Content

What's hot

Upgrading Made Easy: Moving to InfluxDB 2.x or InfluxDB Cloud with Cribl LogS...
Upgrading Made Easy: Moving to InfluxDB 2.x or InfluxDB Cloud with Cribl LogS...Upgrading Made Easy: Moving to InfluxDB 2.x or InfluxDB Cloud with Cribl LogS...
Upgrading Made Easy: Moving to InfluxDB 2.x or InfluxDB Cloud with Cribl LogS...InfluxData
 
“10 Things You Must Know Before Designing Your Own Camera,” a Presentation fr...
“10 Things You Must Know Before Designing Your Own Camera,” a Presentation fr...“10 Things You Must Know Before Designing Your Own Camera,” a Presentation fr...
“10 Things You Must Know Before Designing Your Own Camera,” a Presentation fr...Edge AI and Vision Alliance
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...InfluxData
 
AGILE Open Call #1 Pitch
AGILE Open Call #1 PitchAGILE Open Call #1 Pitch
AGILE Open Call #1 PitchAGILE IoT
 
Studio 5000® Application Code Manager: Introduction and Demonstration
Studio 5000® Application Code Manager: Introduction and DemonstrationStudio 5000® Application Code Manager: Introduction and Demonstration
Studio 5000® Application Code Manager: Introduction and DemonstrationRockwell Automation
 
RA TechED 2019 - IN03 - Develop Analytics That Scale Using FactoryTalk Innova...
RA TechED 2019 - IN03 - Develop Analytics That Scale Using FactoryTalk Innova...RA TechED 2019 - IN03 - Develop Analytics That Scale Using FactoryTalk Innova...
RA TechED 2019 - IN03 - Develop Analytics That Scale Using FactoryTalk Innova...Rockwell Automation
 
Cloud Foundry Summit 2015: Cloud Foundry and IoT Protocol Support
Cloud Foundry Summit 2015: Cloud Foundry and IoT Protocol SupportCloud Foundry Summit 2015: Cloud Foundry and IoT Protocol Support
Cloud Foundry Summit 2015: Cloud Foundry and IoT Protocol SupportVMware Tanzu
 
FactoryTalk® AssetCentre: Overview
FactoryTalk® AssetCentre: OverviewFactoryTalk® AssetCentre: Overview
FactoryTalk® AssetCentre: OverviewRockwell Automation
 
IoT and Cloud services interactions
IoT and Cloud services interactionsIoT and Cloud services interactions
IoT and Cloud services interactionsAGILE IoT
 
Smart Devices: Helping Design, Operate and Maintain The Connected Enterprise
Smart Devices: Helping Design, Operate and Maintain The Connected EnterpriseSmart Devices: Helping Design, Operate and Maintain The Connected Enterprise
Smart Devices: Helping Design, Operate and Maintain The Connected EnterpriseRockwell Automation
 
Robert Murphy Driving Value from Smart Manufacturing
Robert Murphy Driving Value from Smart ManufacturingRobert Murphy Driving Value from Smart Manufacturing
Robert Murphy Driving Value from Smart ManufacturingRockwell Automation
 
S1: Predix ISV Partner Program (Predix Transform 2016)
S1: Predix ISV Partner Program (Predix Transform 2016)S1: Predix ISV Partner Program (Predix Transform 2016)
S1: Predix ISV Partner Program (Predix Transform 2016)Predix
 
Project Design Considerations for Integration
Project Design Considerations for IntegrationProject Design Considerations for Integration
Project Design Considerations for IntegrationRockwell Automation
 
Connectivity is here (5 g, swarm,...). now, let's build interplanetary apps! (1)
Connectivity is here (5 g, swarm,...). now, let's build interplanetary apps! (1)Connectivity is here (5 g, swarm,...). now, let's build interplanetary apps! (1)
Connectivity is here (5 g, swarm,...). now, let's build interplanetary apps! (1)Samy Fodil
 
RA TechED 2019 - SY07- Next-Gen Device Library of Preconfigured Objects
RA TechED 2019 - SY07- Next-Gen Device Library of Preconfigured ObjectsRA TechED 2019 - SY07- Next-Gen Device Library of Preconfigured Objects
RA TechED 2019 - SY07- Next-Gen Device Library of Preconfigured ObjectsRockwell Automation
 
RA TechED 2019 - IN02 - Empower Your Connected Enterprise with FactoryTalk In...
RA TechED 2019 - IN02 - Empower Your Connected Enterprise with FactoryTalk In...RA TechED 2019 - IN02 - Empower Your Connected Enterprise with FactoryTalk In...
RA TechED 2019 - IN02 - Empower Your Connected Enterprise with FactoryTalk In...Rockwell Automation
 
Performance and Analytics Cloud for Machine Builders
Performance and Analytics Cloud for Machine BuildersPerformance and Analytics Cloud for Machine Builders
Performance and Analytics Cloud for Machine BuildersRockwell Automation
 
Migration Tools to Convert Your Legacy DCS
Migration Tools to Convert Your Legacy DCSMigration Tools to Convert Your Legacy DCS
Migration Tools to Convert Your Legacy DCSRockwell Automation
 
Smart Devices - Design ,Operate and Maintain
Smart Devices - Design ,Operate and MaintainSmart Devices - Design ,Operate and Maintain
Smart Devices - Design ,Operate and Maintainsoftconsystem
 

What's hot (20)

Upgrading Made Easy: Moving to InfluxDB 2.x or InfluxDB Cloud with Cribl LogS...
Upgrading Made Easy: Moving to InfluxDB 2.x or InfluxDB Cloud with Cribl LogS...Upgrading Made Easy: Moving to InfluxDB 2.x or InfluxDB Cloud with Cribl LogS...
Upgrading Made Easy: Moving to InfluxDB 2.x or InfluxDB Cloud with Cribl LogS...
 
“10 Things You Must Know Before Designing Your Own Camera,” a Presentation fr...
“10 Things You Must Know Before Designing Your Own Camera,” a Presentation fr...“10 Things You Must Know Before Designing Your Own Camera,” a Presentation fr...
“10 Things You Must Know Before Designing Your Own Camera,” a Presentation fr...
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
 
AGILE Open Call #1 Pitch
AGILE Open Call #1 PitchAGILE Open Call #1 Pitch
AGILE Open Call #1 Pitch
 
Studio 5000® Application Code Manager: Introduction and Demonstration
Studio 5000® Application Code Manager: Introduction and DemonstrationStudio 5000® Application Code Manager: Introduction and Demonstration
Studio 5000® Application Code Manager: Introduction and Demonstration
 
RA TechED 2019 - IN03 - Develop Analytics That Scale Using FactoryTalk Innova...
RA TechED 2019 - IN03 - Develop Analytics That Scale Using FactoryTalk Innova...RA TechED 2019 - IN03 - Develop Analytics That Scale Using FactoryTalk Innova...
RA TechED 2019 - IN03 - Develop Analytics That Scale Using FactoryTalk Innova...
 
Cloud Foundry Summit 2015: Cloud Foundry and IoT Protocol Support
Cloud Foundry Summit 2015: Cloud Foundry and IoT Protocol SupportCloud Foundry Summit 2015: Cloud Foundry and IoT Protocol Support
Cloud Foundry Summit 2015: Cloud Foundry and IoT Protocol Support
 
CDK - The next big thing - Quang Phuong
CDK - The next big thing - Quang PhuongCDK - The next big thing - Quang Phuong
CDK - The next big thing - Quang Phuong
 
FactoryTalk® AssetCentre: Overview
FactoryTalk® AssetCentre: OverviewFactoryTalk® AssetCentre: Overview
FactoryTalk® AssetCentre: Overview
 
IoT and Cloud services interactions
IoT and Cloud services interactionsIoT and Cloud services interactions
IoT and Cloud services interactions
 
Smart Devices: Helping Design, Operate and Maintain The Connected Enterprise
Smart Devices: Helping Design, Operate and Maintain The Connected EnterpriseSmart Devices: Helping Design, Operate and Maintain The Connected Enterprise
Smart Devices: Helping Design, Operate and Maintain The Connected Enterprise
 
Robert Murphy Driving Value from Smart Manufacturing
Robert Murphy Driving Value from Smart ManufacturingRobert Murphy Driving Value from Smart Manufacturing
Robert Murphy Driving Value from Smart Manufacturing
 
S1: Predix ISV Partner Program (Predix Transform 2016)
S1: Predix ISV Partner Program (Predix Transform 2016)S1: Predix ISV Partner Program (Predix Transform 2016)
S1: Predix ISV Partner Program (Predix Transform 2016)
 
Project Design Considerations for Integration
Project Design Considerations for IntegrationProject Design Considerations for Integration
Project Design Considerations for Integration
 
Connectivity is here (5 g, swarm,...). now, let's build interplanetary apps! (1)
Connectivity is here (5 g, swarm,...). now, let's build interplanetary apps! (1)Connectivity is here (5 g, swarm,...). now, let's build interplanetary apps! (1)
Connectivity is here (5 g, swarm,...). now, let's build interplanetary apps! (1)
 
RA TechED 2019 - SY07- Next-Gen Device Library of Preconfigured Objects
RA TechED 2019 - SY07- Next-Gen Device Library of Preconfigured ObjectsRA TechED 2019 - SY07- Next-Gen Device Library of Preconfigured Objects
RA TechED 2019 - SY07- Next-Gen Device Library of Preconfigured Objects
 
RA TechED 2019 - IN02 - Empower Your Connected Enterprise with FactoryTalk In...
RA TechED 2019 - IN02 - Empower Your Connected Enterprise with FactoryTalk In...RA TechED 2019 - IN02 - Empower Your Connected Enterprise with FactoryTalk In...
RA TechED 2019 - IN02 - Empower Your Connected Enterprise with FactoryTalk In...
 
Performance and Analytics Cloud for Machine Builders
Performance and Analytics Cloud for Machine BuildersPerformance and Analytics Cloud for Machine Builders
Performance and Analytics Cloud for Machine Builders
 
Migration Tools to Convert Your Legacy DCS
Migration Tools to Convert Your Legacy DCSMigration Tools to Convert Your Legacy DCS
Migration Tools to Convert Your Legacy DCS
 
Smart Devices - Design ,Operate and Maintain
Smart Devices - Design ,Operate and MaintainSmart Devices - Design ,Operate and Maintain
Smart Devices - Design ,Operate and Maintain
 

Viewers also liked

Viewers also liked (17)

A HistóRia De Feio
A HistóRia De FeioA HistóRia De Feio
A HistóRia De Feio
 
Workshop C Impact Of Growth Agenda
Workshop C   Impact Of Growth AgendaWorkshop C   Impact Of Growth Agenda
Workshop C Impact Of Growth Agenda
 
Aeropuerto
AeropuertoAeropuerto
Aeropuerto
 
Uma casa atlântica, Berlim em Lisboa ou os problemas do velho direito de auto...
Uma casa atlântica, Berlim em Lisboa ou os problemas do velho direito de auto...Uma casa atlântica, Berlim em Lisboa ou os problemas do velho direito de auto...
Uma casa atlântica, Berlim em Lisboa ou os problemas do velho direito de auto...
 
PresentacióN1
PresentacióN1PresentacióN1
PresentacióN1
 
ApresentaçãO2tic
ApresentaçãO2ticApresentaçãO2tic
ApresentaçãO2tic
 
Opera bug
Opera bugOpera bug
Opera bug
 
Creativity Day Milano 27 Febbraio Milano
Creativity Day Milano 27 Febbraio MilanoCreativity Day Milano 27 Febbraio Milano
Creativity Day Milano 27 Febbraio Milano
 
adirealoral
adirealoraladirealoral
adirealoral
 
Paper.li work
Paper.li workPaper.li work
Paper.li work
 
点道
点道点道
点道
 
Aria au pays du Web
Aria au pays du WebAria au pays du Web
Aria au pays du Web
 
Liberty
LibertyLiberty
Liberty
 
2011 04-19 html5 css3 及脚本
2011 04-19 html5 css3 及脚本2011 04-19 html5 css3 及脚本
2011 04-19 html5 css3 及脚本
 
microtest12
microtest12microtest12
microtest12
 
SowProg, présentation lors du Mama 2012
SowProg, présentation lors du Mama 2012SowProg, présentation lors du Mama 2012
SowProg, présentation lors du Mama 2012
 
Sir Richard Mac Cormac, Density
Sir Richard Mac Cormac, DensitySir Richard Mac Cormac, Density
Sir Richard Mac Cormac, Density
 

Similar to Designing and Deploying Internet-Scale Services

Maintenance: the cost for software vendors and what it means to your organiza...
Maintenance: the cost for software vendors and what it means to your organiza...Maintenance: the cost for software vendors and what it means to your organiza...
Maintenance: the cost for software vendors and what it means to your organiza...nick_hans
 
Netcetera Proactive Management Service
Netcetera Proactive Management ServiceNetcetera Proactive Management Service
Netcetera Proactive Management ServicePeter Skelton
 
Agile Maintenance
Agile MaintenanceAgile Maintenance
Agile MaintenanceNaresh Jain
 
Joe Honan Virtualization Trends
Joe Honan   Virtualization TrendsJoe Honan   Virtualization Trends
Joe Honan Virtualization Trends1velocity
 
Slow Cool 20081009 Final
Slow Cool 20081009 FinalSlow Cool 20081009 Final
Slow Cool 20081009 Finalrajivmordani
 
Netpod - The Merging of NPM & APM
Netpod - The Merging of NPM & APMNetpod - The Merging of NPM & APM
Netpod - The Merging of NPM & APMBoni Bruno
 
Exposing and Fixing Common App Performance Problems
Exposing and Fixing Common App Performance ProblemsExposing and Fixing Common App Performance Problems
Exposing and Fixing Common App Performance ProblemsRiverbed Technology
 
E Shared Services Effort
E Shared Services EffortE Shared Services Effort
E Shared Services EffortArturo Saavedra
 
E g innovations overview
E g innovations overviewE g innovations overview
E g innovations overviewNuno Alves
 
eG Enterprise Citrix XenDesktop Monitor Product Tour
eG Enterprise Citrix XenDesktop Monitor Product ToureG Enterprise Citrix XenDesktop Monitor Product Tour
eG Enterprise Citrix XenDesktop Monitor Product ToureG Innovations
 
Monitoring Cloud/Virtual/Physical IT Infrastructures
Monitoring Cloud/Virtual/Physical IT InfrastructuresMonitoring Cloud/Virtual/Physical IT Infrastructures
Monitoring Cloud/Virtual/Physical IT InfrastructuresJohnnie Burke-Gaffney
 
Managing and Monitoring Virtual/Cloud/Physical Infrastructures
Managing and Monitoring Virtual/Cloud/Physical InfrastructuresManaging and Monitoring Virtual/Cloud/Physical Infrastructures
Managing and Monitoring Virtual/Cloud/Physical InfrastructuresJohnnie Burke-Gaffney
 
Appdynamics Training Session
Appdynamics Training SessionAppdynamics Training Session
Appdynamics Training SessionCodvaTech Labs
 
Building Composite Applications in Lotus Notes
Building Composite Applications in Lotus NotesBuilding Composite Applications in Lotus Notes
Building Composite Applications in Lotus Notesdominion
 
Movie Rating Site [for presentation]
Movie Rating Site [for presentation]Movie Rating Site [for presentation]
Movie Rating Site [for presentation]SH Rajøn
 
Unified Process
Unified ProcessUnified Process
Unified Processguy_davis
 
Top 5 .NET Challenges, Performance Monitoring Tips & Tricks
Top 5 .NET Challenges, Performance Monitoring Tips & TricksTop 5 .NET Challenges, Performance Monitoring Tips & Tricks
Top 5 .NET Challenges, Performance Monitoring Tips & TricksAppDynamics
 

Similar to Designing and Deploying Internet-Scale Services (20)

Maintenance: the cost for software vendors and what it means to your organiza...
Maintenance: the cost for software vendors and what it means to your organiza...Maintenance: the cost for software vendors and what it means to your organiza...
Maintenance: the cost for software vendors and what it means to your organiza...
 
Netcetera Proactive Management Service
Netcetera Proactive Management ServiceNetcetera Proactive Management Service
Netcetera Proactive Management Service
 
Agile Maintenance
Agile MaintenanceAgile Maintenance
Agile Maintenance
 
Joe Honan Virtualization Trends
Joe Honan   Virtualization TrendsJoe Honan   Virtualization Trends
Joe Honan Virtualization Trends
 
Slow Cool 20081009 Final
Slow Cool 20081009 FinalSlow Cool 20081009 Final
Slow Cool 20081009 Final
 
Netpod - The Merging of NPM & APM
Netpod - The Merging of NPM & APMNetpod - The Merging of NPM & APM
Netpod - The Merging of NPM & APM
 
Exposing and Fixing Common App Performance Problems
Exposing and Fixing Common App Performance ProblemsExposing and Fixing Common App Performance Problems
Exposing and Fixing Common App Performance Problems
 
E Shared Services Effort
E Shared Services EffortE Shared Services Effort
E Shared Services Effort
 
E g innovations overview
E g innovations overviewE g innovations overview
E g innovations overview
 
eG Enterprise Citrix XenDesktop Monitor Product Tour
eG Enterprise Citrix XenDesktop Monitor Product ToureG Enterprise Citrix XenDesktop Monitor Product Tour
eG Enterprise Citrix XenDesktop Monitor Product Tour
 
Monitoring Cloud/Virtual/Physical IT Infrastructures
Monitoring Cloud/Virtual/Physical IT InfrastructuresMonitoring Cloud/Virtual/Physical IT Infrastructures
Monitoring Cloud/Virtual/Physical IT Infrastructures
 
Managing and Monitoring Virtual/Cloud/Physical Infrastructures
Managing and Monitoring Virtual/Cloud/Physical InfrastructuresManaging and Monitoring Virtual/Cloud/Physical Infrastructures
Managing and Monitoring Virtual/Cloud/Physical Infrastructures
 
Appdynamics Training Session
Appdynamics Training SessionAppdynamics Training Session
Appdynamics Training Session
 
XS 2008 Boston Capacity Planning
XS 2008 Boston Capacity PlanningXS 2008 Boston Capacity Planning
XS 2008 Boston Capacity Planning
 
Building Composite Applications in Lotus Notes
Building Composite Applications in Lotus NotesBuilding Composite Applications in Lotus Notes
Building Composite Applications in Lotus Notes
 
Movie Rating Site [for presentation]
Movie Rating Site [for presentation]Movie Rating Site [for presentation]
Movie Rating Site [for presentation]
 
Unified Process
Unified ProcessUnified Process
Unified Process
 
Seminar - JBoss Migration
Seminar - JBoss MigrationSeminar - JBoss Migration
Seminar - JBoss Migration
 
Top 5 .NET Challenges, Performance Monitoring Tips & Tricks
Top 5 .NET Challenges, Performance Monitoring Tips & TricksTop 5 .NET Challenges, Performance Monitoring Tips & Tricks
Top 5 .NET Challenges, Performance Monitoring Tips & Tricks
 
Web 2.0 Expo Notes
Web 2.0 Expo NotesWeb 2.0 Expo Notes
Web 2.0 Expo Notes
 

Recently uploaded

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 

Recently uploaded (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 

Designing and Deploying Internet-Scale Services

  • 1. Designing and Deploying Internet-Scale Services James Hamilton 2008.01.17 Architect, Windows Live Platform Services JamesRH@microsoft.com Web: research.microsoft.com/~jamesrh Blog: perspectives.mvdirona.com
  • 2. Agenda • Overview • Recovery-Oriented Computing • Overall Application Design • Operational Issues • Summary Contributors: Search, Mail, Exchange Hosted Services, Live Collaboration Server, Contacts & Storage, Spaces, Xbox Live, Rackable Systems, Messenger, WinLive Operations, & MS.com Ops 1/17/2008 2
  • 3. Background and biases • 15 years in database engine development – Lead architect on IBM DB2 – Architect on SQL Server • Led variety of core engine teams including SQL client, SQL compiler, optimizer, XML, full text search, execution engine, protocols, etc. • Led the Exchange Hosted Services Team – Email anti-spam, anti-virus, and archiving for 2.2m seats with $27m revenue – ~700 servers in 10 data centers world-wide • Currently architect on Windows Live Platform Services • Automation & redundancy is only way to: – Reduce costs – Improve rate of innovation – Reduce operational failures and downtime 1/17/2008 3
  • 4. Motivation • System-to-admin ratio indicator of admin costs – Tracking total ops costs often gamed • Outsourcing halves ops costs without addressing real issues – Inefficient properties: <10:1 – Enterprise: 150:1 – Best services: over 2,000:1 • 80% of ops issues from design and development – Poorly written applications are difficult to automate • Focus on reducing ops costs during design & development 1/17/2008 4
  • 5. What does operations do? Site Management Architectural Software Total Engineering Total Development 7% 8% Total 7% Requests Total 6% Deployment Overhead Total Management 11% Total 31% Incident Management Problem Source: Deepak Patil, Global Total Engineering Total Foundation Services (8/14/2006) 20% 10% • 51% is deployment & incident management (known resolution) • Teams: Messenger, Contacts and Storage & business unit IT services 1/17/2008 Windows Live Platform Services 5
  • 6. ROC design pattern • Recover-oriented computing (ROC) – Assume software & hardware will fail frequently & unpredictably • Heavily instrument applications to detect failures App Bohr Bug Urgent Bohr bug: Repeatable functional Alert software issue (functional bugs); Heisenbug should be rare in production Heisenbug: Software issue that only Restart occurs in unusual cross-request Failure Reboot timing issues or the pattern of long sequences of independent Failure Re-image operations; some found only in Failure production Replace Machine out of rotation and power down Set LCD/LED to quot;needs servicequot; 1/17/2008 Windows Live Platform Services 6
  • 7. Overall application design • Single-box deployment • Development and testing in full environment • Quick service health check • Zero trust of underlying components • Pod or cluster independence • Implement & test ops tools and utilities • Simplicity throughout • Partition & version everything 1/17/2008 7
  • 8. Design for auto-mgmt & provisioning • Support for geo-distribution • Auto-provisioning & auto-installation mandatory • Manage quot;service rolequot; rather than servers • Multi-system failures are common – Limit automation range of action • Never rely on local, non-replicated persistent state • Don't worry about clean shutdown – Often won't get it & need this path tested • Explicitly install everything and then verify • Force fail all services and components regularly 1/17/2008 Windows Live Platform Services 8
  • 9. Release cycle & testing • Ship frequently: – Small releases ship more smoothly – Increases pace of innovation – Long stabilization periods not required in services • Use production data to find problems (traffic capture) – Measurable release criteria – Release criteria includes quality and throughput data • Track all recovered errors to protect against automation- supported service entropy • Test all error paths in integration & in production • Test in production via incremental deployment & roll-back – Never deploy without tested roll-back – Continue testing after release 1/17/2008 9
  • 10. Design for incremental release • Incrementally release with schema changes? – Old code must run against new schema, or – Two-phase process (avoid if possible) • Update code to support both, commit changes, and then upgrade schema • Incrementally release with user experience (UX) changes? – Separate UX from infrastructure – Ensure old UX works with new infrastructure – Deploy infrastructure incrementally – On success, bring a small beta population onto new UX – On continued success, announce new UX and set a date to roll out • Client-side code? – Ensure old & new clients both run with new infrastructure 1/17/2008 10
  • 11. Graceful degradation & admission control • No amount of quot;head roomquot; is sufficient – Even at 25-50% H/W utilization, spikes will exceed 100% • Prevent overload through admission control • Graceful degradation prior to admission control – Find less resource-intensive modes to provide (possibly) degraded services • Related concept: Metered rate-of-service admission – Service login typically more expensive than steady state – Allow a single or small number of users in when restarting a service after failure 1/17/2008 11
  • 12. Auditing, monitoring, & alerting • Produce perf data, health data & throughput data • All config changes need to be tracked via audit log • Alerting goals: – No customer events without an alert (detect problems) – Alert to event ratio nearing 1 (don’t false alarm) • Alerting is an art … need to tune alerting frequently – Can’t embed in code (too hard to change) – Code produces events, events tracked centrally, alerts produced via queries over event DB • Testing in production requires very reliable monitoring – Combination of detection & capability to roll back allows nimbleness • Tracked events for all interesting issues – Latencies are toughest issues to detect 1/17/2008 12
  • 13. Dependency management • Expect latency & failures in dependent services – Run on cached data or offer degraded services – Test failure & latency frequently in production • Don’t depend upon features not yet shipped – It takes time to work out reliability & scaling issues • Select dependent components & services thoughtfully – On-server components need consistent quality goals – Dependent services should be large granule (“worth” sharing) • Isolate services & decouple components – Contain faults within services – Assume different upgrade rates – Rather than auth on each connect, use session key and refresh every N hours (avoids login storms) 1/17/2008 13
  • 14. Customer & press communications plan • Systems fail & you will experience latency • Communicate through multiple channels – Opt-in RSS, web, IM, email, etc. – If app has client, report details through client • Set ETA expectations & inform • Some events will bring press attention • There is a natural tendency to hide systems issues • Prepare for serious scenarios in advance • Data loss, data corruption, security breach, privacy violation • Prepare communications skeleton plan in advance • Who gets called, communicates with the press, & how data is gathered • Silence typically interpreted as hiding something or lack of control 1/17/2008 14
  • 15. Summary • Reduced operations costs & improved reliability through automation • Full automation dependent upon partitioning & redundancy • Each human administrative interaction is an opportunity for error • Design for failure in all components & test frequently • Rollback & deep monitoring allows safe production testing 1/17/2008 15
  • 16. More Information • Designing & Deploying Internet-Scale Services paper: – http://research.microsoft.com/~JamesRH/TalksAndPapers/JamesRH_Lisa.pdf • Autopilot: Automatic Data Center Operation – http://research.microsoft.com/users/misard/papers/osr2007.pdf • Recovery-Oriented Computing – http://roc.cs.berkeley.edu/ – http://www.cs.berkeley.edu/~pattrsn/talks/HPCAkeynote.ppt – http://www.sciam.com/article.cfm?articleID=000DAA41-3B4E-1EB7- BDC0809EC588EEDF • These slides: – http://research.microsoft.com/~jamesrh • Email: – JamesRH@microsoft.com • External Blog: – http://perspectives.mvdirona.com 1/17/2008 16