SlideShare a Scribd company logo
1 of 16
Download to read offline
Designing and Deploying
 Internet-Scale Services


          James Hamilton
                   2008.01.17
    Architect, Windows Live Platform Services
            JamesRH@microsoft.com
     Web: research.microsoft.com/~jamesrh
        Blog: perspectives.mvdirona.com
Agenda
•   Overview
•   Recovery-Oriented Computing
•   Overall Application Design
•   Operational Issues
•   Summary



Contributors: Search, Mail, Exchange Hosted Services, Live Collaboration Server, Contacts &
Storage, Spaces, Xbox Live, Rackable Systems, Messenger, WinLive Operations, & MS.com Ops
1/17/2008                                                                                     2
Background and biases
• 15 years in database engine development
      – Lead architect on IBM DB2
      – Architect on SQL Server
            • Led variety of core engine teams including SQL client, SQL
              compiler, optimizer, XML, full text search, execution engine,
              protocols, etc.
• Led the Exchange Hosted Services Team
      – Email anti-spam, anti-virus, and archiving for 2.2m seats
        with $27m revenue
      – ~700 servers in 10 data centers world-wide
• Currently architect on Windows Live Platform Services
• Automation & redundancy is only way to:
      – Reduce costs
      – Improve rate of innovation
      – Reduce operational failures and downtime


1/17/2008                                                                     3
Motivation
• System-to-admin ratio indicator of admin costs
      – Tracking total ops costs often gamed
            • Outsourcing halves ops costs without addressing real issues
      – Inefficient properties: <10:1
      – Enterprise: 150:1
      – Best services: over 2,000:1
• 80% of ops issues from design and development
      – Poorly written applications are difficult to automate
• Focus on reducing ops costs during design &
  development

1/17/2008                                                                   4
What does operations do?
                              Site Management                      Architectural
                      Software      Total                        Engineering Total
                    Development      7%                                 8%
                        Total
                         7%
                     Requests Total
                          6%


                                                                      Deployment
                                Overhead Total                        Management
                                     11%                                 Total
                                                                         31%



                                                    Incident
                                                   Management
                         Problem                                                       Source: Deepak Patil, Global
                                                      Total
                     Engineering Total                                               Foundation Services (8/14/2006)
                                                      20%
                           10%



• 51% is deployment & incident management (known resolution)
•   Teams: Messenger, Contacts and Storage & business unit IT services

1/17/2008                                Windows Live Platform Services                                            5
ROC design pattern
• Recover-oriented computing (ROC)
      – Assume software & hardware will fail frequently & unpredictably
• Heavily instrument applications to detect failures

   App
              Bohr Bug      Urgent                                  Bohr bug: Repeatable functional
                             Alert
                                                                    software issue (functional bugs);
      Heisenbug
                                                                    should be rare in production
                                                                    Heisenbug: Software issue that only
  Restart
                                                                    occurs in unusual cross-request
    Failure
               Reboot
                                                                    timing issues or the pattern of long
                                                                    sequences of independent
                 Failure
                           Re-image                                 operations; some found only in
                             Failure
                                                                    production
                                       Replace

                                        Machine out of rotation and power down
                                        Set LCD/LED to quot;needs servicequot;


1/17/2008                                  Windows Live Platform Services                             6
Overall application design
•   Single-box deployment
•   Development and testing in full environment
•   Quick service health check
•   Zero trust of underlying components
•   Pod or cluster independence
•   Implement & test ops tools and utilities
•   Simplicity throughout
•   Partition & version everything

1/17/2008                                         7
Design for auto-mgmt & provisioning
•   Support for geo-distribution
•   Auto-provisioning & auto-installation mandatory
•   Manage quot;service rolequot; rather than servers
•   Multi-system failures are common
      – Limit automation range of action
• Never rely on local, non-replicated persistent state
• Don't worry about clean shutdown
      – Often won't get it & need this path tested
• Explicitly install everything and then verify
• Force fail all services and components regularly

1/17/2008                 Windows Live Platform Services   8
Release cycle & testing
• Ship frequently:
      – Small releases ship more smoothly
      – Increases pace of innovation
      – Long stabilization periods not required in services
• Use production data to find problems (traffic capture)
      – Measurable release criteria
      – Release criteria includes quality and throughput data
• Track all recovered errors to protect against automation-
  supported service entropy
• Test all error paths in integration & in production
• Test in production via incremental deployment & roll-back
      – Never deploy without tested roll-back
      – Continue testing after release

1/17/2008                                                       9
Design for incremental release
• Incrementally release with schema changes?
   – Old code must run against new schema, or
   – Two-phase process (avoid if possible)
            • Update code to support both, commit changes, and then upgrade schema
• Incrementally release with user experience (UX) changes?
   – Separate UX from infrastructure
   – Ensure old UX works with new infrastructure
   – Deploy infrastructure incrementally
   – On success, bring a small beta population onto new UX
   – On continued success, announce new UX and set a date to
      roll out
• Client-side code?
   – Ensure old & new clients both run with new infrastructure
1/17/2008                                                                       10
Graceful degradation & admission control

• No amount of quot;head roomquot; is sufficient
   – Even at 25-50% H/W utilization, spikes will exceed 100%
• Prevent overload through admission control
• Graceful degradation prior to admission control
   – Find less resource-intensive modes to provide (possibly)
     degraded services
• Related concept: Metered rate-of-service admission
   – Service login typically more expensive than steady state
   – Allow a single or small number of users in when restarting
     a service after failure

1/17/2008                                                     11
Auditing, monitoring, & alerting
• Produce perf data, health data & throughput data
• All config changes need to be tracked via audit log
• Alerting goals:
      – No customer events without an alert (detect problems)
      – Alert to event ratio nearing 1 (don’t false alarm)
• Alerting is an art … need to tune alerting frequently
      – Can’t embed in code (too hard to change)
      – Code produces events, events tracked centrally, alerts produced via
        queries over event DB
• Testing in production requires very reliable monitoring
      – Combination of detection & capability to roll back allows nimbleness
• Tracked events for all interesting issues
      – Latencies are toughest issues to detect

1/17/2008                                                                      12
Dependency management
• Expect latency & failures in dependent services
    – Run on cached data or offer degraded services
    – Test failure & latency frequently in production
• Don’t depend upon features not yet shipped
    – It takes time to work out reliability & scaling issues
• Select dependent components & services thoughtfully
    – On-server components need consistent quality goals
    – Dependent services should be large granule (“worth” sharing)
• Isolate services & decouple components
    – Contain faults within services
    – Assume different upgrade rates
    – Rather than auth on each connect, use session key and refresh every N
      hours (avoids login storms)

1/17/2008                                                                13
Customer & press communications plan
• Systems fail & you will experience latency
• Communicate through multiple channels
      – Opt-in RSS, web, IM, email, etc.
      – If app has client, report details through
        client
• Set ETA expectations & inform
• Some events will bring press attention
• There is a natural tendency to hide systems issues
• Prepare for serious scenarios in advance
   • Data loss, data corruption, security breach, privacy violation
• Prepare communications skeleton plan in advance
   • Who gets called, communicates with the press, & how data is gathered
   • Silence typically interpreted as hiding something or lack of control

1/17/2008                                                              14
Summary
• Reduced operations costs & improved reliability
  through automation
• Full automation dependent upon partitioning &
  redundancy
• Each human administrative interaction is an
  opportunity for error
• Design for failure in all components & test
  frequently
• Rollback & deep monitoring allows safe
  production testing

1/17/2008                                           15
More Information
• Designing & Deploying Internet-Scale Services paper:
      – http://research.microsoft.com/~JamesRH/TalksAndPapers/JamesRH_Lisa.pdf
• Autopilot: Automatic Data Center Operation
      – http://research.microsoft.com/users/misard/papers/osr2007.pdf
• Recovery-Oriented Computing
      – http://roc.cs.berkeley.edu/
      – http://www.cs.berkeley.edu/~pattrsn/talks/HPCAkeynote.ppt
      – http://www.sciam.com/article.cfm?articleID=000DAA41-3B4E-1EB7-
        BDC0809EC588EEDF
• These slides:
      – http://research.microsoft.com/~jamesrh
• Email:
      – JamesRH@microsoft.com
• External Blog:
      – http://perspectives.mvdirona.com
1/17/2008                                                                        16

More Related Content

What's hot

What's hot (20)

Upgrading Made Easy: Moving to InfluxDB 2.x or InfluxDB Cloud with Cribl LogS...
Upgrading Made Easy: Moving to InfluxDB 2.x or InfluxDB Cloud with Cribl LogS...Upgrading Made Easy: Moving to InfluxDB 2.x or InfluxDB Cloud with Cribl LogS...
Upgrading Made Easy: Moving to InfluxDB 2.x or InfluxDB Cloud with Cribl LogS...
 
“10 Things You Must Know Before Designing Your Own Camera,” a Presentation fr...
“10 Things You Must Know Before Designing Your Own Camera,” a Presentation fr...“10 Things You Must Know Before Designing Your Own Camera,” a Presentation fr...
“10 Things You Must Know Before Designing Your Own Camera,” a Presentation fr...
 
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor A...
 
AGILE Open Call #1 Pitch
AGILE Open Call #1 PitchAGILE Open Call #1 Pitch
AGILE Open Call #1 Pitch
 
Studio 5000® Application Code Manager: Introduction and Demonstration
Studio 5000® Application Code Manager: Introduction and DemonstrationStudio 5000® Application Code Manager: Introduction and Demonstration
Studio 5000® Application Code Manager: Introduction and Demonstration
 
RA TechED 2019 - IN03 - Develop Analytics That Scale Using FactoryTalk Innova...
RA TechED 2019 - IN03 - Develop Analytics That Scale Using FactoryTalk Innova...RA TechED 2019 - IN03 - Develop Analytics That Scale Using FactoryTalk Innova...
RA TechED 2019 - IN03 - Develop Analytics That Scale Using FactoryTalk Innova...
 
Cloud Foundry Summit 2015: Cloud Foundry and IoT Protocol Support
Cloud Foundry Summit 2015: Cloud Foundry and IoT Protocol SupportCloud Foundry Summit 2015: Cloud Foundry and IoT Protocol Support
Cloud Foundry Summit 2015: Cloud Foundry and IoT Protocol Support
 
CDK - The next big thing - Quang Phuong
CDK - The next big thing - Quang PhuongCDK - The next big thing - Quang Phuong
CDK - The next big thing - Quang Phuong
 
FactoryTalk® AssetCentre: Overview
FactoryTalk® AssetCentre: OverviewFactoryTalk® AssetCentre: Overview
FactoryTalk® AssetCentre: Overview
 
IoT and Cloud services interactions
IoT and Cloud services interactionsIoT and Cloud services interactions
IoT and Cloud services interactions
 
Smart Devices: Helping Design, Operate and Maintain The Connected Enterprise
Smart Devices: Helping Design, Operate and Maintain The Connected EnterpriseSmart Devices: Helping Design, Operate and Maintain The Connected Enterprise
Smart Devices: Helping Design, Operate and Maintain The Connected Enterprise
 
Robert Murphy Driving Value from Smart Manufacturing
Robert Murphy Driving Value from Smart ManufacturingRobert Murphy Driving Value from Smart Manufacturing
Robert Murphy Driving Value from Smart Manufacturing
 
S1: Predix ISV Partner Program (Predix Transform 2016)
S1: Predix ISV Partner Program (Predix Transform 2016)S1: Predix ISV Partner Program (Predix Transform 2016)
S1: Predix ISV Partner Program (Predix Transform 2016)
 
Project Design Considerations for Integration
Project Design Considerations for IntegrationProject Design Considerations for Integration
Project Design Considerations for Integration
 
Connectivity is here (5 g, swarm,...). now, let's build interplanetary apps! (1)
Connectivity is here (5 g, swarm,...). now, let's build interplanetary apps! (1)Connectivity is here (5 g, swarm,...). now, let's build interplanetary apps! (1)
Connectivity is here (5 g, swarm,...). now, let's build interplanetary apps! (1)
 
RA TechED 2019 - SY07- Next-Gen Device Library of Preconfigured Objects
RA TechED 2019 - SY07- Next-Gen Device Library of Preconfigured ObjectsRA TechED 2019 - SY07- Next-Gen Device Library of Preconfigured Objects
RA TechED 2019 - SY07- Next-Gen Device Library of Preconfigured Objects
 
RA TechED 2019 - IN02 - Empower Your Connected Enterprise with FactoryTalk In...
RA TechED 2019 - IN02 - Empower Your Connected Enterprise with FactoryTalk In...RA TechED 2019 - IN02 - Empower Your Connected Enterprise with FactoryTalk In...
RA TechED 2019 - IN02 - Empower Your Connected Enterprise with FactoryTalk In...
 
Performance and Analytics Cloud for Machine Builders
Performance and Analytics Cloud for Machine BuildersPerformance and Analytics Cloud for Machine Builders
Performance and Analytics Cloud for Machine Builders
 
Migration Tools to Convert Your Legacy DCS
Migration Tools to Convert Your Legacy DCSMigration Tools to Convert Your Legacy DCS
Migration Tools to Convert Your Legacy DCS
 
Smart Devices - Design ,Operate and Maintain
Smart Devices - Design ,Operate and MaintainSmart Devices - Design ,Operate and Maintain
Smart Devices - Design ,Operate and Maintain
 

Viewers also liked

Workshop C Impact Of Growth Agenda
Workshop C   Impact Of Growth AgendaWorkshop C   Impact Of Growth Agenda
Workshop C Impact Of Growth Agenda
Stuart Reid
 
Creativity Day Milano 27 Febbraio Milano
Creativity Day Milano 27 Febbraio MilanoCreativity Day Milano 27 Febbraio Milano
Creativity Day Milano 27 Febbraio Milano
roberto.design
 
Aria au pays du Web
Aria au pays du WebAria au pays du Web
Aria au pays du Web
JPVillain
 
2011 04-19 html5 css3 及脚本
2011 04-19 html5 css3 及脚本2011 04-19 html5 css3 及脚本
2011 04-19 html5 css3 及脚本
tbmallf2e
 

Viewers also liked (17)

A HistóRia De Feio
A HistóRia De FeioA HistóRia De Feio
A HistóRia De Feio
 
Workshop C Impact Of Growth Agenda
Workshop C   Impact Of Growth AgendaWorkshop C   Impact Of Growth Agenda
Workshop C Impact Of Growth Agenda
 
Aeropuerto
AeropuertoAeropuerto
Aeropuerto
 
Uma casa atlântica, Berlim em Lisboa ou os problemas do velho direito de auto...
Uma casa atlântica, Berlim em Lisboa ou os problemas do velho direito de auto...Uma casa atlântica, Berlim em Lisboa ou os problemas do velho direito de auto...
Uma casa atlântica, Berlim em Lisboa ou os problemas do velho direito de auto...
 
PresentacióN1
PresentacióN1PresentacióN1
PresentacióN1
 
ApresentaçãO2tic
ApresentaçãO2ticApresentaçãO2tic
ApresentaçãO2tic
 
Opera bug
Opera bugOpera bug
Opera bug
 
Creativity Day Milano 27 Febbraio Milano
Creativity Day Milano 27 Febbraio MilanoCreativity Day Milano 27 Febbraio Milano
Creativity Day Milano 27 Febbraio Milano
 
adirealoral
adirealoraladirealoral
adirealoral
 
Paper.li work
Paper.li workPaper.li work
Paper.li work
 
点道
点道点道
点道
 
Aria au pays du Web
Aria au pays du WebAria au pays du Web
Aria au pays du Web
 
Liberty
LibertyLiberty
Liberty
 
2011 04-19 html5 css3 及脚本
2011 04-19 html5 css3 及脚本2011 04-19 html5 css3 及脚本
2011 04-19 html5 css3 及脚本
 
microtest12
microtest12microtest12
microtest12
 
SowProg, présentation lors du Mama 2012
SowProg, présentation lors du Mama 2012SowProg, présentation lors du Mama 2012
SowProg, présentation lors du Mama 2012
 
Sir Richard Mac Cormac, Density
Sir Richard Mac Cormac, DensitySir Richard Mac Cormac, Density
Sir Richard Mac Cormac, Density
 

Similar to Designing and Deploying Internet-Scale Services

Maintenance: the cost for software vendors and what it means to your organiza...
Maintenance: the cost for software vendors and what it means to your organiza...Maintenance: the cost for software vendors and what it means to your organiza...
Maintenance: the cost for software vendors and what it means to your organiza...
nick_hans
 
Slow Cool 20081009 Final
Slow Cool 20081009 FinalSlow Cool 20081009 Final
Slow Cool 20081009 Final
rajivmordani
 

Similar to Designing and Deploying Internet-Scale Services (20)

Maintenance: the cost for software vendors and what it means to your organiza...
Maintenance: the cost for software vendors and what it means to your organiza...Maintenance: the cost for software vendors and what it means to your organiza...
Maintenance: the cost for software vendors and what it means to your organiza...
 
Netcetera Proactive Management Service
Netcetera Proactive Management ServiceNetcetera Proactive Management Service
Netcetera Proactive Management Service
 
Agile Maintenance
Agile MaintenanceAgile Maintenance
Agile Maintenance
 
Joe Honan Virtualization Trends
Joe Honan   Virtualization TrendsJoe Honan   Virtualization Trends
Joe Honan Virtualization Trends
 
Slow Cool 20081009 Final
Slow Cool 20081009 FinalSlow Cool 20081009 Final
Slow Cool 20081009 Final
 
Netpod - The Merging of NPM & APM
Netpod - The Merging of NPM & APMNetpod - The Merging of NPM & APM
Netpod - The Merging of NPM & APM
 
Exposing and Fixing Common App Performance Problems
Exposing and Fixing Common App Performance ProblemsExposing and Fixing Common App Performance Problems
Exposing and Fixing Common App Performance Problems
 
E Shared Services Effort
E Shared Services EffortE Shared Services Effort
E Shared Services Effort
 
E g innovations overview
E g innovations overviewE g innovations overview
E g innovations overview
 
eG Enterprise Citrix XenDesktop Monitor Product Tour
eG Enterprise Citrix XenDesktop Monitor Product ToureG Enterprise Citrix XenDesktop Monitor Product Tour
eG Enterprise Citrix XenDesktop Monitor Product Tour
 
Monitoring Cloud/Virtual/Physical IT Infrastructures
Monitoring Cloud/Virtual/Physical IT InfrastructuresMonitoring Cloud/Virtual/Physical IT Infrastructures
Monitoring Cloud/Virtual/Physical IT Infrastructures
 
Managing and Monitoring Virtual/Cloud/Physical Infrastructures
Managing and Monitoring Virtual/Cloud/Physical InfrastructuresManaging and Monitoring Virtual/Cloud/Physical Infrastructures
Managing and Monitoring Virtual/Cloud/Physical Infrastructures
 
Appdynamics Training Session
Appdynamics Training SessionAppdynamics Training Session
Appdynamics Training Session
 
XS 2008 Boston Capacity Planning
XS 2008 Boston Capacity PlanningXS 2008 Boston Capacity Planning
XS 2008 Boston Capacity Planning
 
Building Composite Applications in Lotus Notes
Building Composite Applications in Lotus NotesBuilding Composite Applications in Lotus Notes
Building Composite Applications in Lotus Notes
 
Movie Rating Site [for presentation]
Movie Rating Site [for presentation]Movie Rating Site [for presentation]
Movie Rating Site [for presentation]
 
Unified Process
Unified ProcessUnified Process
Unified Process
 
Seminar - JBoss Migration
Seminar - JBoss MigrationSeminar - JBoss Migration
Seminar - JBoss Migration
 
Top 5 .NET Challenges, Performance Monitoring Tips & Tricks
Top 5 .NET Challenges, Performance Monitoring Tips & TricksTop 5 .NET Challenges, Performance Monitoring Tips & Tricks
Top 5 .NET Challenges, Performance Monitoring Tips & Tricks
 
Web 2.0 Expo Notes
Web 2.0 Expo NotesWeb 2.0 Expo Notes
Web 2.0 Expo Notes
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Designing and Deploying Internet-Scale Services

  • 1. Designing and Deploying Internet-Scale Services James Hamilton 2008.01.17 Architect, Windows Live Platform Services JamesRH@microsoft.com Web: research.microsoft.com/~jamesrh Blog: perspectives.mvdirona.com
  • 2. Agenda • Overview • Recovery-Oriented Computing • Overall Application Design • Operational Issues • Summary Contributors: Search, Mail, Exchange Hosted Services, Live Collaboration Server, Contacts & Storage, Spaces, Xbox Live, Rackable Systems, Messenger, WinLive Operations, & MS.com Ops 1/17/2008 2
  • 3. Background and biases • 15 years in database engine development – Lead architect on IBM DB2 – Architect on SQL Server • Led variety of core engine teams including SQL client, SQL compiler, optimizer, XML, full text search, execution engine, protocols, etc. • Led the Exchange Hosted Services Team – Email anti-spam, anti-virus, and archiving for 2.2m seats with $27m revenue – ~700 servers in 10 data centers world-wide • Currently architect on Windows Live Platform Services • Automation & redundancy is only way to: – Reduce costs – Improve rate of innovation – Reduce operational failures and downtime 1/17/2008 3
  • 4. Motivation • System-to-admin ratio indicator of admin costs – Tracking total ops costs often gamed • Outsourcing halves ops costs without addressing real issues – Inefficient properties: <10:1 – Enterprise: 150:1 – Best services: over 2,000:1 • 80% of ops issues from design and development – Poorly written applications are difficult to automate • Focus on reducing ops costs during design & development 1/17/2008 4
  • 5. What does operations do? Site Management Architectural Software Total Engineering Total Development 7% 8% Total 7% Requests Total 6% Deployment Overhead Total Management 11% Total 31% Incident Management Problem Source: Deepak Patil, Global Total Engineering Total Foundation Services (8/14/2006) 20% 10% • 51% is deployment & incident management (known resolution) • Teams: Messenger, Contacts and Storage & business unit IT services 1/17/2008 Windows Live Platform Services 5
  • 6. ROC design pattern • Recover-oriented computing (ROC) – Assume software & hardware will fail frequently & unpredictably • Heavily instrument applications to detect failures App Bohr Bug Urgent Bohr bug: Repeatable functional Alert software issue (functional bugs); Heisenbug should be rare in production Heisenbug: Software issue that only Restart occurs in unusual cross-request Failure Reboot timing issues or the pattern of long sequences of independent Failure Re-image operations; some found only in Failure production Replace Machine out of rotation and power down Set LCD/LED to quot;needs servicequot; 1/17/2008 Windows Live Platform Services 6
  • 7. Overall application design • Single-box deployment • Development and testing in full environment • Quick service health check • Zero trust of underlying components • Pod or cluster independence • Implement & test ops tools and utilities • Simplicity throughout • Partition & version everything 1/17/2008 7
  • 8. Design for auto-mgmt & provisioning • Support for geo-distribution • Auto-provisioning & auto-installation mandatory • Manage quot;service rolequot; rather than servers • Multi-system failures are common – Limit automation range of action • Never rely on local, non-replicated persistent state • Don't worry about clean shutdown – Often won't get it & need this path tested • Explicitly install everything and then verify • Force fail all services and components regularly 1/17/2008 Windows Live Platform Services 8
  • 9. Release cycle & testing • Ship frequently: – Small releases ship more smoothly – Increases pace of innovation – Long stabilization periods not required in services • Use production data to find problems (traffic capture) – Measurable release criteria – Release criteria includes quality and throughput data • Track all recovered errors to protect against automation- supported service entropy • Test all error paths in integration & in production • Test in production via incremental deployment & roll-back – Never deploy without tested roll-back – Continue testing after release 1/17/2008 9
  • 10. Design for incremental release • Incrementally release with schema changes? – Old code must run against new schema, or – Two-phase process (avoid if possible) • Update code to support both, commit changes, and then upgrade schema • Incrementally release with user experience (UX) changes? – Separate UX from infrastructure – Ensure old UX works with new infrastructure – Deploy infrastructure incrementally – On success, bring a small beta population onto new UX – On continued success, announce new UX and set a date to roll out • Client-side code? – Ensure old & new clients both run with new infrastructure 1/17/2008 10
  • 11. Graceful degradation & admission control • No amount of quot;head roomquot; is sufficient – Even at 25-50% H/W utilization, spikes will exceed 100% • Prevent overload through admission control • Graceful degradation prior to admission control – Find less resource-intensive modes to provide (possibly) degraded services • Related concept: Metered rate-of-service admission – Service login typically more expensive than steady state – Allow a single or small number of users in when restarting a service after failure 1/17/2008 11
  • 12. Auditing, monitoring, & alerting • Produce perf data, health data & throughput data • All config changes need to be tracked via audit log • Alerting goals: – No customer events without an alert (detect problems) – Alert to event ratio nearing 1 (don’t false alarm) • Alerting is an art … need to tune alerting frequently – Can’t embed in code (too hard to change) – Code produces events, events tracked centrally, alerts produced via queries over event DB • Testing in production requires very reliable monitoring – Combination of detection & capability to roll back allows nimbleness • Tracked events for all interesting issues – Latencies are toughest issues to detect 1/17/2008 12
  • 13. Dependency management • Expect latency & failures in dependent services – Run on cached data or offer degraded services – Test failure & latency frequently in production • Don’t depend upon features not yet shipped – It takes time to work out reliability & scaling issues • Select dependent components & services thoughtfully – On-server components need consistent quality goals – Dependent services should be large granule (“worth” sharing) • Isolate services & decouple components – Contain faults within services – Assume different upgrade rates – Rather than auth on each connect, use session key and refresh every N hours (avoids login storms) 1/17/2008 13
  • 14. Customer & press communications plan • Systems fail & you will experience latency • Communicate through multiple channels – Opt-in RSS, web, IM, email, etc. – If app has client, report details through client • Set ETA expectations & inform • Some events will bring press attention • There is a natural tendency to hide systems issues • Prepare for serious scenarios in advance • Data loss, data corruption, security breach, privacy violation • Prepare communications skeleton plan in advance • Who gets called, communicates with the press, & how data is gathered • Silence typically interpreted as hiding something or lack of control 1/17/2008 14
  • 15. Summary • Reduced operations costs & improved reliability through automation • Full automation dependent upon partitioning & redundancy • Each human administrative interaction is an opportunity for error • Design for failure in all components & test frequently • Rollback & deep monitoring allows safe production testing 1/17/2008 15
  • 16. More Information • Designing & Deploying Internet-Scale Services paper: – http://research.microsoft.com/~JamesRH/TalksAndPapers/JamesRH_Lisa.pdf • Autopilot: Automatic Data Center Operation – http://research.microsoft.com/users/misard/papers/osr2007.pdf • Recovery-Oriented Computing – http://roc.cs.berkeley.edu/ – http://www.cs.berkeley.edu/~pattrsn/talks/HPCAkeynote.ppt – http://www.sciam.com/article.cfm?articleID=000DAA41-3B4E-1EB7- BDC0809EC588EEDF • These slides: – http://research.microsoft.com/~jamesrh • Email: – JamesRH@microsoft.com • External Blog: – http://perspectives.mvdirona.com 1/17/2008 16