SlideShare una empresa de Scribd logo
1 de 14
Descargar para leer sin conexión
An introduction to designing
reliable cloud services

January 2014

Contents
Overview

2

Cloud service reliability versus resiliency

4

Recovery-oriented computing

5

Planning for failure

7

Designing for and responding to failure

9

Summary

13

Additional resources

14

Authors and contributors

14

Trustworthy Computing | An introduction to designing reliable cloud services

1
Overview
This paper describes reliability concepts and a reliability design-time process for organizations
that create, deploy, and/or consume cloud services. It explains fundamental concepts of
reliability and can help decision makers understand the factors and processes that make cloud
services more reliable. In addition, it provides architects, developers, and operations personnel
with insights into how to work together to increase the reliability of the services they design,
implement, and support.
Cloud service providers and their customers have varying degrees of responsibility, depending
on the cloud service model. The following figure shows the spectrum of responsibilities for both
providers and customers. For infrastructure as a service (IaaS) offerings, such as a virtual
machines, the provider and the customer share responsibilities. Although the customer is
responsible for ensuring that the solutions they build on the offering run in a reliable manner,
the provider is still ultimately responsible for the reliability of the infrastructure (core compute,
network, and storage) components. When customers purchase software as a service (SaaS)
offerings, such as Microsoft Office 365, cloud providers hold primary responsibility for ensuring
the reliability of the service. Platform as a service (PaaS) offerings, such as Windows Azure,
occupy the middle of this responsibility spectrum, with providers being responsible for the
infrastructure and fabric controller layers. If a customer purchases an infrastructure as a service
(IaaS) offering, such as Windows Azure, the cloud provider is largely responsible for host
management and physical security of the facility itself.

Trustworthy Computing | An introduction to designing reliable cloud services

2
Figure 1. Cloud customer and cloud provider responsibilities

With the emergence of cloud computing and online services, customers expect services to be
available whenever they need them—just like electricity or dial tone. This expectation requires
organizations that build and support cloud services to plan for probable failures and have
mechanisms that allow rapid recovery from such failures. Cloud services are complex and have
many dependencies, so it is important that all members of a service provider’s organization
understand their role in making the service they provide as reliable as possible.
This paper includes the following sections:
• Cloud service reliability versus resiliency
• Recovery-oriented computing
• Planning for failure
• Designing for and responding to failure
Although it is outside the scope of this paper, it is also important to understand that there are
cost tradeoffs associated with some reliability strategies that need consideration in order to
implement a service with sufficient reliability at optimal cost. Considerations could include
determining what features to include in the service and prioritizing the degree of reliability
associated with each feature.

Trustworthy Computing | An introduction to designing reliable cloud services

3
Cloud service reliability versus resiliency
The Institute of Electrical and Electronics Engineers (IEEE) Reliability Society states that reliability
[engineering] is “a design engineering discipline which applies scientific knowledge to assure
that a system will perform its intended function for the required duration within a given
environment, including the ability to test and support the system through its total lifecycle.”1 For
software, it defines reliability as “the probability of failure-free software operation for a specified
period of time in a specified environment.”2
If one assumes that all cloud service providers strive to deliver a reliable experience for their
customers, it is important to fully understand what comprises a reliable cloud service. In essence,
a reliable cloud service is one that functions as the designer intended, when the customer
expects it to function, and wherever the connected customer is located. However, not every
component needs to operate flawlessly 100 percent of the time.
Cloud service providers strive for reliability. Resiliency is the ability of a cloud-based service to
withstand certain types of failure and yet remain functional from the customers’ perspective. A
service could be characterized as reliable simply because no part of the service has ever failed,
and yet the service might not be considered resilient because it has not been tested.
A resilient service is one that is designed and built so that potential failures have minimal effect
on the service’s availability and functionality. In addition, despite constant and persistent
reliability threats, resilient services should remain fully functional and allow customers to
perform the tasks necessary to complete their work.
Services should be designed to:
• Minimize the impact a failure has on any given customer. For example, the service
should degrade gracefully, which means that non-critical components of the service may fail
but critical functions still work.
• Minimize the number of customers affected by a failure. For example, the service should
be designed so that faults can be isolated to a subset of customers.
• Reduce the number of minutes that a customer (or customers) cannot use the service
in its entirety. For example, the service should be able to transfer customer requests from
one data center to another if a major failure occurs.

1
2

IEEE Reliability Society, at http://rs.ieee.org
Ibid.

Trustworthy Computing | An introduction to designing reliable cloud services

4
Recovery-oriented computing
Traditional computing systems have been designed to avoid failure. Cloud-based systems have
an inherent issue of reliability, because of their scale and complexity. The recovery-oriented
computing (ROC) approach can help organizations frame software failure in a way that makes it
easier to design cloud services to respond to elements that are under their direct control as well
as to elements that are not.
The following three basic assumptions are associated with recovery-oriented computing:
• Devices and hardware will fail
• People make mistakes
• Software contains imperfections
Organizations that create cloud services must design them to mitigate these assumptions as
much as possible to provide reliability for their customers.
ROC research areas
ROC defines six research areas3 that can be adapted to cloud service design and implementation
recommendations. These research areas can help mitigate potential issues that are rooted in the
referenced three basic assumptions, and are explained in the following list:
• Fault zones. Organizations should partition cloud services into fault zones so failures can be
contained, which enables rapid recovery. Isolation and loose coupling of dependencies are
essential elements that contribute to fault containment and recovery capabilities. Fault
isolation mechanisms should apply to a wide range of failure scenarios, including software
imperfections and human-induced failures.
• Defense-in-depth. Organizations should use a defense-in-depth approach, which helps
ensure that a failure is contained if the first layer of protection does not isolate it. In other
words, organizations should not rely on a single protective measure but instead factor
multiple protective measures into their service design.
• Redundancy. Organizations should build redundancy into their systems to survive faults.
Redundancy enables isolation so that organizations can ensure the service continues to run,
perhaps in a degraded state, when a fault occurs and the system is in the process of being
recovered. Organizations should design fail-fast components that enable redundant systems
to detect failure quickly and isolate it during recovery.
• Diagnostic aids. Organizations should use diagnostic aids for root cause analysis of failures.
These aids must be suitable for use in non-production and production environments, and
should be able to rapidly detect the presence of failures and identify their root causes using
automated techniques.
3

Recovery-Oriented Computing Overview, at http://roc.cs.berkeley.edu/roc_overview.html

Trustworthy Computing | An introduction to designing reliable cloud services

5
• Automated rollback. Organizations should create systems that provide automated rollback
for most aspects of operations, from system configuration to application management to
hardware and software upgrades. This functionality does not prevent human error but can
help mitigate the impact of mistakes and make services more dependable.
• Recovery process drills. Organizations should conduct recovery process drills routinely to
test repair mechanisms, both during development and while in production mode. Testing
helps ensure that the repair mechanisms work as expected and do not compound failures in a
production environment.
Using the ROC approach can help an organization shift from strictly focusing on preventing
failures to also focusing on reducing the amount of time it takes to recover from a failure. In
other words, some degree of failure is inevitable, (that is, it cannot be avoided or prevented), so
it is important to have recovery strategies in place. Two terms can frame the shift in thinking that
is required to create more reliable cloud services: mean time to failure (MTTF) and mean time to
recover (MTTR).
MTTF is a measure of how frequently software and hardware fails, and the goal is to make the
time between failures as long as possible. It is a necessary measure and works well for packaged
software, because software publishers are able specify the computing environment in which the
software will optimally perform. However, focusing on MTTF by itself is insufficient for cloud
services, because portions of the computing environment are out of the direct control of the
provider and thus more unpredictable. It is important that cloud services are designed in such a
way that they can rapidly recover.
MTTR is the amount of time it takes to get a service running again after a failure. Shrinking
MTTR requires design and development practices that promote quicker detection and
subsequent recovery, and it also requires well-trained operations teams that are capable of
bringing components of the service back online as quickly as possible; an even better approach
would be for the system to automatically recover without human intervention. Organizations
should design cloud services so that they do not stop working, even when some components
fail; such an approach allows the service to degrade gracefully while still allowing users to
accomplish their work.
Embracing the ROC approach allows organizations to design services in ways that reduce MTTR
as much as possible and increase MTTF as much as possible.

Trustworthy Computing | An introduction to designing reliable cloud services

6
Planning for failure
To help reduce MTTR, organizations need to design ways for their services to continue operating
when known failure conditions occur. For example, what should the service do when another
cloud service that it depends on is not available? What should the service do when it cannot
connect to its primary database? What hardware redundancies are required and where should
they be located? Can the service detect and respond gracefully to incorrect configuration
settings, allowing rollback of the system to a “last known good” state? At what point is rollback
of a given change no longer possible, necessitating a “patch and roll forward” mitigation
strategy instead?
Organizations that create cloud services should consider the three primary causes of failure
shown in the following figure:
Figure 2. Causes of failure

Device and
infrastructure
failure

These failures range from expected, end-of-life failures to
catastrophic failures caused by natural disaster or accidents that
are out of an organization’s control.

Human
error

Administrator and configuration mistakes that are often out of an
organization’s control.

Software
imperfections

Code imperfections and software-related issues in the deployed
online service. Pre-release testing can control this to some degree.

Core design principles for reliable services
Organizations must address the following three essential reliability design principles when they
create specifications for a cloud service. These principles help to mitigate the effect of failures
when they occur:
• Design for resilience. The service must withstand component-level failures without requiring
human intervention. A service should be able to detect failures and automatically take
corrective measures so that users do not experience service interruptions. When failure does
occur, the service should degrade gracefully and provide partial functionality instead of going
completely offline. For example, a service should use fail-fast components and indicate
appropriate exceptions so that the system can automatically detect and resolve the issue.
There are also automated techniques that architects can include to predict service failure and
notify the organization about service degradation or failure.
• Design for data integrity. The service must capture, manipulate, store, or discard data in a
manner that is consistent with its intended operation. A service needs to preserve the integrity
of the information that customers have entrusted to it. For example, organizations should
Trustworthy Computing | An introduction to designing reliable cloud services

7
replicate customer data stores to prevent hardware failures from causing data loss, and
adequately secure data stores to prevent unauthorized access.
• Design for recoverability. When the unforeseen happens, the service must be recoverable.
As much as possible, a service or its components should recover quickly and automatically.
Teams should be able to restore a service quickly and completely if an interruption occurs. For
example, services should be designed for component redundancy and data failover so that
when failure is detected in a component, a group of servers, or an entire physical location or
data center, another component, group of servers, or physical location automatically takes
over to keep the service running.
When designing cloud services, organizations should adapt these essential principles as
minimum requirements for handling potential failures.

Trustworthy Computing | An introduction to designing reliable cloud services

8
Designing for and responding to failure
To build reliable cloud services, organizations should design for failure—that is, specify how a
service will respond gracefully when it encounters a failure condition. The process that is
illustrated in the following figure is intended for organizations that create SaaS solutions, and is
designed to help them identify and mitigate possible failures. However, organizations that
purchase cloud services can also use this process to develop an understanding of how the
services function and help them formulate questions to ask before entering into a service
agreement with a cloud provider.
Figure 3. An overview of the design process

Create initial service design

Failure mode & effects analysis

Design coping strategies

Use fault injection
Capture unexpected faults
Monitor the live site

Designing a service for reliability and implementing recovery mechanisms based on recoveryoriented principles is an iterative process. Design iterations are fluid and take into account both
information garnered from pre-release testing and data about how the service performs after it
is deployed.
Failure mode and effects analysis
Failure mode and effects analysis (FMEA) is a key step in the design process for any online
service. Identifying the important interaction points and dependencies of a service enables the
engineering team to pinpoint changes that are required to ensure the service can be monitored
effectively for rapid detection of issues. This approach enables the engineering team to develop
ways for the service to withstand, or mitigate, faults. FMEA also helps the engineering teams
identify suitable test cases to validate whether the service is able to cope with faults, in test
environments as well as in production (otherwise known as fault injection).
Trustworthy Computing | An introduction to designing reliable cloud services

9
As part of FMEA, organizations should create a component inventory of all components that the
service uses, whether they are user interface (UI) components hosted on a web server, a
database hosted in a remote data center, or an external service that the service depends on. The
team can then capture possible faults in a spreadsheet or other document and incorporate
relevant information into design specifications.
The following example questions are the types of questions that an online service design team
should consider. In addition, teams should consider whether they have capacity to detect
failures, to analyze the root cause of the failure, and to recover the service.
• What external services will the service be dependent on?
• What data sources will the service be dependent on?
• What configuration settings will the service require to operate properly?
• What hardware dependencies does the service have?
• What are the relevant customer scenarios that should be modeled?
To fully analyze how the service will use its components, the team can create a matrix that
captures which components are accessed for each customer scenario. For example, an online
video service might contain scenarios for logging in, for browsing an inventory of available
videos, selecting a video and viewing it, and then rating the video after viewing. Although these
scenarios share common information and components, each is a separate customer usage
scenario, and each accesses components that are independent from the other scenarios. The
matrix should identify each of these usage scenarios and contain a list of all required
components for each scenario.
Using a matrix also allows the service design team to create a map of possible failure points at
each component interaction point, and define a fault-handling mechanism for each.
For more information on how Microsoft implements failure mode and effects analysis, read the
‘Resilience by design for cloud services’ whitepaper.
Designing and implementing coping strategies
Fault-handling mechanisms are also called coping strategies. In the design stage, architects
define what the coping strategies will be so that the software will do something reasonable
when a failure occurs. They should also define the types of instrumentation that engineers
should include in the service specification to enable monitors that can detect when a particular
type of failure occurs.
Designing coping strategies to do something reasonable depends on the functionality that the
service provides and the type of failure the coping strategy addresses. The key is to ensure that
when a component fails, it fails quickly and, if required, the service switches to a redundant
component. In other words, the service degrades gracefully but does not fail completely.
Trustworthy Computing | An introduction to designing reliable cloud services

10
For example, the architects of a car purchasing service design their application to include ratings
for specific makes and models of each car model type. They design the purchasing service with a
dependency on another service that provides comparative ratings of the models. If the rating
service fails or is unavailable, the coping strategy might mean the purchasing service displays a
list of models without the associated ratings rather than not displaying a list at all. In other
words, when a particular failure happens the service should produce a reasonable result,
regardless of the failure. The result may not be optimal, but it should be reasonable from the
customer’s perspective. For a car purchasing service, it is reasonable to still produce a list of
models with standard features, optional features, and pricing without any rating data instead of
an error message or a blank page; the information is not optimal, but it might be useful to the
customer. It is best to think in terms of “reasonable, but not necessarily optimal” when deciding
what the response to a failure condition should be.
When designing and implementing instrumentation, it is important to monitor at the
component level as well as from the user’s perspective. This approach can allow the service team
to identify a trend in component-level performance before it becomes an incident that affects
users. The data that this kind of monitoring can produce enables organizations to gain insight
into how to improve the service’s reliability for later releases.
Monitoring the live site
Accurate monitoring information can be used by teams to improve services in several ways. For
example, it can provide teams with information to troubleshoot known problems or potential
problems in a service. It can also provide organizations with insights into how their services
perform when handling live workloads. In addition, it can also be fed directly into servicealerting mechanisms to reduce the time to detect problems and therefore reduce MTTR.
Simulated workloads in test environments rarely capture the range of possible failures and faults
that live site workloads generate. Organizations can identify trends before they become failures
by carefully analyzing live site telemetry data and establishing thresholds, both upper and lower
ranges, that represent normal operating conditions. If the telemetry being collected in near real
time approaches either the upper or the lower threshold, an alarm can be triggered that
prompts the operations team to immediately triage the service and potentially prevent a failure.
They can also analyze failure and fault data that instrumentation and monitoring tools capture in
the production environment to better understand how the service operates and to determine
what monitoring improvements and new coping strategies they require.

Using fault injection

Fault injection can be viewed as using software that is designed to break other software. For
teams that design and deploy cloud services, it is software designed and written by the team to
cripple the service in a deliberate and programmatic way. Fault injection is often used with stress
testing and is widely considered an important part of developing robust software.
Trustworthy Computing | An introduction to designing reliable cloud services

11
When using fault injection on a service that is already deployed, organizations target locations
where coping strategies have been put in place so they can validate those strategies. In addition,
cloud providers can discover unexpected results that are generated by the service and that can
be used to appropriately harden the production environment.
Fault injection and recovery drills can provide valuable information, such as whether the service
functions as expected or whether unexpected faults occur under load. A service provider can use
this information to design new coping strategies to implement in future updates to the service.

Trustworthy Computing | An introduction to designing reliable cloud services

12
Summary
To design and implement a reliable cloud service requires organizations to assess how they
regard failure. Historically, reliability has been equated with preventing failure—that is,
delivering a tangible object free of faults or imperfections. Cloud services are complex and have
dependencies, so they become more reliable when they are designed to quickly recover from
unavoidable failures, particularly those that are out of an organization's control.
The processes that architects and engineers use to design cloud services can also affect the
reliability of a service. It is critical for service design to incorporate monitoring data from live
sites, especially when identifying the faults and failures that are addressed through coping
strategies that are tailored to a particular service. Organizations should also consider conducting
fault injection tests and recovery drills in their production environments. Doing so generates
data they can use to improve service reliability and that will help prepare organizations to
handle failures when they actually occur.

Trustworthy Computing | An introduction to designing reliable cloud services

13
Additional resources
• The Berkeley/Stanford Recovery-Oriented Computing (ROC) Project
http://roc.cs.berkeley.edu/
• Resilience by design for cloud services
http://aka.ms/resiliency
• Foundations of Trustworthy Computing: Reliability
www.microsoft.com/about/twc/en/us/reliability.aspx
• Microsoft Trustworthy Computing
www.microsoft.com/about/twc/en/us/default.aspx

Authors and contributors
MIKE ADAMS – Cloud and Enterprise
SHANNON BEARLY – Cloud and Enterprise
DAVID BILLS –Trustworthy Computing
SEAN FOY –Trustworthy Computing
MARGARET LI –Trustworthy Computing
TIM RAINS –Trustworthy Computing
MICHAEL RAY – Cloud and Enterprise
DAN ROGERS – Operating Systems
FRANK SIMORJAY –Trustworthy Computing
SIAN SUTHERS –Trustworthy Computing
JASON WESCOTT –Trustworthy Computing
© 2014 Microsoft Corp. All rights reserved.
This document is provided "as-is." Information and views expressed in this document, including URL
and other Internet Web site references, may change without notice. You bear the risk of using it. This
document does not provide you with any legal rights to any intellectual property in any Microsoft
product. Microsoft, Office 365, and Windows Azure are either registered trademarks or trademarks of
Microsoft Corporation in the United States and/or other countries.
You may copy and use this document for your internal, reference purposes. Licensed under
Creative Commons Attribution-Non Commercial-Share Alike 3.0 Unported.

Trustworthy Computing | An introduction to designing reliable cloud services

14

Más contenido relacionado

La actualidad más candente

Key Achievements PowerPoint Presentation Slides
Key Achievements PowerPoint Presentation SlidesKey Achievements PowerPoint Presentation Slides
Key Achievements PowerPoint Presentation SlidesSlideTeam
 
Twitter tools for Recruitment - Sathish Ganesh - Sourcing Lab at TASCON16
Twitter tools for Recruitment - Sathish Ganesh - Sourcing Lab at TASCON16Twitter tools for Recruitment - Sathish Ganesh - Sourcing Lab at TASCON16
Twitter tools for Recruitment - Sathish Ganesh - Sourcing Lab at TASCON16SourcingAdda
 
A year with event sourcing and CQRS
A year with event sourcing and CQRSA year with event sourcing and CQRS
A year with event sourcing and CQRSSteve Pember
 
Keep Ahead of Evolving Cyberattacks with OPSWAT and F5 NGINX
Keep Ahead of Evolving Cyberattacks with OPSWAT and F5 NGINXKeep Ahead of Evolving Cyberattacks with OPSWAT and F5 NGINX
Keep Ahead of Evolving Cyberattacks with OPSWAT and F5 NGINXNGINX, Inc.
 
Software development training for technical recruiters
Software development training for technical recruitersSoftware development training for technical recruiters
Software development training for technical recruitersObi Mba Ogbanufe
 
Quality of hire metrics and why you must measure it
Quality of hire  metrics  and why you must measure itQuality of hire  metrics  and why you must measure it
Quality of hire metrics and why you must measure itDr. John Sullivan
 
Competitor Landscape Framework PowerPoint Presentation Slides
Competitor Landscape Framework PowerPoint Presentation SlidesCompetitor Landscape Framework PowerPoint Presentation Slides
Competitor Landscape Framework PowerPoint Presentation SlidesSlideTeam
 
Achievements And Challenges PowerPoint Presentation Slides
Achievements And Challenges PowerPoint Presentation SlidesAchievements And Challenges PowerPoint Presentation Slides
Achievements And Challenges PowerPoint Presentation SlidesSlideTeam
 
Technology Solutions Strategies Presentation Powerpoint
Technology Solutions Strategies Presentation PowerpointTechnology Solutions Strategies Presentation Powerpoint
Technology Solutions Strategies Presentation PowerpointSlideTeam
 
API 101 - Understanding APIs
API 101 - Understanding APIsAPI 101 - Understanding APIs
API 101 - Understanding APIs3scale
 
How to Battle Bad Reviews
How to Battle Bad ReviewsHow to Battle Bad Reviews
How to Battle Bad ReviewsGlassdoor
 
Employee Hiring Process PowerPoint Presentation Slides
Employee Hiring Process PowerPoint Presentation SlidesEmployee Hiring Process PowerPoint Presentation Slides
Employee Hiring Process PowerPoint Presentation SlidesSlideTeam
 
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and HailoMicroservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailogjuljo
 
Company Summary of Business a Plan PowerPoint Presentation Slides
Company Summary of Business a Plan PowerPoint Presentation SlidesCompany Summary of Business a Plan PowerPoint Presentation Slides
Company Summary of Business a Plan PowerPoint Presentation SlidesSlideTeam
 
Be ready for hyperautomation with the UiPath RPA Platform
Be ready for hyperautomation with the UiPath RPA PlatformBe ready for hyperautomation with the UiPath RPA Platform
Be ready for hyperautomation with the UiPath RPA PlatformUiPath
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...Dataiku
 
Continuous Deployment Practices, with Production, Test and Development Enviro...
Continuous Deployment Practices, with Production, Test and Development Enviro...Continuous Deployment Practices, with Production, Test and Development Enviro...
Continuous Deployment Practices, with Production, Test and Development Enviro...Amazon Web Services
 
Creating a Sourcing Function
Creating a Sourcing FunctionCreating a Sourcing Function
Creating a Sourcing Functioncjparker
 
CIPD Research on Mgt Competencies
CIPD Research on Mgt CompetenciesCIPD Research on Mgt Competencies
CIPD Research on Mgt CompetenciesRye Cruz
 

La actualidad más candente (20)

Key Achievements PowerPoint Presentation Slides
Key Achievements PowerPoint Presentation SlidesKey Achievements PowerPoint Presentation Slides
Key Achievements PowerPoint Presentation Slides
 
Twitter tools for Recruitment - Sathish Ganesh - Sourcing Lab at TASCON16
Twitter tools for Recruitment - Sathish Ganesh - Sourcing Lab at TASCON16Twitter tools for Recruitment - Sathish Ganesh - Sourcing Lab at TASCON16
Twitter tools for Recruitment - Sathish Ganesh - Sourcing Lab at TASCON16
 
A year with event sourcing and CQRS
A year with event sourcing and CQRSA year with event sourcing and CQRS
A year with event sourcing and CQRS
 
Keep Ahead of Evolving Cyberattacks with OPSWAT and F5 NGINX
Keep Ahead of Evolving Cyberattacks with OPSWAT and F5 NGINXKeep Ahead of Evolving Cyberattacks with OPSWAT and F5 NGINX
Keep Ahead of Evolving Cyberattacks with OPSWAT and F5 NGINX
 
Software development training for technical recruiters
Software development training for technical recruitersSoftware development training for technical recruiters
Software development training for technical recruiters
 
Quality of hire metrics and why you must measure it
Quality of hire  metrics  and why you must measure itQuality of hire  metrics  and why you must measure it
Quality of hire metrics and why you must measure it
 
Competitor Landscape Framework PowerPoint Presentation Slides
Competitor Landscape Framework PowerPoint Presentation SlidesCompetitor Landscape Framework PowerPoint Presentation Slides
Competitor Landscape Framework PowerPoint Presentation Slides
 
Achievements And Challenges PowerPoint Presentation Slides
Achievements And Challenges PowerPoint Presentation SlidesAchievements And Challenges PowerPoint Presentation Slides
Achievements And Challenges PowerPoint Presentation Slides
 
Technology Solutions Strategies Presentation Powerpoint
Technology Solutions Strategies Presentation PowerpointTechnology Solutions Strategies Presentation Powerpoint
Technology Solutions Strategies Presentation Powerpoint
 
API 101 - Understanding APIs
API 101 - Understanding APIsAPI 101 - Understanding APIs
API 101 - Understanding APIs
 
How to Battle Bad Reviews
How to Battle Bad ReviewsHow to Battle Bad Reviews
How to Battle Bad Reviews
 
Employee Hiring Process PowerPoint Presentation Slides
Employee Hiring Process PowerPoint Presentation SlidesEmployee Hiring Process PowerPoint Presentation Slides
Employee Hiring Process PowerPoint Presentation Slides
 
REST API
REST APIREST API
REST API
 
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and HailoMicroservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
 
Company Summary of Business a Plan PowerPoint Presentation Slides
Company Summary of Business a Plan PowerPoint Presentation SlidesCompany Summary of Business a Plan PowerPoint Presentation Slides
Company Summary of Business a Plan PowerPoint Presentation Slides
 
Be ready for hyperautomation with the UiPath RPA Platform
Be ready for hyperautomation with the UiPath RPA PlatformBe ready for hyperautomation with the UiPath RPA Platform
Be ready for hyperautomation with the UiPath RPA Platform
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
 
Continuous Deployment Practices, with Production, Test and Development Enviro...
Continuous Deployment Practices, with Production, Test and Development Enviro...Continuous Deployment Practices, with Production, Test and Development Enviro...
Continuous Deployment Practices, with Production, Test and Development Enviro...
 
Creating a Sourcing Function
Creating a Sourcing FunctionCreating a Sourcing Function
Creating a Sourcing Function
 
CIPD Research on Mgt Competencies
CIPD Research on Mgt CompetenciesCIPD Research on Mgt Competencies
CIPD Research on Mgt Competencies
 

Destacado

Dic filatro 2010 versão 2
Dic filatro 2010 versão 2Dic filatro 2010 versão 2
Dic filatro 2010 versão 2Faculdade Eniac
 
S39 revue de presse kylia - semaine du 22 au 28 septembre 2014
S39   revue de presse kylia - semaine du 22 au 28 septembre 2014S39   revue de presse kylia - semaine du 22 au 28 septembre 2014
S39 revue de presse kylia - semaine du 22 au 28 septembre 2014KYLIA France
 
Herramientas tecnologicas
Herramientas tecnologicasHerramientas tecnologicas
Herramientas tecnologicasjuliakelly63
 
Scénario 3 mise en sitation et finalité de la tâche
Scénario 3 mise en sitation et finalité de la tâcheScénario 3 mise en sitation et finalité de la tâche
Scénario 3 mise en sitation et finalité de la tâcheLaurencemarlioz
 
Faschismus, die blutige ideologie des darwinismus. german deutsche
Faschismus, die blutige ideologie des darwinismus. german deutscheFaschismus, die blutige ideologie des darwinismus. german deutsche
Faschismus, die blutige ideologie des darwinismus. german deutscheHarunyahyaGerman
 
Redes locales básico1
Redes locales básico1Redes locales básico1
Redes locales básico1johnki5708
 
Loic sarton mise en commun! 1 (copie de hélène w. en conflit 2012-12-12)
Loic sarton   mise en commun! 1 (copie de hélène w. en conflit 2012-12-12)Loic sarton   mise en commun! 1 (copie de hélène w. en conflit 2012-12-12)
Loic sarton mise en commun! 1 (copie de hélène w. en conflit 2012-12-12)Loic Sarton
 
Programa jóvenes en acción de la unión europea
Programa jóvenes en acción de la unión europeaPrograma jóvenes en acción de la unión europea
Programa jóvenes en acción de la unión europeaFederico Fernández Reigosa
 
Manualapachephpnetbeanspostgresqlphpapp02
Manualapachephpnetbeanspostgresqlphpapp02Manualapachephpnetbeanspostgresqlphpapp02
Manualapachephpnetbeanspostgresqlphpapp02william
 
03 Introduccion Al Moodle
03 Introduccion Al Moodle03 Introduccion Al Moodle
03 Introduccion Al Moodlejoseadalberto
 
Auto estudio o auto aprendizaje
Auto estudio o auto aprendizajeAuto estudio o auto aprendizaje
Auto estudio o auto aprendizajegladyssalazar
 
Las Herramientas Digitales para la Educacion
Las Herramientas Digitales para la Educacion Las Herramientas Digitales para la Educacion
Las Herramientas Digitales para la Educacion misaeldiaz10
 
Informatica
InformaticaInformatica
Informaticaeataipe
 
Upcoming ACT Test Security and Identification Enhancements
Upcoming ACT Test Security and Identification EnhancementsUpcoming ACT Test Security and Identification Enhancements
Upcoming ACT Test Security and Identification EnhancementsE J Griffis
 
ELEMENTOS DEL DESARROLLO
ELEMENTOS DEL DESARROLLOELEMENTOS DEL DESARROLLO
ELEMENTOS DEL DESARROLLOindicadoreshmb
 

Destacado (20)

Dic filatro 2010 versão 2
Dic filatro 2010 versão 2Dic filatro 2010 versão 2
Dic filatro 2010 versão 2
 
S39 revue de presse kylia - semaine du 22 au 28 septembre 2014
S39   revue de presse kylia - semaine du 22 au 28 septembre 2014S39   revue de presse kylia - semaine du 22 au 28 septembre 2014
S39 revue de presse kylia - semaine du 22 au 28 septembre 2014
 
Herramientas tecnologicas
Herramientas tecnologicasHerramientas tecnologicas
Herramientas tecnologicas
 
Scénario 3 mise en sitation et finalité de la tâche
Scénario 3 mise en sitation et finalité de la tâcheScénario 3 mise en sitation et finalité de la tâche
Scénario 3 mise en sitation et finalité de la tâche
 
Faschismus, die blutige ideologie des darwinismus. german deutsche
Faschismus, die blutige ideologie des darwinismus. german deutscheFaschismus, die blutige ideologie des darwinismus. german deutsche
Faschismus, die blutige ideologie des darwinismus. german deutsche
 
S.Multimedia...NT
S.Multimedia...NTS.Multimedia...NT
S.Multimedia...NT
 
Macroeconomía
MacroeconomíaMacroeconomía
Macroeconomía
 
Redes locales básico1
Redes locales básico1Redes locales básico1
Redes locales básico1
 
Loic sarton mise en commun! 1 (copie de hélène w. en conflit 2012-12-12)
Loic sarton   mise en commun! 1 (copie de hélène w. en conflit 2012-12-12)Loic sarton   mise en commun! 1 (copie de hélène w. en conflit 2012-12-12)
Loic sarton mise en commun! 1 (copie de hélène w. en conflit 2012-12-12)
 
Programa jóvenes en acción de la unión europea
Programa jóvenes en acción de la unión europeaPrograma jóvenes en acción de la unión europea
Programa jóvenes en acción de la unión europea
 
Manualapachephpnetbeanspostgresqlphpapp02
Manualapachephpnetbeanspostgresqlphpapp02Manualapachephpnetbeanspostgresqlphpapp02
Manualapachephpnetbeanspostgresqlphpapp02
 
1 - Apresentação
1 - Apresentação1 - Apresentação
1 - Apresentação
 
03 Introduccion Al Moodle
03 Introduccion Al Moodle03 Introduccion Al Moodle
03 Introduccion Al Moodle
 
Auto estudio o auto aprendizaje
Auto estudio o auto aprendizajeAuto estudio o auto aprendizaje
Auto estudio o auto aprendizaje
 
Las Herramientas Digitales para la Educacion
Las Herramientas Digitales para la Educacion Las Herramientas Digitales para la Educacion
Las Herramientas Digitales para la Educacion
 
Informatica
InformaticaInformatica
Informatica
 
Upcoming ACT Test Security and Identification Enhancements
Upcoming ACT Test Security and Identification EnhancementsUpcoming ACT Test Security and Identification Enhancements
Upcoming ACT Test Security and Identification Enhancements
 
Rocio maria
Rocio mariaRocio maria
Rocio maria
 
ELEMENTOS DEL DESARROLLO
ELEMENTOS DEL DESARROLLOELEMENTOS DEL DESARROLLO
ELEMENTOS DEL DESARROLLO
 
Humanities 31
Humanities 31Humanities 31
Humanities 31
 

Similar a An Introduction to Designing Reliable Cloud Services January 2014

cloud services and providers
cloud services and providerscloud services and providers
cloud services and providersKalai Selvi
 
unit 5 cloud.pptx
unit 5 cloud.pptxunit 5 cloud.pptx
unit 5 cloud.pptxMrPrathapG
 
Pillars Of Cloud Computing: Decoding The Fundamentals
Pillars Of Cloud Computing: Decoding The FundamentalsPillars Of Cloud Computing: Decoding The Fundamentals
Pillars Of Cloud Computing: Decoding The FundamentalsCiente
 
Apq Qms Project Plan
Apq Qms Project PlanApq Qms Project Plan
Apq Qms Project PlanEng-Mohammad
 
Trust Assessment Policy Manager in Cloud Computing – Cloud Service Provider’s...
Trust Assessment Policy Manager in Cloud Computing – Cloud Service Provider’s...Trust Assessment Policy Manager in Cloud Computing – Cloud Service Provider’s...
Trust Assessment Policy Manager in Cloud Computing – Cloud Service Provider’s...idescitation
 
IRJET - Multitenancy using Cloud Computing Features
IRJET - Multitenancy using Cloud Computing FeaturesIRJET - Multitenancy using Cloud Computing Features
IRJET - Multitenancy using Cloud Computing FeaturesIRJET Journal
 
Cloud scalability considerations
Cloud scalability considerationsCloud scalability considerations
Cloud scalability considerationsIJCSES Journal
 
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...ijgca
 
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...ijgca
 
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...ijgca
 
Jayant Ghorpade - Cloud Computing White Paper
Jayant Ghorpade - Cloud Computing White PaperJayant Ghorpade - Cloud Computing White Paper
Jayant Ghorpade - Cloud Computing White PaperJayant Ghorpade
 

Similar a An Introduction to Designing Reliable Cloud Services January 2014 (20)

Ijcet 06 07_005
Ijcet 06 07_005Ijcet 06 07_005
Ijcet 06 07_005
 
TermPaper
TermPaperTermPaper
TermPaper
 
cloud services and providers
cloud services and providerscloud services and providers
cloud services and providers
 
cloud Resilience
cloud Resilience cloud Resilience
cloud Resilience
 
unit 5 cloud.pptx
unit 5 cloud.pptxunit 5 cloud.pptx
unit 5 cloud.pptx
 
Pillars Of Cloud Computing: Decoding The Fundamentals
Pillars Of Cloud Computing: Decoding The FundamentalsPillars Of Cloud Computing: Decoding The Fundamentals
Pillars Of Cloud Computing: Decoding The Fundamentals
 
ETCA_5
ETCA_5ETCA_5
ETCA_5
 
Apq Qms Project Plan
Apq Qms Project PlanApq Qms Project Plan
Apq Qms Project Plan
 
Cloud resilience, provisioning
Cloud resilience, provisioning Cloud resilience, provisioning
Cloud resilience, provisioning
 
Cloud Computing Improving Organizational Agility
Cloud Computing Improving Organizational AgilityCloud Computing Improving Organizational Agility
Cloud Computing Improving Organizational Agility
 
Trust Assessment Policy Manager in Cloud Computing – Cloud Service Provider’s...
Trust Assessment Policy Manager in Cloud Computing – Cloud Service Provider’s...Trust Assessment Policy Manager in Cloud Computing – Cloud Service Provider’s...
Trust Assessment Policy Manager in Cloud Computing – Cloud Service Provider’s...
 
Cloud is a Process, Not a Tech Revolution
Cloud is a Process, Not a Tech RevolutionCloud is a Process, Not a Tech Revolution
Cloud is a Process, Not a Tech Revolution
 
IRJET - Multitenancy using Cloud Computing Features
IRJET - Multitenancy using Cloud Computing FeaturesIRJET - Multitenancy using Cloud Computing Features
IRJET - Multitenancy using Cloud Computing Features
 
Cloud scalability considerations
Cloud scalability considerationsCloud scalability considerations
Cloud scalability considerations
 
An Intro to Cloud Computing
An Intro to Cloud ComputingAn Intro to Cloud Computing
An Intro to Cloud Computing
 
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
 
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
 
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
SERVICE LEVEL AGREEMENT BASED FAULT TOLERANT WORKLOAD SCHEDULING IN CLOUD COM...
 
Jayant Ghorpade - Cloud Computing White Paper
Jayant Ghorpade - Cloud Computing White PaperJayant Ghorpade - Cloud Computing White Paper
Jayant Ghorpade - Cloud Computing White Paper
 
internship paper
internship paperinternship paper
internship paper
 

Más de David J Rosenthal

Microsoft Teams Phone - Calling Made Simple
Microsoft Teams Phone  - Calling Made SimpleMicrosoft Teams Phone  - Calling Made Simple
Microsoft Teams Phone - Calling Made SimpleDavid J Rosenthal
 
Whats New in Microsoft Teams Calling November 2021
Whats New in Microsoft Teams Calling November 2021Whats New in Microsoft Teams Calling November 2021
Whats New in Microsoft Teams Calling November 2021David J Rosenthal
 
Whats New in Microsoft Teams Hybrid Meetings November 2021
Whats New in Microsoft Teams Hybrid Meetings November 2021Whats New in Microsoft Teams Hybrid Meetings November 2021
Whats New in Microsoft Teams Hybrid Meetings November 2021David J Rosenthal
 
Viva Connections from Microsoft
Viva Connections from MicrosoftViva Connections from Microsoft
Viva Connections from MicrosoftDavid J Rosenthal
 
Protect your hybrid workforce across the attack chain
Protect your hybrid workforce across the attack chainProtect your hybrid workforce across the attack chain
Protect your hybrid workforce across the attack chainDavid J Rosenthal
 
A Secure Journey to Cloud with Microsoft 365
A Secure Journey to Cloud with Microsoft 365A Secure Journey to Cloud with Microsoft 365
A Secure Journey to Cloud with Microsoft 365David J Rosenthal
 
Azure Arc Overview from Microsoft
Azure Arc Overview from MicrosoftAzure Arc Overview from Microsoft
Azure Arc Overview from MicrosoftDavid J Rosenthal
 
Microsoft Windows Server 2022 Overview
Microsoft Windows Server 2022 OverviewMicrosoft Windows Server 2022 Overview
Microsoft Windows Server 2022 OverviewDavid J Rosenthal
 
Windows365 Hybrid Windows for a Hybrid World
Windows365 Hybrid Windows for a Hybrid WorldWindows365 Hybrid Windows for a Hybrid World
Windows365 Hybrid Windows for a Hybrid WorldDavid J Rosenthal
 
Windows 11 for the Enterprise
Windows 11 for the EnterpriseWindows 11 for the Enterprise
Windows 11 for the EnterpriseDavid J Rosenthal
 
Microsoft Scheduler for M365 - Personal Digital Assistant
Microsoft Scheduler for M365 - Personal Digital AssistantMicrosoft Scheduler for M365 - Personal Digital Assistant
Microsoft Scheduler for M365 - Personal Digital AssistantDavid J Rosenthal
 
What is New in Teams Meetings and Meeting Rooms July 2021
What is New in Teams Meetings and Meeting Rooms July 2021What is New in Teams Meetings and Meeting Rooms July 2021
What is New in Teams Meetings and Meeting Rooms July 2021David J Rosenthal
 
Modernize Java Apps on Microsoft Azure
Modernize Java Apps on Microsoft AzureModernize Java Apps on Microsoft Azure
Modernize Java Apps on Microsoft AzureDavid J Rosenthal
 
Microsoft Defender and Azure Sentinel
Microsoft Defender and Azure SentinelMicrosoft Defender and Azure Sentinel
Microsoft Defender and Azure SentinelDavid J Rosenthal
 
Microsoft Azure Active Directory
Microsoft Azure Active DirectoryMicrosoft Azure Active Directory
Microsoft Azure Active DirectoryDavid J Rosenthal
 

Más de David J Rosenthal (20)

Microsoft Teams Phone - Calling Made Simple
Microsoft Teams Phone  - Calling Made SimpleMicrosoft Teams Phone  - Calling Made Simple
Microsoft Teams Phone - Calling Made Simple
 
Whats New in Microsoft Teams Calling November 2021
Whats New in Microsoft Teams Calling November 2021Whats New in Microsoft Teams Calling November 2021
Whats New in Microsoft Teams Calling November 2021
 
Whats New in Microsoft Teams Hybrid Meetings November 2021
Whats New in Microsoft Teams Hybrid Meetings November 2021Whats New in Microsoft Teams Hybrid Meetings November 2021
Whats New in Microsoft Teams Hybrid Meetings November 2021
 
Viva Connections from Microsoft
Viva Connections from MicrosoftViva Connections from Microsoft
Viva Connections from Microsoft
 
Protect your hybrid workforce across the attack chain
Protect your hybrid workforce across the attack chainProtect your hybrid workforce across the attack chain
Protect your hybrid workforce across the attack chain
 
Microsoft Viva Introduction
Microsoft Viva IntroductionMicrosoft Viva Introduction
Microsoft Viva Introduction
 
Microsoft Viva Learning
Microsoft Viva LearningMicrosoft Viva Learning
Microsoft Viva Learning
 
Microsoft Viva Topics
Microsoft Viva TopicsMicrosoft Viva Topics
Microsoft Viva Topics
 
A Secure Journey to Cloud with Microsoft 365
A Secure Journey to Cloud with Microsoft 365A Secure Journey to Cloud with Microsoft 365
A Secure Journey to Cloud with Microsoft 365
 
Azure Arc Overview from Microsoft
Azure Arc Overview from MicrosoftAzure Arc Overview from Microsoft
Azure Arc Overview from Microsoft
 
Microsoft Windows Server 2022 Overview
Microsoft Windows Server 2022 OverviewMicrosoft Windows Server 2022 Overview
Microsoft Windows Server 2022 Overview
 
Windows365 Hybrid Windows for a Hybrid World
Windows365 Hybrid Windows for a Hybrid WorldWindows365 Hybrid Windows for a Hybrid World
Windows365 Hybrid Windows for a Hybrid World
 
Windows 11 for the Enterprise
Windows 11 for the EnterpriseWindows 11 for the Enterprise
Windows 11 for the Enterprise
 
Microsoft Scheduler for M365 - Personal Digital Assistant
Microsoft Scheduler for M365 - Personal Digital AssistantMicrosoft Scheduler for M365 - Personal Digital Assistant
Microsoft Scheduler for M365 - Personal Digital Assistant
 
What is New in Teams Meetings and Meeting Rooms July 2021
What is New in Teams Meetings and Meeting Rooms July 2021What is New in Teams Meetings and Meeting Rooms July 2021
What is New in Teams Meetings and Meeting Rooms July 2021
 
Modernize Java Apps on Microsoft Azure
Modernize Java Apps on Microsoft AzureModernize Java Apps on Microsoft Azure
Modernize Java Apps on Microsoft Azure
 
Microsoft Defender and Azure Sentinel
Microsoft Defender and Azure SentinelMicrosoft Defender and Azure Sentinel
Microsoft Defender and Azure Sentinel
 
Microsoft Azure Active Directory
Microsoft Azure Active DirectoryMicrosoft Azure Active Directory
Microsoft Azure Active Directory
 
Nintex Worflow Overview
Nintex Worflow OverviewNintex Worflow Overview
Nintex Worflow Overview
 
Microsoft Power BI Overview
Microsoft Power BI OverviewMicrosoft Power BI Overview
Microsoft Power BI Overview
 

Último

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Último (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

An Introduction to Designing Reliable Cloud Services January 2014

  • 1. An introduction to designing reliable cloud services January 2014 Contents Overview 2 Cloud service reliability versus resiliency 4 Recovery-oriented computing 5 Planning for failure 7 Designing for and responding to failure 9 Summary 13 Additional resources 14 Authors and contributors 14 Trustworthy Computing | An introduction to designing reliable cloud services 1
  • 2. Overview This paper describes reliability concepts and a reliability design-time process for organizations that create, deploy, and/or consume cloud services. It explains fundamental concepts of reliability and can help decision makers understand the factors and processes that make cloud services more reliable. In addition, it provides architects, developers, and operations personnel with insights into how to work together to increase the reliability of the services they design, implement, and support. Cloud service providers and their customers have varying degrees of responsibility, depending on the cloud service model. The following figure shows the spectrum of responsibilities for both providers and customers. For infrastructure as a service (IaaS) offerings, such as a virtual machines, the provider and the customer share responsibilities. Although the customer is responsible for ensuring that the solutions they build on the offering run in a reliable manner, the provider is still ultimately responsible for the reliability of the infrastructure (core compute, network, and storage) components. When customers purchase software as a service (SaaS) offerings, such as Microsoft Office 365, cloud providers hold primary responsibility for ensuring the reliability of the service. Platform as a service (PaaS) offerings, such as Windows Azure, occupy the middle of this responsibility spectrum, with providers being responsible for the infrastructure and fabric controller layers. If a customer purchases an infrastructure as a service (IaaS) offering, such as Windows Azure, the cloud provider is largely responsible for host management and physical security of the facility itself. Trustworthy Computing | An introduction to designing reliable cloud services 2
  • 3. Figure 1. Cloud customer and cloud provider responsibilities With the emergence of cloud computing and online services, customers expect services to be available whenever they need them—just like electricity or dial tone. This expectation requires organizations that build and support cloud services to plan for probable failures and have mechanisms that allow rapid recovery from such failures. Cloud services are complex and have many dependencies, so it is important that all members of a service provider’s organization understand their role in making the service they provide as reliable as possible. This paper includes the following sections: • Cloud service reliability versus resiliency • Recovery-oriented computing • Planning for failure • Designing for and responding to failure Although it is outside the scope of this paper, it is also important to understand that there are cost tradeoffs associated with some reliability strategies that need consideration in order to implement a service with sufficient reliability at optimal cost. Considerations could include determining what features to include in the service and prioritizing the degree of reliability associated with each feature. Trustworthy Computing | An introduction to designing reliable cloud services 3
  • 4. Cloud service reliability versus resiliency The Institute of Electrical and Electronics Engineers (IEEE) Reliability Society states that reliability [engineering] is “a design engineering discipline which applies scientific knowledge to assure that a system will perform its intended function for the required duration within a given environment, including the ability to test and support the system through its total lifecycle.”1 For software, it defines reliability as “the probability of failure-free software operation for a specified period of time in a specified environment.”2 If one assumes that all cloud service providers strive to deliver a reliable experience for their customers, it is important to fully understand what comprises a reliable cloud service. In essence, a reliable cloud service is one that functions as the designer intended, when the customer expects it to function, and wherever the connected customer is located. However, not every component needs to operate flawlessly 100 percent of the time. Cloud service providers strive for reliability. Resiliency is the ability of a cloud-based service to withstand certain types of failure and yet remain functional from the customers’ perspective. A service could be characterized as reliable simply because no part of the service has ever failed, and yet the service might not be considered resilient because it has not been tested. A resilient service is one that is designed and built so that potential failures have minimal effect on the service’s availability and functionality. In addition, despite constant and persistent reliability threats, resilient services should remain fully functional and allow customers to perform the tasks necessary to complete their work. Services should be designed to: • Minimize the impact a failure has on any given customer. For example, the service should degrade gracefully, which means that non-critical components of the service may fail but critical functions still work. • Minimize the number of customers affected by a failure. For example, the service should be designed so that faults can be isolated to a subset of customers. • Reduce the number of minutes that a customer (or customers) cannot use the service in its entirety. For example, the service should be able to transfer customer requests from one data center to another if a major failure occurs. 1 2 IEEE Reliability Society, at http://rs.ieee.org Ibid. Trustworthy Computing | An introduction to designing reliable cloud services 4
  • 5. Recovery-oriented computing Traditional computing systems have been designed to avoid failure. Cloud-based systems have an inherent issue of reliability, because of their scale and complexity. The recovery-oriented computing (ROC) approach can help organizations frame software failure in a way that makes it easier to design cloud services to respond to elements that are under their direct control as well as to elements that are not. The following three basic assumptions are associated with recovery-oriented computing: • Devices and hardware will fail • People make mistakes • Software contains imperfections Organizations that create cloud services must design them to mitigate these assumptions as much as possible to provide reliability for their customers. ROC research areas ROC defines six research areas3 that can be adapted to cloud service design and implementation recommendations. These research areas can help mitigate potential issues that are rooted in the referenced three basic assumptions, and are explained in the following list: • Fault zones. Organizations should partition cloud services into fault zones so failures can be contained, which enables rapid recovery. Isolation and loose coupling of dependencies are essential elements that contribute to fault containment and recovery capabilities. Fault isolation mechanisms should apply to a wide range of failure scenarios, including software imperfections and human-induced failures. • Defense-in-depth. Organizations should use a defense-in-depth approach, which helps ensure that a failure is contained if the first layer of protection does not isolate it. In other words, organizations should not rely on a single protective measure but instead factor multiple protective measures into their service design. • Redundancy. Organizations should build redundancy into their systems to survive faults. Redundancy enables isolation so that organizations can ensure the service continues to run, perhaps in a degraded state, when a fault occurs and the system is in the process of being recovered. Organizations should design fail-fast components that enable redundant systems to detect failure quickly and isolate it during recovery. • Diagnostic aids. Organizations should use diagnostic aids for root cause analysis of failures. These aids must be suitable for use in non-production and production environments, and should be able to rapidly detect the presence of failures and identify their root causes using automated techniques. 3 Recovery-Oriented Computing Overview, at http://roc.cs.berkeley.edu/roc_overview.html Trustworthy Computing | An introduction to designing reliable cloud services 5
  • 6. • Automated rollback. Organizations should create systems that provide automated rollback for most aspects of operations, from system configuration to application management to hardware and software upgrades. This functionality does not prevent human error but can help mitigate the impact of mistakes and make services more dependable. • Recovery process drills. Organizations should conduct recovery process drills routinely to test repair mechanisms, both during development and while in production mode. Testing helps ensure that the repair mechanisms work as expected and do not compound failures in a production environment. Using the ROC approach can help an organization shift from strictly focusing on preventing failures to also focusing on reducing the amount of time it takes to recover from a failure. In other words, some degree of failure is inevitable, (that is, it cannot be avoided or prevented), so it is important to have recovery strategies in place. Two terms can frame the shift in thinking that is required to create more reliable cloud services: mean time to failure (MTTF) and mean time to recover (MTTR). MTTF is a measure of how frequently software and hardware fails, and the goal is to make the time between failures as long as possible. It is a necessary measure and works well for packaged software, because software publishers are able specify the computing environment in which the software will optimally perform. However, focusing on MTTF by itself is insufficient for cloud services, because portions of the computing environment are out of the direct control of the provider and thus more unpredictable. It is important that cloud services are designed in such a way that they can rapidly recover. MTTR is the amount of time it takes to get a service running again after a failure. Shrinking MTTR requires design and development practices that promote quicker detection and subsequent recovery, and it also requires well-trained operations teams that are capable of bringing components of the service back online as quickly as possible; an even better approach would be for the system to automatically recover without human intervention. Organizations should design cloud services so that they do not stop working, even when some components fail; such an approach allows the service to degrade gracefully while still allowing users to accomplish their work. Embracing the ROC approach allows organizations to design services in ways that reduce MTTR as much as possible and increase MTTF as much as possible. Trustworthy Computing | An introduction to designing reliable cloud services 6
  • 7. Planning for failure To help reduce MTTR, organizations need to design ways for their services to continue operating when known failure conditions occur. For example, what should the service do when another cloud service that it depends on is not available? What should the service do when it cannot connect to its primary database? What hardware redundancies are required and where should they be located? Can the service detect and respond gracefully to incorrect configuration settings, allowing rollback of the system to a “last known good” state? At what point is rollback of a given change no longer possible, necessitating a “patch and roll forward” mitigation strategy instead? Organizations that create cloud services should consider the three primary causes of failure shown in the following figure: Figure 2. Causes of failure Device and infrastructure failure These failures range from expected, end-of-life failures to catastrophic failures caused by natural disaster or accidents that are out of an organization’s control. Human error Administrator and configuration mistakes that are often out of an organization’s control. Software imperfections Code imperfections and software-related issues in the deployed online service. Pre-release testing can control this to some degree. Core design principles for reliable services Organizations must address the following three essential reliability design principles when they create specifications for a cloud service. These principles help to mitigate the effect of failures when they occur: • Design for resilience. The service must withstand component-level failures without requiring human intervention. A service should be able to detect failures and automatically take corrective measures so that users do not experience service interruptions. When failure does occur, the service should degrade gracefully and provide partial functionality instead of going completely offline. For example, a service should use fail-fast components and indicate appropriate exceptions so that the system can automatically detect and resolve the issue. There are also automated techniques that architects can include to predict service failure and notify the organization about service degradation or failure. • Design for data integrity. The service must capture, manipulate, store, or discard data in a manner that is consistent with its intended operation. A service needs to preserve the integrity of the information that customers have entrusted to it. For example, organizations should Trustworthy Computing | An introduction to designing reliable cloud services 7
  • 8. replicate customer data stores to prevent hardware failures from causing data loss, and adequately secure data stores to prevent unauthorized access. • Design for recoverability. When the unforeseen happens, the service must be recoverable. As much as possible, a service or its components should recover quickly and automatically. Teams should be able to restore a service quickly and completely if an interruption occurs. For example, services should be designed for component redundancy and data failover so that when failure is detected in a component, a group of servers, or an entire physical location or data center, another component, group of servers, or physical location automatically takes over to keep the service running. When designing cloud services, organizations should adapt these essential principles as minimum requirements for handling potential failures. Trustworthy Computing | An introduction to designing reliable cloud services 8
  • 9. Designing for and responding to failure To build reliable cloud services, organizations should design for failure—that is, specify how a service will respond gracefully when it encounters a failure condition. The process that is illustrated in the following figure is intended for organizations that create SaaS solutions, and is designed to help them identify and mitigate possible failures. However, organizations that purchase cloud services can also use this process to develop an understanding of how the services function and help them formulate questions to ask before entering into a service agreement with a cloud provider. Figure 3. An overview of the design process Create initial service design Failure mode & effects analysis Design coping strategies Use fault injection Capture unexpected faults Monitor the live site Designing a service for reliability and implementing recovery mechanisms based on recoveryoriented principles is an iterative process. Design iterations are fluid and take into account both information garnered from pre-release testing and data about how the service performs after it is deployed. Failure mode and effects analysis Failure mode and effects analysis (FMEA) is a key step in the design process for any online service. Identifying the important interaction points and dependencies of a service enables the engineering team to pinpoint changes that are required to ensure the service can be monitored effectively for rapid detection of issues. This approach enables the engineering team to develop ways for the service to withstand, or mitigate, faults. FMEA also helps the engineering teams identify suitable test cases to validate whether the service is able to cope with faults, in test environments as well as in production (otherwise known as fault injection). Trustworthy Computing | An introduction to designing reliable cloud services 9
  • 10. As part of FMEA, organizations should create a component inventory of all components that the service uses, whether they are user interface (UI) components hosted on a web server, a database hosted in a remote data center, or an external service that the service depends on. The team can then capture possible faults in a spreadsheet or other document and incorporate relevant information into design specifications. The following example questions are the types of questions that an online service design team should consider. In addition, teams should consider whether they have capacity to detect failures, to analyze the root cause of the failure, and to recover the service. • What external services will the service be dependent on? • What data sources will the service be dependent on? • What configuration settings will the service require to operate properly? • What hardware dependencies does the service have? • What are the relevant customer scenarios that should be modeled? To fully analyze how the service will use its components, the team can create a matrix that captures which components are accessed for each customer scenario. For example, an online video service might contain scenarios for logging in, for browsing an inventory of available videos, selecting a video and viewing it, and then rating the video after viewing. Although these scenarios share common information and components, each is a separate customer usage scenario, and each accesses components that are independent from the other scenarios. The matrix should identify each of these usage scenarios and contain a list of all required components for each scenario. Using a matrix also allows the service design team to create a map of possible failure points at each component interaction point, and define a fault-handling mechanism for each. For more information on how Microsoft implements failure mode and effects analysis, read the ‘Resilience by design for cloud services’ whitepaper. Designing and implementing coping strategies Fault-handling mechanisms are also called coping strategies. In the design stage, architects define what the coping strategies will be so that the software will do something reasonable when a failure occurs. They should also define the types of instrumentation that engineers should include in the service specification to enable monitors that can detect when a particular type of failure occurs. Designing coping strategies to do something reasonable depends on the functionality that the service provides and the type of failure the coping strategy addresses. The key is to ensure that when a component fails, it fails quickly and, if required, the service switches to a redundant component. In other words, the service degrades gracefully but does not fail completely. Trustworthy Computing | An introduction to designing reliable cloud services 10
  • 11. For example, the architects of a car purchasing service design their application to include ratings for specific makes and models of each car model type. They design the purchasing service with a dependency on another service that provides comparative ratings of the models. If the rating service fails or is unavailable, the coping strategy might mean the purchasing service displays a list of models without the associated ratings rather than not displaying a list at all. In other words, when a particular failure happens the service should produce a reasonable result, regardless of the failure. The result may not be optimal, but it should be reasonable from the customer’s perspective. For a car purchasing service, it is reasonable to still produce a list of models with standard features, optional features, and pricing without any rating data instead of an error message or a blank page; the information is not optimal, but it might be useful to the customer. It is best to think in terms of “reasonable, but not necessarily optimal” when deciding what the response to a failure condition should be. When designing and implementing instrumentation, it is important to monitor at the component level as well as from the user’s perspective. This approach can allow the service team to identify a trend in component-level performance before it becomes an incident that affects users. The data that this kind of monitoring can produce enables organizations to gain insight into how to improve the service’s reliability for later releases. Monitoring the live site Accurate monitoring information can be used by teams to improve services in several ways. For example, it can provide teams with information to troubleshoot known problems or potential problems in a service. It can also provide organizations with insights into how their services perform when handling live workloads. In addition, it can also be fed directly into servicealerting mechanisms to reduce the time to detect problems and therefore reduce MTTR. Simulated workloads in test environments rarely capture the range of possible failures and faults that live site workloads generate. Organizations can identify trends before they become failures by carefully analyzing live site telemetry data and establishing thresholds, both upper and lower ranges, that represent normal operating conditions. If the telemetry being collected in near real time approaches either the upper or the lower threshold, an alarm can be triggered that prompts the operations team to immediately triage the service and potentially prevent a failure. They can also analyze failure and fault data that instrumentation and monitoring tools capture in the production environment to better understand how the service operates and to determine what monitoring improvements and new coping strategies they require. Using fault injection Fault injection can be viewed as using software that is designed to break other software. For teams that design and deploy cloud services, it is software designed and written by the team to cripple the service in a deliberate and programmatic way. Fault injection is often used with stress testing and is widely considered an important part of developing robust software. Trustworthy Computing | An introduction to designing reliable cloud services 11
  • 12. When using fault injection on a service that is already deployed, organizations target locations where coping strategies have been put in place so they can validate those strategies. In addition, cloud providers can discover unexpected results that are generated by the service and that can be used to appropriately harden the production environment. Fault injection and recovery drills can provide valuable information, such as whether the service functions as expected or whether unexpected faults occur under load. A service provider can use this information to design new coping strategies to implement in future updates to the service. Trustworthy Computing | An introduction to designing reliable cloud services 12
  • 13. Summary To design and implement a reliable cloud service requires organizations to assess how they regard failure. Historically, reliability has been equated with preventing failure—that is, delivering a tangible object free of faults or imperfections. Cloud services are complex and have dependencies, so they become more reliable when they are designed to quickly recover from unavoidable failures, particularly those that are out of an organization's control. The processes that architects and engineers use to design cloud services can also affect the reliability of a service. It is critical for service design to incorporate monitoring data from live sites, especially when identifying the faults and failures that are addressed through coping strategies that are tailored to a particular service. Organizations should also consider conducting fault injection tests and recovery drills in their production environments. Doing so generates data they can use to improve service reliability and that will help prepare organizations to handle failures when they actually occur. Trustworthy Computing | An introduction to designing reliable cloud services 13
  • 14. Additional resources • The Berkeley/Stanford Recovery-Oriented Computing (ROC) Project http://roc.cs.berkeley.edu/ • Resilience by design for cloud services http://aka.ms/resiliency • Foundations of Trustworthy Computing: Reliability www.microsoft.com/about/twc/en/us/reliability.aspx • Microsoft Trustworthy Computing www.microsoft.com/about/twc/en/us/default.aspx Authors and contributors MIKE ADAMS – Cloud and Enterprise SHANNON BEARLY – Cloud and Enterprise DAVID BILLS –Trustworthy Computing SEAN FOY –Trustworthy Computing MARGARET LI –Trustworthy Computing TIM RAINS –Trustworthy Computing MICHAEL RAY – Cloud and Enterprise DAN ROGERS – Operating Systems FRANK SIMORJAY –Trustworthy Computing SIAN SUTHERS –Trustworthy Computing JASON WESCOTT –Trustworthy Computing © 2014 Microsoft Corp. All rights reserved. This document is provided "as-is." Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. You bear the risk of using it. This document does not provide you with any legal rights to any intellectual property in any Microsoft product. Microsoft, Office 365, and Windows Azure are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. You may copy and use this document for your internal, reference purposes. Licensed under Creative Commons Attribution-Non Commercial-Share Alike 3.0 Unported. Trustworthy Computing | An introduction to designing reliable cloud services 14